2024-12-17

Title: Personalized and Sequential Text-to-Image Generation

Authors: Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ramachandran, Yinlam Chow, Xiang Li, Craig Boutilier
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2412.10419
Pdf URL: https://arxiv.org/pdf/2412.10419
Copy Paste: [[2412.10419]] Personalized and Sequential Text-to-Image Generation(https://arxiv.org/abs/2412.10419)
Keywords: generation
Abstract: We address the problem of personalized, interactive text-to-image (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with large-scale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varying user preference types. We then leverage a large multimodal language model (LMM) and a value-based RL approach to suggest a personalized and diverse slate of prompt expansions to the user. Our Personalized And Sequential Text-to-image Agent (PASTA) extends T2I models with personalized multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user's intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also release our sequential rater dataset and simulated user-rater interactions to support future research in personalized, multi-turn T2I generation.
摘要：我们解决了个性化、交互式文本转图像 (T2I) 生成问题，设计了一个强化学习 (RL) 代理，它通过一系列提示扩展迭代地为用户改进一组生成的图像。使用人工评估者，我们创建了一个新颖的顺序偏好数据集，并将其与大规模开源（非顺序）数据集结合使用。我们使用 EM 策略构建用户偏好和用户选择模型，并识别不同的用户偏好类型。然后，我们利用大型多模态语言模型 (LMM) 和基于价值的 RL 方法向用户推荐个性化和多样化的提示扩展。我们的个性化和顺序文本转图像代理 (PASTA) 通过个性化的多轮功能扩展了 T2I 模型，促进了协作共同创造并解决了用户意图中的不确定性或未充分规范的问题。我们使用人工评估者评估 PASTA，与基线方法相比，结果显示出显着的改进。我们还发布了顺序评估者数据集和模拟用户评估者交互，以支持未来个性化、多轮 T2I 生成的研究。

Title: CAP: Evaluation of Persuasive and Creative Image Generation

Authors: Aysan Aghazadeh, Adriana Kovashka
Subjects: cs.CV, cs.CL, cs.GR
Abstract URL: https://arxiv.org/abs/2412.10426
Pdf URL: https://arxiv.org/pdf/2412.10426
Copy Paste: [[2412.10426]] CAP: Evaluation of Persuasive and Creative Image Generation(https://arxiv.org/abs/2412.10426)
Keywords: generation
Abstract: We address the task of advertisement image generation and introduce three evaluation metrics to assess Creativity, prompt Alignment, and Persuasiveness (CAP) in generated advertisement images. Despite recent advancements in Text-to-Image (T2I) generation and their performance in generating high-quality images for explicit descriptions, evaluating these models remains challenging. Existing evaluation methods focus largely on assessing alignment with explicit, detailed descriptions, but evaluating alignment with visually implicit prompts remains an open problem. Additionally, creativity and persuasiveness are essential qualities that enhance the effectiveness of advertisement images, yet are seldom measured. To address this, we propose three novel metrics for evaluating the creativity, alignment, and persuasiveness of generated images. Our findings reveal that current T2I models struggle with creativity, persuasiveness, and alignment when the input text is implicit messages. We further introduce a simple yet effective approach to enhance T2I models' capabilities in producing images that are better aligned, more creative, and more persuasive.
摘要：我们解决了广告图像生成的任务，并引入了三个评估指标来评估生成的广告图像中的创造力、提示对齐和说服力 (CAP)。尽管文本到图像 (T2I) 生成及其在生成用于明确描述的高质量图像方面取得了最新进展，但评估这些模型仍然具有挑战性。现有的评估方法主要侧重于评估与明确、详细描述的对齐，但评估与视觉隐含提示的对齐仍然是一个悬而未决的问题。此外，创造力和说服力是提高广告图像有效性的基本品质，但很少被衡量。为了解决这个问题，我们提出了三个新的指标来评估生成的图像的创造力、对齐和说服力。我们的研究结果表明，当输入文本是隐含消息时，当前的 T2I 模型在创造力、说服力和对齐方面存在困难。我们进一步介绍了一种简单而有效的方法来增强 T2I 模型生成更对齐、更有创意和更有说服力的图像的能力。

Title: GPTDrawer: Enhancing Visual Synthesis through ChatGPT

Authors: Kun Li, Xinwei Chen, Tianyou Song, Hansong Zhang, Wenzhe Zhang, Qing Shan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10429
Pdf URL: https://arxiv.org/pdf/2412.10429
Copy Paste: [[2412.10429]] GPTDrawer: Enhancing Visual Synthesis through ChatGPT(https://arxiv.org/abs/2412.10429)
Keywords: generation, generative
Abstract: In the burgeoning field of AI-driven image generation, the quest for precision and relevance in response to textual prompts remains paramount. This paper introduces GPTDrawer, an innovative pipeline that leverages the generative prowess of GPT-based models to enhance the visual synthesis process. Our methodology employs a novel algorithm that iteratively refines input prompts using keyword extraction, semantic analysis, and image-text congruence evaluation. By integrating ChatGPT for natural language processing and Stable Diffusion for image generation, GPTDrawer produces a batch of images that undergo successive refinement cycles, guided by cosine similarity metrics until a threshold of semantic alignment is attained. The results demonstrate a marked improvement in the fidelity of images generated in accordance with user-defined prompts, showcasing the system's ability to interpret and visualize complex semantic constructs. The implications of this work extend to various applications, from creative arts to design automation, setting a new benchmark for AI-assisted creative processes.
摘要：在人工智能驱动的图像生成这一新兴领域，对文本提示的精确性和相关性的追求仍然至关重要。本文介绍了 GPTDrawer，这是一种创新的流程，它利用基于 GPT 的模型的生成能力来增强视觉合成过程。我们的方法采用了一种新算法，该算法使用关键字提取、语义分析和图像文本一致性评估迭代地细化输入提示。通过集成 ChatGPT 用于自然语言处理和 Stable Diffusion 用于图像生成，GPTDrawer 生成了一批图像，这些图像在余弦相似度指标的指导下经过连续的细化循环，直到达到语义对齐的阈值。结果表明，根据用户定义的提示生成的图像的保真度显着提高，展示了系统解释和可视化复杂语义结构的能力。这项工作的影响扩展到从创意艺术到设计自动化的各种应用，为人工智能辅助创意过程树立了新的标杆。

Title: Benchmarking Federated Learning for Semantic Datasets: Federated Scene Graph Generation

Authors: SeungBum Ha, Taehwan Lee, Jiyoun Lim, Sung Whan Yoon
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10436
Pdf URL: https://arxiv.org/pdf/2412.10436
Copy Paste: [[2412.10436]] Benchmarking Federated Learning for Semantic Datasets: Federated Scene Graph Generation(https://arxiv.org/abs/2412.10436)
Keywords: generation
Abstract: Federated learning (FL) has recently garnered attention as a data-decentralized training framework that enables the learning of deep models from locally distributed samples while keeping data privacy. Built upon the framework, immense efforts have been made to establish FL benchmarks, which provide rigorous evaluation settings that control data heterogeneity across clients. Prior efforts have mainly focused on handling relatively simple classification tasks, where each sample is annotated with a one-hot label, such as MNIST, CIFAR, LEAF benchmark, etc. However, little attention has been paid to demonstrating an FL benchmark that handles complicated semantics, where each sample encompasses diverse semantic information from multiple labels, such as Panoptic Scene Graph Generation (PSG) with objects, subjects, and relations between them. Because the existing benchmark is designed to distribute data in a narrow view of a single semantic, e.g., a one-hot label, managing the complicated semantic heterogeneity across clients when formalizing FL benchmarks is non-trivial. In this paper, we propose a benchmark process to establish an FL benchmark with controllable semantic heterogeneity across clients: two key steps are i) data clustering with semantics and ii) data distributing via controllable semantic heterogeneity across clients. As a proof of concept, we first construct a federated PSG benchmark, demonstrating the efficacy of the existing PSG methods in an FL setting with controllable semantic heterogeneity of scene graphs. We also present the effectiveness of our benchmark by applying robust federated learning algorithms to data heterogeneity to show increased performance. Our code is available at this https URL.
摘要：联邦学习 (FL) 最近引起了人们的关注，它是一种数据分散的训练框架，能够从本地分布的样本中学习深度模型，同时保持数据隐私。在此框架的基础上，人们付出了巨大的努力来建立 FL 基准，这些基准提供了严格的评估设置，可以控制客户端之间的数据异质性。之前的努力主要集中在处理相对简单的分类任务上，其中每个样本都用一个独热标签进行注释，例如 MNIST、CIFAR、LEAF 基准等。然而，很少有人关注展示一个处理复杂语义的 FL 基准，其中每个样本包含来自多个标签的不同语义信息，例如全景场景图生成 (PSG)，其中包含对象、主题和它们之间的关系。由于现有基准旨在以单一语义的狭隘视角（例如独热标签）分发数据，因此在形式化 FL 基准时管理客户端之间的复杂语义异质性并非易事。在本文中，我们提出了一个基准测试流程，以建立具有可控跨客户端语义异构性的 FL 基准测试：两个关键步骤是 i) 使用语义进行数据聚类和 ii) 通过可控跨客户端语义异构性进行数据分发。作为概念证明，我们首先构建一个联邦 PSG 基准测试，展示现有 PSG 方法在具有可控场景图语义异构性的 FL 设置中的有效性。我们还通过将强大的联邦学习算法应用于数据异构性来展示我们基准测试的有效性，以显示性能的提高。我们的代码可在此 https URL 上找到。

Title: SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion

Authors: Ximing Xing, Juncheng Hu, Jing Zhang, Dong Xu, Qian Yu
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10437
Pdf URL: https://arxiv.org/pdf/2412.10437
Copy Paste: [[2412.10437]] SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion(https://arxiv.org/abs/2412.10437)
Keywords: generation
Abstract: The generation of Scalable Vector Graphics (SVG) assets from textual data remains a significant challenge, largely due to the scarcity of high-quality vector datasets and the limitations in scalable vector representations required for modeling intricate graphic distributions. This work introduces SVGFusion, a Text-to-SVG model capable of scaling to real-world SVG data without reliance on a text-based discrete language model or prolonged SDS optimization. The essence of SVGFusion is to learn a continuous latent space for vector graphics with a popular Text-to-Image framework. Specifically, SVGFusion consists of two modules: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) and a Vector Space Diffusion Transformer (VS-DiT). VP-VAE takes both the SVGs and corresponding rasterizations as inputs and learns a continuous latent space, whereas VS-DiT learns to generate a latent code within this space based on the text prompt. Based on VP-VAE, a novel rendering sequence modeling strategy is proposed to enable the latent space to embed the knowledge of construction logics in SVGs. This empowers the model to achieve human-like design capabilities in vector graphics, while systematically preventing occlusion in complex graphic compositions. Moreover, our SVGFusion's ability can be continuously improved by leveraging the scalability of the VS-DiT by adding more VS-DiT blocks. A large-scale SVG dataset is collected to evaluate the effectiveness of our proposed method. Extensive experimentation has confirmed the superiority of our SVGFusion over existing SVG generation methods, achieving enhanced quality and generalizability, thereby establishing a novel framework for SVG content creation. Code, model, and data will be released at: \href{this https URL}{this https URL}
摘要：从文本数据生成可缩放矢量图形 (SVG) 资产仍然是一项重大挑战，这主要是由于高质量矢量数据集的稀缺以及建模复杂图形分布所需的可缩放矢量表示的局限性。这项工作引入了 SVGFusion，这是一种文本到 SVG 模型，能够扩展到现实世界的 SVG 数据，而无需依赖基于文本的离散语言模型或长时间的 SDS 优化。SVGFusion 的本质是使用流行的文本到图像框架来学习矢量图形的连续潜在空间。具体来说，SVGFusion 由两个模块组成：矢量像素融合变分自动编码器 (VP-VAE) 和矢量空间扩散变换器 (VS-DiT)。VP-VAE 将 SVG 和相应的光栅化作为输入并学习连续潜在空间，而 VS-DiT 则学习根据文本提示在此空间内生成潜在代码。基于 VP-VAE，提出了一种新颖的渲染序列建模策略，使潜在空间能够嵌入 SVG 中的构造逻辑知识。这使模型能够在矢量图形中实现类似人类的设计能力，同时系统地防止复杂图形组合中的遮挡。此外，通过添加更多 VS-DiT 块，可以利用 VS-DiT 的可扩展性不断提高我们的 SVGFusion 的能力。收集了一个大规模 SVG 数据集来评估我们提出的方法的有效性。大量实验证实了我们的 SVGFusion 优于现有的 SVG 生成方法，实现了更高的质量和通用性，从而为 SVG 内容创建建立了一个新颖的框架。代码、模型和数据将在以下位置发布：\href{此 https URL}{此 https URL}

Title: SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization

Authors: Zhentao Tan, Ben Xue, Jian Jia, Junhao Wang, Wencai Ye, Shaoyun Shi, Mingjie Sun, Wenjin Wu, Quan Chen, Peng Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10443
Pdf URL: https://arxiv.org/pdf/2412.10443
Copy Paste: [[2412.10443]] SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization(https://arxiv.org/abs/2412.10443)
Keywords: generation
Abstract: This paper presents the \textbf{S}emantic-a\textbf{W}ar\textbf{E} spatial-t\textbf{E}mporal \textbf{T}okenizer (SweetTokenizer), a compact yet effective discretization approach for vision data. Our goal is to boost tokenizers' compression ratio while maintaining reconstruction fidelity in the VQ-VAE paradigm. Firstly, to obtain compact latent representations, we decouple images or videos into spatial-temporal dimensions, translating visual information into learnable querying spatial and temporal tokens through a \textbf{C}ross-attention \textbf{Q}uery \textbf{A}uto\textbf{E}ncoder (CQAE). Secondly, to complement visual information during compression, we quantize these tokens via a specialized codebook derived from off-the-shelf LLM embeddings to leverage the rich semantics from language modality. Finally, to enhance training stability and convergence, we also introduce a curriculum learning strategy, which proves critical for effective discrete visual representation learning. SweetTokenizer achieves comparable video reconstruction fidelity with only \textbf{25\%} of the tokens used in previous state-of-the-art video tokenizers, and boost video generation results by \textbf{32.9\%} w.r.t gFVD. When using the same token number, we significantly improves video and image reconstruction results by \textbf{57.1\%} w.r.t rFVD on UCF-101 and \textbf{37.2\%} w.r.t rFID on ImageNet-1K. Additionally, the compressed tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.
摘要：本文介绍了 \textbf{S}emantic-a\textbf{W}ar\textbf{E} spatial-t\textbf{E}mporal \textbf{T}okenizer (SweetTokenizer)，这是一种紧凑而有效的视觉数据离散化方法。我们的目标是提高标记器的压缩率，同时保持 VQ-VAE 范式中的重建保真度。首先，为了获得紧凑的潜在表示，我们将图像或视频解耦为时空维度，通过 \textbf{C}cross-attention \textbf{Q}uery \textbf{A}uto\textbf{E}ncoder (CQAE) 将视觉信息转换为可学习的查询空间和时间标记。其次，为了在压缩过程中补充视觉信息，我们通过从现成的 LLM 嵌入派生的专用码本量化这些标记，以利用语言模态的丰富语义。最后，为了提高训练稳定性和收敛性，我们还引入了课程学习策略，这对于有效的离散视觉表征学习至关重要。SweetTokenizer 仅使用之前最先进的视频标记器中使用的 \textbf{25\%} 标记即可实现相当的视频重建保真度，并将视频生成结果提升 \textbf{32.9\%} w.r.t gFVD。当使用相同的标记数量时，我们显著提高了视频和图像重建结果，在 UCF-101 上，w.r.t rFVD 提高了 \textbf{57.1\%}，在 ImageNet-1K 上，w.r.t rFID 提高了 \textbf{37.2\%}。此外，压缩标记充满了语义信息，使下游应用程序中的 LLM 能够实现少量识别功能。

Title: Boundary Exploration of Next Best View Policy in 3D Robotic Scanning

Authors: Leihui Li, Xuping Zhang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.10444
Pdf URL: https://arxiv.org/pdf/2412.10444
Copy Paste: [[2412.10444]] Boundary Exploration of Next Best View Policy in 3D Robotic Scanning(https://arxiv.org/abs/2412.10444)
Keywords: generation
Abstract: The Next Best View (NBV) problem is a pivotal challenge in 3D robotic scanning, with the potential to greatly improve the efficiency of object capture and reconstruction. Current methods for determining the NBV often overlook view overlaps, assume a virtual origin point for the camera's focus, and rely on voxel representations of 3D data. To address these issues and improve the practicality of scanning unknown objects, we propose an NBV policy in which the next view explores the boundary of the scanned point cloud, and the overlap is intrinsically considered. The scanning distance or camera working distance is adjustable and flexible. To this end, a model-based approach is proposed where the next sensor positions are searched iteratively based on a reference model. A score is calculated by considering the overlaps between newly scanned and existing data, as well as the final convergence. Additionally, following the boundary exploration idea, a deep learning network, Boundary Exploration NBV network (BENBV-Net), is designed and proposed, which can be used to predict the NBV directly from the scanned data without requiring the reference model. It predicts the scores for given boundaries, and the boundary with the highest score is selected as the target point of the next best view. BENBV-Net improves the speed of NBV generation while maintaining the performance of the model-based approach. Our proposed methods are evaluated and compared with existing approaches on the ShapeNet, ModelNet, and 3D Repository datasets. Experimental results demonstrate that our approach outperforms others in terms of scanning efficiency and overlap, both of which are crucial for practical 3D scanning applications. The related code is released at \url{this http URL}.
摘要：下一个最佳视图 (NBV) 问题是 3D 机器人扫描中的一个关键挑战，有可能大大提高物体捕获和重建的效率。当前确定 NBV 的方法经常忽略视图重叠，假设相机焦点的虚拟原点，并依赖于 3D 数据的体素表示。为了解决这些问题并提高扫描未知物体的实用性，我们提出了一种 NBV 策略，其中下一个视图探索扫描点云的边界，并且本质上考虑重叠。扫描距离或相机工作距离是可调且灵活的。为此，提出了一种基于模型的方法，其中基于参考模型迭代搜索下一个传感器位置。通过考虑新扫描数据和现有数据之间的重叠以及最终收敛来计算分数。此外，遵循边界探索思想，设计和提出了一个深度学习网络，即边界探索 NBV 网络 (BENBV-Net)，可用于直接从扫描数据预测 NBV，而无需参考模型。它预测给定边界的得分，并选择得分最高的边界作为下一个最佳视图的目标点。BENBV-Net 提高了 NBV 生成速度，同时保持了基于模型的方法的性能。我们提出的方法在 ShapeNet、ModelNet 和 3D Repository 数据集上进行了评估并与现有方法进行了比较。实验结果表明，我们的方法在扫描效率和重叠方面优于其他方法，这两者对于实际的 3D 扫描应用都至关重要。相关代码发布在 \url{此 http URL}。

Title: Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning

Authors: Shihao Xu, Yiyang Luo, Wei Shi
Subjects: cs.CV, cs.AI, cs.CG
Abstract URL: https://arxiv.org/abs/2412.10455
Pdf URL: https://arxiv.org/pdf/2412.10455
Copy Paste: [[2412.10455]] Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning(https://arxiv.org/abs/2412.10455)
Keywords: generation
Abstract: Geometry mathematics problems pose significant challenges for large language models (LLMs) because they involve visual elements and spatial reasoning. Current methods primarily rely on symbolic character awareness to address these problems. Considering geometry problem solving is a relatively nascent field with limited suitable datasets and currently almost no work on solid geometry problem solving, we collect a geometry question-answer dataset by sourcing geometric data from Chinese high school education websites, referred to as GeoMath. It contains solid geometry questions and answers with accurate reasoning steps as compensation for existing plane geometry datasets. Additionally, we propose a Large Multi-modal Model (LMM) framework named Geo-LLaVA, which incorporates retrieval augmentation with supervised fine-tuning (SFT) in the training stage, called meta-training, and employs in-context learning (ICL) during inference to improve performance. Our fine-tuned model with ICL attains the state-of-the-art performance of 65.25% and 42.36% on selected questions of the GeoQA dataset and GeoMath dataset respectively with proper inference steps. Notably, our model initially endows the ability to solve solid geometry problems and supports the generation of reasonable solid geometry picture descriptions and problem-solving steps. Our research sets the stage for further exploration of LLMs in multi-modal math problem-solving, particularly in geometry math problems.
摘要：几何数学问题对大型语言模型 (LLM) 提出了重大挑战，因为它们涉及视觉元素和空间推理。当前的方法主要依靠符号字符感知来解决这些问题。考虑到几何问题求解是一个相对新兴的领域，合适的数据集有限，目前几乎没有关于立体几何问题求解的研究，我们通过从中国高中教育网站获取几何数据来收集几何问答数据集，称为 GeoMath。它包含立体几何问题和答案，并具有准确的推理步骤以补偿现有的平面几何数据集。此外，我们提出了一个名为 Geo-LLaVA 的大型多模态模型 (LMM) 框架，它在训练阶段结合了带有监督微调 (SFT) 的检索增强，称为元训练，并在推理过程中采用上下文学习 (ICL) 来提高性能。我们采用 ICL 进行微调的模型在适当的推理步骤下，分别在 GeoQA 数据集和 GeoMath 数据集的选定问题上获得了 65.25% 和 42.36% 的最佳性能。值得注意的是，我们的模型初步赋予了解决立体几何问题的能力，并支持生成合理的立体几何图片描述和问题解决步骤。我们的研究为 LLM 在多模态数学问题解决中的进一步探索奠定了基础，特别是在几何数学问题中。

Title: Motion Generation Review: Exploring Deep Learning for Lifelike Animation with Manifold

Authors: Jiayi Zhao, Dongdong Weng, Qiuxin Du, Zeyu Tian
Subjects: cs.CV, cs.GR, cs.HC
Abstract URL: https://arxiv.org/abs/2412.10458
Pdf URL: https://arxiv.org/pdf/2412.10458
Copy Paste: [[2412.10458]] Motion Generation Review: Exploring Deep Learning for Lifelike Animation with Manifold(https://arxiv.org/abs/2412.10458)
Keywords: generation
Abstract: Human motion generation involves creating natural sequences of human body poses, widely used in gaming, virtual reality, and human-computer interaction. It aims to produce lifelike virtual characters with realistic movements, enhancing virtual agents and immersive experiences. While previous work has focused on motion generation based on signals like movement, music, text, or scene background, the complexity of human motion and its relationships with these signals often results in unsatisfactory outputs. Manifold learning offers a solution by reducing data dimensionality and capturing subspaces of effective motion. In this review, we present a comprehensive overview of manifold applications in human motion generation, one of the first in this domain. We explore methods for extracting manifolds from unstructured data, their application in motion generation, and discuss their advantages and future directions. This survey aims to provide a broad perspective on the field and stimulate new approaches to ongoing challenges.
摘要：人体运动生成涉及创建自然的人体姿势序列，广泛应用于游戏、虚拟现实和人机交互。它旨在制作具有逼真动作的逼真虚拟角色，增强虚拟代理和沉浸式体验。虽然以前的工作主要集中在基于运动、音乐、文本或场景背景等信号的运动生成，但人体运动的复杂性及其与这些信号的关系往往会导致不令人满意的输出。流形学习通过降低数据维度和捕获有效运动的子空间提供了一种解决方案。在这篇综述中，我们全面概述了人体运动生成中的流形应用，这是该领域的首批应用之一。我们探索从非结构化数据中提取流形的方法及其在运动生成中的应用，并讨论它们的优势和未来发展方向。本综述旨在为该领域提供广阔的视角，并激发应对持续挑战的新方法。

Title: SVGBuilder: Component-Based Colored SVG Generation with Text-Guided Autoregressive Transformers

Authors: Zehao Chen, Rong Pan
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.10488
Pdf URL: https://arxiv.org/pdf/2412.10488
Copy Paste: [[2412.10488]] SVGBuilder: Component-Based Colored SVG Generation with Text-Guided Autoregressive Transformers(https://arxiv.org/abs/2412.10488)
Keywords: generation
Abstract: Scalable Vector Graphics (SVG) are essential XML-based formats for versatile graphics, offering resolution independence and scalability. Unlike raster images, SVGs use geometric shapes and support interactivity, animation, and manipulation via CSS and JavaScript. Current SVG generation methods face challenges related to high computational costs and complexity. In contrast, human designers use component-based tools for efficient SVG creation. Inspired by this, SVGBuilder introduces a component-based, autoregressive model for generating high-quality colored SVGs from textual input. It significantly reduces computational overhead and improves efficiency compared to traditional methods. Our model generates SVGs up to 604 times faster than optimization-based approaches. To address the limitations of existing SVG datasets and support our research, we introduce ColorSVG-100K, the first large-scale dataset of colored SVGs, comprising 100,000 graphics. This dataset fills the gap in color information for SVG generation models and enhances diversity in model training. Evaluation against state-of-the-art models demonstrates SVGBuilder's superior performance in practical applications, highlighting its efficiency and quality in generating complex SVG graphics.
摘要：可缩放矢量图形 (SVG) 是多功能图形必不可少的基于 XML 的格式，具有分辨率独立性和可扩展性。与光栅图像不同，SVG 使用几何形状，并通过 CSS 和 JavaScript 支持交互性、动画和操作。当前的 SVG 生成方法面临着与高计算成本和复杂性相关的挑战。相比之下，人类设计师使用基于组件的工具来高效地创建 SVG。受此启发，SVGBuilder 引入了一种基于组件的自回归模型，用于从文本输入生成高质量的彩色 SVG。与传统方法相比，它显著降低了计算开销并提高了效率。我们的模型生成 SVG 的速度比基于优化的方法快 604 倍。为了解决现有 SVG 数据集的局限性并支持我们的研究，我们推出了 ColorSVG-100K，这是第一个包含 100,000 个图形的大型彩色 SVG 数据集。该数据集填补了 SVG 生成模型在颜色信息方面的空白，并增强了模型训练的多样性。与最先进模型的评估证明了 SVGBuilder 在实际应用中的卓越性能，凸显了其在生成复杂 SVG 图形方面的效率和质量。

Title: CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information

Authors: Kaifan Zhang, Lihuo He, Xin Jiang, Wen Lu, Di Wang, Xinbo Gao
Subjects: cs.CV, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2412.10489
Pdf URL: https://arxiv.org/pdf/2412.10489
Copy Paste: [[2412.10489]] CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information(https://arxiv.org/abs/2412.10489)
Keywords: generative
Abstract: Electroencephalogram (EEG) signals have attracted significant attention from researchers due to their non-invasive nature and high temporal sensitivity in decoding visual stimuli. However, most recent studies have focused solely on the relationship between EEG and image data pairs, neglecting the valuable ``beyond-image-modality" information embedded in EEG signals. This results in the loss of critical multimodal information in EEG. To address this limitation, we propose CognitionCapturer, a unified framework that fully leverages multimodal data to represent EEG signals. Specifically, CognitionCapturer trains Modality Expert Encoders for each modality to extract cross-modal information from the EEG modality. Then, it introduces a diffusion prior to map the EEG embedding space to the CLIP embedding space, followed by using a pretrained generative model, the proposed framework can reconstruct visual stimuli with high semantic and structural fidelity. Notably, the framework does not require any fine-tuning of the generative models and can be extended to incorporate more modalities. Through extensive experiments, we demonstrate that CognitionCapturer outperforms state-of-the-art methods both qualitatively and quantitatively. Code: this https URL.
摘要：脑电图 (EEG) 信号由于其非侵入性和在解码视觉刺激方面的高时间敏感性而引起了研究人员的广泛关注。然而，最近的大多数研究仅关注 EEG 和图像数据对之间的关系，而忽略了 EEG 信号中嵌入的宝贵的“超越图像模态”信息。这导致 EEG 中关键的多模态信息的丢失。为了解决这一限制，我们提出了 CognitionCapturer，这是一个充分利用多模态数据来表示 EEG 信号的统一框架。具体来说，CognitionCapturer 为每种模态训练模态专家编码器，以从 EEG 模态中提取跨模态信息。然后，它引入扩散先验将 EEG 嵌入空间映射到 CLIP 嵌入空间，然后使用预训练的生成模型，所提出的框架可以重建具有高语义和结构保真度的视觉刺激。值得注意的是，该框架不需要对生成模型进行任何微调，并且可以扩展以包含更多模态。通过大量实验，我们证明 CognitionCapturer 在质量和数量上都优于最先进的方法。代码：这个 https URL。

Title: SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

Authors: Runtao Liu, Chen I Chieh, Jindong Gu, Jipeng Zhang, Renjie Pi, Qifeng Chen, Philip Torr, Ashkan Khakzar, Fabio Pizzati
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10493
Pdf URL: https://arxiv.org/pdf/2412.10493
Copy Paste: [[2412.10493]] SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation(https://arxiv.org/abs/2412.10493)
Keywords: generation, generative
Abstract: Text-to-image (T2I) models have become widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model's generative capabilities. In this work, we introduce SafetyDPO, a method for safety alignment of T2I models through Direct Preference Optimization (DPO). We enable the application of DPO for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2. Using a custom DPO strategy and this dataset, we train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related concepts. Then, we merge the experts into a single LoRA using a novel merging strategy for optimal scaling performance. This expert-based approach enables scalability, allowing us to remove 7 times more harmful concepts from T2I models compared to baselines. SafetyDPO consistently outperforms the state-of-the-art on many benchmarks and establishes new practices for safety alignment in T2I networks. Code and data will be shared at this https URL.
摘要：文本转图像 (T2I) 模型已得到广泛应用，但其有限的安全防护措施会使最终用户接触到有害内容，并可能导致模型被滥用。当前的安全措施通常仅限于基于文本的过滤或概念删除策略，只能从模型的生成功能中删除一些概念。在这项工作中，我们引入了 SafetyDPO，这是一种通过直接偏好优化 (DPO) 对 T2I 模型进行安全校准的方法。我们通过合成生成有害和安全图像文本对的数据集（我们称之为 CoProV2）来实现 DPO 在 T2I 模型中的安全应用。使用自定义 DPO 策略和此数据集，我们以低秩自适应 (LoRA) 矩阵的形式训练安全专家，能够引导生成过程远离特定的安全相关概念。然后，我们使用新颖的合并策略将专家合并为单个 LoRA，以实现最佳扩展性能。这种基于专家的方法实现了可扩展性，使我们能够从 T2I 模型中删除与基线相比多 7 倍的有害概念。 SafetyDPO 在许多基准测试中始终优于最先进的技术，并为 T2I 网络中的安全协调建立了新实践。代码和数据将在此 https URL 上共享。

Title: SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Authors: Yushu Wu, Zhixing Zhang, Yanyu Li, Yanwu Xu, Anil Kag, Yang Sui, Huseyin Coskun, Ke Ma, Aleksei Lebedev, Ju Hu, Dimitris Metaxas, Yanzhi Wang, Sergey Tulyakov, Jian Ren
Subjects: cs.CV, cs.AI, cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2412.10494
Pdf URL: https://arxiv.org/pdf/2412.10494
Copy Paste: [[2412.10494]] SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device(https://arxiv.org/abs/2412.10494)
Keywords: generation
Abstract: We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.
摘要：在过去的一年中，我们见证了基于扩散的视频生成前所未有的成功。社区最近提出的模型能够根据任意输入提示生成具有流畅动作的电影和高分辨率视频。然而，作为图像生成的超级任务，视频生成模型需要更多的计算，因此主要托管在云服务器上，限制了内容创建者更广泛的采用。在这项工作中，我们提出了一个全面的加速框架，将大规模视频扩散模型的强大功能带到边缘用户手中。从网络架构范围来看，我们从紧凑的图像主干初始化，并搜索时间层的设计和排列，以最大限度地提高硬件效率。此外，我们为我们的高效模型提出了一种专用的对抗微调算法，并将去噪步骤减少到 4 个。我们的模型只有 0.6B 个参数，可以在 5 秒内在 iPhone 16 PM 上生成 5 秒的视频。与在强大的 GPU 上需要几分钟才能生成单个视频的服务器端模型相比，我们在提供同等质量的同时将生成速度提高了几个数量级。

Title: The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion

Authors: Changan Chen, Juze Zhang, Shrinidhi K. Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, Ehsan Adeli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10523
Pdf URL: https://arxiv.org/pdf/2412.10523
Copy Paste: [[2412.10523]] The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion(https://arxiv.org/abs/2412.10523)
Keywords: generation
Abstract: Human communication is inherently multimodal, involving a combination of verbal and non-verbal cues such as speech, facial expressions, and body gestures. Modeling these behaviors is essential for understanding human interaction and for creating virtual characters that can communicate naturally in applications like games, films, and virtual reality. However, existing motion generation models are typically limited to specific input modalities -- either speech, text, or motion data -- and cannot fully leverage the diversity of available data. In this paper, we propose a novel framework that unifies verbal and non-verbal language using multimodal language models for human motion understanding and generation. This model is flexible in taking text, speech, and motion or any combination of them as input. Coupled with our novel pre-training strategy, our model not only achieves state-of-the-art performance on co-speech gesture generation but also requires much less data for training. Our model also unlocks an array of novel tasks such as editable gesture generation and emotion prediction from motion. We believe unifying the verbal and non-verbal language of human motion is essential for real-world applications, and language models offer a powerful approach to achieving this goal. Project page: this http URL.
摘要：人类交流本质上是多模态的，涉及言语和非言语线索的组合，例如语音、面部表情和肢体动作。对这些行为进行建模对于理解人类互动以及创建可以在游戏、电影和虚拟现实等应用中自然交流的虚拟角色至关重要。然而，现有的动作生成模型通常仅限于特定的输入模态——语音、文本或动作数据——并且无法充分利用可用数据的多样性。在本文中，我们提出了一个新颖的框架，该框架使用多模态语言模型统一言语和非言语语言，以理解和生成人类动作。该模型可以灵活地将文本、语音和动作或它们的任意组合作为输入。结合我们新颖的预训练策略，我们的模型不仅在语音手势生成方面实现了最先进的性能，而且训练所需的数据也少得多。我们的模型还解锁了一系列新颖的任务，例如可编辑的手势生成和从动作中预测情绪。我们相信，统一人类动作的口头和非口头语言对于现实世界的应用至关重要，而语言模型为实现这一目标提供了强有力的方法。项目页面：此 http URL。

Title: Solving the Inverse Alignment Problem for Efficient RLHF

Authors: Shambhavi Krishna, Aishwarya Sahoo
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.10529
Pdf URL: https://arxiv.org/pdf/2412.10529
Copy Paste: [[2412.10529]] Solving the Inverse Alignment Problem for Efficient RLHF(https://arxiv.org/abs/2412.10529)
Keywords: generation
Abstract: Collecting high-quality preference datasets for reinforcement learning from human feedback (RLHF) is resource-intensive and challenging. As a result, researchers often train reward models on extensive offline datasets which aggregate diverse generation sources and scoring/alignment policies. We hypothesize that this aggregation has an averaging effect on reward model scores, which limits signal and impairs the alignment process. Inspired by the field of inverse RL, we define the 'inverse alignment problem' in language model training, where our objective is to optimize the critic's reward for a fixed actor and a fixed offline preference dataset. We hypothesize that solving the inverse alignment problem will improve reward model quality by providing clearer feedback on the policy's current behavior. To that end, we investigate whether repeatedly fine-tuning a reward model on subsets of the offline preference dataset aligned with a periodically frozen policy during RLHF improves upon vanilla RLHF. Our empirical results demonstrate that this approach facilitates superior alignment and faster convergence compared to using an unaligned or out-of-distribution reward model relative to the LLM policy.
摘要：为强化学习从人类反馈 (RLHF) 收集高质量的偏好数据集需要大量资源且具有挑战性。因此，研究人员经常在大量离线数据集上训练奖励模型，这些数据集聚合了不同的生成源和评分/对齐策略。我们假设这种聚合对奖励模型分数有平均效应，从而限制信号并损害对齐过程。受逆 RL 领域的启发，我们在语言模型训练中定义了“逆对齐问题”，我们的目标是针对固定参与者和固定离线偏好数据集优化评论家的奖励。我们假设解决逆对齐问题将通过提供对策略当前行为的更清晰反馈来提高奖励模型质量。为此，我们研究在 RLHF 期间反复微调与定期冻结策略对齐的离线偏好数据集子集上的奖励模型是否会改进 vanilla RLHF。我们的实证结果表明，与使用相对于 LLM 策略的未对齐或分布外的奖励模型相比，这种方法有利于实现更好的对齐和更快的收敛。

Title: Towards Using Machine Learning to Generatively Simulate EV Charging in Urban Areas

Authors: Marek Miltner, Jakub Zíka, Daniel Vašata, Artem Bryksa, Magda Friedjungová, Ondřej Štogl, Ram Rajagopal, Oldřich Starý
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.10531
Pdf URL: https://arxiv.org/pdf/2412.10531
Copy Paste: [[2412.10531]] Towards Using Machine Learning to Generatively Simulate EV Charging in Urban Areas(https://arxiv.org/abs/2412.10531)
Keywords: generative
Abstract: This study addresses the challenge of predicting electric vehicle (EV) charging profiles in urban locations with limited data. Utilizing a neural network architecture, we aim to uncover latent charging profiles influenced by spatio-temporal factors. Our model focuses on peak power demand and daily load shapes, providing insights into charging behavior. Our results indicate significant impacts from the type of Basic Administrative Units on predicted load curves, which contributes to the understanding and optimization of EV charging infrastructure in urban settings and allows Distribution System Operators (DSO) to more efficiently plan EV charging infrastructure expansion.
摘要：本研究解决了在数据有限的情况下预测城市地区电动汽车 (EV) 充电曲线的挑战。利用神经网络架构，我们旨在发现受时空因素影响的潜在充电曲线。我们的模型侧重于峰值电力需求和每日负载形状，从而深入了解充电行为。我们的结果表明，基本行政单位的类型对预测负载曲线有显著影响，这有助于理解和优化城市环境中的电动汽车充电基础设施，并允许配电系统运营商 (DSO) 更有效地规划电动汽车充电基础设施的扩建。

Title: SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

Authors: Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Nanxuan Zhao, Jing Shi, Tong Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10533
Pdf URL: https://arxiv.org/pdf/2412.10533
Copy Paste: [[2412.10533]] SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner(https://arxiv.org/abs/2412.10533)
Keywords: generation
Abstract: We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.
摘要：我们提出了一种零样本方法 SUGAR，用于主题驱动的视频定制。给定一张输入图像，SUGAR 能够为图像中包含的主题生成视频，并将生成内容与用户输入文本指定的任意视觉属性（例如风格和运动）对齐。与以前的方法不同，以前的方法需要在测试时进行微调，否则无法生成文本对齐的视频，而 SUGAR 无需在测试时增加额外成本即可获得出色的结果。为了实现零样本能力，我们引入了一个可扩展的管道来构建专为主题驱动定制而设计的合成数据集，从而产生了 250 万个图像-视频-文本三元组。此外，我们提出了几种方法来增强我们的模型，包括特殊注意设计、改进的训练策略和改进的采样算法。进行了广泛的实验。与以前的方法相比，SUGAR 在身份保存、视频动态和视频文本对齐方面取得了最先进的结果，用于主题驱动的视频定制，证明了我们提出的方法的有效性。

Title: RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation

Authors: Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Ganesh Ananthanarayanan, Ravi Netravali, Junchen Jiang
Subjects: cs.LG, cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2412.10543
Pdf URL: https://arxiv.org/pdf/2412.10543
Copy Paste: [[2412.10543]] RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation(https://arxiv.org/abs/2412.10543)
Keywords: generation
Abstract: RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents RAGServe, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, RAGServe reduces the generation latency by $1.64-2.54\times$ without sacrificing generation quality.
摘要：RAG（检索增强生成）允许 LLM（大型语言模型）利用外部知识生成更好的响应，但使用更多的外部知识通常会以响应延迟为代价来提高生成质量。先前的工作要么减少响应延迟（通过更好地调度 RAG 查询），要么努力最大化质量（这涉及调整 RAG 工作流程），但它们在优化 RAG 响应的延迟和质量之间的权衡方面做得不够。本文介绍了 RAGServe，这是第一个联合调度查询并调整每个查询的关键 RAG 配置（例如检索到的文本块数量和合成方法）的 RAG 系统，以平衡质量优化和响应延迟减少。使用 4 个流行的 RAG-QA 数据集，我们表明，与最先进的 RAG 优化方案相比，RAGServe 在不牺牲生成质量的情况下将生成延迟减少了 1.64-2.54 次。

Title: Adaptive Sampling to Reduce Epistemic Uncertainty Using Prediction Interval-Generation Neural Networks

Authors: Giorgio Morales, John Sheppard
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.10570
Pdf URL: https://arxiv.org/pdf/2412.10570
Copy Paste: [[2412.10570]] Adaptive Sampling to Reduce Epistemic Uncertainty Using Prediction Interval-Generation Neural Networks(https://arxiv.org/abs/2412.10570)
Keywords: generation
Abstract: Obtaining high certainty in predictive models is crucial for making informed and trustworthy decisions in many scientific and engineering domains. However, extensive experimentation required for model accuracy can be both costly and time-consuming. This paper presents an adaptive sampling approach designed to reduce epistemic uncertainty in predictive models. Our primary contribution is the development of a metric that estimates potential epistemic uncertainty leveraging prediction interval-generation neural networks. This estimation relies on the distance between the predicted upper and lower bounds and the observed data at the tested positions and their neighboring points. Our second contribution is the proposal of a batch sampling strategy based on Gaussian processes (GPs). A GP is used as a surrogate model of the networks trained at each iteration of the adaptive sampling process. Using this GP, we design an acquisition function that selects a combination of sampling locations to maximize the reduction of epistemic uncertainty across the domain. We test our approach on three unidimensional synthetic problems and a multi-dimensional dataset based on an agricultural field for selecting experimental fertilizer rates. The results demonstrate that our method consistently converges faster to minimum epistemic uncertainty levels compared to Normalizing Flows Ensembles, MC-Dropout, and simple GPs.
摘要：在许多科学和工程领域，获得预测模型的高确定性对于做出明智和值得信赖的决策至关重要。然而，模型准确性所需的大量实验既昂贵又耗时。本文介绍了一种自适应采样方法，旨在减少预测模型中的认知不确定性。我们的主要贡献是开发一种利用预测区间生成神经网络来估计潜在认知不确定性的指标。该估计依赖于预测的上限和下限与测试位置及其相邻点的观测数据之间的距离。我们的第二个贡献是基于高斯过程 (GP) 的批量采样策略的提议。GP 用作自适应采样过程每次迭代时训练的网络的替代模型。使用此 GP，我们设计了一个采集函数，该函数选择采样位置的组合，以最大限度地减少整个领域的认知不确定性。我们在三个一维合成问题和一个基于农业领域的多维数据集上测试了我们的方法，以选择实验性施肥率。结果表明，与 Normalizing Flows Ensembles、MC-Dropout 和简单 GP 相比，我们的方法能够持续更快地收敛到最低认知不确定性水平。

Title: PanSR: An Object-Centric Mask Transformer for Panoptic Segmentation

Authors: Lojze Žust, Matej Kristan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10589
Pdf URL: https://arxiv.org/pdf/2412.10589
Copy Paste: [[2412.10589]] PanSR: An Object-Centric Mask Transformer for Panoptic Segmentation(https://arxiv.org/abs/2412.10589)
Keywords: generation
Abstract: Panoptic segmentation is a fundamental task in computer vision and a crucial component for perception in autonomous vehicles. Recent mask-transformer-based methods achieve impressive performance on standard benchmarks but face significant challenges with small objects, crowded scenes and scenes exhibiting a wide range of object scales. We identify several fundamental shortcomings of the current approaches: (i) the query proposal generation process is biased towards larger objects, resulting in missed smaller objects, (ii) initially well-localized queries may drift to other objects, resulting in missed detections, (iii) spatially well-separated instances may be merged into a single mask causing inconsistent and false scene interpretations. To address these issues, we rethink the individual components of the network and its supervision, and propose a novel method for panoptic segmentation PanSR. PanSR effectively mitigates instance merging, enhances small-object detection and increases performance in crowded scenes, delivering a notable +3.4 PQ improvement over state-of-the-art on the challenging LaRS benchmark, while reaching state-of-the-art performance on Cityscapes. The code and models will be publicly available at this https URL.
摘要：全景分割是计算机视觉中的一项基本任务，也是自动驾驶汽车感知的关键组成部分。最近基于 mask-transformer 的方法在标准基准上取得了令人印象深刻的性能，但在处理小物体、拥挤场景和具有广泛物体尺度的场景时面临重大挑战。我们发现当前方法存在几个根本缺陷：(i) 查询提议生成过程偏向于较大的物体，导致错过较小的物体，(ii) 最初定位良好的查询可能会漂移到其他物体，导致错过检测，(iii) 空间上分离良好的实例可能会合并到单个 mask 中，从而导致不一致和错误的场景解释。为了解决这些问题，我们重新考虑了网络的各个组件及其监督，并提出了一种用于全景分割 PanSR 的新方法。PanSR 有效地缓解了实例合并，增强了小物体检测并提高了拥挤场景中的性能，在具有挑战性的 LaRS 基准上实现了比最新技术显着 +3.4 PQ 的改进，同时在 Cityscapes 上达到了最先进的性能。代码和模型将在此 https URL 上公开提供。

Title: Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

Authors: Sara Ghazanfari, Siddharth Garg, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Francesco Croce
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10594
Pdf URL: https://arxiv.org/pdf/2412.10594
Copy Paste: [[2412.10594]] Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics(https://arxiv.org/abs/2412.10594)
Keywords: generative
Abstract: Human perception of similarity across uni- and multimodal inputs is highly complex, making it challenging to develop automated metrics that accurately mimic it. General purpose vision-language models, such as CLIP and large multi-modal models (LMMs), can be applied as zero-shot perceptual metrics, and several recent works have developed models specialized in narrow perceptual tasks. However, the extent to which existing perceptual metrics align with human perception remains unclear. To investigate this question, we introduce UniSim-Bench, a benchmark encompassing 7 multi-modal perceptual similarity tasks, with a total of 25 datasets. Our evaluation reveals that while general-purpose models perform reasonably well on average, they often lag behind specialized models on individual tasks. Conversely, metrics fine-tuned for specific tasks fail to generalize well to unseen, though related, tasks. As a first step towards a unified multi-task perceptual similarity metric, we fine-tune both encoder-based and generative vision-language models on a subset of the UniSim-Bench tasks. This approach yields the highest average performance, and in some cases, even surpasses taskspecific models. Nevertheless, these models still struggle with generalization to unseen tasks, highlighting the ongoing challenge of learning a robust, unified perceptual similarity metric capable of capturing the human notion of similarity. The code and models are available at this https URL.
摘要：人类对单模态和多模态输入之间的相似性的感知非常复杂，因此开发能够准确模仿它的自动化指标具有挑战性。通用视觉语言模型，例如 CLIP 和大型多模态模型 (LMM)，可以用作零样本感知指标，最近的几项研究已经开发出专门用于狭窄感知任务的模型。然而，现有感知指标与人类感知的一致程度仍不清楚。为了研究这个问题，我们引入了 UniSim-Bench，这是一个包含 7 个多模态感知相似性任务的基准，总共有 25 个数据集。我们的评估表明，虽然通用模型的平均表现相当不错，但它们在单个任务上的表现往往落后于专门的模型。相反，针对特定任务进行微调的指标无法很好地推广到看不见但相关的任务。作为实现统一多任务感知相似性指标的第一步，我们在 UniSim-Bench 任务的子集上微调了基于编码器和生成的视觉语言模型。这种方法可实现最高的平均性能，在某些情况下甚至超越特定任务模型。然而，这些模型仍然难以推广到未见过的任务，这凸显了学习能够捕捉人类相似性概念的强大、统一的感知相似性度量的持续挑战。代码和模型可在此 https URL 上找到。

Title: EvalGIM: A Library for Evaluating Generative Image Models

Authors: Melissa Hall, Oscar Mañas, Reyhane Askari, Mark Ibrahim, Candace Ross, Pietro Astolfi, Tariq Berrada Ifriqi, Marton Havasi, Yohann Benchetrit, Karen Ullrich, Carolina Braga, Abhishek Charnalia, Maeve Ryan, Mike Rabbat, Michal Drozdzal, Jakob Verbeek, Adriana Romero Soriano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10604
Pdf URL: https://arxiv.org/pdf/2412.10604
Copy Paste: [[2412.10604]] EvalGIM: A Library for Evaluating Generative Image Models(https://arxiv.org/abs/2412.10604)
Keywords: generative
Abstract: As the use of text-to-image generative models increases, so does the adoption of automatic benchmarking methods used in their evaluation. However, while metrics and datasets abound, there are few unified benchmarking libraries that provide a framework for performing evaluations across many datasets and metrics. Furthermore, the rapid introduction of increasingly robust benchmarking methods requires that evaluation libraries remain flexible to new datasets and metrics. Finally, there remains a gap in synthesizing evaluations in order to deliver actionable takeaways about model performance. To enable unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced ''EvalGym''), a library for evaluating generative image models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency of text-to-image generative models. In addition, EvalGIM is designed with flexibility for user customization as a top priority and contains a structure that allows plug-and-play additions of new datasets and metrics. To enable actionable evaluation insights, we introduce ''Evaluation Exercises'' that highlight takeaways for specific evaluation questions. The Evaluation Exercises contain easy-to-use and reproducible implementations of two state-of-the-art evaluation methods of text-to-image generative models: consistency-diversity-realism Pareto Fronts and disaggregated measurements of performance disparities across groups. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models: robustness analyses of model rankings and balanced evaluations across different prompt styles. We encourage text-to-image model exploration with EvalGIM and invite contributions at this https URL.
摘要：随着文本到图像生成模型的使用增加，用于评估的自动基准测试方法的采用也随之增加。然而，虽然指标和数据集比比皆是，但很少有统一的基准测试库能够提供跨许多数据集和指标执行评估的框架。此外，日益强大的基准测试方法的快速引入要求评估库能够灵活地适应新的数据集和指标。最后，在综合评估以提供有关模型性能的可行结论方面仍然存在差距。为了实现统一、灵活和可操作的评估，我们引入了 EvalGIM（发音为“EvalGym”），这是一个用于评估生成图像模型的库。EvalGIM 广泛支持用于衡量文本到图像生成模型的质量、多样性和一致性的数据集和指标。此外，EvalGIM 的设计以用户自定义的灵活性为首要任务，并包含允许即插即用添加新数据集和指标的结构。为了获得可操作的评估见解，我们引入了“评估练习”，重点介绍了特定评估问题的要点。评估练习包含两种最先进的文本到图像生成模型评估方法的易于使用和可重复的实现：一致性-多样性-现实性帕累托前沿和跨组绩效差异的分解测量。EvalGIM 还包含评估练习，介绍了两种新的文本到图像生成模型分析方法：模型排名的稳健性分析和不同提示样式之间的平衡评估。我们鼓励使用 EvalGIM 进行文本到图像模型探索，并邀请在此 https URL 上做出贡献。

Title: UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval

Authors: Haoyu Jiang, Zhi-Qi Cheng, Gabriel Moreira, Jiawen Zhu, Jingdong Sun, Bukun Ren, Jun-Yan He, Qi Dai, Xian-Sheng Hua
Subjects: cs.CV, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2412.10680
Pdf URL: https://arxiv.org/pdf/2412.10680
Copy Paste: [[2412.10680]] UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval(https://arxiv.org/abs/2412.10680)
Keywords: generation
Abstract: Universal Cross-Domain Retrieval (UCDR) retrieves relevant images from unseen domains and classes without semantic labels, ensuring robust generalization. Existing methods commonly employ prompt tuning with pre-trained vision-language models but are inherently limited by static prompts, reducing adaptability. We propose UCDR-Adapter, which enhances pre-trained models with adapters and dynamic prompt generation through a two-phase training strategy. First, Source Adapter Learning integrates class semantics with domain-specific visual knowledge using a Learnable Textual Semantic Template and optimizes Class and Domain Prompts via momentum updates and dual loss functions for robust alignment. Second, Target Prompt Generation creates dynamic prompts by attending to masked source prompts, enabling seamless adaptation to unseen domains and classes. Unlike prior approaches, UCDR-Adapter dynamically adapts to evolving data distributions, enhancing both flexibility and generalization. During inference, only the image branch and generated prompts are used, eliminating reliance on textual inputs for highly efficient retrieval. Extensive benchmark experiments show that UCDR-Adapter consistently outperforms ProS in most cases and other state-of-the-art methods on UCDR, U(c)CDR, and U(d)CDR settings.
摘要：通用跨域检索 (UCDR) 从没有语义标签的未知域和类中检索相关图像，确保稳健的泛化。现有方法通常采用对预训练的视觉语言模型进行提示调整，但本质上受到静态提示的限制，从而降低了适应性。我们提出了 UCDR-Adapter，它通过两阶段训练策略使用适配器和动态提示生成增强了预训练模型。首先，源适配器学习使用可学习的文本语义模板将类语义与领域特定的视觉知识相结合，并通过动量更新和双损失函数优化类和域提示以实现稳健对齐。其次，目标提示生成通过关注掩码源提示来创建动态提示，从而实现对未知域和类的无缝适应。与之前的方法不同，UCDR-Adapter 可以动态适应不断变化的数据分布，从而增强灵活性和泛化能力。在推理过程中，仅使用图像分支和生成的提示，从而无需依赖文本输入即可实现高效检索。大量的基准实验表明，UCDR-Adapter 在大多数情况下始终优于 ProS，并且在 UCDR、U(c)CDR 和 U(d)CDR 设置上优于其他最先进的方法。

Title: Control of Overfitting with Physics

Authors: Sergei V. Kozyrev, Ilya A Lopatin, Alexander N Pechen
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.10716
Pdf URL: https://arxiv.org/pdf/2412.10716
Copy Paste: [[2412.10716]] Control of Overfitting with Physics(https://arxiv.org/abs/2412.10716)
Keywords: generative
Abstract: While there are many works on the applications of machine learning, not so many of them are trying to understand the theoretical justifications to explain their efficiency. In this work, overfitting control (or generalization property) in machine learning is explained using analogies from physics and biology. For stochastic gradient Langevin dynamics, we show that the Eyring formula of kinetic theory allows to control overfitting in the algorithmic stability approach - when wide minima of the risk function with low free energy correspond to low overfitting. For the generative adversarial network (GAN) model, we establish an analogy between GAN and the predator-prey model in biology. An application of this analogy allows us to explain the selection of wide likelihood maxima and overfitting reduction for GANs.
摘要：虽然有许多关于机器学习应用的研究，但其中很少有人试图理解解释其效率的理论依据。在这项工作中，使用物理学和生物学的类比来解释机器学习中的过度拟合控制（或泛化特性）。对于随机梯度朗之万动力学，我们表明动力学理论的艾林公式允许在算法稳定性方法中控制过度拟合 - 当具有低自由能的风险函数的宽最小值对应于低过度拟合时。对于生成对抗网络 (GAN) 模型，我们在 GAN 和生物学中的捕食者-猎物模型之间建立了类比。这种类比的应用使我们能够解释 GAN 的宽似然最大值的选择和过度拟合减少。

Title: GRID: Visual Layout Generation

Authors: Cong Wan, Xiangyang Luo, Zijian Cai, Yiren Song, Yunlong Zhao, Yifan Bai, Yuhang He, Yihong Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10718
Pdf URL: https://arxiv.org/pdf/2412.10718
Copy Paste: [[2412.10718]] GRID: Visual Layout Generation(https://arxiv.org/abs/2412.10718)
Keywords: generation
Abstract: In this paper, we introduce GRID, a novel paradigm that reframes a broad range of visual generation tasks as the problem of arranging grids, akin to film strips. At its core, GRID transforms temporal sequences into grid layouts, enabling image generation models to process visual sequences holistically. To achieve both layout consistency and motion coherence, we develop a parallel flow-matching training strategy that combines layout matching and temporal losses, guided by a coarse-to-fine schedule that evolves from basic layouts to precise motion control. Our approach demonstrates remarkable efficiency, achieving up to 35 faster inference speeds while using 1/1000 of the computational resources compared to specialized models. Extensive experiments show that GRID exhibits exceptional versatility across diverse visual generation tasks, from Text-to-Video to 3D Editing, while maintaining its foundational image generation capabilities. This dual strength in both expanded applications and preserved core competencies establishes GRID as an efficient and versatile omni-solution for visual generation.
摘要：在本文中，我们介绍了 GRID，这是一种新颖的范式，它将广泛的视觉生成任务重新定义为排列网格的问题，类似于电影胶片。从本质上讲，GRID 将时间序列转换为网格布局，使图像生成模型能够整体处理视觉序列。为了实现布局一致性和运动连贯性，我们开发了一种并行流匹配训练策略，该策略结合了布局匹配和时间损失，由粗到细的计划引导，该计划从基本布局发展到精确的运动控制。我们的方法表现出卓越的效率，与专用模型相比，它实现了高达 35 倍的推理速度，同时使用的计算资源仅为其 1/1000。大量实验表明，GRID 在从文本到视频到 3D 编辑等各种视觉生成任务中表现出卓越的多功能性，同时保持了其基础的图像生成功能。在扩展应用程序和保留核心竞争力方面的双重优势使 GRID 成为一种高效且多功能的视觉生成全方位解决方案。

Title: OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous Driving

Authors: Lianqing Zheng, Long Yang, Qunshu Lin, Wenjin Ai, Minghao Liu, Shouyi Lu, Jianan Liu, Hongze Ren, Jingyue Mo, Xiaokai Bai, Jie Bai, Zhixiong Ma, Xichan Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10734
Pdf URL: https://arxiv.org/pdf/2412.10734
Copy Paste: [[2412.10734]] OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous Driving(https://arxiv.org/abs/2412.10734)
Keywords: generation
Abstract: The rapid advancement of deep learning has intensified the need for comprehensive data for use by autonomous driving algorithms. High-quality datasets are crucial for the development of effective data-driven autonomous driving solutions. Next-generation autonomous driving datasets must be multimodal, incorporating data from advanced sensors that feature extensive data coverage, detailed annotations, and diverse scene representation. To address this need, we present OmniHD-Scenes, a large-scale multimodal dataset that provides comprehensive omnidirectional high-definition data. The OmniHD-Scenes dataset combines data from 128-beam LiDAR, six cameras, and six 4D imaging radar systems to achieve full environmental perception. The dataset comprises 1501 clips, each approximately 30-s long, totaling more than 450K synchronized frames and more than 5.85 million synchronized sensor data points. We also propose a novel 4D annotation pipeline. To date, we have annotated 200 clips with more than 514K precise 3D bounding boxes. These clips also include semantic segmentation annotations for static scene elements. Additionally, we introduce a novel automated pipeline for generation of the dense occupancy ground truth, which effectively leverages information from non-key frames. Alongside the proposed dataset, we establish comprehensive evaluation metrics, baseline models, and benchmarks for 3D detection and semantic occupancy prediction. These benchmarks utilize surround-view cameras and 4D imaging radar to explore cost-effective sensor solutions for autonomous driving applications. Extensive experiments demonstrate the effectiveness of our low-cost sensor configuration and its robustness under adverse conditions. Data will be released at this https URL.
摘要：深度学习的快速发展加剧了自动驾驶算法对全面数据的需求。高质量的数据集对于开发有效的数据驱动的自动驾驶解决方案至关重要。下一代自动驾驶数据集必须是多模态的，结合来自先进传感器的数据，这些传感器具有广泛的数据覆盖范围、详细的注释和多样化的场景表示。为了满足这一需求，我们提出了 OmniHD-Scenes，这是一个提供全面全向高清数据的大规模多模态数据集。OmniHD-Scenes 数据集结合了来自 128 光束 LiDAR、六个摄像头和六个 4D 成像雷达系统的数据，以实现完整的环境感知。该数据集包含 1501 个剪辑，每个剪辑大约 30 秒长，总计超过 450K 个同步帧和超过 585 万个同步传感器数据点。我们还提出了一种新颖的 4D 注释流程。到目前为止，我们已经注释了 200 个剪辑，其中包含超过 514K 个精确的 3D 边界框。这些剪辑还包括静态场景元素的语义分割注释。此外，我们引入了一种新颖的自动化管道来生成密集占用地面实况，该管道有效地利用了非关键帧的信息。除了提出的数据集之外，我们还建立了全面的评估指标、基线模型和基准，用于 3D 检测和语义占用预测。这些基准利用环视摄像头和 4D 成像雷达来探索用于自动驾驶应用的经济高效的传感器解决方案。大量实验证明了我们的低成本传感器配置的有效性及其在恶劣条件下的稳健性。数据将在此 https URL 上发布。

Title: NeuralPLexer3: Physio-Realistic Biomolecular Complex Structure Prediction with Flow Models

Authors: Zhuoran Qiao, Feizhi Ding, Thomas Dresselhaus, Mia A. Rosenfeld, Xiaotian Han, Owen Howell, Aniketh Iyengar, Stephen Opalenski, Anders S. Christensen, Sai Krishna Sirumalla, Frederick R. Manby, Thomas F. Miller III, Matthew Welborn
Subjects: cs.LG, physics.chem-ph, q-bio.BM
Abstract URL: https://arxiv.org/abs/2412.10743
Pdf URL: https://arxiv.org/pdf/2412.10743
Copy Paste: [[2412.10743]] NeuralPLexer3: Physio-Realistic Biomolecular Complex Structure Prediction with Flow Models(https://arxiv.org/abs/2412.10743)
Keywords: generative
Abstract: Structure determination is essential to a mechanistic understanding of diseases and the development of novel therapeutics. Machine-learning-based structure prediction methods have made significant advancements by computationally predicting protein and bioassembly structures from sequences and molecular topology alone. Despite substantial progress in the field, challenges remain to deliver structure prediction models to real-world drug discovery. Here, we present NeuralPLexer3 -- a physics-inspired flow-based generative model that achieves state-of-the-art prediction accuracy on key biomolecular interaction types and improves training and sampling efficiency compared to its predecessors and alternative methodologies. Examined through newly developed benchmarking strategies, NeuralPLexer3 excels in vital areas that are crucial to structure-based drug design, such as physical validity and ligand-induced conformational changes.
摘要：结构测定对于了解疾病的机制和开发新型疗法至关重要。基于机器学习的结构预测方法通过仅从序列和分子拓扑计算预测蛋白质和生物组装结构取得了重大进展。尽管该领域取得了实质性进展，但将结构预测模型应用于现实世界的药物发现仍然存在挑战。在这里，我们介绍了 NeuralPLexer3——一种受物理启发的基于流的生成模型，它在关键的生物分子相互作用类型上实现了最先进的预测精度，并且与其前身和其他方法相比提高了训练和采样效率。通过新开发的基准测试策略进行测试，NeuralPLexer3 在对基于结构的药物设计至关重要的关键领域表现出色，例如物理有效性和配体诱导的构象变化。

Title: VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation

Authors: Saksham Singh Kushwaha, Yapeng Tian
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.10768
Pdf URL: https://arxiv.org/pdf/2412.10768
Copy Paste: [[2412.10768]] VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation(https://arxiv.org/abs/2412.10768)
Keywords: generation
Abstract: Recent advances in audio generation have focused on text-to-audio (T2A) and video-to-audio (V2A) tasks. However, T2A or V2A methods cannot generate holistic sounds (onscreen and off-screen). This is because T2A cannot generate sounds aligning with onscreen objects, while V2A cannot generate semantically complete (offscreen sounds missing). In this work, we address the task of holistic audio generation: given a video and a text prompt, we aim to generate both onscreen and offscreen sounds that are temporally synchronized with the video and semantically aligned with text and video. Previous approaches for joint text and video-to-audio generation often suffer from modality bias, favoring one modality over the other. To overcome this limitation, we introduce VinTAGe, a flow-based transformer model that jointly considers text and video to guide audio generation. Our framework comprises two key components: a Visual-Text Encoder and a Joint VT-SiT model. To reduce modality bias and improve generation quality, we employ pretrained uni-modal text-to-audio and video-to-audio generation models for additional guidance. Due to the lack of appropriate benchmarks, we also introduce VinTAGe-Bench, a dataset of 636 video-text-audio pairs containing both onscreen and offscreen sounds. Our comprehensive experiments on VinTAGe-Bench demonstrate that joint text and visual interaction is necessary for holistic audio generation. Furthermore, VinTAGe achieves state-of-the-art results on the VGGSound benchmark. Our source code and pre-trained models will be released. Demo is available at: this https URL.
摘要：音频生成领域的最新进展主要集中在文本转音频 (T2A) 和视频转音频 (V2A) 任务上。然而，T2A 或 V2A 方法无法生成整体声音（屏幕上和屏幕外）。这是因为 T2A 无法生成与屏幕上对象一致的声音，而 V2A 无法生成语义完整（缺少屏幕外声音）。在这项工作中，我们解决了整体音频生成的任务：给定一个视频和一个文本提示，我们的目标是生成与视频在时间上同步并与文本和视频在语义上一致的屏幕上和屏幕外的声音。以前的文本和视频转音频联合生成方法经常受到模态偏差的影响，偏向一种模态而不是另一种模态。为了克服这一限制，我们引入了 VinTAGe，这是一种基于流的转换器模型，它联合考虑文本和视频来指导音频生成。我们的框架包含两个关键组件：视觉文本编码器和联合 VT-SiT 模型。为了减少模态偏差并提高生成质量，我们采用预训练的单模态文本转音频和视频转音频生成模型来提供额外指导。由于缺乏合适的基准，我们还引入了 VinTAGe-Bench，这是一个包含 636 个视频-文本-音频对的数据集，其中包含屏幕内和屏幕外的声音。我们在 VinTAGe-Bench 上进行的全面实验表明，联合文本和视觉交互对于整体音频生成必不可少。此外，VinTAGe 在 VGGSound 基准上取得了最先进的结果。我们的源代码和预训练模型即将发布。演示可在以下网址获得：此 https URL。

Title: Video Diffusion Transformers are In-Context Learners

Authors: Zhengcong Fei, Di Qiu, Changqian Yu, Debang Li, Mingyuan Fan, Xiang Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10783
Pdf URL: https://arxiv.org/pdf/2412.10783
Copy Paste: [[2412.10783]] Video Diffusion Transformers are In-Context Learners(https://arxiv.org/abs/2412.10783)
Keywords: generation
Abstract: This paper investigates a solution for enabling in-context capabilities of video diffusion transformers, with minimal tuning required for activation. Specifically, we propose a simple pipeline to leverage in-context generation: ($\textbf{i}$) concatenate videos along spacial or time dimension, ($\textbf{ii}$) jointly caption multi-scene video clips from one source, and ($\textbf{iii}$) apply task-specific fine-tuning using carefully curated small datasets. Through a series of diverse controllable tasks, we demonstrate qualitatively that existing advanced text-to-video models can effectively perform in-context generation. Notably, it allows for the creation of consistent multi-scene videos exceeding 30 seconds in duration, without additional computational overhead. Importantly, this method requires no modifications to the original models, results in high-fidelity video outputs that better align with prompt specifications and maintain role consistency. Our framework presents a valuable tool for the research community and offers critical insights for advancing product-level controllable video generation systems. The data, code, and model weights are publicly available at: \url{this https URL}.
摘要：本文研究了一种解决方案，用于实现视频扩散转换器的上下文功能，只需进行最少的调整即可激活。具体来说，我们提出了一个简单的流程来利用上下文生成：（$\textbf{i}$）沿空间或时间维度连接视频，（$\textbf{ii}$）为来自一个来源的多场景视频片段联合添加字幕，以及（$\textbf{iii}$）使用精心策划的小数据集应用特定于任务的微调。通过一系列不同的可控任务，我们定性地证明了现有的高级文本到视频模型可以有效地执行上下文生成。值得注意的是，它允许创建持续时间超过 30 秒的一致多场景视频，而无需额外的计算开销。重要的是，这种方法不需要对原始模型进行修改，从而产生高保真视频输出，更好地符合提示规范并保持角色一致性。我们的框架为研究界提供了一个有价值的工具，并为推进产品级可控视频生成系统提供了关键见解。数据、代码和模型权重可在以下网址公开获取：\url{此 https URL}。

Title: StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer

Authors: Pin-Yen Chiu, Dai-Jie Wu, Po-Hsun Chu, Chia-Hsuan Hsu, Hsiang-Chen Chiu, Chih-Yu Wang, Jun-Cheng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10785
Pdf URL: https://arxiv.org/pdf/2412.10785
Copy Paste: [[2412.10785]] StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer(https://arxiv.org/abs/2412.10785)
Keywords: generation
Abstract: Kinship face synthesis is a challenging problem due to the scarcity and low quality of the available kinship data. Existing methods often struggle to generate descendants with both high diversity and fidelity while precisely controlling facial attributes such as age and gender. To address these issues, we propose the Style Latent Diffusion Transformer (StyleDiT), a novel framework that integrates the strengths of StyleGAN with the diffusion model to generate high-quality and diverse kinship faces. In this framework, the rich facial priors of StyleGAN enable fine-grained attribute control, while our conditional diffusion model is used to sample a StyleGAN latent aligned with the kinship relationship of conditioning images by utilizing the advantage of modeling complex kinship relationship distribution. StyleGAN then handles latent decoding for final face generation. Additionally, we introduce the Relational Trait Guidance (RTG) mechanism, enabling independent control of influencing conditions, such as each parent's facial image. RTG also enables a fine-grained adjustment between the diversity and fidelity in synthesized faces. Furthermore, we extend the application to an unexplored domain: predicting a partner's facial images using a child's image and one parent's image within the same framework. Extensive experiments demonstrate that our StyleDiT outperforms existing methods by striking an excellent balance between generating diverse and high-fidelity kinship faces.
摘要：由于可用的亲属关系数据稀缺且质量低下，亲属关系人脸合成是一个具有挑战性的问题。现有方法通常难以生成具有高多样性和保真度的后代，同时精确控制年龄和性别等面部属性。为了解决这些问题，我们提出了风格潜在扩散变换器 (StyleDiT)，这是一个新颖的框架，它将 StyleGAN 的优势与扩散模型相结合，以生成高质量和多样化的亲属关系人脸。在这个框架中，StyleGAN 丰富的面部先验支持细粒度的属性控制，而我们的条件扩散模型用于利用建模复杂亲属关系分布的优势，对与条件图像的亲属关系一致的 StyleGAN 潜在数据进行采样。然后，StyleGAN 处理潜在解码以生成最终的人脸。此外，我们引入了关系特征指导 (RTG) 机制，可以独立控制影响条件，例如每个父母的面部图像。RTG 还可以在合成人脸的多样性和保真度之间进行细粒度的调整。此外，我们将该应用扩展到一个未开发的领域：在同一框架内使用孩子图像和父母之一图像预测伴侣的面部图像。大量实验表明，我们的 StyleDiT 通过在生成多样性和高保真度亲属面孔之间取得良好平衡，优于现有方法。

Title: Optimizing Few-Step Sampler for Diffusion Probabilistic Model

Authors: Jen-Yuan Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10786
Pdf URL: https://arxiv.org/pdf/2412.10786
Copy Paste: [[2412.10786]] Optimizing Few-Step Sampler for Diffusion Probabilistic Model(https://arxiv.org/abs/2412.10786)
Keywords: generation
Abstract: Diffusion Probabilistic Models (DPMs) have demonstrated exceptional capability of generating high-quality and diverse images, but their practical application is hindered by the intensive computational cost during inference. The DPM generation process requires solving a Probability-Flow Ordinary Differential Equation (PF-ODE), which involves discretizing the integration domain into intervals for numerical approximation. This corresponds to the sampling schedule of a diffusion ODE solver, and we notice the solution from a first-order solver can be expressed as a convex combination of model outputs at all scheduled time-steps. We derive an upper bound for the discretization error of the sampling schedule, which can be efficiently optimized with Monte-Carlo estimation. Building on these theoretical results, we purpose a two-phase alternating optimization algorithm. In Phase-1, the sampling schedule is optimized for the pre-trained DPM; in Phase-2, the DPM further tuned on the selected time-steps. Experiments on a pre-trained DPM for ImageNet64 dataset demonstrate the purposed method consistently improves the baseline across various number of sampling steps.
摘要：扩散概率模型 (DPM) 已展现出生成高质量和多样化图像的卓越能力，但由于推理过程中计算成本高昂，其实际应用受到阻碍。DPM 生成过程需要求解概率流常微分方程 (PF-ODE)，这涉及将积分域离散化为区间以进行数值近似。这对应于扩散 ODE 求解器的采样计划，我们注意到一阶求解器的解可以表示为所有预定时间步长的模型输出的凸组合。我们推导出采样计划离散化误差的上限，可以使用蒙特卡罗估计进行有效优化。基于这些理论结果，我们提出了一种两阶段交替优化算法。在阶段 1 中，针对预训练的 DPM 优化采样计划；在阶段 2 中，DPM 进一步根据选定的时间步长进行调整。在针对 ImageNet64 数据集的预训练 DPM 上进行的实验表明，该方法能够在不同数量的采样步骤中持续改善基线。

Title: Reliable and superior elliptic Fourier descriptor normalization and its application software ElliShape with efficient image processing

Authors: Hui Wu (1,2,3,4), Jia-Jie Yang (1,3,4), Chao-Qun Li (5), Jin-Hua Ran (2,4,6), Ren-Hua Peng (6,7), Xiao-Quan Wang (1,2,3,4,6) ((1) Big Data and AI Biodiversity Conservation Research Center, Institute of Botany, Chinese Academy of Sciences, Beijing, China (2) State Key Laboratory of Plant Diversity and Specialty Crops and Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, China (3) Plant Science Data Center, Chinese Academy of Sciences, Beijing, China (4) China National Botanical Garden, Beijing, China (5) School of Life Sciences, Qilu Normal University, Jinan, China (6) University of Chinese Academy of Sciences, Beijing, China (7) Key Laboratory of Noise and Vibration Control, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China)
Subjects: cs.CV, q-bio.QM
Abstract URL: https://arxiv.org/abs/2412.10795
Pdf URL: https://arxiv.org/pdf/2412.10795
Copy Paste: [[2412.10795]] Reliable and superior elliptic Fourier descriptor normalization and its application software ElliShape with efficient image processing(https://arxiv.org/abs/2412.10795)
Keywords: generation
Abstract: Elliptic Fourier analysis (EFA) is a powerful tool for shape analysis, which is often employed in geometric morphometrics. However, the normalization of elliptic Fourier descriptors has persistently posed challenges in obtaining unique results in basic contour transformations, requiring extensive manual alignment. Additionally, contemporary contour/outline extraction methods often struggle to handle complex digital images. Here, we reformulated the procedure of EFDs calculation to improve computational efficiency and introduced a novel approach for EFD normalization, termed true EFD normalization, which remains invariant under all basic contour transformations. These improvements are crucial for processing large sets of contour curves collected from different platforms with varying transformations. Based on these improvements, we developed ElliShape, a user-friendly software. Particularly, the improved contour/outline extraction employs an interactive approach that combines automatic contour generation for efficiency with manual correction for essential modifications and refinements. We evaluated ElliShape's stability, robustness, and ease of use by comparing it with existing software using standard datasets. ElliShape consistently produced reliable reconstructed shapes and normalized EFD values across different contours and transformations, and it demonstrated superior performance in visualization and efficient processing of various digital images for contour this http URL output annotated images and EFDs could be utilized in deep learning-based data training, thereby advancing artificial intelligence in botany and offering innovative solutions for critical challenges in biodiversity conservation, species classification, ecosystem function assessment, and related critical issues.
摘要：椭圆傅里叶分析 (EFA) 是一种强大的形状分析工具，通常用于几何形态测量。然而，椭圆傅里叶描述符的归一化一直在为在基本轮廓变换中获得唯一结果带来挑战，需要大量的手动对齐。此外，当代轮廓/轮廓提取方法通常难以处理复杂的数字图像。在这里，我们重新制定了 EFD 计算程序以提高计算效率，并引入了一种新的 EFD 归一化方法，称为真正的 EFD 归一化，它在所有基本轮廓变换下保持不变。这些改进对于处理从具有不同变换的不同平台收集的大量轮廓曲线至关重要。基于这些改进，我们开发了用户友好的软件 ElliShape。特别是，改进的轮廓/轮廓提取采用了一种交互式方法，将自动轮廓生成与手动校正相结合以提高效率，从而进行必要的修改和改进。我们通过使用标准数据集将 ElliShape 与现有软件进行比较，评估了它的稳定性、稳健性和易用性。 ElliShape 在不同的轮廓和变换中始终如一地产生可靠的重建形状和归一化的 EFD 值，并且在可视化和高效处理各种数字图像方面表现出卓越的性能，用于轮廓此 http URL 输出注释图像和 EFD 可用于基于深度学习的数据训练，从而推动植物学人工智能的发展，并为生物多样性保护、物种分类、生态系统功能评估和相关关键问题中的关键挑战提供创新解决方案。

Title: Medical Manifestation-Aware De-Identification

Authors: Yuan Tian, Shuo Wang, Guangtao Zhai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10804
Pdf URL: https://arxiv.org/pdf/2412.10804
Copy Paste: [[2412.10804]] Medical Manifestation-Aware De-Identification(https://arxiv.org/abs/2412.10804)
Keywords: generation
Abstract: Face de-identification (DeID) has been widely studied for common scenes, but remains under-researched for medical scenes, mostly due to the lack of large-scale patient face datasets. In this paper, we release MeMa, consisting of over 40,000 photo-realistic patient faces. MeMa is re-generated from massive real patient photos. By carefully modulating the generation and data-filtering procedures, MeMa avoids breaching real patient privacy, while ensuring rich and plausible medical manifestations. We recruit expert clinicians to annotate MeMa with both coarse- and fine-grained labels, building the first medical-scene DeID benchmark. Additionally, we propose a baseline approach for this new medical-aware DeID task, by integrating data-driven medical semantic priors into the DeID procedure. Despite its conciseness and simplicity, our approach substantially outperforms previous ones. Dataset is available at this https URL.
摘要：人脸去识别 (DeID) 已在常见场景中得到广泛研究，但在医疗场景中的研究仍不足，主要是因为缺乏大规模患者人脸数据集。在本文中，我们发布了 MeMa，它包含 40,000 多张逼真的患者人脸。MeMa 是根据大量真实患者照片重新生成的。通过精心调整生成和数据过滤程序，MeMa 避免侵犯真实患者隐私，同时确保丰富且合理的医疗表现。我们招募了专业临床医生，使用粗粒度和细粒度标签对 MeMa 进行注释，从而构建了第一个医疗场景 DeID 基准。此外，我们通过将数据驱动的医疗语义先验集成到 DeID 程序中，为这种新的医疗感知 DeID 任务提出了一种基线方法。尽管我们的方法简洁明了，但它的性能远远优于以前的方法。数据集可在此 https URL 上获取。

Title: Diffusion Model from Scratch

Authors: Wang Zhen, Dong Yunyun
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10824
Pdf URL: https://arxiv.org/pdf/2412.10824
Copy Paste: [[2412.10824]] Diffusion Model from Scratch(https://arxiv.org/abs/2412.10824)
Keywords: generative
Abstract: Diffusion generative models are currently the most popular generative models. However, their underlying modeling process is quite complex, and starting directly with the seminal paper Denoising Diffusion Probability Model (DDPM) can be challenging. This paper aims to assist readers in building a foundational understanding of generative models by tracing the evolution from VAEs to DDPM through detailed mathematical derivations and a problem-oriented analytical approach. It also explores the core ideas and improvement strategies of current mainstream methodologies, providing guidance for undergraduate and graduate students interested in learning about diffusion models.
摘要：扩散生成模型是目前最流行的生成模型，但其底层建模过程相当复杂，直接从经典论文《去噪扩散概率模型（DDPM）》开始可能具有挑战性。本文旨在通过详细的数学推导和面向问题的分析方法追溯从 VAE 到 DDPM 的演变，帮助读者建立对生成模型的基础理解。它还探讨了当前主流方法论的核心思想和改进策略，为有兴趣学习扩散模型的本科生和研究生提供指导。

Title: Unbiased General Annotated Dataset Generation

Authors: Dengyang Jiang, Haoyu Wang, Lei Zhang, Wei Wei, Guang Dai, Mengmeng Wang, Jingdong Wang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10831
Pdf URL: https://arxiv.org/pdf/2412.10831
Copy Paste: [[2412.10831]] Unbiased General Annotated Dataset Generation(https://arxiv.org/abs/2412.10831)
Keywords: generation
Abstract: Pre-training backbone networks on a general annotated dataset (e.g., ImageNet) that comprises numerous manually collected images with category annotations has proven to be indispensable for enhancing the generalization capacity of downstream visual tasks. However, those manually collected images often exhibit bias, which is non-transferable across either categories or domains, thus causing the model's generalization capacity degeneration. To mitigate this problem, we present an unbiased general annotated dataset generation framework (ubGen). Instead of expensive manual collection, we aim at directly generating unbiased images with category annotations. To achieve this goal, we propose to leverage the advantage of a multimodal foundation model (e.g., CLIP), in terms of aligning images in an unbiased semantic space defined by language. Specifically, we develop a bi-level semantic alignment loss, which not only forces all generated images to be consistent with the semantic distribution of all categories belonging to the target dataset in an adversarial learning manner, but also requires each generated image to match the semantic description of its category name. In addition, we further cast an existing image quality scoring model into a quality assurance loss to preserve the quality of the generated image. By leveraging these two loss functions, we can obtain an unbiased image generation model by simply fine-tuning a pre-trained diffusion model using only all category names in the target dataset as input. Experimental results confirm that, compared with the manually labeled dataset or other synthetic datasets, the utilization of our generated unbiased datasets leads to stable generalization capacity enhancement of different backbone networks across various tasks, especially in tasks where the manually labeled samples are scarce.
摘要：在包含大量带有类别注释的手动收集图像的通用带注释数据集（例如 ImageNet）上对骨干网络进行预训练已被证明对于增强下游视觉任务的泛化能力是必不可少的。然而，这些手动收集的图像通常表现出偏见，这种偏见无法跨类别或域传递，从而导致模型的泛化能力下降。为了缓解这个问题，我们提出了一个无偏的通用带注释数据集生成框架 (ubGen)。我们的目标是直接生成带有类别注释的无偏图像，而不是昂贵的手动收集。为了实现这一目标，我们建议利用多模态基础模型（例如 CLIP）的优势，在语言定义的无偏语义空间中对齐图像。具体来说，我们开发了一个双层语义对齐损失，它不仅以对抗学习的方式强制所有生成的图像与属于目标数据集的所有类别的语义分布一致，而且还要求每个生成的图像与其类别名称的语义描述相匹配。此外，我们进一步将现有的图像质量评分模型转化为质量保证损失，以保持生成图像的质量。通过利用这两个损失函数，我们只需使用目标数据集中的所有类别名称作为输入，对预训练的扩散模型进行微调，即可获得无偏图像生成模型。实验结果证实，与手动标记的数据集或其他合成数据集相比，利用我们生成的无偏数据集可以稳定地增强不同骨干网络在各种任务中的泛化能力，尤其是在手动标记样本稀缺的任务中。

Title: RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices

Authors: Wonkyo Choe, Yangfeng Ji, Felix Lin
Subjects: cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2412.10856
Pdf URL: https://arxiv.org/pdf/2412.10856
Copy Paste: [[2412.10856]] RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices(https://arxiv.org/abs/2412.10856)
Keywords: generation
Abstract: To deploy LLMs on resource-contained platforms such as mobile robotics and wearables, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) models have shown promising results in text generation on resource-constrained devices thanks to their computational efficiency. However, these models remain too large to be deployed on embedded devices due to their high parameter count. In this paper, we propose an efficient suite of compression techniques, tailored to the RWKV architecture. These techniques include low-rank approximation, sparsity predictors, and clustering head, designed to align with the model size. Our methods compress the RWKV models by 4.95--3.8x with only 2.95pp loss in accuracy.
摘要：为了在资源受限的平台（例如移动机器人和可穿戴设备）上部署 LLM，非 Transformer LLM 取得了重大突破。最近，一种基于 RNN 的新型 LLM 系列 Repentance Weighted Key Value (RWKV) 模型由于其计算效率而在资源受限的设备上文本生成中表现出良好的效果。然而，由于参数数量较多，这些模型仍然太大而无法部署在嵌入式设备上。在本文中，我们提出了一套高效的压缩技术，专门针对 RWKV 架构。这些技术包括低秩近似、稀疏预测器和聚类头，旨在与模型大小保持一致。我们的方法将 RWKV 模型压缩了 4.95--3.8 倍，而准确度仅损失 2.95pp。

Title: Zigzag Diffusion Sampling: The Path to Success Is Zigzag

Authors: Lichen Bai, Shitong Shao, Zikai Zhou, Zipeng Qi, Zhiqiang Xu, Haoyi Xiong, Zeke Xie
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10891
Pdf URL: https://arxiv.org/pdf/2412.10891
Copy Paste: [[2412.10891]] Zigzag Diffusion Sampling: The Path to Success Is Zigzag(https://arxiv.org/abs/2412.10891)
Keywords: generation, generative
Abstract: Diffusion models, the most popular generative paradigm so far, can inject conditional information into the generation path to guide the latent towards desired directions. However, existing text-to-image diffusion models often fail to maintain high image quality and high prompt-image alignment for those challenging prompts. To mitigate this issue and enhance existing pretrained diffusion models, we mainly made three contributions in this paper. First, we theoretically and empirically demonstrate that the conditional guidance gap between the denoising and inversion processes captures prompt-related semantic information. Second, motivated by theoretical analysis, we derive Zigzag Diffusion Sampling (Z-Sampling), a novel sampling method that leverages the guidance gap to accumulate semantic information step-by-step throughout the entire generation process, leading to improved sampling results. Moreover, as a plug-and-play method, Z-Sampling can be generally applied to various diffusion models (e.g., accelerated ones and Transformer-based ones) with very limited coding and computational costs. Third, our extensive experiments demonstrate that Z-Sampling can generally and significantly enhance generation quality across various benchmark datasets, diffusion models, and performance evaluation metrics. For example, Z-Sampling can even make DreamShaper achieve the HPSv2 winning rate higher than 94% over the original results. Moreover, Z-Sampling can further enhance existing diffusion models combined with other orthogonal methods, including Diffusion-DPO.
摘要：扩散模型是目前最流行的生成范式，它可以将条件信息注入生成路径，以引导潜在信息朝着期望的方向发展。然而，现有的文本到图像的扩散模型往往无法对于那些具有挑战性的提示保持高图像质量和高提示图像对齐度。为了缓解这个问题并增强现有的预训练扩散模型，我们在本文中主要做出了三项贡献。首先，我们从理论和经验上证明了去噪和反演过程之间的条件指导差距可以捕获与提示相关的语义信息。其次，在理论分析的启发下，我们推导出锯齿形扩散采样（Z-Sampling），这是一种新颖的采样方法，它利用指导差距在整个生成过程中逐步积累语义信息，从而获得更好的采样结果。此外，作为一种即插即用的方法，Z-Sampling 可以普遍应用于各种扩散模型（例如，加速模型和基于 Transformer 的模型），并且编码和计算成本非常有限。第三，我们进行了大量的实验，结果表明 Z-Sampling 可以普遍显著地提高各种基准数据集、扩散模型和性能评估指标的生成质量。例如，Z-Sampling 甚至可以使 DreamShaper 的 HPSv2 胜率高于原始结果的 94%。此外，Z-Sampling 可以结合其他正交方法（包括 Diffusion-DPO）进一步增强现有的扩散模型。

Title: Multi-Class and Multi-Task Strategies for Neural Directed Link Prediction

Authors: Claudio Moroni, Claudio Borile, Carolina Mattsson, Michele Starnini, André Panisson
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.10895
Pdf URL: https://arxiv.org/pdf/2412.10895
Copy Paste: [[2412.10895]] Multi-Class and Multi-Task Strategies for Neural Directed Link Prediction(https://arxiv.org/abs/2412.10895)
Keywords: generation
Abstract: Link Prediction is a foundational task in Graph Representation Learning, supporting applications like link recommendation, knowledge graph completion and graph generation. Graph Neural Networks have shown the most promising results in this domain and are currently the de facto standard approach to learning from graph data. However, a key distinction exists between Undirected and Directed Link Prediction: the former just predicts the existence of an edge, while the latter must also account for edge directionality and bidirectionality. This translates to Directed Link Prediction (DLP) having three sub-tasks, each defined by how training, validation and test sets are structured. Most research on DLP overlooks this trichotomy, focusing solely on the "existence" sub-task, where training and test sets are random, uncorrelated samples of positive and negative directed edges. Even in the works that recognize the aforementioned trichotomy, models fail to perform well across all three sub-tasks. In this study, we experimentally demonstrate that training Neural DLP (NDLP) models only on the existence sub-task, using methods adapted from Neural Undirected Link Prediction, results in parameter configurations that fail to capture directionality and bidirectionality, even after rebalancing edge classes. To address this, we propose three strategies that handle the three tasks simultaneously. Our first strategy, the Multi-Class Framework for Neural Directed Link Prediction (MC-NDLP) maps NDLP to a Multi-Class training objective. The second and third approaches adopt a Multi-Task perspective, either with a Multi-Objective (MO-DLP) or a Scalarized (S-DLP) strategy. Our results show that these methods outperform traditional approaches across multiple datasets and models, achieving equivalent or superior performance in addressing the three DLP sub-tasks.
摘要：链接预测是图表示学习中的一项基础任务，支持链接推荐、知识图谱补全和图谱生成等应用。图神经网络已在此领域显示出最有希望的结果，并且目前是从图数据中学习的事实标准方法。然而，无向链接预测和有向链接预测之间存在一个关键区别：前者仅预测边的存在，而后者还必须考虑边的方向性和双向性。这意味着有向链接预测 (DLP) 有三个子任务，每个子任务都由训练、验证和测试集的结构定义。大多数关于 DLP 的研究都忽略了这种三分法，仅关注“存在”子任务，其中训练集和测试集是正向和负向边的随机、不相关样本。即使在认识到上述三分法的研究中，模型也无法在这三个子任务中表现良好。在本研究中，我们通过实验证明，仅在存在子任务上训练神经 DLP (NDLP) 模型，使用改编自神经无向链接预测的方法，会导致参数配置无法捕捉方向性和双向性，即使在重新平衡边缘类别之后也是如此。为了解决这个问题，我们提出了三种同时处理这三个任务的策略。我们的第一种策略，神经定向链接预测多类框架 (MC-NDLP) 将 NDLP 映射到多类训练目标。第二种和第三种方法采用多任务视角，采用多目标 (MO-DLP) 或标量化 (S-DLP) 策略。我们的结果表明，这些方法在多个数据集和模型中的表现优于传统方法，在解决三个 DLP 子任务时实现了同等或更优异的性能。

Title: Video Representation Learning with Joint-Embedding Predictive Architectures

Authors: Katrina Drozdov, Ravid Shwartz-Ziv, Yann LeCun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10925
Pdf URL: https://arxiv.org/pdf/2412.10925
Copy Paste: [[2412.10925]] Video Representation Learning with Joint-Embedding Predictive Architectures(https://arxiv.org/abs/2412.10925)
Keywords: generative
Abstract: Video representation learning is an increasingly important topic in machine learning research. We present Video JEPA with Variance-Covariance Regularization (VJ-VCR): a joint-embedding predictive architecture for self-supervised video representation learning that employs variance and covariance regularization to avoid representation collapse. We show that hidden representations from our VJ-VCR contain abstract, high-level information about the input data. Specifically, they outperform representations obtained from a generative baseline on downstream tasks that require understanding of the underlying dynamics of moving objects in the videos. Additionally, we explore different ways to incorporate latent variables into the VJ-VCR framework that capture information about uncertainty in the future in non-deterministic settings.
摘要：视频表征学习是机器学习研究中一个越来越重要的主题。我们提出了具有方差-协方差正则化的视频 JEPA (VJ-VCR)：一种用于自监督视频表征学习的联合嵌入预测架构，它采用方差和协方差正则化来避免表征崩溃。我们表明，来自 VJ-VCR 的隐藏表征包含有关输入数据的抽象高级信息。具体而言，它们在需要了解视频中移动物体的底层动态的下游任务上优于从生成基线获得的表征。此外，我们探索了将潜在变量纳入 VJ-VCR 框架的不同方法，这些框架可捕获有关非确定性设置中未来不确定性的信息。

Title: Progressive Compression with Universally Quantized Diffusion Models

Authors: Yibo Yang, Justus C. Will, Stephan Mandt
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.10935
Pdf URL: https://arxiv.org/pdf/2412.10935
Copy Paste: [[2412.10935]] Progressive Compression with Universally Quantized Diffusion Models(https://arxiv.org/abs/2412.10935)
Keywords: generation, generative
Abstract: Diffusion probabilistic models have achieved mainstream success in many generative modeling tasks, from image generation to inverse problem solving. A distinct feature of these models is that they correspond to deep hierarchical latent variable models optimizing a variational evidence lower bound (ELBO) on the data likelihood. Drawing on a basic connection between likelihood modeling and compression, we explore the potential of diffusion models for progressive coding, resulting in a sequence of bits that can be incrementally transmitted and decoded with progressively improving reconstruction quality. Unlike prior work based on Gaussian diffusion or conditional diffusion models, we propose a new form of diffusion model with uniform noise in the forward process, whose negative ELBO corresponds to the end-to-end compression cost using universal quantization. We obtain promising first results on image compression, achieving competitive rate-distortion and rate-realism results on a wide range of bit-rates with a single model, bringing neural codecs a step closer to practical deployment.
摘要：扩散概率模型已在许多生成建模任务中取得了主流成功，从图像生成到逆问题求解。这些模型的一个显着特征是它们对应于深度分层潜变量模型，优化数据似然的变分证据下限 (ELBO)。利用似然建模和压缩之间的基本联系，我们探索了扩散模型用于渐进编码的潜力，从而产生可以逐步传输和解码的比特序列，并逐步提高重建质量。与基于高斯扩散或条件扩散模型的先前工作不同，我们提出了一种在前向过程中具有均匀噪声的新型扩散模型，其负 ELBO 对应于使用通用量化的端到端压缩成本。我们在图像压缩方面获得了有希望的初步结果，使用单一模型在各种比特率上实现了具有竞争力的速率失真和速率真实性结果，使神经编解码器更接近实际部署。

Title: A Staged Deep Learning Approach to Spatial Refinement in 3D Temporal Atmospheric Transport

Authors: M. Giselle Fernández-Godino, Wai Tong Chung, Akshay A. Gowardhan, Matthias Ihme, Qingkai Kong, Donald D. Lucas, Stephen C. Myers
Subjects: cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2412.10945
Pdf URL: https://arxiv.org/pdf/2412.10945
Copy Paste: [[2412.10945]] A Staged Deep Learning Approach to Spatial Refinement in 3D Temporal Atmospheric Transport(https://arxiv.org/abs/2412.10945)
Keywords: super-resolution
Abstract: High-resolution spatiotemporal simulations effectively capture the complexities of atmospheric plume dispersion in complex terrain. However, their high computational cost makes them impractical for applications requiring rapid responses or iterative processes, such as optimization, uncertainty quantification, or inverse modeling. To address this challenge, this work introduces the Dual-Stage Temporal Three-dimensional UNet Super-resolution (DST3D-UNet-SR) model, a highly efficient deep learning model for plume dispersion prediction. DST3D-UNet-SR is composed of two sequential modules: the temporal module (TM), which predicts the transient evolution of a plume in complex terrain from low-resolution temporal data, and the spatial refinement module (SRM), which subsequently enhances the spatial resolution of the TM predictions. We train DST3DUNet- SR using a comprehensive dataset derived from high-resolution large eddy simulations (LES) of plume transport. We propose the DST3D-UNet-SR model to significantly accelerate LES simulations of three-dimensional plume dispersion by three orders of magnitude. Additionally, the model demonstrates the ability to dynamically adapt to evolving conditions through the incorporation of new observational data, substantially improving prediction accuracy in high-concentration regions near the source. Keywords: Atmospheric sciences, Geosciences, Plume transport,3D temporal sequences, Artificial intelligence, CNN, LSTM, Autoencoder, Autoregressive model, U-Net, Super-resolution, Spatial Refinement.
摘要：高分辨率时空模拟有效地捕捉了复杂地形中大气羽流扩散的复杂性。然而，它们的高计算成本使其不适用于需要快速响应或迭代过程的应用，例如优化、不确定性量化或逆向建模。为了应对这一挑战，这项工作引入了双阶段时间三维 UNet 超分辨率 (DST3D-UNet-SR) 模型，这是一种用于羽流扩散预测的高效深度学习模型。DST3D-UNet-SR 由两个连续模块组成：时间模块 (TM)，它从低分辨率时间数据预测复杂地形中羽流的瞬态演变，以及空间细化模块 (SRM)，它随后增强 TM 预测的空间分辨率。我们使用来自羽流传输高分辨率大涡模拟 (LES) 的综合数据集来训练 DST3DUNet-SR。我们提出了 DST3D-UNet-SR 模型，以显著加快三维羽流扩散的 LES 模拟三个数量级。此外，该模型还展示了通过结合新的观测数据动态适应不断变化的条件的能力，大大提高了源头附近高浓度区域的预测精度。关键词：大气科学、地球科学、羽流输送、3D 时间序列、人工智能、CNN、LSTM、自动编码器、自回归模型、U-Net、超分辨率、空间细化。

Title: SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Authors: Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, Emad Barsoum
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10958
Pdf URL: https://arxiv.org/pdf/2412.10958
Copy Paste: [[2412.10958]] SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer(https://arxiv.org/abs/2412.10958)
Keywords: generation, generative
Abstract: Efficient image tokenization with high compression ratios remains a critical challenge for training generative models. We present SoftVQ-VAE, a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token, substantially increasing the representation capacity of the latent space. When applied to Transformer-based architectures, our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens. Not only does SoftVQ-VAE show consistent and high-quality reconstruction, more importantly, it also achieves state-of-the-art and significantly faster image generation results across different denoising-based generative models. Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images while achieving competitive FID scores of 1.78 and 2.21 for SiT-XL. It also improves the training efficiency of the generative models by reducing the number of training iterations by 2.3x while maintaining comparable performance. With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VQE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models. Code and model are released.
摘要：高效且高压缩比的图像标记化仍然是训练生成模型的关键挑战。我们提出了 SoftVQ-VAE，这是一种连续图像标记器，它利用软分类后验将多个代码字聚合到每个潜在标记中，从而大大增加了潜在空间的表示容量。当应用于基于 Transformer 的架构时，我们的方法使用少至 32 或 64 个一维标记来压缩 256x256 和 512x512 图像。SoftVQ-VAE 不仅表现出一致且高质量的重建，更重要的是，它还在不同的基于去噪的生成模型中实现了最先进且速度更快的图像生成结果。值得注意的是，SoftVQ-VAE 将生成 256x256 图像的推理吞吐量提高了 18 倍，将生成 512x512 图像的推理吞吐量提高了 55 倍，同时实现了 SiT-XL 的竞争性 FID 分数 1.78 和 2.21。它还通过将训练迭代次数减少 2.3 倍来提高生成模型的训练效率，同时保持相当的性能。凭借其完全可微分的设计和语义丰富的潜在空间，我们的实验表明 SoftVQ-VQE 实现了高效的标记化，而不会影响生成质量，为更高效的生成模型铺平了道路。代码和模型已发布。

Title: FlowDock: Geometric Flow Matching for Generative Protein-Ligand Docking and Affinity Prediction

Authors: Alex Morehead, Jianlin Cheng
Subjects: cs.LG, cs.AI, q-bio.BM, q-bio.QM
Abstract URL: https://arxiv.org/abs/2412.10966
Pdf URL: https://arxiv.org/pdf/2412.10966
Copy Paste: [[2412.10966]] FlowDock: Geometric Flow Matching for Generative Protein-Ligand Docking and Affinity Prediction(https://arxiv.org/abs/2412.10966)
Keywords: generative
Abstract: Powerful generative models of protein-ligand structure have recently been proposed, but few of these methods support both flexible protein-ligand docking and affinity estimation. Of those that do, none can directly model multiple binding ligands concurrently or have been rigorously benchmarked on pharmacologically relevant drug targets, hindering their widespread adoption in drug discovery efforts. In this work, we propose FlowDock, a deep geometric generative model based on conditional flow matching that learns to directly map unbound (apo) structures to their bound (holo) counterparts for an arbitrary number of binding ligands. Furthermore, FlowDock provides predicted structural confidence scores and binding affinity values with each of its generated protein-ligand complex structures, enabling fast virtual screening of new (multi-ligand) drug targets. For the commonly-used PoseBusters Benchmark dataset, FlowDock achieves a 51% blind docking success rate using unbound (apo) protein input structures and without any information derived from multiple sequence alignments, and for the challenging new DockGen-E dataset, FlowDock matches the performance of single-sequence Chai-1 for binding pocket generalization. Additionally, in the ligand category of the 16th community-wide Critical Assessment of Techniques for Structure Prediction (CASP16), FlowDock ranked among the top-5 methods for pharmacological binding affinity estimation across 140 protein-ligand complexes, demonstrating the efficacy of its learned representations in virtual screening. Source code, data, and pre-trained models are available at this https URL.
摘要：最近提出了强大的蛋白质-配体结构生成模型，但这些方法中很少有既支持灵活的蛋白质-配体对接又支持亲和力估计的方法。在那些可以同时支持多种结合配体的方法中，没有一个能够直接同时建模多种结合配体，也没有一个能够严格地针对药理学相关的药物靶标进行基准测试，这阻碍了它们在药物发现工作中的广泛应用。在这项工作中，我们提出了 FlowDock，这是一种基于条件流匹配的深度几何生成模型，它可以学习将任意数量的结合配体的未结合 (apo) 结构直接映射到其结合 (holo) 对应物。此外，FlowDock 为其生成的每种蛋白质-配体复合物结构提供预测的结构置信度得分和结合亲和力值，从而能够快速虚拟筛选新的 (多配体) 药物靶标。对于常用的 PoseBusters Benchmark 数据集，FlowDock 使用未结合 (apo) 蛋白质输入结构，无需任何来自多序列比对的信息，实现了 51% 的盲对接成功率，而对于具有挑战性的新 DockGen-E 数据集，FlowDock 在结合口袋泛化方面的表现堪比单序列 Chai-1。此外，在第 16 届社区范围的结构预测技术关键评估 (CASP16) 的配体类别中，FlowDock 在 140 种蛋白质-配体复合物的药理学结合亲和力估计方法中名列前五，证明了其学习表征在虚拟筛选中的有效性。源代码、数据和预训练模型可在此 https URL 上找到。

Title: Towards Context-aware Convolutional Network for Image Restoration

Authors: Fangwei Hao, Ji Du, Weiyun Liang, Jing Xu, Xiaoxuan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11008
Pdf URL: https://arxiv.org/pdf/2412.11008
Copy Paste: [[2412.11008]] Towards Context-aware Convolutional Network for Image Restoration(https://arxiv.org/abs/2412.11008)
Keywords: restoration
Abstract: Image restoration (IR) is a long-standing task to recover a high-quality image from its corrupted observation. Recently, transformer-based algorithms and some attention-based convolutional neural networks (CNNs) have presented promising results on several IR tasks. However, existing convolutional residual building modules for IR encounter limited ability to map inputs into high-dimensional and non-linear feature spaces, and their local receptive fields have difficulty in capturing long-range context information like Transformer. Besides, CNN-based attention modules for IR either face static abundant parameters or have limited receptive fields. To address the first issue, we propose an efficient residual star module (ERSM) that includes context-aware "star operation" (element-wise multiplication) to contextually map features into exceedingly high-dimensional and non-linear feature spaces, which greatly enhances representation learning. To further boost the extraction of contextual information, as for the second issue, we propose a large dynamic integration module (LDIM) which possesses an extremely large receptive field. Thus, LDIM can dynamically and efficiently integrate more contextual information that helps to further significantly improve the reconstruction performance. Integrating ERSM and LDIM into an U-shaped backbone, we propose a context-aware convolutional network (CCNet) with powerful learning ability for contextual high-dimensional mapping and abundant contextual information. Extensive experiments show that our CCNet with low model complexity achieves superior performance compared to other state-of-the-art IR methods on several IR tasks, including image dehazing, image motion deblurring, and image desnowing.
摘要：图像恢复 (IR) 是一项长期存在的任务，目的是从损坏的观察中恢复高质量的图像。最近，基于 Transformer 的算法和一些基于注意力的卷积神经网络 (CNN) 在多个 IR 任务上取得了有希望的结果。然而，现有的 IR 卷积残差构建模块在将输入映射到高维和非线性特征空间的能力有限，并且它们的局部感受野难以像 Transformer 一样捕获长距离上下文信息。此外，基于 CNN 的 IR 注意力模块要么面临静态丰富的参数，要么感受野有限。为了解决第一个问题，我们提出了一个高效的残差星模块 (ERSM)，其中包括上下文感知的“星号操作”(逐元素乘法)，以将特征上下文映射到极高维和非线性的特征空间，从而大大增强了表示学习。为了进一步促进上下文信息的提取，对于第二个问题，我们提出了一个具有极大感受野的大型动态集成模块 (LDIM)。因此，LDIM 可以动态高效地整合更多上下文信息，有助于进一步显著提高重建性能。我们将 ERSM 和 LDIM 集成到 U 形主干中，提出了一种上下文感知卷积网络 (CCNet)，该网络具有强大的上下文高维映射学习能力和丰富的上下文信息。大量实验表明，我们的低模型复杂度 CCNet 在图像去雾、图像运动去模糊和图像去雪等多个 IR 任务上与其他最先进的 IR 方法相比取得了优异的性能。

Title: PromptV: Leveraging LLM-powered Multi-Agent Prompting for High-quality Verilog Generation

Authors: Zhendong Mi, Renming Zheng, Haowen Zhong, Yue Sun, Shaoyi Huang
Subjects: cs.LG, cs.AI, cs.AR, cs.SE
Abstract URL: https://arxiv.org/abs/2412.11014
Pdf URL: https://arxiv.org/pdf/2412.11014
Copy Paste: [[2412.11014]] PromptV: Leveraging LLM-powered Multi-Agent Prompting for High-quality Verilog Generation(https://arxiv.org/abs/2412.11014)
Keywords: generation, generative
Abstract: Recent advances in agentic LLMs have demonstrated remarkable automated Verilog code generation capabilities. However, existing approaches either demand substantial computational resources or rely on LLM-assisted single-agent prompt learning techniques, which we observe for the first time has a degeneration issue - characterized by deteriorating generative performance and diminished error detection and correction capabilities. This paper proposes a novel multi-agent prompt learning framework to address these limitations and enhance code generation quality. We show for the first time that multi-agent architectures can effectively mitigate the degeneration risk while improving code error correction capabilities, resulting in higher-quality Verilog code generation. Experimental results show that the proposed method could achieve 96.4% and 96.5% pass@10 scores on VerilogEval Machine and Human benchmarks, respectively while attaining 100% Syntax and 99.9% Functionality pass@5 metrics on the RTLLM benchmark.
摘要：代理 LLM 的最新进展已经展示了卓越的自动化 Verilog 代码生成能力。然而，现有的方法要么需要大量的计算资源，要么依赖于 LLM 辅助的单代理提示学习技术，我们首次观察到这些技术存在退化问题——其特点是生成性能下降以及错误检测和纠正能力下降。本文提出了一种新颖的多代理提示学习框架来解决这些限制并提高代码生成质量。我们首次证明多代理架构可以有效地降低退化风险，同时提高代码纠错能力，从而生成更高质量的 Verilog 代码。实验结果表明，所提出的方法在 VerilogEval Machine 和 Human 基准上分别可以达到 96.4% 和 96.5% 的 pass@10 分数，同时在 RTLLM 基准上达到 100% 的语法和 99.9% 的功能 pass@5 指标。

Title: Exploring Diffusion and Flow Matching Under Generator Matching

Authors: Zeeshan Patel, James DeLoye, Lance Mathias
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.11024
Pdf URL: https://arxiv.org/pdf/2412.11024
Copy Paste: [[2412.11024]] Exploring Diffusion and Flow Matching Under Generator Matching(https://arxiv.org/abs/2412.11024)
Keywords: generative
Abstract: In this paper, we present a comprehensive theoretical comparison of diffusion and flow matching under the Generator Matching framework. Despite their apparent differences, both diffusion and flow matching can be viewed under the unified framework of Generator Matching. By recasting both diffusion and flow matching under the same generative Markov framework, we provide theoretical insights into why flow matching models can be more robust empirically and how novel model classes can be constructed by mixing deterministic and stochastic components. Our analysis offers a fresh perspective on the relationships between state-of-the-art generative modeling paradigms.
摘要：在本文中，我们在生成器匹配框架下对扩散和流匹配进行了全面的理论比较。尽管它们之间存在明显的差异，但扩散和流匹配都可以在生成器匹配的统一框架下查看。通过在同一个生成马尔可夫框架下重新塑造扩散和流匹配，我们提供了理论见解，说明为什么流匹配模型在经验上更具鲁棒性，以及如何通过混合确定性和随机性成分来构建新的模型类。我们的分析为最先进的生成建模范式之间的关系提供了新的视角。

Title: From Simple to Professional: A Combinatorial Controllable Image Captioning Agent

Authors: Xinran Wang, Muxi Diao, Baoteng Li, Haiwen Zhang, Kongming Liang, Zhanyu Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11025
Pdf URL: https://arxiv.org/pdf/2412.11025
Copy Paste: [[2412.11025]] From Simple to Professional: A Combinatorial Controllable Image Captioning Agent(https://arxiv.org/abs/2412.11025)
Keywords: generation
Abstract: The Controllable Image Captioning Agent (CapAgent) is an innovative system designed to bridge the gap between user simplicity and professional-level outputs in image captioning tasks. CapAgent automatically transforms user-provided simple instructions into detailed, professional instructions, enabling precise and context-aware caption generation. By leveraging multimodal large language models (MLLMs) and external tools such as object detection tool and search engines, the system ensures that captions adhere to specified guidelines, including sentiment, keywords, focus, and formatting. CapAgent transparently controls each step of the captioning process, and showcases its reasoning and tool usage at every step, fostering user trust and engagement. The project code is available at this https URL.
摘要：可控图像字幕代理 (CapAgent) 是一种创新系统，旨在弥补图像字幕任务中用户简单性和专业级输出之间的差距。CapAgent 会自动将用户提供的简单指令转换为详细的专业指令，从而实现精确且具有上下文感知的字幕生成。通过利用多模态大型语言模型 (MLLM) 和外部工具（例如对象检测工具和搜索引擎），系统可确保字幕符合指定的准则，包括情绪、关键字、焦点和格式。CapAgent 透明地控制字幕过程的每个步骤，并在每一步展示其推理和工具使用情况，从而培养用户的信任和参与度。项目代码可在此 https URL 上找到。

Title: SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation

Authors: Hang Zhang, Zhuoling Li, Jun Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11026
Pdf URL: https://arxiv.org/pdf/2412.11026
Copy Paste: [[2412.11026]] SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation(https://arxiv.org/abs/2412.11026)
Keywords: generation
Abstract: Dynamic scenes contain intricate spatio-temporal information, crucial for mobile robots, UAVs, and autonomous driving systems to make informed decisions. Parsing these scenes into semantic triplets for accurate Scene Graph Generation (SGG) is highly challenging due to the fluctuating spatio-temporal complexity. Inspired by the reasoning capabilities of Large Language Models (LLMs), we propose SceneLLM, a novel framework that leverages LLMs as powerful scene analyzers for dynamic SGG. Our framework introduces a Video-to-Language (V2L) mapping module that transforms video frames into linguistic signals (scene tokens), making the input more comprehensible for LLMs. To better encode spatial information, we devise a Spatial Information Aggregation (SIA) scheme, inspired by the structure of Chinese characters, which encodes spatial data into tokens. Using Optimal Transport (OT), we generate an implicit language signal from the frame-level token sequence that captures the video's spatio-temporal information. To further improve the LLM's ability to process this implicit linguistic input, we apply Low-Rank Adaptation (LoRA) to fine-tune the model. Finally, we use a transformer-based SGG predictor to decode the LLM's reasoning and predict semantic triplets. Our method achieves state-of-the-art results on the Action Genome (AG) benchmark, and extensive experiments show the effectiveness of SceneLLM in understanding and generating accurate dynamic scene graphs.
摘要：动态场景包含复杂的时空信息，这对于移动机器人、无人机和自动驾驶系统做出明智的决策至关重要。由于时空复杂性不断波动，将这些场景解析为语义三元组 <主语-谓语-宾语> 以进行准确的场景图生成 (SGG) 极具挑战性。受大型语言模型 (LLM) 推理能力的启发，我们提出了 SceneLLM，这是一个新颖的框架，它利用 LLM 作为动态 SGG 的强大场景分析器。我们的框架引入了一个视频到语言 (V2L) 映射模块，可将视频帧转换为语言信号（场景标记），使输入更易于 LLM 理解。为了更好地编码空间信息，我们设计了一种空间信息聚合 (SIA) 方案，该方案受到汉字结构的启发，将空间数据编码为标记。使用最佳传输 (OT)，我们从帧级标记序列生成隐式语言信号，该序列捕获视频的时空信息。为了进一步提高 LLM 处理这种隐式语言输入的能力，我们应用低秩自适应 (LoRA) 来微调模型。最后，我们使用基于转换器的 SGG 预测器来解码 LLM 的推理并预测语义三元组。我们的方法在动作基因组 (AG) 基准上取得了最先进的结果，大量实验表明 SceneLLM 在理解和生成准确的动态场景图方面的有效性。

Title: AURORA: Automated Unleash of 3D Room Outlines for VR Applications

Authors: Huijun Han, Yongqing Liang, Yuanlong Zhou, Wenping Wang, Edgar J. Rojas-Munoz, Xin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11033
Pdf URL: https://arxiv.org/pdf/2412.11033
Copy Paste: [[2412.11033]] AURORA: Automated Unleash of 3D Room Outlines for VR Applications(https://arxiv.org/abs/2412.11033)
Keywords: generation
Abstract: Creating realistic VR experiences is challenging due to the labor-intensive process of accurately replicating real-world details into virtual scenes, highlighting the need for automated methods that maintain spatial accuracy and provide design flexibility. In this paper, we propose AURORA, a novel method that leverages RGB-D images to automatically generate both purely virtual reality (VR) scenes and VR scenes combined with real-world elements. This approach can benefit designers by streamlining the process of converting real-world details into virtual scenes. AURORA integrates advanced techniques in image processing, segmentation, and 3D reconstruction to efficiently create realistic and detailed interior designs from real-world environments. The design of this integration ensures optimal performance and precision, addressing key challenges in automated indoor design generation by uniquely combining and leveraging the strengths of foundation models. We demonstrate the effectiveness of our approach through experiments, both on self-captured data and public datasets, showcasing its potential to enhance virtual reality (VR) applications by providing interior designs that conform to real-world positioning.
摘要：创建逼真的 VR 体验具有挑战性，因为将现实世界的细节准确地复制到虚拟场景中是一个劳动密集型的过程，这凸显了对保持空间准确性和提供设计灵活性的自动化方法的需求。在本文中，我们提出了 AURORA，这是一种利用 RGB-D 图像自动生成纯虚拟现实 (VR) 场景和结合现实世界元素的 VR 场景的新方法。这种方法可以简化将现实世界细节转换为虚拟场景的过程，从而使设计师受益。AURORA 集成了图像处理、分割和 3D 重建方面的先进技术，可从现实世界环境中高效地创建逼真且详细的室内设计。这种集成的设计确保了最佳性能和精度，通过独特地结合和利用基础模型的优势，解决了自动室内设计生成中的关键挑战。我们通过对自捕获数据和公共数据集的实验证明了我们方法的有效性，展示了它通过提供符合现实世界定位的室内设计来增强虚拟现实 (VR) 应用的潜力。

Title: Understanding and Mitigating Memorization in Diffusion Models for Tabular Data

Authors: Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiao Li, Jing Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.11044
Pdf URL: https://arxiv.org/pdf/2412.11044
Copy Paste: [[2412.11044]] Understanding and Mitigating Memorization in Diffusion Models for Tabular Data(https://arxiv.org/abs/2412.11044)
Keywords: generation
Abstract: Tabular data generation has attracted significant research interest in recent years, with the tabular diffusion models greatly improving the quality of synthetic data. However, while memorization, where models inadvertently replicate exact or near-identical training data, has been thoroughly investigated in image and text generation, its effects on tabular data remain largely unexplored. In this paper, we conduct the first comprehensive investigation of memorization phenomena in diffusion models for tabular data. Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. Additionally, we provide a theoretical explanation for why memorization occurs in tabular diffusion models. To address this issue, we propose TabCutMix, a simple yet effective data augmentation technique that exchanges randomly selected feature segments between random same-class training sample pairs. Building upon this, we introduce TabCutMixPlus, an enhanced method that clusters features based on feature correlations and ensures that features within the same cluster are exchanged together during augmentation. This clustering mechanism mitigates out-of-distribution (OOD) generation issues by maintaining feature coherence. Experimental results across various datasets and diffusion models demonstrate that TabCutMix effectively mitigates memorization while maintaining high-quality data generation.
摘要：近年来，表格数据生成引起了广泛的研究兴趣，表格扩散模型极大地提高了合成数据的质量。然而，虽然记忆（模型无意中复制完全相同或几乎相同的训练数据）已在图像和文本生成中得到彻底研究，但它对表格数据的影响仍未得到充分探索。在本文中，我们首次全面研究了表格数据扩散模型中的记忆现象。我们的实证分析表明，记忆出现在表格扩散模型中，并且随着训练周期的增加而增加。我们进一步研究了数据集大小、特征维度和不同扩散模型等因素对记忆的影响。此外，我们还从理论角度解释了为什么记忆会出现在表格扩散模型中。为了解决这个问题，我们提出了 TabCutMix，这是一种简单而有效的数据增强技术，可在随机的同类训练样本对之间交换随机选择的特征段。在此基础上，我们引入了 TabCutMixPlus，这是一种增强方法，它基于特征相关性对特征进行聚类，并确保在增强过程中同一聚类内的特征一起交换。这种聚类机制通过保持特征一致性来缓解分布不均 (OOD) 生成问题。跨各种数据集和扩散模型的实验结果表明，TabCutMix 可有效缓解记忆，同时保持高质量的数据生成。

Title: RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models

Authors: Yujin Wang, Quanfeng Liu, Jiaqi Fan, Jinlong Hong, Hongqing Chu, Mengjian Tian, Bingzhao Gao, Hong Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11050
Pdf URL: https://arxiv.org/pdf/2412.11050
Copy Paste: [[2412.11050]] RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models(https://arxiv.org/abs/2412.11050)
Keywords: generation
Abstract: Understanding and addressing corner cases is essential for ensuring the safety and reliability of autonomous driving systems. Vision-Language Models (VLMs) play a crucial role in enhancing scenario comprehension, yet they face significant challenges, such as hallucination and insufficient real-world grounding, which compromise their performance in critical driving scenarios. In this work, we propose RAC3, a novel framework designed to improve VLMs' ability to handle corner cases effectively. The framework integrates Retrieval-Augmented Generation (RAG) to mitigate hallucination by dynamically incorporating context-specific external knowledge. A cornerstone of RAC3 is its cross-modal alignment fine-tuning, which utilizes contrastive learning to embed image-text pairs into a unified semantic space, enabling robust retrieval of similar scenarios. We evaluate RAC3 through extensive experiments using a curated dataset of corner case scenarios, demonstrating its ability to enhance semantic alignment, improve hallucination mitigation, and achieve superior performance metrics, such as Cosine Similarity and ROUGE-L scores. For example, for the LLaVA-v1.6-34B VLM, the cosine similarity between the generated text and the reference text has increased by 5.22\%. The F1-score in ROUGE-L has increased by 39.91\%, the Precision has increased by 55.80\%, and the Recall has increased by 13.74\%. This work underscores the potential of retrieval-augmented VLMs to advance the robustness and safety of autonomous driving in complex environments.
摘要：理解和解决极端情况对于确保自动驾驶系统的安全性和可靠性至关重要。视觉语言模型 (VLM) 在增强场景理解方面发挥着至关重要的作用，但它们面临着重大挑战，例如幻觉和现实世界基础不足，这会影响它们在关键驾驶场景中的表现。在这项工作中，我们提出了 RAC3，这是一个旨在提高 VLM 有效处理极端情况的能力的新框架。该框架集成了检索增强生成 (RAG)，通过动态整合特定于上下文的外部知识来缓解幻觉。RAC3 的基石是其跨模态对齐微调，它利用对比学习将图像文本对嵌入到统一的语义空间中，从而实现对类似场景的稳健检索。我们使用精选的极端情况数据集通过大量实验评估 RAC3，展示了其增强语义对齐、改善幻觉缓解和实现卓越性能指标（例如余弦相似度和 ROUGE-L 分数）的能力。例如，对于 LLaVA-v1.6-34B VLM，生成文本与参考文本之间的余弦相似度增加了 5.22%。ROUGE-L 中的 F1 分数增加了 39.91%，准确率增加了 55.80%，召回率增加了 13.74%。这项工作强调了检索增强 VLM 在提高复杂环境中自动驾驶的稳健性和安全性方面的潜力。

Title: DisCo-DSO: Coupling Discrete and Continuous Optimization for Efficient Generative Design in Hybrid Spaces

Authors: Jacob F. Pettit, Chak Shing Lee, Jiachen Yang, Alex Ho, Daniel Faissol, Brenden Petersen, Mikel Landajuela
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2412.11051
Pdf URL: https://arxiv.org/pdf/2412.11051
Copy Paste: [[2412.11051]] DisCo-DSO: Coupling Discrete and Continuous Optimization for Efficient Generative Design in Hybrid Spaces(https://arxiv.org/abs/2412.11051)
Keywords: generative
Abstract: We consider the challenge of black-box optimization within hybrid discrete-continuous and variable-length spaces, a problem that arises in various applications, such as decision tree learning and symbolic regression. We propose DisCo-DSO (Discrete-Continuous Deep Symbolic Optimization), a novel approach that uses a generative model to learn a joint distribution over discrete and continuous design variables to sample new hybrid designs. In contrast to standard decoupled approaches, in which the discrete and continuous variables are optimized separately, our joint optimization approach uses fewer objective function evaluations, is robust against non-differentiable objectives, and learns from prior samples to guide the search, leading to significant improvement in performance and sample efficiency. Our experiments on a diverse set of optimization tasks demonstrate that the advantages of DisCo-DSO become increasingly evident as the complexity of the problem increases. In particular, we illustrate DisCo-DSO's superiority over the state-of-the-art methods for interpretable reinforcement learning with decision trees.
摘要：我们考虑了离散-连续和可变长度混合空间中的黑盒优化挑战，这是决策树学习和符号回归等各种应用中都会出现的问题。我们提出了 DisCo-DSO（离散-连续深度符号优化），这是一种新颖的方法，它使用生成模型来学习离散和连续设计变量的联合分布，以对新的混合设计进行采样。与分别优化离散和连续变量的标准解耦方法相比，我们的联合优化方法使用更少的目标函数评估，对不可微分目标具有鲁棒性，并从先前的样本中学习以指导搜索，从而显着提高性能和样本效率。我们在一系列不同的优化任务上进行的实验表明，随着问题复杂性的增加，DisCo-DSO 的优势变得越来越明显。特别是，我们说明了 DisCo-DSO 优于最先进的决策树可解释强化学习方法。

Title: Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track

Authors: Deepak Gupta, Dina Demner-Fushman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11056
Pdf URL: https://arxiv.org/pdf/2412.11056
Copy Paste: [[2412.11056]] Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track(https://arxiv.org/abs/2412.11056)
Keywords: generation
Abstract: One of the key goals of artificial intelligence (AI) is the development of a multimodal system that facilitates communication with the visual world (image and video) using a natural language query. Earlier works on medical question answering primarily focused on textual and visual (image) modalities, which may be inefficient in answering questions requiring demonstration. In recent years, significant progress has been achieved due to the introduction of large-scale language-vision datasets and the development of efficient deep neural techniques that bridge the gap between language and visual understanding. Improvements have been made in numerous vision-and-language tasks, such as visual captioning visual question answering, and natural language video localization. Most of the existing work on language vision focused on creating datasets and developing solutions for open-domain applications. We believe medical videos may provide the best possible answers to many first aid, medical emergency, and medical education questions. With increasing interest in AI to support clinical decision-making and improve patient engagement, there is a need to explore such challenges and develop efficient algorithms for medical language-video understanding and generation. Toward this, we introduced new tasks to foster research toward designing systems that can understand medical videos to provide visual answers to natural language questions, and are equipped with multimodal capability to generate instruction steps from the medical video. These tasks have the potential to support the development of sophisticated downstream applications that can benefit the public and medical professionals.
摘要：人工智能 (AI) 的主要目标之一是开发一种多模态系统，该系统使用自然语言查询来促进与视觉世界（图像和视频）的交流。早期关于医学问答的研究主要集中在文本和视觉（图像）模态上，这可能在回答需要演示的问题时效率低下。近年来，由于引入了大规模语言视觉数据集，并开发了有效的深度神经技术来弥合语言和视觉理解之间的差距，取得了重大进展。许多视觉和语言任务都得到了改进，例如视觉字幕、视觉问答和自然语言视频定位。现有的大多数语言视觉研究都集中在创建数据集和开发开放域应用程序的解决方案上。我们相信医学视频可以为许多急救、医疗紧急情况和医学教育问题提供最佳答案。随着人们对人工智能支持临床决策和提高患者参与度的兴趣日益浓厚，有必要探索此类挑战并开发有效的医学语言视频理解和生成算法。为此，我们引入了新任务，以促进研究设计能够理解医学视频的系统，以便为自然语言问题提供视觉答案，并配备多模式功能，可以从医学视频中生成指导步骤。这些任务有可能支持开发复杂的下游应用程序，使公众和医疗专业人员受益。

Title: HC-LLM: Historical-Constrained Large Language Models for Radiology Report Generation

Authors: Tengfei Liu, Jiapu Wang, Yongli Hu, Mingjie Li, Junfei Yi, Xiaojun Chang, Junbin Gao, Baocai Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11070
Pdf URL: https://arxiv.org/pdf/2412.11070
Copy Paste: [[2412.11070]] HC-LLM: Historical-Constrained Large Language Models for Radiology Report Generation(https://arxiv.org/abs/2412.11070)
Keywords: generation
Abstract: Radiology report generation (RRG) models typically focus on individual exams, often overlooking the integration of historical visual or textual data, which is crucial for patient follow-ups. Traditional methods usually struggle with long sequence dependencies when incorporating historical information, but large language models (LLMs) excel at in-context learning, making them well-suited for analyzing longitudinal medical data. In light of this, we propose a novel Historical-Constrained Large Language Models (HC-LLM) framework for RRG, empowering LLMs with longitudinal report generation capabilities by constraining the consistency and differences between longitudinal images and their corresponding reports. Specifically, our approach extracts both time-shared and time-specific features from longitudinal chest X-rays and diagnostic reports to capture disease progression. Then, we ensure consistent representation by applying intra-modality similarity constraints and aligning various features across modalities with multimodal contrastive and structural constraints. These combined constraints effectively guide the LLMs in generating diagnostic reports that accurately reflect the progression of the disease, achieving state-of-the-art results on the Longitudinal-MIMIC dataset. Notably, our approach performs well even without historical data during testing and can be easily adapted to other multimodal large models, enhancing its versatility.
摘要：放射学报告生成 (RRG) 模型通常侧重于单个检查，往往忽略了历史视觉或文本数据的整合，而这对于患者随访至关重要。传统方法在整合历史信息时通常会遇到长序列依赖性问题，但大型语言模型 (LLM) 擅长上下文学习，因此非常适合分析纵向医学数据。鉴于此，我们为 RRG 提出了一种新颖的历史约束大型语言模型 (HC-LLM) 框架，通过约束纵向图像与其相应报告之间的一致性和差异，为 LLM 提供纵向报告生成功能。具体而言，我们的方法从纵向胸部 X 光片和诊断报告中提取时间共享和时间特定特征以捕捉疾病进展。然后，我们通过应用模态内相似性约束并使用多模态对比和结构约束跨模态对齐各种特征来确保一致的表示。这些组合约束有效地指导 LLM 生成准确反映疾病进展的诊断报告，在 Longitudinal-MIMIC 数据集上取得最佳结果。值得注意的是，我们的方法即使在测试期间没有历史数据也能表现良好，并且可以轻松适应其他多模式大型模型，从而增强其多功能性。

Title: Edge Contrastive Learning: An Augmentation-Free Graph Contrastive Learning Model

Authors: Yujun Li, Hongyuan Zhang, Yuan Yuan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.11075
Pdf URL: https://arxiv.org/pdf/2412.11075
Copy Paste: [[2412.11075]] Edge Contrastive Learning: An Augmentation-Free Graph Contrastive Learning Model(https://arxiv.org/abs/2412.11075)
Keywords: generation
Abstract: Graph contrastive learning (GCL) aims to learn representations from unlabeled graph data in a self-supervised manner and has developed rapidly in recent years. However, edgelevel contrasts are not well explored by most existing GCL methods. Most studies in GCL only regard edges as auxiliary information while updating node features. One of the primary obstacles of edge-based GCL is the heavy computation burden. To tackle this issue, we propose a model that can efficiently learn edge features for GCL, namely AugmentationFree Edge Contrastive Learning (AFECL) to achieve edgeedge contrast. AFECL depends on no augmentation consisting of two parts. Firstly, we design a novel edge feature generation method, where edge features are computed by embedding concatenation of their connected nodes. Secondly, an edge contrastive learning scheme is developed, where edges connecting the same nodes are defined as positive pairs, and other edges are defined as negative pairs. Experimental results show that compared with recent state-of-the-art GCL methods or even some supervised GNNs, AFECL achieves SOTA performance on link prediction and semi-supervised node classification of extremely scarce labels. The source code is available at this https URL.
摘要：图对比学习（GCL）旨在以自监督的方式从未标记的图数据中学习表示，近年来发展迅速。然而，现有的大多数 GCL 方法并没有很好地探索边缘级对比。GCL 中的大多数研究在更新节点特征时仅将边作为辅助信息。基于边的 GCL 的主要障碍之一是计算负担过重。为了解决这个问题，我们提出了一个可以有效学习 GCL 边缘特征的模型，即无增强边缘对比学习（AFECL）来实现边边对比。AFECL 不依赖于由两部分组成的增强。首先，我们设计了一种新颖的边特征生成方法，其中边特征通过嵌入其连接节点的串联来计算。其次，开发了一种边对比学习方案，其中连接相同节点的边定义为正对，其他边定义为负对。实验结果表明，与最近最先进的 GCL 方法甚至一些监督 GNN 相比，AFECL 在极其稀缺标签的链接预测和半监督节点分类方面取得了 SOTA 性能。源代码可在此 https URL 上找到。

Title: DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

Authors: Jinxiu Liu, Shaoheng Lin, Yinxiao Li, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11100
Pdf URL: https://arxiv.org/pdf/2412.11100
Copy Paste: [[2412.11100]] DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes(https://arxiv.org/abs/2412.11100)
Keywords: generation
Abstract: The increasing demand for immersive AR/VR applications and spatial intelligence has heightened the need to generate high-quality scene-level and 360° panoramic video. However, most video diffusion models are constrained by limited resolution and aspect ratio, which restricts their applicability to scene-level dynamic content synthesis. In this work, we propose the DynamicScaler, addressing these challenges by enabling spatially scalable and panoramic dynamic scene synthesis that preserves coherence across panoramic scenes of arbitrary size. Specifically, we introduce a Offset Shifting Denoiser, facilitating efficient, synchronous, and coherent denoising panoramic dynamic scenes via a diffusion model with fixed resolution through a seamless rotating Window, which ensures seamless boundary transitions and consistency across the entire panoramic space, accommodating varying resolutions and aspect ratios. Additionally, we employ a Global Motion Guidance mechanism to ensure both local detail fidelity and global motion continuity. Extensive experiments demonstrate our method achieves superior content and motion quality in panoramic scene-level video generation, offering a training-free, efficient, and scalable solution for immersive dynamic scene creation with constant VRAM consumption regardless of the output video resolution. Our project page is available at \url{this https URL}.
摘要：对沉浸式 AR/VR 应用和空间智能的需求不断增长，这提高了生成高质量场景级和 360° 全景视频的需求。然而，大多数视频扩散模型受到分辨率和宽高比的限制，这限制了它们在场景级动态内容合成中的适用性。在这项工作中，我们提出了 DynamicScaler，通过实现空间可扩展和全景动态场景合成来解决这些挑战，该合成可以在任意大小的全景场景中保持连贯性。具体来说，我们引入了偏移移位降噪器，通过无缝旋转窗口通过具有固定分辨率的扩散模型实现高效、同步和连贯的全景动态场景降噪，从而确保整个全景空间的无缝边界过渡和一致性，适应不同的分辨率和宽高比。此外，我们采用了全局运动引导机制来确保局部细节保真度和全局运动连续性。大量实验表明，我们的方法在全景场景级视频生成中实现了卓越的内容和运动质量，提供了一种无需训练、高效且可扩展的解决方案，用于沉浸式动态场景创建，无论输出视频分辨率如何，VRAM 消耗都保持不变。我们的项目页面位于 \url{此 https URL}。

Title: Empowering LLMs to Understand and Generate Complex Vector Graphics

Authors: Ximing Xing, Juncheng Hu, Guotao Liang, Jing Zhang, Dong Xu, Qian Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11102
Pdf URL: https://arxiv.org/pdf/2412.11102
Copy Paste: [[2412.11102]] Empowering LLMs to Understand and Generate Complex Vector Graphics(https://arxiv.org/abs/2412.11102)
Keywords: generation
Abstract: The unprecedented advancements in Large Language Models (LLMs) have profoundly impacted natural language processing but have yet to fully embrace the realm of scalable vector graphics (SVG) generation. While LLMs encode partial knowledge of SVG data from web pages during training, recent findings suggest that semantically ambiguous and tokenized representations within LLMs may result in hallucinations in vector primitive predictions. Additionally, LLM training typically lacks modeling and understanding of the rendering sequence of vector paths, which can lead to occlusion between output vector primitives. In this paper, we present LLM4SVG, an initial yet substantial step toward bridging this gap by enabling LLMs to better understand and generate vector graphics. LLM4SVG facilitates a deeper understanding of SVG components through learnable semantic tokens, which precisely encode these tokens and their corresponding properties to generate semantically aligned SVG outputs. Using a series of learnable semantic tokens, a structured dataset for instruction following is developed to support comprehension and generation across two primary tasks. Our method introduces a modular architecture to existing large language models, integrating semantic tags, vector instruction encoders, fine-tuned commands, and powerful LLMs to tightly combine geometric, appearance, and language information. To overcome the scarcity of SVG-text instruction data, we developed an automated data generation pipeline that collected a massive dataset of more than 250k SVG data and 580k SVG-text instructions, which facilitated the adoption of the two-stage training strategy popular in LLM development. By exploring various training strategies, we developed LLM4SVG, which significantly moves beyond optimized rendering-based approaches and language-model-based baselines to achieve remarkable results in human evaluation tasks.
摘要：大型语言模型 (LLM) 的空前进步对自然语言处理产生了深远影响，但尚未完全涵盖可缩放矢量图形 (SVG) 生成领域。虽然 LLM 在训练期间对来自网页的 SVG 数据的部分知识进行编码，但最近的研究结果表明，LLM 中语义模糊和标记化的表示可能会导致向量基元预测出现幻觉。此外，LLM 训练通常缺乏对向量路径渲染序列的建模和理解，这可能导致输出向量基元之间的遮挡。在本文中，我们介绍了 LLM4SVG，这是弥合这一差距的初步但实质性的一步，它使 LLM 能够更好地理解和生成矢量图形。LLM4SVG 通过可学习的语义标记促进对 SVG 组件的更深入理解，这些标记精确地编码这些标记及其相应的属性以生成语义对齐的 SVG 输出。使用一系列可学习的语义标记，开发了一个用于遵循指令的结构化数据集，以支持跨两个主要任务的理解和生成。我们的方法为现有的大型语言模型引入了模块化架构，集成了语义标签、矢量指令编码器、微调命令和强大的 LLM，以紧密结合几何、外观和语言信息。为了克服 SVG 文本指令数据的稀缺性，我们开发了一个自动数据生成管道，收集了超过 250k SVG 数据和 580k SVG 文本指令的海量数据集，这有助于采用 LLM 开发中流行的两阶段训练策略。通过探索各种训练策略，我们开发了 LLM4SVG，它大大超越了基于优化渲染的方法和基于语言模型的基线，在人工评估任务中取得了显著的成果。

Title: A Comprehensive Survey of Action Quality Assessment: Method and Benchmark

Authors: Kanglei Zhou, Ruizhi Cai, Liyuan Wang, Hubert P. H. Shum, Xiaohui Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11149
Pdf URL: https://arxiv.org/pdf/2412.11149
Copy Paste: [[2412.11149]] A Comprehensive Survey of Action Quality Assessment: Method and Benchmark(https://arxiv.org/abs/2412.11149)
Keywords: quality assessment
Abstract: Action Quality Assessment (AQA) quantitatively evaluates the quality of human actions, providing automated assessments that reduce biases in human judgment. Its applications span domains such as sports analysis, skill assessment, and medical care. Recent advances in AQA have introduced innovative methodologies, but similar methods often intertwine across different domains, highlighting the fragmented nature that hinders systematic reviews. In addition, the lack of a unified benchmark and limited computational comparisons hinder consistent evaluation and fair assessment of AQA approaches. In this work, we address these gaps by systematically analyzing over 150 AQA-related papers to develop a hierarchical taxonomy, construct a unified benchmark, and provide an in-depth analysis of current trends, challenges, and future directions. Our hierarchical taxonomy categorizes AQA methods based on input modalities (video, skeleton, multi-modal) and their specific characteristics, highlighting the evolution and interrelations across various approaches. To promote standardization, we present a unified benchmark, integrating diverse datasets to evaluate the assessment precision and computational efficiency. Finally, we review emerging task-specific applications and identify under-explored challenges in AQA, providing actionable insights into future research directions. This survey aims to deepen understanding of AQA progress, facilitate method comparison, and guide future innovations. The project web page can be found at this https URL.
摘要：动作质量评估 (AQA) 定量评估人类动作的质量，提供自动化评估以减少人类判断的偏见。其应用涵盖体育分析、技能评估和医疗保健等领域。AQA 的最新进展引入了创新方法，但类似的方法往往交织在不同领域，突出了阻碍系统评价的碎片化性质。此外，缺乏统一的基准和有限的计算比较阻碍了对 AQA 方法的一致评估和公平评估。在这项工作中，我们通过系统分析 150 多篇与 AQA 相关的论文来解决这些差距，以开发分层分类法，构建统一的基准，并对当前趋势、挑战和未来方向进行深入分析。我们的分层分类法根据输入模式（视频、骨架、多模式）及其特定特征对 AQA 方法进行分类，突出了各种方法之间的演变和相互关系。为了促进标准化，我们提出了一个统一的基准，整合了不同的数据集来评估评估精度和计算效率。最后，我们回顾了新兴的任务特定应用，并确定了 AQA 中尚未充分探索的挑战，为未来的研究方向提供了可行的见解。这项调查旨在加深对 AQA 进展的理解，促进方法比较，并指导未来的创新。项目网页可在此 https URL 中找到。

Title: OTLRM: Orthogonal Learning-based Low-Rank Metric for Multi-Dimensional Inverse Problems

Authors: Xiangming Wang, Haijin Zeng, Jiaoyang Chen, Sheng Liu, Yongyong Chen, Guoqing Chao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11165
Pdf URL: https://arxiv.org/pdf/2412.11165
Copy Paste: [[2412.11165]] OTLRM: Orthogonal Learning-based Low-Rank Metric for Multi-Dimensional Inverse Problems(https://arxiv.org/abs/2412.11165)
Keywords: restoration, generative
Abstract: In real-world scenarios, complex data such as multispectral images and multi-frame videos inherently exhibit robust low-rank property. This property is vital for multi-dimensional inverse problems, such as tensor completion, spectral imaging reconstruction, and multispectral image denoising. Existing tensor singular value decomposition (t-SVD) definitions rely on hand-designed or pre-given transforms, which lack flexibility for defining tensor nuclear norm (TNN). The TNN-regularized optimization problem is solved by the singular value thresholding (SVT) operator, which leverages the t-SVD framework to obtain the low-rank tensor. However, it is quite complicated to introduce SVT into deep neural networks due to the numerical instability problem in solving the derivatives of the eigenvectors. In this paper, we introduce a novel data-driven generative low-rank t-SVD model based on the learnable orthogonal transform, which can be naturally solved under its representation. Prompted by the linear algebra theorem of the Householder transformation, our learnable orthogonal transform is achieved by constructing an endogenously orthogonal matrix adaptable to neural networks, optimizing it as arbitrary orthogonal matrices. Additionally, we propose a low-rank solver as a generalization of SVT, which utilizes an efficient representation of generative networks to obtain low-rank structures. Extensive experiments highlight its significant restoration enhancements.
摘要：在现实场景中，多光谱图像和多帧视频等复杂数据本身就表现出鲁棒的低秩属性。此属性对于多维逆问题至关重要，例如张量补全、光谱成像重建和多光谱图像去噪。现有的张量奇异值分解（t-SVD）定义依赖于手工设计或预先给定的变换，缺乏定义张量核范数（TNN）的灵活性。TNN 正则化的优化问题由奇异值阈值（SVT）算子解决，该算子利用 t-SVD 框架来获得低秩张量。然而，由于求解特征向量导数的数值不稳定性问题，将 SVT 引入深度神经网络相当复杂。在本文中，我们介绍了一种基于可学习正交变换的新型数据驱动的生成低秩 t-SVD 模型，该模型可以在其表示下自然求解。受 Householder 变换的线性代数定理启发，我们通过构建一个适用于神经网络的内生正交矩阵，将其优化为任意正交矩阵，实现了可学习的正交变换。此外，我们提出了一个低秩求解器作为 SVT 的泛化，它利用生成网络的有效表示来获得低秩结构。大量实验凸显了其显著的恢复增强效果。

Title: Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation

Authors: Yujie Zhang, Bingyang Cui, Qi Yang, Zhu Li, Yiling Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11170
Pdf URL: https://arxiv.org/pdf/2412.11170
Copy Paste: [[2412.11170]] Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation(https://arxiv.org/abs/2412.11170)
Keywords: generation, quality assessment
Abstract: Text-to-3D generation has achieved remarkable progress in recent years, yet evaluating these methods remains challenging for two reasons: i) Existing benchmarks lack fine-grained evaluation on different prompt categories and evaluation dimensions. ii) Previous evaluation metrics only focus on a single aspect (e.g., text-3D alignment) and fail to perform multi-dimensional quality assessment. To address these problems, we first propose a comprehensive benchmark named MATE-3D. The benchmark contains eight well-designed prompt categories that cover single and multiple object generation, resulting in 1,280 generated textured meshes. We have conducted a large-scale subjective experiment from four different evaluation dimensions and collected 107,520 annotations, followed by detailed analyses of the results. Based on MATE-3D, we propose a novel quality evaluator named HyperScore. Utilizing hypernetwork to generate specified mapping functions for each evaluation dimension, our metric can effectively perform multi-dimensional quality assessment. HyperScore presents superior performance over existing metrics on MATE-3D, making it a promising metric for assessing and improving text-to-3D generation. The project is available at this https URL.
摘要：近年来，文本到 3D 生成取得了显著进展，但评估这些方法仍然具有挑战性，原因有二：i）现有基准缺乏对不同提示类别和评估维度的细粒度评估。ii）以前的评估指标仅关注单一方面（例如，文本-3D 对齐），无法执行多维质量评估。为了解决这些问题，我们首先提出了一个名为 MATE-3D 的综合基准。该基准包含八个精心设计的提示类别，涵盖单个和多个对象生成，最终生成了 1,280 个纹理网格。我们从四个不同的评估维度进行了一项大规模主观实验，收集了 107,520 条注释，然后对结果进行了详细分析。基于 MATE-3D，我们提出了一种名为 HyperScore 的新型质量评估器。利用超网络为每个评估维度生成指定的映射函数，我们的指标可以有效地执行多维质量评估。 HyperScore 在 MATE-3D 上的表现优于现有指标，使其成为评估和改进文本到 3D 生成的有前途的指标。该项目可在此 https URL 上找到。

Title: OccScene: Semantic Occupancy-based Cross-task Mutual Learning for 3D Scene Generation

Authors: Bohan Li, Xin Jin, Jianan Wang, Yukai Shi, Yasheng Sun, Xiaofeng Wang, Zhuang Ma, Baao Xie, Chao Ma, Xiaokang Yang, Wenjun Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11183
Pdf URL: https://arxiv.org/pdf/2412.11183
Copy Paste: [[2412.11183]] OccScene: Semantic Occupancy-based Cross-task Mutual Learning for 3D Scene Generation(https://arxiv.org/abs/2412.11183)
Keywords: generation
Abstract: Recent diffusion models have demonstrated remarkable performance in both 3D scene generation and perception tasks. Nevertheless, existing methods typically separate these two processes, acting as a data augmenter to generate synthetic data for downstream perception tasks. In this work, we propose OccScene, a novel mutual learning paradigm that integrates fine-grained 3D perception and high-quality generation in a unified framework, achieving a cross-task win-win effect. OccScene generates new and consistent 3D realistic scenes only depending on text prompts, guided with semantic occupancy in a joint-training diffusion framework. To align the occupancy with the diffusion latent, a Mamba-based Dual Alignment module is introduced to incorporate fine-grained semantics and geometry as perception priors. Within OccScene, the perception module can be effectively improved with customized and diverse generated scenes, while the perception priors in return enhance the generation performance for mutual benefits. Extensive experiments show that OccScene achieves realistic 3D scene generation in broad indoor and outdoor scenarios, while concurrently boosting the perception models to achieve substantial performance improvements in the 3D perception task of semantic occupancy prediction.
摘要：最近的扩散模型在 3D 场景生成和感知任务中都表现出色。然而，现有的方法通常将这两个过程分开，充当数据增强器来生成合成数据用于下游感知任务。在这项工作中，我们提出了 OccScene，一种新颖的相互学习范式，它将细粒度的 3D 感知和高质量生成集成在一个统一的框架中，实现跨任务的双赢效果。OccScene 仅根据文本提示生成新的、一致的 3D 逼真场景，并在联合训练扩散框架中使用语义占用率进行引导。为了将占用率与扩散潜变量对齐，引入了基于 Mamba 的双重对齐模块，将细粒度的语义和几何作为感知先验。在 OccScene 中，可以通过定制和多样化的生成场景有效地改进感知模块，而感知先验反过来又提高了生成性能，实现互利互惠。大量实验表明，OccScene 在广泛的室内和室外场景中实现了逼真的 3D 场景生成，同时同时增强了感知模型，在语义占用预测的 3D 感知任务中实现了显着的性能提升。

Title: Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation

Authors: Ling-An Zeng, Guohong Huang, Gaojie Wu, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11193
Pdf URL: https://arxiv.org/pdf/2412.11193
Copy Paste: [[2412.11193]] Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation(https://arxiv.org/abs/2412.11193)
Keywords: generation
Abstract: Despite the significant role text-to-motion (T2M) generation plays across various applications, current methods involve a large number of parameters and suffer from slow inference speeds, leading to high usage costs. To address this, we aim to design a lightweight model to reduce usage costs. First, unlike existing works that focus solely on global information modeling, we recognize the importance of local information modeling in the T2M task by reconsidering the intrinsic properties of human motion, leading us to propose a lightweight Local Information Modeling Module. Second, we introduce Mamba to the T2M task, reducing the number of parameters and GPU memory demands, and we have designed a novel Pseudo-bidirectional Scan to replicate the effects of a bidirectional scan without increasing parameter count. Moreover, we propose a novel Adaptive Textual Information Injector that more effectively integrates textual information into the motion during generation. By integrating the aforementioned designs, we propose a lightweight and fast model named Light-T2M. Compared to the state-of-the-art method, MoMask, our Light-T2M model features just 10\% of the parameters (4.48M vs 44.85M) and achieves a 16\% faster inference time (0.152s vs 0.180s), while surpassing MoMask with an FID of \textbf{0.040} (vs. 0.045) on HumanML3D dataset and 0.161 (vs. 0.228) on KIT-ML dataset. The code is available at this https URL.
摘要：尽管文本转运动 (T2M) 生成在各种应用中发挥着重要作用，但当前的方法涉及大量参数，并且推理速度慢，导致使用成本高昂。为了解决这个问题，我们旨在设计一个轻量级模型来降低使用成本。首先，与仅关注全局信息建模的现有研究不同，我们通过重新考虑人体运动的内在属性，认识到局部信息建模在 T2M 任务中的重要性，这促使我们提出了一个轻量级的局部信息建模模块。其次，我们将 Mamba 引入 T2M 任务，减少了参数数量和 GPU 内存需求，并且我们设计了一种新颖的伪双向扫描来复制双向扫描的效果而不增加参数数量。此外，我们提出了一种新颖的自适应文本信息注入器，可以在生成过程中更有效地将文本信息集成到运动中。通过整合上述设计，我们提出了一个名为 Light-T2M 的轻量级快速模型。与最先进的方法 MoMask 相比，我们的 Light-T2M 模型仅具有 10\% 的参数（4.48M vs 44.85M），推理时间却快了 16\%（0.152 秒 vs 0.180 秒），同时在 HumanML3D 数据集上的 FID 为 \textbf{0.040}（vs. 0.045），在 KIT-ML 数据集上的 FID 为 0.161（vs. 0.228），均超过了 MoMask。代码可从此 https URL 获取。

Title: GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Authors: Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, Alexandre Alahi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11198
Pdf URL: https://arxiv.org/pdf/2412.11198
Copy Paste: [[2412.11198]] GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control(https://arxiv.org/abs/2412.11198)
Keywords: generation
Abstract: We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, ego-trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open-sourced.
摘要：我们提出了 GEM，这是一种可推广的自我视觉多模态世界模型，它使用参考帧、稀疏特征、人体姿势和自我轨迹来预测未来帧。因此，我们的模型可以精确控制物体动态、自我代理运动和人体姿势。GEM 生成成对的 RGB 和深度输出，以实现更丰富的空间理解。我们引入了自回归噪声计划，以实现稳定的长期生成。我们的数据集包含 4000 多个小时的多模态数据，涉及自动驾驶、以自我为中心的人类活动和无人机飞行等领域。伪标签用于获取深度图、自我轨迹和人体姿势。我们使用一个全面的评估框架，包括一个新的控制对象操纵 (COM) 指标，来评估可控性。实验表明，GEM 擅长生成多样化、可控的场景和长期的时序一致性。代码、模型和数据集都是完全开源的。

Title: GenLit: Reformulating Single-Image Relighting as Video Generation

Authors: Shrisha Bharadwaj, Haiwen Feng, Victoria Abrevaya, Michael J. Black
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.11224
Pdf URL: https://arxiv.org/pdf/2412.11224
Copy Paste: [[2412.11224]] GenLit: Reformulating Single-Image Relighting as Video Generation(https://arxiv.org/abs/2412.11224)
Keywords: generation
Abstract: Manipulating the illumination within a single image represents a fundamental challenge in computer vision and graphics. This problem has been traditionally addressed using inverse rendering techniques, which require explicit 3D asset reconstruction and costly ray tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be practical and possible -- one that replaces explicit physical models with networks that are trained on massive amounts of image and video data. In this paper, we explore the potential of exploiting video diffusion models, and in particular Stable Video Diffusion (SVD), in understanding the physical world to perform relighting tasks given a single image. Specifically, we introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate the results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset (270 objects) is able to generalize to real images, enabling single-image relighting with realistic ray tracing effects and cast shadows. These results reveal the ability of video foundation models to capture rich information about lighting, material, and shape. Our findings suggest that such models, with minimal training, can be used for physically-based rendering without explicit physically asset reconstruction and complex ray tracing. This further suggests the potential of such models for controllable and physically accurate image synthesis tasks.
摘要：在单幅图像中操纵照明是计算机视觉和图形学中的一项基本挑战。这个问题传统上是使用逆向渲染技术来解决的，这需要明确的 3D 资产重建和昂贵的光线追踪模拟。同时，视觉基础模型的最新进展表明，一种新的范式可能很快就会变得实用和可行——用在大量图像和视频数据上训练的网络取代明确的物理模型。在本文中，我们探索了利用视频扩散模型，特别是稳定视频扩散 (SVD)，在理解物理世界以给定单幅图像执行重新照明任务方面的潜力。具体来说，我们引入了 GenLit，这是一个将图形引擎执行光线操纵的能力提炼到视频生成模型中的框架，使用户能够直接在给定图像中的 3D 世界中插入和操纵点光源，并将结果直接生成为视频序列。我们发现，仅对小型合成数据集（270 个对象）进行微调的模型能够推广到真实图像，从而实现具有逼真的光线追踪效果和投射阴影的单幅图像重新照明。这些结果揭示了视频基础模型能够捕获有关照明、材质和形状的丰富信息。我们的研究结果表明，此类模型只需进行最少的训练，即可用于基于物理的渲染，而无需显式物理资产重建和复杂的光线追踪。这进一步表明此类模型具有可控且物理精确的图像合成任务的潜力。

Title: On the Generalizability of Iterative Patch Selection for Memory-Efficient High-Resolution Image Classification

Authors: Max Riffi-Aslett, Christina Fell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11237
Pdf URL: https://arxiv.org/pdf/2412.11237
Copy Paste: [[2412.11237]] On the Generalizability of Iterative Patch Selection for Memory-Efficient High-Resolution Image Classification(https://arxiv.org/abs/2412.11237)
Keywords: generation
Abstract: Classifying large images with small or tiny regions of interest (ROI) is challenging due to computational and memory constraints. Weakly supervised memory-efficient patch selectors have achieved results comparable with strongly supervised methods. However, low signal-to-noise ratios and low entropy attention still cause overfitting. We explore these issues using a novel testbed on a memory-efficient cross-attention transformer with Iterative Patch Selection (IPS) as the patch selection module. Our testbed extends the megapixel MNIST benchmark to four smaller O2I (object-to-image) ratios ranging from 0.01% to 0.14% while keeping the canvas size fixed and introducing a noise generation component based on Bézier curves. Experimental results generalize the observations made on CNNs to IPS whereby the O2I threshold below which the classifier fails to generalize is affected by the training dataset size. We further observe that the magnitude of this interaction differs for each task of the Megapixel MNIST. For tasks "Maj" and "Top", the rate is at its highest, followed by tasks "Max" and "Multi" where in the latter, this rate is almost at 0. Moreover, results show that in a low data setting, tuning the patch size to be smaller relative to the ROI improves generalization, resulting in an improvement of + 15% for the megapixel MNIST and + 5% for the Swedish traffic signs dataset compared to the original object-to-patch ratios in IPS. Further outcomes indicate that the similarity between the thickness of the noise component and the digits in the megapixel MNIST gradually causes IPS to fail to generalize, contributing to previous suspicions.
摘要：由于计算和内存限制，对具有较小或极小感兴趣区域 (ROI) 的大图像进行分类具有挑战性。弱监督的内存高效补丁选择器已实现与强监督方法相当的结果。但是，低信噪比和低熵注意力仍然会导致过度拟合。我们使用一种新颖的测试平台来探索这些问题，该测试平台基于内存高效的交叉注意力转换器，以迭代补丁选择 (IPS) 作为补丁选择模块。我们的测试平台将百万像素 MNIST 基准扩展到四个较小的 O2I（对象与图像）比率，范围从 0.01% 到 0.14%，同时保持画布大小固定并引入基于贝塞尔曲线的噪声生成组件。实验结果将对 CNN 的观察结果推广到 IPS，其中分类器无法推广的 O2I 阈值受训练数据集大小的影响。我们进一步观察到，这种相互作用的幅度对于百万像素 MNIST 的每个任务都不同。对于任务“Maj”和“Top”，该比率最高，其次是任务“Max”和“Multi”，后者的比率几乎为 0。此外，结果显示，在低数据设置下，将补丁大小调整为相对于 ROI 更小可提高泛化能力，与 IPS 中原始对象与补丁比率相比，百万像素 MNIST 的改进为 + 15%，瑞典交通标志数据集的改进为 + 5%。进一步的结果表明，噪声成分的厚度与百万像素 MNIST 中的数字之间的相似性逐渐导致 IPS 无法泛化，从而加剧了之前的怀疑。

Title: Wasserstein Bounds for generative diffusion models with Gaussian tail targets

Authors: Xixian Wang, Zhongjian Wang
Subjects: cs.LG, math.AP, math.NA
Abstract URL: https://arxiv.org/abs/2412.11251
Pdf URL: https://arxiv.org/pdf/2412.11251
Copy Paste: [[2412.11251]] Wasserstein Bounds for generative diffusion models with Gaussian tail targets(https://arxiv.org/abs/2412.11251)
Keywords: generation, generative
Abstract: We present an estimate of the Wasserstein distance between the data distribution and the generation of score-based generative models, assuming an $\epsilon$-accurate approximation of the score and a Gaussian-type tail behavior of the data distribution. The complexity bound in dimension is $O(\sqrt{d})$, with a logarithmic constant. Such Gaussian tail assumption applies to the distribution of a compact support target with early stopping technique and the Bayesian posterior with a bounded observation operator. Corresponding convergence and complexity bounds are derived. The crux of the analysis lies in the Lipchitz bound of the score, which is related to the Hessian estimate of a viscous Hamilton-Jacobi equation (vHJ). This latter is demonstrated by employing a dimension independent kernel estimate. Consequently, our complexity bound scales linearly (up to a logarithmic constant) with the square root of the trace of the covariance operator, which relates to the invariant distribution of forward process. Our analysis also extends to the probabilistic flow ODE, as the sampling process.
摘要：我们给出了数据分布与基于分数的生成模型生成之间的 Wasserstein 距离的估计，假设分数的近似值为 $\epsilon$ 精确度，且数据分布的尾部行为为高斯型。复杂度的维度界限为 $O(\sqrt{d})$，对数为常数。这种高斯尾部假设适用于具有早期停止技术的紧凑支持目标分布和具有有界观测算子的贝叶斯后验。推导出相应的收敛和复杂度界限。分析的关键在于分数的 Lipchitz 界限，它与粘性 Hamilton-Jacobi 方程 (vHJ) 的 Hessian 估计有关。后者通过采用与维度无关的核估计来证明。因此，我们的复杂度界限与协方差算子的迹的平方根线性相关（最多为对数常数），这与前向过程的不变分布有关。我们的分析还扩展到概率流 ODE，作为采样过程。

Title: Detecting Daily Living Gait Amid Huntington's Disease Chorea using a Foundation Deep Learning Model

Authors: Dafna Schwartz, Lori Quinn, Nora E. Fritz, Lisa M. Muratori, Jeffery M. Hausdorff, Ran Gilad Bachrach
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11286
Pdf URL: https://arxiv.org/pdf/2412.11286
Copy Paste: [[2412.11286]] Detecting Daily Living Gait Amid Huntington's Disease Chorea using a Foundation Deep Learning Model(https://arxiv.org/abs/2412.11286)
Keywords: generative
Abstract: Wearable sensors offer a non-invasive way to collect physical activity (PA) data, with walking as a key component. Existing models often struggle to detect gait bouts in individuals with neurodegenerative diseases (NDDs) involving involuntary movements. We developed J-Net, a deep learning model inspired by U-Net, which uses a pre-trained self-supervised foundation model fine-tuned with Huntington`s disease (HD) in-lab data and paired with a segmentation head for gait detection. J-Net processes wrist-worn accelerometer data to detect gait during daily living. We evaluated J-Net on in-lab and daily-living data from HD, Parkinson`s disease (PD), and controls. J-Net achieved a 10-percentage point improvement in ROC-AUC for HD over existing methods, reaching 0.97 for in-lab data. In daily-living environments, J-Net estimates showed no significant differences in median daily walking time between HD and controls (p = 0.23), in contrast to other models, which indicated counterintuitive results (p < 0.005). Walking time measured by J-Net correlated with the UHDRS-TMS clinical severity score (r=-0.52; p=0.02), confirming its clinical relevance. Fine-tuning J-Net on PD data also improved gait detection over current methods. J-Net`s architecture effectively addresses the challenges of gait detection in severe chorea and offers robust performance in daily living. The dataset and J-Net model are publicly available, providing a resource for further research into NDD-related gait impairments.
摘要：可穿戴传感器提供了一种非侵入式的方式来收集身体活动 (PA) 数据，其中步行是关键组成部分。现有模型通常难以检测患有神经退行性疾病 (NDD) 且涉及不自主运动的个体的步态发作。我们开发了 J-Net，这是一种受 U-Net 启发的深度学习模型，它使用预先训练的自监督基础模型，该模型使用亨廷顿氏病 (HD) 实验室内数据进行微调，并与分割头配对以进行步态检测。J-Net 处理腕戴式加速度计数据以检测日常生活中的步态。我们根据来自亨廷顿氏病、帕金森氏病 (PD) 和对照组的实验室内和日常生活数据对 J-Net 进行了评估。与现有方法相比，J-Net 将亨廷顿氏病的 ROC-AUC 提高了 10 个百分点，实验室内数据达到 0.97。在日常生活环境中，J-Net 估计显示 HD 患者和对照组之间的每日平均步行时间没有显著差异（p = 0.23），而其他模型则得出了违反直觉的结果（p < 0.005）。J-Net 测量的步行时间与 UHDRS-TMS 临床严重程度评分相关（r=-0.52；p=0.02），证实了其临床相关性。在 PD 数据上对 J-Net 进行微调也比当前方法改进了步态检测。J-Net 的架构有效地解决了严重舞蹈症中的步态检测挑战，并在日常生活中提供了强大的性能。数据集和 J-Net 模型是公开的，为进一步研究与 NDD 相关的步态障碍提供了资源。

Title: Grassmannian Geometry Meets Dynamic Mode Decomposition in DMD-GEN: A New Metric for Mode Collapse in Time Series Generative Models

Authors: Amime Mohamed Aboussalah, Yassine Abbahaddou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.11292
Pdf URL: https://arxiv.org/pdf/2412.11292
Copy Paste: [[2412.11292]] Grassmannian Geometry Meets Dynamic Mode Decomposition in DMD-GEN: A New Metric for Mode Collapse in Time Series Generative Models(https://arxiv.org/abs/2412.11292)
Keywords: generation, generative
Abstract: Generative models like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) often fail to capture the full diversity of their training data, leading to mode collapse. While this issue is well-explored in image generation, it remains underinvestigated for time series data. We introduce a new definition of mode collapse specific to time series and propose a novel metric, DMD-GEN, to quantify its severity. Our metric utilizes Dynamic Mode Decomposition (DMD), a data-driven technique for identifying coherent spatiotemporal patterns, and employs Optimal Transport between DMD eigenvectors to assess discrepancies between the underlying dynamics of the original and generated data. This approach not only quantifies the preservation of essential dynamic characteristics but also provides interpretability by pinpointing which modes have collapsed. We validate DMD-GEN on both synthetic and real-world datasets using various generative models, including TimeGAN, TimeVAE, and DiffusionTS. The results demonstrate that DMD-GEN correlates well with traditional evaluation metrics for static data while offering the advantage of applicability to dynamic data. This work offers for the first time a definition of mode collapse for time series, improving understanding, and forming the basis of our tool for assessing and improving generative models in the time series domain.
摘要：生成对抗网络 (GAN) 和变分自编码器 (VAE) 等生成模型通常无法捕获训练数据的全部多样性，从而导致模式崩溃。虽然这个问题在图像生成中得到了充分研究，但对于时间序列数据，它仍然没有得到充分研究。我们引入了特定于时间序列的模式崩溃的新定义，并提出了一种新颖的指标 DMD-GEN 来量化其严重程度。我们的指标利用动态模式分解 (DMD)，这是一种用于识别连贯时空模式的数据驱动技术，并采用 DMD 特征向量之间的最佳传输来评估原始数据和生成数据的底层动态之间的差异。这种方法不仅可以量化基本动态特征的保存，还可以通过精确定位哪些模式已经崩溃来提供可解释性。我们使用各种生成模型（包括 TimeGAN、TimeVAE 和 DiffusionTS）在合成和真实世界数据集上验证 DMD-GEN。结果表明，DMD-GEN 与传统的静态数据评估指标具有良好的相关性，同时具有适用于动态数据的优势。这项工作首次提出了时间序列模式崩溃的定义，提高了理解能力，并为我们评估和改进时间序列领域生成模型的工具奠定了基础。

Title: One-Shot Multilingual Font Generation Via ViT

Authors: Zhiheng Wang, Jiarui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11342
Pdf URL: https://arxiv.org/pdf/2412.11342
Copy Paste: [[2412.11342]] One-Shot Multilingual Font Generation Via ViT(https://arxiv.org/abs/2412.11342)
Keywords: generation
Abstract: Font design poses unique challenges for logographic languages like Chinese, Japanese, and Korean (CJK), where thousands of unique characters must be individually crafted. This paper introduces a novel Vision Transformer (ViT)-based model for multi-language font generation, effectively addressing the complexities of both logographic and alphabetic scripts. By leveraging ViT and pretraining with a strong visual pretext task (Masked Autoencoding, MAE), our model eliminates the need for complex design components in prior frameworks while achieving comprehensive results with enhanced generalizability. Remarkably, it can generate high-quality fonts across multiple languages for unseen, unknown, and even user-crafted characters. Additionally, we integrate a Retrieval-Augmented Guidance (RAG) module to dynamically retrieve and adapt style references, improving scalability and real-world applicability. We evaluated our approach in various font generation tasks, demonstrating its effectiveness, adaptability, and scalability.
摘要：字体设计对中文、日文和韩文 (CJK) 等表意文字语言提出了独特的挑战，因为这些语言中必须单独制作数千个独特的字符。本文介绍了一种基于 Vision Transformer (ViT) 的新型多语言字体生成模型，有效地解决了表意文字和字母脚本的复杂性。通过利用 ViT 并使用强大的视觉借口任务 (Masked Autoencoding, MAE) 进行预训练，我们的模型消除了先前框架中对复杂设计组件的需求，同时实现了全面的结果和增强的通用性。值得注意的是，它可以为多种语言中看不见的、未知的甚至用户制作的字符生成高质量的字体。此外，我们集成了一个检索增强指导 (RAG) 模块来动态检索和调整样式参考，从而提高了可扩展性和现实世界的适用性。我们在各种字体生成任务中评估了我们的方法，证明了它的有效性、适应性和可扩展性。

Title: Adapting Segment Anything Model (SAM) to Experimental Datasets via Fine-Tuning on GAN-based Simulation: A Case Study in Additive Manufacturing

Authors: Anika Tabassum, Amirkoushyar Ziabari
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2412.11381
Pdf URL: https://arxiv.org/pdf/2412.11381
Copy Paste: [[2412.11381]] Adapting Segment Anything Model (SAM) to Experimental Datasets via Fine-Tuning on GAN-based Simulation: A Case Study in Additive Manufacturing(https://arxiv.org/abs/2412.11381)
Keywords: generative
Abstract: Industrial X-ray computed tomography (XCT) is a powerful tool for non-destructive characterization of materials and manufactured components. XCT commonly accompanied by advanced image analysis and computer vision algorithms to extract relevant information from the images. Traditional computer vision models often struggle due to noise, resolution variability, and complex internal structures, particularly in scientific imaging applications. State-of-the-art foundational models, like the Segment Anything Model (SAM)-designed for general-purpose image segmentation-have revolutionized image segmentation across various domains, yet their application in specialized fields like materials science remains under-explored. In this work, we explore the application and limitations of SAM for industrial X-ray CT inspection of additive manufacturing components. We demonstrate that while SAM shows promise, it struggles with out-of-distribution data, multiclass segmentation, and computational efficiency during fine-tuning. To address these issues, we propose a fine-tuning strategy utilizing parameter-efficient techniques, specifically Conv-LoRa, to adapt SAM for material-specific datasets. Additionally, we leverage generative adversarial network (GAN)-generated data to enhance the training process and improve the model's segmentation performance on complex X-ray CT data. Our experimental results highlight the importance of tailored segmentation models for accurate inspection, showing that fine-tuning SAM on domain-specific scientific imaging data significantly improves performance. However, despite improvements, the model's ability to generalize across diverse datasets remains limited, highlighting the need for further research into robust, scalable solutions for domain-specific segmentation tasks.
摘要：工业 X 射线计算机断层扫描 (XCT) 是一种强大的工具，可用于对材料和制造的组件进行无损表征。XCT 通常伴随着先进的图像分析和计算机视觉算法，以从图像中提取相关信息。传统的计算机视觉模型通常会因噪声、分辨率变化和复杂的内部结构而出现问题，尤其是在科学成像应用中。最先进的基础模型，如专为通用图像分割而设计的 Segment Anything 模型 (SAM)，已经彻底改变了各个领域的图像分割，但它们在材料科学等专业领域的应用仍未得到充分探索。在这项工作中，我们探索了 SAM 在增材制造组件的工业 X 射线 CT 检查中的应用和局限性。我们表明，虽然 SAM 很有前途，但它在微调过程中面临着分布不均的数据、多类分割和计算效率的问题。为了解决这些问题，我们提出了一种微调策略，利用参数高效的技术，特别是 Conv-LoRa，使 SAM 适应特定于材料的数据集。此外，我们利用生成对抗网络 (GAN) 生成的数据来增强训练过程并提高模型在复杂 X 射线 CT 数据上的分割性能。我们的实验结果强调了定制分割模型对于准确检查的重要性，表明对特定领域的科学成像数据微调 SAM 可显著提高性能。然而，尽管有所改进，但该模型在不同数据集中的推广能力仍然有限，这凸显了需要进一步研究针对特定领域分割任务的稳健、可扩展的解决方案。

Title: Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes

Authors: Antonio Carlos Rivera, Anthony Moore, Steven Robinson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11396
Pdf URL: https://arxiv.org/pdf/2412.11396
Copy Paste: [[2412.11396]] Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes(https://arxiv.org/abs/2412.11396)
Keywords: generative
Abstract: Object-aware reasoning in vision-language tasks poses significant challenges for current models, particularly in handling unseen objects, reducing hallucinations, and capturing fine-grained relationships in complex visual scenes. To address these limitations, we propose the Vision-Aware Retrieval-Augmented Prompting (VRAP) framework, a generative approach that enhances Large Vision-Language Models (LVLMs) by integrating retrieval-augmented object tags into their prompts. VRAP introduces a novel pipeline where structured tags, including objects, attributes, and relationships, are extracted using pretrained visual encoders and scene graph parsers. These tags are enriched with external knowledge and incorporated into the LLM's input, enabling detailed and accurate reasoning. We evaluate VRAP across multiple vision-language benchmarks, including VQAv2, GQA, VizWiz, and COCO, achieving state-of-the-art performance in fine-grained reasoning and multimodal understanding. Additionally, our ablation studies highlight the importance of retrieval-augmented tags and contrastive learning, while human evaluations confirm VRAP's ability to generate accurate, detailed, and contextually relevant responses. Notably, VRAP achieves a 40% reduction in inference latency by eliminating runtime retrieval. These results demonstrate that VRAP is a robust and efficient framework for advancing object-aware multimodal reasoning.
摘要：视觉语言任务中的对象感知推理对当前模型提出了重大挑战，特别是在处理看不见的物体、减少幻觉和捕捉复杂视觉场景中的细粒度关系方面。为了解决这些限制，我们提出了视觉感知检索增强提示 (VRAP) 框架，这是一种生成方法，通过将检索增强对象标签集成到其提示中来增强大型视觉语言模型 (LVLM)。VRAP 引入了一种新颖的管道，其中使用预训练的视觉编码器和场景图解析器提取结构化标签，包括对象、属性和关系。这些标签通过外部知识丰富并纳入 LLM 的输入，从而实现详细而准确的推理。我们在多个视觉语言基准（包括 VQAv2、GQA、VizWiz 和 COCO）上评估 VRAP，在细粒度推理和多模态理解方面取得了最先进的性能。此外，我们的消融研究强调了检索增强标签和对比学习的重要性，而人工评估则证实了 VRAP 能够生成准确、详细且与上下文相关的响应。值得注意的是，VRAP 通过消除运行时检索将推理延迟减少了 40%。这些结果表明，VRAP 是一个强大而高效的框架，可用于推进对象感知多模态推理。

Title: Quantization of Climate Change Impacts on Renewable Energy Generation Capacity: A Super-Resolution Recurrent Diffusion Model

Authors: Xiaochong Dong, Jun Dan, Yingyun Sun, Yang Liu, Xuemin Zhang, Shengwei Mei
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2412.11399
Pdf URL: https://arxiv.org/pdf/2412.11399
Copy Paste: [[2412.11399]] Quantization of Climate Change Impacts on Renewable Energy Generation Capacity: A Super-Resolution Recurrent Diffusion Model(https://arxiv.org/abs/2412.11399)
Keywords: super-resolution, generation, generative
Abstract: Driven by global climate change and the ongoing energy transition, the coupling between power supply capabilities and meteorological factors has become increasingly significant. Over the long term, accurately quantifying the power generation capacity of renewable energy under the influence of climate change is essential for the development of sustainable power systems. However, due to interdisciplinary differences in data requirements, climate data often lacks the necessary hourly resolution to capture the short-term variability and uncertainties of renewable energy resources. To address this limitation, a super-resolution recurrent diffusion model (SRDM) has been developed to enhance the temporal resolution of climate data and model the short-term uncertainty. The SRDM incorporates a pre-trained decoder and a denoising network, that generates long-term, high-resolution climate data through a recurrent coupling mechanism. The high-resolution climate data is then converted into power value using the mechanism model, enabling the simulation of wind and photovoltaic (PV) power generation capacity on future long-term scales. Case studies were conducted in the Ejina region of Inner Mongolia, China, using fifth-generation reanalysis (ERA5) and coupled model intercomparison project (CMIP6) data under two climate pathways: SSP126 and SSP585. The results demonstrate that the SRDM outperforms existing generative models in generating super-resolution climate data. For the Ejina region, under a high-emission pathway, the annual utilization hours of wind power are projected to decrease by 2.82 hours/year, while those for PV power are projected to decrease by 0.26 hours/year. Furthermore, the research highlights the estimation biases introduced when low-resolution climate data is used for power conversion.
摘要：在全球气候变化和持续的能源转型推动下，电力供应能力与气象因素的耦合关系日益重要。长期来看，准确量化气候变化影响下的可再生能源发电能力对于可持续电力系统的发展至关重要。然而，由于跨学科对数据要求的差异，气候数据往往缺乏必要的小时分辨率来捕捉可再生能源资源的短期变化和不确定性。为了解决这一限制，开发了一种超分辨率递归扩散模型 (SRDM)，以提高气候数据的时间分辨率并模拟短期不确定性。SRDM 结合了预先训练的解码器和去噪网络，通过递归耦合机制生成长期、高分辨率的气候数据。然后使用机制模型将高分辨率气候数据转换为功率值，从而能够模拟未来长期尺度上的风能和光伏 (PV) 发电能力。案例研究在中国内蒙古额济纳地区进行，使用第五代再分析（ERA5）和耦合模式比对项目（CMIP6）数据，在两种气候路径下进行：SSP126 和 SSP585。结果表明，SRDM 在生成超分辨率气候数据方面优于现有的生成模型。对于额济纳地区，在高排放路径下，预计风电年利用小时数将减少 2.82 小时/年，而光伏发电年利用小时数将减少 0.26 小时/年。此外，研究还强调了低分辨率气候数据用于电力转换时引入的估计偏差。

Title: Scaled Conjugate Gradient Method for Nonconvex Optimization in Deep Neural Networks

Authors: Naoki Sato, Koshiro Izumi, Hideaki Iiduka
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.11400
Pdf URL: https://arxiv.org/pdf/2412.11400
Copy Paste: [[2412.11400]] Scaled Conjugate Gradient Method for Nonconvex Optimization in Deep Neural Networks(https://arxiv.org/abs/2412.11400)
Keywords: generative
Abstract: A scaled conjugate gradient method that accelerates existing adaptive methods utilizing stochastic gradients is proposed for solving nonconvex optimization problems with deep neural networks. It is shown theoretically that, whether with constant or diminishing learning rates, the proposed method can obtain a stationary point of the problem. Additionally, its rate of convergence with diminishing learning rates is verified to be superior to that of the conjugate gradient method. The proposed method is shown to minimize training loss functions faster than the existing adaptive methods in practical applications of image and text classification. Furthermore, in the training of generative adversarial networks, one version of the proposed method achieved the lowest Frechet inception distance score among those of the adaptive methods.
摘要：提出了一种缩放共轭梯度法，该方法可加速现有的利用随机梯度的自适应方法，用于解决深度神经网络的非凸优化问题。从理论上证明，无论学习率是恒定的还是递减的，所提出的方法都可以获得问题的驻点。此外，还证明了其在学习率递减的情况下的收敛速度优于共轭梯度法。在图像和文本分类的实际应用中，所提出的方法比现有的自适应方法更快地最小化训练损失函数。此外，在生成对抗网络的训练中，所提出方法的一个版本在所有自适应方法中获得了最低的 Frechet 初始距离得分。

Title: An Enhanced Classification Method Based on Adaptive Multi-Scale Fusion for Long-tailed Multispectral Point Clouds

Authors: TianZhu Liu, BangYan Hu, YanFeng Gu, Xian Li, Aleksandra Pižurica
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.11407
Pdf URL: https://arxiv.org/pdf/2412.11407
Copy Paste: [[2412.11407]] An Enhanced Classification Method Based on Adaptive Multi-Scale Fusion for Long-tailed Multispectral Point Clouds(https://arxiv.org/abs/2412.11407)
Keywords: generation
Abstract: Multispectral point cloud (MPC) captures 3D spatial-spectral information from the observed scene, which can be used for scene understanding and has a wide range of applications. However, most of the existing classification methods were extensively tested on indoor datasets, and when applied to outdoor datasets they still face problems including sparse labeled targets, differences in land-covers scales, and long-tailed distributions. To address the above issues, an enhanced classification method based on adaptive multi-scale fusion for MPCs with long-tailed distributions is proposed. In the training set generation stage, a grid-balanced sampling strategy is designed to reliably generate training samples from sparse labeled datasets. In the feature learning stage, a multi-scale feature fusion module is proposed to fuse shallow features of land-covers at different scales, addressing the issue of losing fine features due to scale variations in land-covers. In the classification stage, an adaptive hybrid loss module is devised to utilize multi-classification heads with adaptive weights to balance the learning ability of different classes, improving the classification performance of small classes due to various-scales and long-tailed distributions in land-covers. Experimental results on three MPC datasets demonstrate the effectiveness of the proposed method compared with the state-of-the-art methods.
摘要：多光谱点云（MPC）可以捕获被观测场景的三维空间光谱信息，可用于场景理解，具有广泛的应用范围。然而，现有的分类方法大多是在室内数据集上进行大量测试的，应用于室外数据集时仍然存在包括稀疏标记目标、地表覆盖尺度差异、长尾分布等问题。针对上述问题，提出了一种基于自适应多尺度融合的长尾分布MPC增强分类方法。在训练集生成阶段，设计一种网格平衡采样策略，从稀疏标记数据集中可靠地生成训练样本。在特征学习阶段，提出一种多尺度特征融合模块，融合不同尺度地表覆盖的浅层特征，解决地表覆盖尺度变化导致精细特征丢失的问题。在分类阶段，设计了一种自适应混合损失模块，利用具有自适应权重的多分类头来平衡不同类别的学习能力，提高由于土地覆盖的多尺度和长尾分布而产生的小类别的分类性能。在三个 MPC 数据集上的实验结果表明，与最新方法相比，所提方法的有效性。

Title: Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech

Authors: Rui Liu, Shuwei He, Yifan Hu, Haizhou Li
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2412.11409
Pdf URL: https://arxiv.org/pdf/2412.11409
Copy Paste: [[2412.11409]] Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech(https://arxiv.org/abs/2412.11409)
Keywords: generation
Abstract: Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many attempts have been made to extract global spatial visual information from the RGB space of an spatial image. However, local and depth image information are crucial for understanding the spatial environment, which previous works have ignored. To address the issues, we propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS, termed M2SE-VTTS. The multi-modal aims to take both the RGB and Depth spaces of the spatial image to learn more comprehensive spatial information, and the multi-scale seeks to model the local and global spatial knowledge simultaneously. Specifically, we first split the RGB and Depth images into patches and adopt the Gemini-generated environment captions to guide the local spatial understanding. After that, the multi-modal and multi-scale features are integrated by the local-aware global spatial understanding. In this way, M2SE-VTTS effectively models the interactions between local and global spatial contexts in the multi-modal spatial environment. Objective and subjective evaluations suggest that our model outperforms the advanced baselines in environmental speech generation. The code and audio samples are available at: this https URL.
摘要：视觉文本转语音 (VTTS) 旨在以环境图像为提示，为口语内容合成混响语音。这项任务的挑战在于从图像中理解空间环境。人们已经进行了许多尝试，试图从空间图像的 RGB 空间中提取全局空间视觉信息。然而，局部和深度图像信息对于理解空间环境至关重要，而以前的研究却忽略了这一点。为了解决这些问题，我们提出了一种新颖的多模态和多尺度空间环境理解方案来实现沉浸式 VTTS，称为 M2SE-VTTS。多模态旨在同时利用空间图像的 RGB 和深度空间来学习更全面的空间信息，而多尺度则寻求同时对局部和全局空间知识进行建模。具体来说，我们首先将 RGB 和深度图像分成块，并采用 Gemini 生成的环境字幕来指导局部空间理解。之后，多模态和多尺度特征被局部感知的全局空间理解所整合。通过这种方式，M2SE-VTTS 有效地模拟了多模态空间环境中局部和全局空间上下文之间的交互。客观和主观评估表明，我们的模型在环境语音生成方面优于高级基线。代码和音频示例可在以下网址获取：此 https URL。

Title: Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models

Authors: Namhyuk Ahn, KiYoon Yoo, Wonhyuk Ahn, Daesik Kim, Seung-Hun Nam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11423
Pdf URL: https://arxiv.org/pdf/2412.11423
Copy Paste: [[2412.11423]] Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models(https://arxiv.org/abs/2412.11423)
Keywords: generation
Abstract: Recent advancements in diffusion models revolutionize image generation but pose risks of misuse, such as replicating artworks or generating deepfakes. Existing image protection methods, though effective, struggle to balance protection efficacy, invisibility, and latency, thus limiting practical use. We introduce perturbation pre-training to reduce latency and propose a mixture-of-perturbations approach that dynamically adapts to input images to minimize performance degradation. Our novel training strategy computes protection loss across multiple VAE feature spaces, while adaptive targeted protection at inference enhances robustness and invisibility. Experiments show comparable protection performance with improved invisibility and drastically reduced inference time. The code and demo are available at \url{this https URL}
摘要：扩散模型的最新进展彻底改变了图像生成，但也存在滥用风险，例如复制艺术品或生成深度伪造。现有的图像保护方法虽然有效，但难以平衡保护效果、不可见性和延迟，从而限制了实际使用。我们引入了扰动预训练来减少延迟，并提出了一种混合扰动方法，可以动态适应输入图像以最大限度地减少性能下降。我们新颖的训练策略计算多个 VAE 特征空间的保护损失，而推理时的自适应目标保护则增强了鲁棒性和不可见性。实验表明，保护性能相当，不可见性得到改善，推理时间大大缩短。代码和演示可在 \url{此 https URL} 处找到

Title: Towards Scientific Discovery with Generative AI: Progress, Opportunities, and Challenges

Authors: Chandan K Reddy, Parshin Shojaee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11427
Pdf URL: https://arxiv.org/pdf/2412.11427
Copy Paste: [[2412.11427]] Towards Scientific Discovery with Generative AI: Progress, Opportunities, and Challenges(https://arxiv.org/abs/2412.11427)
Keywords: generative
Abstract: Scientific discovery is a complex cognitive process that has driven human knowledge and technological progress for centuries. While artificial intelligence (AI) has made significant advances in automating aspects of scientific reasoning, simulation, and experimentation, we still lack integrated AI systems capable of performing autonomous long-term scientific research and discovery. This paper examines the current state of AI for scientific discovery, highlighting recent progress in large language models and other AI techniques applied to scientific tasks. We then outline key challenges and promising research directions toward developing more comprehensive AI systems for scientific discovery, including the need for science-focused AI agents, improved benchmarks and evaluation metrics, multimodal scientific representations, and unified frameworks combining reasoning, theorem proving, and data-driven modeling. Addressing these challenges could lead to transformative AI tools to accelerate progress across disciplines towards scientific discovery.
摘要：科学发现是一个复杂的认知过程，几个世纪以来一直推动着人类知识和技术进步。虽然人工智能 (AI) 在科学推理、模拟和实验的自动化方面取得了重大进展，但我们仍然缺乏能够进行自主长期科学研究和发现的集成 AI 系统。本文探讨了科学发现的 AI 现状，重点介绍了大型语言模型和其他应用于科学任务的 AI 技术的最新进展。然后，我们概述了开发更全面的科学发现 AI 系统的关键挑战和有希望的研究方向，包括对以科学为中心的 AI 代理的需求、改进的基准和评估指标、多模态科学表示以及结合推理、定理证明和数据驱动建模的统一框架。应对这些挑战可能会带来变革性的 AI 工具，以加速跨学科的科学发现进程。

Title: Learning Implicit Features with Flow Infused Attention for Realistic Virtual Try-On

Authors: Delong Zhang, Qiwei Huang, Yuanliu Liu, Yang Sun, Wei-Shi Zheng, Pengfei Xiong, Wei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11435
Pdf URL: https://arxiv.org/pdf/2412.11435
Copy Paste: [[2412.11435]] Learning Implicit Features with Flow Infused Attention for Realistic Virtual Try-On(https://arxiv.org/abs/2412.11435)
Keywords: generation
Abstract: Image-based virtual try-on is challenging since the generated image should fit the garment to model images in various poses and keep the characteristics and details of the garment simultaneously. A popular research stream warps the garment image firstly to reduce the burden of the generation stage, which relies highly on the performance of the warping module. Other methods without explicit warping often lack sufficient guidance to fit the garment to the model images. In this paper, we propose FIA-VTON, which leverages the implicit warp feature by adopting a Flow Infused Attention module on virtual try-on. The dense warp flow map is projected as indirect guidance attention to enhance the feature map warping in the generation process implicitly, which is less sensitive to the warping estimation accuracy than an explicit warp of the garment image. To further enhance implicit warp guidance, we incorporate high-level spatial attention to complement the dense warp. Experimental results on the VTON-HD and DressCode dataset significantly outperform state-of-the-art methods, demonstrating that FIA-VTON is effective and robust for virtual try-on.
摘要：基于图像的虚拟试穿具有挑战性，因为生成的图像应适合服装以各种姿势建模图像，同时保留服装的特征和细节。一种流行的研究流首先对服装图像进行扭曲，以减轻生成阶段的负担，这在很大程度上依赖于扭曲模块的性能。没有显式扭曲的其他方法通常缺乏足够的指导来将服装与模型图像相适应。在本文中，我们提出了 FIA-VTON，它通过在虚拟试穿中采用流注入注意力模块来利用隐式扭曲特征。密集扭曲流图被投射为间接指导注意力，以隐式增强生成过程中的特征图扭曲，这与服装图像的显式扭曲相比对扭曲估计精度的敏感度较低。为了进一步增强隐式扭曲指导，我们结合了高级空间注意力来补充密集扭曲。在 VTON-HD 和 DressCode 数据集上的实验结果明显优于最先进的方法，表明 FIA-VTON 对于虚拟试穿是有效且稳健的。

Title: Bayesian Flow Is All You Need to Sample Out-of-Distribution Chemical Spaces

Authors: Nianze Tao
Subjects: cs.LG, cs.AI, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2412.11439
Pdf URL: https://arxiv.org/pdf/2412.11439
Copy Paste: [[2412.11439]] Bayesian Flow Is All You Need to Sample Out-of-Distribution Chemical Spaces(https://arxiv.org/abs/2412.11439)
Keywords: generation
Abstract: Generating novel molecules with higher properties than the training space, namely the out-of-distribution generation, is important for ${de~novo}$ drug design. However, it is not easy for distribution learning-based models, for example diffusion models, to solve this challenge as these methods are designed to fit the distribution of training data as close as possible. In this paper, we show that Bayesian flow network is capable of effortlessly generating high quality out-of-distribution samples that meet several scenarios. We introduce a semi-autoregressive training/sampling method that helps to enhance the model performance and surpass the state-of-the-art models.
摘要：生成具有比训练空间更高属性的新分子，即分布外生成，对于从头药物设计非常重要。然而，基于分布学习的模型（例如扩散模型）很难解决这一挑战，因为这些方法旨在尽可能接近训练数据的分布。在本文中，我们展示了贝叶斯流网络能够毫不费力地生成满足多种场景的高质量分布外样本。我们引入了一种半自回归训练/采样方法，有助于提高模型性能并超越最先进的模型。

Title: FedCAR: Cross-client Adaptive Re-weighting for Generative Models in Federated Learning

Authors: Minjun Kim, Minjee Kim, Jinhoon Jeong
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11463
Pdf URL: https://arxiv.org/pdf/2412.11463
Copy Paste: [[2412.11463]] FedCAR: Cross-client Adaptive Re-weighting for Generative Models in Federated Learning(https://arxiv.org/abs/2412.11463)
Keywords: generation, generative
Abstract: Generative models trained on multi-institutional datasets can provide an enriched understanding through diverse data distributions. However, training the models on medical images is often challenging due to hospitals' reluctance to share data for privacy reasons. Federated learning(FL) has emerged as a privacy-preserving solution for training distributed datasets across data centers by aggregating model weights from multiple clients instead of sharing raw data. Previous research has explored the adaptation of FL to generative models, yet effective aggregation algorithms specifically tailored for generative models remain unexplored. We hereby propose a novel algorithm aimed at improving the performance of generative models within FL. Our approach adaptively re-weights the contribution of each client, resulting in well-trained shared parameters. In each round, the server side measures the distribution distance between fake images generated by clients instead of directly comparing the Fréchet Inception Distance per client, thereby enhancing efficiency of the learning. Experimental results on three public chest X-ray datasets show superior performance in medical image generation, outperforming both centralized learning and conventional FL algorithms. Our code is available at this https URL.
摘要：在多机构数据集上训练的生成模型可以通过不同的数据分布提供丰富的理解。然而，由于医院出于隐私原因不愿共享数据，因此在医学图像上训练模型通常具有挑战性。联邦学习 (FL) 已成为一种隐私保护解决方案，用于通过聚合来自多个客户端的模型权重而不是共享原始数据来训练跨数据中心的分布式数据集。先前的研究已经探索了 FL 对生成模型的适应性，但专门针对生成模型的有效聚合算法仍未开发。我们在此提出了一种旨在提高 FL 内生成模型性能的新算法。我们的方法自适应地重新加权每个客户端的贡献，从而产生训练有素的共享参数。在每一轮中，服务器端测量客户端生成的假图像之间的分布距离，而不是直接比较每个客户端的 Fréchet Inception Distance，从而提高学习效率。在三个公共胸部 X 光数据集上的实验结果显示，它在医学图像生成方面表现出色，优于集中学习和传统的 FL 算法。我们的代码可在此 https URL 上找到。

Title: HGSFusion: Radar-Camera Fusion with Hybrid Generation and Synchronization for 3D Object Detection

Authors: Zijian Gu, Jianwei Ma, Yan Huang, Honghao Wei, Zhanye Chen, Hui Zhang, Wei Hong
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.11489
Pdf URL: https://arxiv.org/pdf/2412.11489
Copy Paste: [[2412.11489]] HGSFusion: Radar-Camera Fusion with Hybrid Generation and Synchronization for 3D Object Detection(https://arxiv.org/abs/2412.11489)
Keywords: generation
Abstract: Millimeter-wave radar plays a vital role in 3D object detection for autonomous driving due to its all-weather and all-lighting-condition capabilities for perception. However, radar point clouds suffer from pronounced sparsity and unavoidable angle estimation errors. To address these limitations, incorporating a camera may partially help mitigate the shortcomings. Nevertheless, the direct fusion of radar and camera data can lead to negative or even opposite effects due to the lack of depth information in images and low-quality image features under adverse lighting conditions. Hence, in this paper, we present the radar-camera fusion network with Hybrid Generation and Synchronization (HGSFusion), designed to better fuse radar potentials and image features for 3D object detection. Specifically, we propose the Radar Hybrid Generation Module (RHGM), which fully considers the Direction-Of-Arrival (DOA) estimation errors in radar signal processing. This module generates denser radar points through different Probability Density Functions (PDFs) with the assistance of semantic information. Meanwhile, we introduce the Dual Sync Module (DSM), comprising spatial sync and modality sync, to enhance image features with radar positional information and facilitate the fusion of distinct characteristics in different modalities. Extensive experiments demonstrate the effectiveness of our approach, outperforming the state-of-the-art methods in the VoD and TJ4DRadSet datasets by $6.53\%$ and $2.03\%$ in RoI AP and BEV AP, respectively. The code is available at this https URL.
摘要：毫米波雷达具有全天候、全光照条件下的感知能力，在自动驾驶的 3D 物体检测中起着至关重要的作用。然而，雷达点云存在明显的稀疏性和不可避免的角度估计误差。为了解决这些限制，加入摄像头可能有助于部分缓解这些缺点。然而，雷达和摄像头数据的直接融合可能会导致负面甚至相反的效果，因为在不利的光照条件下，图像中缺乏深度信息和低质量的图像特征。因此，在本文中，我们提出了具有混合生成和同步 (HGSFusion) 的雷达-摄像头融合网络，旨在更好地融合雷达潜力和图像特征以进行 3D 物体检测。具体来说，我们提出了雷达混合生成模块 (RHGM)，它充分考虑了雷达信号处理中的到达方向 (DOA) 估计误差。该模块在语义信息的帮助下，通过不同的概率密度函数 (PDF) 生成更密集的雷达点。同时，我们引入了双同步模块 (DSM)，包括空间同步和模态同步，以使用雷达位置信息增强图像特征，并促进不同模态中不同特征的融合。大量实验证明了我们方法的有效性，在 VoD 和 TJ4DRadSet 数据集中，RoI AP 和 BEV AP 分别比最先进的方法高出 $6.53\%$ 和 $2.03\%$。代码可在此 https URL 上找到。

Title: IGR: Improving Diffusion Model for Garment Restoration from Person Image

Authors: Le Shen, Rong Huang, Zhijie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11513
Pdf URL: https://arxiv.org/pdf/2412.11513
Copy Paste: [[2412.11513]] IGR: Improving Diffusion Model for Garment Restoration from Person Image(https://arxiv.org/abs/2412.11513)
Keywords: restoration
Abstract: Garment restoration, the inverse of virtual try-on task, focuses on restoring standard garment from a person image, requiring accurate capture of garment details. However, existing methods often fail to preserve the identity of the garment or rely on complex processes. To address these limitations, we propose an improved diffusion model for restoring authentic garments. Our approach employs two garment extractors to independently capture low-level features and high-level semantics from the person image. Leveraging a pretrained latent diffusion model, these features are integrated into the denoising process through garment fusion blocks, which combine self-attention and cross-attention layers to align the restored garment with the person image. Furthermore, a coarse-to-fine training strategy is introduced to enhance the fidelity and authenticity of the generated garments. Experimental results demonstrate that our model effectively preserves garment identity and generates high-quality restorations, even in challenging scenarios such as complex garments or those with occlusions.
摘要：服装修复是虚拟试穿任务的逆过程，侧重于从人物图像中恢复标准服装，需要准确捕捉服装细节。然而，现有的方法往往无法保留服装的身份或依赖于复杂的过程。为了解决这些限制，我们提出了一种改进的扩散模型来恢复真实的服装。我们的方法采用两个服装提取器来独立地从人物图像中捕获低级特征和高级语义。利用预训练的潜在扩散模型，这些特征通过服装融合模块集成到去噪过程中，服装融合模块结合了自注意力和交叉注意力层，将恢复的服装与人物图像对齐。此外，还引入了从粗到细的训练策略来增强生成的服装的保真度和真实性。实验结果表明，即使在复杂服装或有遮挡的服装等具有挑战性的场景中，我们的模型也能有效地保留服装身份并生成高质量的修复。

Title: LineArt: A Knowledge-guided Training-free High-quality Appearance Transfer for Design Drawing with Diffusion Model

Authors: Xi Wang, Hongzhen Li, Heng Fang, Yichen Peng, Haoran Xie, Xi Yang, Chuntao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11519
Pdf URL: https://arxiv.org/pdf/2412.11519
Copy Paste: [[2412.11519]] LineArt: A Knowledge-guided Training-free High-quality Appearance Transfer for Design Drawing with Diffusion Model(https://arxiv.org/abs/2412.11519)
Keywords: generation
Abstract: Image rendering from line drawings is vital in design and image generation technologies reduce costs, yet professional line drawings demand preserving complex details. Text prompts struggle with accuracy, and image translation struggles with consistency and fine-grained control. We present LineArt, a framework that transfers complex appearance onto detailed design drawings, facilitating design and artistic creation. It generates high-fidelity appearance while preserving structural accuracy by simulating hierarchical visual cognition and integrating human artistic experience to guide the diffusion process. LineArt overcomes the limitations of current methods in terms of difficulty in fine-grained control and style degradation in design drawings. It requires no precise 3D modeling, physical property specs, or network training, making it more convenient for design tasks. LineArt consists of two stages: a multi-frequency lines fusion module to supplement the input design drawing with detailed structural information and a two-part painting process for Base Layer Shaping and Surface Layer Coloring. We also present a new design drawing dataset ProLines for evaluation. The experiments show that LineArt performs better in accuracy, realism, and material precision compared to SOTAs.
摘要：线条图的图像渲染在设计中至关重要，图像生成技术可以降低成本，但专业的线条图需要保留复杂的细节。文本提示难以准确，图像转换难以一致性和细粒度控制。我们提出了 LineArt，这是一个将复杂外观转移到详细设计图上的框架，可促进设计和艺术创作。它通过模拟分层视觉认知并整合人类艺术经验来指导传播过程，在保留结构准确性的同时生成高保真外观。LineArt 克服了当前方法在细粒度控制困难和设计图中风格退化的局限性。它不需要精确的 3D 建模、物理属性规范或网络训练，使其更方便完成设计任务。LineArt 由两个阶段组成：一个多频线融合模块，用于为输入的设计图补充详细的结构信息，以及一个用于基础层成型和表面层着色的两部分绘画过程。我们还提出了一个新的设计图数据集 ProLines 供评估。实验表明，与 SOTA 相比，LineArt 在准确性、真实感和材料精度方面表现更好。

Title: Sequence Matters: Harnessing Video Models in Super-Resolution

Authors: Hyun-kyu Ko, Dongheok Park, Youngin Park, Byeonghyeon Lee, Juhee Han, Eunbyung Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11525
Pdf URL: https://arxiv.org/pdf/2412.11525
Copy Paste: [[2412.11525]] Sequence Matters: Harnessing Video Models in Super-Resolution(https://arxiv.org/abs/2412.11525)
Keywords: super-resolution
Abstract: 3D super-resolution aims to reconstruct high-fidelity 3D models from low-resolution (LR) multi-view images. Early studies primarily focused on single-image super-resolution (SISR) models to upsample LR images into high-resolution images. However, these methods often lack view consistency because they operate independently on each image. Although various post-processing techniques have been extensively explored to mitigate these inconsistencies, they have yet to fully resolve the issues. In this paper, we perform a comprehensive study of 3D super-resolution by leveraging video super-resolution (VSR) models. By utilizing VSR models, we ensure a higher degree of spatial consistency and can reference surrounding spatial information, leading to more accurate and detailed reconstructions. Our findings reveal that VSR models can perform remarkably well even on sequences that lack precise spatial alignment. Given this observation, we propose a simple yet practical approach to align LR images without involving fine-tuning or generating 'smooth' trajectory from the trained 3D models over LR images. The experimental results show that the surprisingly simple algorithms can achieve the state-of-the-art results of 3D super-resolution tasks on standard benchmark datasets, such as the NeRF-synthetic and MipNeRF-360 datasets. Project page: this https URL
摘要：3D 超分辨率旨在从低分辨率 (LR) 多视图图像重建高保真 3D 模型。早期研究主要集中于单图像超分辨率 (SISR) 模型，以将 LR 图像上采样为高分辨率图像。然而，这些方法通常缺乏视图一致性，因为它们对每张图像独立操作。尽管已经广泛探索了各种后处理技术来缓解这些不一致性，但它们尚未完全解决这些问题。在本文中，我们利用视频超分辨率 (VSR) 模型对 3D 超分辨率进行了全面研究。通过利用 VSR 模型，我们可以确保更高程度的空间一致性，并可以参考周围的空间信息，从而实现更准确和详细的重建。我们的研究结果表明，即使在缺乏精确空间对齐的序列上，VSR 模型也能表现得非常出色。鉴于这一观察结果，我们提出了一种简单而实用的方法来对齐 LR 图像，而无需从 LR 图像上的训练过的 3D 模型进行微调或生成“平滑”轨迹。实验结果表明，令人惊讶的简单算法可以在标准基准数据集（例如 NeRF-synthetic 和 MipNeRF-360 数据集）上实现 3D 超分辨率任务的最优结果。项目页面：此 https URL

Title: MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models

Authors: Weilun Feng, Haotong Qin, Chuanguang Yang, Zhulin An, Libo Huang, Boyu Diao, Fei Wang, Renshuai Tao, Yongjun Xu, Michele Magno
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11549
Pdf URL: https://arxiv.org/pdf/2412.11549
Copy Paste: [[2412.11549]] MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models(https://arxiv.org/abs/2412.11549)
Keywords: generation
Abstract: Diffusion models have received wide attention in generation tasks. However, the expensive computation cost prevents the application of diffusion models in resource-constrained scenarios. Quantization emerges as a practical solution that significantly saves storage and computation by reducing the bit-width of parameters. However, the existing quantization methods for diffusion models still cause severe degradation in performance, especially under extremely low bit-widths (2-4 bit). The primary decrease in performance comes from the significant discretization of activation values at low bit quantization. Too few activation candidates are unfriendly for outlier significant weight channel quantization, and the discretized features prevent stable learning over different time steps of the diffusion model. This paper presents MPQ-DM, a Mixed-Precision Quantization method for Diffusion Models. The proposed MPQ-DM mainly relies on two techniques:(1) To mitigate the quantization error caused by outlier severe weight channels, we propose an Outlier-Driven Mixed Quantization (OMQ) technique that uses $Kurtosis$ to quantify outlier salient channels and apply optimized intra-layer mixed-precision bit-width allocation to recover accuracy performance within target efficiency.(2) To robustly learn representations crossing time steps, we construct a Time-Smoothed Relation Distillation (TRD) scheme between the quantized diffusion model and its full-precision counterpart, transferring discrete and continuous latent to a unified relation space to reduce the representation inconsistency. Comprehensive experiments demonstrate that MPQ-DM achieves significant accuracy gains under extremely low bit-widths compared with SOTA quantization methods. MPQ-DM achieves a 58\% FID decrease under W2A4 setting compared with baseline, while all other methods even collapse.
摘要：扩散模型在生成任务中受到广泛关注，但昂贵的计算成本阻碍了扩散模型在资源受限场景中的应用。量化作为一种实用的解决方案应运而生，通过减少参数的位宽可以显著节省存储和计算量。然而，现有的扩散模型量化方法仍然会导致性能严重下降，尤其是在极低位宽（2-4 位）下。性能下降的主要原因是低位量化时激活值的显著离散化。激活候选值太少对于异常显著权重通道量化不利，离散化特征阻碍了扩散模型在不同时间步长的稳定学习。本文提出了一种用于扩散模型的混合精度量化方法 MPQ-DM。所提出的 MPQ-DM 主要依赖于两种技术：（1）为了减轻由异常严重权重通道引起的量化误差，我们提出了一种异常值驱动混合量化（OMQ）技术，该技术使用$峰度$量化异常显着通道并应用优化的层内混合精度位宽分配以在目标效率范围内恢复准确度性能。（2）为了稳健地学习跨时间步骤的表示，我们在量化扩散模型及其全精度对应模型之间构建了一个时间平滑关系蒸馏（TRD）方案，将离散和连续潜变量转移到统一的关系空间以减少表示不一致性。全面的实验表明，与 SOTA 量化方法相比，MPQ-DM 在极低位宽下实现了显着的准确度提升。与基线相比，MPQ-DM 在 W2A4 设置下实现了 58％的 FID 降低，而所有其他方法甚至崩溃。

Title: StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Authors: Xiaokun Sun, Zeyu Cai, Zhenyu Zhang, Ying Tai, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11586
Pdf URL: https://arxiv.org/pdf/2412.11586
Copy Paste: [[2412.11586]] StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors(https://arxiv.org/abs/2412.11586)
Keywords: generation, generative
Abstract: While haircut indicates distinct personality, existing avatar generation methods fail to model practical hair due to the general or entangled representation. We propose StrandHead, a novel text to 3D head avatar generation method capable of generating disentangled 3D hair with strand representation. Without using 3D data for supervision, we demonstrate that realistic hair strands can be generated from prompts by distilling 2D generative diffusion models. To this end, we propose a series of reliable priors on shape initialization, geometric primitives, and statistical haircut features, leading to a stable optimization and text-aligned performance. Extensive experiments show that StrandHead achieves the state-of-the-art reality and diversity of generated 3D head and hair. The generated 3D hair can also be easily implemented in the Unreal Engine for physical simulation and other applications. The code will be available at this https URL.
摘要：虽然发型可以反映出独特的个性，但现有的头像生成方法由于表示方式一般或纠缠不清而无法对实际的头发进行建模。我们提出了 StrandHead，这是一种新颖的文本到 3D 头部头像生成方法，能够生成具有发束表示的解开的 3D 头发。我们证明，在不使用 3D 数据进行监督的情况下，可以通过提取 2D 生成扩散模型从提示中生成逼真的发束。为此，我们提出了一系列关于形状初始化、几何图元和统计发型特征的可靠先验，从而实现稳定的优化和文本对齐性能。大量实验表明，StrandHead 实现了生成的 3D 头部和头发的最先进的真实感和多样性。生成的 3D 头发也可以轻松实现在虚幻引擎中，用于物理模拟和其他应用。代码将在此 https URL 上提供。

Title: VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Authors: Zhipeng Chen, Lan Yang, Yonggang Qi, Honggang Zhang, Kaiyue Pang, Ke Li, Yi-Zhe Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11594
Pdf URL: https://arxiv.org/pdf/2412.11594
Copy Paste: [[2412.11594]] VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis(https://arxiv.org/abs/2412.11594)
Keywords: generation, generative
Abstract: Despite the rapid advancements in text-to-image (T2I) synthesis, enabling precise visual control remains a significant challenge. Existing works attempted to incorporate multi-facet controls (text and sketch), aiming to enhance the creative control over generated images. However, our pilot study reveals that the expressive power of humans far surpasses the capabilities of current methods. Users desire a more versatile approach that can accommodate their diverse creative intents, ranging from controlling individual subjects to manipulating the entire scene composition. We present VersaGen, a generative AI agent that enables versatile visual control in T2I synthesis. VersaGen admits four types of visual controls: i) single visual subject; ii) multiple visual subjects; iii) scene background; iv) any combination of the three above or merely no control at all. We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process. We introduce three optimization strategies during the inference phase of VersaGen to improve generation results and enhance user experience. Comprehensive experiments on COCO and Sketchy validate the effectiveness and flexibility of VersaGen, as evidenced by both qualitative and quantitative results.
摘要：尽管文本到图像 (T2I) 合成技术取得了快速发展，但实现精确的视觉控制仍然是一项重大挑战。现有研究试图结合多方面控制（文本和草图），旨在增强对生成图像的创作控制。然而，我们的初步研究表明，人类的表达能力远远超过了当前方法的能力。用户希望有一种更加通用的方法来适应他们多样化的创作意图，从控制单个主体到操纵整个场景构图。我们提出了 VersaGen，这是一种生成式 AI 代理，可在 T2I 合成中实现多功能视觉控制。VersaGen 允许四种类型的视觉控制：i) 单个视觉主体；ii) 多个视觉主体；iii) 场景背景；iv) 以上三者的任意组合或根本没有控制。我们在冻结的 T2I 模型上训练一个适配器，以将视觉信息容纳到以文本为主的扩散过程中。我们在 VersaGen 的推理阶段引入了三种优化策略，以改善生成结果并增强用户体验。在 COCO 和 Sketchy 上进行的综合实验验证了 VersaGen 的有效性和灵活性，定性和定量结果都证明了这一点。

Title: MeshArt: Generating Articulated Meshes with Structure-guided Transformers

Authors: Daoyi Gao, Yawar Siddiqui, Lei Li, Angela Dai
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.11596
Pdf URL: https://arxiv.org/pdf/2412.11596
Copy Paste: [[2412.11596]] MeshArt: Generating Articulated Meshes with Structure-guided Transformers(https://arxiv.org/abs/2412.11596)
Keywords: generation
Abstract: Articulated 3D object generation is fundamental for creating realistic, functional, and interactable virtual assets which are not simply static. We introduce MeshArt, a hierarchical transformer-based approach to generate articulated 3D meshes with clean, compact geometry, reminiscent of human-crafted 3D models. We approach articulated mesh generation in a part-by-part fashion across two stages. First, we generate a high-level articulation-aware object structure; then, based on this structural information, we synthesize each part's mesh faces. Key to our approach is modeling both articulation structures and part meshes as sequences of quantized triangle embeddings, leading to a unified hierarchical framework with transformers for autoregressive generation. Object part structures are first generated as their bounding primitives and articulation modes; a second transformer, guided by these articulation structures, then generates each part's mesh triangles. To ensure coherency among generated parts, we introduce structure-guided conditioning that also incorporates local part mesh connectivity. MeshArt shows significant improvements over state of the art, with 57.1% improvement in structure coverage and a 209-point improvement in mesh generation FID.
摘要：铰接式 3D 对象生成对于创建逼真、功能齐全且可交互的虚拟资产（而非静态资产）至关重要。我们引入了 MeshArt，这是一种基于分层变换器的方法，用于生成具有干净、紧凑几何形状的铰接式 3D 网格，让人联想到人造 3D 模型。我们以两个阶段逐个部分的方式处理铰接式网格生成。首先，我们生成高级铰接感知对象结构；然后，基于此结构信息，我们合成每个部分的网格面。我们方法的关键是将铰接结构和部分网格建模为量化三角形嵌入序列，从而形成具有用于自回归生成的变换器的统一分层框架。首先生成对象部分结构作为其边界基元和铰接模式；然后，在这些铰接结构的引导下，第二个变换器生成每个部分的网格三角形。为了确保生成的部分之间的一致性，我们引入了结构引导条件，其中还结合了局部部分网格连接。 MeshArt 与现有技术相比有显著的改进，结构覆盖率提高了 57.1%，网格生成 FID 提高了 209 点。

Title: 3D$^2$-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling

Authors: Zichen Tang, Hongyu Yang, Hanchen Zhang, Jiaxin Chen, Di Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11599
Pdf URL: https://arxiv.org/pdf/2412.11599
Copy Paste: [[2412.11599]] 3D$^2$-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling(https://arxiv.org/abs/2412.11599)
Keywords: generation
Abstract: Advancements in neural implicit representations and differentiable rendering have markedly improved the ability to learn animatable 3D avatars from sparse multi-view RGB videos. However, current methods that map observation space to canonical space often face challenges in capturing pose-dependent details and generalizing to novel poses. While diffusion models have demonstrated remarkable zero-shot capabilities in 2D image generation, their potential for creating animatable 3D avatars from 2D inputs remains underexplored. In this work, we introduce 3D$^2$-Actor, a novel approach featuring a pose-conditioned 3D-aware human modeling pipeline that integrates iterative 2D denoising and 3D rectifying steps. The 2D denoiser, guided by pose cues, generates detailed multi-view images that provide the rich feature set necessary for high-fidelity 3D reconstruction and pose rendering. Complementing this, our Gaussian-based 3D rectifier renders images with enhanced 3D consistency through a two-stage projection strategy and a novel local coordinate representation. Additionally, we propose an innovative sampling strategy to ensure smooth temporal continuity across frames in video synthesis. Our method effectively addresses the limitations of traditional numerical solutions in handling ill-posed mappings, producing realistic and animatable 3D human avatars. Experimental results demonstrate that 3D$^2$-Actor excels in high-fidelity avatar modeling and robustly generalizes to novel poses. Code is available at: this https URL.
摘要：神经隐式表示和可微分渲染方面的进步显著提高了从稀疏多视角 RGB 视频中学习可动画 3D 化身的能力。然而，将观察空间映射到规范空间的当前方法在捕捉与姿势相关的细节和推广到新姿势方面往往面临挑战。虽然扩散模型在 2D 图像生成中表现出了卓越的零样本能力，但它们从 2D 输入创建可动画 3D 化身的潜力仍未得到充分探索。在这项工作中，我们引入了 3D$^2$-Actor，这是一种新颖的方法，具有姿势调节的 3D 感知人体建模管道，集成了迭代 2D 去噪和 3D 校正步骤。在姿势提示的引导下，2D 去噪器生成详细的多视角图像，提供高保真 3D 重建和姿势渲染所需的丰富特征集。除此之外，我们基于高斯的 3D 校正器通过两阶段投影策略和新颖的局部坐标表示法，以增强的 3D 一致性渲染图像。此外，我们提出了一种创新的采样策略，以确保视频合成中跨帧的平滑时间连续性。我们的方法有效地解决了传统数值解在处理不适定映射方面的局限性，从而产生了逼真且可动画化的 3D 人体化身。实验结果表明，3D$^2$-Actor 在高保真化身建模方面表现出色，并且可以稳健地推广到新颖的姿势。代码可在以下网址获得：此 https URL。

Title: CLIP-SR: Collaborative Linguistic and Image Processing for Super-Resolution

Authors: Bingwen Hu, Heng Liu, Zhedong Zheng, Ping Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11609
Pdf URL: https://arxiv.org/pdf/2412.11609
Copy Paste: [[2412.11609]] CLIP-SR: Collaborative Linguistic and Image Processing for Super-Resolution(https://arxiv.org/abs/2412.11609)
Keywords: super-resolution
Abstract: Convolutional Neural Networks (CNNs) have advanced Image Super-Resolution (SR), but most CNN-based methods rely solely on pixel-based transformations, often leading to artifacts and blurring, particularly with severe downsampling (e.g., 8x or 16x). Recent text-guided SR methods attempt to leverage textual information for enhanced detail, but they frequently struggle with effective alignment, resulting in inconsistent semantic coherence. To address these limitations, we introduce a multi-modal semantic enhancement approach that combines textual semantics with visual features, effectively tackling semantic mismatches and detail loss in highly degraded LR images. Our proposed multi-modal collaborative framework enables the production of realistic and high-quality SR images at significant up-scaling factors. The framework integrates text and image inputs, employing a prompt predictor, Text-Image Fusion Block (TIFBlock), and Iterative Refinement Module alongside CLIP (Contrastive Language-Image Pretraining) features to guide a progressive enhancement process with fine-grained alignment. This alignment produces high-resolution outputs with crisp details and semantic coherence, even at large scaling factors. Through extensive comparative experiments and ablation studies, we validate the effectiveness of our approach. Additionally, by incorporating textual semantic guidance, our technique enables a degree of super-resolution editability while maintaining semantic coherence.
摘要：卷积神经网络 (CNN) 具有先进的图像超分辨率 (SR)，但大多数基于 CNN 的方法仅依赖于基于像素的转换，这通常会导致伪影和模糊，尤其是在严重下采样（例如 8 倍或 16 倍）的情况下。最近的文本引导 SR 方法试图利用文本信息来增强细节，但它们经常难以有效对齐，导致语义连贯性不一致。为了解决这些限制，我们引入了一种多模态语义增强方法，将文本语义与视觉特征相结合，有效解决严重退化的 LR 图像中的语义不匹配和细节丢失问题。我们提出的多模态协作框架能够在显着的放大因子下生成逼真的高质量 SR 图像。该框架集成了文本和图像输入，采用即时预测器、文本图像融合块 (TIFBlock) 和迭代细化模块以及 CLIP（对比语言图像预训练）功能来指导具有细粒度对齐的渐进增强过程。这种对齐方式可产生高分辨率输出，具有清晰的细节和语义连贯性，即使在较大的缩放因子下也是如此。通过大量的比较实验和消融研究，我们验证了我们方法的有效性。此外，通过结合文本语义指导，我们的技术可以在保持语义连贯性的同时实现一定程度的超分辨率可编辑性。

Title: VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Authors: Muhammet Furkan Ilaslan, Ali Koksal, Kevin Qinhong Lin, Burak Satar, Mike Zheng Shou, Qianli Xu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.11621
Pdf URL: https://arxiv.org/pdf/2412.11621
Copy Paste: [[2412.11621]] VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting(https://arxiv.org/abs/2412.11621)
Keywords: generation
Abstract: Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and video procedural plans given a specified high-level objective. The main challenges are achieving textual and visual informativeness, temporal coherence, and accuracy in procedural plans. VG-TVP leverages the zero-shot reasoning capability of LLMs, the video-to-text generation ability of the video captioning models, and the text-to-video generation ability of diffusion models. VG-TVP improves the interaction between modalities by proposing a novel Fusion of Captioning (FoC) method and using Text-to-Video Bridge (T2V-B) and Video-to-Text Bridge (V2T-B). They allow LLMs to guide the generation of visually-grounded text plans and textual-grounded video plans. To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset.
摘要：基于大型语言模型 (LLM) 的代理在程序任务中表现出了良好的前景，但通过文本和视频增强的多模态指令在帮助用户方面的潜力仍未得到充分探索。为了解决这一差距，我们提出了基于视觉的文本-视频提示 (VG-TVP) 方法，这是一种新颖的 LLM 赋能的多模态程序规划 (MPP) 框架。它根据指定的高级目标生成有凝聚力的文本和视频程序计划。主要挑战是实现程序计划中的文本和视觉信息量、时间连贯性和准确性。VG-TVP 利用 LLM 的零样本推理能力、视频字幕模型的视频到文本生成能力以及扩散模型的文本到视频生成能力。VG-TVP 通过提出一种新颖的字幕融合 (FoC) 方法并使用文本到视频桥 (T2V-B) 和视频到文本桥 (V2T-B) 来改善模态之间的交互。它们允许 LLM 指导基于视觉的文本计划和基于文本的视频计划的生成。为了解决适用于 MPP 的数据集稀缺的问题，我们整理了一个名为“日常生活任务程序计划”（Daily-PP）的新数据集。我们进行了全面的实验和基准测试，以评估人类的偏好（关于文本和视觉信息量、时间连贯性和计划准确性）。我们的 VG-TVP 方法在 Daily-PP 数据集上的表现优于单峰基线。

Title: Predicting the Original Appearance of Damaged Historical Documents

Authors: Zhenhua Yang, Dezhi Peng, Yongxin Shi, Yuyi Zhang, Chongyu Liu, Lianwen Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11634
Pdf URL: https://arxiv.org/pdf/2412.11634
Copy Paste: [[2412.11634]] Predicting the Original Appearance of Damaged Historical Documents(https://arxiv.org/abs/2412.11634)
Keywords: generation
Abstract: Historical documents encompass a wealth of cultural treasures but suffer from severe damages including character missing, paper damage, and ink erosion over time. However, existing document processing methods primarily focus on binarization, enhancement, etc., neglecting the repair of these damages. To this end, we present a new task, termed Historical Document Repair (HDR), which aims to predict the original appearance of damaged historical documents. To fill the gap in this field, we propose a large-scale dataset HDR28K and a diffusion-based network DiffHDR for historical document repair. Specifically, HDR28K contains 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradations. Moreover, DiffHDR augments the vanilla diffusion framework with semantic and spatial information and a meticulously designed character perceptual loss for contextual and visual coherence. Experimental results demonstrate that the proposed DiffHDR trained using HDR28K significantly surpasses existing approaches and exhibits remarkable performance in handling real damaged documents. Notably, DiffHDR can also be extended to document editing and text block generation, showcasing its high flexibility and generalization capacity. We believe this study could pioneer a new direction of document processing and contribute to the inheritance of invaluable cultures and civilizations. The dataset and code is available at this https URL.
摘要：历史文献蕴含着丰富的文化宝藏，但随着时间的推移，它们遭受了严重的损坏，包括字符丢失、纸张损坏和墨水腐蚀。然而，现有的文档处理方法主要侧重于二值化、增强等，而忽略了对这些损坏的修复。为此，我们提出了一项新任务，称为历史文档修复 (HDR)，旨在预测受损历史文献的原始外观。为了填补这一领域的空白，我们提出了一个大规模数据集 HDR28K 和一个基于扩散的网络 DiffHDR，用于历史文档修复。具体来说，HDR28K 包含 28,552 个受损修复图像对，具有字符级注释和多风格降级。此外，DiffHDR 通过语义和空间信息以及精心设计的字符感知损失增强了原始扩散框架，以实现上下文和视觉连贯性。实验结果表明，使用 HDR28K 训练的所提出的 DiffHDR 明显超越了现有方法，并且在处理真实受损文档方面表现出色。值得注意的是，DiffHDR 还可以扩展到文档编辑和文本块生成，展现了其高度的灵活性和泛化能力。我们相信这项研究可以开辟文档处理的新方向，并为传承宝贵的文化和文明做出贡献。数据集和代码可在此 https URL 上找到。

Title: IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation

Authors: Yiren Song, Pei Yang, Hai Ci, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11638
Pdf URL: https://arxiv.org/pdf/2412.11638
Copy Paste: [[2412.11638]] IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation(https://arxiv.org/abs/2412.11638)
Keywords: generation, generative
Abstract: Recently, zero-shot methods like InstantID have revolutionized identity-preserving generation. Unlike multi-image finetuning approaches such as DreamBooth, these zero-shot methods leverage powerful facial encoders to extract identity information from a single portrait photo, enabling efficient identity-preserving generation through a single inference pass. However, this convenience introduces new threats to the facial identity protection. This paper aims to safeguard portrait photos from unauthorized encoder-based customization. We introduce IDProtector, an adversarial noise encoder that applies imperceptible adversarial noise to portrait photos in a single forward pass. Our approach offers universal protection for portraits against multiple state-of-the-art encoder-based methods, including InstantID, IP-Adapter, and PhotoMaker, while ensuring robustness to common image transformations such as JPEG compression, resizing, and affine transformations. Experiments across diverse portrait datasets and generative models reveal that IDProtector generalizes effectively to unseen data and even closed-source proprietary models.
摘要：最近，像 InstantID 这样的零样本方法彻底改变了身份保护生成。与 DreamBooth 等多图像微调方法不同，这些零样本方法利用强大的面部编码器从单张肖像照片中提取身份信息，通过一次推理过程实现高效的身份保护生成。然而，这种便利给面部身份保护带来了新的威胁。本文旨在保护肖像照片免受未经授权的基于编码器的定制。我们介绍了 IDProtector，这是一种对抗性噪声编码器，可在一次前向传递中将难以察觉的对抗性噪声应用于肖像照片。我们的方法为肖像提供了针对多种最先进的基于编码器的方法（包括 InstantID、IP-Adapter 和 PhotoMaker）的通用保护，同时确保对常见图像转换（如 JPEG 压缩、调整大小和仿射变换）的鲁棒性。在不同肖像数据集和生成模型上进行的实验表明，IDProtector 可以有效地推广到看不见的数据甚至闭源专有模型。

Title: EGP3D: Edge-guided Geometric Preserving 3D Point Cloud Super-resolution for RGB-D camera

Authors: Zheng Fang, Ke Ye, Yaofang Liu, Gongzhe Li, Xianhong Zhao, Jialong Li, Ruxin Wang, Yuchen Zhang, Xiangyang Ji, Qilin Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11680
Pdf URL: https://arxiv.org/pdf/2412.11680
Copy Paste: [[2412.11680]] EGP3D: Edge-guided Geometric Preserving 3D Point Cloud Super-resolution for RGB-D camera(https://arxiv.org/abs/2412.11680)
Keywords: super-resolution
Abstract: Point clouds or depth images captured by current RGB-D cameras often suffer from low resolution, rendering them insufficient for applications such as 3D reconstruction and robots. Existing point cloud super-resolution (PCSR) methods are either constrained by geometric artifacts or lack attention to edge details. To address these issues, we propose an edge-guided geometric-preserving 3D point cloud super-resolution (EGP3D) method tailored for RGB-D cameras. Our approach innovatively optimizes the point cloud with an edge constraint on a projected 2D space, thereby ensuring high-quality edge preservation in the 3D PCSR task. To tackle geometric optimization challenges in super-resolution point clouds, particularly preserving edge shapes and smoothness, we introduce a multi-faceted loss function that simultaneously optimizes the Chamfer distance, Hausdorff distance, and gradient smoothness. Existing datasets used for point cloud upsampling are predominantly synthetic and inadequately represent real-world scenarios, neglecting noise and stray light effects. To address the scarcity of realistic RGB-D data for PCSR tasks, we built a dataset that captures real-world noise and stray-light effects, offering a more accurate representation of authentic environments. Validated through simulations and real-world experiments, the proposed method exhibited superior performance in preserving edge clarity and geometric details.
摘要：当前 RGB-D 相机捕获的点云或深度图像通常分辨率较低，不足以满足 3D 重建和机器人等应用的需求。现有的点云超分辨率 (PCSR) 方法要么受到几何伪影的限制，要么缺乏对边缘细节的关注。为了解决这些问题，我们提出了一种专为 RGB-D 相机量身定制的边缘引导几何保留 3D 点云超分辨率 (EGP3D) 方法。我们的方法创新地优化了投影 2D 空间上的边缘约束点云，从而确保了 3D PCSR 任务中的高质量边缘保留。为了解决超分辨率点云中的几何优化挑战，特别是保留边缘形状和平滑度，我们引入了一个多面损失函数，可同时优化倒角距离、豪斯多夫距离和梯度平滑度。用于点云上采样的现有数据集主要是合成的，不能充分代表真实世界场景，忽略了噪声和杂散光效应。为了解决 PCSR 任务中真实 RGB-D 数据稀缺的问题，我们构建了一个数据集，该数据集可以捕捉真实世界的噪声和杂散光效应，从而更准确地呈现真实环境。通过模拟和真实世界实验验证，所提出的方法在保留边缘清晰度和几何细节方面表现出色。

Title: AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Authors: Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Zhao Jin, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11706
Pdf URL: https://arxiv.org/pdf/2412.11706
Copy Paste: [[2412.11706]] AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration(https://arxiv.org/abs/2412.11706)
Keywords: restoration, generation
Abstract: Video Diffusion Transformers (DiTs) have demonstrated significant potential for generating high-fidelity videos but are computationally intensive. Existing acceleration methods include distillation, which requires costly retraining, and feature caching, which is highly sensitive to network architecture. Recent token reduction methods are training-free and architecture-agnostic, offering greater flexibility and wider applicability. However, they enforce the same sequence length across different components, constraining their acceleration potential. We observe that intra-sequence redundancy in video DiTs varies across features, blocks, and denoising timesteps. Building on this observation, we propose Asymmetric Reduction and Restoration (AsymRnR), a training-free approach to accelerate video DiTs. It offers a flexible and adaptive strategy that reduces the number of tokens based on their redundancy to enhance both acceleration and generation quality. We further propose matching cache to facilitate faster processing. Integrated into state-of-the-art video DiTs, AsymRnR achieves a superior speedup without compromising the quality.
摘要：视频扩散变换器 (DiT) 已显示出生成高保真视频的巨大潜力，但计算量很大。现有的加速方法包括蒸馏（需要昂贵的再训练）和特征缓存（对网络架构高度敏感）。最近的 token 减少方法是无需训练且与架构无关的，提供了更大的灵活性和更广泛的适用性。然而，它们在不同的组件上强制相同的序列长度，限制了它们的加速潜力。我们观察到视频 DiT 中的序列内冗余因特征、块和去噪时间步长而异。基于这一观察，我们提出了非对称减少和恢复 (AsymRnR)，这是一种无需训练的方法来加速视频 DiT。它提供了一种灵活且自适应的策略，可以根据冗余度减少 token 的数量，从而提高加速和生成质量。我们进一步提出了匹配缓存以促进更快的处理。AsymRnR 集成到最先进的视频 DiT 中，在不影响质量的情况下实现了卓越的加速。

Title: Transferable Adversarial Face Attack with Text Controlled Attribute

Authors: Wenyun Li, Zheng Zhang, Xiangyuan Lan, Dongmei Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11735
Pdf URL: https://arxiv.org/pdf/2412.11735
Copy Paste: [[2412.11735]] Transferable Adversarial Face Attack with Text Controlled Attribute(https://arxiv.org/abs/2412.11735)
Keywords: generative
Abstract: Traditional adversarial attacks typically produce adversarial examples under norm-constrained conditions, whereas unrestricted adversarial examples are free-form with semantically meaningful perturbations. Current unrestricted adversarial impersonation attacks exhibit limited control over adversarial face attributes and often suffer from low transferability. In this paper, we propose a novel Text Controlled Attribute Attack (TCA$^2$) to generate photorealistic adversarial impersonation faces guided by natural language. Specifically, the category-level personal softmax vector is employed to precisely guide the impersonation attacks. Additionally, we propose both data and model augmentation strategies to achieve transferable attacks on unknown target models. Finally, a generative model, \textit{i.e}, Style-GAN, is utilized to synthesize impersonated faces with desired attributes. Extensive experiments on two high-resolution face recognition datasets validate that our TCA$^2$ method can generate natural text-guided adversarial impersonation faces with high transferability. We also evaluate our method on real-world face recognition systems, \textit{i.e}, Face++ and Aliyun, further demonstrating the practical potential of our approach.
摘要：传统对抗攻击通常在规范约束条件下产生对抗性示例，而不受限制的对抗性示例则是具有语义上有意义的扰动的自由形式。当前不受限制的对抗性模仿攻击对对抗性人脸属性的控制有限，并且通常具有较低的可转移性。在本文中，我们提出了一种新颖的文本控制属性攻击 (TCA$^2$)，以生成由自然语言引导的照片般逼真的对抗性模仿面孔。具体而言，类别级个人 softmax 向量用于精确引导模仿攻击。此外，我们提出了数据和模型增强策略，以实现对未知目标模型的可转移攻击。最后，利用生成模型 \textit{i.e} Style-GAN 来合成具有所需属性的模仿面孔。在两个高分辨率人脸识别数据集上进行的大量实验验证了我们的 TCA$^2$ 方法可以生成具有高可转移性的自然文本引导的对抗性模仿面孔。我们还在现实世界的人脸识别系统、\textit{i.e}、Face++ 和 Aliyun 上评估了我们的方法，进一步证明了我们方法的实际潜力。

Title: Generative Inbetweening through Frame-wise Conditions-Driven Video Generation

Authors: Tianyi Zhu, Dongwei Ren, Qilong Wang, Xiaohe Wu, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11755
Pdf URL: https://arxiv.org/pdf/2412.11755
Copy Paste: [[2412.11755]] Generative Inbetweening through Frame-wise Conditions-Driven Video Generation(https://arxiv.org/abs/2412.11755)
Keywords: generation, generative
Abstract: Generative inbetweening aims to generate intermediate frame sequences by utilizing two key frames as input. Although remarkable progress has been made in video generation models, generative inbetweening still faces challenges in maintaining temporal stability due to the ambiguous interpolation path between two key frames. This issue becomes particularly severe when there is a large motion gap between input frames. In this paper, we propose a straightforward yet highly effective Frame-wise Conditions-driven Video Generation (FCVG) method that significantly enhances the temporal stability of interpolated video frames. Specifically, our FCVG provides an explicit condition for each frame, making it much easier to identify the interpolation path between two input frames and thus ensuring temporally stable production of visually plausible video frames. To achieve this, we suggest extracting matched lines from two input frames that can then be easily interpolated frame by frame, serving as frame-wise conditions seamlessly integrated into existing video generation models. In extensive evaluations covering diverse scenarios such as natural landscapes, complex human poses, camera movements and animations, existing methods often exhibit incoherent transitions across frames. In contrast, our FCVG demonstrates the capability to generate temporally stable videos using both linear and non-linear interpolation curves. Our project page and code are available at \url{this https URL}.
摘要：生成式中间帧旨在利用两个关键帧作为输入来生成中间帧序列。尽管视频生成模型取得了显著进展，但由于两个关键帧之间的插值路径不明确，生成式中间帧在保持时间稳定性方面仍然面临挑战。当输入帧之间存在较大的运动间隙时，这个问题变得尤为严重。在本文中，我们提出了一种简单但高效的逐帧条件驱动视频生成 (FCVG) 方法，该方法显著提高了插值视频帧的时间稳定性。具体来说，我们的 FCVG 为每个帧提供了一个明确的条件，使得识别两个输入帧之间的插值路径变得更加容易，从而确保生成时间稳定、视觉上可信的视频帧。为了实现这一点，我们建议从两个输入帧中提取匹配的线，然后可以轻松地逐帧进行插值，作为逐帧条件无缝集成到现有的视频生成模型中。在涵盖自然景观、复杂人体姿势、相机运动和动画等各种场景的广泛评估中，现有方法通常表现出跨帧不连贯的过渡。相比之下，我们的 FCVG 展示了使用线性和非线性插值曲线生成时间稳定视频的能力。我们的项目页面和代码可在 \url{此 https URL} 上找到。

Title: IDEA-Bench: How Far are Generative Models from Professional Designing?

Authors: Chen Liang, Lianghua Huang, Jingwu Fang, Huanzhang Dou, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Junge Zhang, Xin Zhao, Yu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11767
Pdf URL: https://arxiv.org/pdf/2412.11767
Copy Paste: [[2412.11767]] IDEA-Bench: How Far are Generative Models from Professional Designing?(https://arxiv.org/abs/2412.11767)
Keywords: generation, generative
Abstract: Real-world design tasks - such as picture book creation, film storyboard development using character sets, photo retouching, visual effects, and font transfer - are highly diverse and complex, requiring deep interpretation and extraction of various elements from instructions, descriptions, and reference images. The resulting images often implicitly capture key features from references or user inputs, making it challenging to develop models that can effectively address such varied tasks. While existing visual generative models can produce high-quality images based on prompts, they face significant limitations in professional design scenarios that involve varied forms and multiple inputs and outputs, even when enhanced with adapters like ControlNets and LoRAs. To address this, we introduce IDEA-Bench, a comprehensive benchmark encompassing 100 real-world design tasks, including rendering, visual effects, storyboarding, picture books, fonts, style-based, and identity-preserving generation, with 275 test cases to thoroughly evaluate a model's general-purpose generation capabilities. Notably, even the best-performing model only achieves 22.48 on IDEA-Bench, while the best general-purpose model only achieves 6.81. We provide a detailed analysis of these results, highlighting the inherent challenges and providing actionable directions for improvement. Additionally, we provide a subset of 18 representative tasks equipped with multimodal large language model (MLLM)-based auto-evaluation techniques to facilitate rapid model development and comparison. We releases the benchmark data, evaluation toolkits, and an online leaderboard at this https URL, aiming to drive the advancement of generative models toward more versatile and applicable intelligent design systems.
摘要：现实世界的设计任务（例如图画书创作、使用角色集开发电影故事板、照片修饰、视觉效果和字体转换）高度多样化和复杂，需要从说明、描述和参考图像中深入解释和提取各种元素。生成的图像通常会隐式捕获参考或用户输入中的关键特征，这使得开发能够有效处理如此多样化任务的模型具有挑战性。虽然现有的视觉生成模型可以根据提示生成高质量的图像，但它们在涉及多种形式和多种输入和输出的专业设计场景中面临着很大的限制，即使使用 ControlNets 和 LoRA 等适配器进行增强也是如此。为了解决这个问题，我们推出了 IDEA-Bench，这是一个全面的基准，涵盖 100 个现实世界的设计任务，包括渲染、视觉效果、故事板、图画书、字体、基于样式和身份保留的生成，并有 275 个测试用例来全面评估模型的通用生成能力。值得注意的是，即使是表现最好的模型在 IDEA-Bench 上也只能达到 22.48，而最好的通用模型也只能达到 6.81。我们对这些结果进行了详细分析，强调了固有的挑战并提供了可行的改进方向。此外，我们还提供了 18 个代表性任务的子集，这些任务配备了基于多模态大型语言模型 (MLLM) 的自动评估技术，以促进快速的模型开发和比较。我们在此 https URL 上发布了基准数据、评估工具包和在线排行榜，旨在推动生成模型向更通用、更适用的智能设计系统发展。

Title: Fast and Slow Gradient Approximation for Binary Neural Network Optimization

Authors: Xinquan Chen, Junqi Gao, Biqing Qi, Dong Li, Yiang Luo, Fangyuan Li, Pengfei Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.11777
Pdf URL: https://arxiv.org/pdf/2412.11777
Copy Paste: [[2412.11777]] Fast and Slow Gradient Approximation for Binary Neural Network Optimization(https://arxiv.org/abs/2412.11777)
Keywords: generation
Abstract: Binary Neural Networks (BNNs) have garnered significant attention due to their immense potential for deployment on edge devices. However, the non-differentiability of the quantization function poses a challenge for the optimization of BNNs, as its derivative cannot be backpropagated. To address this issue, hypernetwork based methods, which utilize neural networks to learn the gradients of non-differentiable quantization functions, have emerged as a promising approach due to their adaptive learning capabilities to reduce estimation errors. However, existing hypernetwork based methods typically rely solely on current gradient information, neglecting the influence of historical gradients. This oversight can lead to accumulated gradient errors when calculating gradient momentum during optimization. To incorporate historical gradient information, we design a Historical Gradient Storage (HGS) module, which models the historical gradient sequence to generate the first-order momentum required for optimization. To further enhance gradient generation in hypernetworks, we propose a Fast and Slow Gradient Generation (FSG) method. Additionally, to produce more precise gradients, we introduce Layer Recognition Embeddings (LRE) into the hypernetwork, facilitating the generation of layer-specific fine gradients. Extensive comparative experiments on the CIFAR-10 and CIFAR-100 datasets demonstrate that our method achieves faster convergence and lower loss values, outperforming existing this http URL is available at this http URL .
摘要：二元神经网络 (BNN) 因其在边缘设备上部署的巨大潜力而备受关注。然而，量化函数的不可微性对 BNN 的优化提出了挑战，因为其导数无法反向传播。为了解决这个问题，基于超网络的方法利用神经网络来学习不可微量化函数的梯度，由于其自适应学习能力可以减少估计误差，因此已成为一种有前途的方法。然而，现有的基于超网络的方法通常仅依赖于当前梯度信息，而忽略了历史梯度的影响。这种疏忽可能导致在优化过程中计算梯度动量时出现累积梯度误差。为了整合历史梯度信息，我们设计了一个历史梯度存储 (HGS) 模块，它对历史梯度序列进行建模以生成优化所需的一阶动量。为了进一步增强超网络中的梯度生成，我们提出了一种快速和慢速梯度生成 (FSG) 方法。此外，为了产生更精确的梯度，我们在超网络中引入了层识别嵌入 (LRE)，从而有助于生成特定于层的精细梯度。在 CIFAR-10 和 CIFAR-100 数据集上进行的大量比较实验表明，我们的方法实现了更快的收敛速度和更低的损失值，优于现有的此 http URL 可在此 http URL 获得。

Title: Impact of Face Alignment on Face Image Quality

Authors: Eren Onaran, Erdi Sarıtaş, Hazım Kemal Ekenel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11779
Pdf URL: https://arxiv.org/pdf/2412.11779
Copy Paste: [[2412.11779]] Impact of Face Alignment on Face Image Quality(https://arxiv.org/abs/2412.11779)
Keywords: quality assessment
Abstract: Face alignment is a crucial step in preparing face images for feature extraction in facial analysis tasks. For applications such as face recognition, facial expression recognition, and facial attribute classification, alignment is widely utilized during both training and inference to standardize the positions of key landmarks in the face. It is well known that the application and method of face alignment significantly affect the performance of facial analysis models. However, the impact of alignment on face image quality has not been thoroughly investigated. Current FIQA studies often assume alignment as a prerequisite but do not explicitly evaluate how alignment affects quality metrics, especially with the advent of modern deep learning-based detectors that integrate detection and landmark localization. To address this need, our study examines the impact of face alignment on face image quality scores. We conducted experiments on the LFW, IJB-B, and SCFace datasets, employing MTCNN and RetinaFace models for face detection and alignment. To evaluate face image quality, we utilized several assessment methods, including SER-FIQ, FaceQAN, DifFIQA, and SDD-FIQA. Our analysis included examining quality score distributions for the LFW and IJB-B datasets and analyzing average quality scores at varying distances in the SCFace dataset. Our findings reveal that face image quality assessment methods are sensitive to alignment. Moreover, this sensitivity increases under challenging real-life conditions, highlighting the importance of evaluating alignment's role in quality assessment.
摘要：面部对齐是准备面部图像以进行面部分析任务中的特征提取的关键步骤。对于面部识别、面部表情识别和面部属性分类等应用，对齐在训练和推理过程中被广泛使用，以标准化面部关键标志的位置。众所周知，面部对齐的应用和方法会显著影响面部分析模型的性能。然而，对齐对面部图像质量的影响尚未得到彻底研究。当前的 FIQA 研究通常假设对齐是先决条件，但没有明确评估对齐如何影响质量指标，尤其是随着集成检测和标志定位的现代基于深度学习的检测器的出现。为了满足这一需求，我们的研究检查了面部对齐对面部图像质量分数的影响。我们在 LFW、IJB-B 和 SCFace 数据集上进行了实验，采用 MTCNN 和 RetinaFace 模型进行面部检测和对齐。为了评估面部图像质量，我们使用了几种评估方法，包括 SER-FIQ、FaceQAN、DifFIQA 和 SDD-FIQA。我们的分析包括检查 LFW 和 IJB-B 数据集的质量分数分布，以及分析 SCFace 数据集中不同距离的平均质量分数。我们的研究结果表明，人脸图像质量评估方法对对齐非常敏感。此外，在具有挑战性的现实条件下，这种敏感性会增强，这凸显了评估对齐在质量评估中的作用的重要性。

Title: InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

Authors: Rick Akkerman, Haiwen Feng, Michael J. Black, Dimitrios Tzionas, Victoria Fernández Abrevaya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11785
Pdf URL: https://arxiv.org/pdf/2412.11785
Copy Paste: [[2412.11785]] InterDyn: Controllable Interactive Dynamics with Video Diffusion Models(https://arxiv.org/abs/2412.11785)
Keywords: generation, generative
Abstract: Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which neglects the continuous motion and subsequent dynamics resulting from the interaction. To address this gap, we propose InterDyn, a novel framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video foundation models can act as both neural renderers and implicit physics simulators by learning interactive dynamics from large-scale video data. To effectively harness this capability, we introduce an interactive control mechanism that conditions the video generation process on the motion of the driving entity. Qualitative results demonstrate that InterDyn generates plausible, temporally consistent videos of complex object interactions while generalizing to unseen objects. Quantitative evaluations show that InterDyn outperforms baselines that focus on static state transitions. This work highlights the potential of leveraging video generative models as implicit physics engines.
摘要：预测交互对象的动态对于人类和智能系统都至关重要。然而，现有的方法仅限于简化的玩具设置，缺乏对复杂的现实世界环境的普遍性。生成模型的最新进展使得能够根据干预来预测状态转换，但专注于生成单一的未来状态，而忽略了交互产生的连续运动和后续动态。为了解决这一差距，我们提出了 InterDyn，这是一个新颖的框架，它根据初始帧和编码驱动对象或参与者运动的控制信号生成交互式动态视频。我们的主要见解是，大型视频基础模型可以通过从大规模视频数据中学习交互式动态，同时充当神经渲染器和隐式物理模拟器。为了有效地利用这种能力，我们引入了一种交互式控制机制，该机制根据驱动实体的运动来调节视频生成过程。定性结果表明，InterDyn 可以生成复杂对象交互的可信、时间一致的视频，同时可以推广到看不见的对象。定量评估表明，InterDyn 的表现优于专注于静态状态转换的基线。这项工作凸显了利用视频生成模型作为隐式物理引擎的潜力。

Title: AMI-Net: Adaptive Mask Inpainting Network for Industrial Anomaly Detection and Localization

Authors: Wei Luo, Haiming Yao, Wenyong Yu, Zhengyong Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11802
Pdf URL: https://arxiv.org/pdf/2412.11802
Copy Paste: [[2412.11802]] AMI-Net: Adaptive Mask Inpainting Network for Industrial Anomaly Detection and Localization(https://arxiv.org/abs/2412.11802)
Keywords: restoration
Abstract: Unsupervised visual anomaly detection is crucial for enhancing industrial production quality and efficiency. Among unsupervised methods, reconstruction approaches are popular due to their simplicity and effectiveness. The key aspect of reconstruction methods lies in the restoration of anomalous regions, which current methods have not satisfactorily achieved. To tackle this issue, we introduce a novel \uline{A}daptive \uline{M}ask \uline{I}npainting \uline{Net}work (AMI-Net) from the perspective of adaptive mask-inpainting. In contrast to traditional reconstruction methods that treat non-semantic image pixels as targets, our method uses a pre-trained network to extract multi-scale semantic features as reconstruction targets. Given the multiscale nature of industrial defects, we incorporate a training strategy involving random positional and quantitative masking. Moreover, we propose an innovative adaptive mask generator capable of generating adaptive masks that effectively mask anomalous regions while preserving normal regions. In this manner, the model can leverage the visible normal global contextual information to restore the masked anomalous regions, thereby effectively suppressing the reconstruction of defects. Extensive experimental results on the MVTec AD and BTAD industrial datasets validate the effectiveness of the proposed method. Additionally, AMI-Net exhibits exceptional real-time performance, striking a favorable balance between detection accuracy and speed, rendering it highly suitable for industrial applications. Code is available at: this https URL
摘要：无监督视觉异常检测对于提高工业生产质量和效率至关重要。在无监督方法中，重建方法因其简单性和有效性而广受欢迎。重建方法的关键方面在于异常区域的恢复，而当前的方法尚未令人满意地实现这一目标。为了解决这个问题，我们从自适应掩模修复的角度引入了一种新颖的 \uline{A} 自适应 \uline{M}ask \uline{I} npainting \uline{Net}work (AMI-Net)。与将非语义图像像素视为目标的传统重建方法相比，我们的方法使用预先训练的网络来提取多尺度语义特征作为重建目标。鉴于工业缺陷的多尺度性质，我们采用了一种涉及随机位置和定量掩蔽的训练策略。此外，我们提出了一种创新的自适应掩模生成器，能够生成自适应掩模，有效掩盖异常区域，同时保留正常区域。通过这种方式，该模型可以利用可见的正常全局上下文信息来恢复被掩盖的异常区域，从而有效地抑制缺陷的重建。在MVTec AD和BTAD工业数据集上的大量实验结果验证了该方法的有效性。此外，AMI-Net表现出卓越的实时性能，在检测精度和速度之间取得了良好的平衡，非常适合工业应用。代码可从以下网址获取：此https URL

Title: ColorFlow: Retrieval-Augmented Image Sequence Colorization

Authors: Junhao Zhuang, Xuan Ju, Zhaoyang Zhang, Yong Liu, Shiyi Zhang, Chun Yuan, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11815
Pdf URL: https://arxiv.org/pdf/2412.11815
Copy Paste: [[2412.11815]] ColorFlow: Retrieval-Augmented Image Sequence Colorization(https://arxiv.org/abs/2412.11815)
Keywords: generative
Abstract: Automatic black-and-white image sequence colorization while preserving character and object identity (ID) is a complex task with significant market demand, such as in cartoon or comic series colorization. Despite advancements in visual colorization using large-scale generative models like diffusion models, challenges with controllability and identity consistency persist, making current solutions unsuitable for industrial this http URL address this, we propose ColorFlow, a three-stage diffusion-based framework tailored for image sequence colorization in industrial applications. Unlike existing methods that require per-ID finetuning or explicit ID embedding extraction, we propose a novel robust and generalizable Retrieval Augmented Colorization pipeline for colorizing images with relevant color references. Our pipeline also features a dual-branch design: one branch for color identity extraction and the other for colorization, leveraging the strengths of diffusion models. We utilize the self-attention mechanism in diffusion models for strong in-context learning and color identity matching. To evaluate our model, we introduce ColorFlow-Bench, a comprehensive benchmark for reference-based colorization. Results show that ColorFlow outperforms existing models across multiple metrics, setting a new standard in sequential image colorization and potentially benefiting the art industry. We release our codes and models on our project page: this https URL.
摘要：自动对黑白图像序列进行着色同时保留角色和对象身份 (ID) 是一项复杂的任务，具有巨大的市场需求，例如卡通或漫画系列着色。尽管使用扩散模型等大规模生成模型在视觉着色方面取得了进展，但可控性和身份一致性方面的挑战仍然存在，使得当前的解决方案不适合工业应用，我们提出了 ColorFlow，这是一个基于扩散的三阶段框架，专为工业应用中的图像序列着色而量身定制。与需要每个 ID 微调或显式 ID 嵌入提取的现有方法不同，我们提出了一种新颖的稳健且可推广的检索增强着色管道，用于使用相关颜色参考对图像进行着色。我们的管道还具有双分支设计：一个分支用于颜色身份提取，另一个用于着色，充分利用扩散模型的优势。我们利用扩散模型中的自注意力机制进行强大的上下文学习和颜色身份匹配。为了评估我们的模型，我们引入了 ColorFlow-Bench，这是一个基于参考的着色的综合基准。结果表明，ColorFlow 在多个指标上的表现均优于现有模型，为连续图像着色树立了新标准，并可能使艺术行业受益。我们在项目页面上发布了代码和模型：此 https URL。

Title: UnMA-CapSumT: Unified and Multi-Head Attention-driven Caption Summarization Transformer

Authors: Dhruv Sharma, Chhavi Dhiman, Dinesh Kumar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11836
Pdf URL: https://arxiv.org/pdf/2412.11836
Copy Paste: [[2412.11836]] UnMA-CapSumT: Unified and Multi-Head Attention-driven Caption Summarization Transformer(https://arxiv.org/abs/2412.11836)
Keywords: generation
Abstract: Image captioning is the generation of natural language descriptions of images which have increased immense popularity in the recent past. With this different deep-learning techniques are devised for the development of factual and stylized image captioning models. Previous models focused more on the generation of factual and stylized captions separately providing more than one caption for a single image. The descriptions generated from these suffer from out-of-vocabulary and repetition issues. To the best of our knowledge, no such work exists that provided a description that integrates different captioning methods to describe the contents of an image with factual and stylized (romantic and humorous) elements. To overcome these limitations, this paper presents a novel Unified Attention and Multi-Head Attention-driven Caption Summarization Transformer (UnMA-CapSumT) based Captioning Framework. It utilizes both factual captions and stylized captions generated by the Modified Adaptive Attention-based factual image captioning model (MAA-FIC) and Style Factored Bi-LSTM with attention (SF-Bi-ALSTM) driven stylized image captioning model respectively. SF-Bi-ALSTM-based stylized IC model generates two prominent styles of expression- {romance, and humor}. The proposed summarizer UnMHA-ST combines both factual and stylized descriptions of an input image to generate styled rich coherent summarized captions. The proposed UnMHA-ST transformer learns and summarizes different linguistic styles efficiently by incorporating proposed word embedding fastText with Attention Word Embedding (fTA-WE) and pointer-generator network with coverage mechanism concept to solve the out-of-vocabulary issues and repetition problem. Extensive experiments are conducted on Flickr8K and a subset of FlickrStyle10K with supporting ablation studies to prove the efficiency and efficacy of the proposed framework.
摘要：图像字幕是用自然语言描述图像的过程，近年来越来越受欢迎。借助这一技术，人们设计出了不同的深度学习技术来开发事实性和风格化的图像字幕模型。以前的模型更侧重于生成事实性和风格化的字幕，分别为单幅图像提供多个字幕。由此生成的描述存在词汇量不足和重复的问题。据我们所知，目前还不存在这样的研究，它能将不同的字幕方法结合起来，用事实和风格化（浪漫和幽默）的元素来描述图像的内容。为了克服这些限制，本文提出了一种新颖的基于统一注意力和多头注意力驱动的字幕摘要转换器 (UnMA-CapSumT) 的字幕框架。它分别利用了基于改进的自适应注意力的事实图像字幕模型 (MAA-FIC) 和带注意力的风格化双长短期记忆 (SF-Bi-ALSTM) 驱动的风格化图像字幕模型生成的事实字幕和风格化字幕。基于 SF-Bi-ALSTM 的风格化 IC 模型生成两种突出的表达风格——{浪漫和幽默}。所提出的摘要器 UnMHA-ST 结合了输入图像的事实和风格化描述，以生成风格丰富、连贯的摘要字幕。所提出的 UnMHA-ST 转换器结合了所提出的词嵌入 fastText 与注意词嵌入 (fTA-WE) 和带覆盖机制概念的指针生成器网络，有效地学习和总结不同的语言风格，以解决词汇量不足的问题和重复问题。在 Flickr8K 和 FlickrStyle10K 的子集上进行了大量实验，并支持了消融研究，以证明所提框架的效率和有效性。

Title: Controllable Shadow Generation with Single-Step Diffusion Models from Synthetic Data

Authors: Onur Tasar, Clément Chadebec, Benjamin Aubin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11972
Pdf URL: https://arxiv.org/pdf/2412.11972
Copy Paste: [[2412.11972]] Controllable Shadow Generation with Single-Step Diffusion Models from Synthetic Data(https://arxiv.org/abs/2412.11972)
Keywords: generation
Abstract: Realistic shadow generation is a critical component for high-quality image compositing and visual effects, yet existing methods suffer from certain limitations: Physics-based approaches require a 3D scene geometry, which is often unavailable, while learning-based techniques struggle with control and visual artifacts. We introduce a novel method for fast, controllable, and background-free shadow generation for 2D object images. We create a large synthetic dataset using a 3D rendering engine to train a diffusion model for controllable shadow generation, generating shadow maps for diverse light source parameters. Through extensive ablation studies, we find that rectified flow objective achieves high-quality results with just a single sampling step enabling real-time applications. Furthermore, our experiments demonstrate that the model generalizes well to real-world images. To facilitate further research in evaluating quality and controllability in shadow generation, we release a new public benchmark containing a diverse set of object images and shadow maps in various settings. The project page is available at this https URL
摘要：逼真的阴影生成是高质量图像合成和视觉效果的关键组成部分，但现有方法存在某些局限性：基于物理的方法需要 3D 场景几何体，而这通常是无法获得的，而基于学习的技术则难以控制和产生视觉伪影。我们介绍了一种用于快速、可控且无背景阴影生成的新方法，适用于 2D 对象图像。我们使用 3D 渲染引擎创建了一个大型合成数据集，以训练扩散模型以生成可控阴影，从而为各种光源参数生成阴影图。通过广泛的消融研究，我们发现整流流目标只需一个采样步骤即可实现高质量结果，从而实现实时应用。此外，我们的实验表明该模型可以很好地推广到现实世界的图像。为了促进进一步研究评估阴影生成的质量和可控性，我们发布了一个新的公共基准，其中包含各种设置中的各种对象图像和阴影图。项目页面位于此 https URL

Title: Industrial-scale Prediction of Cement Clinker Phases using Machine Learning

Authors: Sheikh Junaid Fayaz, Nestor Montiel-Bohorquez, Shashank Bishnoi, Matteo Romano, Manuele Gatti, N. M. Anoop Krishnan
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2412.11981
Pdf URL: https://arxiv.org/pdf/2412.11981
Copy Paste: [[2412.11981]] Industrial-scale Prediction of Cement Clinker Phases using Machine Learning(https://arxiv.org/abs/2412.11981)
Keywords: quality assessment
Abstract: Cement production, exceeding 4.1 billion tonnes and contributing 2.4 tonnes of CO2 annually, faces critical challenges in quality control and process optimization. While traditional process models for cement manufacturing are confined to steady-state conditions with limited predictive capability for mineralogical phases, modern plants operate under dynamic conditions that demand real-time quality assessment. Here, exploiting a comprehensive two-year operational dataset from an industrial cement plant, we present a machine learning framework that accurately predicts clinker mineralogy from process data. Our model achieves unprecedented prediction accuracy for major clinker phases while requiring minimal input parameters, demonstrating robust performance under varying operating conditions. Through post-hoc explainable algorithms, we interpret the hierarchical relationships between clinker oxides and phase formation, providing insights into the functioning of an otherwise black-box model. This digital twin framework can potentially enable real-time optimization of cement production, thereby providing a route toward reducing material waste and ensuring quality while reducing the associated emissions under real plant conditions. Our approach represents a significant advancement in industrial process control, offering a scalable solution for sustainable cement manufacturing.
摘要：水泥产量每年超过 41 亿吨，排放 2.4 吨二氧化碳，在质量控制和工艺优化方面面临严峻挑战。虽然传统的水泥制造工艺模型局限于稳定状态，对矿物相的预测能力有限，但现代工厂在动态条件下运行，需要实时质量评估。在这里，我们利用工业水泥厂的全面两年运营数据集，提出了一个机器学习框架，可以根据工艺数据准确预测熟料矿物学。我们的模型实现了对主要熟料相的前所未有的预测精度，同时需要最少的输入参数，在不同的操作条件下表现出强大的性能。通过事后可解释的算法，我们解释了熟料氧化物和相形成之间的层次关系，从而深入了解了原本黑箱模型的功能。这种数字孪生框架可以潜在地实现水泥生产的实时优化，从而提供一种减少材料浪费和确保质量的途径，同时在实际工厂条件下减少相关排放。我们的方法代表了工业过程控制的重大进步，为可持续水泥制造提供了可扩展的解决方案。

Title: A LoRA is Worth a Thousand Pictures

Authors: Chenxi Liu, Towaki Takikawa, Alec Jacobson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12048
Pdf URL: https://arxiv.org/pdf/2412.12048
Copy Paste: [[2412.12048]] A LoRA is Worth a Thousand Pictures(https://arxiv.org/abs/2412.12048)
Keywords: generation
Abstract: Recent advances in diffusion models and parameter-efficient fine-tuning (PEFT) have made text-to-image generation and customization widely accessible, with Low Rank Adaptation (LoRA) able to replicate an artist's style or subject using minimal data and computation. In this paper, we examine the relationship between LoRA weights and artistic styles, demonstrating that LoRA weights alone can serve as an effective descriptor of style, without the need for additional image generation or knowledge of the original training set. Our findings show that LoRA weights yield better performance in clustering of artistic styles compared to traditional pre-trained features, such as CLIP and DINO, with strong structural similarities between LoRA-based and conventional image-based embeddings observed both qualitatively and quantitatively. We identify various retrieval scenarios for the growing collection of customized models and show that our approach enables more accurate retrieval in real-world settings where knowledge of the training images is unavailable and additional generation is required. We conclude with a discussion on potential future applications, such as zero-shot LoRA fine-tuning and model attribution.
摘要：扩散模型和参数高效微调 (PEFT) 的最新进展使得文本到图像的生成和定制变得广泛可用，低秩自适应 (LoRA) 能够使用最少的数据和计算来复制艺术家的风格或主题。在本文中，我们研究了 LoRA 权重与艺术风格之间的关系，表明 LoRA 权重本身可以作为风格的有效描述符，而无需额外的图像生成或原始训练集的知识。我们的研究结果表明，与传统的预训练特征（例如 CLIP 和 DINO）相比，LoRA 权重在艺术风格的聚类中具有更好的性能，并且在定性和定量方面观察到基于 LoRA 和传统基于图像的嵌入之间存在很强的结构相似性。我们为不断增长的定制模型集合确定了各种检索场景，并表明我们的方法能够在现实世界环境中实现更准确的检索，在这些环境中，无法获得训练图像的知识并且需要额外的生成。最后，我们讨论了潜在的未来应用，例如零样本 LoRA 微调和模型归因。

Title: Wonderland: Navigating 3D Scenes from a Single Image

Authors: Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N. Plataniotis, Sergey Tulyakov, Jian Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12091
Pdf URL: https://arxiv.org/pdf/2412.12091
Copy Paste: [[2412.12091]] Wonderland: Navigating 3D Scenes from a Single Image(https://arxiv.org/abs/2412.12091)
Keywords: generation
Abstract: This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.
摘要：本文探讨了一个具有挑战性的问题：如何从单个任意图像高效地创建高质量、宽范围的 3D 场景？现有方法面临着一些限制，例如需要多视图数据、耗时的场景优化、背景视觉质量低以及看不见的区域的重建扭曲。我们提出了一种新颖的流程来克服这些限制。具体来说，我们引入了一个大规模重建模型，该模型使用视频扩散模型中的潜在数据以前馈方式预测场景的 3D 高斯分布。视频扩散模型旨在精确地按照指定的摄像机轨迹创建视频，从而使其能够生成包含多视图信息的压缩视频潜在数据，同时保持 3D 一致性。我们训练 3D 重建模型以使用渐进式训练策略在视频潜在空间上运行，从而能够高效生成高质量、宽范围和通用的 3D 场景。在各种数据集上进行的大量评估表明，我们的模型在单视图 3D 场景生成方面的表现明显优于现有的方法，尤其是在域外图像方面。我们首次证明了可以在扩散模型的潜在空间上有效地构建 3D 重建模型，以实现高效的 3D 场景生成。

Title: CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models

Authors: Felix Taubner, Ruihang Zhang, Mathieu Tuli, David B. Lindell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12093
Pdf URL: https://arxiv.org/pdf/2412.12093
Copy Paste: [[2412.12093]] CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models(https://arxiv.org/abs/2412.12093)
Keywords: generative
Abstract: Reconstructing photorealistic and dynamic portrait avatars from images is essential to many applications including advertising, visual effects, and virtual reality. Depending on the application, avatar reconstruction involves different capture setups and constraints $-$ for example, visual effects studios use camera arrays to capture hundreds of reference images, while content creators may seek to animate a single portrait image downloaded from the internet. As such, there is a large and heterogeneous ecosystem of methods for avatar reconstruction. Techniques based on multi-view stereo or neural rendering achieve the highest quality results, but require hundreds of reference images. Recent generative models produce convincing avatars from a single reference image, but visual fidelity yet lags behind multi-view techniques. Here, we present CAP4D: an approach that uses a morphable multi-view diffusion model to reconstruct photoreal 4D (dynamic 3D) portrait avatars from any number of reference images (i.e., one to 100) and animate and render them in real time. Our approach demonstrates state-of-the-art performance for single-, few-, and multi-image 4D portrait avatar reconstruction, and takes steps to bridge the gap in visual fidelity between single-image and multi-view reconstruction techniques.
摘要：从图像中重建照片级逼真的动态肖像头像对于广告、视觉效果和虚拟现实等许多应用都至关重要。根据应用的不同，头像重建涉及不同的捕获设置和约束 $ - $ 例如，视觉效果工作室使用相机阵列捕获数百张参考图像，而内容创建者可能试图为从互联网上下载的单个肖像图像制作动画。因此，头像重建方法的生态系统庞大而多样。基于多视图立体或神经渲染的技术可实现最高质量的结果，但需要数百张参考图像。最近的生成模型可以从单个参考图像生成令人信服的头像，但视觉保真度仍然落后于多视图技术。在这里，我们介绍 CAP4D：一种使用可变形多视图扩散模型从任意数量的参考图像（即一到一百张）重建照片级逼真的 4D（动态 3D）肖像头像并实时为其制作动画和渲染的方法。我们的方法展示了单图像、少图像和多图像 4D 肖像化身重建的最先进的性能，并采取措施弥合单图像和多视图重建技术之间的视觉保真度差距。

Title: Causal Diffusion Transformers for Generative Modeling

Authors: Chaorui Deng, Deyao Zh, Kunchang Li, Shi Guan, Haoqi Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12095
Pdf URL: https://arxiv.org/pdf/2412.12095
Copy Paste: [[2412.12095]] Causal Diffusion Transformers for Generative Modeling(https://arxiv.org/abs/2412.12095)
Keywords: generation, generative
Abstract: We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attempt to combine diffusion with AR models, we show that introducing sequential factorization to a diffusion model can substantially improve its performance and enables a smooth transition between AR and diffusion generation modes. Hence, we propose CausalFusion - a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels, leading to state-of-the-art results on the ImageNet generation benchmark while also enjoying the AR advantage of generating an arbitrary number of tokens for in-context reasoning. We further demonstrate CausalFusion's multimodal capabilities through a joint image generation and captioning model, and showcase CausalFusion's ability for zero-shot in-context image manipulations. We hope that this work could provide the community with a fresh perspective on training multimodal models over discrete and continuous data.
摘要：我们引入了因果扩散作为扩散模型的自回归 (AR) 对应项。它是一个下一个标记预测框架，对离散和连续模态都很友好，并且与现有的下一个标记预测模型（如 LLaMA 和 GPT）兼容。虽然最近的研究试图将扩散与 AR 模型相结合，但我们表明，将顺序分解引入扩散模型可以显着提高其性能，并实现 AR 和扩散生成模式之间的平稳过渡。因此，我们提出了 CausalFusion - 一种仅解码器的转换器，可在顺序标记和扩散噪声级别上对数据进行双重分解，从而在 ImageNet 生成基准上获得最先进的结果，同时还享受生成任意数量标记以进行上下文推理的 AR 优势。我们通过联合图像生成和字幕模型进一步展示了 CausalFusion 的多模态功能，并展示了 CausalFusion 的零样本上下文图像处理能力。我们希望这项工作能够为社区提供在离散和连续数据上训练多模态模型的新视角。