2024-11-28

Title: UVCG: Leveraging Temporal Consistency for Universal Video Protection

Authors: KaiZhou Li, Jindong Gu, Xinchun Yu, Junjie Cao, Yansong Tang, Xiao-Ping Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17746
Pdf URL: https://arxiv.org/pdf/2411.17746
Copy Paste: [[2411.17746]] UVCG: Leveraging Temporal Consistency for Universal Video Protection(https://arxiv.org/abs/2411.17746)
Keywords: generation
Abstract: The security risks of AI-driven video editing have garnered significant attention. Although recent studies indicate that adding perturbations to images can protect them from malicious edits, directly applying image-based methods to perturb each frame in a video becomes ineffective, as video editing techniques leverage the consistency of inter-frame information to restore individually perturbed content. To address this challenge, we leverage the temporal consistency of video content to propose a straightforward and efficient, yet highly effective and broadly applicable approach, Universal Video Consistency Guard (UVCG). UVCG embeds the content of another video(target video) within a protected video by introducing continuous, imperceptible perturbations which has the ability to force the encoder of editing models to map continuous inputs to misaligned continuous outputs, thereby inhibiting the generation of videos consistent with the intended textual prompts. Additionally leveraging similarity in perturbations between adjacent frames, we improve the computational efficiency of perturbation generation by employing a perturbation-reuse strategy. We applied UVCG across various versions of Latent Diffusion Models (LDM) and assessed its effectiveness and generalizability across multiple LDM-based editing pipelines. The results confirm the effectiveness, transferability, and efficiency of our approach in safeguarding video content from unauthorized modifications.
摘要：人工智能驱动的视频编辑的安全风险引起了广泛关注。尽管最近的研究表明，对图像进行扰动可以保护它们免受恶意编辑，但直接应用基于图像的方法来扰动视频中的每一帧是无效的，因为视频编辑技术利用帧间信息的一致性来恢复单独扰动的内容。为了应对这一挑战，我们利用视频内容的时间一致性提出了一种简单、高效、高效且广泛适用的方法，即通用视频一致性保护 (UVCG)。UVCG 通过引入连续、不可察觉的扰动将另一个视频（目标视频）的内容嵌入受保护的视频中，这种扰动能够迫使编辑模型的编码器将连续输入映射到未对齐的连续输出，从而抑制与预期文本提示一致的视频的生成。此外，利用相邻帧之间扰动的相似性，我们通过采用扰动重用策略来提高扰动生成的计算效率。我们将 UVCG 应用于各种版本的潜在扩散模型 (LDM)，并评估了其在多个基于 LDM 的编辑流程中的有效性和通用性。结果证实了我们的方法在保护视频内容免遭未经授权的修改方面的有效性、可转移性和效率。

Title: MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Authors: Rongchang Xie, Chen Du, Ping Song, Chang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17762
Pdf URL: https://arxiv.org/pdf/2411.17762
Copy Paste: [[2411.17762]] MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding(https://arxiv.org/abs/2411.17762)
Keywords: generation
Abstract: We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with texture semantic features. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces training difficulty and improves the performance of the unified model. The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.
摘要：我们介绍了 MUSE-VL，一种通过语义离散编码实现多模态理解和生成的统一视觉语言模型。最近，研究界开始探索视觉生成和理解的统一模型。然而，现有的视觉标记器（例如 VQGAN）仅考虑低级信息，这使得它难以与纹理语义特征对齐。这导致训练复杂度高，并且需要大量训练数据才能实现最佳性能。此外，它们的性能与专用的理解模型还相差甚远。本文提出了语义离散编码 (SDE)，通过向视觉标记器添加语义约束，有效地对齐视觉标记和语言标记的信息。这大大降低了训练难度并提高了统一模型的性能。所提出的模型在各种视觉语言基准测试中显著超越了之前的最新技术，并且取得了比专用理解模型更好的性能。

Title: Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation

Authors: Xiang Li, Zixuan Huang, Anh Thai, James M. Rehg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17763
Pdf URL: https://arxiv.org/pdf/2411.17763
Copy Paste: [[2411.17763]] Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation(https://arxiv.org/abs/2411.17763)
Keywords: generation, generative
Abstract: Symmetry is a ubiquitous and fundamental property in the visual world, serving as a critical cue for perception and structure interpretation. This paper investigates the detection of 3D reflection symmetry from a single RGB image, and reveals its significant benefit on single-image 3D generation. We introduce Reflect3D, a scalable, zero-shot symmetry detector capable of robust generalization to diverse and real-world scenarios. Inspired by the success of foundation models, our method scales up symmetry detection with a transformer-based architecture. We also leverage generative priors from multi-view diffusion models to address the inherent ambiguity in single-view symmetry detection. Extensive evaluations on various data sources demonstrate that Reflect3D establishes a new state-of-the-art in single-image symmetry detection. Furthermore, we show the practical benefit of incorporating detected symmetry into single-image 3D generation pipelines through a symmetry-aware optimization process. The integration of symmetry significantly enhances the structural accuracy, cohesiveness, and visual fidelity of the reconstructed 3D geometry and textures, advancing the capabilities of 3D content creation.
摘要：对称性是视觉世界中普遍存在的基本属性，是感知和结构解释的关键线索。本文研究了从单个 RGB 图像中检测 3D 反射对称性，并揭示了其对单幅图像 3D 生成的显著优势。我们引入了 Reflect3D，这是一种可扩展的零样本对称性检测器，能够稳健地推广到多样化的真实场景。受基础模型成功的启发，我们的方法使用基于变压器的架构扩展了对称性检测。我们还利用多视图扩散模型的生成先验来解决单视图对称性检测中固有的模糊性。对各种数据源的广泛评估表明，Reflect3D 在单幅图像对称性检测中建立了新的最先进技术。此外，我们展示了通过对称感知优化过程将检测到的对称性纳入单幅图像 3D 生成管道的实际好处。对称性的整合显著提高了重建的 3D 几何和纹理的结构准确性、凝聚力和视觉保真度，从而提高了 3D 内容创建的能力。

Title: DiagramQG: A Dataset for Generating Concept-Focused Questions from Diagrams

Authors: Xinyu Zhang, Lingling Zhang, Yanrui Wu, Muye Huang, Wenjun Wu, Bo Li, Shaowei Wang, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17771
Pdf URL: https://arxiv.org/pdf/2411.17771
Copy Paste: [[2411.17771]] DiagramQG: A Dataset for Generating Concept-Focused Questions from Diagrams(https://arxiv.org/abs/2411.17771)
Keywords: generation
Abstract: Visual Question Generation (VQG) has gained significant attention due to its potential in educational applications. However, VQG researches mainly focus on natural images, neglecting diagrams in educational materials used to assess students' conceptual understanding. To address this gap, we introduce DiagramQG, a dataset containing 8,372 diagrams and 19,475 questions across various subjects. DiagramQG introduces concept and target text constraints, guiding the model to generate concept-focused questions for educational purposes. Meanwhile, we present the Hierarchical Knowledge Integration framework for Diagram Question Generation (HKI-DQG) as a strong baseline. This framework obtains multi-scale patches of diagrams and acquires knowledge using a visual language model with frozen parameters. It then integrates knowledge, text constraints and patches to generate concept-focused questions. We evaluate the performance of existing VQG models, open-source and closed-source vision-language models, and HKI-DQG on the DiagramQG dataset. Our HKI-DQG outperform existing methods, demonstrating that it serves as a strong baseline. Furthermore, to assess its generalizability, we apply HKI-DQG to two other VQG datasets of natural images, namely VQG-COCO and K-VQG, achieving state-of-the-art this http URL dataset and code are available at this https URL.
摘要：视觉问题生成 (VQG) 因其在教育应用中的潜力而备受关注。然而，VQG 研究主要集中在自然图像上，而忽略了用于评估学生概念理解的教育材料中的图表。为了解决这一差距，我们引入了 DiagramQG，这是一个包含 8,372 张图表和 19,475 个问题的数据集，涉及各个学科。DiagramQG 引入了概念和目标文本约束，引导模型生成以概念为中心的教育问题。同时，我们提出了用于图表问题生成的分层知识集成框架 (HKI-DQG) 作为强大的基线。该框架获得多尺度图表块并使用具有冻结参数的视觉语言模型获取知识。然后，它整合知识、文本约束和块以生成以概念为中心的问题。我们在 DiagramQG 数据集上评估了现有 VQG 模型、开源和闭源视觉语言模型以及 HKI-DQG 的性能。我们的 HKI-DQG 优于现有方法，表明它可以作为强大的基线。此外，为了评估其普遍性，我们将 HKI-DQG 应用于另外两个自然图像 VQG 数据集，即 VQG-COCO 和 K-VQG，实现了最先进的 http URL 数据集和代码可在此 https URL 上获得。

Title: MVBoost: Boost 3D Reconstruction with Multi-View Refinement

Authors: Xiangyu Liu, Xiaomei Zhang, Zhiyuan Ma, Xiangyu Zhu, Zhen Lei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17772
Pdf URL: https://arxiv.org/pdf/2411.17772
Copy Paste: [[2411.17772]] MVBoost: Boost 3D Reconstruction with Multi-View Refinement(https://arxiv.org/abs/2411.17772)
Keywords: generation
Abstract: Recent advancements in 3D object reconstruction have been remarkable, yet most current 3D models rely heavily on existing 3D datasets. The scarcity of diverse 3D datasets results in limited generalization capabilities of 3D reconstruction models. In this paper, we propose a novel framework for boosting 3D reconstruction with multi-view refinement (MVBoost) by generating pseudo-GT data. The key of MVBoost is combining the advantages of the high accuracy of the multi-view generation model and the consistency of the 3D reconstruction model to create a reliable data source. Specifically, given a single-view input image, we employ a multi-view diffusion model to generate multiple views, followed by a large 3D reconstruction model to produce consistent 3D data. MVBoost then adaptively refines these multi-view images, rendered from the consistent 3D data, to build a large-scale multi-view dataset for training a feed-forward 3D reconstruction model. Additionally, the input view optimization is designed to optimize the corresponding viewpoints based on the user's input image, ensuring that the most important viewpoint is accurately tailored to the user's needs. Extensive evaluations demonstrate that our method achieves superior reconstruction results and robust generalization compared to prior works.
摘要：3D 物体重建领域近年来取得了显著进展，但当前的大多数 3D 模型都严重依赖于现有的 3D 数据集。多样化 3D 数据集的稀缺导致 3D 重建模型的泛化能力有限。在本文中，我们提出了一种通过生成伪 GT 数据来促进多视图细化 3D 重建的新框架 (MVBoost)。MVBoost 的关键是结合多视图生成模型的高精度和 3D 重建模型的一致性优势来创建可靠的数据源。具体而言，给定一个单视图输入图像，我们采用多视图扩散模型生成多个视图，然后使用大型 3D 重建模型来生成一致的 3D 数据。然后，MVBoost 自适应地细化这些从一致的 3D 数据渲染的多视图图像，以构建一个用于训练前馈 3D 重建模型的大规模多视图数据集。此外，输入视图优化旨在根据用户的输入图像优化相应的视点，确保最重要的视点能够准确地满足用户的需求。大量评估表明，与之前的研究相比，我们的方法实现了更优的重建结果和稳健的泛化能力。

Title: Diffusion Autoencoders for Few-shot Image Generation in Hyperbolic Space

Authors: Lingxiao Li, Kaixuan Fan, Boqing Gong, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17784
Pdf URL: https://arxiv.org/pdf/2411.17784
Copy Paste: [[2411.17784]] Diffusion Autoencoders for Few-shot Image Generation in Hyperbolic Space(https://arxiv.org/abs/2411.17784)
Keywords: generation
Abstract: Few-shot image generation aims to generate diverse and high-quality images for an unseen class given only a few examples in that class. However, existing methods often suffer from a trade-off between image quality and diversity while offering limited control over the attributes of newly generated images. In this work, we propose Hyperbolic Diffusion Autoencoders (HypDAE), a novel approach that operates in hyperbolic space to capture hierarchical relationships among images and texts from seen categories. By leveraging pre-trained foundation models, HypDAE generates diverse new images for unseen categories with exceptional quality by varying semantic codes or guided by textual instructions. Most importantly, the hyperbolic representation introduces an additional degree of control over semantic diversity through the adjustment of radii within the hyperbolic disk. Extensive experiments and visualizations demonstrate that HypDAE significantly outperforms prior methods by achieving a superior balance between quality and diversity with limited data and offers a highly controllable and interpretable generation process.
摘要：少量样本图像生成旨在仅根据该类别中的少数示例为未知类别生成多样化且高质量的图像。然而，现有方法通常会在图像质量和多样性之间做出权衡，同时对新生成图像的属性的控制有限。在这项工作中，我们提出了双曲扩散自动编码器 (HypDAE)，这是一种在双曲空间中运行的新方法，用于捕获已知类别中图像和文本之间的层次关系。通过利用预先训练的基础模型，HypDAE 通过改变语义代码或由文本指令引导，为未知类别生成具有卓越质量的多样化新图像。最重要的是，双曲表示通过调整双曲盘内的半径引入了对语义多样性的额外控制程度。大量实验和可视化表明，HypDAE 通过在有限的数据下实现质量和多样性之间的卓越平衡，显著优于以前的方法，并提供了高度可控和可解释的生成过程。

Title: DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching

Authors: Emanuele Aiello, Umberto Michieli, Diego Valsesia, Mete Ozay, Enrico Magli
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17786
Pdf URL: https://arxiv.org/pdf/2411.17786
Copy Paste: [[2411.17786]] DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching(https://arxiv.org/abs/2411.17786)
Keywords: generation, generative
Abstract: Personalized image generation requires text-to-image generative models that capture the core features of a reference subject to allow for controlled generation across different contexts. Existing methods face challenges due to complex training requirements, high inference costs, limited flexibility, or a combination of these issues. In this paper, we introduce DreamCache, a scalable approach for efficient and high-quality personalized image generation. By caching a small number of reference image features from a subset of layers and a single timestep of the pretrained diffusion denoiser, DreamCache enables dynamic modulation of the generated image features through lightweight, trained conditioning adapters. DreamCache achieves state-of-the-art image and text alignment, utilizing an order of magnitude fewer extra parameters, and is both more computationally effective and versatile than existing models.
摘要：个性化图像生成需要文本到图像生成模型，该模型可以捕获参考主题的核心特征，从而允许在不同上下文中进行受控生成。现有方法面临着挑战，因为训练要求复杂、推理成本高、灵活性有限或这些问题的组合。在本文中，我们介绍了 DreamCache，这是一种可扩展的方法，用于高效、高质量的个性化图像生成。通过从预训练扩散去噪器的层子集和单个时间步长中缓存少量参考图像特征，DreamCache 可以通过轻量级、经过训练的条件适配器动态调节生成的图像特征。DreamCache 实现了最先进的图像和文本对齐，使用的额外参数少了一个数量级，并且比现有模型具有更高的计算效率和通用性。

Title: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

Authors: Zigeng Chen, Xinyin Ma, Gongfan Fang, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17787
Pdf URL: https://arxiv.org/pdf/2411.17787
Copy Paste: [[2411.17787]] Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient(https://arxiv.org/abs/2411.17787)
Keywords: generation
Abstract: In the rapidly advancing field of image generation, Visual Auto-Regressive (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency, scalability, and zero-shot generalization. Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To address these bottlenecks, we propose Collaborative Decoding (CoDe), a novel efficient decoding strategy tailored for the VAR framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. Based on these insights, we partition the multi-scale inference process into a seamless collaboration between a large model and a small model. The large model serves as the 'drafter', specializing in generating low-frequency content at smaller scales, while the smaller model serves as the 'refiner', solely focusing on predicting high-frequency details at larger scales. This collaboration yields remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98. When drafting steps are further decreased, CoDe can achieve an impressive 2.9x acceleration ratio, reaching 41 images/s at 256x256 resolution on a single NVIDIA 4090 GPU, while preserving a commendable FID of 2.27. The code is available at this https URL
摘要：在快速发展的图像生成领域，视觉自回归 (VAR) 建模因其创新的下一代预测方法而备受关注。该范式在效率、可扩展性和零样本泛化方面提供了显着的改进。然而，VAR 固有的由粗到细的性质引入了延长的标记序列，导致内存消耗过高和计算冗余。为了解决这些瓶颈，我们提出了协作解码 (CoDe)，这是一种为 VAR 框架量身定制的新型高效解码策略。CoDe 利用了两个关键观察结果：较大尺度上参数需求大幅减少以及不同尺度上独有的生成模式。基于这些见解，我们将多尺度推理过程划分为大型模型和小型模型之间的无缝协作。大型模型充当“起草者”，专门用于生成较小尺度的低频内容，而小型模型充当“细化者”，仅专注于预测较大尺度上的高频细节。此次合作带来了显著的效率，同时对质量的影响却微乎其微：CoDe 实现了 1.7 倍的加速，内存使用量减少了约 50%，并且保持了图像质量，FID 仅从 1.95 增加到 1.98，几乎可以忽略不计。当绘图步骤进一步减少时，CoDe 可以实现令人印象深刻的 2.9 倍加速比，在单个 NVIDIA 4090 GPU 上以 256x256 分辨率达到 41 张图像/秒，同时保持了值得称赞的 2.27 FID。代码可从此 https URL 获取

Title: Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors

Authors: Ziang Xu, Bin Li, Yang Hu, Chenyu Zhang, James East, Sharib Ali, Jens Rittscher
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17790
Pdf URL: https://arxiv.org/pdf/2411.17790
Copy Paste: [[2411.17790]] Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors(https://arxiv.org/abs/2411.17790)
Keywords: generative
Abstract: Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring reliable depth and pose estimation. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a Generative Latent Bank and a Variational Autoencoder (VAE). The Generative Latent Bank leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract's complex textures and lighting. Extensive evaluations on SimCol and EndoSLAM datasets confirm our framework's superior performance over published self-supervised methods in endoscopic depth and pose estimation.
摘要：内窥镜中的精确 3D 映射能够定量、整体地表征胃肠道 (GI) 内的病变，这需要可靠的深度和姿势估计。然而，内窥镜系统是单目系统，现有的依赖于合成数据集或复杂模型的方法在具有挑战性的内窥镜条件下往往缺乏通用性。我们提出了一个强大的自监督单目深度和姿势估计框架，该框架结合了生成潜在库和变分自动编码器 (VAE)。生成潜在库利用来自自然图像的大量深度场景来调节深度网络，通过潜在特征先验增强深度预测的真实性和稳健性。对于姿势估计，我们在 VAE 框架内对其进行了重新表述，将姿势转换视为潜在变量以规范比例、稳定 z 轴突出度并提高 x-y 灵敏度。这种双重细化管道能够实现准确的深度和姿势预测，有效解决胃肠道复杂的纹理和光照问题。对 SimCol 和 EndoSLAM 数据集的广泛评估证实了我们的框架在内窥镜深度和姿势估计方面比已发布的自监督方法具有更优异的性能。

Title: Signs as Tokens: An Autoregressive Multilingual Sign Language Generator

Authors: Ronglai Zuo, Rolandos Alexandros Potamias, Evangelos Ververas, Jiankang Deng, Stefanos Zafeiriou
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.17799
Pdf URL: https://arxiv.org/pdf/2411.17799
Copy Paste: [[2411.17799]] Signs as Tokens: An Autoregressive Multilingual Sign Language Generator(https://arxiv.org/abs/2411.17799)
Keywords: generation
Abstract: Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. While many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), drawing inspiration from its linguistic characteristics, the reverse task of sign language generation (SLG, text-to-sign) remains largely unexplored. Most existing approaches treat SLG as a visual content generation task, employing techniques such as diffusion models to produce sign videos, 2D keypoints, or 3D avatars based on text inputs, overlooking the linguistic properties of sign languages. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets. To facilitate multilingual SLG research, we further curate a large-scale Chinese sign language dataset, CSL-Daily, with high-quality 3D pose annotations. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. The project page is available at this https URL.
摘要：手语是一种视觉语言，涵盖了自然语言的所有语言特征，是聋人和听力障碍者的主要交流方式。虽然许多研究已经成功地将预训练语言模型 (LM) 用于手语翻译 (手语到文本)，并从其语言特征中汲取灵感，但手语生成 (SLG，文本到手语) 的逆向任务仍未得到充分探索。大多数现有方法将 SLG 视为视觉内容生成任务，采用扩散模型等技术根据文本输入制作手势视频、2D 关键点或 3D 头像，而忽略了手语的语言属性。在这项工作中，我们引入了一个多语言手语模型 Signs as Tokens (SOKE)，它可以使用预训练的 LM 从文本输入自回归地生成 3D 手势头像。为了将手语与 LM 对齐，我们开发了一个解耦标记器，将连续手势离散化为代表各种身体部位的标记序列。这些手势标记被集成到 LM 的原始文本词汇表中，从而允许对手语数据集进行监督微调。为了促进多语言 SLG 研究，我们进一步整理了一个大规模中国手语数据集 CSL-Daily，其中包含高质量的 3D 姿势注释。广泛的定性和定量评估证明了 SOKE 的有效性。项目页面可在此 https URL 上找到。

Title: From memorization to generalization: a theoretical framework for diffusion-based generative models

Authors: Indranil Halder
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.17807
Pdf URL: https://arxiv.org/pdf/2411.17807
Copy Paste: [[2411.17807]] From memorization to generalization: a theoretical framework for diffusion-based generative models(https://arxiv.org/abs/2411.17807)
Keywords: generative
Abstract: Diffusion-based generative models demonstrate a transition from memorizing the training dataset to a non-memorization regime as the size of the training set increases. Here, we begin by introducing a mathematically precise definition of this transition in terms of a relative distance: the model is said to be in the non-memorization/`generalization' regime if the generated distribution is almost surely far from the probability distribution associated with a Gaussian kernel approximation to the training dataset, relative to the sampling distribution. Then, we develop an analytically tractable diffusion model and establish a lower bound on Kullback-Leibler divergence between the generated and sampling distribution. The model also features the transition, according to our definition in terms of the relative distance, when the training data is sampled from an isotropic Gaussian distribution. Further, our study reveals that this transition occurs when the individual distance between the generated and underlying sampling distribution begins to decrease with the addition of more training samples. This is to be contrasted with an alternative scenario, where the model's memorization performance degrades, but generalization performance doesn't improve. We also provide empirical evidence indicating that realistic diffusion models exhibit the same alignment of scales.
摘要：基于扩散的生成模型展示了随着训练集规模的增加，从记忆训练数据集到非记忆模式的转变。在这里，我们首先从相对距离的角度引入这种转变的数学精确定义：如果生成的分布相对于采样分布几乎肯定远离与训练数据集的高斯核近似相关的概率分布，则该模型被称为处于非记忆/“泛化”模式。然后，我们开发了一个可分析处理的扩散模型，并建立了生成分布和采样分布之间 Kullback-Leibler 散度的下限。根据我们对相对距离的定义，当训练数据从各向同性的高斯分布中采样时，该模型还具有转变。此外，我们的研究表明，当生成分布和底层采样分布之间的个体距离随着更多训练样本的增加而开始减小时，就会发生这种转变。这与另一种情况形成了对比，在另一种情况下，模型的记忆性能会下降，但泛化性能不会提高。我们还提供了经验证据表明，现实的扩散模型表现出相同的尺度一致性。

Title: Low-rank Adaptation-based All-Weather Removal for Autonomous Navigation

Authors: Sudarshan Rajagopalan, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17814
Pdf URL: https://arxiv.org/pdf/2411.17814
Copy Paste: [[2411.17814]] Low-rank Adaptation-based All-Weather Removal for Autonomous Navigation(https://arxiv.org/abs/2411.17814)
Keywords: restoration
Abstract: All-weather image restoration (AWIR) is crucial for reliable autonomous navigation under adverse weather conditions. AWIR models are trained to address a specific set of weather conditions such as fog, rain, and snow. But this causes them to often struggle with out-of-distribution (OoD) samples or unseen degradations which limits their effectiveness for real-world autonomous navigation. To overcome this issue, existing models must either be retrained or fine-tuned, both of which are inefficient and impractical, with retraining needing access to large datasets, and fine-tuning involving many parameters. In this paper, we propose using Low-Rank Adaptation (LoRA) to efficiently adapt a pre-trained all-weather model to novel weather restoration tasks. Furthermore, we observe that LoRA lowers the performance of the adapted model on the pre-trained restoration tasks. To address this issue, we introduce a LoRA-based fine-tuning method called LoRA-Align (LoRA-A) which seeks to align the singular vectors of the fine-tuned and pre-trained weight matrices using Singular Value Decomposition (SVD). This alignment helps preserve the model's knowledge of its original tasks while adapting it to unseen tasks. We show that images restored with LoRA and LoRA-A can be effectively used for computer vision tasks in autonomous navigation, such as semantic segmentation and depth estimation.
摘要：全天候图像恢复 (AWIR) 对于在恶劣天气条件下实现可靠的自主导航至关重要。AWIR 模型经过训练，可以应对特定的天气条件，例如雾、雨和雪。但这导致它们经常难以处理分布外 (OoD) 样本或看不见的退化，从而限制了它们在现实世界自主导航中的有效性。为了克服这个问题，现有模型必须重新训练或微调，这两种方法都效率低下且不切实际，重新训练需要访问大型数据集，而微调涉及许多参数。在本文中，我们建议使用低秩自适应 (LoRA) 来有效地将预训练的全天候模型适应新的天气恢复任务。此外，我们观察到 LoRA 降低了适应模型在预训练恢复任务上的性能。为了解决这个问题，我们引入了一种基于 LoRA 的微调方法，称为 LoRA-Align (LoRA-A)，该方法旨在使用奇异值分解 (SVD) 对齐微调和预训练权重矩阵的奇异向量。这种对齐有助于保留模型对其原始任务的了解，同时使其适应未知任务。我们表明，使用 LoRA 和 LoRA-A 恢复的图像可有效用于自主导航中的计算机视觉任务，例如语义分割和深度估计。

Title: SVGDreamer++: Advancing Editability and Diversity in Text-Guided SVG Generation

Authors: Ximing Xing, Qian Yu, Chuang Wang, Haitao Zhou, Jing Zhang, Dong Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17832
Pdf URL: https://arxiv.org/pdf/2411.17832
Copy Paste: [[2411.17832]] SVGDreamer++: Advancing Editability and Diversity in Text-Guided SVG Generation(https://arxiv.org/abs/2411.17832)
Keywords: generation
Abstract: Recently, text-guided scalable vector graphics (SVG) synthesis has demonstrated significant potential in domains such as iconography and sketching. However, SVGs generated from existing Text-to-SVG methods often lack editability and exhibit deficiencies in visual quality and diversity. In this paper, we propose a novel text-guided vector graphics synthesis method to address these limitations. To improve the diversity of output SVGs, we present a Vectorized Particle-based Score Distillation (VPSD) approach. VPSD addresses over-saturation issues in existing methods and enhances sample diversity. A pre-trained reward model is incorporated to re-weight vector particles, improving aesthetic appeal and enabling faster convergence. Additionally, we design a novel adaptive vector primitives control strategy, which allows for the dynamic adjustment of the number of primitives, thereby enhancing the presentation of graphic details. Extensive experiments validate the effectiveness of the proposed method, demonstrating its superiority over baseline methods in terms of editability, visual quality, and diversity. We also show that our new method supports up to six distinct vector styles, capable of generating high-quality vector assets suitable for stylized vector design and poster design.
摘要：最近，文本引导的可缩放矢量图形 (SVG) 合成在图像和素描等领域表现出巨大的潜力。然而，现有的文本到 SVG 方法生成的 SVG 通常缺乏可编辑性，并且在视觉质量和多样性方面表现出不足。在本文中，我们提出了一种新颖的文本引导矢量图形合成方法来解决这些限制。为了提高输出 SVG 的多样性，我们提出了一种基于矢量化粒子的分数蒸馏 (VPSD) 方法。VPSD 解决了现有方法中的过饱和问题并增强了样本多样性。结合预先训练的奖励模型来重新加权矢量粒子，提高美感并实现更快的收敛。此外，我们设计了一种新颖的自适应矢量基元控制策略，允许动态调整基元数量，从而增强图形细节的呈现。大量实验验证了所提方法的有效性，证明了其在可编辑性、视觉质量和多样性方面优于基线方法。我们还表明，我们的新方法支持多达六种不同的矢量样式，能够生成适合风格化矢量设计和海报设计的高质量矢量资产。

Title: Generative Image Layer Decomposition with Visual Effects

Authors: Jinrui Yang, Qing Liu, Yijun Li, Soo Ye Kim, Daniil Pakhomov, Mengwei Ren, Jianming Zhang, Zhe Lin, Cihang Xie, Yuyin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17864
Pdf URL: https://arxiv.org/pdf/2411.17864
Copy Paste: [[2411.17864]] Generative Image Layer Decomposition with Visual Effects(https://arxiv.org/abs/2411.17864)
Keywords: generative
Abstract: Recent advancements in large generative models, particularly diffusion-based methods, have significantly enhanced the capabilities of image editing. However, achieving precise control over image composition tasks remains a challenge. Layered representations, which allow for independent editing of image components, are essential for user-driven content creation, yet existing approaches often struggle to decompose image into plausible layers with accurately retained transparent visual effects such as shadows and reflections. We propose $\textbf{LayerDecomp}$, a generative framework for image layer decomposition which outputs photorealistic clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects. To enable effective training, we first introduce a dataset preparation pipeline that automatically scales up simulated multi-layer data with synthesized visual effects. To further enhance real-world applicability, we supplement this simulated dataset with camera-captured images containing natural visual effects. Additionally, we propose a consistency loss which enforces the model to learn accurate representations for the transparent foreground layer when ground-truth annotations are not available. Our method achieves superior quality in layer decomposition, outperforming existing approaches in object removal and spatial editing tasks across several benchmarks and multiple user studies, unlocking various creative possibilities for layer-wise image editing. The project page is this https URL.
摘要：大型生成模型（尤其是基于扩散的方法）的最新进展显著增强了图像编辑的能力。然而，实现对图像合成任务的精确控制仍然是一个挑战。分层表示允许独立编辑图像组件，对于用户驱动的内容创建至关重要，但现有方法通常难以将图像分解为具有准确保留的透明视觉效果（如阴影和反射）的合理层。我们提出了 $\textbf{LayerDecomp}$，这是一个用于图像层分解的生成框架，可输出具有忠实保留的视觉效果的逼真干净背景和高质量透明前景。为了实现有效的训练，我们首先引入了一个数据集准备管道，该管道可自动扩展具有合成视觉效果的模拟多层数据。为了进一步增强现实世界的适用性，我们用包含自然视觉效果的相机捕获图像补充了这个模拟数据集。此外，我们提出了一种一致性损失，当没有真实注释时，它强制模型学习透明前景层的准确表示。我们的方法在层分解方面取得了卓越的质量，在多个基准和多个用户研究中，在对象移除和空间编辑任务中的表现优于现有方法，为逐层图像编辑解锁了各种创意可能性。项目页面是这个 https URL。

Title: Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey

Authors: Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2411.17911
Pdf URL: https://arxiv.org/pdf/2411.17911
Copy Paste: [[2411.17911]] Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey(https://arxiv.org/abs/2411.17911)
Keywords: generation, generative
Abstract: In recent years, deepfakes (DFs) have been utilized for malicious purposes, such as individual impersonation, misinformation spreading, and artists' style imitation, raising questions about ethical and security concerns. However, existing surveys have focused on accuracy performance of passive DF detection approaches for single modalities, such as image, video or audio. This comprehensive survey explores passive approaches across multiple modalities, including image, video, audio, and multi-modal domains, and extend our discussion beyond detection accuracy, including generalization, robustness, attribution, and interpretability. Additionally, we discuss threat models for passive approaches, including potential adversarial strategies and different levels of adversary knowledge and capabilities. We also highlights current challenges in DF detection, including the lack of generalization across different generative models, the need for comprehensive trustworthiness evaluation, and the limitations of existing multi-modal approaches. Finally, we propose future research directions that address these unexplored and emerging issues in the field of passive DF detection, such as adaptive learning, dynamic benchmark, holistic trustworthiness evaluation, and multi-modal detectors for talking-face video generation.
摘要：近年来，深度伪造 (DF) 已被用于恶意目的，例如个人冒充、传播错误信息和模仿艺术家的风格，这引发了有关道德和安全问题的质疑。然而，现有的调查主要集中于被动 DF 检测方法对单一模态（例如图像、视频或音频）的准确性性能。这项全面的调查探讨了跨多种模态（包括图像、视频、音频和多模态域）的被动方法，并将我们的讨论扩展到检测准确性之外，包括泛化、稳健性、归因和可解释性。此外，我们还讨论了被动方法的威胁模型，包括潜在的对抗策略和不同级别的对手知识和能力。我们还强调了 DF 检测中当前的挑战，包括不同生成模型之间缺乏泛化、需要全面的可信度评估以及现有多模态方法的局限性。最后，我们提出了未来的研究方向，以解决被动 DF 检测领域中这些尚未探索和正在出现的问题，例如自适应学习、动态基准、整体可信度评估和用于说话脸部视频生成的多模态检测器。

Title: ROICtrl: Boosting Instance Control for Visual Generation

Authors: Yuchao Gu, Yipin Zhou, Yunfan Ye, Yixin Nie, Licheng Yu, Pingchuan Ma, Kevin Qinghong Lin, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17949
Pdf URL: https://arxiv.org/pdf/2411.17949
Copy Paste: [[2411.17949]] ROICtrl: Boosting Instance Control for Visual Generation(https://arxiv.org/abs/2411.17949)
Keywords: generation
Abstract: Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption. Previous methods in this area typically rely on implicit position encoding or explicit attention masks to separate regions of interest (ROIs), resulting in either inaccurate coordinate injection or large computational overhead. Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. Together, ROI-Align and ROI-Unpool enable explicit, efficient, and accurate ROI manipulation on high-resolution feature maps for visual generation. Building on ROI-Unpool, we propose ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control. ROICtrl is compatible with community-finetuned diffusion models, as well as with existing spatial-based add-ons (\eg, ControlNet, T2I-Adapter) and embedding-based add-ons (\eg, IP-Adapter, ED-LoRA), extending their applications to multi-instance generation. Experiments show that ROICtrl achieves superior performance in regional instance control while significantly reducing computational costs.
摘要：自然语言通常难以将位置和属性信息与多个实例准确地关联起来，这限制了当前基于文本的视觉生成模型只能包含少数主要实例的简单组合。为了解决这一限制，这项工作通过引入区域实例控制来增强扩散模型，其中每个实例由与自由形式标题配对的边界框控制。该领域的先前方法通常依靠隐式位置编码或显式注意力掩码来分离感兴趣区域 (ROI)，从而导致坐标注入不准确或计算开销较大。受对象检测中 ROI-Align 的启发，我们引入了一个称为 ROI-Unpool 的互补操作。ROI-Align 和 ROI-Unpool 相结合，可在高分辨率特征图上实现明确、高效和准确的 ROI 操作，以用于视觉生成。在 ROI-Unpool 的基础上，我们提出了 ROICtrl，这是一个预训练扩散模型的适配器，可实现精确的区域实例控制。 ROICtrl 与社区微调扩散模型兼容，也与现有的基于空间的附加组件（例如 ControlNet、T2I-Adapter）和基于嵌入的附加组件（例如 IP-Adapter、ED-LoRA）兼容，将其应用扩展到多实例生成。实验表明，ROICtrl 在区域实例控制方面取得了优异的性能，同时显著降低了计算成本。

Title: Improved implicit diffusion model with knowledge distillation to estimate the spatial distribution density of carbon stock in remote sensing imagery

Authors: Zhenyu Yu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17973
Pdf URL: https://arxiv.org/pdf/2411.17973
Copy Paste: [[2411.17973]] Improved implicit diffusion model with knowledge distillation to estimate the spatial distribution density of carbon stock in remote sensing imagery(https://arxiv.org/abs/2411.17973)
Keywords: generative
Abstract: The forest serves as the most significant terrestrial carbon stock mechanism, effectively reducing atmospheric CO$_2$ concentrations and mitigating climate change. Remote sensing provides high data accuracy and enables large-scale observations. Optical images facilitate long-term monitoring, which is crucial for future carbon stock estimation studies. This study focuses on Huize County, Qujing City, Yunnan Province, China, utilizing GF-1 WFV satellite imagery. The KD-VGG and KD-UNet modules were introduced for initial feature extraction, and the improved implicit diffusion model (IIDM) was proposed. The results showed: (1) The VGG module improved initial feature extraction, improving accuracy, and reducing inference time with optimized model parameters. (2) The Cross-attention + MLPs module enabled effective feature fusion, establishing critical relationships between global and local features, achieving high-accuracy estimation. (3) The IIDM model, a novel contribution, demonstrated the highest estimation accuracy with an RMSE of 12.17\%, significantly improving by 41.69\% to 42.33\% compared to the regression model. In carbon stock estimation, the generative model excelled in extracting deeper features, significantly outperforming other models, demonstrating the feasibility of AI-generated content in quantitative remote sensing. The 16-meter resolution estimates provide a robust basis for tailoring forest carbon sink regulations, enhancing regional carbon stock management.
摘要：森林是陆地最重要的碳储量机制，有效降低大气中的二氧化碳浓度，缓解气候变化。遥感数据精度高，可进行大规模观测，光学图像有利于长期监测，这对未来的碳储量估算研究至关重要。本研究以中国云南省曲靖市会泽县为研究对象，利用GF-1 WFV卫星影像。引入KD-VGG和KD-UNet模块进行初始特征提取，提出改进的隐式扩散模型（IIDM）。结果表明：（1）VGG模块改进了初始特征提取，提高了准确率，并通过优化模型参数减少了推理时间。（2）Cross-attention + MLPs模块实现了有效的特征融合，建立了全局特征和局部特征之间的关键关系，实现了高精度估计。（3）本文提出的IIDM模型具有最高的估算精度，RMSE为12.17%，与回归模型相比，显著提高了41.69%至42.33%。在碳储量估算中，生成模型在提取更深层次的特征方面表现出色，显著优于其他模型，证明了人工智能生成内容在定量遥感中的可行性。16米分辨率的估算为制定森林碳汇法规、加强区域碳储量管理提供了坚实的基础。

Title: Vision Mamba Distillation for Low-resolution Fine-grained Image Classification

Authors: Yao Chen, Jiabao Wang, Peichao Wang, Rui Zhang, Yang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17980
Pdf URL: https://arxiv.org/pdf/2411.17980
Copy Paste: [[2411.17980]] Vision Mamba Distillation for Low-resolution Fine-grained Image Classification(https://arxiv.org/abs/2411.17980)
Keywords: super-resolution
Abstract: Low-resolution fine-grained image classification has recently made significant progress, largely thanks to the super-resolution techniques and knowledge distillation methods. However, these approaches lead to an exponential increase in the number of parameters and computational complexity of models. In order to solve this problem, in this letter, we propose a Vision Mamba Distillation (ViMD) approach to enhance the effectiveness and efficiency of low-resolution fine-grained image classification. Concretely, a lightweight super-resolution vision Mamba classification network (SRVM-Net) is proposed to improve its capability for extracting visual features by redesigning the classification sub-network with Mamba modeling. Moreover, we design a novel multi-level Mamba knowledge distillation loss boosting the performance, which can transfer prior knowledge obtained from a High-resolution Vision Mamba classification Network (HRVM-Net) as a teacher into the proposed SRVM-Net as a student. Extensive experiments on seven public fine-grained classification datasets related to benchmarks confirm our ViMD achieves a new state-of-the-art performance. While having higher accuracy, ViMD outperforms similar methods with fewer parameters and FLOPs, which is more suitable for embedded device applications. Code is available at this https URL.
摘要：低分辨率细粒度图像分类最近取得了重大进展，这主要归功于超分辨率技术和知识蒸馏方法。然而，这些方法导致模型的参数数量和计算复杂度呈指数级增长。为了解决这个问题，在本文中，我们提出了一种视觉曼巴蒸馏（ViMD）方法来提高低分辨率细粒度图像分类的有效性和效率。具体来说，提出了一种轻量级超分辨率视觉曼巴分类网络（SRVM-Net），通过使用 Mamba 建模重新设计分类子网络来提高其提取视觉特征的能力。此外，我们设计了一种新颖的多级 Mamba 知识蒸馏损失来提高性能，它可以将从高分辨率视觉曼巴分类网络（HRVM-Net）获得的先验知识作为老师转移到所提出的 SRVM-Net 作为学生。在与基准相关的七个公共细粒度分类数据集上进行的大量实验证实了我们的 ViMD 实现了新的最先进的性能。 ViMD 在具有更高准确度的同时，还以更少的参数和 FLOP 超越了类似方法，更适合嵌入式设备应用。代码可在此 https URL 上获取。

Title: Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Authors: Shuyang Hao, Bryan Hooi, Jun Liu, Kai-Wei Chang, Zi Huang, Yujun Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18000
Pdf URL: https://arxiv.org/pdf/2411.18000
Copy Paste: [[2411.18000]] Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models(https://arxiv.org/abs/2411.18000)
Keywords: generation
Abstract: Despite inheriting security measures from underlying language models, Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues. Through empirical analysis, we uncover two critical findings: scenario-matched images can significantly amplify harmful outputs, and contrary to common assumptions in gradient-based attacks, minimal loss values do not guarantee optimal attack effectiveness. Building on these insights, we introduce MLAI (Multi-Loss Adversarial Images), a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment, exploits flat minima theory for robust adversarial image selection, and employs multi-image collaborative attacks for enhanced effectiveness. Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2, substantially outperforming existing methods by margins of 34.37% and 12.77% respectively. Furthermore, MLAI shows considerable transferability to commercial black-box VLMs, achieving up to 60.11% success rate. Our work reveals fundamental visual vulnerabilities in current VLMs safety mechanisms and underscores the need for stronger defenses. Warning: This paper contains potentially harmful example text.
摘要：尽管从底层语言模型继承了安全措施，视觉语言模型 (VLM) 仍然可能容易受到安全对齐问题的影响。通过实证分析，我们发现了两个关键发现：场景匹配的图像可以显著放大有害输出，并且与基于梯度的攻击中的常见假设相反，最小损失值并不能保证最佳攻击效果。基于这些见解，我们引入了 MLAI（多损失对抗图像），这是一种新颖的越狱框架，它利用场景感知图像生成进行语义对齐，利用平坦极小值理论进行稳健的对抗图像选择，并采用多图像协作攻击来增强有效性。大量实验证明了 MLAI 的显著影响，在 MiniGPT-4 上的攻击成功率达到 77.75%，在 LLaVA-2 上的攻击成功率达到 82.80%，分别比现有方法高出 34.37% 和 12.77%。此外，MLAI 显示出对商业黑盒 VLM 的相当大的可转移性，成功率高达 60.11%。我们的工作揭示了当前 VLM 安全机制中的基本视觉漏洞，并强调了加强防御的必要性。警告：本文包含可能有害的示例文本。

Title: HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Authors: Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren, Alper Yilmaz, Khoa Luu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18042
Pdf URL: https://arxiv.org/pdf/2411.18042
Copy Paste: [[2411.18042]] HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation(https://arxiv.org/abs/2411.18042)
Keywords: generation
Abstract: Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.
摘要：多模态 LLM 具有先进的视觉语言任务，但仍难以理解视频场景。为了弥补这一差距，出现了视频场景图生成 (VidSGG) 来捕获视频帧之间的多对象关系。然而，之前的方法依赖于成对连接，限制了它们处理复杂的多对象交互和推理的能力。为此，我们提出了场景超图上的多模态 LLM (HyperGLM)，以促进对多向交互和高阶关系的推理。我们的方法独特地将捕获对象之间空间关系的实体场景图与模拟其因果转换的程序图相结合，形成统一的超图。值得注意的是，HyperGLM 通过将这个统一的超图注入 LLM 来实现推理。此外，我们引入了一个新的视频场景图推理 (VSGR) 数据集，其中包含来自第三人称、自我中心和无人机视图的 190 万帧，并支持五项任务：场景图生成、场景图预测、视频问答、视频字幕和关系推理。从经验上看，HyperGLM 在五项任务中的表现始终优于最先进的方法，能够有效地建模和推理不同视频场景中的复杂关系。

Title: PersonaCraft: Personalized Full-Body Image Synthesis for Multiple Identities from Single References Using 3D-Model-Conditioned Diffusion

Authors: Gwanghyun Kim, Suh Yoon Jeon, Seunggyu Lee, Se Young Chun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18068
Pdf URL: https://arxiv.org/pdf/2411.18068
Copy Paste: [[2411.18068]] PersonaCraft: Personalized Full-Body Image Synthesis for Multiple Identities from Single References Using 3D-Model-Conditioned Diffusion(https://arxiv.org/abs/2411.18068)
Keywords: generation
Abstract: Personalized image generation has been significantly advanced, enabling the creation of highly realistic and customized images. However, existing methods often struggle with generating images of multiple people due to occlusions and fail to accurately personalize full-body shapes. In this paper, we propose PersonaCraft, a novel approach that combines diffusion models with 3D human modeling to address these limitations. Our method effectively manages occlusions by incorporating 3D-aware pose conditioning with SMPLx-ControlNet and accurately personalizes human full-body shapes through SMPLx fitting. Additionally, PersonaCraft enables user-defined body shape adjustments, adding flexibility for individual body customization. Experimental results demonstrate the superior performance of PersonaCraft in generating high-quality, realistic images of multiple individuals while resolving occlusion issues, thus establishing a new standard for multi-person personalized image synthesis. Project page: this https URL
摘要：个性化图像生成已经取得了显著的进步，可以创建高度逼真和定制的图像。然而，现有的方法往往由于遮挡而难以生成多人图像，并且无法准确地个性化全身形状。在本文中，我们提出了 PersonaCraft，这是一种将扩散模型与 3D 人体建模相结合的新方法，以解决这些限制。我们的方法通过将 3D 感知姿势调节与 SMPLx-ControlNet 相结合来有效地管理遮挡，并通过 SMPLx 拟合准确地个性化人体全身形状。此外，PersonaCraft 支持用户定义的体形调整，为个人身体定制增加了灵活性。实验结果表明，PersonaCraft 在生成多个人的高质量逼真图像的同时解决了遮挡问题，从而为多人个性化图像合成建立了新的标准。项目页面：此 https URL

Title: Training Data Synthesis with Difficulty Controlled Diffusion Model

Authors: Zerun Wang, Jiafeng Mao, Xueting Wang, Toshihiko Yamasaki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18109
Pdf URL: https://arxiv.org/pdf/2411.18109
Copy Paste: [[2411.18109]] Training Data Synthesis with Difficulty Controlled Diffusion Model(https://arxiv.org/abs/2411.18109)
Keywords: generative
Abstract: Semi-supervised learning (SSL) can improve model performance by leveraging unlabeled images, which can be collected from public image sources with low costs. In recent years, synthetic images have become increasingly common in public image sources due to rapid advances in generative models. Therefore, it is becoming inevitable to include existing synthetic images in the unlabeled data for SSL. How this kind of contamination will affect SSL remains unexplored. In this paper, we introduce a new task, Real-Synthetic Hybrid SSL (RS-SSL), to investigate the impact of unlabeled data contaminated by synthetic images for SSL. First, we set up a new RS-SSL benchmark to evaluate current SSL methods and found they struggled to improve by unlabeled synthetic images, sometimes even negatively affected. To this end, we propose RSMatch, a novel SSL method specifically designed to handle the challenges of RS-SSL. RSMatch effectively identifies unlabeled synthetic data and further utilizes them for improvement. Extensive experimental results show that RSMatch can transfer synthetic unlabeled data from `obstacles' to `resources.' The effectiveness is further verified through ablation studies and visualization.
摘要：半监督学习 (SSL) 可以利用未标记图像来提高模型性能，这些图像可以从公共图像源以低成本收集。近年来，由于生成模型的快速发展，合成图像在公共图像源中变得越来越普遍。因此，将现有的合成图像纳入 SSL 的未标记数据中已变得不可避免。这种污染将如何影响 SSL 仍未得到探索。在本文中，我们引入了一项新任务，即真实合成混合 SSL (RS-SSL)，以研究合成图像污染的未标记数据对 SSL 的影响。首先，我们建立了一个新的 RS-SSL 基准来评估当前的 SSL 方法，发现它们很难通过未标记的合成图像进行改进，有时甚至受到负面影响。为此，我们提出了 RSMatch，一种专门用于处理 RS-SSL 挑战的新型 SSL 方法。RSMatch 有效地识别未标记的合成数据并进一步利用它们进行改进。大量实验结果表明，RSMatch 可以将合成的未标记数据从“障碍”转变为“资源”。通过消融研究和可视化进一步验证了其有效性。

Title: When Large Vision-Language Models Meet Person Re-Identification

Authors: Qizao Wang, Bin Li, Xiangyang Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18111
Pdf URL: https://arxiv.org/pdf/2411.18111
Copy Paste: [[2411.18111]] When Large Vision-Language Models Meet Person Re-Identification(https://arxiv.org/abs/2411.18111)
Keywords: generation, generative
Abstract: Large Vision-Language Models (LVLMs) that incorporate visual models and Large Language Models (LLMs) have achieved impressive results across various cross-modal understanding and reasoning tasks. In recent years, person re-identification (ReID) has also started to explore cross-modal semantics to improve the accuracy of identity recognition. However, effectively utilizing LVLMs for ReID remains an open challenge. While LVLMs operate under a generative paradigm by predicting the next output word, ReID requires the extraction of discriminative identity features to match pedestrians across cameras. In this paper, we propose LVLM-ReID, a novel framework that harnesses the strengths of LVLMs to promote ReID. Specifically, we employ instructions to guide the LVLM in generating one pedestrian semantic token that encapsulates key appearance semantics from the person image. This token is further refined through our Semantic-Guided Interaction (SGI) module, establishing a reciprocal interaction between the semantic token and visual tokens. Ultimately, the reinforced semantic token serves as the pedestrian identity representation. Our framework integrates the semantic understanding and generation capabilities of LVLMs into end-to-end ReID training, allowing LVLMs to capture rich semantic cues from pedestrian images during both training and inference. Our method achieves competitive results on multiple benchmarks without additional image-text annotations, demonstrating the potential of LVLM-generated semantics to advance person ReID and offering a promising direction for future research.
摘要：结合视觉模型和大型语言模型 (LLM) 的大型视觉语言模型 (LVLM) 在各种跨模态理解和推理任务中取得了令人瞩目的成果。近年来，行人重新识别 (ReID) 也开始探索跨模态语义，以提高身份识别的准确性。然而，有效地利用 LVLM 进行 ReID 仍然是一个悬而未决的挑战。虽然 LVLM 在生成范式下通过预测下一个输出词来运行，但 ReID 需要提取判别性身份特征来匹配跨摄像头的行人。在本文中，我们提出了 LVLM-ReID，这是一个利用 LVLM 的优势来促进 ReID 的新框架。具体来说，我们使用指令来指导 LVLM 生成一个行人语义标记，该标记封装了行人图像中的关键外观语义。该标记通过我们的语义引导交互 (SGI) 模块进一步细化，从而在语义标记和视觉标记之间建立了相互交互。最终，强化语义标记可作为行人身份表示。我们的框架将 LVLM 的语义理解和生成功能集成到端到端 ReID 训练中，使 LVLM 能够在训练和推理过程中从行人图像中捕获丰富的语义线索。我们的方法在多个基准测试中取得了有竞争力的结果，而无需额外的图像文本注释，展示了 LVLM 生成的语义在推进行人 ReID 方面的潜力，并为未来的研究提供了一个有希望的方向。

Title: ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts

Authors: Uy Dieu Tran, Minh Luu, Phong Ha Nguyen, Khoi Nguyen, Binh-Son Hua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18135
Pdf URL: https://arxiv.org/pdf/2411.18135
Copy Paste: [[2411.18135]] ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts(https://arxiv.org/abs/2411.18135)
Keywords: generation
Abstract: Existing Score Distillation Sampling (SDS)-based methods have driven significant progress in text-to-3D generation. However, 3D models produced by SDS-based methods tend to exhibit over-smoothing and low-quality outputs. These issues arise from the mode-seeking behavior of current methods, where the scores used to update the model oscillate between multiple modes, resulting in unstable optimization and diminished output quality. To address this problem, we introduce a novel image prompt score distillation loss named ISD, which employs a reference image to direct text-to-3D optimization toward a specific mode. Our ISD loss can be implemented by using IP-Adapter, a lightweight adapter for integrating image prompt capability to a text-to-image diffusion model, as a mode-selection module. A variant of this adapter, when not being prompted by a reference image, can serve as an efficient control variate to reduce variance in score estimates, thereby enhancing both output quality and optimization stability. Our experiments demonstrate that the ISD loss consistently achieves visually coherent, high-quality outputs and improves optimization speed compared to prior text-to-3D methods, as demonstrated through both qualitative and quantitative evaluations on the T3Bench benchmark suite.
摘要：现有的基于分数蒸馏采样 (SDS) 的方法推动了文本到 3D 生成的重大进展。然而，基于 SDS 的方法生成的 3D 模型往往表现出过度平滑和低质量的输出。这些问题源于当前方法的模式寻求行为，其中用于更新模型的分数在多个模式之间振荡，导致优化不稳定和输出质量下降。为了解决这个问题，我们引入了一种名为 ISD 的新型图像提示分数蒸馏损失，它使用参考图像将文本到 3D 优化引导到特定模式。我们的 ISD 损失可以通过使用 IP-Adapter（一种用于将图像提示功能集成到文本到图像扩散模型的轻量级适配器）作为模式选择模块来实现。当不受参考图像提示时，此适配器的变体可以作为有效的控制变量来减少分数估计的方差，从而提高输出质量和优化稳定性。我们的实验表明，与之前的文本转 3D 方法相比，ISD 损失始终能够实现视觉连贯、高质量的输出，并且提高了优化速度，这在 T3Bench 基准套件上的定性和定量评估中均得到了证明。

Title: Type-R: Automatically Retouching Typos for Text-to-Image Generation

Authors: Wataru Shimoda, Naoto Inoue, Daichi Haraguchi, Hayato Mitani, Seichi Uchida, Kota Yamaguchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18159
Pdf URL: https://arxiv.org/pdf/2411.18159
Copy Paste: [[2411.18159]] Type-R: Automatically Retouching Typos for Text-to-Image Generation(https://arxiv.org/abs/2411.18159)
Keywords: generation
Abstract: While recent text-to-image models can generate photorealistic images from text prompts that reflect detailed instructions, they still face significant challenges in accurately rendering words in the image. In this paper, we propose to retouch erroneous text renderings in the post-processing pipeline. Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words. Through extensive experiments, we show that Type-R, in combination with the latest text-to-image models such as Stable Diffusion or Flux, achieves the highest text rendering accuracy while maintaining image quality and also outperforms text-focused generation baselines in terms of balancing text accuracy and image quality.
摘要：虽然最近的文本转图像模型可以根据反映详细说明的文本提示生成照片级逼真的图像，但它们在准确渲染图像中的文字方面仍然面临重大挑战。在本文中，我们建议在后处理流程中修饰错误的文本渲染。我们的方法称为 Type-R，它可以识别生成的图像中的印刷错误，删除错误文本，重新生成缺失单词的文本框，最后纠正渲染单词中的拼写错误。通过大量实验，我们表明 Type-R 与最新的文本转图像模型（如 Stable Diffusion 或 Flux）相结合，在保持图像质量的同时实现了最高的文本渲染精度，并且在平衡文本精度和图像质量方面也优于以文本为中心的生成基线。

Title: DistinctAD: Distinctive Audio Description Generation in Contexts

Authors: Bo Fang, Wenhao Wu, Qiangqiang Wu, Yuxin Song, Antoni B. Chan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18180
Pdf URL: https://arxiv.org/pdf/2411.18180
Copy Paste: [[2411.18180]] DistinctAD: Distinctive Audio Description Generation in Contexts(https://arxiv.org/abs/2411.18180)
Keywords: generation
Abstract: Audio Descriptions (ADs) aim to provide a narration of a movie in text form, describing non-dialogue-related narratives, such as characters, actions, or scene establishment. Automatic generation of ADs remains challenging due to: i) the domain gap between movie-AD data and existing data used to train vision-language models, and ii) the issue of contextual redundancy arising from highly similar neighboring visual clips in a long movie. In this work, we propose DistinctAD, a novel two-stage framework for generating ADs that emphasize distinctiveness to produce better narratives. To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora, enabling more effective alignment between movie and AD modalities at both global and fine-grained levels. In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context, ensuring the prediction of unique terms specific to the current AD. Comprehensive evaluations on MAD-Eval, CMD-AD, and TV-AD benchmarks demonstrate the superiority of DistinctAD, with the model consistently outperforming baselines, particularly in Recall@k/N, highlighting its effectiveness in producing high-quality, distinctive ADs.
摘要：音频描述 (AD) 旨在以文本形式提供电影旁白，描述与对话无关的叙述，例如角色、动作或场景设置。自动生成 AD 仍然具有挑战性，因为：i) 电影 AD 数据与用于训练视觉语言模型的现有数据之间存在领域差距，以及 ii) 长电影中高度相似的相邻视觉片段导致上下文冗余的问题。在这项工作中，我们提出了 DistinctAD，这是一个新颖的两阶段框架，用于生成强调独特性以产生更好叙述的 AD。为了解决领域差距，我们引入了一种 CLIP-AD 自适应策略，该策略不需要额外的 AD 语料库，从而能够在全局和细粒度级别更有效地对齐电影和 AD 模态。在第二阶段，DistinctAD 融合了两项关键创新：（i）上下文期望最大化注意 (EMA) 模块，通过从连续视频片段中提取共同基础来减少冗余；（ii）显式独特词预测损失，过滤掉上下文中的重复词，确保预测特定于当前 AD 的独特术语。对 MAD-Eval、CMD-AD 和 TV-AD 基准的综合评估证明了 DistinctAD 的优越性，该模型的表现始终优于基线，尤其是在 Recall@k/N 方面，凸显了其在生成高质量、独特 AD 方面的有效性。

Title: Semantic Edge Computing and Semantic Communications in 6G Networks: A Unifying Survey and Research Challenges

Authors: Milin Zhang, Mohammad Abdi, Venkat R. Dasari, Francesco Restuccia
Subjects: cs.LG, cs.NI, eess.SP
Abstract URL: https://arxiv.org/abs/2411.18199
Pdf URL: https://arxiv.org/pdf/2411.18199
Copy Paste: [[2411.18199]] Semantic Edge Computing and Semantic Communications in 6G Networks: A Unifying Survey and Research Challenges(https://arxiv.org/abs/2411.18199)
Keywords: generation
Abstract: Semantic Edge Computing (SEC) and Semantic Communications (SemComs) have been proposed as viable approaches to achieve real-time edge-enabled intelligence in sixth-generation (6G) wireless networks. On one hand, SemCom leverages the strength of Deep Neural Networks (DNNs) to encode and communicate the semantic information only, while making it robust to channel distortions by compensating for wireless effects. Ultimately, this leads to an improvement in the communication efficiency. On the other hand, SEC has leveraged distributed DNNs to divide the computation of a DNN across different devices based on their computational and networking constraints. Although significant progress has been made in both fields, the literature lacks a systematic view to connect both fields. In this work, we fulfill the current gap by unifying the SEC and SemCom fields. We summarize the research problems in these two fields and provide a comprehensive review of the state of the art with a focus on their technical strengths and challenges.
摘要：语义边缘计算 (SEC) 和语义通信 (SemComs) 已被提议作为在第六代 (6G) 无线网络中实现实时边缘智能的可行方法。一方面，SemCom 利用深度神经网络 (DNN) 的优势来仅对语义信息进行编码和通信，同时通过补偿无线效应使其对信道失真具有鲁棒性。最终，这将提高通信效率。另一方面，SEC 利用分布式 DNN 根据计算和网络约束将 DNN 的计算划分到不同的设备之间。尽管这两个领域都取得了重大进展，但文献缺乏将这两个领域联系起来的系统观点。在这项工作中，我们通过统一 SEC 和 SemCom 领域来填补当前的空白。我们总结了这两个领域的研究问题，并全面回顾了最新技术，重点介绍了它们的技术优势和挑战。

Title: SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation

Authors: Duc-Hai Pham, Tung Do, Phong Nguyen, Binh-Son Hua, Khoi Nguyen, Rang Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18229
Pdf URL: https://arxiv.org/pdf/2411.18229
Copy Paste: [[2411.18229]] SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation(https://arxiv.org/abs/2411.18229)
Keywords: generative
Abstract: We propose SharpDepth, a novel approach to monocular metric depth estimation that combines the metric accuracy of discriminative depth estimation methods (e.g., Metric3D, UniDepth) with the fine-grained boundary sharpness typically achieved by generative methods (e.g., Marigold, Lotus). Traditional discriminative models trained on real-world data with sparse ground-truth depth can accurately predict metric depth but often produce over-smoothed or low-detail depth maps. Generative models, in contrast, are trained on synthetic data with dense ground truth, generating depth maps with sharp boundaries yet only providing relative depth with low accuracy. Our approach bridges these limitations by integrating metric accuracy with detailed boundary preservation, resulting in depth predictions that are both metrically precise and visually sharp. Our extensive zero-shot evaluations on standard depth estimation benchmarks confirm SharpDepth effectiveness, showing its ability to achieve both high depth accuracy and detailed representation, making it well-suited for applications requiring high-quality depth perception across diverse, real-world environments.
摘要：我们提出了 SharpDepth，这是一种新颖的单目度量深度估计方法，它将判别深度估计方法（例如 Metric3D、UniDepth）的度量精度与通常由生成方法（例如 Marigold、Lotus）实现的细粒度边界清晰度相结合。使用具有稀疏地面真实深度的真实世界数据训练的传统判别模型可以准确预测度量深度，但通常会产生过度平滑或低细节的深度图。相比之下，生成模型是在具有密集地面真实值的合成数据上训练的，可以生成具有清晰边界的深度图，但仅提供相对深度且准确度较低。我们的方法通过将度量精度与详细的边界保留相结合来弥补这些限制，从而产生既度量精确又视觉清晰的深度预测。我们在标准深度估计基准上进行的大量零样本评估证实了 SharpDepth 的有效性，表明它能够同时实现高深度精度和详细表示，使其非常适合需要在多样化的真实环境中实现高质量深度感知的应用程序。

Title: Dynamic Retail Pricing via Q-Learning -- A Reinforcement Learning Framework for Enhanced Revenue Management

Authors: Mohit Apte, Ketan Kale, Pranav Datar, Pratiksha Deshmukh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18261
Pdf URL: https://arxiv.org/pdf/2411.18261
Copy Paste: [[2411.18261]] Dynamic Retail Pricing via Q-Learning -- A Reinforcement Learning Framework for Enhanced Revenue Management(https://arxiv.org/abs/2411.18261)
Keywords: generation
Abstract: This paper explores the application of a reinforcement learning (RL) framework using the Q-Learning algorithm to enhance dynamic pricing strategies in the retail sector. Unlike traditional pricing methods, which often rely on static demand models, our RL approach continuously adapts to evolving market dynamics, offering a more flexible and responsive pricing strategy. By creating a simulated retail environment, we demonstrate how RL effectively addresses real-time changes in consumer behavior and market conditions, leading to improved revenue outcomes. Our results illustrate that the RL model not only surpasses traditional methods in terms of revenue generation but also provides insights into the complex interplay of price elasticity and consumer demand. This research underlines the significant potential of applying artificial intelligence in economic decision-making, paving the way for more sophisticated, data-driven pricing models in various commercial domains.
摘要：本文探讨了使用 Q-Learning 算法的强化学习 (RL) 框架在零售领域增强动态定价策略的应用。与通常依赖静态需求模型的传统定价方法不同，我们的 RL 方法不断适应不断变化的市场动态，提供更灵活、响应更快的定价策略。通过创建模拟零售环境，我们展示了 RL 如何有效应对消费者行为和市场条件的实时变化，从而提高收入结果。我们的结果表明，RL 模型不仅在创收方面超越了传统方法，而且还提供了对价格弹性和消费者需求之间复杂相互作用的洞察。这项研究强调了人工智能在经济决策中应用的巨大潜力，为各个商业领域更复杂的数据驱动定价模型铺平了道路。

Title: TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution

Authors: Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18263
Pdf URL: https://arxiv.org/pdf/2411.18263
Copy Paste: [[2411.18263]] TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution(https://arxiv.org/abs/2411.18263)
Keywords: restoration, super-resolution
Abstract: Pre-trained text-to-image diffusion models are increasingly applied to real-world image super-resolution (Real-ISR) task. Given the iterative refinement nature of diffusion models, most existing approaches are computationally expensive. While methods such as SinSR and OSEDiff have emerged to condense inference steps via distillation, their performance in image restoration or details recovery is not satisfied. To address this, we propose TSD-SR, a novel distillation framework specifically designed for real-world image super-resolution, aiming to construct an efficient and effective one-step model. We first introduce the Target Score Distillation, which leverages the priors of diffusion models and real image references to achieve more realistic image restoration. Secondly, we propose a Distribution-Aware Sampling Module to make detail-oriented gradients more readily accessible, addressing the challenge of recovering fine details. Extensive experiments demonstrate that our TSD-SR has superior restoration results (most of the metrics perform the best) and the fastest inference speed (e.g. 40 times faster than SeeSR) compared to the past Real-ISR approaches based on pre-trained diffusion priors.
摘要：预训练的文本到图像扩散模型越来越多地应用于真实世界图像超分辨率（Real-ISR）任务。鉴于扩散模型的迭代细化特性，大多数现有方法的计算成本很高。虽然出现了诸如 SinSR 和 OSEDiff 之类的方法通过蒸馏来压缩推理步骤，但它们在图像恢复或细节恢复方面的表现并不令人满意。为了解决这个问题，我们提出了 TSD-SR，一种专为真实世界图像超分辨率设计的新型蒸馏框架，旨在构建一个高效的一步模型。我们首先引入目标分数蒸馏，它利用扩散模型的先验和真实图像参考来实现更真实的图像恢复。其次，我们提出了一个分布感知采样模块，使面向细节的梯度更容易获得，从而解决了恢复精细细节的挑战。大量实验表明，与过去基于预训练扩散先验的 Real-ISR 方法相比，我们的 TSD-SR 具有更优异的恢复结果（大多数指标表现最佳）和最快的推理速度（例如比 SeeSR 快 40 倍）。

Title: MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation

Authors: Haopeng Fang, Di Qiu, Binjie Mao, Pengfei Yan, He Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18281
Pdf URL: https://arxiv.org/pdf/2411.18281
Copy Paste: [[2411.18281]] MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation(https://arxiv.org/abs/2411.18281)
Keywords: generation
Abstract: Recent advancements in personalized Text-to-Video (T2V) generation highlight the importance of integrating character-specific identities and actions. However, previous T2V models struggle with identity consistency and controllable motion dynamics, mainly due to limited fine-grained facial and action-based textual prompts, and datasets that overlook key human attributes and actions. To address these challenges, we propose MotionCharacter, an efficient and high-fidelity human video generation framework designed for identity preservation and fine-grained motion control. We introduce an ID-preserving module to maintain identity fidelity while allowing flexible attribute modifications, and further integrate ID-consistency and region-aware loss mechanisms, significantly enhancing identity consistency and detail fidelity. Additionally, our approach incorporates a motion control module that prioritizes action-related text while maintaining subject consistency, along with a dataset, Human-Motion, which utilizes large language models to generate detailed motion descriptions. For simplify user control during inference, we parameterize motion intensity through a single coefficient, allowing for easy adjustments. Extensive experiments highlight the effectiveness of MotionCharacter, demonstrating significant improvements in ID-preserving, high-quality video generation.
摘要：个性化文本转视频 (T2V) 生成的最新进展凸显了整合特定角色身份和动作的重要性。然而，之前的 T2V 模型在身份一致性和可控运动动态方面存在困难，这主要是由于有限的细粒度面部和基于动作的文本提示，以及忽略关键人类属性和动作的数据集。为了应对这些挑战，我们提出了 MotionCharacter，这是一个高效且高保真的人类视频生成框架，专为身份保存和细粒度运动控制而设计。我们引入了一个 ID 保存模块来保持身份保真度，同时允许灵活的属性修改，并进一步集成 ID 一致性和区域感知损失机制，显著增强身份一致性和细节保真度。此外，我们的方法结合了一个运动控制模块，该模块优先考虑与动作相关的文本，同时保持主题一致性，以及一个数据集 Human-Motion，它利用大型语言模型来生成详细的运动描述。为了简化推理过程中的用户控制，我们通过单个系数参数化运动强度，以便于调整。大量实验凸显了 MotionCharacter 的有效性，证明了在 ID 保留和高质量视频生成方面有显著的改进。

Title: HiFiVFS: High Fidelity Video Face Swapping

Authors: Xu Chen, Keke He, Junwei Zhu, Yanhao Ge, Wei Li, Chengjie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18293
Pdf URL: https://arxiv.org/pdf/2411.18293
Copy Paste: [[2411.18293]] HiFiVFS: High Fidelity Video Face Swapping(https://arxiv.org/abs/2411.18293)
Keywords: generative
Abstract: Face swapping aims to generate results that combine the identity from the source with attributes from the target. Existing methods primarily focus on image-based face swapping. When processing videos, each frame is handled independently, making it difficult to ensure temporal stability. From a model perspective, face swapping is gradually shifting from generative adversarial networks (GANs) to diffusion models (DMs), as DMs have been shown to possess stronger generative capabilities. Current diffusion-based approaches often employ inpainting techniques, which struggle to preserve fine-grained attributes like lighting and makeup. To address these challenges, we propose a high fidelity video face swapping (HiFiVFS) framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion (SVD). We build a fine-grained attribute module to extract identity-disentangled and fine-grained attribute features through identity desensitization and adversarial learning. Additionally, We introduce detailed identity injection to further enhance identity similarity. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.
摘要：换脸旨在生成将源身份与目标属性相结合的结果。现有方法主要侧重于基于图像的换脸。在处理视频时，每帧都是独立处理的，因此很难确保时间稳定性。从模型的角度来看，换脸正逐渐从生成对抗网络 (GAN) 转向扩散模型 (DM)，因为 DM 已被证明具有更强的生成能力。当前基于扩散的方法通常采用修复技术，这些技术难以保留照明和化妆等细粒度属性。为了应对这些挑战，我们提出了一个高保真视频换脸 (HiFiVFS) 框架，该框架利用稳定视频扩散 (SVD) 的强大生成能力和时间先验。我们构建了一个细粒度属性模块，通过身份脱敏和对抗学习来提取身份解缠和细粒度属性特征。此外，我们引入了详细的身份注入以进一步增强身份相似性。大量实验表明，我们的方法在视频换脸方面无论从质量还是数量上都达到了最新水平（SOTA）。

Title: Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

Authors: Tianyi Wei, Dongdong Chen, Yifan Zhou, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18301
Pdf URL: https://arxiv.org/pdf/2411.18301
Copy Paste: [[2411.18301]] Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation(https://arxiv.org/abs/2411.18301)
Keywords: generation
Abstract: Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating multiple similar subjects, as the guidance provided by overlap loss is not explicit enough. Therefore, we further propose Overlap Online Detection and Back-to-Start Sampling Strategy to alleviate the problem. Experimental results on a newly constructed challenging dataset of similar subjects validate the effectiveness of our approach, showing superior generation quality and much higher success rates over existing methods. Our code will be available at this https URL.
摘要：最新的多模态扩散变换器 (MMDiT) 代表了文本到图像模型的前沿技术，它在很大程度上缓解了以前模型中存在的许多生成问题。然而，我们发现，当输入文本提示包含多个具有相似语义或外观的主题时，它仍然会受到主题忽略或混合的影响。我们在 MMDiT 架构中确定了导致此问题的三种可能的歧义：块间歧义、文本编码器歧义和语义歧义。为了解决这些问题，我们建议在早期去噪步骤中通过测试时优化来动态修复模糊的潜在问题。具体来说，我们设计了三个损失函数：块对齐损失、文本编码器对齐损失和重叠损失，每个损失函数都经过量身定制以减轻这些歧义。尽管有显着的改进，但我们观察到在生成多个相似主题时语义歧义仍然存在，因为重叠损失提供的指导不够明确。因此，我们进一步提出了重叠在线检测和返回开始采样策略来缓解该问题。在新构建的类似主题的具有挑战性的数据集上的实验结果验证了我们方法的有效性，显示出比现有方法更高的生成质量和更高的成功率。我们的代码将在此 https URL 上提供。

Title: InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation

Authors: Wenjie Zhuo, Fan Ma, Hehe Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18303
Pdf URL: https://arxiv.org/pdf/2411.18303
Copy Paste: [[2411.18303]] InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation(https://arxiv.org/abs/2411.18303)
Keywords: generation
Abstract: We present InfiniDreamer, a novel framework for arbitrarily long human motion generation. InfiniDreamer addresses the limitations of current motion generation methods, which are typically restricted to short sequences due to the lack of long motion training data. To achieve this, we first generate sub-motions corresponding to each textual description and then assemble them into a coarse, extended sequence using randomly initialized transition segments. We then introduce an optimization-based method called Segment Score Distillation (SSD) to refine the entire long motion sequence. SSD is designed to utilize an existing motion prior, which is trained only on short clips, in a training-free manner. Specifically, SSD iteratively refines overlapping short segments sampled from the coarsely extended long motion sequence, progressively aligning them with the pre-trained motion diffusion prior. This process ensures local coherence within each segment, while the refined transitions between segments maintain global consistency across the entire sequence. Extensive qualitative and quantitative experiments validate the superiority of our framework, showcasing its ability to generate coherent, contextually aware motion sequences of arbitrary length.
摘要：我们提出了 InfiniDreamer，一种用于生成任意长度人体运动的新型框架。InfiniDreamer 解决了当前运动生成方法的局限性，由于缺乏长运动训练数据，这些方法通常仅限于短序列。为了实现这一点，我们首先生成与每个文本描述相对应的子运动，然后使用随机初始化的过渡段将它们组装成一个粗略的扩展序列。然后，我们引入一种基于优化的方法来细化整个长运动序列，称为片段分数蒸馏 (SSD)。SSD 旨在以无需训练的方式利用现有的运动先验，该先验仅在短片段上进行训练。具体而言，SSD 迭代细化从粗略扩展的长运动序列中采样的重叠短片段，逐步将它们与预先训练的运动扩散先验对齐。此过程确保每个片段内的局部一致性，而片段之间的细化过渡则在整个序列中保持全局一致性。大量定性和定量实验验证了我们框架的优越性，展示了其生成任意长度的连贯、上下文感知的运动序列的能力。

Title: MvKeTR: Chest CT Report Generation with Multi-View Perception and Knowledge Enhancement

Authors: Xiwei Deng, Xianchun He, Yudan Zhou, Shuhui Cai, Congbo Cai, Zhong Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18309
Pdf URL: https://arxiv.org/pdf/2411.18309
Copy Paste: [[2411.18309]] MvKeTR: Chest CT Report Generation with Multi-View Perception and Knowledge Enhancement(https://arxiv.org/abs/2411.18309)
Keywords: generation
Abstract: CT report generation (CTRG) aims to automatically generate diagnostic reports for 3D volumes, relieving clinicians' workload and improving patient care. Despite clinical value, existing works fail to effectively incorporate diagnostic information from multiple anatomical views and lack related clinical expertise essential for accurate and reliable diagnosis. To resolve these limitations, we propose a novel Multi-view perception Knowledge-enhanced Tansformer (MvKeTR) to mimic the diagnostic workflow of clinicians. Just as radiologists first examine CT scans from multiple planes, a Multi-View Perception Aggregator (MVPA) with view-aware attention effectively synthesizes diagnostic information from multiple anatomical views. Then, inspired by how radiologists further refer to relevant clinical records to guide diagnostic decision-making, a Cross-Modal Knowledge Enhancer (CMKE) retrieves the most similar reports based on the query volume to incorporate domain knowledge into the diagnosis procedure. Furthermore, instead of traditional MLPs, we employ Kolmogorov-Arnold Networks (KANs) with learnable nonlinear activation functions as the fundamental building blocks of both modules to better capture intricate diagnostic patterns in CT interpretation. Extensive experiments on the public CTRG-Chest-548K dataset demonstrate that our method outpaces prior state-of-the-art models across all metrics.
摘要：CT 报告生成 (CTRG) 旨在自动生成 3D 体积的诊断报告，减轻临床医生的工作量并改善患者护理。尽管具有临床价值，但现有研究未能有效结合来自多个解剖视图的诊断信息，并且缺乏准确可靠诊断所必需的相关临床专业知识。为了解决这些限制，我们提出了一种新颖的多视图感知知识增强转换器 (MvKeTR) 来模拟临床医生的诊断工作流程。就像放射科医生首先从多个平面检查 CT 扫描一样，具有视图感知注意的多视图感知聚合器 (MVPA) 可以有效地综合来自多个解剖视图的诊断信息。然后，受放射科医生如何进一步参考相关临床记录来指导诊断决策的启发，跨模态知识增强器 (CMKE) 根据查询量检索最相似的报告，以将领域知识纳入诊断程序。此外，我们采用具有可学习非线性激活函数的 Kolmogorov-Arnold 网络 (KAN) 作为两个模块的基本构建块，而不是传统的 MLP，以便更好地捕捉 CT 解释中的复杂诊断模式。在公共 CTRG-Chest-548K 数据集上进行的大量实验表明，我们的方法在所有指标上都超越了之前最先进的模型。

Title: TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

Authors: Riza Velioglu, Petra Bevandic, Robin Chan, Barbara Hammer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18350
Pdf URL: https://arxiv.org/pdf/2411.18350
Copy Paste: [[2411.18350]] TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models(https://arxiv.org/abs/2411.18350)
Keywords: generation, generative
Abstract: This paper introduces Virtual Try-Off (VTOFF), a novel task focused on generating standardized garment images from single photos of clothed individuals. Unlike traditional Virtual Try-On (VTON), which digitally dresses models, VTOFF aims to extract a canonical garment image, posing unique challenges in capturing garment shape, texture, and intricate patterns. This well-defined target makes VTOFF particularly effective for evaluating reconstruction fidelity in generative models. We present TryOffDiff, a model that adapts Stable Diffusion with SigLIP-based visual conditioning to ensure high fidelity and detail retention. Experiments on a modified VITON-HD dataset show that our approach outperforms baseline methods based on pose transfer and virtual try-on with fewer pre- and post-processing steps. Our analysis reveals that traditional image generation metrics inadequately assess reconstruction quality, prompting us to rely on DISTS for more accurate evaluation. Our results highlight the potential of VTOFF to enhance product imagery in e-commerce applications, advance generative model evaluation, and inspire future work on high-fidelity reconstruction. Demo, code, and models are available at: this https URL
摘要：本文介绍了虚拟试穿 (VTOFF)，这是一项新颖的任务，重点是从穿衣人的单张照片生成标准化的服装图像。与传统的虚拟试穿 (VTON)（以数字方式为模特穿衣）不同，VTOFF 旨在提取规范的服装图像，这在捕捉服装形状、纹理和复杂图案方面提出了独特的挑战。这个明确的目标使 VTOFF 在评估生成模型中的重建保真度方面特别有效。我们提出了 TryOffDiff 模型，该模型采用基于 SigLIP 的视觉调节来适应稳定扩散，以确保高保真度和细节保留。在修改后的 VITON-HD 数据集上进行的实验表明，我们的方法优于基于姿势转移和虚拟试穿的基线方法，并且预处理和后处理步骤更少。我们的分析表明，传统的图像生成指标不足以评估重建质量，这促使我们依靠 DISTS 进行更准确的评估。我们的结果突出了 VTOFF 在电子商务应用中增强产品图像、推进生成模型评估和启发未来高保真重建工作的潜力。演示、代码和模型可在以下位置获得：此 https URL

Title: Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

Authors: Yiming Wu, Huan Wang, Zhenghao Chen, Dong Xu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2411.18375
Pdf URL: https://arxiv.org/pdf/2411.18375
Copy Paste: [[2411.18375]] Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models(https://arxiv.org/abs/2411.18375)
Keywords: generation
Abstract: The high computational cost and slow inference time are major obstacles to deploying the video diffusion model (VDM) in practical applications. To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of \textbf{motion dynamics} e.g., coherence of the entire video, while shallower layers are more focused on \textbf{individual content} e.g., individual frames. Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Additionally, we propose an \textbf{Individual Content and Motion Dynamics (ICMD)} Consistency Loss to gain comparable generation performance as larger VDM, i.e., the teacher to VDMini i.e., the student. Particularly, we first use the Individual Content Distillation (ICD) Loss to ensure consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 $\times$ and 1.4 $\times$ speed up for the I2V method SF-V and the T2V method T2V-Turbo-v2, while maintaining the quality of the generated videos on two benchmarks, i.e., UCF101 and VBench.
摘要：高计算成本和慢推理时间是视频扩散模型 (VDM) 在实际应用中部署的主要障碍。为了克服这个问题，我们引入了一种新的视频扩散模型压缩方法，使用保留个体内容和运动动态的修剪和一致性损失。首先，我们通过经验观察到更深的 VDM 层对于保持 \textbf{运动动态} 的质量（例如整个视频的连贯性）至关重要，而较浅的层则更侧重于 \textbf{个体内容}（例如个体帧）。因此，我们从较浅的层中修剪冗余块，同时保留更多的较深的层，从而得到一个称为 VDMini 的轻量级 VDM 变体。此外，我们提出了一种 \textbf{个体内容和运动动态 (ICMD)} 一致性损失，以获得与更大的 VDM（即老师与 VDMini（即学生））相当的生成性能。具体来说，我们首先使用个体内容蒸馏 (ICD) 损失来确保教师模型和学生模型之间每个生成帧的特征一致性。接下来，我们引入多帧内容对抗 (MCA) 损失来增强整个生成视频的运动动态。此方法显著加快了推理时间，同时保持了高质量的视频生成。大量实验证明了我们的 VDMini 在两个重要的视频生成任务——文本转视频 (T2V) 和图像转视频 (I2V) 上的有效性，其中我们分别实现了 I2V 方法 SF-V 和 T2V 方法 T2V-Turbo-v2 平均 2.5 $\times$ 和 1.4 $\times$ 的速度提升，同时在两个基准测试（即 UCF101 和 VBench）上保持了生成的视频的质量。

Title: Adaptive Blind All-in-One Image Restoration

Authors: David Serrano-Lozano, Luis Herranz, Shaolin Su, Javier Vazquez-Corral
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18412
Pdf URL: https://arxiv.org/pdf/2411.18412
Copy Paste: [[2411.18412]] Adaptive Blind All-in-One Image Restoration(https://arxiv.org/abs/2411.18412)
Keywords: restoration
Abstract: Blind all-in-one image restoration models aim to recover a high-quality image from an input degraded with unknown distortions. However, these models require all the possible degradation types to be defined during the training stage while showing limited generalization to unseen degradations, which limits their practical application in complex cases. In this paper, we propose a simple but effective adaptive blind all-in-one restoration (ABAIR) model, which can address multiple degradations, generalizes well to unseen degradations, and efficiently incorporate new degradations by training a small fraction of parameters. First, we train our baseline model on a large dataset of natural images with multiple synthetic degradations, augmented with a segmentation head to estimate per-pixel degradation types, resulting in a powerful backbone able to generalize to a wide range of degradations. Second, we adapt our baseline model to varying image restoration tasks using independent low-rank adapters. Third, we learn to adaptively combine adapters to versatile images via a flexible and lightweight degradation estimator. Our model is both powerful in handling specific distortions and flexible in adapting to complex tasks, it not only outperforms the state-of-the-art by a large margin on five- and three-task IR setups, but also shows improved generalization to unseen degradations and also composite distortions.
摘要：盲目一体化图像恢复模型旨在从具有未知失真的退化输入中恢复高质量图像。然而，这些模型要求在训练阶段定义所有可能的退化类型，同时对看不见的退化表现出有限的泛化能力，这限制了它们在复杂情况下的实际应用。在本文中，我们提出了一种简单但有效的自适应盲目一体化恢复 (ABAIR) 模型，该模型可以解决多种退化问题，很好地泛化到看不见的退化，并通过训练一小部分参数有效地整合新的退化。首先，我们在具有多种合成退化的大型自然图像数据集上训练我们的基线模型，并使用分割头来估计每个像素的退化类型，从而产生一个能够泛化到各种退化的强大主干。其次，我们使用独立的低秩适配器将我们的基线模型调整到不同的图像恢复任务。第三，我们学习通过灵活且轻量级的退化估计器自适应地将适配器组合到多功能图像中。我们的模型在处理特定的扭曲方面非常强大，在适应复杂任务方面也很灵活，它不仅在五任务和三任务 IR 设置上大大超越了最先进的技术，而且还显示出对看不见的退化和复合扭曲的改进的泛化。

Title: Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation

Authors: Marco Pasini, Javier Nistal, Stefan Lattner, George Fazekas
Subjects: cs.LG, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.18447
Pdf URL: https://arxiv.org/pdf/2411.18447
Copy Paste: [[2411.18447]] Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation(https://arxiv.org/abs/2411.18447)
Keywords: generation, generative
Abstract: Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also feasible. However, such Continuous Autoregressive Models (CAMs) can suffer from a decline in generation quality over extended sequences due to error accumulation during inference. We introduce a novel method to address this issue by injecting random noise into the input embeddings during training. This procedure makes the model robust against varying error levels at inference. We further reduce error accumulation through an inference procedure that introduces low-level noise. Experiments on musical audio generation show that CAM substantially outperforms existing autoregressive and non-autoregressive approaches while preserving audio quality over extended sequences. This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real-time and interactive generative applications.
摘要：自回归模型通常应用于离散标记序列，但最近的研究表明，以自回归方式生成连续嵌入序列也是可行的。但是，由于推理过程中的错误积累，这种连续自回归模型 (CAM) 可能会在扩展序列上遭受生成质量下降的困扰。我们引入了一种新方法来解决此问题，即在训练期间向输入嵌入中注入随机噪声。此过程使模型在推理时能够抵抗不同程度的错误。我们通过引入低级噪声的推理过程进一步减少错误积累。音乐音频生成实验表明，CAM 的表现大大优于现有的自回归和非自回归方法，同时保持了扩展序列的音频质量。这项工作为在纯自回归环境中生成连续嵌入铺平了道路，为实时和交互式生成应用开辟了新的可能性。

Title: Advancements in Myocardial Infarction Detection and Classification Using Wearable Devices: A Comprehensive Review

Authors: Abhijith S, Arjun Rajesh, Mansi Manoj, Sandra Davis Kollannur, Sujitta R V, Jerrin Thomas Panachakel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.18451
Pdf URL: https://arxiv.org/pdf/2411.18451
Copy Paste: [[2411.18451]] Advancements in Myocardial Infarction Detection and Classification Using Wearable Devices: A Comprehensive Review(https://arxiv.org/abs/2411.18451)
Keywords: generation
Abstract: Myocardial infarction (MI), commonly known as a heart attack, is a critical health condition caused by restricted blood flow to the heart. Early-stage detection through continuous ECG monitoring is essential to minimize irreversible damage. This review explores advancements in MI classification methodologies for wearable devices, emphasizing their potential in real-time monitoring and early diagnosis. It critically examines traditional approaches, such as morphological filtering and wavelet decomposition, alongside cutting-edge techniques, including Convolutional Neural Networks (CNNs) and VLSI-based methods. By synthesizing findings on machine learning, deep learning, and hardware innovations, this paper highlights their strengths, limitations, and future prospects. The integration of these techniques into wearable devices offers promising avenues for efficient, accurate, and energy-aware MI detection, paving the way for next-generation wearable healthcare solutions.
摘要：心肌梗塞 (MI)，俗称心脏病发作，是一种因心脏血流受限而导致的严重健康状况。通过持续心电图监测进行早期检测对于最大限度地减少不可逆转的损害至关重要。本综述探讨了可穿戴设备 MI 分类方法的进展，强调了它们在实时监测和早期诊断方面的潜力。它批判性地审查了形态滤波和小波分解等传统方法以及卷积神经网络 (CNN) 和基于 VLSI 的方法等尖端技术。通过综合机器学习、深度学习和硬件创新方面的发现，本文强调了它们的优势、局限性和未来前景。将这些技术集成到可穿戴设备中为高效、准确和节能的 MI 检测提供了有希望的途径，为下一代可穿戴医疗解决方案铺平了道路。

Title: Synthetic ECG Generation for Data Augmentation and Transfer Learning in Arrhythmia Classification

Authors: José Fernando Núñez, Jamie Arjona, Javier Béjar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18456
Pdf URL: https://arxiv.org/pdf/2411.18456
Copy Paste: [[2411.18456]] Synthetic ECG Generation for Data Augmentation and Transfer Learning in Arrhythmia Classification(https://arxiv.org/abs/2411.18456)
Keywords: generation, generative
Abstract: Deep learning models need a sufficient amount of data in order to be able to find the hidden patterns in it. It is the purpose of generative modeling to learn the data distribution, thus allowing us to sample more data and augment the original dataset. In the context of physiological data, and more specifically electrocardiogram (ECG) data, given its sensitive nature and expensive data collection, we can exploit the benefits of generative models in order to enlarge existing datasets and improve downstream tasks, in our case, classification of heart rhythm. In this work, we explore the usefulness of synthetic data generated with different generative models from Deep Learning namely Diffweave, Time-Diffusion and Time-VQVAE in order to obtain better classification results for two open source multivariate ECG datasets. Moreover, we also investigate the effects of transfer learning, by fine-tuning a synthetically pre-trained model and then progressively adding increasing proportions of real data. We conclude that although the synthetic samples resemble the real ones, the classification improvement when simply augmenting the real dataset is barely noticeable on individual datasets, but when both datasets are merged the results show an increase across all metrics for the classifiers when using synthetic samples as augmented data. From the fine-tuning results the Time-VQVAE generative model has shown to be superior to the others but not powerful enough to achieve results close to a classifier trained with real data only. In addition, methods and metrics for measuring closeness between synthetic data and the real one have been explored as a side effect of the main research questions of this study.
摘要：深度学习模型需要足够数量的数据才能找到其中隐藏的模式。生成模型的目的是学习数据分布，从而使我们能够采样更多数据并扩充原始数据集。在生理数据（更具体地说是心电图 (ECG) 数据）的背景下，考虑到其敏感性和数据收集成本高昂，我们可以利用生成模型的优势来扩大现有数据集并改进下游任务，在我们的例子中是心律分类。在这项工作中，我们探索了使用来自深度学习的不同生成模型（即 Diffweave、Time-Diffusion 和 Time-VQVAE）生成的合成数据的实用性，以便为两个开源多变量 ECG 数据集获得更好的分类结果。此外，我们还通过微调合成预训练模型然后逐步添加越来越多的真实数据来研究迁移学习的效果。我们得出的结论是，尽管合成样本与真实样本相似，但仅增强真实数据集时，分类改进在单个数据集上几乎不可察觉，但当两个数据集合并时，结果显示，当使用合成样本作为增强数据时，分类器的所有指标都有所增加。从微调结果来看，Time-VQVAE 生成模型已证明优于其他模型，但不足以实现接近仅使用真实数据训练的分类器的结果。此外，作为本研究主要研究问题的附带结果，还探索了用于衡量合成数据与真实数据之间接近度的方法和指标。

Title: Complexity Experts are Task-Discriminative Learners for Any Image Restoration

Authors: Eduard Zamfir, Zongwei Wu, Nancy Mehta, Yuedong Tan, Danda Pani Paudel, Yulun Zhang, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18466
Pdf URL: https://arxiv.org/pdf/2411.18466
Copy Paste: [[2411.18466]] Complexity Experts are Task-Discriminative Learners for Any Image Restoration(https://arxiv.org/abs/2411.18466)
Keywords: restoration
Abstract: Recent advancements in all-in-one image restoration models have revolutionized the ability to address diverse degradations through a unified framework. However, parameters tied to specific tasks often remain inactive for other tasks, making mixture-of-experts (MoE) architectures a natural extension. Despite this, MoEs often show inconsistent behavior, with some experts unexpectedly generalizing across tasks while others struggle within their intended scope. This hinders leveraging MoEs' computational benefits by bypassing irrelevant experts during inference. We attribute this undesired behavior to the uniform and rigid architecture of traditional MoEs. To address this, we introduce ``complexity experts" -- flexible expert blocks with varying computational complexity and receptive fields. A key challenge is assigning tasks to each expert, as degradation complexity is unknown in advance. Thus, we execute tasks with a simple bias toward lower complexity. To our surprise, this preference effectively drives task-specific allocation, assigning tasks to experts with the appropriate complexity. Extensive experiments validate our approach, demonstrating the ability to bypass irrelevant experts during inference while maintaining superior performance. The proposed MoCE-IR model outperforms state-of-the-art methods, affirming its efficiency and practical applicability. The source will be publicly made available at \href{this https URL}{\texttt{this http URL}}
摘要：一体化图像恢复模型的最新进展彻底改变了通过统一框架解决各种退化问题的能力。然而，与特定任务相关的参数通常对其他任务无效，因此混合专家 (MoE) 架构成为一种自然延伸。尽管如此，MoE 经常表现出不一致的行为，一些专家意外地跨任务进行推广，而其他专家则在其预期范围内挣扎。这阻碍了 MoE 在推理过程中绕过不相关的专家来利用其计算优势。我们将这种不良行为归因于传统 MoE 的统一和僵化的架构。为了解决这个问题，我们引入了“复杂性专家”——具有不同计算复杂性和接受域的灵活专家块。一个关键的挑战是将任务分配给每个专家，因为退化复杂性是事先未知的。因此，我们执行任务时会简单偏向于较低的复杂性。令我们惊讶的是，这种偏好有效地推动了特定任务的分配，将任务分配给具有适当复杂性的专家。大量实验验证了我们的方法，证明了在推理过程中绕过不相关的专家同时保持卓越性能的能力。所提出的 MoCE-IR 模型优于最先进的方法，证实了其效率和实用性。源代码将在 \href{this https URL}{\texttt{this http URL}} 公开提供

Title: GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Authors: Pengfei Zhou, Xiaopeng Peng, Jiajun Song, Chuanhao Li, Zhaopan Xu, Yue Yang, Ziyao Guo, Hao Zhang, Yuqi Lin, Yefei He, Lirui Zhao, Shuo Liu, Tianhua Li, Yuxuan Xie, Xiaojun Chang, Yu Qiao, Wenqi Shao, Kaipeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18499
Pdf URL: https://arxiv.org/pdf/2411.18499
Copy Paste: [[2411.18499]] GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation(https://arxiv.org/abs/2411.18499)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to data size and diversity limitations. To bridge this gap, we introduce GATE OpenING (OpenING), a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82. 42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models. The OpenING is open-sourced at this https URL.
摘要：多模态大型语言模型 (MLLM) 在视觉理解和生成任务方面取得了重大进展。然而，生成交错的图像文本内容仍然是一项挑战，这需要综合的多模态理解和生成能力。虽然统一模型的进展提供了新的解决方案，但由于数据大小和多样性限制，现有的基准不足以评估这些方法。为了弥补这一差距，我们引入了 GATE OpenING (OpenING)，这是一个全面的基准，包含 56 个真实世界任务中的 5,400 个高质量人工注释实例。OpenING 涵盖了旅游指南、设计和头脑风暴等各种日常场景，为具有挑战性的交错生成方法提供了一个强大的平台。此外，我们提出了 IntJudge，这是一个用于评估开放式多模态生成方法的判断模型。使用新颖的数据管道进行训练后，我们的 IntJudge 与人类判断的一致率达到 82.42%，比基于 GPT 的评估者高出 11.34%。对 OpenING 的大量实验表明，当前的交错生成方法仍有很大改进空间。进一步介绍了交错图像文本生成的关键发现，以指导下一代模型的开发。OpenING 已在此 https URL 上开源。

Title: Enhancing weed detection performance by means of GenAI-based image augmentation

Authors: Sourav Modak, Anthony Stein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18513
Pdf URL: https://arxiv.org/pdf/2411.18513
Copy Paste: [[2411.18513]] Enhancing weed detection performance by means of GenAI-based image augmentation(https://arxiv.org/abs/2411.18513)
Keywords: generative
Abstract: Precise weed management is essential for sustaining crop productivity and ecological balance. Traditional herbicide applications face economic and environmental challenges, emphasizing the need for intelligent weed control systems powered by deep learning. These systems require vast amounts of high-quality training data. The reality of scarcity of well-annotated training data, however, is often addressed through generating more data using data augmentation. Nevertheless, conventional augmentation techniques such as random flipping, color changes, and blurring lack sufficient fidelity and diversity. This paper investigates a generative AI-based augmentation technique that uses the Stable Diffusion model to produce diverse synthetic images that improve the quantity and quality of training datasets for weed detection models. Moreover, this paper explores the impact of these synthetic images on the performance of real-time detection systems, thus focusing on compact CNN-based models such as YOLO nano for edge devices. The experimental results show substantial improvements in mean Average Precision (mAP50 and mAP50-95) scores for YOLO models trained with generative AI-augmented datasets, demonstrating the promising potential of synthetic data to enhance model robustness and accuracy.
摘要：精准杂草管理对于维持作物产量和生态平衡至关重要。传统的除草剂应用面临经济和环境挑战，这凸显了对由深度学习驱动的智能杂草控制系统的需求。这些系统需要大量高质量的训练数据。然而，缺乏注释良好的训练数据的现实问题通常可以通过使用数据增强来生成更多数据来解决。然而，传统的增强技术（例如随机翻转、颜色变化和模糊）缺乏足够的保真度和多样性。本文研究了一种基于生成式 AI 的增强技术，该技术使用稳定扩散模型来生成多样化的合成图像，从而提高杂草检测模型的训练数据集的数量和质量。此外，本文探讨了这些合成图像对实时检测系统性能的影响，因此重点关注基于 CNN 的紧凑型模型，例如用于边缘设备的 YOLO nano。实验结果表明，使用生成式 AI 增强数据集训练的 YOLO 模型的平均精度（mAP50 和 mAP50-95）得分有显著提高，证明了合成数据在增强模型鲁棒性和准确性方面具有巨大的潜力。

Title: PhyCAGE: Physically Plausible Compositional 3D Asset Generation from a Single Image

Authors: Han Yan, Mingrui Zhang, Yang Li, Chao Ma, Pan Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18548
Pdf URL: https://arxiv.org/pdf/2411.18548
Copy Paste: [[2411.18548]] PhyCAGE: Physically Plausible Compositional 3D Asset Generation from a Single Image(https://arxiv.org/abs/2411.18548)
Keywords: generation
Abstract: We present PhyCAGE, the first approach for physically plausible compositional 3D asset generation from a single image. Given an input image, we first generate consistent multi-view images for components of the assets. These images are then fitted with 3D Gaussian Splatting representations. To ensure that the Gaussians representing objects are physically compatible with each other, we introduce a Physical Simulation-Enhanced Score Distillation Sampling (PSE-SDS) technique to further optimize the positions of the Gaussians. It is achieved by setting the gradient of the SDS loss as the initial velocity of the physical simulation, allowing the simulator to act as a physics-guided optimizer that progressively corrects the Gaussians' positions to a physically compatible state. Experimental results demonstrate that the proposed method can generate physically plausible compositional 3D assets given a single image.
摘要：我们提出了 PhyCAGE，这是第一种从单个图像生成物理上合理的组合 3D 资产的方法。给定一个输入图像，我们首先为资产的组件生成一致的多视图图像。然后为这些图像配备 3D 高斯分布表示。为了确保表示对象的高斯在物理上彼此兼容，我们引入了物理模拟增强分数蒸馏采样 (PSE-SDS) 技术来进一步优化高斯的位置。这是通过将 SDS 损失的梯度设置为物理模拟的初始速度来实现的，从而使模拟器可以充当物理引导的优化器，逐步将高斯的位置校正到物理兼容状态。实验结果表明，所提出的方法可以在给定单个图像的情况下生成物理上合理的组合 3D 资产。

Title: FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Authors: Haosen Yang, Adrian Bulat, Isma Hadji, Hai X. Pham, Xiatian Zhu, Georgios Tzimiropoulos, Brais Martinez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18552
Pdf URL: https://arxiv.org/pdf/2411.18552
Copy Paste: [[2411.18552]] FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion(https://arxiv.org/abs/2411.18552)
Keywords: generation
Abstract: Diffusion models are proficient at generating high-quality images. They are however effective only when operating at the resolution used during training. Inference at a scaled resolution leads to repetitive patterns and structural distortions. Retraining at higher resolutions quickly becomes prohibitive. Thus, methods enabling pre-existing diffusion models to operate at flexible test-time resolutions are highly desirable. Previous works suffer from frequent artifacts and often introduce large latency overheads. We propose two simple modules that combine to solve these issues. We introduce a Frequency Modulation (FM) module that leverages the Fourier domain to improve the global structure consistency, and an Attention Modulation (AM) module which improves the consistency of local texture patterns, a problem largely ignored in prior works. Our method, coined Fam diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training. Extensive qualitative results highlight the effectiveness of our method in addressing structural and local artifacts, while quantitative results show state-of-the-art performance. Also, our method avoids redundant inference tricks for improved consistency such as patch-based or progressive generation, leading to negligible latency overheads.
摘要：扩散模型擅长生成高质量图像。然而，它们仅在以训练期间使用的分辨率运行时才有效。以缩放分辨率进行推理会导致重复模式和结构扭曲。以更高分辨率进行再训练很快就会变得难以承受。因此，非常需要能够使预先存在的扩散模型以灵活的测试时间分辨率运行的方法。以前的研究经常出现伪影，并且经常会带来很大的延迟开销。我们提出了两个简单的模块来结合起来解决这些问题。我们引入了一个频率调制 (FM) 模块，它利用傅立叶域来提高全局结构一致性，以及一个注意力调制 (AM) 模块，它提高了局部纹理模式的一致性，这是以前研究在很大程度上忽略的问题。我们的方法被称为 Fam 扩散，可以无缝集成到任何潜在扩散模型中，并且不需要额外的训练。大量的定性结果突出了我们的方法在解决结构和局部伪影方面的有效性，而定量结果则显示了最先进的性能。此外，我们的方法避免了冗余推理技巧，以提高一致性，例如基于补丁或渐进式生成，从而导致可忽略不计的延迟开销。

Title: Hierarchical Information Flow for Generalized Efficient Image Restoration

Authors: Yawei Li, Bin Ren, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Nicu Sebe, Ming-Hsuan Yang, Luca Benini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18588
Pdf URL: https://arxiv.org/pdf/2411.18588
Copy Paste: [[2411.18588]] Hierarchical Information Flow for Generalized Efficient Image Restoration(https://arxiv.org/abs/2411.18588)
Keywords: restoration
Abstract: While vision transformers show promise in numerous image restoration (IR) tasks, the challenge remains in efficiently generalizing and scaling up a model for multiple IR tasks. To strike a balance between efficiency and model capacity for a generalized transformer-based IR method, we propose a hierarchical information flow mechanism for image restoration, dubbed Hi-IR, which progressively propagates information among pixels in a bottom-up manner. Hi-IR constructs a hierarchical information tree representing the degraded image across three levels. Each level encapsulates different types of information, with higher levels encompassing broader objects and concepts and lower levels focusing on local details. Moreover, the hierarchical tree architecture removes long-range self-attention, improves the computational efficiency and memory utilization, thus preparing it for effective model scaling. Based on that, we explore model scaling to improve our method's capabilities, which is expected to positively impact IR in large-scale training settings. Extensive experimental results show that Hi-IR achieves state-of-the-art performance in seven common image restoration tasks, affirming its effectiveness and generalizability.
摘要：虽然视觉变换器在众多图像恢复 (IR) 任务中显示出良好的前景，但如何高效地将模型推广并扩展以用于多个 IR 任务仍然是一个难题。为了在基于变换器的广义 IR 方法中实现效率和模型容量之间的平衡，我们提出了一种用于图像恢复的分层信息流机制，称为 Hi-IR，它以自下而上的方式在像素之间逐步传播信息。Hi-IR 构建了一个分层信息树，表示三个级别的退化图像。每个级别都封装了不同类型的信息，较高级别包含更广泛的对象和概念，较低级别则关注局部细节。此外，分层树架构消除了长距离自注意力，提高了计算效率和内存利用率，从而为有效的模型扩展做好了准备。在此基础上，我们探索模型扩展以提高我们方法的能力，预计这将对大规模训练环境中的 IR 产生积极影响。大量实验结果表明，Hi-IR 在七种常见的图像恢复任务中取得了最佳性能，证明了其有效性和通用性。

Title: CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Authors: Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T. Barron, Aleksander Holynski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18613
Pdf URL: https://arxiv.org/pdf/2411.18613
Copy Paste: [[2411.18613]] CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models(https://arxiv.org/abs/2411.18613)
Keywords: generation
Abstract: We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. CAT4D leverages a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis at any specified camera poses and timestamps. Combined with a novel sampling approach, this model can transform a single monocular video into a multi-view video, enabling robust 4D reconstruction via optimization of a deformable 3D Gaussian representation. We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks, and highlight the creative capabilities for 4D scene generation from real or generated videos. See our project page for results and interactive demos: \url{this http URL}.
摘要：我们提出了 CAT4D，一种从单目视频创建 4D（动态 3D）场景的方法。CAT4D 利用在多种数据集组合上训练的多视图视频扩散模型，实现在任何指定的相机姿势和时间戳下的新视图合成。结合新颖的采样方法，该模型可以将单个单目视频转换为多视图视频，通过优化可变形的 3D 高斯表示实现稳健的 4D 重建。我们在新视图合成和动态场景重建基准上展示了具有竞争力的性能，并强调了从真实或生成的视频生成 4D 场景的创造性能力。请参阅我们的项目页面了解结果和交互式演示：\url{此 http URL}。

Title: Diffusion Self-Distillation for Zero-Shot Customized Image Generation

Authors: Shengqu Cai, Eric Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, Gordon Wetzstein
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18616
Pdf URL: https://arxiv.org/pdf/2411.18616
Copy Paste: [[2411.18616]] Diffusion Self-Distillation for Zero-Shot Customized Image Generation(https://arxiv.org/abs/2411.18616)
Keywords: generation, generative
Abstract: Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.
摘要：文本到图像的扩散模型产生了令人印象深刻的结果，但对于需要细粒度控制的艺术家来说，它们却是令人沮丧的工具。例如，一个常见的用例是在新颖的上下文中创建特定实例的图像，即“身份保留生成”。此设置以及许多其他任务（例如重新照明）非常适合图像+文本条件生成模型。但是，没有足够的高质量配对数据来直接训练这样的模型。我们提出了扩散自蒸馏方法，这是一种使用预先训练的文本到图像模型为文本条件图像到图像任务生成自己的数据集的方法。我们首先利用文本到图像扩散模型的上下文生成能力来创建图像网格，并在视觉语言模型的帮助下整理大型配对数据集。然后，我们使用整理的配对数据集将文本到图像模型微调为文本+图像到图像模型。我们证明，扩散自蒸馏优于现有的零样本方法，并且在广泛的身份保存生成任务上与每个实例调整技术具有竞争力，而无需测试时间优化。