2024-12-24

Title: A Decade of Deep Learning: A Survey on The Magnificent Seven

Authors: Dilshod Azizov, Muhammad Arslan Manzoor, Velibor Bojkovic, Yingxu Wang, Zixiao Wang, Zangir Iklassov, Kailong Zhao, Liang Li, Siwei Liu, Yu Zhong, Wei Liu, Shangsong Liang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.16188
Pdf URL: https://arxiv.org/pdf/2412.16188
Copy Paste: [[2412.16188]] A Decade of Deep Learning: A Survey on The Magnificent Seven(https://arxiv.org/abs/2412.16188)
Keywords: generative
Abstract: Deep learning has fundamentally reshaped the landscape of artificial intelligence over the past decade, enabling remarkable achievements across diverse domains. At the heart of these developments lie multi-layered neural network architectures that excel at automatic feature extraction, leading to significant improvements in machine learning tasks. To demystify these advances and offer accessible guidance, we present a comprehensive overview of the most influential deep learning algorithms selected through a broad-based survey of the field. Our discussion centers on pivotal architectures, including Residual Networks, Transformers, Generative Adversarial Networks, Variational Autoencoders, Graph Neural Networks, Contrastive Language-Image Pre-training, and Diffusion models. We detail their historical context, highlight their mathematical foundations and algorithmic principles, and examine subsequent variants, extensions, and practical considerations such as training methodologies, normalization techniques, and learning rate schedules. Beyond historical and technical insights, we also address their applications, challenges, and potential research directions. This survey aims to serve as a practical manual for both newcomers seeking an entry point into cutting-edge deep learning methods and experienced researchers transitioning into this rapidly evolving domain.
摘要：在过去十年中，深度学习从根本上重塑了人工智能的格局，在不同领域取得了令人瞩目的成就。这些发展的核心是擅长自动特征提取的多层神经网络架构，从而显著提高了机器学习任务的效率。为了揭开这些进步的神秘面纱并提供易于理解的指导，我们通过对该领域的广泛调查，全面概述了最具影响力的深度学习算法。我们的讨论集中在关键架构上，包括残差网络、Transformers、生成对抗网络、变分自动编码器、图神经网络、对比语言图像预训练和扩散模型。我们详细介绍了它们的历史背景，强调了它们的数学基础和算法原理，并研究了后续的变体、扩展和实际考虑因素，例如训练方法、规范化技术和学习率计划。除了历史和技术见解之外，我们还讨论了它们的应用、挑战和潜在的研究方向。本调查旨在为寻求进入尖端深度学习方法的新手和进入这一快速发展领域的经验丰富的研究人员提供一本实用手册。

Title: AgroXAI: Explainable AI-Driven Crop Recommendation System for Agriculture 4.0

Authors: Ozlem Turgut, Ibrahim Kok, Suat Ozdemir
Subjects: cs.LG, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2412.16196
Pdf URL: https://arxiv.org/pdf/2412.16196
Copy Paste: [[2412.16196]] AgroXAI: Explainable AI-Driven Crop Recommendation System for Agriculture 4.0(https://arxiv.org/abs/2412.16196)
Keywords: generation
Abstract: Today, crop diversification in agriculture is a critical issue to meet the increasing demand for food and improve food safety and quality. This issue is considered to be the most important challenge for the next generation of agriculture due to the diminishing natural resources, the limited arable land, and unpredictable climatic conditions caused by climate change. In this paper, we employ emerging technologies such as the Internet of Things (IoT), machine learning (ML), and explainable artificial intelligence (XAI) to improve operational efficiency and productivity in the agricultural sector. Specifically, we propose an edge computing-based explainable crop recommendation system, AgroXAI, which suggests suitable crops for a region based on weather and soil conditions. In this system, we provide local and global explanations of ML model decisions with methods such as ELI5, LIME, SHAP, which we integrate into ML models. More importantly, we provide regional alternative crop recommendations with the counterfactual explainability method. In this way, we envision that our proposed AgroXAI system will be a platform that provides regional crop diversity in the next generation agriculture.
摘要：如今，农业作物多样化是满足日益增长的粮食需求和提高食品安全和质量的关键问题。由于自然资源的减少、可耕地的有限以及气候变化造成的不可预测的气候条件，这一问题被认为是下一代农业面临的最重要挑战。在本文中，我们采用物联网 (IoT)、机器学习 (ML) 和可解释人工智能 (XAI) 等新兴技术来提高农业部门的运营效率和生产力。具体来说，我们提出了一种基于边缘计算的可解释作物推荐系统 AgroXAI，该系统根据天气和土壤条件为某个地区推荐合适的作物。在这个系统中，我们使用 ELI5、LIME、SHAP 等方法为 ML 模型决策提供局部和全局解释，并将其集成到 ML 模型中。更重要的是，我们使用反事实可解释性方法提供区域替代作物建议。通过这种方式，我们设想我们提出的 AgroXAI 系统将成为在下一代农业中提供区域作物多样性的平台。

Title: Synthetic Time Series Data Generation for Healthcare Applications: A PCG Case Study

Authors: Ainaz Jamshidi, Muhammad Arif, Sabir Ali Kalhoro, Alexander Gelbukh
Subjects: cs.LG, cs.CE, eess.SP
Abstract URL: https://arxiv.org/abs/2412.16207
Pdf URL: https://arxiv.org/pdf/2412.16207
Copy Paste: [[2412.16207]] Synthetic Time Series Data Generation for Healthcare Applications: A PCG Case Study(https://arxiv.org/abs/2412.16207)
Keywords: generation, generative
Abstract: The generation of high-quality medical time series data is essential for advancing healthcare diagnostics and safeguarding patient privacy. Specifically, synthesizing realistic phonocardiogram (PCG) signals offers significant potential as a cost-effective and efficient tool for cardiac disease pre-screening. Despite its potential, the synthesis of PCG signals for this specific application received limited attention in research. In this study, we employ and compare three state-of-the-art generative models from different categories - WaveNet, DoppelGANger, and DiffWave - to generate high-quality PCG data. We use data from the George B. Moody PhysioNet Challenge 2022. Our methods are evaluated using various metrics widely used in the previous literature in the domain of time series data generation, such as mean absolute error and maximum mean discrepancy. Our results demonstrate that the generated PCG data closely resembles the original datasets, indicating the effectiveness of our generative models in producing realistic synthetic PCG data. In our future work, we plan to incorporate this method into a data augmentation pipeline to synthesize abnormal PCG signals with heart murmurs, in order to address the current scarcity of abnormal data. We hope to improve the robustness and accuracy of diagnostic tools in cardiology, enhancing their effectiveness in detecting heart murmurs.
摘要：生成高质量的医疗时间序列数据对于推进医疗诊断和保护患者隐私至关重要。具体而言，合成逼真的心音图 (PCG) 信号具有巨大的潜力，可作为心脏病预筛查的经济高效工具。尽管具有潜力，但针对这一特定应用的 PCG 信号合成在研究中受到的关注有限。在本研究中，我们采用并比较了三种来自不同类别的最先进的生成模型 - WaveNet、DoppelGANger 和 DiffWave - 来生成高质量的 PCG 数据。我们使用 George B. Moody PhysioNet Challenge 2022 的数据。我们的方法使用先前文献中在时间序列数据生成领域广泛使用的各种指标进行评估，例如平均绝对误差和最大平均差异。我们的结果表明，生成的 PCG 数据与原始数据集非常相似，表明我们的生成模型在生成逼真的合成 PCG 数据方面的有效性。在未来的工作中，我们计划将此方法纳入数据增强流程，以合成带有心脏杂音的异常 PCG 信号，以解决当前异常数据稀缺的问题。我们希望提高心脏病学诊断工具的稳健性和准确性，提高其检测心脏杂音的有效性。

Title: Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Authors: Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Simon Shaolei Du, Yelong Shen
Subjects: cs.CV, cs.CL, cs.GR
Abstract URL: https://arxiv.org/abs/2412.16211
Pdf URL: https://arxiv.org/pdf/2412.16211
Copy Paste: [[2412.16211]] Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation(https://arxiv.org/abs/2412.16211)
Keywords: generation, generative
Abstract: The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios. For example, top T2V generative models still fail to generate a video of the short simple story 'how to put an elephant into a refrigerator.' While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation. To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2-4 consecutive events. We employ advanced vision-language models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation.
摘要：目前最先进的视频生成模型可以制作具有高度逼真细节的商业级视频。然而，它们仍然难以连贯地呈现提示指定的故事中的多个连续事件，而这可以预见是未来长视频生成场景的一项基本能力。例如，顶级 T2V 生成模型仍然无法生成“如何将大象放进冰箱”这个简短故事的视频。虽然现有的细节导向基准主要关注美学质量和时空一致性等细粒度指标，但它们无法评估模型处理事件级故事呈现的能力。为了弥补这一差距，我们推出了 StoryEval，这是一个以故事为导向的基准，专门用于评估文本到视频 (T2V) 模型的故事完成能力。StoryEval 有 423 个提示，涵盖 7 个类别，每个提示代表由 2-4 个连续事件组成的短篇故事。我们采用先进的视觉语言模型（例如 GPT-4V 和 LLaVA-OV-Chat-72B）来验证生成的视频中每个事件的完成情况，并采用一致投票法来提高可靠性。我们的方法确保与人工评估高度一致，对 11 个模型的评估揭示了其挑战性，没有一个模型的平均故事完成率超过 50%。StoryEval 为推进 T2V 模型提供了新的基准，并强调了开发下一代解决方案以生成连贯的故事驱动视频的挑战和机遇。

Title: ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping

Authors: Youxin Pang, Ruizhi Shao, Jiajun Zhang, Hanzhang Tu, Yun Liu, Boyao Zhou, Hongwen Zhang, Yebin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16212
Pdf URL: https://arxiv.org/pdf/2412.16212
Copy Paste: [[2412.16212]] ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping(https://arxiv.org/abs/2412.16212)
Keywords: generation
Abstract: In this paper, we introduce ManiVideo, a novel method for generating consistent and temporally coherent bimanual hand-object manipulation videos from given motion sequences of hands and objects. The core idea of ManiVideo is the construction of a multi-layer occlusion (MLO) representation that learns 3D occlusion relationships from occlusion-free normal maps and occlusion confidence maps. By embedding the MLO structure into the UNet in two forms, the model enhances the 3D consistency of dexterous hand-object manipulation. To further achieve the generalizable grasping of objects, we integrate Objaverse, a large-scale 3D object dataset, to address the scarcity of video data, thereby facilitating the learning of extensive object consistency. Additionally, we propose an innovative training strategy that effectively integrates multiple datasets, supporting downstream tasks such as human-centric hand-object manipulation video generation. Through extensive experiments, we demonstrate that our approach not only achieves video generation with plausible hand-object interaction and generalizable objects, but also outperforms existing SOTA methods.
摘要：在本文中，我们介绍了 ManiVideo，这是一种从给定的手和物体运动序列生成一致且时间连贯的双手手部物体操作视频的新方法。ManiVideo 的核心思想是构建多层遮挡 (MLO) 表示，从无遮挡法线图和遮挡置信度图中学习 3D 遮挡关系。通过以两种形式将 MLO 结构嵌入到 UNet 中，该模型增强了灵巧手部物体操作的 3D 一致性。为了进一步实现物体的可泛化抓取，我们集成了大型 3D 物体数据集 Objaverse，以解决视频数据的稀缺性，从而促进广泛物体一致性的学习。此外，我们提出了一种创新的训练策略，可以有效地整合多个数据集，支持以人为中心的手部物体操作视频生成等下游任务。通过大量实验，我们证明了我们的方法不仅实现了具有合理手部物体交互和可泛化物体的视频生成，而且优于现有的 SOTA 方法。

Title: AdvIRL: Reinforcement Learning-Based Adversarial Attacks on 3D NeRF Models

Authors: Tommy Nguyen, Mehmet Ergezer, Christian Green
Subjects: cs.CV, cs.AI, cs.CY, cs.GR, eess.IV
Abstract URL: https://arxiv.org/abs/2412.16213
Pdf URL: https://arxiv.org/pdf/2412.16213
Copy Paste: [[2412.16213]] AdvIRL: Reinforcement Learning-Based Adversarial Attacks on 3D NeRF Models(https://arxiv.org/abs/2412.16213)
Keywords: generative
Abstract: The increasing deployment of AI models in critical applications has exposed them to significant risks from adversarial attacks. While adversarial vulnerabilities in 2D vision models have been extensively studied, the threat landscape for 3D generative models, such as Neural Radiance Fields (NeRF), remains underexplored. This work introduces \textit{AdvIRL}, a novel framework for crafting adversarial NeRF models using Instant Neural Graphics Primitives (Instant-NGP) and Reinforcement Learning. Unlike prior methods, \textit{AdvIRL} generates adversarial noise that remains robust under diverse 3D transformations, including rotations and scaling, enabling effective black-box attacks in real-world scenarios. Our approach is validated across a wide range of scenes, from small objects (e.g., bananas) to large environments (e.g., lighthouses). Notably, targeted attacks achieved high-confidence misclassifications, such as labeling a banana as a slug and a truck as a cannon, demonstrating the practical risks posed by adversarial NeRFs. Beyond attacking, \textit{AdvIRL}-generated adversarial models can serve as adversarial training data to enhance the robustness of vision systems. The implementation of \textit{AdvIRL} is publicly available at \url{this https URL}, ensuring reproducibility and facilitating future research.
摘要：人工智能模型在关键应用中的部署日益增多，使它们面临对抗性攻击的重大风险。虽然 2D 视觉模型中的对抗性漏洞已经得到广泛研究，但 3D 生成模型（如神经辐射场 (NeRF)）的威胁形势仍未得到充分探索。这项工作引入了 \textit{AdvIRL}，这是一个使用即时神经图形基元 (Instant-NGP) 和强化学习制作对抗性 NeRF 模型的新框架。与之前的方法不同，\textit{AdvIRL} 生成的对抗性噪声在各种 3D 变换（包括旋转和缩放）下仍保持稳健，从而能够在真实场景中实施有效的黑盒攻击。我们的方法已在从小物体（例如香蕉）到大环境（例如灯塔）的广泛场景中得到验证。值得注意的是，针对性攻击实现了高置信度错误分类，例如将香蕉标记为蛞蝓，将卡车标记为大炮，这表明对抗性 NeRF 带来的实际风险。除了攻击之外，\textit{AdvIRL} 生成的对抗性模型可以作为对抗性训练数据来增强视觉系统的鲁棒性。\textit{AdvIRL} 的实现在 \url{此 https URL} 上公开可用，确保可重复性并促进未来的研究。

Title: GALOT: Generative Active Learning via Optimizable Zero-shot Text-to-image Generation

Authors: Hanbin Hong, Shenao Yan, Shuya Feng, Yan Yan, Yuan Hong
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.16227
Pdf URL: https://arxiv.org/pdf/2412.16227
Copy Paste: [[2412.16227]] GALOT: Generative Active Learning via Optimizable Zero-shot Text-to-image Generation(https://arxiv.org/abs/2412.16227)
Keywords: generation, generative
Abstract: Active Learning (AL) represents a crucial methodology within machine learning, emphasizing the identification and utilization of the most informative samples for efficient model training. However, a significant challenge of AL is its dependence on the limited labeled data samples and data distribution, resulting in limited performance. To address this limitation, this paper integrates the zero-shot text-to-image (T2I) synthesis and active learning by designing a novel framework that can efficiently train a machine learning (ML) model sorely using the text description. Specifically, we leverage the AL criteria to optimize the text inputs for generating more informative and diverse data samples, annotated by the pseudo-label crafted from text, then served as a synthetic dataset for active learning. This approach reduces the cost of data collection and annotation while increasing the efficiency of model training by providing informative training samples, enabling a novel end-to-end ML task from text description to vision models. Through comprehensive evaluations, our framework demonstrates consistent and significant improvements over traditional AL methods.
摘要：主动学习 (AL) 是机器学习中的一个重要方法，强调识别和利用最具信息量的样本进行有效的模型训练。然而，AL 的一个重大挑战是它依赖于有限的标记数据样本和数据分布，从而导致性能有限。为了解决这一限制，本文通过设计一个新颖的框架，将零样本文本到图像 (T2I) 合成和主动学习相结合，该框架可以有效地使用文本描述来训练机器学习 (ML) 模型。具体来说，我们利用 AL 标准来优化文本输入，以生成更具信息量和多样性的数据样本，这些样本由从文本制作的伪标签注释，然后用作主动学习的合成数据集。这种方法降低了数据收集和注释的成本，同时通过提供信息丰富的训练样本提高了模型训练的效率，从而实现了从文本描述到视觉模型的新型端到端 ML 任务。通过全面的评估，我们的框架比传统的 AL 方法表现出一致且显着的改进。

Title: Training-free Heterogeneous Graph Condensation via Data Selection

Authors: Yuxuan Liang, Wentao Zhang, Xinyi Gao, Ling Yang, Chong Chen, Hongzhi Yin, Yunhai Tong, Bin Cui
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.16250
Pdf URL: https://arxiv.org/pdf/2412.16250
Copy Paste: [[2412.16250]] Training-free Heterogeneous Graph Condensation via Data Selection(https://arxiv.org/abs/2412.16250)
Keywords: generation
Abstract: Efficient training of large-scale heterogeneous graphs is of paramount importance in real-world applications. However, existing approaches typically explore simplified models to mitigate resource and time overhead, neglecting the crucial aspect of simplifying large-scale heterogeneous graphs from the data-centric perspective. Addressing this gap, HGCond introduces graph condensation (GC) in heterogeneous graphs and generates a small condensed graph for efficient model training. Despite its efficacy in graph generation, HGCond encounters two significant limitations. The first is low effectiveness, HGCond excessively relies on the simplest relay model for the condensation procedure, which restricts the ability to exert powerful Heterogeneous Graph Neural Networks (HGNNs) with flexible condensation ratio and limits the generalization ability. The second is low efficiency, HGCond follows the existing GC methods designed for homogeneous graphs and leverages the sophisticated optimization paradigm, resulting in a time-consuming condensing procedure. In light of these challenges, we present the first Training \underline{Free} Heterogeneous Graph Condensation method, termed FreeHGC, facilitating both efficient and high-quality generation of heterogeneous condensed graphs. Specifically, we reformulate the heterogeneous graph condensation problem as a data selection issue, offering a new perspective for assessing and condensing representative nodes and edges in the heterogeneous graphs. By leveraging rich meta-paths, we introduce a new, high-quality heterogeneous data selection criterion to select target-type nodes. Furthermore, two training-free condensation strategies for heterogeneous graphs are designed to condense and synthesize other-types nodes effectively.
摘要：高效训练大规模异构图在实际应用中至关重要。然而，现有方法通常探索简化模型以减少资源和时间开销，而忽略了从数据为中心简化大规模异构图的关键方面。为了解决这一问题，HGCond 在异构图中引入了图浓缩（GC），并生成小型浓缩图以进行高效的模型训练。尽管 HGCond 在图生成方面非常有效，但它也面临两个显著的限制。第一是有效性低，HGCond 过度依赖最简单的中继模型进行浓缩过程，这限制了发挥具有灵活浓缩比率的强大异构图神经网络（HGNN）的能力，并限制了其泛化能力。第二是效率低，HGCond 遵循现有为同构图设计的 GC 方法并利用复杂的优化范式，导致浓缩过程非常耗时。鉴于这些挑战，我们提出了第一种无需训练的异构图压缩方法，称为 FreeHGC，以促进高效和高质量的异构压缩图生成。具体来说，我们将异构图压缩问题重新表述为数据选择问题，为评估和压缩异构图中的代表性节点和边提供了新的视角。通过利用丰富的元路径，我们引入了一种新的高质量异构数据选择标准来选择目标类型的节点。此外，还设计了两种无需训练的异构图压缩策略，以有效地压缩和合成其他类型的节点。

Title: Interactive Scene Authoring with Specialized Generative Primitives

Authors: Clément Jambon (1), Changwoon Choi (2), Dongsu Zhang (2), Olga Sorkine-Hornung (1), Young Min Kim (2) ((1) ETH Zurich, (2) Seoul National University)
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.16253
Pdf URL: https://arxiv.org/pdf/2412.16253
Copy Paste: [[2412.16253]] Interactive Scene Authoring with Specialized Generative Primitives(https://arxiv.org/abs/2412.16253)
Keywords: generation, generative
Abstract: Generating high-quality 3D digital assets often requires expert knowledge of complex design tools. We introduce Specialized Generative Primitives, a generative framework that allows non-expert users to author high-quality 3D scenes in a seamless, lightweight, and controllable manner. Each primitive is an efficient generative model that captures the distribution of a single exemplar from the real world. With our framework, users capture a video of an environment, which we turn into a high-quality and explicit appearance model thanks to 3D Gaussian Splatting. Users then select regions of interest guided by semantically-aware features. To create a generative primitive, we adapt Generative Cellular Automata to single-exemplar training and controllable generation. We decouple the generative task from the appearance model by operating on sparse voxels and we recover a high-quality output with a subsequent sparse patch consistency step. Each primitive can be trained within 10 minutes and used to author new scenes interactively in a fully compositional manner. We showcase interactive sessions where various primitives are extracted from real-world scenes and controlled to create 3D assets and scenes in a few minutes. We also demonstrate additional capabilities of our primitives: handling various 3D representations to control generation, transferring appearances, and editing geometries.
摘要：生成高质量的 3D 数字资产通常需要复杂设计工具的专业知识。我们引入了 Specialized Generative Primitives，这是一个生成框架，允许非专家用户以无缝、轻量且可控的方式创作高质量的 3D 场景。每个基元都是一个高效的生成模型，可捕获来自现实世界的单个样例的分布。借助我们的框架，用户可以捕获环境的视频，然后借助 3D Gaussian Splatting，我们将其转换为高质量且明确的外观模型。然后，用户选择由语义感知特征引导的感兴趣区域。为了创建生成基元，我们将生成细胞自动机调整为单样例训练和可控生成。我们通过对稀疏体素进行操作将生成任务与外观模型分离，并通过随后的稀疏补丁一致性步骤恢复高质量输出。每个基元都可以在 10 分钟内完成训练，并用于以完全合成的方式以交互方式创作新场景。我们展示了交互式课程，其中从现实世界场景中提取各种基元并进行控制，以在几分钟内创建 3D 资源和场景。我们还展示了基元的其他功能：处理各种 3D 表示以控制生成、传输外观和编辑几何图形。

Title: PromptLA: Towards Integrity Verification of Black-box Text-to-Image Diffusion Models

Authors: Zhuomeng Zhang, Fangqi Li, Chong Di, Shilin Wang
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2412.16257
Pdf URL: https://arxiv.org/pdf/2412.16257
Copy Paste: [[2412.16257]] PromptLA: Towards Integrity Verification of Black-box Text-to-Image Diffusion Models(https://arxiv.org/abs/2412.16257)
Keywords: generative
Abstract: Current text-to-image (T2I) diffusion models can produce high-quality images, and malicious users who are authorized to use the model only for benign purposes might modify their models to generate images that result in harmful social impacts. Therefore, it is essential to verify the integrity of T2I diffusion models, especially when they are deployed as black-box services. To this end, considering the randomness within the outputs of generative models and the high costs in interacting with them, we capture modifications to the model through the differences in the distributions of the features of generated images. We propose a novel prompt selection algorithm based on learning automaton for efficient and accurate integrity verification of T2I diffusion models. Extensive experiments demonstrate the effectiveness, stability, accuracy and generalization of our algorithm against existing integrity violations compared with baselines. To the best of our knowledge, this paper is the first work addressing the integrity verification of T2I diffusion models, which paves the way to copyright discussions and protections for artificial intelligence applications in practice.
摘要：当前的文本转图像 (T2I) 传播模型可以生成高质量的图像，而仅被授权将该模型用于良性目的的恶意用户可能会修改其模型以生成导致有害社会影响的图像。因此，验证 T2I 传播模型的完整性至关重要，尤其是当它们被部署为黑盒服务时。为此，考虑到生成模型输出的随机性以及与它们交互的高成本，我们通过生成图像特征分布的差异来捕获对模型的修改。我们提出了一种基于学习自动机的新型提示选择算法，用于高效准确地验证 T2I 传播模型的完整性。与基线相比，大量实验证明了我们的算法对现有完整性违规的有效性、稳定性、准确性和泛化性。据我们所知，本文是第一篇解决 T2I 传播模型完整性验证的工作，为版权讨论和实践中人工智能应用的保护铺平了道路。

Title: HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases

Authors: Meng-Chieh Lee, Qi Zhu, Costas Mavromatis, Zhen Han, Soji Adeshina, Vassilis N. Ioannidis, Huzefa Rangwala, Christos Faloutsos
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.16311
Pdf URL: https://arxiv.org/pdf/2412.16311
Copy Paste: [[2412.16311]] HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases(https://arxiv.org/abs/2412.16311)
Keywords: generation
Abstract: Given a semi-structured knowledge base (SKB), where text documents are interconnected by relations, how can we effectively retrieve relevant information to answer user questions? Retrieval-Augmented Generation (RAG) retrieves documents to assist large language models (LLMs) in question answering; while Graph RAG (GRAG) uses structured knowledge bases as its knowledge source. However, many questions require both textual and relational information from SKB - referred to as "hybrid" questions - which complicates the retrieval process and underscores the need for a hybrid retrieval method that leverages both information. In this paper, through our empirical analysis, we identify key insights that show why existing methods may struggle with hybrid question answering (HQA) over SKB. Based on these insights, we propose HybGRAG for HQA consisting of a retriever bank and a critic module, with the following advantages: (1) Agentic, it automatically refines the output by incorporating feedback from the critic module, (2) Adaptive, it solves hybrid questions requiring both textual and relational information with the retriever bank, (3) Interpretable, it justifies decision making with intuitive refinement path, and (4) Effective, it surpasses all baselines on HQA benchmarks. In experiments on the STaRK benchmark, HybGRAG achieves significant performance gains, with an average relative improvement in Hit@1 of 51%.
摘要：给定一个半结构化知识库 (SKB)，其中文本文档通过关系相互连接，我们如何才能有效地检索相关信息来回答用户的问题？检索增强生成 (RAG) 检索文档以协助大型语言模型 (LLM) 进行问答；而 Graph RAG (GRAG) 使用结构化知识库作为其知识来源。然而，许多问题需要来自 SKB 的文本和关系信息 - 称为“混合”问题 - 这使检索过程变得复杂，并强调需要一种利用这两种信息的混合检索方法。在本文中，通过我们的实证分析，我们确定了关键见解，这些见解表明了现有方法在混合问答 (HQA) 方面可能难以胜任 SKB。基于这些见解，我们提出了用于 HQA 的 HybGRAG，它由一个检索器库和一个评论模块组成，具有以下优点：（1）具有代理性，它通过整合评论模块的反馈自动优化输出；（2）具有自适应性，它使用检索器库解决需要文本和关系信息的混合问题；（3）具有可解释性，它通过直观的优化路径证明决策的合理性；（4）有效，它超越了 HQA 基准的所有基线。在 STaRK 基准的实验中，HybGRAG 实现了显著的性能提升，Hit@1 的平均相对改进为 51%。

Title: When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization

Authors: Vivek Ramanujan, Kushal Tirumala, Armen Aghajanyan, Luke Zettlemoyer, Ali Farhadi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.16326
Pdf URL: https://arxiv.org/pdf/2412.16326
Copy Paste: [[2412.16326]] When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization(https://arxiv.org/abs/2412.16326)
Keywords: generation, generative
Abstract: Current image generation methods, such as latent diffusion and discrete token-based generation, depend on a two-stage training approach. In stage 1, an auto-encoder is trained to compress an image into a latent space; in stage 2, a generative model is trained to learn a distribution over that latent space. Most work focuses on maximizing stage 1 performance independent of stage 2, assuming better reconstruction always leads to better generation. However, we show this is not strictly true. Smaller stage 2 models can benefit from more compressed stage 1 latents even if reconstruction performance worsens, showing a fundamental trade-off between compression and generation modeling capacity. To better optimize this trade-off, we introduce Causally Regularized Tokenization (CRT), which uses knowledge of the stage 2 generation modeling procedure to embed useful inductive biases in stage 1 latents. This regularization makes stage 1 reconstruction performance worse, but makes stage 2 generation performance better by making the tokens easier to model: we are able to improve compute efficiency 2-3$\times$ over baseline and match state-of-the-art discrete autoregressive ImageNet generation (2.18 FID) with less than half the tokens per image (256 vs. 576) and a fourth the total model parameters (775M vs. 3.1B) as the previous SOTA (LlamaGen).
摘要：当前的图像生成方法（例如潜在扩散和基于离散标记的生成）依赖于两阶段训练方法。在第 1 阶段，训练自动编码器将图像压缩到潜在空间中；在第 2 阶段，训练生成模型学习该潜在空间上的分布。大多数工作都侧重于独立于第 2 阶段最大化第 1 阶段的性能，假设更好的重建总是会带来更好的生成。然而，我们表明这并非完全正确。即使重建性能变差，较小的第 2 阶段模型也可以从更多压缩的第 1 阶段潜在中受益，这表明压缩和生成建模能力之间存在根本的权衡。为了更好地优化这种权衡，我们引入了因果正则化标记化 (CRT)，它使用第 2 阶段生成建模过程的知识在第 1 阶段潜在中嵌入有用的归纳偏差。这种正则化使第 1 阶段重建性能变差，但通过使标记更容易建模，使第 2 阶段生成性能更好：我们能够将计算效率提高 2-3$\times$ 超过基线，并匹配最先进的离散自回归 ImageNet 生成（2.18 FID），每个图像的标记数不到之前的 SOTA（LlamaGen）的一半（256 对 576），总模型参数（775M 对 3.1B）的四分之一。

Title: Improving Equity in Health Modeling with GPT4-Turbo Generated Synthetic Data: A Comparative Study

Authors: Daniel Smolyak, Arshana Welivita, Margrét V. Bjarnadóttir, Ritu Agarwal
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2412.16335
Pdf URL: https://arxiv.org/pdf/2412.16335
Copy Paste: [[2412.16335]] Improving Equity in Health Modeling with GPT4-Turbo Generated Synthetic Data: A Comparative Study(https://arxiv.org/abs/2412.16335)
Keywords: generation
Abstract: Objective. Demographic groups are often represented at different rates in medical datasets. These differences can create bias in machine learning algorithms, with higher levels of performance for better-represented groups. One promising solution to this problem is to generate synthetic data to mitigate potential adverse effects of non-representative data sets. Methods. We build on recent advances in LLM-based synthetic data generation to create a pipeline where the synthetic data is generated separately for each demographic group. We conduct our study using MIMIC-IV and Framingham "Offspring and OMNI-1 Cohorts" datasets. We prompt GPT4-Turbo to create group-specific data, providing training examples and the dataset context. An exploratory analysis is conducted to ascertain the quality of the generated data. We then evaluate the utility of the synthetic data for augmentation of a training dataset in a downstream machine learning task, focusing specifically on model performance metrics across groups. Results. The performance of GPT4-Turbo augmentation is generally superior but not always. In the majority of experiments our method outperforms standard modeling baselines, however, prompting GPT-4-Turbo to produce data specific to a group provides little to no additional benefit over a prompt that does not specify the group. Conclusion. We developed a method for using LLMs out-of-the-box to synthesize group-specific data to address imbalances in demographic representation in medical datasets. As another "tool in the toolbox", this method can improve model fairness and thus health equity. More research is needed to understand the conditions under which LLM generated synthetic data is useful for non-representative medical data sets.
摘要：目标。人口统计学群体在医疗数据集中的代表性通常不同。这些差异可能会在机器学习算法中造成偏差，代表性较好的群体具有更高的性能水平。解决这个问题的一个有希望的方法是生成合成数据以减轻非代表性数据集的潜在不利影响。方法。我们基于 LLM 合成数据生成的最新进展，创建了一个管道，其中为每个人口统计学群体分别生成合成数据。我们使用 MIMIC-IV 和 Framingham“后代和 OMNI-1 队列”数据集进行研究。我们提示 GPT4-Turbo 创建特定于组的数据，提供训练示例和数据集上下文。进行探索性分析以确定生成数据的质量。然后，我们评估合成数据在下游机器学习任务中增强训练数据集的效用，特别关注跨组的模型性能指标。结果。GPT4-Turbo 增强的性能通常更优越，但并非总是如此。在大多数实验中，我们的方法都优于标准建模基线，但是，提示 GPT-4-Turbo 生成特定于某个组的数据与未指定组的提示相比几乎没有任何额外的好处。结论。我们开发了一种使用开箱即用的 LLM 来合成特定于组的数据的方法，以解决医疗数据集中人口统计代表性不平衡的问题。作为工具箱中的另一个“工具”，该方法可以提高模型公平性，从而提高健康公平性。需要进行更多研究来了解 LLM 生成的合成数据对非代表性医疗数据集有用的条件。

Title: A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation

Authors: Shijie Zhou, Ruiyi Zhang, Yufan Zhou, Changyou Chen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.16364
Pdf URL: https://arxiv.org/pdf/2412.16364
Copy Paste: [[2412.16364]] A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation(https://arxiv.org/abs/2412.16364)
Keywords: generation
Abstract: Large multimodal models still struggle with text-rich images because of inadequate training data. Self-Instruct provides an annotation-free way for generating instruction data, but its quality is poor, as multimodal alignment remains a hurdle even for the largest models. In this work, we propose LLaVAR-2, to enhance multimodal alignment for text-rich images through hybrid instruction generation between human annotators and large language models. Specifically, it involves detailed image captions from human annotators, followed by the use of these annotations in tailored text prompts for GPT-4o to curate a dataset. It also implements several mechanisms to filter out low-quality data, and the resulting dataset comprises 424k high-quality pairs of instructions. Empirical results show that models fine-tuned on this dataset exhibit impressive enhancements over those trained with self-instruct data.
摘要：由于训练数据不足，大型多模态模型仍然难以处理富含文本的图像。Self-Instruct 提供了一种无需注释的生成指令数据的方法，但其质量较差，因为即使对于最大的模型来说，多模态对齐仍然是一个障碍。在这项工作中，我们提出了 LLaVAR-2，通过人工注释者和大型语言模型之间的混合指令生成来增强富含文本图像的多模态对齐。具体来说，它涉及来自人工注释者的详细图像标题，然后在 GPT-4o 的定制文本提示中使用这些注释来整理数据集。它还实现了几种过滤低质量数据的机制，生成的数据集包含 424k 对高质量指令。实证结果表明，在该数据集上微调的模型比使用自指导数据训练的模型表现出令人印象深刻的增强。

Title: Has LLM Reached the Scaling Ceiling Yet? Unified Insights into LLM Regularities and Constraints

Authors: Charles Luo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.16443
Pdf URL: https://arxiv.org/pdf/2412.16443
Copy Paste: [[2412.16443]] Has LLM Reached the Scaling Ceiling Yet? Unified Insights into LLM Regularities and Constraints(https://arxiv.org/abs/2412.16443)
Keywords: generation
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their scalability raises a critical question: Have we reached the scaling ceiling? This paper addresses this pivotal question by developing a unified theoretical framework that integrates mathematical and statistical insights to explain the scaling dynamics of LLMs. We present: 1. Central Limit Theorem (CLT) for Hidden Representations: We show that noise in hidden representations scales inversely with context size, explaining stabilization effects and the limits of context length improvements. 2. Bias-Variance Decomposition: We decompose next-token prediction loss into irreducible entropy, capacity-driven bias, and finite sample variance, revealing trade-offs where scaling yields diminishing returns. 3. Emergent SNR Thresholds: By defining signal-to-noise ratio (SNR), we quantify how capabilities emerge abruptly once SNR surpasses a threshold, offering insights into when scaling becomes less effective. Through this framework, we conclude that while LLMs have not reached an absolute scaling ceiling, practical constraints are increasingly prominent: diminishing returns, resource inefficiencies, and data limitations. Future progress will require a shift from brute-force scaling to innovations in architecture, data quality, and training paradigms. This work provides a roadmap for guiding the efficient development of next-generation LLMs and advancing the field beyond traditional scaling strategies. Keywords: Large Language Models; Scaling Ceiling; Central Limit Theorem; Bias-Variance Trade-Off; Signal-to-Noise Ratio; Emergent Capabilities
摘要：大型语言模型 (LLM) 已展示出卓越的能力，然而其可扩展性提出了一个关键问题：我们是否已经达到扩展的上限？本文通过开发一个统一的理论框架来解决这个关键问题，该框架整合了数学和统计学见解来解释 LLM 的扩展动态。我们提出：1. 隐藏表示的中心极限定理 (CLT)：我们表明隐藏表示中的噪声与上下文大小成反比，从而解释了稳定效应和上下文长度改进的极限。2. 偏差-方差分解：我们将下一个标记预测损失分解为不可约熵、容量驱动偏差和有限样本方差，揭示扩展产生收益递减的权衡。3. 新兴 SNR 阈值：通过定义信噪比 (SNR)，我们量化了一旦 SNR 超过阈值，能力如何突然出现，从而提供有关何时扩展变得不那么有效的见解。通过这个框架，我们得出结论，虽然 LLM 尚未达到绝对的扩展上限，但实际限制越来越突出：收益递减、资源效率低下和数据限制。未来的进步将需要从蛮力扩展转向架构、数据质量和训练范式的创新。这项工作为指导下一代 LLM 的有效开发和推动该领域超越传统扩展策略提供了路线图。关键词：大型语言模型；扩展上限；中心极限定理；偏差-方差权衡；信噪比；新兴能力

Title: Rethinking Model Redundancy for Low-light Image Enhancement

Authors: Tong Li, Lizhi Wang, Hansen Feng, Lin Zhu, Wanxuan Lu, Hua Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16459
Pdf URL: https://arxiv.org/pdf/2412.16459
Copy Paste: [[2412.16459]] Rethinking Model Redundancy for Low-light Image Enhancement(https://arxiv.org/abs/2412.16459)
Keywords: generation
Abstract: Low-light image enhancement (LLIE) is a fundamental task in computational photography, aiming to improve illumination, reduce noise, and enhance the image quality of low-light images. While recent advancements primarily focus on customizing complex neural network models, we have observed significant redundancy in these models, limiting further performance improvement. In this paper, we investigate and rethink the model redundancy for LLIE, identifying parameter harmfulness and parameter uselessness. Inspired by the rethinking, we propose two innovative techniques to mitigate model redundancy while improving the LLIE performance: Attention Dynamic Reallocation (ADR) and Parameter Orthogonal Generation (POG). ADR dynamically reallocates appropriate attention based on original attention, thereby mitigating parameter harmfulness. POG learns orthogonal basis embeddings of parameters and prevents degradation to static parameters, thereby mitigating parameter uselessness. Experiments validate the effectiveness of our techniques. We will release the code to the public.
摘要：低光图像增强 (LLIE) 是计算摄影中的一项基本任务，旨在改善照明、降低噪声并增强低光图像的图像质量。虽然最近的进展主要集中在定制复杂的神经网络模型上，但我们观察到这些模型中存在显着的冗余，限制了进一步的性能提升。在本文中，我们研究并重新思考了 LLIE 的模型冗余，确定了参数有害性和参数无用性。受重新思考的启发，我们提出了两种创新技术来减轻模型冗余，同时提高 LLIE 性能：注意力动态重新分配 (ADR) 和参数正交生成 (POG)。ADR 根据原始注意力动态地重新分配适当的注意力，从而减轻参数有害性。POG 学习参数的正交基嵌入并防止退化为静态参数，从而减轻参数无用性。实验验证了我们技术的有效性。我们将向公众发布代码。

Title: Enhancing Nighttime Vehicle Detection with Day-to-Night Style Transfer and Labeling-Free Augmentation

Authors: Yunxiang Yang, Hao Zhen, Yongcan Huang, Jidong J. Yang
Subjects: cs.CV, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.16478
Pdf URL: https://arxiv.org/pdf/2412.16478
Copy Paste: [[2412.16478]] Enhancing Nighttime Vehicle Detection with Day-to-Night Style Transfer and Labeling-Free Augmentation(https://arxiv.org/abs/2412.16478)
Keywords: generative
Abstract: Existing deep learning-based object detection models perform well under daytime conditions but face significant challenges at night, primarily because they are predominantly trained on daytime images. Additionally, training with nighttime images presents another challenge: even human annotators struggle to accurately label objects in low-light conditions. This issue is particularly pronounced in transportation applications, such as detecting vehicles and other objects of interest on rural roads at night, where street lighting is often absent, and headlights may introduce undesirable glare. This study addresses these challenges by introducing a novel framework for labeling-free data augmentation, leveraging CARLA-generated synthetic data for day-to-night image style transfer. Specifically, the framework incorporates the Efficient Attention Generative Adversarial Network for realistic day-to-night style transfer and uses CARLA-generated synthetic nighttime images to help the model learn vehicle headlight effects. To evaluate the efficacy of the proposed framework, we fine-tuned the YOLO11 model with an augmented dataset specifically curated for rural nighttime environments, achieving significant improvements in nighttime vehicle detection. This novel approach is simple yet effective, offering a scalable solution to enhance AI-based detection systems in low-visibility environments and extend the applicability of object detection models to broader real-world contexts.
摘要：现有的基于深度学习的物体检测模型在白天条件下表现良好，但在夜间面临重大挑战，主要是因为它们主要在白天图像上进行训练。此外，使用夜间图像进行训练还带来了另一个挑战：即使是人类注释者也难以在低光条件下准确标记物体。这个问题在交通应用中尤为明显，例如在夜间在乡村道路上检测车辆和其他感兴趣的物体，那里通常没有路灯，而且前灯可能会产生不受欢迎的眩光。这项研究通过引入一种新颖的无标记数据增强框架来解决这些挑战，利用 CARLA 生成的合成数据进行日夜图像风格转换。具体来说，该框架结合了高效注意力生成对抗网络，实现了逼真的日夜风格转换，并使用 CARLA 生成的合成夜间图像来帮助模型学习车辆前灯效果。为了评估所提框架的有效性，我们使用专门针对乡村夜间环境而编制的增强数据集对 YOLO11 模型进行了微调，从而显著提高了夜间车辆检测能力。这种新颖的方法简单而有效，提供了一种可扩展的解决方案，可增强低能见度环境中的 AI 检测系统，并将物体检测模型的适用性扩展到更广泛的现实世界环境。

Title: Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance

Authors: Beiyuan Zhang, Yue Ma, Chunlei Fu, Xinyang Song, Zhenan Sun, Ziqiang Li
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.16495
Pdf URL: https://arxiv.org/pdf/2412.16495
Copy Paste: [[2412.16495]] Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance(https://arxiv.org/abs/2412.16495)
Keywords: generation
Abstract: Text-editable and pose-controllable character video generation is a challenging but prevailing topic with practical applications. However, existing approaches mainly focus on single-object video generation with pose guidance, ignoring the realistic situation that multi-character appear concurrently in a scenario. To tackle this, we propose a novel multi-character video generation framework in a tuning-free manner, which is based on the separated text and pose guidance. Specifically, we first extract character masks from the pose sequence to identify the spatial position for each generating character, and then single prompts for each character are obtained with LLMs for precise text guidance. Moreover, the spatial-aligned cross attention and multi-branch control module are proposed to generate fine grained controllable multi-character video. The visualized results of generating video demonstrate the precise controllability of our method for multi-character generation. We also verify the generality of our method by applying it to various personalized T2I models. Moreover, the quantitative results show that our approach achieves superior performance compared with previous works.
摘要：可编辑文本和可控制姿势的角色视频生成是一个具有挑战性但又具有实际应用价值的热门话题。然而，现有的方法主要侧重于带姿势引导的单对象视频生成，忽略了场景中同时出现多个角色的现实情况。为了解决这个问题，我们提出了一种基于分离的文本和姿势引导的新型免调优多角色视频生成框架。具体来说，我们首先从姿势序列中提取角色掩码来识别每个生成角色的空间位置，然后使用 LLM 为每个角色获取单个提示以进行精确的文本引导。此外，我们提出了空间对齐的交叉注意和多分支控制模块来生成细粒度可控的多角色视频。生成视频的可视化结果证明了我们的方法对多角色生成具有精确的可控性。我们还通过将我们的方法应用于各种个性化 T2I 模型来验证其通用性。此外，定量结果表明，与以前的方法相比，我们的方法取得了更好的性能。

Title: Autonomous Crack Detection using Deep Learning on Synthetic Thermogram Datasets

Authors: Chinmay Makarand Pimpalkhare, D. N. Pawaskar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16499
Pdf URL: https://arxiv.org/pdf/2412.16499
Copy Paste: [[2412.16499]] Autonomous Crack Detection using Deep Learning on Synthetic Thermogram Datasets(https://arxiv.org/abs/2412.16499)
Keywords: generation
Abstract: In a lot of scientific problems, there is the need to generate data through the running of an extensive number of experiments. Further, some tasks require constant human intervention. We consider the problem of crack detection in steel plates. The way in which this generally happens is through humans looking at an image of the thermogram generated by heating the plate and classifying whether it is cracked or not. There has been a rise in the use of Artificial Intelligence (AI) based methods which try to remove the requirement of a human from this loop by using algorithms such as Convolutional Neural Netowrks (CNN)s as a proxy for the detection process. The issue is that CNNs and other vision models are generally very data-hungry and require huge amounts of data before they can start performing well. This data generation process is not very easy and requires innovation in terms of mechanical and electronic design of the experimental setup. It further requires massive amount of time and energy, which is difficult in resource-constrained scenarios. We try to solve exactly this problem, by creating a synthetic data generation pipeline based on Finite Element Simulations. We employ data augmentation techniques on this data to further increase the volume and diversity of data generated. The working of this concept is shown via performing inference on fine-tuned vision models and we have also validated the results by checking if our approach translates to realistic experimental data. We show the conditions where this translation is successful and how we can go about achieving that.
摘要：在许多科学问题中，需要通过大量实验来生成数据。此外，有些任务需要不断的人为干预。我们考虑钢板裂纹检测问题。这通常发生的方式是通过人类查看加热钢板产生的热图图像并判断其是否破裂。基于人工智能 (AI) 的方法的使用有所增加，这些方法试图通过使用卷积神经网络 (CNN) 等算法作为检测过程的代理来消除对人类的需求。问题是 CNN 和其他视觉模型通常非常耗费数据，需要大量数据才能开始表现良好。这个数据生成过程并不容易，需要在实验装置的机械和电子设计方面进行创新。它还需要大量的时间和精力，这在资源受限的情况下是困难的。我们试图通过创建基于有限元模拟的合成数据生成管道来解决这个问题。我们对这些数据采用了数据增强技术，以进一步增加生成的数据量和多样性。通过在经过微调的视觉模型上进行推理，展示了这一概念的工作原理，我们还通过检查我们的方法是否能转化为现实的实验数据来验证结果。我们展示了这种转化成功的条件以及我们如何实现这一目标。

Title: TrojFlow: Flow Models are Natural Targets for Trojan Attacks

Authors: Zhengyang Qi, Xiaohua Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.16512
Pdf URL: https://arxiv.org/pdf/2412.16512
Copy Paste: [[2412.16512]] TrojFlow: Flow Models are Natural Targets for Trojan Attacks(https://arxiv.org/abs/2412.16512)
Keywords: generative
Abstract: Flow-based generative models (FMs) have rapidly advanced as a method for mapping noise to data, its efficient training and sampling process makes it widely applicable in various fields. FMs can be viewed as a variant of diffusion models (DMs). At the same time, previous studies have shown that DMs are vulnerable to Trojan/Backdoor attacks, a type of output manipulation attack triggered by a maliciously embedded pattern at model input. We found that Trojan attacks on generative models are essentially equivalent to image transfer tasks from the backdoor distribution to the target distribution, the unique ability of FMs to fit any two arbitrary distributions significantly simplifies the training and sampling setups for attacking FMs, making them inherently natural targets for backdoor attacks. In this paper, we propose TrojFlow, exploring the vulnerabilities of FMs through Trojan attacks. In particular, we consider various attack settings and their combinations and thoroughly explore whether existing defense methods for DMs can effectively defend against our proposed attack scenarios. We evaluate TrojFlow on CIFAR-10 and CelebA datasets, our experiments show that our method can compromise FMs with high utility and specificity, and can easily break through existing defense mechanisms.
摘要：基于流的生成模型（FM）作为一种将噪声映射到数据的方法得到了迅速发展，其高效的训练和采样过程使其广泛应用于各个领域。FM可以看作是扩散模型（DM）的一种变体。同时，先前的研究表明，DM容易受到木马/后门攻击，这是一种由模型输入中恶意嵌入的模式触发的输出操纵攻击。我们发现对生成模型的木马攻击本质上相当于从后门分布到目标分布的图像传输任务，FM能够拟合任意两个任意分布的独特能力大大简化了攻击FM的训练和采样设置，使其成为后门攻击的天然目标。在本文中，我们提出了TrojFlow，通过木马攻击探索FM的弱点。特别是，我们考虑了各种攻击设置及其组合，并彻底探索了现有的DM防御方法是否可以有效防御我们提出的攻击场景。我们在 CIFAR-10 和 CelebA 数据集上对 TrojFlow 进行了评估，实验表明我们的方法可以高效用且特异性地攻击 FM，并且可以轻松突破现有的防御机制。

Title: Diffusion Prior Interpolation for Flexibility Real-World Face Super-Resolution

Authors: Jiarui Yang, Tao Dai, Yufei Zhu, Naiqi Li, Jinmin Li, Shutao Xia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.16552
Pdf URL: https://arxiv.org/pdf/2412.16552
Copy Paste: [[2412.16552]] Diffusion Prior Interpolation for Flexibility Real-World Face Super-Resolution(https://arxiv.org/abs/2412.16552)
Keywords: super-resolution, generative
Abstract: Diffusion models represent the state-of-the-art in generative modeling. Due to their high training costs, many works leverage pre-trained diffusion models' powerful representations for downstream tasks, such as face super-resolution (FSR), through fine-tuning or prior-based methods. However, relying solely on priors without supervised training makes it challenging to meet the pixel-level accuracy requirements of discrimination task. Although prior-based methods can achieve high fidelity and high-quality results, ensuring consistency remains a significant challenge. In this paper, we propose a masking strategy with strong and weak constraints and iterative refinement for real-world FSR, termed Diffusion Prior Interpolation (DPI). We introduce conditions and constraints on consistency by masking different sampling stages based on the structural characteristics of the face. Furthermore, we propose a condition Corrector (CRT) to establish a reciprocal posterior sampling process, enhancing FSR performance by mutual refinement of conditions and samples. DPI can balance consistency and diversity and can be seamlessly integrated into pre-trained models. In extensive experiments conducted on synthetic and real datasets, along with consistency validation in face recognition, DPI demonstrates superiority over SOTA FSR methods. The code is available at \url{this https URL}.
摘要：扩散模型代表了生成模型的最新进展。由于其训练成本高，许多工作通过微调或基于先验的方法利用预训练扩散模型的强大表示来完成下游任务，例如人脸超分辨率 (FSR)。然而，仅仅依靠先验而没有监督训练，很难满足判别任务的像素级精度要求。虽然基于先验的方法可以实现高保真度和高质量的结果，但确保一致性仍然是一个重大挑战。在本文中，我们提出了一种针对现实世界 FSR 的具有强弱约束和迭代细化的掩蔽策略，称为扩散先验插值 (DPI)。我们通过根据面部的结构特征掩蔽不同的采样阶段来引入一致性的条件和约束。此外，我们提出了一个条件校正器 (CRT) 来建立一个相互后验采样过程，通过条件和样本的相互细化来增强 FSR 性能。DPI 可以平衡一致性和多样性，并且可以无缝集成到预训练模型中。在对合成和真实数据集进行的大量实验以及人脸识别中的一致性验证中，DPI 表现出优于 SOTA FSR 方法的优势。代码可在 \url{此 https URL} 处获取。

Title: SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

Authors: Xiangyue Zhang, Jiangfang Li, Jiaxu Zhang, Ziqiang Dang, Jianqiang Ren, Liefeng Bo, Zhigang Tu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16563
Pdf URL: https://arxiv.org/pdf/2412.16563
Copy Paste: [[2412.16563]] SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis(https://arxiv.org/abs/2412.16563)
Keywords: generation
Abstract: A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn general motions and sparse motions, and then adaptively fuse them. In particular, rhythmic consistency learning is explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, textit{semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.
摘要：要实现良好的同声动作生成，必须仔细整合常见的节奏动作和罕见但必不可少的语义动作。在这项工作中，我们提出了 SemTalk 用于具有帧级语义强调的整体同声动作生成。我们的关键见解是分别学习一般动作和稀疏动作，然后自适应地融合它们。具体来说，我们探索节奏一致性学习以建立与节奏相关的基本动作，确保将手势与语音节奏同步的连贯基础。随后，文本语义强调学习旨在生成语义感知的稀疏动作，重点关注帧级语义线索。最后，为了将稀疏动作集成到基本动作中并生成语义强调的同声手势，我们进一步利用学习到的语义分数进行自适应合成。在两个公共数据集上的定性和定量比较表明，我们的方法优于最先进的方法，与稳定的基本动作相比，可提供具有增强的语义丰富性的高质量同声动作。

Title: Learning for Cross-Layer Resource Allocation in MEC-Aided Cell-Free Networks

Authors: Chong Zheng, Shiwen He, Yongming Huang, Tony Q. S. Quek
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.16565
Pdf URL: https://arxiv.org/pdf/2412.16565
Copy Paste: [[2412.16565]] Learning for Cross-Layer Resource Allocation in MEC-Aided Cell-Free Networks(https://arxiv.org/abs/2412.16565)
Keywords: generation
Abstract: Cross-layer resource allocation over mobile edge computing (MEC)-aided cell-free networks can sufficiently exploit the transmitting and computing resources to promote the data rate. However, the technical bottlenecks of traditional methods pose significant challenges to cross-layer optimization. In this paper, joint subcarrier allocation and beamforming optimization are investigated for the MEC-aided cell-free network from the perspective of deep learning to maximize the weighted sum rate. Specifically, we convert the underlying problem into a joint multi-task optimization problem and then propose a centralized multi-task self-supervised learning algorithm to solve the problem so as to avoid costly manual labeling. Therein, two novel and general loss functions, i.e., negative fraction linear loss and exponential linear loss whose advantages in robustness and target domain have been proved and discussed, are designed to enable self-supervised learning. Moreover, we further design a MEC-enabled distributed multi-task self-supervised learning (DMTSSL) algorithm, with low complexity and high scalability to address the challenge of dimensional disaster. Finally, we develop the distance-aware transfer learning algorithm based on the DMTSSL algorithm to handle the dynamic scenario with negligible computation cost. Simulation results under $3$rd generation partnership project 38.901 urban-macrocell scenario demonstrate the superiority of the proposed algorithms over the baseline algorithms.
摘要：移动边缘计算 (MEC) 辅助无蜂窝网络的跨层资源分配可以充分利用传输和计算资源来提高数据速率。然而，传统方法的技术瓶颈对跨层优化提出了重大挑战。本文从深度学习的角度研究了 MEC 辅助无蜂窝网络的联合子载波分配和波束成形优化，以最大化加权和速率。具体来说，我们将底层问题转换为联合多任务优化问题，然后提出一种集中式多任务自监督学习算法来解决问题，以避免昂贵的人工标记。其中，设计了两个新颖且通用的损失函数，即负分数线性损失和指数线性损失，它们在鲁棒性和目标域方面的优势已被证明和讨论，它们被设计用于实现自监督学习。此外，我们进一步设计了一种支持 MEC 的分布式多任务自监督学习 (DMTSSL) 算法，具有低复杂度和高可扩展性，以应对维度灾难的挑战。最后，我们基于 DMTSSL 算法开发了距离感知迁移学习算法，以几乎不计算成本的方式处理动态场景。在 $3$rd Generation Partnership Project 38.901 城市宏蜂窝场景下的仿真结果证明了所提算法相对于基线算法的优越性。

Title: REO-VLM: Transforming VLM to Meet Regression Challenges in Earth Observation

Authors: Xizhe Xue, Guoting Wei, Hao Chen, Haokui Zhang, Feng Lin, Chunhua Shen, Xiao Xiang Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16583
Pdf URL: https://arxiv.org/pdf/2412.16583
Copy Paste: [[2412.16583]] REO-VLM: Transforming VLM to Meet Regression Challenges in Earth Observation(https://arxiv.org/abs/2412.16583)
Keywords: generation, generative
Abstract: The rapid evolution of Vision Language Models (VLMs) has catalyzed significant advancements in artificial intelligence, expanding research across various disciplines, including Earth Observation (EO). While VLMs have enhanced image understanding and data processing within EO, their applications have predominantly focused on image content description. This limited focus overlooks their potential in geographic and scientific regression tasks, which are essential for diverse EO applications. To bridge this gap, this paper introduces a novel benchmark dataset, called \textbf{REO-Instruct} to unify regression and generation tasks specifically for the EO domain. Comprising 1.6 million multimodal EO imagery and language pairs, this dataset is designed to support both biomass regression and image content interpretation tasks. Leveraging this dataset, we develop \textbf{REO-VLM}, a groundbreaking model that seamlessly integrates regression capabilities with traditional generative functions. By utilizing language-driven reasoning to incorporate scientific domain knowledge, REO-VLM goes beyond solely relying on EO imagery, enabling comprehensive interpretation of complex scientific attributes from EO data. This approach establishes new performance benchmarks and significantly enhances the capabilities of environmental monitoring and resource management.
摘要：视觉语言模型 (VLM) 的快速发展催化了人工智能的重大进步，扩展了包括地球观测 (EO) 在内的各个学科的研究。虽然 VLM 增强了 EO 中的图像理解和数据处理，但它们的应用主要集中在图像内容描述上。这种有限的关注忽视了它们在地理和科学回归任务中的潜力，而这些任务对于各种 EO 应用至关重要。为了弥补这一差距，本文引入了一种名为 \textbf{REO-Instruct} 的新型基准数据集，以统一专门针对 EO 领域的回归和生成任务。该数据集包含 160 万个多模态 EO 图像和语言对，旨在支持生物量回归和图像内容解释任务。利用这个数据集，我们开发了 \textbf{REO-VLM}，这是一个突破性的模型，将回归功能与传统生成功能无缝集成。 REO-VLM 利用语言驱动推理来整合科学领域知识，超越了单纯依赖 EO 图像的范畴，能够全面解读 EO 数据中的复杂科学属性。这种方法建立了新的性能基准，并显著增强了环境监测和资源管理的能力。

Title: Leveraging Contrastive Learning for Semantic Segmentation with Consistent Labels Across Varying Appearances

Authors: Javier Montalvo, Roberto Alcover-Couso, Pablo Carballeira, Álvaro García-Martín, Juan C. SanMiguel, Marcos Escudero-Viñolo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16592
Pdf URL: https://arxiv.org/pdf/2412.16592
Copy Paste: [[2412.16592]] Leveraging Contrastive Learning for Semantic Segmentation with Consistent Labels Across Varying Appearances(https://arxiv.org/abs/2412.16592)
Keywords: generation
Abstract: This paper introduces a novel synthetic dataset that captures urban scenes under a variety of weather conditions, providing pixel-perfect, ground-truth-aligned images to facilitate effective feature alignment across domains. Additionally, we propose a method for domain adaptation and generalization that takes advantage of the multiple versions of each scene, enforcing feature consistency across different weather scenarios. Our experimental results demonstrate the impact of our dataset in improving performance across several alignment metrics, addressing key challenges in domain adaptation and generalization for segmentation tasks. This research also explores critical aspects of synthetic data generation, such as optimizing the balance between the volume and variability of generated images to enhance segmentation performance. Ultimately, this work sets forth a new paradigm for synthetic data generation and domain adaptation.
摘要：本文介绍了一种新型合成数据集，该数据集可捕捉各种天气条件下的城市场景，提供像素完美、与地面实况对齐的图像，以促进跨域的有效特征对齐。此外，我们提出了一种域自适应和泛化方法，利用每个场景的多个版本，在不同天气场景中实现特征一致性。我们的实验结果证明了我们的数据集在提高多个对齐指标的性能方面的影响，解决了分割任务的域自适应和泛化中的关键挑战。这项研究还探讨了合成数据生成的关键方面，例如优化生成图像的数量和可变性之间的平衡以增强分割性能。最终，这项工作为合成数据生成和域自适应提出了一种新范式。

Title: OmniSplat: Taming Feed-Forward 3D Gaussian Splatting for Omnidirectional Images with Editable Capabilities

Authors: Suyoung Lee, Jaeyoung Chung, Kihoon Kim, Jaeyoo Huh, Gunhee Lee, Minsoo Lee, Kyoung Mu Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16604
Pdf URL: https://arxiv.org/pdf/2412.16604
Copy Paste: [[2412.16604]] OmniSplat: Taming Feed-Forward 3D Gaussian Splatting for Omnidirectional Images with Editable Capabilities(https://arxiv.org/abs/2412.16604)
Keywords: generation
Abstract: Feed-forward 3D Gaussian Splatting (3DGS) models have gained significant popularity due to their ability to generate scenes immediately without needing per-scene optimization. Although omnidirectional images are getting more popular since they reduce the computation for image stitching to composite a holistic scene, existing feed-forward models are only designed for perspective images. The unique optical properties of omnidirectional images make it difficult for feature encoders to correctly understand the context of the image and make the Gaussian non-uniform in space, which hinders the image quality synthesized from novel views. We propose OmniSplat, a pioneering work for fast feed-forward 3DGS generation from a few omnidirectional images. We introduce Yin-Yang grid and decompose images based on it to reduce the domain gap between omnidirectional and perspective images. The Yin-Yang grid can use the existing CNN structure as it is, but its quasi-uniform characteristic allows the decomposed image to be similar to a perspective image, so it can exploit the strong prior knowledge of the learned feed-forward network. OmniSplat demonstrates higher reconstruction accuracy than existing feed-forward networks trained on perspective images. Furthermore, we enhance the segmentation consistency between omnidirectional images by leveraging attention from the encoder of OmniSplat, providing fast and clean 3DGS editing results.
摘要：前馈 3D 高斯 Splatting (3DGS) 模型因其无需逐场景优化即可立即生成场景的能力而广受欢迎。尽管全向图像越来越受欢迎，因为它们减少了图像拼接以合成整体场景的计算量，但现有的前馈模型仅适用于透视图像。全向图像的独特光学特性使特征编码器难以正确理解图像的上下文，并使高斯在空间中不均匀，从而阻碍了从新视图合成的图像质量。我们提出了 OmniSplat，这是一项开创性的工作，用于从少量全向图像快速生成前馈 3DGS。我们引入了阴阳网格并基于它分解图像，以减少全向图像和透视图像之间的域差距。阴阳网格可以按原样使用现有的 CNN 结构，但其准均匀特性使分解后的图像类似于透视图像，因此它可以利用学习到的前馈网络的强大先验知识。 OmniSplat 的重建精度比现有的在透视图像上训练的前馈网络更高。此外，我们利用 OmniSplat 编码器的注意力机制增强了全向图像之间的分割一致性，从而提供快速、干净的 3DGS 编辑结果。

Title: Generalizable Articulated Object Perception with Superpoints

Authors: Qiaojun Yu, Ce Hao, Xibin Yuan, Li Zhang, Liu Liu, Yukang Huo, Rohit Agarwal, Cewu Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.16656
Pdf URL: https://arxiv.org/pdf/2412.16656
Copy Paste: [[2412.16656]] Generalizable Articulated Object Perception with Superpoints(https://arxiv.org/abs/2412.16656)
Keywords: generation
Abstract: Manipulating articulated objects with robotic arms is challenging due to the complex kinematic structure, which requires precise part segmentation for efficient manipulation. In this work, we introduce a novel superpoint-based perception method designed to improve part segmentation in 3D point clouds of articulated objects. We propose a learnable, part-aware superpoint generation technique that efficiently groups points based on their geometric and semantic similarities, resulting in clearer part boundaries. Furthermore, by leveraging the segmentation capabilities of the 2D foundation model SAM, we identify the centers of pixel regions and select corresponding superpoints as candidate query points. Integrating a query-based transformer decoder further enhances our method's ability to achieve precise part segmentation. Experimental results on the GAPartNet dataset show that our method outperforms existing state-of-the-art approaches in cross-category part segmentation, achieving AP50 scores of 77.9% for seen categories (4.4% improvement) and $39.3\%$ for unseen categories (11.6% improvement), with superior results in 5 out of 9 part categories for seen objects and outperforming all previous methods across all part categories for unseen objects.
摘要：由于运动结构复杂，用机械臂操纵铰接式物体具有挑战性，需要精确的部件分割才能有效操纵。在这项工作中，我们引入了一种基于超点的新型感知方法，旨在改善铰接式物体 3D 点云中的部件分割。我们提出了一种可学习的、部件感知的超点生成技术，该技术可根据点的几何和语义相似性有效地对其进行分组，从而产生更清晰的部件边界。此外，通过利用 2D 基础模型 SAM 的分割功能，我们可以识别像素区域的中心并选择相应的超点作为候选查询点。集成基于查询的 Transformer 解码器进一步增强了我们方法实现精确部件分割的能力。在 GAPartNet 数据集上的实验结果表明，我们的方法在跨类别部分分割方面优于现有的最先进方法，在可见类别中实现了 77.9% 的 AP50 得分（提高了 4.4%），在未见类别中实现了 $39.3\%$ 的 AP50 得分（提高了 11.6%），并且在可见物体的 9 个部分类别中的 5 个部分类别中获得了优异的结果，并且在所有未见物体的部分类别中优于所有以前的方法。

Title: Adversarial Attack Against Images Classification based on Generative Adversarial Networks

Authors: Yahe Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.16662
Pdf URL: https://arxiv.org/pdf/2412.16662
Copy Paste: [[2412.16662]] Adversarial Attack Against Images Classification based on Generative Adversarial Networks(https://arxiv.org/abs/2412.16662)
Keywords: generation, generative
Abstract: Adversarial attacks on image classification systems have always been an important problem in the field of machine learning, and generative adversarial networks (GANs), as popular models in the field of image generation, have been widely used in various novel scenarios due to their powerful generative capabilities. However, with the popularity of generative adversarial networks, the misuse of fake image technology has raised a series of security problems, such as malicious tampering with other people's photos and videos, and invasion of personal privacy. Inspired by the generative adversarial networks, this work proposes a novel adversarial attack method, aiming to gain insight into the weaknesses of the image classification system and improve its anti-attack ability. Specifically, the generative adversarial networks are used to generate adversarial samples with small perturbations but enough to affect the decision-making of the classifier, and the adversarial samples are generated through the adversarial learning of the training generator and the classifier. From extensive experiment analysis, we evaluate the effectiveness of the method on a classical image classification dataset, and the results show that our model successfully deceives a variety of advanced classifiers while maintaining the naturalness of adversarial samples.
摘要：针对图像分类系统的对抗攻击一直是机器学习领域的重要问题，而生成对抗网络（GAN）作为图像生成领域的热门模型，凭借强大的生成能力被广泛应用于各种新奇场景。然而随着生成对抗网络的火爆，假图技术的滥用引发了一系列安全问题，如恶意篡改他人照片和视频、侵犯个人隐私等。受生成对抗网络的启发，本工作提出了一种新颖的对抗攻击方法，旨在洞察图像分类系统的弱点，提高其抗攻击能力。具体而言，利用生成对抗网络生成扰动较小但足以影响分类器决策的对抗样本，对抗样本通过训练生成器和分类器的对抗学习生成。通过大量的实验分析，我们在经典图像分类数据集上评估了该方法的有效性，结果表明我们的模型成功地欺骗了各种高级分类器，同时保持了对抗样本的自然性。

Title: Two-in-One: Unified Multi-Person Interactive Motion Generation by Latent Diffusion Transformer

Authors: Boyuan Li, Xihua Wang, Ruihua Song, Wenbing Huang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.16670
Pdf URL: https://arxiv.org/pdf/2412.16670
Copy Paste: [[2412.16670]] Two-in-One: Unified Multi-Person Interactive Motion Generation by Latent Diffusion Transformer(https://arxiv.org/abs/2412.16670)
Keywords: generation
Abstract: Multi-person interactive motion generation, a critical yet under-explored domain in computer character animation, poses significant challenges such as intricate modeling of inter-human interactions beyond individual motions and generating two motions with huge differences from one text condition. Current research often employs separate module branches for individual motions, leading to a loss of interaction information and increased computational demands. To address these challenges, we propose a novel, unified approach that models multi-person motions and their interactions within a single latent space. Our approach streamlines the process by treating interactive motions as an integrated data point, utilizing a Variational AutoEncoder (VAE) for compression into a unified latent space, and performing a diffusion process within this space, guided by the natural language conditions. Experimental results demonstrate our method's superiority over existing approaches in generation quality, performing text condition in particular when motions have significant asymmetry, and accelerating the generation efficiency while preserving high quality.
摘要：多人交互式动作生成是计算机角色动画中一个关键但尚未得到充分探索的领域，它带来了重大挑战，例如，除了单个动作之外，还要对人际交互进行复杂的建模，以及在一个文本条件下生成两个差异巨大的动作。当前的研究通常对单个动作采用单独的模块分支，这会导致交互信息的丢失和计算需求的增加。为了应对这些挑战，我们提出了一种新颖的统一方法，在单个潜在空间内对多人动作及其交互进行建模。我们的方法通过将交互动作视为一个集成数据点，利用变分自动编码器 (VAE) 将其压缩到统一的潜在空间中，并在自然语言条件的指导下在该空间内执行扩散过程，从而简化了该过程。实验结果表明，我们的方法在生成质量方面优于现有方法，尤其是在动作具有显著不对称性时执行文本条件，并在保持高质量的同时加快生成效率。

Title: VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation

Authors: Chi Zhang, Yuanzhi Liang, Xi Qiu, Fangqiu Yi, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16677
Pdf URL: https://arxiv.org/pdf/2412.16677
Copy Paste: [[2412.16677]] VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation(https://arxiv.org/abs/2412.16677)
Keywords: generation
Abstract: Generating high-quality videos from textual descriptions poses challenges in maintaining temporal coherence and control over subject motion. We propose VAST (Video As Storyboard from Text), a two-stage framework to address these challenges and enable high-quality video generation. In the first stage, StoryForge transforms textual descriptions into detailed storyboards, capturing human poses and object layouts to represent the structural essence of the scene. In the second stage, VisionForge generates videos from these storyboards, producing high-quality videos with smooth motion, temporal consistency, and spatial coherence. By decoupling text understanding from video generation, VAST enables precise control over subject dynamics and scene composition. Experiments on the VBench benchmark demonstrate that VAST outperforms existing methods in both visual quality and semantic expression, setting a new standard for dynamic and coherent video generation.
摘要：根据文本描述生成高质量视频在保持时间连贯性和控制主体运动方面面临挑战。我们提出了 VAST（从文本生成视频故事板），这是一个两阶段框架，旨在解决这些挑战并实现高质量视频生成。在第一阶段，StoryForge 将文本描述转换为详细的故事板，捕捉人体姿势和物体布局以表示场景的结构本质。在第二阶段，VisionForge 根据这些故事板生成视频，制作出具有流畅动作、时间一致性和空间连贯性的高质量视频。通过将文本理解与视频生成分离，VAST 能够精确控制主体动态和场景构图。在 VBench 基准上的实验表明，VAST 在视觉质量和语义表达方面均优于现有方法，为动态和连贯的视频生成树立了新标准。

Title: TCAQ-DM: Timestep-Channel Adaptive Quantization for Diffusion Models

Authors: Haocheng Huang, Jiaxin Chen, Jinyang Guo, Ruiyi Zhan, Yunhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16700
Pdf URL: https://arxiv.org/pdf/2412.16700
Copy Paste: [[2412.16700]] TCAQ-DM: Timestep-Channel Adaptive Quantization for Diffusion Models(https://arxiv.org/abs/2412.16700)
Keywords: generation
Abstract: Diffusion models have achieved remarkable success in the image and video generation tasks. Nevertheless, they often require a large amount of memory and time overhead during inference, due to the complex network architecture and considerable number of timesteps for iterative diffusion. Recently, the post-training quantization (PTQ) technique has proved a promising way to reduce the inference cost by quantizing the float-point operations to low-bit ones. However, most of them fail to tackle with the large variations in the distribution of activations across distinct channels and timesteps, as well as the inconsistent of input between quantization and inference on diffusion models, thus leaving much room for improvement. To address the above issues, we propose a novel method dubbed Timestep-Channel Adaptive Quantization for Diffusion Models (TCAQ-DM). Specifically, we develop a timestep-channel joint reparameterization (TCR) module to balance the activation range along both the timesteps and channels, facilitating the successive reconstruction procedure. Subsequently, we employ a dynamically adaptive quantization (DAQ) module that mitigate the quantization error by selecting an optimal quantizer for each post-Softmax layers according to their specific types of distributions. Moreover, we present a progressively aligned reconstruction (PAR) strategy to mitigate the bias caused by the input mismatch. Extensive experiments on various benchmarks and distinct diffusion models demonstrate that the proposed method substantially outperforms the state-of-the-art approaches in most cases, especially yielding comparable FID metrics to the full precision model on CIFAR-10 in the W6A6 setting, while enabling generating available images in the W4A4 settings.
摘要：扩散模型在图像和视频生成任务中取得了显著的成功。然而，由于网络架构复杂，迭代扩散的时间步长相当多，它们在推理过程中通常需要大量的内存和时间开销。最近，训练后量化 (PTQ) 技术已被证明是一种很有前途的方法，通过将浮点运算量化为低位运算来降低推理成本。然而，它们中的大多数都无法解决不同通道和时间步长上激活分布的巨大变化，以及扩散模型量化和推理之间输入的不一致，因此有很大的改进空间。为了解决上述问题，我们提出了一种称为扩散模型的时间步长通道自适应量化 (TCAQ-DM) 的新方法。具体来说，我们开发了一个时间步长通道联合重参数化 (TCR) 模块来平衡时间步长和通道上的激活范围，从而促进连续重建过程。随后，我们采用动态自适应量化 (DAQ) 模块，根据每个后 Softmax 层的特定分布类型为其选择最佳量化器，从而减轻量化误差。此外，我们提出了一种渐进对齐重建 (PAR) 策略来减轻输入不匹配造成的偏差。对各种基准和不同扩散模型进行的大量实验表明，所提出的方法在大多数情况下都大大优于最先进的方法，尤其是在 W6A6 设置中产生与 CIFAR-10 上的全精度模型相当的 FID 指标，同时能够在 W4A4 设置中生成可用图像。

Title: GANFusion: Feed-Forward Text-to-3D with Diffusion in GAN Space

Authors: Souhaib Attaiki, Paul Guerrero, Duygu Ceylan, Niloy J. Mitra, Maks Ovsjanikov
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.16717
Pdf URL: https://arxiv.org/pdf/2412.16717
Copy Paste: [[2412.16717]] GANFusion: Feed-Forward Text-to-3D with Diffusion in GAN Space(https://arxiv.org/abs/2412.16717)
Keywords: generative
Abstract: We train a feed-forward text-to-3D diffusion generator for human characters using only single-view 2D data for supervision. Existing 3D generative models cannot yet match the fidelity of image or video generative models. State-of-the-art 3D generators are either trained with explicit 3D supervision and are thus limited by the volume and diversity of existing 3D data. Meanwhile, generators that can be trained with only 2D data as supervision typically produce coarser results, cannot be text-conditioned, or must revert to test-time optimization. We observe that GAN- and diffusion-based generators have complementary qualities: GANs can be trained efficiently with 2D supervision to produce high-quality 3D objects but are hard to condition on text. In contrast, denoising diffusion models can be conditioned efficiently but tend to be hard to train with only 2D supervision. We introduce GANFusion, which starts by generating unconditional triplane features for 3D data using a GAN architecture trained with only single-view 2D data. We then generate random samples from the GAN, caption them, and train a text-conditioned diffusion model that directly learns to sample from the space of good triplane features that can be decoded into 3D objects.
摘要：我们仅使用单视图 2D 数据进行监督，为人类字符训练前馈文本到 3D 扩散生成器。现有的 3D 生成模型尚无法达到图像或视频生成模型的保真度。最先进的 3D 生成器要么使用明确的 3D 监督进行训练，因此受到现有 3D 数据的数量和多样性的限制。同时，仅使用 2D 数据作为监督进行训练的生成器通常会产生较粗的结果，无法进行文本条件化，或者必须恢复到测试时间优化。我们观察到基于 GAN 和扩散的生成器具有互补的品质：GAN 可以通过 2D 监督进行有效训练以生成高质量的 3D 对象，但很难对文本进行条件化。相比之下，去噪扩散模型可以有效地进行条件化，但往往很难仅使用 2D 监督进行训练。我们引入了 GANFusion，它首先使用仅使用单视图 2D 数据训练的 GAN 架构为 3D 数据生成无条件三平面特征。然后，我们从 GAN 中生成随机样本，为其添加标题，并训练一个文本条件扩散模型，该模型直接学习从可以解码为 3D 对象的良好三平面特征空间中进行采样。

Title: ViM-Disparity: Bridging the Gap of Speed, Accuracy and Memory for Disparity Map Generation

Authors: Maheswar Bora, Tushar Anand, Saurabh Atreya, Aritra Mukherjee, Abhijit Das
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16745
Pdf URL: https://arxiv.org/pdf/2412.16745
Copy Paste: [[2412.16745]] ViM-Disparity: Bridging the Gap of Speed, Accuracy and Memory for Disparity Map Generation(https://arxiv.org/abs/2412.16745)
Keywords: generation
Abstract: In this work we propose a Visual Mamba (ViM) based architecture, to dissolve the existing trade-off for real-time and accurate model with low computation overhead for disparity map generation (DMG). Moreover, we proposed a performance measure that can jointly evaluate the inference speed, computation overhead and the accurateness of a DMG model.
摘要：在本研究中，我们提出了一种基于 Visual Mamba (ViM) 的架构，以解决视差图生成 (DMG) 的实时性和准确性与低计算开销之间的权衡问题。此外，我们提出了一种性能指标，可以联合评估 DMG 模型的推理速度、计算开销和准确性。

Title: Solving Inverse Problems via Diffusion Optimal Control

Authors: Henry Li, Marcus Pereira
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.16748
Pdf URL: https://arxiv.org/pdf/2412.16748
Copy Paste: [[2412.16748]] Solving Inverse Problems via Diffusion Optimal Control(https://arxiv.org/abs/2412.16748)
Keywords: super-resolution, generative
Abstract: Existing approaches to diffusion-based inverse problem solvers frame the signal recovery task as a probabilistic sampling episode, where the solution is drawn from the desired posterior distribution. This framework suffers from several critical drawbacks, including the intractability of the conditional likelihood function, strict dependence on the score network approximation, and poor $\mathbf{x}_0$ prediction quality. We demonstrate that these limitations can be sidestepped by reframing the generative process as a discrete optimal control episode. We derive a diffusion-based optimal controller inspired by the iterative Linear Quadratic Regulator (iLQR) algorithm. This framework is fully general and able to handle any differentiable forward measurement operator, including super-resolution, inpainting, Gaussian deblurring, nonlinear deblurring, and even highly nonlinear neural classifiers. Furthermore, we show that the idealized posterior sampling equation can be recovered as a special case of our algorithm. We then evaluate our method against a selection of neural inverse problem solvers, and establish a new baseline in image reconstruction with inverse problems.
摘要：现有的基于扩散的逆问题求解器方法将信号恢复任务构建为概率采样事件，其中解决方案来自所需的后验分布。该框架有几个关键的缺点，包括条件似然函数的难处理性、对分数网络近似的严格依赖以及较差的 $\mathbf{x}_0$ 预测质量。我们证明，可以通过将生成过程重新定义为离散最优控制事件来规避这些限制。我们推导出一个基于扩散的最优控制器，灵感来自迭代线性二次调节器 (iLQR) 算法。该框架完全通用，能够处理任何可微分的前向测量算子，包括超分辨率、修复、高斯去模糊、非线性去模糊，甚至高度非线性的神经分类器。此外，我们表明理想化的后验采样方程可以作为我们算法的一个特例来恢复。然后，我们根据一系列神经逆问题求解器评估我们的方法，并为使用逆问题进行图像重建建立新的基线。

Title: Paraformer: Parameterization of Sub-grid Scale Processes Using Transformers

Authors: Shuochen Wang, Nishant Yadav, Auroop R. Ganguly
Subjects: cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2412.16763
Pdf URL: https://arxiv.org/pdf/2412.16763
Copy Paste: [[2412.16763]] Paraformer: Parameterization of Sub-grid Scale Processes Using Transformers(https://arxiv.org/abs/2412.16763)
Keywords: generation
Abstract: One of the major sources of uncertainty in the current generation of Global Climate Models (GCMs) is the representation of sub-grid scale physical processes. Over the years, a series of deep-learning-based parameterization schemes have been developed and tested on both idealized and real-geography GCMs. However, datasets on which previous deep-learning models were trained either contain limited variables or have low spatial-temporal coverage, which can not fully simulate the parameterization process. Additionally, these schemes rely on classical architectures while the latest attention mechanism used in Transformer models remains unexplored in this field. In this paper, we propose Paraformer, a "memory-aware" Transformer-based model on ClimSim, the largest dataset ever created for climate parameterization. Our results demonstrate that the proposed model successfully captures the complex non-linear dependencies in the sub-grid scale variables and outperforms classical deep-learning architectures. This work highlights the applicability of the attenuation mechanism in this field and provides valuable insights for developing future deep-learning-based climate parameterization schemes.
摘要：当前一代全球气候模型 (GCM) 的主要不确定性来源之一是亚网格尺度物理过程的表示。多年来，一系列基于深度学习的参数化方案已经在理想化和真实地理 GCM 上得到开发和测试。然而，以前的深度学习模型所训练的数据集要么包含有限的变量，要么时空覆盖率低，无法完全模拟参数化过程。此外，这些方案依赖于经典架构，而 Transformer 模型中使用的最新注意力机制在该领域仍未得到探索。在本文中，我们提出了 Paraformer，这是一个基于“内存感知”Transformer 的模型，该模型基于 ClimSim，这是有史以来为气候参数化创建的最大的数据集。我们的结果表明，所提出的模型成功捕捉了亚网格尺度变量中复杂的非线性依赖关系，并且优于经典的深度学习架构。这项工作强调了衰减机制在该领域的适用性，并为开发未来基于深度学习的气候参数化方案提供了宝贵的见解。

Title: SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization

Authors: Tan-Hanh Pham, Hoang-Nam Le, Phu-Vinh Nguyen, Chris Ngo, Truong-Son Hy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16771
Pdf URL: https://arxiv.org/pdf/2412.16771
Copy Paste: [[2412.16771]] SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization(https://arxiv.org/abs/2412.16771)
Keywords: generation
Abstract: Visual Language Models have demonstrated remarkable capabilities across tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in human-machine interactions. Moreover, the quality of language models depends on reasoning and prompting techniques, such as COT, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, a novel end-to-end multimodal model that uses speech instructions for reasoning in visual question answering. In addition, we investigate reasoning techniques with levels including conversational, simple, and complex speech instruction. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling intuitive interactions by allowing users to provide verbal or text instructions. To this end, we introduce a dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model ability to process and explain visual scenes from spoken input, moving beyond object recognition to reasoning-based interactions. The experiments show that SilVar achieves SOTA performance on the MMMU and ScienceQA benchmarks despite the challenge of speech-based instructions. We believe SilVar will inspire next-generation multimodal reasoning models, toward expert artificial general intelligence. Our code and dataset are available here.
摘要：视觉语言模型在视觉问答和图像字幕等任务中表现出了卓越的能力。然而，大多数模型都依赖于基于文本的指令，这限制了它们在人机交互中的有效性。此外，语言模型的质量取决于推理和提示技术，例如 COT，在使用语音指令时，这些技术仍未得到充分探索。为了应对这些挑战，我们提出了 SilVar，这是一种新颖的端到端多模态模型，它使用语音指令在视觉问答中进行推理。此外，我们还研究了包括对话、简单和复杂语音指令在内的不同级别的推理技术。SilVar 建立在 CLIP、Whisper 和 LLaMA 3.1-8B 之上，通过允许用户提供口头或文本指令来实现直观的交互。为此，我们引入了一个数据集，旨在挑战具有基于语音的推理任务的模型，以实现对象定位。该数据集增强了模型处理和解释来自语音输入的视觉场景的能力，超越了对象识别，转向基于推理的交互。实验表明，尽管语音指令存在挑战，SilVar 仍在 MMMU 和 ScienceQA 基准上取得了 SOTA 性能。我们相信 SilVar 将启发下一代多模态推理模型，走向专家级通用人工智能。我们的代码和数据集可在此处获取。

Title: RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing

Authors: Zhipeng Huang, Wangbo Yu, Xinhua Cheng, ChengShu Zhao, Yunyang Ge, Mingyi Guo, Li Yuan, Yonghong Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16778
Pdf URL: https://arxiv.org/pdf/2412.16778
Copy Paste: [[2412.16778]] RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing(https://arxiv.org/abs/2412.16778)
Keywords: generation
Abstract: Indoor scene texture synthesis has garnered significant interest due to its important potential applications in virtual reality, digital media, and creative arts. Existing diffusion model-based researches either rely on per-view inpainting techniques, which are plagued by severe cross-view inconsistencies and conspicuous seams, or they resort to optimization-based approaches that entail substantial computational overhead. In this work, we present RoomPainter, a framework that seamlessly integrates efficiency and consistency to achieve high-fidelity texturing of indoor scenes. The core of RoomPainter features a zero-shot technique that effectively adapts a 2D diffusion model for 3D-consistent texture synthesis, along with a two-stage generation strategy that ensures both global and local consistency. Specifically, we introduce Attention-Guided Multi-View Integrated Sampling (MVIS) combined with a neighbor-integrated attention mechanism for zero-shot texture map generation. Using the MVIS, we firstly generate texture map for the entire room to ensure global consistency, then adopt its variant, namely an attention-guided multi-view integrated repaint sampling (MVRS) to repaint individual instances within the room, thereby further enhancing local consistency. Experiments demonstrate that RoomPainter achieves superior performance for indoor scene texture synthesis in visual quality, global consistency, and generation efficiency.
摘要：室内场景纹理合成因其在虚拟现实、数字媒体和创意艺术中的重要潜在应用而引起了人们的极大兴趣。现有的基于扩散模型的研究要么依赖于每个视图的修复技术，但这种技术受到严重的跨视图不一致和明显的接缝的困扰，要么采用基于优化的方法，这需要大量的计算开销。在这项工作中，我们提出了 RoomPainter，这是一个无缝集成效率和一致性以实现室内场景高保真纹理的框架。RoomPainter 的核心是一种零样本技术，可有效地调整 2D 扩散模型以实现 3D 一致的纹理合成，以及一种确保全局和局部一致性的两阶段生成策略。具体来说，我们引入了注意力引导的多视图集成采样 (MVIS) 和邻域集成注意力机制来生成零样本纹理图。使用 MVIS，我们首先为整个房间生成纹理图，以确保全局一致性，然后采用其变体，即注意力引导的多视图集成重绘采样 (MVRS) 来重绘房间内的各个实例，从而进一步增强局部一致性。实验表明，RoomPainter 在视觉质量、全局一致性和生成效率方面在室内场景纹理合成方面取得了优异的表现。

Title: Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers

Authors: Haoran You, Connelly Barnes, Yuqian Zhou, Yan Kang, Zhenbang Du, Wei Zhou, Lingzhi Zhang, Yotam Nitzan, Xiaoyang Liu, Zhe Lin, Eli Shechtman, Sohrab Amirghodsi, Yingyan Celine Lin
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.16822
Pdf URL: https://arxiv.org/pdf/2412.16822
Copy Paste: [[2412.16822]] Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers(https://arxiv.org/abs/2412.16822)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency, making them difficult to deploy on resource-constrained devices. One key efficiency bottleneck is that existing DiTs apply equal computation across all regions of an image. However, not all image tokens are equally important, and certain localized areas require more computation, such as objects. To address this, we propose DiffRatio-MoD, a dynamic DiT inference framework with differentiable compression ratios, which automatically learns to dynamically route computation across layers and timesteps for each image token, resulting in Mixture-of-Depths (MoD) efficient DiT models. Specifically, DiffRatio-MoD integrates three features: (1) A token-level routing scheme where each DiT layer includes a router that is jointly fine-tuned with model weights to predict token importance scores. In this way, unimportant tokens bypass the entire layer's computation; (2) A layer-wise differentiable ratio mechanism where different DiT layers automatically learn varying compression ratios from a zero initialization, resulting in large compression ratios in redundant layers while others remain less compressed or even uncompressed; (3) A timestep-wise differentiable ratio mechanism where each denoising timestep learns its own compression ratio. The resulting pattern shows higher ratios for noisier timesteps and lower ratios as the image becomes clearer. Extensive experiments on both text-to-image and inpainting tasks show that DiffRatio-MoD effectively captures dynamism across token, layer, and timestep axes, achieving superior trade-offs between generation quality and efficiency compared to prior works.
摘要：扩散变换器 (DiT) 已经实现了最先进 (SOTA) 的图像生成质量，但存在高延迟和内存效率低下的问题，因此难以部署在资源受限的设备上。一个关键的效率瓶颈是现有的 DiT 对图像的所有区域应用相同的计算。然而，并非所有图像标记都同等重要，某些局部区域需要更多的计算，比如对象。为了解决这个问题，我们提出了 DiffRatio-MoD，这是一个具有可区分压缩比的动态 DiT 推理框架，它自动学习为每个图像标记在层和时间步之间动态路由计算，从而产生深度混合 (MoD) 高效的 DiT 模型。具体来说，DiffRatio-MoD 集成了三个特性：(1) 一种标记级路由方案，其中每个 DiT 层包括一个路由器，该路由器与模型权重联合微调以预测标记重要性分数。这样，不重要的标记就会绕过整个层的计算； (2) 逐层可微分比率机制，其中不同的 DiT 层自动从零初始化中学习不同的压缩比，从而导致冗余层的压缩比较大，而其他层的压缩比较小甚至未压缩；(3) 逐时间步可微分比率机制，其中每个去噪时间步学习自己的压缩比。结果模式显示，噪声较大的时间步的比率较高，而图像越清晰，比率越低。在文本转图像和修复任务上进行的大量实验表明，DiffRatio-MoD 可以有效捕捉跨标记、层和时间步轴的动态，与之前的工作相比，在生成质量和效率之间实现了更好的权衡。

Title: Human-Guided Image Generation for Expanding Small-Scale Training Image Datasets

Authors: Changjian Chen, Fei Lv, Yalong Guan, Pengcheng Wang, Shengjie Yu, Yifan Zhang, Zhuo Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.16839
Pdf URL: https://arxiv.org/pdf/2412.16839
Copy Paste: [[2412.16839]] Human-Guided Image Generation for Expanding Small-Scale Training Image Datasets(https://arxiv.org/abs/2412.16839)
Keywords: generation, generative
Abstract: The performance of computer vision models in certain real-world applications (e.g., rare wildlife observation) is limited by the small number of available this http URL datasets using pre-trained generative models is an effective way to address this limitation. However, since the automatic generation process is uncontrollable, the generated images are usually limited in diversity, and some of them are undesired. In this paper, we propose a human-guided image generation method for more controllable dataset expansion. We develop a multi-modal projection method with theoretical guarantees to facilitate the exploration of both the original and generated images. Based on the exploration, users refine the prompts and re-generate images for better performance. Since directly refining the prompts is challenging for novice users, we develop a sample-level prompt refinement method to make it easier. With this method, users only need to provide sample-level feedback (e.g., which samples are undesired) to obtain better prompts. The effectiveness of our method is demonstrated through the quantitative evaluation of the multi-modal projection method, improved model performance in the case study for both classification and object detection tasks, and positive feedback from the experts.
摘要：计算机视觉模型在某些现实应用（如珍稀野生动物观察）中的表现受到可用数据集数量较少的限制，使用预先训练的生成模型是解决这一限制的有效方法。然而，由于自动生成过程不可控，生成的图像通常多样性有限，其中一些是不需要的。在本文中，我们提出了一种人为引导的图像生成方法，以实现更可控的数据集扩展。我们开发了一种具有理论保证的多模态投影方法，以促进对原始图像和生成图像的探索。基于探索，用户可以细化提示并重新生成图像以获得更好的性能。由于直接细化提示对于新手用户来说具有挑战性，因此我们开发了一种样本级提示细化方法以使其更容易。使用这种方法，用户只需提供样本级反馈（例如，哪些样本是不受欢迎的）即可获得更好的提示。通过对多模态投影方法的定量评估、分类和对象检测任务案例研究中模型性能的提高以及专家的积极反馈，证明了我们方法的有效性。

Title: Anchor3DLane++: 3D Lane Detection via Sample-Adaptive Sparse 3D Anchor Regression

Authors: Shaofei Huang, Zhenwei Shen, Zehao Huang, Yue Liao, Jizhong Han, Naiyan Wang, Si Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16889
Pdf URL: https://arxiv.org/pdf/2412.16889
Copy Paste: [[2412.16889]] Anchor3DLane++: 3D Lane Detection via Sample-Adaptive Sparse 3D Anchor Regression(https://arxiv.org/abs/2412.16889)
Keywords: generation
Abstract: In this paper, we focus on the challenging task of monocular 3D lane detection. Previous methods typically adopt inverse perspective mapping (IPM) to transform the Front-Viewed (FV) images or features into the Bird-Eye-Viewed (BEV) space for lane detection. However, IPM's dependence on flat ground assumption and context information loss in BEV representations lead to inaccurate 3D information estimation. Though efforts have been made to bypass BEV and directly predict 3D lanes from FV representations, their performances still fall behind BEV-based methods due to a lack of structured modeling of 3D lanes. In this paper, we propose a novel BEV-free method named Anchor3DLane++ which defines 3D lane anchors as structural representations and makes predictions directly from FV features. We also design a Prototype-based Adaptive Anchor Generation (PAAG) module to generate sample-adaptive sparse 3D anchors dynamically. In addition, an Equal-Width (EW) loss is developed to leverage the parallel property of lanes for regularization. Furthermore, camera-LiDAR fusion is also explored based on Anchor3DLane++ to leverage complementary information. Extensive experiments on three popular 3D lane detection benchmarks show that our Anchor3DLane++ outperforms previous state-of-the-art methods. Code is available at: this https URL.
摘要：在本文中，我们专注于单目 3D 车道检测这一具有挑战性的任务。以前的方法通常采用逆透视映射 (IPM) 将正面 (FV) 图像或特征转换为鸟瞰 (BEV) 空间以进行车道检测。然而，IPM 对平地假设的依赖和 BEV 表示中的上下文信息丢失导致 3D 信息估计不准确。尽管已经努力绕过 BEV 并直接从 FV 表示中预测 3D 车道，但由于缺乏 3D 车道的结构化建模，它们的性能仍然落后于基于 BEV 的方法。在本文中，我们提出了一种新颖的无 BEV 方法 Anchor3DLane++，它将 3D 车道锚点定义为结构表示并直接从 FV 特征进行预测。我们还设计了一个基于原型的自适应锚点生成 (PAAG) 模块来动态生成样本自适应稀疏 3D 锚点。此外，我们还开发了等宽 (EW) 损失函数，以利用车道的平行特性进行正则化。此外，我们还基于 Anchor3DLane++ 探索了摄像头-LiDAR 融合，以利用互补信息。在三个流行的 3D 车道检测基准上进行的大量实验表明，我们的 Anchor3DLane++ 优于之前最先进的方法。代码可从此 https URL 获取。

Title: Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation

Authors: Quan Dao, Hao Phung, Trung Dao, Dimitris Metaxas, Anh Tran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16906
Pdf URL: https://arxiv.org/pdf/2412.16906
Copy Paste: [[2412.16906]] Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation(https://arxiv.org/abs/2412.16906)
Keywords: generation, generative
Abstract: Flow matching has emerged as a promising framework for training generative models, demonstrating impressive empirical performance while offering relative ease of training compared to diffusion-based models. However, this method still requires numerous function evaluations in the sampling process. To address these limitations, we introduce a self-corrected flow distillation method that effectively integrates consistency models and adversarial training within the flow-matching framework. This work is a pioneer in achieving consistent generation quality in both few-step and one-step sampling. Our extensive experiments validate the effectiveness of our method, yielding superior results both quantitatively and qualitatively on CelebA-HQ and zero-shot benchmarks on the COCO dataset. Our implementation is released at this https URL
摘要：流匹配已成为一种有前途的生成模型训练框架，与基于扩散的模型相比，它表现出令人印象深刻的经验性能，同时提供相对容易的训练。然而，这种方法在采样过程中仍然需要大量的函数评估。为了解决这些限制，我们引入了一种自校正流蒸馏方法，该方法有效地将一致性模型和对抗性训练整合到流匹配框架中。这项工作是在几步和一步采样中实现一致生成质量的先驱。我们进行了大量的实验，验证了我们方法的有效性，在 COCO 数据集上的 CelebA-HQ 和零样本基准上在定量和定性上都取得了优异的结果。我们的实现发布在此 https URL

Title: TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction

Authors: Xuying Zhang, Yutong Liu, Yangguang Li, Renrui Zhang, Yufei Liu, Kai Wang, Wanli Ouyang, Zhiwei Xiong, Peng Gao, Qibin Hou, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16919
Pdf URL: https://arxiv.org/pdf/2412.16919
Copy Paste: [[2412.16919]] TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction(https://arxiv.org/abs/2412.16919)
Keywords: generation, generative
Abstract: We present TAR3D, a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT) to generate high-quality 3D assets. The core insight of this work is to migrate the multimodal unification and promising learning capabilities of the next-token prediction paradigm to conditional 3D object generation. To achieve this, the 3D VQ-VAE first encodes a wide range of 3D shapes into a compact triplane latent space and utilizes a set of discrete representations from a trainable codebook to reconstruct fine-grained geometries under the supervision of query point occupancy. Then, the 3D GPT, equipped with a custom triplane position embedding called TriPE, predicts the codebook index sequence with prefilling prompt tokens in an autoregressive manner so that the composition of 3D geometries can be modeled part by part. Extensive experiments on ShapeNet and Objaverse demonstrate that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks
摘要：我们提出了 TAR3D，这是一个新颖的框架，由 3D 感知矢量量化变分自动编码器 (VQ-VAE) 和生成式预训练转换器 (GPT) 组成，用于生成高质量的 3D 资产。这项工作的核心见解是将下一个标记预测范式的多模态统一和有希望的学习能力迁移到条件 3D 对象生成。为了实现这一点，3D VQ-VAE 首先将各种 3D 形状编码到紧凑的三平面潜在空间中，并利用可训练码本中的一组离散表示在查询点占用率的监督下重建细粒度几何图形。然后，配备自定义三平面位置嵌入（称为 TriPE）的 3D GPT 以自回归方式预测带有预填充提示标记的码本索引序列，以便可以逐部分建模 3D 几何图形的组成。在 ShapeNet 和 Objaverse 上进行的大量实验表明，TAR3D 在文本转 3D 和图像转 3D 任务中可以实现优于现有方法的生成质量

Title: Image Quality Assessment: Investigating Causal Perceptual Effects with Abductive Counterfactual Inference

Authors: Wenhao Shen, Mingliang Zhou, Yu Chen, Xuekai Wei, Jun Luo, Huayan Pu, Weijia Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16939
Pdf URL: https://arxiv.org/pdf/2412.16939
Copy Paste: [[2412.16939]] Image Quality Assessment: Investigating Causal Perceptual Effects with Abductive Counterfactual Inference(https://arxiv.org/abs/2412.16939)
Keywords: quality assessment
Abstract: Existing full-reference image quality assessment (FR-IQA) methods often fail to capture the complex causal mechanisms that underlie human perceptual responses to image distortions, limiting their ability to generalize across diverse scenarios. In this paper, we propose an FR-IQA method based on abductive counterfactual inference to investigate the causal relationships between deep network features and perceptual distortions. First, we explore the causal effects of deep features on perception and integrate causal reasoning with feature comparison, constructing a model that effectively handles complex distortion types across different IQA scenarios. Second, the analysis of the perceptual causal correlations of our proposed method is independent of the backbone architecture and thus can be applied to a variety of deep networks. Through abductive counterfactual experiments, we validate the proposed causal relationships, confirming the model's superior perceptual relevance and interpretability of quality scores. The experimental results demonstrate the robustness and effectiveness of the method, providing competitive quality predictions across multiple benchmarks. The source code is available at this https URL.
摘要：现有的全参考图像质量评估 (FR-IQA) 方法通常无法捕捉到人类对图像失真的感知反应背后的复杂因果机制，从而限制了它们在不同场景中推广的能力。在本文中，我们提出了一种基于溯因反事实推理的 FR-IQA 方法来研究深度网络特征与感知失真之间的因果关系。首先，我们探索深度特征对感知的因果影响，并将因果推理与特征比较相结合，构建一个有效处理不同 IQA 场景中复杂失真类型的模型。其次，我们提出的方法对感知因果相关性的分析与主干架构无关，因此可以应用于各种深度网络。通过溯因反事实实验，我们验证了所提出的因果关系，证实了该模型卓越的感知相关性和质量分数的可解释性。实验结果证明了该方法的稳健性和有效性，并在多个基准上提供了具有竞争力的质量预测。源代码可在此 https URL 上获取。

Title: DTSGAN: Learning Dynamic Textures via Spatiotemporal Generative Adversarial Network

Authors: Xiangtian Li, Xiaobo Wang, Zhen Qi, Han Cao, Zhaoyang Zhang, Ao Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.16948
Pdf URL: https://arxiv.org/pdf/2412.16948
Copy Paste: [[2412.16948]] DTSGAN: Learning Dynamic Textures via Spatiotemporal Generative Adversarial Network(https://arxiv.org/abs/2412.16948)
Keywords: generative
Abstract: Dynamic texture synthesis aims to generate sequences that are visually similar to a reference video texture and exhibit specific stationary properties in time. In this paper, we introduce a spatiotemporal generative adversarial network (DTSGAN) that can learn from a single dynamic texture by capturing its motion and content distribution. With the pipeline of DTSGAN, a new video sequence is generated from the coarsest scale to the finest one. To avoid mode collapse, we propose a novel strategy for data updates that helps improve the diversity of generated results. Qualitative and quantitative experiments show that our model is able to generate high quality dynamic textures and natural motion.
摘要：动态纹理合成旨在生成与参考视频纹理在视觉上相似且随时间表现出特定静止属性的序列。在本文中，我们引入了一个时空生成对抗网络 (DTSGAN)，它可以通过捕获单个动态纹理的运动和内容分布来从中学习。借助 DTSGAN 的流水线，可以从最粗到最细的尺度生成新的视频序列。为了避免模式崩溃，我们提出了一种新颖的数据更新策略，有助于提高生成结果的多样性。定性和定量实验表明，我们的模型能够生成高质量的动态纹理和自然运动。

Title: PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask

Authors: Jeongho Kim, Hoiyeong Jin, Sunghyun Park, Jaegul Choo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.16978
Pdf URL: https://arxiv.org/pdf/2412.16978
Copy Paste: [[2412.16978]] PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask(https://arxiv.org/abs/2412.16978)
Keywords: generation, generative
Abstract: Recent virtual try-on approaches have advanced by fine-tuning the pre-trained text-to-image diffusion models to leverage their powerful generative ability. However, the use of text prompts in virtual try-on is still underexplored. This paper tackles a text-editable virtual try-on task that changes the clothing item based on the provided clothing image while editing the wearing style (e.g., tucking style, fit) according to the text descriptions. In the text-editable virtual try-on, three key aspects exist: (i) designing rich text descriptions for paired person-clothing data to train the model, (ii) addressing the conflicts where textual information of the existing person's clothing interferes the generation of the new clothing, and (iii) adaptively adjust the inpainting mask aligned with the text descriptions, ensuring proper editing areas while preserving the original person's appearance irrelevant to the new clothing. To address these aspects, we propose PromptDresser, a text-editable virtual try-on model that leverages large multimodal model (LMM) assistance to enable high-quality and versatile manipulation based on generative text prompts. Our approach utilizes LMMs via in-context learning to generate detailed text descriptions for person and clothing images independently, including pose details and editing attributes using minimal human cost. Moreover, to ensure the editing areas, we adjust the inpainting mask depending on the text prompts adaptively. We found that our approach, utilizing detailed text prompts, not only enhances text editability but also effectively conveys clothing details that are difficult to capture through images alone, thereby enhancing image quality. Our code is available at this https URL.
摘要：最近的虚拟试穿方法通过微调预先训练的文本到图像扩散模型来利用其强大的生成能力，取得了进展。然而，文本提示在虚拟试穿中的使用仍未得到充分探索。本文解决了一个可编辑文本的虚拟试穿任务，该任务根据提供的服装图像更改服装项目，同时根据文本描述编辑穿着风格（例如，塞进裤子里、合身度）。在可编辑文本的虚拟试穿中，存在三个关键方面：（i）为配对的人-服装数据设计丰富的文本描述以训练模型，（ii）解决现有人物服装的文本信息干扰新服装生成的冲突，以及（iii）自适应地调整与文本描述对齐的修复蒙版，确保正确的编辑区域，同时保留与新服装无关的原始人物的外观。为了解决这些问题，我们提出了 PromptDresser，这是一个可编辑文本的虚拟试穿模型，它利用大型多模态模型 (LMM) 的辅助，基于生成文本提示实现高质量和多功能的操作。我们的方法通过上下文学习利用 LMM，以最少的人力成本独立生成人物和服装图像的详细文本描述，包括姿势细节和编辑属性。此外，为了确保编辑区域，我们根据文本提示自适应地调整修复蒙版。我们发现，利用详细文本提示的方法不仅增强了文本的可编辑性，而且还有效地传达了仅通过图像难以捕捉的服装细节，从而提高了图像质量。我们的代码可在此 https URL 上找到。

Title: InterDance:Reactive 3D Dance Generation with Realistic Duet Interactions

Authors: Ronghui Li, Youliang Zhang, Yachao Zhang, Yuxiang Zhang, Mingyang Su, Jie Guo, Ziwei Liu, Yebin Liu, Xiu Li
Subjects: cs.CV, cs.GR, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.16982
Pdf URL: https://arxiv.org/pdf/2412.16982
Copy Paste: [[2412.16982]] InterDance:Reactive 3D Dance Generation with Realistic Duet Interactions(https://arxiv.org/abs/2412.16982)
Keywords: generation, generative
Abstract: Humans perform a variety of interactive motions, among which duet dance is one of the most challenging interactions. However, in terms of human motion generative models, existing works are still unable to generate high-quality interactive motions, especially in the field of duet dance. On the one hand, it is due to the lack of large-scale high-quality datasets. On the other hand, it arises from the incomplete representation of interactive motion and the lack of fine-grained optimization of interactions. To address these challenges, we propose, InterDance, a large-scale duet dance dataset that significantly enhances motion quality, data scale, and the variety of dance genres. Built upon this dataset, we propose a new motion representation that can accurately and comprehensively describe interactive motion. We further introduce a diffusion-based framework with an interaction refinement guidance strategy to optimize the realism of interactions progressively. Extensive experiments demonstrate the effectiveness of our dataset and algorithm.
摘要：人类会做出各种各样的交互动作，双人舞是其中最具挑战性的交互动作之一。然而在人体动作生成模型方面，现有的工作仍然无法生成高质量的交互动作，尤其是在双人舞领域。一方面是由于缺乏大规模高质量的数据集，另一方面则是由于交互动作的表征不完整以及缺乏对交互的细粒度优化。针对这些挑战，我们提出了一个大规模双人舞数据集 InterDance，显著提升了动作质量、数据规模以及舞蹈类型的多样性。在此数据集的基础上，我们提出了一种新的动作表征，可以准确全面地描述交互动作。我们进一步引入了一个基于扩散的框架和交互细化引导策略，逐步优化交互的真实感。大量实验证明了数据集和算法的有效性。

Title: Where am I? Cross-View Geo-localization with Natural Language Descriptions

Authors: Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Conghui He, Weijia Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17007
Pdf URL: https://arxiv.org/pdf/2412.17007
Copy Paste: [[2412.17007]] Where am I? Cross-View Geo-localization with Natural Language Descriptions(https://arxiv.org/abs/2412.17007)
Keywords: generation
Abstract: Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response. In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization this http URL, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More information can be found at this https URL.
摘要：跨视图地理定位通过将街景图像与地理标记的卫星图像或 OSM 进行匹配来识别街景图像的位置。然而，大多数研究都集中在图像到图像的检索上，很少有研究涉及文本引导的检索，而文本引导的检索对于行人导航和应急响应等应用至关重要。在这项工作中，我们引入了一种具有自然语言描述的跨视图地理定位的新任务，旨在根据场景文本检索相应的卫星图像或 OSM 数据库。为了支持这项任务，我们通过收集来自多个城市的跨视图数据并采用场景文本生成方法来构建 CVG-Text 数据集，该方法利用大型多模态模型的注释功能来生成具有本地化的高质量场景文本描述。在此 http URL 中，我们提出了一种基于文本的新型检索定位方法 CrossText2Loc，该方法将召回率提高了 10%，并展示了出色的长文本检索能力。在可解释性方面，它不仅提供相似度分数，还提供检索原因。更多信息可以在此 https URL 中找到。

Title: HyperNet Fields: Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories

Authors: Eric Hedlin, Munawar Hayat, Fatih Porikli, Kwang Moo Yi, Shweta Mahajan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.17040
Pdf URL: https://arxiv.org/pdf/2412.17040
Copy Paste: [[2412.17040]] HyperNet Fields: Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories(https://arxiv.org/abs/2412.17040)
Keywords: generation, generative
Abstract: To efficiently adapt large models or to train generative models of neural representations, Hypernetworks have drawn interest. While hypernetworks work well, training them is cumbersome, and often requires ground truth optimized weights for each sample. However, obtaining each of these weights is a training problem of its own-one needs to train, e.g., adaptation weights or even an entire neural field for hypernetworks to regress to. In this work, we propose a method to train hypernetworks, without the need for any per-sample ground truth. Our key idea is to learn a Hypernetwork `Field` and estimate the entire trajectory of network weight training instead of simply its converged state. In other words, we introduce an additional input to the Hypernetwork, the convergence state, which then makes it act as a neural field that models the entire convergence pathway of a task network. A critical benefit in doing so is that the gradient of the estimated weights at any convergence state must then match the gradients of the original task -- this constraint alone is sufficient to train the Hypernetwork Field. We demonstrate the effectiveness of our method through the task of personalized image generation and 3D shape reconstruction from images and point clouds, demonstrating competitive results without any per-sample ground truth.
摘要：为了有效地适应大型模型或训练神经表征的生成模型，超网络引起了人们的兴趣。虽然超网络运行良好，但训练它们却很麻烦，而且通常需要为每个样本提供基本事实优化权重。然而，获得这些权重中的每一个都是一个训练问题——需要训练，例如，适应权重，甚至整个神经场，以便超网络回归。在这项工作中，我们提出了一种训练超网络的方法，不需要任何每个样本的基本事实。我们的关键思想是学习超网络“场”并估计网络权重训练的整个轨迹，而不仅仅是它的收敛状态。换句话说，我们向超网络引入了一个额外的输入，即收敛状态，然后使其充当神经场，模拟任务网络的整个收敛路径。这样做的一个关键好处是，任何收敛状态下估计权重的梯度都必须与原始任务的梯度相匹配——仅此约束就足以训练超网络场。我们通过从图像和点云进行个性化图像生成和 3D 形状重建的任务证明了我们方法的有效性，并且在没有任何每个样本基本事实的情况下展示了具有竞争力的结果。

Title: Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation

Authors: Luoxu Jin, Hiroshi Watanabe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17042
Pdf URL: https://arxiv.org/pdf/2412.17042
Copy Paste: [[2412.17042]] Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation(https://arxiv.org/abs/2412.17042)
Keywords: generation, generative
Abstract: The development of video generation models has advanced significantly in recent years. For video frame interpolation, we adopt a pre-trained large-scale image-to-video diffusion model. To enable this adaptation, we propose a conditional encoder, which serves as a simple yet effective trainable module. By leveraging the first and last frames, we extract spatial and temporal features and input them into the conditional encoder. The computed features of the conditional encoder guide the video diffusion model in generating keyframe-guided video sequences. Our method demonstrates superior performance on the Fréchet Video Distance (FVD) metric compared to previous deterministic approaches in handling large-motion cases, highlighting advancements in generative-based methodologies.
摘要：近年来，视频生成模型的发展取得了长足进步。对于视频帧插值，我们采用了预先训练的大规模图像到视频扩散模型。为了实现这种适应性，我们提出了一个条件编码器，它是一个简单但有效的可训练模块。通过利用第一帧和最后一帧，我们提取空间和时间特征并将其输入到条件编码器中。条件编码器的计算特征指导视频扩散模型生成关键帧引导的视频序列。与处理大运动情况的先前确定性方法相比，我们的方法在 Fréchet 视频距离 (FVD) 度量上表现出卓越的性能，凸显了基于生成的方法的进步。

Title: DreamOmni: Unified Image Generation and Editing

Authors: Bin Xia, Yuechen Zhang, Jingyao Li, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17098
Pdf URL: https://arxiv.org/pdf/2412.17098
Copy Paste: [[2412.17098]] DreamOmni: Unified Image Generation and Editing(https://arxiv.org/abs/2412.17098)
Keywords: generation
Abstract: Currently, the success of large language models (LLMs) illustrates that a unified multitasking approach can significantly enhance model usability, streamline deployment, and foster synergistic benefits across different tasks. However, in computer vision, while text-to-image (T2I) models have significantly improved generation quality through scaling up, their framework design did not initially consider how to unify with downstream tasks, such as various types of editing. To address this, we introduce DreamOmni, a unified model for image generation and editing. We begin by analyzing existing frameworks and the requirements of downstream tasks, proposing a unified framework that integrates both T2I models and various editing tasks. Furthermore, another key challenge is the efficient creation of high-quality editing data, particularly for instruction-based and drag-based editing. To this end, we develop a synthetic data pipeline using sticker-like elements to synthesize accurate, high-quality datasets efficiently, which enables editing data scaling up for unified model training. For training, DreamOmni jointly trains T2I generation and downstream tasks. T2I training enhances the model's understanding of specific concepts and improves generation quality, while editing training helps the model grasp the nuances of the editing task. This collaboration significantly boosts editing performance. Extensive experiments confirm the effectiveness of DreamOmni. The code and model will be released.
摘要：目前，大型语言模型 (LLM) 的成功表明，统一的多任务方法可以显著提高模型的可用性、简化部署并在不同任务之间产生协同效益。然而，在计算机视觉领域，虽然文本到图像 (T2I) 模型通过扩展显著提高了生成质量，但它们的框架设计最初并没有考虑如何与下游任务（例如各种类型的编辑）统一。为了解决这个问题，我们引入了 DreamOmni，一个用于图像生成和编辑的统一模型。我们首先分析现有框架和下游任务的要求，提出一个将 T2I 模型和各种编辑任务集成在一起的统一框架。此外，另一个关键挑战是高效创建高质量的编辑数据，特别是对于基于指令和基于拖动的编辑。为此，我们开发了一个合成数据管道，使用类似贴纸的元素来高效合成准确、高质量的数据集，从而使编辑数据能够扩展以进行统一的模型训练。对于训练，DreamOmni 联合训练 T2I 生成和下游任务。 T2I 训练增强了模型对特定概念的理解并提高了生成质量，而编辑训练则帮助模型掌握编辑任务的细微差别。这种协作显著提高了编辑性能。大量实验证实了 DreamOmni 的有效性。代码和模型即将发布。

Title: Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

Authors: Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.17153
Pdf URL: https://arxiv.org/pdf/2412.17153
Copy Paste: [[2412.17153]] Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching(https://arxiv.org/abs/2412.17153)
Keywords: generation
Abstract: Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. DD doesn't need the training data of the original AR model, making it more this http URL evaluate DD on state-of-the-art image AR models and present promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation (6.3$\times$ speed-up), with an acceptable increase in FID from 4.19 to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8$\times$ speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID>100. DD also excels on text-to-image generation, reducing the generation from 256 steps to 2 for LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The project website is at this https URL.
摘要：自回归 (AR) 模型在文本和图像生成方面已取得最佳性能，但由于逐个标记的过程而导致生成速度缓慢。我们提出了一个雄心勃勃的问题：预先训练的 AR 模型是否可以适应仅用一两个步骤生成输出？如果成功，这将大大推动 AR 模型的开发和部署。我们注意到，现有的尝试通过一次生成多个标记来加速 AR 生成的工作从根本上无法捕获输出分布，因为标记之间存在条件依赖性，从而限制了它们在少步生成中的有效性。为了解决这个问题，我们提出了蒸馏解码 (DD)，它使用流匹配来创建从高斯分布到预先训练的 AR 模型的输出分布的确定性映射。然后我们训练一个网络来蒸馏这个映射，从而实现少步生成。DD 不需要原始 AR 模型的训练数据，使其更适合在最先进的图像 AR 模型上评估 DD，并在 ImageNet-256 上呈现有希望的结果。对于需要 10 步生成的 VAR，DD 可实现一步生成（加速 6.3 倍），FID 从 4.19 增加到 9.96，效果令人满意。对于 LlamaGen，DD 将生成步骤从 256 步减少到 1 步，实现了 217.8 倍加速，FID 从 4.11 增加到 11.35。在这两种情况下，基线方法都完全失败，FID>100。DD 在文本到图像生成方面也表现出色，将 LlamaGen 的生成步骤从 256 步减少到 2 步，FID 从 25.70 增加到 28.95。作为第一项展示图像 AR 模型一步生成可能性的研究，DD 挑战了 AR 模型天生速度慢的流行观念，并为高效的 AR 生成开辟了新的机会。项目网站位于此 https URL。

Title: Generative Diffusion Modeling: A Practical Handbook

Authors: Zihan Ding, Chi Jin
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.17162
Pdf URL: https://arxiv.org/pdf/2412.17162
Copy Paste: [[2412.17162]] Generative Diffusion Modeling: A Practical Handbook(https://arxiv.org/abs/2412.17162)
Keywords: generative
Abstract: This handbook offers a unified perspective on diffusion models, encompassing diffusion probabilistic models, score-based generative models, consistency models, rectified flow, and related methods. By standardizing notations and aligning them with code implementations, it aims to bridge the "paper-to-code" gap and facilitate robust implementations and fair comparisons. The content encompasses the fundamentals of diffusion models, the pre-training process, and various post-training methods. Post-training techniques include model distillation and reward-based fine-tuning. Designed as a practical guide, it emphasizes clarity and usability over theoretical depth, focusing on widely adopted approaches in generative modeling with diffusion models.
摘要：本手册提供了关于扩散模型的统一视角，涵盖了扩散概率模型、基于分数的生成模型、一致性模型、整流流和相关方法。通过标准化符号并将其与代码实现对齐，它旨在弥合“论文到代码”的差距并促进稳健的实现和公平的比较。内容涵盖扩散模型的基础知识、预训练过程和各种训练后方法。训练后技术包括模型提炼和基于奖励的微调。它旨在作为一本实用指南，强调清晰度和可用性而不是理论深度，重点介绍使用扩散模型进行生成建模时广泛采用的方法。

Title: Enhancing Item Tokenization for Generative Recommendation through Self-Improvement

Authors: Runjin Chen, Mingxuan Ju, Ngoc Bui, Dimosthenis Antypas, Stanley Cai, Xiaopeng Wu, Leonardo Neves, Zhangyang Wang, Neil Shah, Tong Zhao
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2412.17171
Pdf URL: https://arxiv.org/pdf/2412.17171
Copy Paste: [[2412.17171]] Enhancing Item Tokenization for Generative Recommendation through Self-Improvement(https://arxiv.org/abs/2412.17171)
Keywords: generation, generative
Abstract: Generative recommendation systems, driven by large language models (LLMs), present an innovative approach to predicting user preferences by modeling items as token sequences and generating recommendations in a generative manner. A critical challenge in this approach is the effective tokenization of items, ensuring that they are represented in a form compatible with LLMs. Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens. While text-based representations integrate seamlessly with LLM tokenization, they are often too lengthy, leading to inefficiencies and complicating accurate generation. Numerical strings, while concise, lack semantic depth and fail to capture meaningful item relationships. Tokenizing items as sequences of newly defined tokens has gained traction, but it often requires external models or algorithms for token assignment. These external processes may not align with the LLM's internal pretrained tokenization schema, leading to inconsistencies and reduced model performance. To address these limitations, we propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process. Our approach starts with item tokenizations generated by any external model and periodically adjusts these tokenizations based on the LLM's learned patterns. Such alignment process ensures consistency between the tokenization and the LLM's internal understanding of the items, leading to more accurate recommendations. Furthermore, our method is simple to implement and can be integrated as a plug-and-play enhancement into existing generative recommendation systems. Experimental results on multiple datasets and using various initial tokenization strategies demonstrate the effectiveness of our method, with an average improvement of 8\% in recommendation performance.
摘要：由大型语言模型 (LLM) 驱动的生成推荐系统通过将项目建模为标记序列并以生成方式生成推荐，提供了一种预测用户偏好的创新方法。这种方法的一个关键挑战是有效地标记项目，确保它们以与 LLM 兼容的形式表示。当前的项目标记方法包括使用文本描述、数字字符串或离散标记序列。虽然基于文本的表示与 LLM 标记无缝集成，但它们通常太长，导致效率低下并使准确生成复杂化。数字字符串虽然简洁，但缺乏语义深度，无法捕捉有意义的项目关系。将项目标记为新定义的标记序列已经获得了关注，但它通常需要外部模型或算法来分配标记。这些外部过程可能与 LLM 的内部预训练标记模式不一致，从而导致不一致和模型性能下降。为了解决这些限制，我们提出了一种自我改进的项目标记方法，允许 LLM 在训练过程中改进其自己的项目标记。我们的方法从任何外部模型生成的项目标记开始，并根据 LLM 学习到的模式定期调整这些标记。这种对齐过程可确保标记与 LLM 对项目的内部理解保持一致，从而产生更准确的推荐。此外，我们的方法易于实现，可以作为即插即用的增强功能集成到现有的生成推荐系统中。在多个数据集上使用各种初始标记策略的实验结果证明了我们方法的有效性，推荐性能平均提高了 8%。

Title: Foundation Model for Lossy Compression of Spatiotemporal Scientific Data

Authors: Xiao Li, Jaemoon Lee, Anand Rangarajan, Sanjay Ranka
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.17184
Pdf URL: https://arxiv.org/pdf/2412.17184
Copy Paste: [[2412.17184]] Foundation Model for Lossy Compression of Spatiotemporal Scientific Data(https://arxiv.org/abs/2412.17184)
Keywords: super-resolution
Abstract: We present a foundation model (FM) for lossy scientific data compression, combining a variational autoencoder (VAE) with a hyper-prior structure and a super-resolution (SR) module. The VAE framework uses hyper-priors to model latent space dependencies, enhancing compression efficiency. The SR module refines low-resolution representations into high-resolution outputs, improving reconstruction quality. By alternating between 2D and 3D convolutions, the model efficiently captures spatiotemporal correlations in scientific data while maintaining low computational cost. Experimental results demonstrate that the FM generalizes well to unseen domains and varying data shapes, achieving up to 4 times higher compression ratios than state-of-the-art methods after domain-specific fine-tuning. The SR module improves compression ratio by 30 percent compared to simple upsampling techniques. This approach significantly reduces storage and transmission costs for large-scale scientific simulations while preserving data integrity and fidelity.
摘要：我们提出了一种有损科学数据压缩的基础模型 (FM)，将变分自动编码器 (VAE) 与超先验结构和超分辨率 (SR) 模块相结合。VAE 框架使用超先验来模拟潜在空间依赖性，从而提高压缩效率。SR 模块将低分辨率表示细化为高分辨率输出，从而提高重建质量。通过在 2D 和 3D 卷积之间交替，该模型可以有效捕获科学数据中的时空相关性，同时保持较低的计算成本。实验结果表明，FM 可以很好地推广到看不见的域和不同的数据形状，在特定领域进行微调后，压缩率比最先进的方法高出 4 倍。与简单的上采样技术相比，SR 模块将压缩率提高了 30%。这种方法显着降低了大规模科学模拟的存储和传输成本，同时保持了数据完整性和保真度。

Title: Discriminative Image Generation with Diffusion Models for Zero-Shot Learning

Authors: Dingjie Fu, Wenjin Hou, Shiming Chen, Shuhuang Chen, Xinge You, Salman Khan, Fahad Shahbaz Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17219
Pdf URL: https://arxiv.org/pdf/2412.17219
Copy Paste: [[2412.17219]] Discriminative Image Generation with Diffusion Models for Zero-Shot Learning(https://arxiv.org/abs/2412.17219)
Keywords: generation, generative
Abstract: Generative Zero-Shot Learning (ZSL) methods synthesize class-related features based on predefined class semantic prototypes, showcasing superior performance. However, this feature generation paradigm falls short of providing interpretable insights. In addition, existing approaches rely on semantic prototypes annotated by human experts, which exhibit a significant limitation in their scalability to generalized scenes. To overcome these deficiencies, a natural solution is to generate images for unseen classes using text prompts. To this end, We present DIG-ZSL, a novel Discriminative Image Generation framework for Zero-Shot Learning. Specifically, to ensure the generation of discriminative images for training an effective ZSL classifier, we learn a discriminative class token (DCT) for each unseen class under the guidance of a pre-trained category discrimination model (CDM). Harnessing DCTs, we can generate diverse and high-quality images, which serve as informative unseen samples for ZSL tasks. In this paper, the extensive experiments and visualizations on four datasets show that our DIG-ZSL: (1) generates diverse and high-quality images, (2) outperforms previous state-of-the-art nonhuman-annotated semantic prototype-based methods by a large margin, and (3) achieves comparable or better performance than baselines that leverage human-annotated semantic prototypes. The codes will be made available upon acceptance of the paper.
摘要：生成式零样本学习 (ZSL) 方法基于预定义的类语义原型合成与类相关的特征，表现出卓越的性能。然而，这种特征生成范式未能提供可解释的见解。此外，现有方法依赖于人类专家注释的语义原型，这在扩展到广义场景方面表现出很大的局限性。为了克服这些缺陷，一个自然的解决方案是使用文本提示为未见类生成图像。为此，我们提出了 DIG-ZSL，一种用于零样本学习的新型判别图像生成框架。具体而言，为了确保生成判别图像以训练有效的 ZSL 分类器，我们在预训练类别判别模型 (CDM) 的指导下为每个未见类学习一个判别类标记 (DCT)。利用 DCT，我们可以生成多样化的高质量图像，这些图像可作为 ZSL 任务的信息性未见样本。在本文中，我们在四个数据集上进行了大量的实验和可视化，结果表明我们的 DIG-ZSL：(1) 生成多样化且高质量的图像；(2) 大大优于之前最先进的基于非人工注释语义原型的方法；(3) 实现了与利用人工注释语义原型的基线相当或更好的性能。代码将在论文被接受后提供。

Title: CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder

Authors: Lichen Ma, Tiezhu Yue, Pei Fu, Yujie Zhong, Kai Zhou, Xiaoming Wei, Jie Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17225
Pdf URL: https://arxiv.org/pdf/2412.17225
Copy Paste: [[2412.17225]] CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder(https://arxiv.org/abs/2412.17225)
Keywords: generation
Abstract: Recently, significant advancements have been made in diffusion-based visual text generation models. Although the effectiveness of these methods in visual text rendering is rapidly improving, they still encounter challenges such as inaccurate characters and strokes when rendering complex visual text. In this paper, we propose CharGen, a highly accurate character-level visual text generation and editing model. Specifically, CharGen employs a character-level multimodal encoder that not only extracts character-level text embeddings but also encodes glyph images character by character. This enables it to capture fine-grained cross-modality features more effectively. Additionally, we introduce a new perceptual loss in CharGen to enhance character shape supervision and address the issue of inaccurate strokes in generated text. It is worth mentioning that CharGen can be integrated into existing diffusion models to generate visual text with high accuracy. CharGen significantly improves text rendering accuracy, outperforming recent methods in public benchmarks such as AnyText-benchmark and MARIO-Eval, with improvements of more than 8% and 6%, respectively. Notably, CharGen achieved a 5.5% increase in accuracy on Chinese test sets.
摘要：近来，基于扩散的视觉文本生成模型取得了重大进展。虽然这些方法在视觉文本渲染中的有效性正在迅速提高，但它们在渲染复杂的视觉文本时仍然面临诸如字符和笔画不准确等挑战。在本文中，我们提出了一个高精度的字符级视觉文本生成和编辑模型CharGen。具体来说，CharGen采用字符级多模态编码器，不仅可以提取字符级文本嵌入，还可以逐个字符地编码字形图像。这使得它能够更有效地捕获细粒度的跨模态特征。此外，我们在CharGen中引入了新的感知损失，以增强字符形状监督并解决生成文本中笔画不准确的问题。值得一提的是，CharGen可以集成到现有的扩散模型中，以高精度生成视觉文本。CharGen显著提高了文本渲染精度，在AnyText-benchmark和MARIO-Eval等公共基准测试中的表现优于近期方法，分别提高了8%和6%以上。值得注意的是，CharGen在中文测试集上的准确率提高了5.5%。

Title: OLiDM: Object-aware LiDAR Diffusion Models for Autonomous Driving

Authors: Tianyi Yan, Junbo Yin, Xianpeng Lang, Ruigang Yang, Cheng-Zhong Xu, Jianbing Shen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.17226
Pdf URL: https://arxiv.org/pdf/2412.17226
Copy Paste: [[2412.17226]] OLiDM: Object-aware LiDAR Diffusion Models for Autonomous Driving(https://arxiv.org/abs/2412.17226)
Keywords: generation
Abstract: To enhance autonomous driving safety in complex scenarios, various methods have been proposed to simulate LiDAR point cloud data. Nevertheless, these methods often face challenges in producing high-quality, diverse, and controllable foreground objects. To address the needs of object-aware tasks in 3D perception, we introduce OLiDM, a novel framework capable of generating high-fidelity LiDAR data at both the object and the scene levels. OLiDM consists of two pivotal components: the Object-Scene Progressive Generation (OPG) module and the Object Semantic Alignment (OSA) module. OPG adapts to user-specific prompts to generate desired foreground objects, which are subsequently employed as conditions in scene generation, ensuring controllable outputs at both the object and scene levels. This also facilitates the association of user-defined object-level annotations with the generated LiDAR scenes. Moreover, OSA aims to rectify the misalignment between foreground objects and background scenes, enhancing the overall quality of the generated objects. The broad effectiveness of OLiDM is demonstrated across various LiDAR generation tasks, as well as in 3D perception tasks. Specifically, on the KITTI-360 dataset, OLiDM surpasses prior state-of-the-art methods such as UltraLiDAR by 17.5 in FPD. Additionally, in sparse-to-dense LiDAR completion, OLiDM achieves a significant improvement over LiDARGen, with a 57.47\% increase in semantic IoU. Moreover, OLiDM enhances the performance of mainstream 3D detectors by 2.4\% in mAP and 1.9\% in NDS, underscoring its potential in advancing object-aware 3D tasks. Code is available at: this https URL.
摘要：为了提高复杂场景下的自动驾驶安全性，已经提出了各种方法来模拟 LiDAR 点云数据。然而，这些方法在生成高质量、多样化和可控的前景物体方面往往面临挑战。为了满足 3D 感知中物体感知任务的需求，我们引入了 OLiDM，这是一个能够在物体和场景级别生成高保真 LiDAR 数据的新框架。OLiDM 由两个关键组件组成：对象场景渐进生成 (OPG) 模块和对象语义对齐 (OSA) 模块。OPG 适应用户特定的提示以生成所需的前景物体，随后将其用作场景生成的条件，确保在物体和场景级别都有可控的输出。这也有助于将用户定义的对象级注释与生成的 LiDAR 场景相关联。此外，OSA 旨在纠正前景物体和背景场景之间的错位，从而提高生成物体的整体质量。 OLiDM 的广泛有效性在各种 LiDAR 生成任务以及 3D 感知任务中得到了证明。具体来说，在 KITTI-360 数据集上，OLiDM 在 FPD 方面比 UltraLiDAR 等先前最先进的方法高出 17.5 倍。此外，在从稀疏到密集的 LiDAR 补全中，OLiDM 比 LiDARGen 取得了显著的改进，语义 IoU 增加了 57.47%。此外，OLiDM 将主流 3D 检测器的性能在 mAP 上提高了 2.4%，在 NDS 上提高了 1.9%，凸显了其在推进对象感知 3D 任务方面的潜力。代码可在以下网址获取：此 https URL。

Title: FedMeld: A Model-dispersal Federated Learning Framework for Space-ground Integrated Networks

Authors: Qian Chen, Xianhao Chen, Kaibin Huang
Subjects: cs.LG, cs.IT, cs.NI
Abstract URL: https://arxiv.org/abs/2412.17231
Pdf URL: https://arxiv.org/pdf/2412.17231
Copy Paste: [[2412.17231]] FedMeld: A Model-dispersal Federated Learning Framework for Space-ground Integrated Networks(https://arxiv.org/abs/2412.17231)
Keywords: generation
Abstract: To bridge the digital divide, the space-ground integrated networks (SGINs), which will be a key component of the six-generation (6G) mobile networks, are expected to deliver artificial intelligence (AI) services to every corner of the world. One mission of SGINs is to support federated learning (FL) at a global scale. However, existing space-ground integrated FL frameworks involve ground stations or costly inter-satellite links, entailing excessive training latency and communication costs. To overcome these limitations, we propose an infrastructure-free federated learning framework based on a model dispersal (FedMeld) strategy, which exploits periodic movement patterns and store-carry-forward capabilities of satellites to enable parameter mixing across large-scale geographical regions. We theoretically show that FedMeld leads to global model convergence and quantify the effects of round interval and mixing ratio between adjacent areas on its learning performance. Based on the theoretical results, we formulate a joint optimization problem to design the staleness control and mixing ratio (SC-MR) for minimizing the training loss. By decomposing the problem into sequential SC and MR subproblems without compromising the optimality, we derive the round interval solution in a closed form and the mixing ratio in a semi-closed form to achieve the \textit{optimal} latency-accuracy tradeoff. Experiments using various datasets demonstrate that FedMeld achieves superior model accuracy while significantly reducing communication costs as compared with traditional FL schemes for SGINs.
摘要：为了缩小数字鸿沟，天地一体化网络 (SGIN) 将成为第六代 (6G) 移动网络的关键组成部分，有望将人工智能 (AI) 服务带到世界的每个角落。SGIN 的使命之一是支持全球范围内的联邦学习 (FL)。然而，现有的天地一体化 FL 框架涉及地面站或昂贵的卫星间链路，导致过高的训练延迟和通信成本。为了克服这些限制，我们提出了一种基于模型分散 (FedMeld) 策略的无基础设施联邦学习框架，该框架利用卫星的周期性运动模式和存储结转功能来实现跨大规模地理区域的参数混合。我们从理论上证明了 FedMeld 可实现全局模型收敛，并量化相邻区域之间的轮次间隔和混合比对其学习性能的影响。基于理论结果，我们制定了一个联合优化问题来设计陈旧性控制和混合比 (SC-MR)，以最小化训练损失。通过在不影响最优性的前提下将问题分解为连续的 SC 和 MR 子问题，我们以闭式推导出循环区间解，以半闭式推导出混合比，以实现 \textit{最佳} 延迟-准确度权衡。使用各种数据集进行的实验表明，与传统的 SGIN FL 方案相比，FedMeld 实现了卓越的模型准确度，同时显著降低了通信成本。

Title: QTSeg: A Query Token-Based Architecture for Efficient 2D Medical Image Segmentation

Authors: Phuong-Nam Tran, Nhat Truong Pham, Duc Ngoc Minh Dang, Eui-Nam Huh, Choong Seon Hong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.17241
Pdf URL: https://arxiv.org/pdf/2412.17241
Copy Paste: [[2412.17241]] QTSeg: A Query Token-Based Architecture for Efficient 2D Medical Image Segmentation(https://arxiv.org/abs/2412.17241)
Keywords: generation
Abstract: Medical image segmentation is crucial in assisting medical doctors in making diagnoses and enabling accurate automatic diagnosis. While advanced convolutional neural networks (CNNs) excel in segmenting regions of interest with pixel-level precision, they often struggle with long-range dependencies, which is crucial for enhancing model performance. Conversely, transformer architectures leverage attention mechanisms to excel in handling long-range dependencies. However, the computational complexity of transformers grows quadratically, posing resource-intensive challenges, especially with high-resolution medical images. Recent research aims to combine CNN and transformer architectures to mitigate their drawbacks and enhance performance while keeping resource demands low. Nevertheless, existing approaches have not fully leveraged the strengths of both architectures to achieve high accuracy with low computational requirements. To address this gap, we propose a novel architecture for 2D medical image segmentation (QTSeg) that leverages a feature pyramid network (FPN) as the image encoder, a multi-level feature fusion (MLFF) as the adaptive module between encoder and decoder and a multi-query mask decoder (MQM Decoder) as the mask decoder. In the first step, an FPN model extracts pyramid features from the input image. Next, MLFF is incorporated between the encoder and decoder to adapt features from different encoder stages to the decoder. Finally, an MQM Decoder is employed to improve mask generation by integrating query tokens with pyramid features at all stages of the mask decoder. Our experimental results show that QTSeg outperforms state-of-the-art methods across all metrics with lower computational demands than the baseline and the existing methods. Code is available at this https URL (v0.1.0)
摘要：医学图像分割对于协助医生进行诊断和实现准确的自动诊断至关重要。虽然先进的卷积神经网络 (CNN) 在以像素级精度分割感兴趣区域方面表现出色，但它们通常在处理长距离依赖性方面遇到困难，而长距离依赖性对于提高模型性能至关重要。相反，Transformer 架构利用注意力机制来处理长距离依赖性。然而，Transformer 的计算复杂度呈二次增长，带来了资源密集型挑战，尤其是对于高分辨率医学图像。最近的研究旨在结合 CNN 和 Transformer 架构，以减轻它们的缺点并提高性能，同时保持较低的资源需求。然而，现有的方法并没有充分利用这两种架构的优势来实现高精度和低计算要求。为了解决这一差距，我们提出了一种用于 2D 医学图像分割 (QTSeg) 的新型架构，该架构利用特征金字塔网络 (FPN) 作为图像编码器，利用多级特征融合 (MLFF) 作为编码器和解码器之间的自适应模块，利用多查询掩码解码器 (MQM 解码器) 作为掩码解码器。第一步，FPN 模型从输入图像中提取金字塔特征。接下来，在编码器和解码器之间加入 MLFF，以使不同编码器阶段的特征适应解码器。最后，使用 MQM 解码器通过在掩码解码器的所有阶段将查询标记与金字塔特征集成来改进掩码生成。我们的实验结果表明，QTSeg 在所有指标上的表现均优于最先进的方法，并且计算需求低于基线和现有方法。代码可在此 https URL (v0.1.0) 上找到

Title: Enhancing Multi-Text Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory

Authors: Xingyao Li, Fengzhuo Zhang, Jiachun Pan, Yunlong Hou, Vincent Y. F. Tan, Zhuoran Yang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.17254
Pdf URL: https://arxiv.org/pdf/2412.17254
Copy Paste: [[2412.17254]] Enhancing Multi-Text Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory(https://arxiv.org/abs/2412.17254)
Keywords: generation
Abstract: Despite the considerable progress achieved in the long video generation problem, there is still significant room to improve the consistency of the videos, particularly in terms of smoothness and transitions between scenes. We address these issues to enhance the consistency and coherence of videos generated with either single or multiple prompts. We propose the Time-frequency based temporal Attention Reweighting Algorithm (TiARA), which meticulously edits the attention score matrix based on the Discrete Short-Time Fourier Transform. Our method is supported by a theoretical guarantee, the first-of-its-kind for frequency-based methods in diffusion models. For videos generated by multiple prompts, we further investigate key factors affecting prompt interpolation quality and propose PromptBlend, an advanced prompt interpolation pipeline. The efficacy of our proposed method is validated via extensive experimental results, exhibiting consistent and impressive improvements over baseline methods. The code will be released upon acceptance.
摘要：尽管在长视频生成问题上取得了长足的进步，但视频的一致性仍有很大提升空间，特别是在流畅度和场景间过渡方面。我们解决了这些问题，以增强使用单个或多个提示生成的视频的一致性和连贯性。我们提出了基于时频的时间注意力重加权算法 (TiARA)，该算法基于离散短时傅里叶变换精心编辑注意力得分矩阵。我们的方法有理论保证，是扩散模型中基于频率的方法的首创。对于由多个提示生成的视频，我们进一步研究影响提示插值质量的关键因素，并提出了一种先进的提示插值管道 PromptBlend。我们提出的方法的有效性通过大量实验结果得到验证，与基线方法相比，表现出一致且令人印象深刻的改进。代码将在接受后发布。

Title: Free-viewpoint Human Animation with Pose-correlated Reference Selection

Authors: Fa-Ting Hong, Zhan Xu, Haiyang Liu, Qinjie Lin, Luchuan Song, Zhixin Shu, Yang Zhou, Duygu Ceylan, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17290
Pdf URL: https://arxiv.org/pdf/2412.17290
Copy Paste: [[2412.17290]] Free-viewpoint Human Animation with Pose-correlated Reference Selection(https://arxiv.org/abs/2412.17290)
Keywords: generation, generative
Abstract: Diffusion-based human animation aims to animate a human character based on a source human image as well as driving signals such as a sequence of poses. Leveraging the generative capacity of diffusion model, existing approaches are able to generate high-fidelity poses, but struggle with significant viewpoint changes, especially in zoom-in/zoom-out scenarios where camera-character distance varies. This limits the applications such as cinematic shot type plan or camera control. We propose a pose-correlated reference selection diffusion network, supporting substantial viewpoint variations in human animation. Our key idea is to enable the network to utilize multiple reference images as input, since significant viewpoint changes often lead to missing appearance details on the human body. To eliminate the computational cost, we first introduce a novel pose correlation module to compute similarities between non-aligned target and source poses, and then propose an adaptive reference selection strategy, utilizing the attention map to identify key regions for animation generation. To train our model, we curated a large dataset from public TED talks featuring varied shots of the same character, helping the model learn synthesis for different perspectives. Our experimental results show that with the same number of reference images, our model performs favorably compared to the current SOTA methods under large viewpoint change. We further show that the adaptive reference selection is able to choose the most relevant reference regions to generate humans under free viewpoints.
摘要：基于扩散的人体动画旨在根据源人体图像以及一系列姿势等驱动信号来为人体角色制作动画。利用扩散模型的生成能力，现有方法能够生成高保真姿势，但在显著的视点变化下会遇到困难，尤其是在相机与角色距离变化的放大/缩小场景中。这限制了电影镜头类型规划或相机控制等应用。我们提出了一种姿势相关的参考选择扩散网络，支持人体动画中的大量视点变化。我们的主要思想是使网络能够利用多个参考图像作为输入，因为显著的视点变化通常会导致人体外观细节缺失。为了消除计算成本，我们首先引入了一种新颖的姿势相关模块来计算非对齐目标和源姿势之间的相似性，然后提出一种自适应参考选择策略，利用注意力图来识别动画生成的关键区域。为了训练我们的模型，我们从公开的 TED 演讲中整理了一个大型数据集，其中包含同一角色的不同镜头，帮助模型学习不同视角的合成。我们的实验结果表明，在参考图像数量相同的情况下，我们的模型在视角变化较大的情况下的表现优于当前的 SOTA 方法。我们进一步表明，自适应参考选择能够在自由视角下选择最相关的参考区域来生成人体。

Title: EcoSearch: A Constant-Delay Best-First Search Algorithm for Program Synthesis

Authors: Théo Matricon, Nathanaël Fijalkow, Guillaume Lagarde
Subjects: cs.LG, cs.AI, cs.PL
Abstract URL: https://arxiv.org/abs/2412.17330
Pdf URL: https://arxiv.org/pdf/2412.17330
Copy Paste: [[2412.17330]] EcoSearch: A Constant-Delay Best-First Search Algorithm for Program Synthesis(https://arxiv.org/abs/2412.17330)
Keywords: generation
Abstract: Many approaches to program synthesis perform a combinatorial search within a large space of programs to find one that satisfies a given specification. To tame the search space blowup, previous works introduced probabilistic and neural approaches to guide this combinatorial search by inducing heuristic cost functions. Best-first search algorithms ensure to search in the exact order induced by the cost function, significantly reducing the portion of the program space to be explored. We present a new best-first search algorithm called EcoSearch, which is the first constant-delay algorithm for pre-generation cost function: the amount of compute required between outputting two programs is constant, and in particular does not increase over time. This key property yields important speedups: we observe that EcoSearch outperforms its predecessors on two classic domains.
摘要：许多程序综合方法在大量程序中执行组合搜索，以找到满足给定规范的程序。为了控制搜索空间膨胀，以前的研究引入了概率和神经方法，通过引入启发式成本函数来指导这种组合搜索。最佳优先搜索算法确保按照成本函数诱导的准确顺序进行搜索，从而显著减少要探索的程序空间部分。我们提出了一种名为 EcoSearch 的新型最佳优先搜索算法，它是第一个用于预生成成本函数的恒定延迟算法：输出两个程序之间所需的计算量是恒定的，特别是不会随着时间的推移而增加。这一关键属性带来了重要的加速：我们观察到 EcoSearch 在两个经典领域的表现优于其前辈。

Title: Broadband Ground Motion Synthesis by Diffusion Model with Minimal Condition

Authors: Jaeheun Jung, Jaehyuk Lee, Chang-Hae Jung, Hanyoung Kim, Bosung Jung, Donghun Lee
Subjects: cs.LG, cs.AI, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2412.17333
Pdf URL: https://arxiv.org/pdf/2412.17333
Copy Paste: [[2412.17333]] Broadband Ground Motion Synthesis by Diffusion Model with Minimal Condition(https://arxiv.org/abs/2412.17333)
Keywords: generation
Abstract: Earthquakes are rare. Hence there is a fundamental call for reliable methods to generate realistic ground motion data for data-driven approaches in seismology. Recent GAN-based methods fall short of the call, as the methods either require special information such as geological traits or generate subpar waveforms that fail to satisfy seismological constraints such as phase arrival times. We propose a specialized Latent Diffusion Model (LDM) that reliably generates realistic waveforms after learning from real earthquake data with minimal conditions: location and magnitude. We also design a domain-specific training method that exploits the traits of earthquake dataset: multiple observed waveforms time-aligned and paired to each earthquake source that are tagged with seismological metadata comprised of earthquake magnitude, depth of focus, and the locations of epicenter and seismometers. We construct the time-aligned earthquake dataset using Southern California Earthquake Data Center (SCEDC) API, and train our model with the dataset and our proposed training method for performance evaluation. Our model surpasses all comparable data-driven methods in various test criteria not only from waveform generation domain but also from seismology such as phase arrival time, GMPE analysis, and spectrum analysis. Our result opens new future research directions for deep learning applications in seismology.
摘要：地震很少发生。因此，对于地震学中数据驱动的方法，人们迫切需要可靠的方法来生成真实的地面运动数据。最近基于 GAN 的方法未能满足这一要求，因为这些方法要么需要特殊信息（例如地质特征），要么生成低于标准的波形，无法满足相位到达时间等地震学约束。我们提出了一种专门的潜在扩散模型 (LDM)，该模型在从真实地震数据中学习后，以最低条件（位置和震级）可靠地生成逼真的波形。我们还设计了一种领域特定的训练方法，利用地震数据集的特征：多个观测波形按时间对齐并与每个地震源配对，并标有地震元数据，包括地震震级、震源深度以及震中和地震仪的位置。我们使用南加州地震数据中心 (SCEDC) API 构建时间对齐的地震数据集，并使用该数据集和我们提出的训练方法训练我们的模型以进行性能评估。我们的模型不仅在波形生成领域，而且在地震学（例如相位到达时间、GMPE 分析和频谱分析）的各种测试标准中都超越了所有同类数据驱动方法。我们的研究结果为地震学中的深度学习应用开辟了新的未来研究方向。

Title: FFA Sora, video generation as fundus fluorescein angiography simulator

Authors: Xinyuan Wu, Lili Wang, Ruoyu Chen, Bowen Liu, Weiyi Zhang, Xi Yang, Yifan Feng, Mingguang He, Danli Shi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.17346
Pdf URL: https://arxiv.org/pdf/2412.17346
Copy Paste: [[2412.17346]] FFA Sora, video generation as fundus fluorescein angiography simulator(https://arxiv.org/abs/2412.17346)
Keywords: generation
Abstract: Fundus fluorescein angiography (FFA) is critical for diagnosing retinal vascular diseases, but beginners often struggle with image interpretation. This study develops FFA Sora, a text-to-video model that converts FFA reports into dynamic videos via a Wavelet-Flow Variational Autoencoder (WF-VAE) and a diffusion transformer (DiT). Trained on an anonymized dataset, FFA Sora accurately simulates disease features from the input text, as confirmed by objective metrics: Frechet Video Distance (FVD) = 329.78, Learned Perceptual Image Patch Similarity (LPIPS) = 0.48, and Visual-question-answering Score (VQAScore) = 0.61. Specific evaluations showed acceptable alignment between the generated videos and textual prompts, with BERTScore of 0.35. Additionally, the model demonstrated strong privacy-preserving performance in retrieval evaluations, achieving an average Recall@K of 0.073. Human assessments indicated satisfactory visual quality, with an average score of 1.570(scale: 1 = best, 5 = worst). This model addresses privacy concerns associated with sharing large-scale FFA data and enhances medical education.
摘要：眼底荧光血管造影 (FFA) 对诊断视网膜血管疾病至关重要，但初学者经常难以解释图像。本研究开发了 FFA Sora，这是一个文本到视频的模型，可通过小波流变分自编码器 (WF-VAE) 和扩散变压器 (DiT) 将 FFA 报告转换为动态视频。FFA Sora 在匿名数据集上进行训练，可以准确模拟输入文本中的疾病特征，客观指标证实了这一点：Frechet 视频距离 (FVD) = 329.78、学习感知图像块相似度 (LPIPS) = 0.48 和视觉问答分数 (VQAScore) = 0.61。具体评估显示，生成的视频与文本提示之间的一致性可接受，BERTScore 为 0.35。此外，该模型在检索评估中表现出强大的隐私保护性能，平均 Recall@K 为 0.073。人工评估显示视觉质量令人满意，平均得分为 1.570（等级：1 = 最好，5 = 最差）。该模型解决了与共享大规模 FFA 数据相关的隐私问题，并增强了医学教育。

Title: ORIGAMI: A generative transformer architecture for predictions from semi-structured data

Authors: Thomas Rückstieß, Alana Huang, Robin Vujanic
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.17348
Pdf URL: https://arxiv.org/pdf/2412.17348
Copy Paste: [[2412.17348]] ORIGAMI: A generative transformer architecture for predictions from semi-structured data(https://arxiv.org/abs/2412.17348)
Keywords: generative
Abstract: Despite the popularity and widespread use of semi-structured data formats such as JSON, end-to-end supervised learning applied directly to such data remains underexplored. We present ORIGAMI (Object RepresentatIon via Generative Autoregressive ModellIng), a transformer-based architecture that directly processes nested key/value pairs while preserving their hierarchical semantics. Our key technical contributions include: (1) a structure-preserving tokenizer, (2) a novel key/value position encoding scheme, and (3) a grammar-constrained training and inference framework that ensures valid outputs and accelerates training convergence. These enhancements enable efficient end-to-end modeling of semi-structured data. By reformulating classification as next-token prediction, ORIGAMI naturally handles both single-label and multi-label tasks without architectural modifications. Empirical evaluation across diverse domains demonstrates ORIGAMI's effectiveness: On standard tabular benchmarks converted to JSON, ORIGAMI remains competitive with classical and state-of-the-art approaches. On native JSON datasets, we outperform baselines on multi-label classification and specialized models such as convolutional and graph neural networks on a code classification task. Through extensive ablation studies, we validate the impact of each architectural component and establish ORIGAMI as a robust framework for end-to-end learning on semi-structured data.
摘要：尽管 JSON 等半结构化数据格式非常流行且使用广泛，但直接应用于此类数据的端到端监督学习仍未得到充分探索。我们提出了 ORIGAMI（通过生成自回归模型进行对象表示），这是一种基于转换器的架构，可直接处理嵌套的键/值对，同时保留其层次语义。我们的主要技术贡献包括：(1) 结构保留标记器、(2) 新颖的键/值位置编码方案，以及 (3) 语法约束的训练和推理框架，可确保有效输出并加速训练收敛。这些增强功能可实现半结构化数据的高效端到端建模。通过将分类重新表述为下一个标记预测，ORIGAMI 可以自然地处理单标签和多标签任务，而无需进行架构修改。跨不同领域的实证评估证明了 ORIGAMI 的有效性：在转换为 JSON 的标准表格基准上，ORIGAMI 仍然与经典和最先进的方法具有竞争力。在原生 JSON 数据集上，我们在多标签分类方面的表现优于基准模型，在代码分类任务上的表现也优于卷积和图神经网络等专用模型。通过广泛的消融研究，我们验证了每个架构组件的影响，并将 ORIGAMI 确立为对半结构化数据进行端到端学习的强大框架。

Title: A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions

Authors: Youliang Zhang, Ronghui Li, Yachao Zhang, Liang Pan, Jingbo Wang, Yebin Liu, Xiu Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.17377
Pdf URL: https://arxiv.org/pdf/2412.17377
Copy Paste: [[2412.17377]] A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions(https://arxiv.org/abs/2412.17377)
Keywords: restoration
Abstract: Extracting physically plausible 3D human motion from videos is a critical task. Although existing simulation-based motion imitation methods can enhance the physical quality of daily motions estimated from monocular video capture, extending this capability to high-difficulty motions remains an open challenge. This can be attributed to some flawed motion clips in video-based motion capture results and the inherent complexity in modeling high-difficulty motions. Therefore, sensing the advantage of segmentation in localizing human body, we introduce a mask-based motion correction module (MCM) that leverages motion context and video mask to repair flawed motions, producing imitation-friendly motions; and propose a physics-based motion transfer module (PTM), which employs a pretrain and adapt approach for motion imitation, improving physical plausibility with the ability to handle in-the-wild and challenging motions. Our approach is designed as a plug-and-play module to physically refine the video motion capture results, including high-difficulty in-the-wild motions. Finally, to validate our approach, we collected a challenging in-the-wild test set to establish a benchmark, and our method has demonstrated effectiveness on both the new benchmark and existing public this http URL://physicalmotionrestoration.this http URL
摘要：从视频中提取物理上合理的三维人体运动是一项关键任务。虽然现有的基于模拟的运动模仿方法可以提高从单目视频捕捉中估计的日常运动的物理质量，但将这种能力扩展到高难度运动仍然是一个悬而未决的挑战。这可以归因于基于视频的运动捕捉结果中的一些有缺陷的运动片段以及建模高难度运动的固有复杂性。因此，我们意识到分割在定位人体方面的优势，引入了一个基于掩模的运动校正模块 (MCM)，利用运动上下文和视频掩模来修复有缺陷的运动，产生易于模仿的运动；并提出了一个基于物理的运动传递模块 (PTM)，它采用预训练和适应方法进行运动模仿，提高了物理合理性，并能够处理野外和具有挑战性的运动。我们的方法被设计为一个即插即用模块，用于物理地改进视频运动捕捉结果，包括高难度的野外运动。最后，为了验证我们的方法，我们收集了一个具有挑战性的野外测试集来建立基准，并且我们的方法已证明在新基准和现有公共此 http URL://physicalmotionrestoration.this http URL 上的有效性

Title: Singular Value Scaling: Efficient Generative Model Compression via Pruned Weights Refinement

Authors: Hyeonjin Kim, Jaejun Yoo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.17387
Pdf URL: https://arxiv.org/pdf/2412.17387
Copy Paste: [[2412.17387]] Singular Value Scaling: Efficient Generative Model Compression via Pruned Weights Refinement(https://arxiv.org/abs/2412.17387)
Keywords: generative
Abstract: While pruning methods effectively maintain model performance without extra training costs, they often focus solely on preserving crucial connections, overlooking the impact of pruned weights on subsequent fine-tuning or distillation, leading to inefficiencies. Moreover, most compression techniques for generative models have been developed primarily for GANs, tailored to specific architectures like StyleGAN, and research into compressing Diffusion models has just begun. Even more, these methods are often applicable only to GANs or Diffusion models, highlighting the need for approaches that work across both model types. In this paper, we introduce Singular Value Scaling (SVS), a versatile technique for refining pruned weights, applicable to both model types. Our analysis reveals that pruned weights often exhibit dominant singular vectors, hindering fine-tuning efficiency and leading to suboptimal performance compared to random initialization. Our method enhances weight initialization by minimizing the disparities between singular values of pruned weights, thereby improving the fine-tuning process. This approach not only guides the compressed model toward superior solutions but also significantly speeds up fine-tuning. Extensive experiments on StyleGAN2, StyleGAN3 and DDPM demonstrate that SVS improves compression performance across model types without additional training costs. Our code is available at: this https URL.
摘要：虽然剪枝方法可以有效地保持模型性能而无需额外的训练成本，但它们通常只关注保留关键连接，而忽略了剪枝权重对后续微调或提炼的影响，从而导致效率低下。此外，大多数生成模型的压缩技术主要是为 GAN 开发的，针对 StyleGAN 等特定架构量身定制，而压缩扩散模型的研究才刚刚开始。更重要的是，这些方法通常仅适用于 GAN 或扩散模型，这凸显了对适用于这两种模型类型的方法的需求。在本文中，我们介绍了奇异值缩放 (SVS)，这是一种适用于这两种模型类型的多功能剪枝权重细化技术。我们的分析表明，剪枝权重通常表现出占主导地位的奇异向量，阻碍了微调效率并导致与随机初始化相比性能不佳。我们的方法通过最小化剪枝权重奇异值之间的差异来增强权重初始化，从而改进微调过程。这种方法不仅可以引导压缩模型获得更优的解决方案，还可以显著加快微调速度。在 StyleGAN2、StyleGAN3 和 DDPM 上进行的大量实验表明，SVS 可以在不增加训练成本的情况下提高跨模型类型的压缩性能。我们的代码可在以下网址获取：此 https URL。

Title: Multimodal Preference Data Synthetic Alignment with Reward Model

Authors: Robert Wijaya, Ngoc-Bao Nguyen, Ngai-Man Cheung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17417
Pdf URL: https://arxiv.org/pdf/2412.17417
Copy Paste: [[2412.17417]] Multimodal Preference Data Synthetic Alignment with Reward Model(https://arxiv.org/abs/2412.17417)
Keywords: generation, generative
Abstract: Multimodal large language models (MLLMs) have significantly advanced tasks like caption generation and visual question answering by integrating visual and textual data. However, they sometimes produce misleading or hallucinate content due to discrepancies between their pre-training data and real user prompts. Existing approaches using Direct Preference Optimization (DPO) in vision-language tasks often rely on strong models like GPT-4 or CLIP to determine positive and negative responses. Here, we propose a new framework in generating synthetic data using a reward model as a proxy of human preference for effective multimodal alignment with DPO training. The resulting DPO dataset ranges from 2K to 9K image-text pairs, was evaluated on LLaVA-v1.5-7B, where our approach demonstrated substantial improvements in both the trustworthiness and reasoning capabilities of the base model across multiple hallucination and vision-language benchmark. The experiment results indicate that integrating selected synthetic data, such as from generative and rewards models can effectively reduce reliance on human-annotated data while enhancing MLLMs' alignment capability, offering a scalable solution for safer deployment.
摘要：多模态大型语言模型 (MLLM) 通过整合视觉和文本数据，显著提高了字幕生成和视觉问答等任务的执行效率。然而，由于预训练数据与真实用户提示之间存在差异，它们有时会产生误导性或幻觉内容。在视觉语言任务中使用直接偏好优化 (DPO) 的现有方法通常依赖于 GPT-4 或 CLIP 等强大模型来确定积极和消极反应。在这里，我们提出了一个新框架，使用奖励模型作为人类偏好的代理来生成合成数据，以便与 DPO 训练进行有效的多模态对齐。生成的 DPO 数据集范围从 2K 到 9K 个图像-文本对，在 LLaVA-v1.5-7B 上进行了评估，其中我们的方法在多个幻觉和视觉语言基准测试中展示了基础模型的可信度和推理能力的显着改进。实验结果表明，整合选定的合成数据（例如来自生成模型和奖励模型的数据）可以有效减少对人工注释数据的依赖，同时增强 MLLM 的对齐能力，为更安全的部署提供可扩展的解决方案。

Title: CALLIC: Content Adaptive Learning for Lossless Image Compression

Authors: Daxin Li, Yuanchao Bai, Kai Wang, Junjun Jiang, Xianming Liu, Wen Gao
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.17464
Pdf URL: https://arxiv.org/pdf/2412.17464
Copy Paste: [[2412.17464]] CALLIC: Content Adaptive Learning for Lossless Image Compression(https://arxiv.org/abs/2412.17464)
Keywords: generative
Abstract: Learned lossless image compression has achieved significant advancements in recent years. However, existing methods often rely on training amortized generative models on massive datasets, resulting in sub-optimal probability distribution estimation for specific testing images during encoding process. To address this challenge, we explore the connection between the Minimum Description Length (MDL) principle and Parameter-Efficient Transfer Learning (PETL), leading to the development of a novel content-adaptive approach for learned lossless image compression, dubbed CALLIC. Specifically, we first propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations, termed Masked Gated ConvFormer (MGCF), and pretrain MGCF on training dataset. Cache then Crop Inference (CCI) is proposed to accelerate the coding process. During encoding, we decompose pre-trained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT). RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time. Extensive experiments across diverse datasets demonstrate that CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression.
摘要：近年来，学习型无损图像压缩取得了重大进展。然而，现有的方法通常依赖于在海量数据集上训练摊销生成模型，导致在编码过程中对特定测试图像的概率分布估计不理想。为了应对这一挑战，我们探索了最小描述长度 (MDL) 原理与参数高效迁移学习 (PETL) 之间的联系，从而开发了一种新的内容自适应学习型无损图像压缩方法，称为 CALLIC。具体来说，我们首先利用卷积门控操作提出一种内容感知的自回归自注意力机制，称为 Masked Gated ConvFormer (MGCF)，并在训练数据集上对 MGCF 进行预训练。提出了缓存然后裁剪推理 (CCI) 来加速编码过程。在编码过程中，我们使用低秩矩阵分解预训练层，包括深度卷积，然后通过速率引导渐进微调 (RPFT) 调整测试图像上的增量权重。 RPFT 通过逐渐增加的按估计熵降序排列的补丁进行微调，从而优化学习过程并缩短适应时间。在不同数据集上进行的大量实验表明，CALLIC 为学习型无损图像压缩树立了新的最先进 (SOTA) 榜样。

Title: Improving the Noise Estimation of Latent Neural Stochastic Differential Equations

Authors: Linus Heck, Maximilian Gelbrecht, Michael T. Schaub, Niklas Boers
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.17499
Pdf URL: https://arxiv.org/pdf/2412.17499
Copy Paste: [[2412.17499]] Improving the Noise Estimation of Latent Neural Stochastic Differential Equations(https://arxiv.org/abs/2412.17499)
Keywords: generative
Abstract: Latent neural stochastic differential equations (SDEs) have recently emerged as a promising approach for learning generative models from stochastic time series data. However, they systematically underestimate the noise level inherent in such data, limiting their ability to capture stochastic dynamics accurately. We investigate this underestimation in detail and propose a straightforward solution: by including an explicit additional noise regularization in the loss function, we are able to learn a model that accurately captures the diffusion component of the data. We demonstrate our results on a conceptual model system that highlights the improved latent neural SDE's capability to model stochastic bistable dynamics.
摘要：潜在神经随机微分方程 (SDE) 最近成为一种很有前途的方法，用于从随机时间序列数据中学习生成模型。然而，它们系统地低估了此类数据中固有的噪声水平，限制了它们准确捕捉随机动态的能力。我们详细研究了这种低估现象，并提出了一个简单的解决方案：通过在损失函数中加入显式附加噪声正则化，我们能够学习一个准确捕捉数据扩散成分的模型。我们在概念模型系统上展示了我们的结果，突出了改进的潜在神经 SDE 建模随机双稳态动态的能力。

Title: Constructing Fair Latent Space for Intersection of Fairness and Explainability

Authors: Hyungjun Joo, Hyeonggeun Han, Sehwan Kim, Sangwoo Hong, Jungwoo Lee
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.17523
Pdf URL: https://arxiv.org/pdf/2412.17523
Copy Paste: [[2412.17523]] Constructing Fair Latent Space for Intersection of Fairness and Explainability(https://arxiv.org/abs/2412.17523)
Keywords: generation, generative
Abstract: As the use of machine learning models has increased, numerous studies have aimed to enhance fairness. However, research on the intersection of fairness and explainability remains insufficient, leading to potential issues in gaining the trust of actual users. Here, we propose a novel module that constructs a fair latent space, enabling faithful explanation while ensuring fairness. The fair latent space is constructed by disentangling and redistributing labels and sensitive attributes, allowing the generation of counterfactual explanations for each type of information. Our module is attached to a pretrained generative model, transforming its biased latent space into a fair latent space. Additionally, since only the module needs to be trained, there are advantages in terms of time and cost savings, without the need to train the entire generative model. We validate the fair latent space with various fairness metrics and demonstrate that our approach can effectively provide explanations for biased decisions and assurances of fairness.
摘要：随着机器学习模型的使用增加，许多研究都旨在提高公平性。然而，关于公平性和可解释性交集的研究仍然不足，导致在获得实际用户的信任方面存在潜在问题。在这里，我们提出了一个构建公平潜在空间的新模块，在确保公平的同时实现忠实的解释。公平潜在空间是通过解开和重新分配标签和敏感属性来构建的，允许为每种类型的信息生成反事实解释。我们的模块附加到预训练的生成模型，将其有偏潜在空间转换为公平潜在空间。此外，由于只需要训练模块，因此在节省时间和成本方面具有优势，而无需训练整个生成模型。我们用各种公平指标验证了公平潜在空间，并证明我们的方法可以有效地为有偏见的决策提供解释和公平保证。

Title: S-INF: Towards Realistic Indoor Scene Synthesis via Scene Implicit Neural Field

Authors: Zixi Liang, Guowei Xu, Haifeng Wu, Ye Huang, Wen Li, Lixin Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17561
Pdf URL: https://arxiv.org/pdf/2412.17561
Copy Paste: [[2412.17561]] S-INF: Towards Realistic Indoor Scene Synthesis via Scene Implicit Neural Field(https://arxiv.org/abs/2412.17561)
Keywords: generation, generative
Abstract: Learning-based methods have become increasingly popular in 3D indoor scene synthesis (ISS), showing superior performance over traditional optimization-based approaches. These learning-based methods typically model distributions on simple yet explicit scene representations using generative models. However, due to the oversimplified explicit representations that overlook detailed information and the lack of guidance from multimodal relationships within the scene, most learning-based methods struggle to generate indoor scenes with realistic object arrangements and styles. In this paper, we introduce a new method, Scene Implicit Neural Field (S-INF), for indoor scene synthesis, aiming to learn meaningful representations of multimodal relationships, to enhance the realism of indoor scene synthesis. S-INF assumes that the scene layout is often related to the object-detailed information. It disentangles the multimodal relationships into scene layout relationships and detailed object relationships, fusing them later through implicit neural fields (INFs). By learning specialized scene layout relationships and projecting them into S-INF, we achieve a realistic generation of scene layout. Additionally, S-INF captures dense and detailed object relationships through differentiable rendering, ensuring stylistic consistency across objects. Through extensive experiments on the benchmark 3D-FRONT dataset, we demonstrate that our method consistently achieves state-of-the-art performance under different types of ISS.
摘要：基于学习的方法在 3D 室内场景合成 (ISS) 中越来越受欢迎，与传统的基于优化的方法相比，其性能更优越。这些基于学习的方法通常使用生成模型对简单但明确的场景表示的分布进行建模。然而，由于过于简单的显式表示忽略了详细信息，并且缺乏场景内多模态关系的指导，大多数基于学习的方法都难以生成具有逼真的物体排列和风格的室内场景。在本文中，我们介绍了一种用于室内场景合成的新方法，即场景隐式神经场 (S-INF)，旨在学习多模态关系的有意义的表示，以增强室内场景合成的真实感。S-INF 假设场景布局通常与对象详细信息相关。它将多模态关系分解为场景布局关系和详细的对象关系，然后通过隐式神经场 (INF) 将它们融合。通过学习专门的场景布局关系并将其投影到 S-INF 中，我们实现了逼真的场景布局生成。此外，S-INF 通过可微分渲染捕捉密集且详细的对象关系，确保对象之间的风格一致性。通过在基准 3D-FRONT 数据集上进行大量实验，我们证明了我们的方法在不同类型的 ISS 下始终能够实现最佳性能。

Title: HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data

Authors: Ting Zhou, Daoyuan Chen, Qirui Jiao, Bolin Ding, Yaliang Li, Ying Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.17574
Pdf URL: https://arxiv.org/pdf/2412.17574
Copy Paste: [[2412.17574]] HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data(https://arxiv.org/abs/2412.17574)
Keywords: generation, quality assessment
Abstract: In the domain of Multimodal Large Language Models (MLLMs), achieving human-centric video understanding remains a formidable challenge. Existing benchmarks primarily emphasize object and action recognition, often neglecting the intricate nuances of human emotions, behaviors, and speech visual alignment within video content. We present HumanVBench, an innovative benchmark meticulously crafted to bridge these gaps in the evaluation of video MLLMs. HumanVBench comprises 17 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects. With two advanced automated pipelines for video annotation and distractor-included QA generation, HumanVBench utilizes diverse state-of-the-art (SOTA) techniques to streamline benchmark data synthesis and quality assessment, minimizing human annotation dependency tailored to human-centric multimodal attributes. A comprehensive evaluation across 16 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and temporal alignment, underscoring the necessity for further refinement toward achieving more human-like understanding. HumanVBench is open-sourced to facilitate future advancements and real-world applications in video MLLMs.
摘要：在多模态大型语言模型 (MLLM) 领域，实现以人为中心的视频理解仍然是一项艰巨的挑战。现有的基准主要强调对象和动作识别，往往忽略了视频内容中人类情感、行为和语音视觉对齐的复杂细微差别。我们提出了 HumanVBench，这是一个精心设计的创新基准，旨在弥补视频 MLLM 评估中的这些差距。HumanVBench 包含 17 个精心设计的任务，探索两个主要维度：内在情感和外在表现，涵盖静态和动态、基本和复杂以及单模态和跨模态方面。HumanVBench 拥有两个用于视频注释和包含干扰项的 QA 生成的先进自动化管道，利用各种最先进的 (SOTA) 技术来简化基准数据合成和质量评估，最大限度地减少针对以人为中心的多模态属性的人工注释依赖。对 16 个 SOTA 视频 MLLM 的全面评估揭示了当前性能的明显局限性，尤其是在跨模态和时间对齐方面，这强调了进一步改进以实现更像人类的理解的必要性。HumanVBench 是开源的，旨在促进视频 MLLM 的未来发展和实际应用。

Title: EasyTime: Time Series Forecasting Made Easy

Authors: Xiangfei Qiu, Xiuwen Li, Ruiyang Pang, Zhicheng Pan, Xingjian Wu, Liu Yang, Jilin Hu, Yang Shu, Xuesong Lu, Chengcheng Yang, Chenjuan Guo, Aoying Zhou, Christian S. Jensen, Bin Yang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.17603
Pdf URL: https://arxiv.org/pdf/2412.17603
Copy Paste: [[2412.17603]] EasyTime: Time Series Forecasting Made Easy(https://arxiv.org/abs/2412.17603)
Keywords: generation
Abstract: Time series forecasting has important applications across diverse domains. EasyTime, the system we demonstrate, facilitates easy use of time-series forecasting methods by researchers and practitioners alike. First, EasyTime enables one-click evaluation, enabling researchers to evaluate new forecasting methods using the suite of diverse time series datasets collected in the preexisting time series forecasting benchmark (TFB). This is achieved by leveraging TFB's flexible and consistent evaluation pipeline. Second, when practitioners must perform forecasting on a new dataset, a nontrivial first step is often to find an appropriate forecasting method. EasyTime provides an Automated Ensemble module that combines the promising forecasting methods to yield superior forecasting accuracy compared to individual methods. Third, EasyTime offers a natural language Q&A module leveraging large language models. Given a question like "Which method is best for long term forecasting on time series with strong seasonality?", EasyTime converts the question into SQL queries on the database of results obtained by TFB and then returns an answer in natural language and charts. By demonstrating EasyTime, we intend to show how it is possible to simplify the use of time series forecasting and to offer better support for the development of new generations of time series forecasting methods.
摘要：时间序列预测在不同领域都有重要应用。我们演示的系统 EasyTime 有助于研究人员和从业人员轻松使用时间序列预测方法。首先，EasyTime 支持一键式评估，使研究人员能够使用在现有时间序列预测基准 (TFB) 中收集的一系列不同时间序列数据集来评估新的预测方法。这是通过利用 TFB 灵活且一致的评估管道实现的。其次，当从业人员必须对新数据集进行预测时，重要的第一步通常是找到合适的预测方法。EasyTime 提供了一个自动集成模块，该模块结合了有前景的预测方法，与单个方法相比，其预测精度更高。第三，EasyTime 提供了一个利用大型语言模型的自然语言问答模块。给出一个问题，例如“哪种方法最适合对具有强烈季节性的时间序列进行长期预测？”，EasyTime 将问题转换为对 TFB 获得的结果数据库的 SQL 查询，然后以自然语言和图表的形式返回答案。通过演示 EasyTime，我们旨在展示如何简化时间序列预测的使用，并为新一代时间序列预测方法的开发提供更好的支持。

Title: SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images

Authors: Risa Shinoda, Kuniaki Saito, Shohei Tanaka, Tosho Hirasawa, Yoshitaka Ushiku
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17606
Pdf URL: https://arxiv.org/pdf/2412.17606
Copy Paste: [[2412.17606]] SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images(https://arxiv.org/abs/2412.17606)
Keywords: generation
Abstract: Building a large-scale figure QA dataset requires a considerable amount of work, from gathering and selecting figures to extracting attributes like text, numbers, and colors, and generating QAs. Although recent developments in LLMs have led to efforts to synthesize figures, most of these focus primarily on QA generation. Additionally, creating figures directly using LLMs often encounters issues such as code errors, similar-looking figures, and repetitive content in figures. To address this issue, we present SBSFigures (Stage-by-Stage Synthetic Figures), a dataset for pre-training figure QA. Our proposed pipeline enables the creation of chart figures with complete annotations of the visualized data and dense QA annotations without any manual annotation process. Our stage-by-stage pipeline makes it possible to create diverse topic and appearance figures efficiently while minimizing code errors. Our SBSFigures demonstrate a strong pre-training effect, making it possible to achieve efficient training with a limited amount of real-world chart data starting from our pre-trained weights.
摘要：构建大规模图形 QA 数据集需要大量工作，从收集和选择图形到提取文本、数字和颜色等属性，再到生成 QA。尽管 LLM 的最新发展已促成了对图形合成的尝试，但其中大部分主要集中在 QA 生成上。此外，直接使用 LLM 创建图形通常会遇到诸如代码错误、图形外观相似以及图形内容重复等问题。为了解决这个问题，我们提出了 SBSFigures（分阶段合成图形），这是一个用于预训练图形 QA 的数据集。我们提出的流程可以创建带有可视化数据完整注释和密集 QA 注释的图表图形，而无需任何手动注释过程。我们的分阶段流程可以高效地创建多样化的主题和外观图形，同时最大限度地减少代码错误。我们的 SBSFigures 展示了强大的预训练效果，使得从我们预训练的权重开始，使用有限数量的真实世界图表数据实现高效训练成为可能。

Title: Personalized Large Vision-Language Models

Authors: Chau Pham, Hoang Phan, David Doermann, Yunjie Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17610
Pdf URL: https://arxiv.org/pdf/2412.17610
Copy Paste: [[2412.17610]] Personalized Large Vision-Language Models(https://arxiv.org/abs/2412.17610)
Keywords: generation
Abstract: The personalization model has gained significant attention in image generation yet remains underexplored for large vision-language models (LVLMs). Beyond generic ones, with personalization, LVLMs handle interactive dialogues using referential concepts (e.g., ``Mike and Susan are talking.'') instead of the generic form (e.g., ``a boy and a girl are talking.''), making the conversation more customizable and referentially friendly. In addition, PLVM is equipped to continuously add new concepts during a dialogue without incurring additional costs, which significantly enhances the practicality. PLVM proposes Aligner, a pre-trained visual encoder to align referential concepts with the queried images. During the dialogues, it extracts features of reference images with these corresponding concepts and recognizes them in the queried image, enabling personalization. We note that the computational cost and parameter count of the Aligner are negligible within the entire framework. With comprehensive qualitative and quantitative analyses, we reveal the effectiveness and superiority of PLVM.
摘要：个性化模型在图像生成中获得了极大的关注，但在大型视觉语言模型 (LVLM) 中仍未得到充分探索。除了通用模型之外，通过个性化，LVLM 使用参考概念（例如，“Mike 和 Susan 正在交谈”）而不是通用形式（例如，“一个男孩和一个女孩正在交谈”）来处理交互式对话，从而使对话更加可定制且参考友好。此外，PLVM 能够在对话过程中不断添加新概念而不会产生额外成本，这大大提高了实用性。PLVM 提出了 Aligner，这是一种预先训练的视觉编码器，用于将参考概念与查询图像对齐。在对话过程中，它提取具有这些相应概念的参考图像的特征并在查询图像中识别它们，从而实现个性化。我们注意到，Aligner 的计算成本和参数数量在整个框架中可以忽略不计。通过全面的定性和定量分析，我们揭示了 PLVM 的有效性和优越性。

Title: Be More Diverse than the Most Diverse: Online Selection of Diverse Mixtures of Generative Models

Authors: Parham Rezaei, Farzan Farnia, Cheuk Ting Li
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.17622
Pdf URL: https://arxiv.org/pdf/2412.17622
Copy Paste: [[2412.17622]] Be More Diverse than the Most Diverse: Online Selection of Diverse Mixtures of Generative Models(https://arxiv.org/abs/2412.17622)
Keywords: generation, generative
Abstract: The availability of multiple training algorithms and architectures for generative models requires a selection mechanism to form a single model over a group of well-trained generation models. The selection task is commonly addressed by identifying the model that maximizes an evaluation score based on the diversity and quality of the generated data. However, such a best-model identification approach overlooks the possibility that a mixture of available models can outperform each individual model. In this work, we explore the selection of a mixture of multiple generative models and formulate a quadratic optimization problem to find an optimal mixture model achieving the maximum of kernel-based evaluation scores including kernel inception distance (KID) and Rényi kernel entropy (RKE). To identify the optimal mixture of the models using the fewest possible sample queries, we propose an online learning approach called Mixture Upper Confidence Bound (Mixture-UCB). Specifically, our proposed online learning method can be extended to every convex quadratic function of the mixture weights, for which we prove a concentration bound to enable the application of the UCB approach. We prove a regret bound for the proposed Mixture-UCB algorithm and perform several numerical experiments to show the success of the proposed Mixture-UCB method in finding the optimal mixture of text-based and image-based generative models. The codebase is available at this https URL .
摘要：生成模型有多种训练算法和架构，因此需要一种选择机制，以便在一组训练良好的生成模型中形成单一模型。选择任务通常通过基于生成数据的多样性和质量识别出可最大化评估分数的模型来解决。然而，这种最佳模型识别方法忽略了可用模型的混合可能优于每个单独模型的可能性。在这项工作中，我们探索了多种生成模型的混合选择，并制定了一个二次优化问题，以找到一个最佳混合模型，该模型可实现包括核初始距离 (KID) 和 Rényi 核熵 (RKE) 在内的最大核评估分数。为了使用尽可能少的样本查询来识别模型的最佳混合，我们提出了一种称为混合上置信度界 (Mixture-UCB) 的在线学习方法。具体而言，我们提出的在线学习方法可以扩展到混合权重的每个凸二次函数，为此我们证明了浓度界，以便应用 UCB 方法。我们证明了所提出的 Mixture-UCB 算法的遗憾界限，并进行了几次数值实验，以证明所提出的 Mixture-UCB 方法在寻找基于文本和基于图像的生成模型的最佳混合方面取得了成功。代码库可在此 https URL 上找到。

Title: SCBench: A Sports Commentary Benchmark for Video LLMs

Authors: Kuangzhi Ge, Lingjun Chen, Kevin Zhang, Yulin Luo, Tianyu Shi, Liaoyuan Fan, Xiang Li, Guanqun Wang, Shanghang Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.17637
Pdf URL: https://arxiv.org/pdf/2412.17637
Copy Paste: [[2412.17637]] SCBench: A Sports Commentary Benchmark for Video LLMs(https://arxiv.org/abs/2412.17637)
Keywords: generation
Abstract: Recently, significant advances have been made in Video Large Language Models (Video LLMs) in both academia and industry. However, methods to evaluate and benchmark the performance of different Video LLMs, especially their fine-grained, temporal visual capabilities, remain very limited. On one hand, current benchmarks use relatively simple videos (e.g., subtitled movie clips) where the model can understand the entire video by processing just a few frames. On the other hand, their datasets lack diversity in task format, comprising only QA or multi-choice QA, which overlooks the models' capacity for generating in-depth and precise texts. Sports videos, which feature intricate visual information, sequential events, and emotionally charged commentary, present a critical challenge for Video LLMs, making sports commentary an ideal benchmarking task. Inspired by these challenges, we propose a novel task: sports video commentary generation, developed $\textbf{SCBench}$ for Video LLMs. To construct such a benchmark, we introduce (1) $\textbf{SCORES}$, a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method, and (2) $\textbf{CommentarySet}$, a dataset consisting of 5,775 annotated video clips and ground-truth labels tailored to our metric. Based on SCBench, we conduct comprehensive evaluations on multiple Video LLMs (e.g. VILA, Video-LLaVA, etc.) and chain-of-thought baseline methods. Our results found that InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04. Our work provides a fresh perspective for future research, aiming to enhance models' overall capabilities in complex visual understanding tasks. Our dataset will be released soon.
摘要：最近，学术界和业界在视频大型语言模型 (Video LLM) 方面取得了重大进展。然而，评估和基准测试不同视频 LLM 性能的方法，尤其是其细粒度、时间视觉能力，仍然非常有限。一方面，当前的基准测试使用相对简单的视频（例如，带字幕的电影剪辑），其中模型只需处理几帧即可理解整个视频。另一方面，它们的数据集缺乏任务格式的多样性，仅包含 QA 或多项选择 QA，这忽略了模型生成深入和精确文本的能力。体育视频具有复杂的视觉信息、连续事件和充满情感的评论，对视频 LLM 提出了严峻的挑战，使体育评论成为理想的基准测试任务。受这些挑战的启发，我们提出了一项新任务：体育视频评论生成，为视频 LLM 开发了 $\textbf{SCBench}$。为了构建这样的基准，我们引入了 (1) $\textbf{SCORES}$，这是一个专门为我们的任务设计的六维指标，我们在此基础上提出了一种基于 GPT 的评估方法，以及 (2) $\textbf{CommentarySet}$，这是一个由 5,775 个带注释的视频片段和根据我们的指标定制的地面实况标签组成的数据集。基于 SCBench，我们对多个视频 LLM（例如 VILA、Video-LLaVA 等）和思路链基线方法进行了全面评估。我们的结果发现，InternVL-Chat-2 以 5.44 的成绩取得了最佳表现，比第二名高出 1.04。我们的工作为未来的研究提供了新的视角，旨在提高模型在复杂视觉理解任务中的整体能力。我们的数据集即将发布。

Title: DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder

Authors: Ente Lin, Xujie Zhang, Fuwei Zhao, Yuxuan Luo, Xin Dong, Long Zeng, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17644
Pdf URL: https://arxiv.org/pdf/2412.17644
Copy Paste: [[2412.17644]] DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder(https://arxiv.org/abs/2412.17644)
Keywords: generation
Abstract: Diffusion models for garment-centric human generation from text or image prompts have garnered emerging attention for their great application potential. However, existing methods often face a dilemma: lightweight approaches, such as adapters, are prone to generate inconsistent textures; while finetune-based methods involve high training costs and struggle to maintain the generalization capabilities of pretrained diffusion models, limiting their performance across diverse scenarios. To address these challenges, we propose DreamFit, which incorporates a lightweight Anything-Dressing Encoder specifically tailored for the garment-centric human generation. DreamFit has three key advantages: (1) \textbf{Lightweight training}: with the proposed adaptive attention and LoRA modules, DreamFit significantly minimizes the model complexity to 83.4M trainable parameters. (2)\textbf{Anything-Dressing}: Our model generalizes surprisingly well to a wide range of (non-)garments, creative styles, and prompt instructions, consistently delivering high-quality results across diverse scenarios. (3) \textbf{Plug-and-play}: DreamFit is engineered for smooth integration with any community control plugins for diffusion models, ensuring easy compatibility and minimizing adoption barriers. To further enhance generation quality, DreamFit leverages pretrained large multi-modal models (LMMs) to enrich the prompt with fine-grained garment descriptions, thereby reducing the prompt gap between training and inference. We conduct comprehensive experiments on both $768 \times 512$ high-resolution benchmarks and in-the-wild images. DreamFit surpasses all existing methods, highlighting its state-of-the-art capabilities of garment-centric human generation.
摘要：用于从文本或图像提示生成以服装为中心的人体的扩散模型因其巨大的应用潜力而备受关注。然而，现有的方法往往面临一个困境：轻量级方法（例如适配器）容易生成不一致的纹理；而基于微调的方法涉及高昂的训练成本，并且难以保持预训练扩散模型的泛化能力，从而限制了它们在不同场景中的表现。为了应对这些挑战，我们提出了 DreamFit，它结合了一个轻量级的 Anything-Dressing 编码器，专门为以服装为中心的人体生成量身定制。DreamFit 具有三个主要优势：(1) \textbf{轻量级训练}：借助所提出的自适应注意和 LoRA 模块，DreamFit 将模型复杂度显著降低至 83.4M 个可训练参数。(2)\textbf{Anything-Dressing}：我们的模型出人意料地适用于各种（非）服装、创意风格和提示说明，可在各种场景中始终提供高质量的结果。 (3) \textbf{即插即用}：DreamFit 专为与任何社区控制插件无缝集成而设计，用于传播模型，确保轻松兼容并最大限度地减少采用障碍。为了进一步提高生成质量，DreamFit 利用预训练的大型多模态模型 (LMM) 来丰富提示，并提供细粒度的服装描述，从而缩小训练和推理之间的提示差距。我们对 $768 \times 512$ 高分辨率基准和野生图像进行了全面的实验。DreamFit 超越了所有现有方法，凸显了其以服装为中心的人类生成的最新能力。

Title: Benchmarking Generative AI Models for Deep Learning Test Input Generation

Authors: Maryam, Matteo Biagiola, Andrea Stocco, Vincenzo Riccio
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2412.17652
Pdf URL: https://arxiv.org/pdf/2412.17652
Copy Paste: [[2412.17652]] Benchmarking Generative AI Models for Deep Learning Test Input Generation(https://arxiv.org/abs/2412.17652)
Keywords: generation, generative
Abstract: Test Input Generators (TIGs) are crucial to assess the ability of Deep Learning (DL) image classifiers to provide correct predictions for inputs beyond their training and test sets. Recent advancements in Generative AI (GenAI) models have made them a powerful tool for creating and manipulating synthetic images, although these advancements also imply increased complexity and resource demands for training. In this work, we benchmark and combine different GenAI models with TIGs, assessing their effectiveness, efficiency, and quality of the generated test images, in terms of domain validity and label preservation. We conduct an empirical study involving three different GenAI architectures (VAEs, GANs, Diffusion Models), five classification tasks of increasing complexity, and 364 human evaluations. Our results show that simpler architectures, such as VAEs, are sufficient for less complex datasets like MNIST. However, when dealing with feature-rich datasets, such as ImageNet, more sophisticated architectures like Diffusion Models achieve superior performance by generating a higher number of valid, misclassification-inducing inputs.
摘要：测试输入生成器 (TIG) 对于评估深度学习 (DL) 图像分类器为其训练和测试集以外的输入提供正确预测的能力至关重要。生成式人工智能 (GenAI) 模型的最新进展使其成为创建和处理合成图像的强大工具，尽管这些进步也意味着训练的复杂性和资源需求增加。在这项工作中，我们对不同的 GenAI 模型进行基准测试并将其与 TIG 相结合，评估其有效性、效率和生成的测试图像的质量（领域有效性和标签保存方面）。我们进行了一项实证研究，涉及三种不同的 GenAI 架构（VAE、GAN、扩散模型）、五个复杂度不断增加的分类任务和 364 次人工评估。我们的结果表明，更简单的架构（例如 VAE）足以处理不太复杂的数据集（例如 MNIST）。但是，在处理功能丰富的数据集（例如 ImageNet）时，更复杂的架构（例如扩散模型）通过生成更多有效的、导致错误分类的输入来实现卓越的性能。

Title: A Bias-Free Training Paradigm for More General AI-generated Image Detection

Authors: Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, Luisa Verdoliva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17671
Pdf URL: https://arxiv.org/pdf/2412.17671
Copy Paste: [[2412.17671]] A Bias-Free Training Paradigm for More General AI-generated Image Detection(https://arxiv.org/abs/2412.17671)
Keywords: generation, generative
Abstract: Successful forensic detectors can produce excellent results in supervised learning benchmarks but struggle to transfer to real-world applications. We believe this limitation is largely due to inadequate training data quality. While most research focuses on developing new algorithms, less attention is given to training data selection, despite evidence that performance can be strongly impacted by spurious correlations such as content, format, or resolution. A well-designed forensic detector should detect generator specific artifacts rather than reflect data biases. To this end, we propose B-Free, a bias-free training paradigm, where fake images are generated from real ones using the conditioning procedure of stable diffusion models. This ensures semantic alignment between real and fake images, allowing any differences to stem solely from the subtle artifacts introduced by AI generation. Through content-based augmentation, we show significant improvements in both generalization and robustness over state-of-the-art detectors and more calibrated results across 27 different generative models, including recent releases, like FLUX and Stable Diffusion 3.5. Our findings emphasize the importance of a careful dataset curation, highlighting the need for further research in dataset design. Code and data will be publicly available at this https URL
摘要：成功的取证检测器可以在监督学习基准中产生出色的结果，但很难转移到实际应用中。我们认为这种限制主要是由于训练数据质量不足。虽然大多数研究都集中在开发新算法上，但对训练数据选择的关注较少，尽管有证据表明性能可能会受到内容、格式或分辨率等虚假相关性的强烈影响。设计良好的取证检测器应该检测特定于生成器的伪影，而不是反映数据偏差。为此，我们提出了一种无偏差训练范式 B-Free，其中使用稳定扩散模型的条件程序从真实图像生成假图像。这确保了真实图像和假图像之间的语义对齐，允许任何差异仅源于 AI 生成引入的细微伪影。通过基于内容的增强，我们在泛化和鲁棒性方面都比最先进的检测器有显著的改进，并且在 27 种不同的生成模型（包括最新版本，如 FLUX 和 Stable Diffusion 3.5）中获得了更校准的结果。我们的研究结果强调了谨慎管理数据集的重要性，并强调了进一步研究数据集设计的必要性。代码和数据将在此 https URL 上公开提供

Title: GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance

Authors: Jingqiu Zhou, Lue Fan, Xuesong Chen, Linjiang Huang, Si Liu, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17715
Pdf URL: https://arxiv.org/pdf/2412.17715
Copy Paste: [[2412.17715]] GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance(https://arxiv.org/abs/2412.17715)
Keywords: generation
Abstract: In this paper, we present GaussianPainter, the first method to paint a point cloud into 3D Gaussians given a reference image. GaussianPainter introduces an innovative feed-forward approach to overcome the limitations of time-consuming test-time optimization in 3D Gaussian splatting. Our method addresses a critical challenge in the field: the non-uniqueness problem inherent in the large parameter space of 3D Gaussian splatting. This space, encompassing rotation, anisotropic scales, and spherical harmonic coefficients, introduces the challenge of rendering similar images from substantially different Gaussian fields. As a result, feed-forward networks face instability when attempting to directly predict high-quality Gaussian fields, struggling to converge on consistent parameters for a given output. To address this issue, we propose to estimate a surface normal for each point to determine its Gaussian rotation. This strategy enables the network to effectively predict the remaining Gaussian parameters in the constrained space. We further enhance our approach with an appearance injection module, incorporating reference image appearance into Gaussian fields via a multiscale triplane representation. Our method successfully balances efficiency and fidelity in 3D Gaussian generation, achieving high-quality, diverse, and robust 3D content creation from point clouds in a single forward pass.
摘要：在本文中，我们介绍了 GaussianPainter，这是第一种在给定参考图像的情况下将点云绘制成 3D 高斯的方法。GaussianPainter 引入了一种创新的前馈方法来克服 3D 高斯溅射中耗时的测试时间优化的局限性。我们的方法解决了该领域的一个关键挑战：3D 高斯溅射的大参数空间中固有的非唯一性问题。这个空间包含旋转、各向异性尺度和球谐系数，带来了从完全不同的高斯场渲染相似图像的挑战。因此，前馈网络在尝试直接预测高质量高斯场时面临不稳定性，难以收敛到给定输出的一致参数。为了解决这个问题，我们建议估计每个点的表面法线以确定其高斯旋转。这种策略使网络能够有效地预测受限空间中剩余的高斯参数。我们通过外观注入模块进一步增强了我们的方法，通过多尺度三平面表示将参考图像外观合并到高斯场中。我们的方法成功地平衡了 3D 高斯生成的效率和保真度，在一次前向传递中从点云实现了高质量、多样化且强大的 3D 内容创建。

Title: VidTwin: Video VAE with Decoupled Structure and Dynamics

Authors: Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, Jiang Bian
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.17726
Pdf URL: https://arxiv.org/pdf/2412.17726
Copy Paste: [[2412.17726]] VidTwin: Video VAE with Decoupled Structure and Dynamics(https://arxiv.org/abs/2412.17726)
Keywords: generation, generative
Abstract: Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Our code has been released at this https URL.
摘要：视频自动编码器 (Video AE) 的最新进展显著提高了视频生成的质量和效率。在本文中，我们提出了一种新颖而紧凑的视频自动编码器 VidTwin，它将视频分解为两个不同的潜在空间：结构潜在向量，用于捕获整体内容和全局运动，以及动态潜在向量，用于表示细粒度细节和快速运动。具体来说，我们的方法利用编码器-解码器主干，并增强了两个子模块，分别用于提取这些潜在空间。第一个子模块使用 Q-Former 来提取低频运动趋势，然后进行下采样块以删除冗余的内容细节。第二个子模块沿空间维度对潜在向量进行平均以捕捉快速运动。大量实验表明，VidTwin 实现了 0.20% 的高压缩率和高重建质量（MCL-JCV 数据集上的 PSNR 为 28.14），并且在下游生成任务中表现高效。此外，我们的模型还展示了可解释性和可扩展性，为未来视频潜在表示和生成的研究铺平了道路。我们的代码已在此 https URL 上发布。

Title: Sensitivity Curve Maximization: Attacking Robust Aggregators in Distributed Learning

Authors: Christian A. Schroth, Stefan Vlaski, Abdelhak M. Zoubir
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2412.17740
Pdf URL: https://arxiv.org/pdf/2412.17740
Copy Paste: [[2412.17740]] Sensitivity Curve Maximization: Attacking Robust Aggregators in Distributed Learning(https://arxiv.org/abs/2412.17740)
Keywords: generation
Abstract: In distributed learning agents aim at collaboratively solving a global learning problem. It becomes more and more likely that individual agents are malicious or faulty with an increasing size of the network. This leads to a degeneration or complete breakdown of the learning process. Classical aggregation schemes are prone to breakdown at small contamination rates, therefore robust aggregation schemes are sought for. While robust aggregation schemes can generally tolerate larger contamination rates, many have been shown to be susceptible to carefully crafted malicious attacks. In this work, we show how the sensitivity curve (SC), a classical tool from robust statistics, can be used to systematically derive optimal attack patterns against arbitrary robust aggregators, in most cases rendering them ineffective. We show the effectiveness of the proposed attack in multiple simulations.
摘要：在分布式学习中，代理旨在协作解决全局学习问题。随着网络规模的扩大，单个代理出现恶意行为或故障的可能性越来越大。这会导致学习过程退化或完全崩溃。经典聚合方案容易在污染率较低时崩溃，因此需要寻求稳健的聚合方案。虽然稳健聚合方案通常可以容忍较大的污染率，但许多方案已被证明容易受到精心设计的恶意攻击。在这项工作中，我们展示了如何使用灵敏度曲线 (SC)（一种来自稳健统计的经典工具）系统地得出针对任意稳健聚合器的最佳攻击模式，在大多数情况下使它们无效。我们在多次模拟中展示了所提出的攻击的有效性。

Title: Reasoning to Attend: Try to Understand How Token Works

Authors: Rui Qian, Xin Yin, Dejing Dou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17741
Pdf URL: https://arxiv.org/pdf/2412.17741
Copy Paste: [[2412.17741]] Reasoning to Attend: Try to Understand How Token Works(https://arxiv.org/abs/2412.17741)
Keywords: generation
Abstract: Current Large Multimodal Models (LMMs) empowered visual grounding typically rely on $\texttt{}$ token as a text prompt to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specified model (\eg, SAM). However, we observe that little research has looked into how it this http URL this work, we first visualize the similarity maps, which are obtained by computing the semantic similarity between the $\texttt{}$ token and the image token embeddings derived from the last hidden layer in both the LLaVA encoder and SAM decoder. Intriguingly, we have found that a striking consistency holds in terms of activation responses in the similarity map,which reveals that what $\texttt{}$ token contributes to is the semantic similarity within image-text pairs. Specifically, $\texttt{}$ token, a placeholder expanded in text vocabulary, extensively queries among individual tokenized image patches to match the semantics of an object from text to the paired image while the Large Language Models (LLMs) are being fine-tuned. Upon the above findings, we present READ, which facilitates LMMs' resilient $\textbf{REA}$soning capability of where to atten$\textbf{D}$ under the guidance of highly activated points borrowed from similarity maps. Remarkably, READ features an intuitive design, Similarity as Points module (SasP), which can be seamlessly applied to $\texttt{}$-like paradigms in a plug-and-play this http URL, extensive experiments have been conducted on the ReasonSeg and RefCOCO(+/g) datasets. To validate whether READ suffers from catastrophic forgetting of previous skills after fine-tuning, we further assess its generation ability on an augmented FP-RefCOCO(+/g) dataset. All codes and models are publicly available at this https URL.
摘要：当前大型多模态模型 (LMM) 增强了视觉基础，通常依赖 $\texttt{}$ 标记作为文本提示来联合优化视觉语言模型（例如 LLaVA）和下游任务指定模型（例如 SAM）。然而，我们观察到很少有研究调查它是如何工作的，我们首先可视化相似度图，这些相似度图是通过计算 $\texttt{}$ 标记与从 LLaVA 编码器和 SAM 解码器中的最后一个隐藏层派生的图像标记嵌入之间的语义相似度获得的。有趣的是，我们发现相似度图中的激活响应具有惊人的一致性，这表明 $\texttt{}$ 标记有助于图像-文本对内的语义相似性。具体来说，$\texttt{}$ 标记是文本词汇中扩展的占位符，它在对大型语言模型 (LLM) 进行微调时，在各个标记化图像块之间进行广泛查询，以将文本中的对象语义与配对图像相匹配。基于上述发现，我们提出了 READ，它在从相似性图中借用的高度激活点的指导下，促进了 LMM 的弹性 $\textbf{REA}$ 定位能力，即关注 $\textbf{D}$。值得注意的是，READ 具有直观的设计，即相似性作为点模块 (SasP)，可以无缝地应用于 $\texttt{}$ 类范例，即插即用。此 http URL，已在 ReasonSeg 和 RefCOCO(+/g) 数据集上进行了广泛的实验。为了验证 READ 在微调后是否会遭遇对先前技能的灾难性遗忘，我们进一步在增强型 FP-RefCOCO(+/g) 数据集上评估其生成能力。所有代码和模型均可在此 https URL 上公开获取。

Title: The Superposition of Diffusion Models Using the It\^o Density Estimator

Authors: Marta Skreta, Lazar Atanackovic, Avishek Joey Bose, Alexander Tong, Kirill Neklyudov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.17762
Pdf URL: https://arxiv.org/pdf/2412.17762
Copy Paste: [[2412.17762]] The Superposition of Diffusion Models Using the It\^o Density Estimator(https://arxiv.org/abs/2412.17762)
Keywords: generation
Abstract: The Cambrian explosion of easily accessible pre-trained diffusion models suggests a demand for methods that combine multiple different pre-trained diffusion models without incurring the significant computational burden of re-training a larger combined model. In this paper, we cast the problem of combining multiple pre-trained diffusion models at the generation stage under a novel proposed framework termed superposition. Theoretically, we derive superposition from rigorous first principles stemming from the celebrated continuity equation and design two novel algorithms tailor-made for combining diffusion models in SuperDiff. SuperDiff leverages a new scalable Itô density estimator for the log likelihood of the diffusion SDE which incurs no additional overhead compared to the well-known Hutchinson's estimator needed for divergence calculations. We demonstrate that SuperDiff is scalable to large pre-trained diffusion models as superposition is performed solely through composition during inference, and also enjoys painless implementation as it combines different pre-trained vector fields through an automated re-weighting scheme. Notably, we show that SuperDiff is efficient during inference time, and mimics traditional composition operators such as the logical OR and the logical AND. We empirically demonstrate the utility of using SuperDiff for generating more diverse images on CIFAR-10, more faithful prompt conditioned image editing using Stable Diffusion, and improved unconditional de novo structure design of proteins. this https URL
摘要：易于获取的预训练扩散模型的寒武纪大爆发表明，需要一种能够组合多种不同的预训练扩散模型的方法，而无需承担重新训练更大组合模型的巨大计算负担。在本文中，我们将在生成阶段组合多种预训练扩散模型的问题置于一个新提出的框架下，称为叠加。从理论上讲，我们从著名的连续性方程中得出严格的第一原理，并设计了两种专门用于在 SuperDiff 中组合扩散模型的新算法。SuperDiff 利用一种新的可扩展 Itô 密度估计量来计算扩散 SDE 的对数似然，与计算散度所需的众所周知的 Hutchinson 估计量相比，它不会产生额外的开销。我们证明 SuperDiff 可扩展到大型预训练扩散模型，因为叠加仅通过推理过程中的组合来执行，而且由于它通过自动重新加权方案组合了不同的预训练矢量场，因此实现起来也毫不费力。值得注意的是，我们表明 SuperDiff 在推理时非常高效，并且能够模拟传统的组合运算符，例如逻辑 OR 和逻辑 AND。我们通过经验证明了使用 SuperDiff 在 CIFAR-10 上生成更多样图像、使用稳定扩散进行更忠实的快速条件图像编辑以及改进的蛋白质无条件从头结构设计的实用性。此 https URL

Title: Large Motion Video Autoencoding with Cross-modal Video VAE

Authors: Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17805
Pdf URL: https://arxiv.org/pdf/2412.17805
Copy Paste: [[2412.17805]] Large Motion Video Autoencoding with Cross-modal Video VAE(https://arxiv.org/abs/2412.17805)
Keywords: generation
Abstract: Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~\href{this https URL}{this https URL}.
摘要：学习强大的视频变分自动编码器 (VAE) 对于减少视频冗余和促进高效视频生成至关重要。直接将图像 VAE 单独应用于各个帧可能会导致时间不一致和次优压缩率，因为缺乏时间压缩。现有的视频 VAE 已经开始解决时间压缩问题；然而，它们往往存在重建性能不足的问题。在本文中，我们提出了一种新颖而强大的视频自动编码器，能够进行高保真视频编码。首先，我们观察到，仅通过将图像 VAE 扩展为 3D VAE 来纠缠空间和时间压缩会引入运动模糊和细节失真伪影。因此，我们提出了时间感知的空间压缩，以更好地编码和解码空间信息。此外，我们集成了一个轻量级运动压缩模型以进一步进行时间压缩。其次，我们建议利用文本到视频数据集中固有的文本信息，并将文本指导纳入我们的模型。这显著提高了重建质量，特别是在细节保存和时间稳定性方面。第三，我们通过对图像和视频进行联合训练，进一步提高了模型的通用性，这不仅提高了重建质量，还使模型能够执行图像和视频自动编码。针对近期强大的基线进行的广泛评估证明了我们方法的卓越性能。项目网站可在以下位置找到：~\href{this https URL}{this https URL}。

Title: Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders

Authors: Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, Ping Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17808
Pdf URL: https://arxiv.org/pdf/2412.17808
Copy Paste: [[2412.17808]] Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders(https://arxiv.org/abs/2412.17808)
Keywords: generation
Abstract: Recent 3D content generation pipelines commonly employ Variational Autoencoders (VAEs) to encode shapes into compact latent representations for diffusion-based generation. However, the widely adopted uniform point sampling strategy in Shape VAE training often leads to a significant loss of geometric details, limiting the quality of shape reconstruction and downstream generation tasks. We present Dora-VAE, a novel approach that enhances VAE reconstruction through our proposed sharp edge sampling strategy and a dual cross-attention mechanism. By identifying and prioritizing regions with high geometric complexity during training, our method significantly improves the preservation of fine-grained shape features. Such sampling strategy and the dual attention mechanism enable the VAE to focus on crucial geometric details that are typically missed by uniform sampling approaches. To systematically evaluate VAE reconstruction quality, we additionally propose Dora-bench, a benchmark that quantifies shape complexity through the density of sharp edges, introducing a new metric focused on reconstruction accuracy at these salient geometric features. Extensive experiments on the Dora-bench demonstrate that Dora-VAE achieves comparable reconstruction quality to the state-of-the-art dense XCube-VAE while requiring a latent space at least 8$\times$ smaller (1,280 vs. > 10,000 codes). We will release our code and benchmark dataset to facilitate future research in 3D shape modeling.
摘要：最近的 3D 内容生成管道通常采用变分自编码器 (VAE) 将形状编码为紧凑的潜在表示，以进行基于扩散的生成。然而，形状 VAE 训练中广泛采用的均匀点采样策略往往会导致几何细节的显著丢失，从而限制形状重建和下游生成任务的质量。我们提出了 Dora-VAE，这是一种新颖的方法，它通过我们提出的尖锐边缘采样策略和双重交叉注意机制增强了 VAE 重建。通过在训练期间识别和优先处理具有高几何复杂度的区域，我们的方法显著提高了细粒度形状特征的保存效果。这种采样策略和双重注意机制使 VAE 能够专注于均匀采样方法通常会遗漏的关键几何细节。为了系统地评估 VAE 重建质量，我们还提出了 Dora-bench，这是一个通过尖锐边缘的密度量化形状复杂度的基准，引入了一个专注于这些显着几何特征的重建精度的新指标。在 Dora-bench 上进行的大量实验表明，Dora-VAE 实现了与最先进的密集 XCube-VAE 相当的重建质量，同时所需的潜在空间至少小 8$\times$（1,280 vs. > 10,000 个代码）。我们将发布我们的代码和基准数据集，以促进未来 3D 形状建模的研究。

Title: ChatGarment: Garment Estimation, Generation and Editing via Large Language Models

Authors: Siyuan Bian, Chenghao Xu, Yuliang Xiu, Artur Grigorev, Zhen Liu, Cewu Lu, Michael J. Black, Yao Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.17811
Pdf URL: https://arxiv.org/pdf/2412.17811
Copy Paste: [[2412.17811]] ChatGarment: Garment Estimation, Generation and Editing via Large Language Models(https://arxiv.org/abs/2412.17811)
Keywords: generation
Abstract: We introduce ChatGarment, a novel approach that leverages large vision-language models (VLMs) to automate the estimation, generation, and editing of 3D garments from images or text descriptions. Unlike previous methods that struggle in real-world scenarios or lack interactive editing capabilities, ChatGarment can estimate sewing patterns from in-the-wild images or sketches, generate them from text descriptions, and edit garments based on user instructions, all within an interactive dialogue. These sewing patterns can then be draped into 3D garments, which are easily animatable and simulatable. This is achieved by finetuning a VLM to directly generate a JSON file that includes both textual descriptions of garment types and styles, as well as continuous numerical attributes. This JSON file is then used to create sewing patterns through a programming parametric model. To support this, we refine the existing programming model, GarmentCode, by expanding its garment type coverage and simplifying its structure for efficient VLM fine-tuning. Additionally, we construct a large-scale dataset of image-to-sewing-pattern and text-to-sewing-pattern pairs through an automated data pipeline. Extensive evaluations demonstrate ChatGarment's ability to accurately reconstruct, generate, and edit garments from multimodal inputs, highlighting its potential to revolutionize workflows in fashion and gaming applications. Code and data will be available at this https URL.
摘要：我们推出了 ChatGarment，这是一种利用大型视觉语言模型 (VLM) 自动根据图像或文本描述估算、生成和编辑 3D 服装的新方法。与以前在现实场景中遇到困难或缺乏交互式编辑功能的方法不同，ChatGarment 可以根据自然图像或草图估算缝纫图案，根据文本描述生成它们，并根据用户说明编辑服装，所有这些都在交互式对话中完成。然后可以将这些缝纫图案叠加到 3D 服装中，这些服装很容易制作动画和模拟。这是通过微调 VLM 来实现的，以直接生成一个 JSON 文件，该文件既包含服装类型和样式的文本描述，也包含连续的数字属性。然后，此 JSON 文件用于通过编程参数模型创建缝纫图案。为了支持这一点，我们改进了现有的编程模型 GarmentCode，扩大了其服装类型覆盖范围并简化了其结构，以实现高效的 VLM 微调。此外，我们通过自动化数据管道构建了图像到缝纫图案和文本到缝纫图案对的大规模数据集。广泛的评估表明 ChatGarment 能够根据多模式输入准确重建、生成和编辑服装，凸显了其彻底改变时尚和游戏应用程序工作流程的潜力。代码和数据将在此 https URL 上提供。

Title: FaceLift: Single Image to 3D Head with View Generation and GS-LRM

Authors: Weijie Lyu, Yi Zhou, Ming-Hsuan Yang, Zhixin Shu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.17812
Pdf URL: https://arxiv.org/pdf/2412.17812
Copy Paste: [[2412.17812]] FaceLift: Single Image to 3D Head with View Generation and GS-LRM(https://arxiv.org/abs/2412.17812)
Keywords: generation
Abstract: We present FaceLift, a feed-forward approach for rapid, high-quality, 360-degree head reconstruction from a single image. Our pipeline begins by employing a multi-view latent diffusion model that generates consistent side and back views of the head from a single facial input. These generated views then serve as input to a GS-LRM reconstructor, which produces a comprehensive 3D representation using Gaussian splats. To train our system, we develop a dataset of multi-view renderings using synthetic 3D human head as-sets. The diffusion-based multi-view generator is trained exclusively on synthetic head images, while the GS-LRM reconstructor undergoes initial training on Objaverse followed by fine-tuning on synthetic head data. FaceLift excels at preserving identity and maintaining view consistency across views. Despite being trained solely on synthetic data, FaceLift demonstrates remarkable generalization to real-world images. Through extensive qualitative and quantitative evaluations, we show that FaceLift outperforms state-of-the-art methods in 3D head reconstruction, highlighting its practical applicability and robust performance on real-world images. In addition to single image reconstruction, FaceLift supports video inputs for 4D novel view synthesis and seamlessly integrates with 2D reanimation techniques to enable 3D facial animation. Project page: this https URL.
摘要：我们提出了 FaceLift，这是一种前馈方法，可从单个图像快速、高质量地重建 360 度头部。我们的流程首先采用多视图潜在扩散模型，该模型可从单个面部输入生成一致的头部侧面和背面视图。然后，这些生成的视图作为 GS-LRM 重建器的输入，该重建器使用高斯条纹生成全面的 3D 表示。为了训练我们的系统，我们使用合成的 3D 人体头部资产开发了一个多视图渲染数据集。基于扩散的多视图生成器专门在合成头部图像上进行训练，而 GS-LRM 重建器在 Objaverse 上进行初始训练，然后在合成头部数据上进行微调。FaceLift 擅长保留身份并保持视图之间的视图一致性。尽管仅使用合成数据进行训练，但 FaceLift 表现出对真实世界图像的出色泛化能力。通过大量的定性和定量评估，我们表明 FaceLift 在 3D 头部重建方面的表现优于最先进的方法，突出了其实际适用性和对真实世界图像的稳健性能。除了单幅图像重建外，FaceLift 还支持视频输入以进行 4D 新颖视图合成，并与 2D 动画技术无缝集成以实现 3D 面部动画。项目页面：此 https URL。