2024-12-10

Title: TagFog: Textual Anchor Guidance and Fake Outlier Generation for Visual Out-of-Distribution Detection

Authors: Jiankang Chen, Tong Zhang, Wei-Shi Zheng, Ruixuan Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.05292
Pdf URL: https://arxiv.org/pdf/2412.05292
Copy Paste: [[2412.05292]] TagFog: Textual Anchor Guidance and Fake Outlier Generation for Visual Out-of-Distribution Detection(https://arxiv.org/abs/2412.05292)
Keywords: generation
Abstract: Out-of-distribution (OOD) detection is crucial in many real-world applications. However, intelligent models are often trained solely on in-distribution (ID) data, leading to overconfidence when misclassifying OOD data as ID classes. In this study, we propose a new learning framework which leverage simple Jigsaw-based fake OOD data and rich semantic embeddings (`anchors') from the ChatGPT description of ID knowledge to help guide the training of the image encoder. The learning framework can be flexibly combined with existing post-hoc approaches to OOD detection, and extensive empirical evaluations on multiple OOD detection benchmarks demonstrate that rich textual representation of ID knowledge and fake OOD knowledge can well help train a visual encoder for OOD detection. With the learning framework, new state-of-the-art performance was achieved on all the benchmarks. The code is available at \url{this https URL}.
摘要：分布外 (OOD) 检测在许多实际应用中至关重要。然而，智能模型通常仅在分布内 (ID) 数据上进行训练，导致在将 OOD 数据错误分类为 ID 类时过于自信。在本研究中，我们提出了一个新的学习框架，该框架利用基于 Jigsaw 的简单假 OOD 数据和来自 ChatGPT ID 知识描述的丰富语义嵌入（“锚点”）来帮助指导图像编码器的训练。该学习框架可以灵活地与现有的 OOD 检测事后方法相结合，并且对多个 OOD 检测基准的大量实证评估表明，ID 知识和假 OOD 知识的丰富文本表示可以很好地帮助训练用于 OOD 检测的视觉编码器。借助该学习框架，在所有基准上都实现了新的最佳性能。代码可在 \url{this https URL} 处获得。

Title: FodFoM: Fake Outlier Data by Foundation Models Creates Stronger Visual Out-of-Distribution Detector

Authors: Jiankang Chen, Ling Deng, Zhiyong Gan, Wei-Shi Zheng, Ruixuan Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.05293
Pdf URL: https://arxiv.org/pdf/2412.05293
Copy Paste: [[2412.05293]] FodFoM: Fake Outlier Data by Foundation Models Creates Stronger Visual Out-of-Distribution Detector(https://arxiv.org/abs/2412.05293)
Keywords: generation
Abstract: Out-of-Distribution (OOD) detection is crucial when deploying machine learning models in open-world applications. The core challenge in OOD detection is mitigating the model's overconfidence on OOD data. While recent methods using auxiliary outlier datasets or synthesizing outlier features have shown promising OOD detection performance, they are limited due to costly data collection or simplified assumptions. In this paper, we propose a novel OOD detection framework FodFoM that innovatively combines multiple foundation models to generate two types of challenging fake outlier images for classifier training. The first type is based on BLIP-2's image captioning capability, CLIP's vision-language knowledge, and Stable Diffusion's image generation ability. Jointly utilizing these foundation models constructs fake outlier images which are semantically similar to but different from in-distribution (ID) images. For the second type, GroundingDINO's object detection ability is utilized to help construct pure background images by blurring foreground ID objects in ID images. The proposed framework can be flexibly combined with multiple existing OOD detection methods. Extensive empirical evaluations show that image classifiers with the help of constructed fake images can more accurately differentiate real OOD images from ID ones. New state-of-the-art OOD detection performance is achieved on multiple benchmarks. The code is available at \url{this https URL}.
摘要：在开放世界应用中部署机器学习模型时，分布外 (OOD) 检测至关重要。OOD 检测的核心挑战是减轻模型对 OOD 数据的过度自信。虽然最近使用辅助异常数据集或合成异常特征的方法已经表现出良好的 OOD 检测性能，但由于数据收集成本高或假设简化，它们受到限制。在本文中，我们提出了一种新颖的 OOD 检测框架 FodFoM，它创新地结合了多个基础模型来生成两种具有挑战性的假异常图像用于分类器训练。第一种类型基于 BLIP-2 的图像字幕功能、CLIP 的视觉语言知识和 Stable Diffusion 的图像生成能力。联合利用这些基础模型构建与分布内 (ID) 图像在语义上相似但不同的假异常图像。对于第二种类型，利用 GroundingDINO 的对象检测能力通过模糊 ID 图像中的前景 ID 对象来帮助构建纯背景图像。所提出的框架可以灵活地与多种现有的 OOD 检测方法相结合。大量的实证评估表明，借助构建的假图像，图像分类器可以更准确地区分真实 OOD 图像和 ID 图像。在多个基准上实现了新的最先进的 OOD 检测性能。代码可在 \url{此 https URL} 处获得。

Title: Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

Authors: Zhejun Zhang, Peter Karkus, Maximilian Igl, Wenhao Ding, Yuxiao Chen, Boris Ivanovic, Marco Pavone
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.05334
Pdf URL: https://arxiv.org/pdf/2412.05334
Copy Paste: [[2412.05334]] Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models(https://arxiv.org/abs/2412.05334)
Keywords: generative
Abstract: Traffic simulation aims to learn a policy for traffic agents that, when unrolled in closed-loop, faithfully recovers the joint distribution of trajectories observed in the real world. Inspired by large language models, tokenized multi-agent policies have recently become the state-of-the-art in traffic simulation. However, they are typically trained through open-loop behavior cloning, and thus suffer from covariate shift when executed in closed-loop during simulation. In this work, we present Closest Among Top-K (CAT-K) rollouts, a simple yet effective closed-loop fine-tuning strategy to mitigate covariate shift. CAT-K fine-tuning only requires existing trajectory data, without reinforcement learning or generative adversarial imitation. Concretely, CAT-K fine-tuning enables a small 7M-parameter tokenized traffic simulation policy to outperform a 102M-parameter model from the same model family, achieving the top spot on the Waymo Sim Agent Challenge leaderboard at the time of submission. The code is available at this https URL.
摘要：交通模拟旨在学习一种交通代理策略，当该策略在闭环中展开时，可以忠实地恢复现实世界中观察到的轨迹的联合分布。受大型语言模型的启发，标记化多代理策略最近已成为交通模拟的最新技术。但是，它们通常通过开环行为克隆进行训练，因此在模拟期间以闭环执行时会受到协变量偏移的影响。在这项工作中，我们提出了 Closest Among Top-K (CAT-K) 推出，这是一种简单但有效的闭环微调策略，可以减轻协变量偏移。CAT-K 微调只需要现有的轨迹数据，而无需强化学习或生成对抗模仿。具体而言，CAT-K 微调使一个小型的 7M 参数标记化交通模拟策略能够胜过来自同一模型系列的 102M 参数模型，在提交时在 Waymo Sim Agent Challenge 排行榜上名列前茅。代码可在此 https URL 上找到。

Title: Generative Model-Based Fusion for Improved Few-Shot Semantic Segmentation of Infrared Images

Authors: Junno Yun, Mehmet Akçakaya
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.05341
Pdf URL: https://arxiv.org/pdf/2412.05341
Copy Paste: [[2412.05341]] Generative Model-Based Fusion for Improved Few-Shot Semantic Segmentation of Infrared Images(https://arxiv.org/abs/2412.05341)
Keywords: generative
Abstract: Infrared (IR) imaging is commonly used in various scenarios, including autonomous driving, fire safety and defense applications. Thus, semantic segmentation of such images is of great interest. However, this task faces several challenges, including data scarcity, differing contrast and input channel number compared to natural images, and emergence of classes not represented in databases in certain scenarios, such as defense applications. Few-shot segmentation (FSS) provides a framework to overcome these issues by segmenting query images using a few labeled support samples. However, existing FSS models for IR images require paired visible RGB images, which is a major limitation since acquiring such paired data is difficult or impossible in some applications. In this work, we develop new strategies for FSS of IR images by using generative modeling and fusion techniques. To this end, we propose to synthesize auxiliary data to provide additional channel information to complement the limited contrast in the IR images, as well as IR data synthesis for data augmentation. Here, the former helps the FSS model to better capture the relationship between the support and query sets, while the latter addresses the issue of data scarcity. Finally, to further improve the former aspect, we propose a novel fusion ensemble module for integrating the two different modalities. Our methods are evaluated on different IR datasets, and improve upon the state-of-the-art (SOTA) FSS models.
摘要：红外 (IR) 成像通常用于各种场景，包括自动驾驶、消防安全和国防应用。因此，对此类图像进行语义分割十分有趣。然而，这项任务面临着一些挑战，包括数据稀缺、与自然图像相比对比度和输入通道数不同，以及在某些情况下（例如国防应用）数据库中未表示的类别的出现。小样本分割 (FSS) 通过使用少量标记的支持样本分割查询图像提供了一个克服这些问题的框架。然而，现有的 IR 图像 FSS 模型需要成对的可见 RGB 图像，这是一个主要限制，因为在某些应用中获取这种成对数据很困难或不可能。在这项工作中，我们使用生成建模和融合技术为 IR 图像的 FSS 开发了新策略。为此，我们建议合成辅助数据以提供额外的通道信息来补充 IR 图像中有限的对比度，以及合成 IR 数据以进行数据增强。在这里，前者帮助 FSS 模型更好地捕捉支持和查询集之间的关系，而后者解决了数据稀缺的问题。最后，为了进一步改进前一个方面，我们提出了一种新颖的融合集成模块，用于集成两种不同的模式。我们的方法在不同的 IR 数据集上进行了评估，并改进了最先进的 (SOTA) FSS 模型。

Title: Tabular data generation with tensor contraction layers and transformers

Authors: Aníbal Silva, André Restivo, Moisés Santos, Carlos Soares
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.05390
Pdf URL: https://arxiv.org/pdf/2412.05390
Copy Paste: [[2412.05390]] Tabular data generation with tensor contraction layers and transformers(https://arxiv.org/abs/2412.05390)
Keywords: generation, generative
Abstract: Generative modeling for tabular data has recently gained significant attention in the Deep Learning domain. Its objective is to estimate the underlying distribution of the data. However, estimating the underlying distribution of tabular data has its unique challenges. Specifically, this data modality is composed of mixed types of features, making it a non-trivial task for a model to learn intra-relationships between them. One approach to address mixture is to embed each feature into a continuous matrix via tokenization, while a solution to capture intra-relationships between variables is via the transformer architecture. In this work, we empirically investigate the potential of using embedding representations on tabular data generation, utilizing tensor contraction layers and transformers to model the underlying distribution of tabular data within Variational Autoencoders. Specifically, we compare four architectural approaches: a baseline VAE model, two variants that focus on tensor contraction layers and transformers respectively, and a hybrid model that integrates both techniques. Our empirical study, conducted across multiple datasets from the OpenML CC18 suite, compares models over density estimation and Machine Learning efficiency metrics. The main takeaway from our results is that leveraging embedding representations with the help of tensor contraction layers improves density estimation metrics, albeit maintaining competitive performance in terms of machine learning efficiency.
摘要：表格数据的生成模型最近在深度学习领域引起了广泛关注。其目标是估计数据的底层分布。然而，估计表格数据的底层分布有其独特的挑战。具体来说，这种数据模态由混合类型的特征组成，因此模型要学习它们之间的内部关系并非易事。解决混合问题的一种方法是通过标记化将每个特征嵌入到连续矩阵中，而捕获变量之间内部关系的解决方案是通过转换器架构。在这项工作中，我们通过实证研究了在表格数据生成中使用嵌入表示的潜力，利用张量收缩层和转换器在变分自动编码器中对表格数据的底层分布进行建模。具体来说，我们比较了四种架构方法：一个基线 VAE 模型、两个分别关注张量收缩层和转换器的变体，以及一个集成两种技术的混合模型。我们针对 OpenML CC18 套件中的多个数据集进行了实证研究，比较了模型的密度估计和机器学习效率指标。从我们的研究结果中得出的主要结论是，在张量收缩层的帮助下利用嵌入表示可以改善密度估计指标，同时在机器学习效率方面保持竞争力。

Title: HiVeGen -- Hierarchical LLM-based Verilog Generation for Scalable Chip Design

Authors: Jinwei Tang, Jiayin Qin, Kiran Thorat, Chen Zhu-Tian, Yu Cao, Yang (Katie)Zhao, Caiwen Ding
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2412.05393
Pdf URL: https://arxiv.org/pdf/2412.05393
Copy Paste: [[2412.05393]] HiVeGen -- Hierarchical LLM-based Verilog Generation for Scalable Chip Design(https://arxiv.org/abs/2412.05393)
Keywords: generation
Abstract: With Large Language Models (LLMs) recently demonstrating impressive proficiency in code generation, it is promising to extend their abilities to Hardware Description Language (HDL). However, LLMs tend to generate single HDL code blocks rather than hierarchical structures for hardware designs, leading to hallucinations, particularly in complex designs like Domain-Specific Accelerators (DSAs). To address this, we propose HiVeGen, a hierarchical LLM-based Verilog generation framework that decomposes generation tasks into LLM-manageable hierarchical submodules. HiVeGen further harnesses the advantages of such hierarchical structures by integrating automatic Design Space Exploration (DSE) into hierarchy-aware prompt generation, introducing weight-based retrieval to enhance code reuse, and enabling real-time human-computer interaction to lower error-correction cost, significantly improving the quality of generated designs.
摘要：大型语言模型 (LLM) 最近在代码生成方面表现出色，有望将其功能扩展到硬件描述语言 (HDL)。然而，LLM 倾向于为硬件设计生成单个 HDL 代码块而不是分层结构，这会导致幻觉，尤其是在领域特定加速器 (DSA) 等复杂设计中。为了解决这个问题，我们提出了 HiVeGen，这是一个基于分层 LLM 的 Verilog 生成框架，它将生成任务分解为 LLM 可管理的分层子模块。HiVeGen 通过将自动设计空间探索 (DSE) 集成到层次感知提示生成中，引入基于权重的检索以增强代码重用，并实现实时人机交互以降低纠错成本，从而进一步利用此类分层结构的优势，显著提高生成的设计质量。

Title: DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Authors: Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2412.05430
Pdf URL: https://arxiv.org/pdf/2412.05430
Copy Paste: [[2412.05430]] DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA(https://arxiv.org/abs/2412.05430)
Keywords: generation
Abstract: Recent advances in self-supervised models for natural language, vision, and protein sequences have inspired the development of large genomic DNA language models (DNALMs). These models aim to learn generalizable representations of diverse DNA elements, potentially enabling various genomic prediction, interpretation and design tasks. Despite their potential, existing benchmarks do not adequately assess the capabilities of DNALMs on key downstream applications involving an important class of non-coding DNA elements critical for regulating gene activity. In this study, we introduce DART-Eval, a suite of representative benchmarks specifically focused on regulatory DNA to evaluate model performance across zero-shot, probed, and fine-tuned scenarios against contemporary ab initio models as baselines. Our benchmarks target biologically meaningful downstream tasks such as functional sequence feature discovery, predicting cell-type specific regulatory activity, and counterfactual prediction of the impacts of genetic variants. We find that current DNALMs exhibit inconsistent performance and do not offer compelling gains over alternative baseline models for most tasks, while requiring significantly more computational resources. We discuss potentially promising modeling, data curation, and evaluation strategies for the next generation of DNALMs. Our code is available at this https URL.
摘要：自然语言、视觉和蛋白质序列的自监督模型的最新进展激发了大型基因组 DNA 语言模型 (DNALM) 的发展。这些模型旨在学习各种 DNA 元素的可泛化表示，从而可能实现各种基因组预测、解释和设计任务。尽管它们具有潜力，但现有的基准测试并不能充分评估 DNALM 在涉及对调节基因活动至关重要的一类重要非编码 DNA 元素的关键下游应用上的能力。在本研究中，我们引入了 DART-Eval，这是一套专门针对调节 DNA 的代表性基准测试，用于评估模型在零样本、探测和微调场景中的表现，以当代从头算模型为基线。我们的基准测试针对具有生物学意义的下游任务，例如功能序列特征发现、预测细胞类型特定的调节活动以及对遗传变异影响的反事实预测。我们发现当前的 DNALM 表现出不一致的性能，并且在大多数任务中并没有提供比其他基线模型更显著的收益，同时需要更多的计算资源。我们讨论了下一代 DNALM 的潜在有前景的建模、数据管理和评估策略。我们的代码可在此 https URL 上找到。

Title: UniScene: Unified Occupancy-centric Driving Scene Generation

Authors: Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, Shuchang Zhou, Li Zhang, Xiaojuan Qi, Hao Zhao, Mu Yang, Wenjun Zeng, Xin Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05435
Pdf URL: https://arxiv.org/pdf/2412.05435
Copy Paste: [[2412.05435]] UniScene: Unified Occupancy-centric Driving Scene Generation(https://arxiv.org/abs/2412.05435)
Keywords: generation
Abstract: Generating high-fidelity, controllable, and annotated training data is critical for autonomous driving. Existing methods typically generate a single data form directly from a coarse scene layout, which not only fails to output rich data forms required for diverse downstream tasks but also struggles to model the direct layout-to-data distribution. In this paper, we introduce UniScene, the first unified framework for generating three key data forms - semantic occupancy, video, and LiDAR - in driving scenes. UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps: (a) first generating semantic occupancy from a customized scene layout as a meta scene representation rich in both semantic and geometric information, and then (b) conditioned on occupancy, generating video and LiDAR data, respectively, with two novel transfer strategies of Gaussian-based Joint Rendering and Prior-guided Sparse Modeling. This occupancy-centric approach reduces the generation burden, especially for intricate scenes, while providing detailed intermediate representations for the subsequent generation stages. Extensive experiments demonstrate that UniScene outperforms previous SOTAs in the occupancy, video, and LiDAR generation, which also indeed benefits downstream driving tasks.
摘要：生成高保真、可控且带注释的训练数据对于自动驾驶至关重要。现有方法通常直接从粗略的场景布局生成单一数据形式，这不仅无法输出各种下游任务所需的丰富数据形式，而且难以对直接布局到数据的分布进行建模。在本文中，我们介绍了 UniScene，这是第一个用于生成驾驶场景中三种关键数据形式（语义占用、视频和激光雷达）的统一框架。UniScene 采用渐进式生成过程，将复杂的场景生成任务分解为两个分层步骤：（a）首先从定制的场景布局生成语义占用作为富含语义和几何信息的元场景表示，然后（b）以占用为条件，分别生成视频和激光雷达数据，使用基于高斯的联合渲染和先验引导的稀疏建模两种新颖的传输策略。这种以占用为中心的方法减轻了生成负担，尤其是对于复杂的场景，同时为后续生成阶段提供了详细的中间表示。大量实验表明，UniScene 在占用率、视频和 LiDAR 生成方面的表现优于之前的 SOTA，这也确实有利于下游驾驶任务。

Title: A Graph-Based Approach for Conversational AI-Driven Personal Memory Capture and Retrieval in a Real-world Application

Authors: Savini Kashmira, Jayanaka L. Dantanarayana, Joshua Brodsky, Ashish Mahendra, Yiping Kang, Krisztian Flautner, Lingjia Tang, Jason Mars
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.05447
Pdf URL: https://arxiv.org/pdf/2412.05447
Copy Paste: [[2412.05447]] A Graph-Based Approach for Conversational AI-Driven Personal Memory Capture and Retrieval in a Real-world Application(https://arxiv.org/abs/2412.05447)
Keywords: generation
Abstract: TOBU is a novel mobile application that captures and retrieves `personal memories' (pictures/videos together with stories and context around those moments) in a user-engaging AI-guided conversational approach. Our initial prototype showed that existing retrieval techniques such as retrieval-augmented generation (RAG) systems fall short due to their limitations in understanding memory relationships, causing low recall, hallucination, and unsatisfactory user experience. We design TOBUGraph, a novel graph-based retrieval approach. During capturing, TOBUGraph leverages large language models (LLMs) to automatically create a dynamic knowledge graph of memories, establishing context and relationships of those memories. During retrieval, TOBUGraph combines LLMs with the memory graph to achieve comprehensive recall through graph traversal. Our evaluation using real user data demonstrates that TOBUGraph outperforms multiple RAG implementations in both precision and recall, significantly improving user experience through improved retrieval accuracy and reduced hallucination.
摘要：TOBU 是一款新颖的移动应用程序，它以用户参与的 AI 引导对话方式捕捉和检索“个人记忆”（图片/视频以及那些时刻的故事和背景）。我们的初始原型表明，现有的检索技术（例如检索增强生成 (RAG) 系统）由于在理解记忆关系方面的局限性而存在不足，导致回忆率低、幻觉和不令人满意的用户体验。我们设计了一种基于图的新颖检索方法 TOBUGraph。在捕捉过程中，TOBUGraph 利用大型语言模型 (LLM) 自动创建记忆的动态知识图，建立这些记忆的背景和关系。在检索过程中，TOBUGraph 将 LLM 与记忆图相结合，通过图遍历实现全面回忆。我们使用真实用户数据进行的评估表明，TOBUGraph 在准确率和召回率方面都优于多种 RAG 实现，通过提高检索准确性和减少幻觉显著改善了用户体验。

Title: CigTime: Corrective Instruction Generation Through Inverse Motion Editing

Authors: Qihang Fang, Chengcheng Tang, Bugra Tekin, Yanchao Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05460
Pdf URL: https://arxiv.org/pdf/2412.05460
Copy Paste: [[2412.05460]] CigTime: Corrective Instruction Generation Through Inverse Motion Editing(https://arxiv.org/abs/2412.05460)
Keywords: generation
Abstract: Recent advancements in models linking natural language with human motions have shown significant promise in motion generation and editing based on instructional text. Motivated by applications in sports coaching and motor skill learning, we investigate the inverse problem: generating corrective instructional text, leveraging motion editing and generation models. We introduce a novel approach that, given a user's current motion (source) and the desired motion (target), generates text instructions to guide the user towards achieving the target motion. We leverage large language models to generate corrective texts and utilize existing motion generation and editing frameworks to compile datasets of triplets (source motion, target motion, and corrective text). Using this data, we propose a new motion-language model for generating corrective instructions. We present both qualitative and quantitative results across a diverse range of applications that largely improve upon baselines. Our approach demonstrates its effectiveness in instructional scenarios, offering text-based guidance to correct and enhance user performance.
摘要：将自然语言与人类动作联系起来的模型的最新进展表明，基于指导性文本的动作生成和编辑具有巨大的前景。受体育教练和运动技能学习应用的启发，我们研究了逆问题：利用动作编辑和生成模型生成纠正性指导性文本。我们引入了一种新方法，该方法根据用户的当前动作（源）和期望动作（目标），生成文本指令以指导用户实现目标动作。我们利用大型语言模型来生成纠正性文本，并利用现有的动作生成和编辑框架来编译三元组数据集（源动作、目标动作和纠正性文本）。利用这些数据，我们提出了一种用于生成纠正性指令的新动作语言模型。我们在各种应用中展示了定性和定量结果，这些结果在很大程度上改进了基线。我们的方法证明了它在教学场景中的有效性，提供了基于文本的指导来纠正和提高用户的表现。

Title: AI-powered Digital Twin of the Ocean: Reliable Uncertainty Quantification for Real-time Wave Height Prediction with Deep Ensemble

Authors: Dongeon Lee, Sunwoong Yang, Jae-Won Oh, Su-Gil Cho, Sanghyuk Kim, Namwoo Kang
Subjects: cs.LG, cs.CE, eess.SP, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2412.05475
Pdf URL: https://arxiv.org/pdf/2412.05475
Copy Paste: [[2412.05475]] AI-powered Digital Twin of the Ocean: Reliable Uncertainty Quantification for Real-time Wave Height Prediction with Deep Ensemble(https://arxiv.org/abs/2412.05475)
Keywords: generation
Abstract: Environmental pollution and the depletion of fossil fuels have prompted the need for eco-friendly power generation methods based on renewable energy. However, renewable energy sources often face challenges in providing stable power due to low energy density and non-stationary. Wave energy converters (WECs), in particular, need reliable real-time wave height prediction to address these issues caused by irregular wave patterns, which can lead to the inefficient and unstable operation of WECs. In this study, we propose an AI-powered reliable real-time wave height prediction model, aiming both high predictive accuracy and reliable uncertainty quantification (UQ). The proposed architecture LSTM-DE, integrates long short-term memory (LSTM) networks for temporal prediction with deep ensemble (DE) for robust UQ, achieving accuracy and reliability in wave height prediction. To further enhance the reliability of the predictive models, uncertainty calibration is applied, which has proven to significantly improve the quality of the quantified uncertainty. Based on the real operational data obtained from an oscillating water column-wave energy converter (OWC-WEC) system in Jeju, South Korea, we demonstrate that the proposed LSTM-DE model architecture achieves notable predictive accuracy (R2 > 0.9) while increasing the uncertainty quality by over 50% through simple calibration technique. Furthermore, a comprehensive parametric study is conducted to explore the effects of key model hyperparameters, offering valuable guidelines for diverse operational scenarios, characterized by differences in wavelength, amplitude, and period. The findings show that the proposed method provides robust and reliable real-time wave height predictions, facilitating digital twin of the ocean.
摘要：环境污染和化石燃料的枯竭促使人们需要基于可再生能源的环保发电方法。然而，由于能量密度低和不稳定，可再生能源在提供稳定电力方面往往面临挑战。波浪能转换器 (WEC) 尤其需要可靠的实时波高预测来解决由不规则波浪模式引起的这些问题，这些问题可能导致 WEC 运行效率低下且不稳定。在本研究中，我们提出了一种由人工智能驱动的可靠实时波高预测模型，旨在实现高预测精度和可靠的不确定性量化 (UQ)。所提出的架构 LSTM-DE 将用于时间预测的长短期记忆 (LSTM) 网络与用于稳健 UQ 的深度集成 (DE) 相结合，实现了波高预测的准确性和可靠性。为了进一步提高预测模型的可靠性，应用了不确定性校准，事实证明这可以显着提高量化不确定性的质量。基于从韩国济州岛的振荡水柱-波浪能转换器 (OWC-WEC) 系统获得的实际运行数据，我们证明了所提出的 LSTM-DE 模型架构实现了显著的预测精度 (R2 > 0.9)，同时通过简单的校准技术将不确定性质量提高了 50% 以上。此外，我们还进行了全面的参数研究，以探索关键模型超参数的影响，为波长、振幅和周期不同的各种运行场景提供有价值的指导。研究结果表明，所提出的方法可以提供稳健可靠的实时波高预测，从而促进海洋数字孪生。

Title: Enhancing Sample Generation of Diffusion Models using Noise Level Correction

Authors: Abulikemu Abuduweili, Chenyang Yuan, Changliu Liu, Frank Permenter
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.05488
Pdf URL: https://arxiv.org/pdf/2412.05488
Copy Paste: [[2412.05488]] Enhancing Sample Generation of Diffusion Models using Noise Level Correction(https://arxiv.org/abs/2412.05488)
Keywords: restoration, super-resolution, generation
Abstract: The denoising process of diffusion models can be interpreted as a projection of noisy samples onto the data manifold. Moreover, the noise level in these samples approximates their distance to the underlying manifold. Building on this insight, we propose a novel method to enhance sample generation by aligning the estimated noise level with the true distance of noisy samples to the manifold. Specifically, we introduce a noise level correction network, leveraging a pre-trained denoising network, to refine noise level estimates during the denoising process. Additionally, we extend this approach to various image restoration tasks by integrating task-specific constraints, including inpainting, deblurring, super-resolution, colorization, and compressed sensing. Experimental results demonstrate that our method significantly improves sample quality in both unconstrained and constrained generation scenarios. Notably, the proposed noise level correction framework is compatible with existing denoising schedulers (e.g., DDIM), offering additional performance improvements.
摘要：扩散模型的去噪过程可以解释为将噪声样本投影到数据流形上。此外，这些样本中的噪声水平近似于它们与底层流形的距离。基于这一见解，我们提出了一种新方法，通过将估计的噪声水平与噪声样本与流形的真实距离对齐来增强样本生成。具体来说，我们引入了一个噪声水平校正网络，利用预先训练的去噪网络来在去噪过程中改进噪声水平估计。此外，我们通过集成特定于任务的约束（包括修复、去模糊、超分辨率、着色和压缩感知）将这种方法扩展到各种图像恢复任务。实验结果表明，我们的方法在无约束和约束生成场景中都显著提高了样本质量。值得注意的是，提出的噪声水平校正框架与现有的去噪调度程序（例如 DDIM）兼容，从而提供了额外的性能改进。

Title: Uncovering Vision Modality Threats in Image-to-Image Tasks

Authors: Hao Cheng, Erjia Xiao, Jiayan Yang, Jiahang Cao, Qiang Zhang, Jize Zhang, Kaidi Xu, Jindong Gu, Renjing Xu
Subjects: cs.CV, cs.PF
Abstract URL: https://arxiv.org/abs/2412.05538
Pdf URL: https://arxiv.org/pdf/2412.05538
Copy Paste: [[2412.05538]] Uncovering Vision Modality Threats in Image-to-Image Tasks(https://arxiv.org/abs/2412.05538)
Keywords: generation
Abstract: Current image generation models can effortlessly produce high-quality, highly realistic images, but this also increases the risk of misuse. In various Text-to-Image or Image-to-Image tasks, attackers can generate a series of images containing inappropriate content by simply editing the language modality input. Currently, to prevent this security threat, the various guard or defense methods that are proposed also focus on defending the language modality. However, in practical applications, threats in the visual modality, particularly in tasks involving the editing of real-world images, pose greater security risks as they can easily infringe upon the rights of the image owner. Therefore, this paper uses a method named typographic attack to reveal that various image generation models also commonly face threats in the vision modality. Furthermore, we also evaluate the defense performance of various existing methods when facing threats in the vision modality and uncover their ineffectiveness. Finally, we propose the Vision Modal Threats in Image Generation Models (VMT-IGMs) dataset, which would serve as a baseline for evaluating the vision modality vulnerability of various image generation models.
摘要：目前的图像生成模型可以毫不费力地生成高质量、高逼真度的图像，但这也增加了被滥用的风险。在各种文本转图像或图像转图像任务中，攻击者只需编辑语言模态输入即可生成一系列包含不当内容的图像。目前，为了防止这种安全威胁，提出的各种防护或防御方法也主要集中在语言模态的防御上。然而，在实际应用中，视觉模态中的威胁，特别是在涉及编辑真实世界图像的任务中，会带来更大的安全风险，因为它们很容易侵犯图像所有者的权利。因此，本文使用一种名为排版攻击的方法来揭示各种图像生成模型也普遍面临视觉模态中的威胁。此外，我们还评估了各种现有方法在面对视觉模态威胁时的防御性能并揭示其无效性。最后，我们提出了图像生成模型中的视觉模态威胁（VMT-IGMs）数据集，该数据集将作为评估各种图像生成模型的视觉模态漏洞的基准。

Title: Text-to-3D Gaussian Splatting with Physics-Grounded Motion Generation

Authors: Wenqing Wang, Yun Fu
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.05560
Pdf URL: https://arxiv.org/pdf/2412.05560
Copy Paste: [[2412.05560]] Text-to-3D Gaussian Splatting with Physics-Grounded Motion Generation(https://arxiv.org/abs/2412.05560)
Keywords: generation
Abstract: Text-to-3D generation is a valuable technology in virtual reality and digital content creation. While recent works have pushed the boundaries of text-to-3D generation, producing high-fidelity 3D objects with inefficient prompts and simulating their physics-grounded motion accurately still remain unsolved challenges. To address these challenges, we present an innovative framework that utilizes the Large Language Model (LLM)-refined prompts and diffusion priors-guided Gaussian Splatting (GS) for generating 3D models with accurate appearances and geometric structures. We also incorporate a continuum mechanics-based deformation map and color regularization to synthesize vivid physics-grounded motion for the generated 3D Gaussians, adhering to the conservation of mass and momentum. By integrating text-to-3D generation with physics-grounded motion synthesis, our framework renders photo-realistic 3D objects that exhibit physics-aware motion, accurately reflecting the behaviors of the objects under various forces and constraints across different materials. Extensive experiments demonstrate that our approach achieves high-quality 3D generations with realistic physics-grounded motion.
摘要：文本到 3D 生成是虚拟现实和数字内容创作中的一项宝贵技术。虽然最近的研究已经突破了文本到 3D 生成的界限，但使用低效提示生成高保真 3D 对象并准确模拟其基于物理的运动仍然是尚未解决的挑战。为了应对这些挑战，我们提出了一个创新框架，该框架利用大型语言模型 (LLM) 精炼的提示和扩散先验引导的高斯分层 (GS) 来生成具有准确外观和几何结构的 3D 模型。我们还结合了基于连续力学的变形图和颜色正则化，为生成的 3D 高斯合成生动的基于物理的运动，遵守质量和动量守恒。通过将文本到 3D 生成与基于物理的运动合成相结合，我们的框架可以渲染出具有物理感知运动的照片般逼真的 3D 对象，准确反映物体在不同材料的各种力和约束下的行为。大量实验表明，我们的方法实现了具有真实物理基础运动的高质量 3D 生成。

Title: TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances

Authors: Wenting Xu, Viorela Ila, Luping Zhou, Craig T. Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05596
Pdf URL: https://arxiv.org/pdf/2412.05596
Copy Paste: [[2412.05596]] TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances(https://arxiv.org/abs/2412.05596)
Keywords: generation
Abstract: The concept of function and affordance is a critical aspect of 3D scene understanding and supports task-oriented objectives. In this work, we develop a model that learns to structure and vary functional affordance across a 3D hierarchical scene graph representing the spatial organization of a scene. The varying functional affordance is designed to integrate with the varying spatial context of the graph. More specifically, we develop an algorithm that learns to construct a 3D hierarchical scene graph (3DHSG) that captures the spatial organization of the scene. Starting from segmented object point clouds and object semantic labels, we develop a 3DHSG with a top node that identifies the room label, child nodes that define local spatial regions inside the room with region-specific affordances, and grand-child nodes indicating object locations and object-specific affordances. To support this work, we create a custom 3DHSG dataset that provides ground truth data for local spatial regions with region-specific affordances and also object-specific affordances for each object. We employ a transformer-based model to learn the 3DHSG. We use a multi-task learning framework that learns both room classification and learns to define spatial regions within the room with region-specific affordances. Our work improves on the performance of state-of-the-art baseline models and shows one approach for applying transformer models to 3D scene understanding and the generation of 3DHSGs that capture the spatial organization of a room. The code and dataset are publicly available.
摘要：功能和可供性的概念是 3D 场景理解的关键方面，支持面向任务的目标。在这项工作中，我们开发了一个模型，该模型学习在表示场景空间组织的 3D 分层场景图中构建和改变功能可供性。变化的功能可供性旨在与图形的变化空间背景相结合。更具体地说，我们开发了一种算法，该算法学习构建 3D 分层场景图 (3DHSG)，以捕获场景的空间组织。从分割的对象点云和对象语义标签开始，我们开发了一个 3DHSG，其顶部节点用于标识房间标签，子节点用于定义房间内具有区域特定可供性的局部空间区域，孙节点用于指示对象位置和对象特定可供性。为了支持这项工作，我们创建了一个自定义 3DHSG 数据集，该数据集为具有区域特定可供性的局部空间区域提供地面真实数据，并为每个对象提供对象特定可供性。我们采用基于变压器的模型来学习 3DHSG。我们使用一个多任务学习框架，该框架既可以学习房间分类，也可以学习使用区域特定功能定义房间内的空间区域。我们的工作提高了最先进的基线模型的性能，并展示了一种将 Transformer 模型应用于 3D 场景理解和生成捕捉房间空间组织的 3DHSG 的方法。代码和数据集是公开可用的。

Title: RefSAM3D: Adapting SAM with Cross-modal Reference for 3D Medical Image Segmentation

Authors: Xiang Gao, Kai Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05605
Pdf URL: https://arxiv.org/pdf/2412.05605
Copy Paste: [[2412.05605]] RefSAM3D: Adapting SAM with Cross-modal Reference for 3D Medical Image Segmentation(https://arxiv.org/abs/2412.05605)
Keywords: generation
Abstract: The Segment Anything Model (SAM), originally built on a 2D Vision Transformer (ViT), excels at capturing global patterns in 2D natural images but struggles with 3D medical imaging modalities like CT and MRI. These modalities require capturing spatial information in volumetric space for tasks such as organ segmentation and tumor quantification. To address this challenge, we introduce RefSAM3D, which adapts SAM for 3D medical imaging by incorporating a 3D image adapter and cross-modal reference prompt generation. Our approach modifies the visual encoder to handle 3D inputs and enhances the mask decoder for direct 3D mask generation. We also integrate textual prompts to improve segmentation accuracy and consistency in complex anatomical scenarios. By employing a hierarchical attention mechanism, our model effectively captures and integrates information across different scales. Extensive evaluations on multiple medical imaging datasets demonstrate the superior performance of RefSAM3D over state-of-the-art methods. Our contributions advance the application of SAM in accurately segmenting complex anatomical structures in medical imaging.
摘要：任何分割模型 (SAM) 最初建立在 2D 视觉变换器 (ViT) 上，擅长捕捉 2D 自然图像中的全局模式，但在 CT 和 MRI 等 3D 医学成像模式方面却举步维艰。这些模式需要在体积空间中捕捉空间信息，以完成器官分割和肿瘤量化等任务。为了应对这一挑战，我们推出了 RefSAM3D，它通过结合 3D 图像适配器和跨模式参考提示生成，使 SAM 适应 3D 医学成像。我们的方法修改了视觉编码器以处理 3D 输入，并增强了掩模解码器以直接生成 3D 掩模。我们还集成了文本提示，以提高复杂解剖场景中的分割准确性和一致性。通过采用分层注意力机制，我们的模型可以有效地捕获和整合不同尺度的信息。对多个医学成像数据集的广泛评估表明，RefSAM3D 的性能优于最先进的方法。我们的贡献推动了 SAM 在医学成像中准确分割复杂解剖结构的应用。

Title: Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC

Authors: Ming Tao, Bing-Kun Bao, Yaowei Wang, Changsheng Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05619
Pdf URL: https://arxiv.org/pdf/2412.05619
Copy Paste: [[2412.05619]] Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC(https://arxiv.org/abs/2412.05619)
Keywords: generation, generative
Abstract: Large pretrained diffusion models have demonstrated impressive generation capabilities and have been adapted to various downstream tasks. However, unlike Large Language Models (LLMs) that can learn multiple tasks in a single model based on instructed data, diffusion models always require additional branches, task-specific training strategies, and losses for effective adaptation to different downstream tasks. This task-specific fine-tuning approach brings two drawbacks. 1) The task-specific additional networks create gaps between pretraining and fine-tuning which hinders the transfer of pretrained knowledge. 2) It necessitates careful additional network design, raising the barrier to learning and implementation, and making it less user-friendly. Thus, a question arises: Can we achieve a simple, efficient, and general approach to fine-tune diffusion models? To this end, we propose ONE-PIC. It enhances the inherited generative ability in the pretrained diffusion models without introducing additional modules. Specifically, we propose In-Visual-Context Tuning, which constructs task-specific training data by arranging source images and target images into a single image. This approach makes downstream fine-tuning closer to the pertaining, allowing our model to adapt more quickly to various downstream tasks. Moreover, we propose a Masking Strategy to unify different generative tasks. This strategy transforms various downstream fine-tuning tasks into predictions of the masked portions. The extensive experimental results demonstrate that our method is simple and efficient which streamlines the adaptation process and achieves excellent performance with lower costs. Code is available at this https URL.
摘要：大型预训练扩散模型已经展示了令人印象深刻的生成能力，并已适应各种下游任务。然而，与大型语言模型 (LLM) 不同，后者可以根据指令数据在单个模型中学习多个任务，而扩散模型总是需要额外的分支、特定于任务的训练策略和损失，才能有效适应不同的下游任务。这种特定于任务的微调方法带来了两个缺点。1) 特定于任务的附加网络在预训练和微调之间造成了差距，阻碍了预训练知识的转移。2) 它需要精心设计额外的网络，增加了学习和实施的门槛，并使其不那么用户友好。因此，一个问题出现了：我们能否实现一种简单、高效、通用的方法来微调扩散模型？为此，我们提出了 ONE-PIC。它增强了预训练扩散模型中继承的生成能力，而无需引入额外的模块。具体来说，我们提出了视觉上下文内调优，它通过将源图像和目标图像排列成单个图像来构建特定于任务的训练数据。这种方法使下游微调更接近相关任务，从而使我们的模型能够更快地适应各种下游任务。此外，我们提出了一种掩蔽策略来统一不同的生成任务。该策略将各种下游微调任务转化为对掩蔽部分的预测。大量的实验结果表明，我们的方法简单高效，简化了适应过程，以较低的成本实现了出色的性能。代码可在此 https URL 上找到。

Title: Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising

Authors: Gongfan Fang, Xinyin Ma, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05628
Pdf URL: https://arxiv.org/pdf/2412.05628
Copy Paste: [[2412.05628]] Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising(https://arxiv.org/abs/2412.05628)
Keywords: generation, generative
Abstract: Transformer-based diffusion models have achieved significant advancements across a variety of generative tasks. However, producing high-quality outputs typically necessitates large transformer models, which result in substantial training and inference overhead. In this work, we investigate an alternative approach involving multiple experts for denoising, and introduce Remix-DiT, a novel method designed to enhance output quality at a low cost. The goal of Remix-DiT is to craft N diffusion experts for different denoising timesteps, yet without the need for expensive training of N independent models. To achieve this, Remix-DiT employs K basis models (where K < N) and utilizes learnable mixing coefficients to adaptively craft expert models. This design offers two significant advantages: first, although the total model size is increased, the model produced by the mixing operation shares the same architecture as a plain model, making the overall model as efficient as a standard diffusion transformer. Second, the learnable mixing adaptively allocates model capacity across timesteps, thereby effectively improving generation quality. Experiments conducted on the ImageNet dataset demonstrate that Remix-DiT achieves promising results compared to standard diffusion transformers and other multiple-expert methods. The code is available at this https URL.
摘要：基于 Transformer 的扩散模型在各种生成任务中取得了重大进展。然而，生成高质量的输出通常需要大型 Transformer 模型，这会导致大量的训练和推理开销。在这项工作中，我们研究了一种涉及多个专家进行去噪的替代方法，并介绍了 Remix-DiT，这是一种旨在以低成本提高输出质量的新方法。Remix-DiT 的目标是为不同的去噪时间步制作 N 个扩散专家，但无需对 N 个独立模型进行昂贵的训练。为了实现这一目标，Remix-DiT 采用 K 个基础模型（其中 K < N）并利用可学习的混合系数自适应地制作专家模型。这种设计有两个显着的优势：首先，虽然总模型大小增加了，但混合操作生成的模型与普通模型共享相同的架构，使整体模型与标准扩散变压器一样高效。其次，可学习的混合自适应地跨时间步分配模型容量，从而有效提高生成质量。在 ImageNet 数据集上进行的实验表明，与标准扩散变换器和其他多专家方法相比，Remix-DiT 取得了令人鼓舞的结果。代码可在此 https URL 上获取。

Title: Biological Brain Age Estimation using Sex-Aware Adversarial Variational Autoencoder with Multimodal Neuroimages

Authors: Abd Ur Rehman, Azka Rehman, Muhammad Usman, Abdullah Shahid, Sung-Min Gho, Aleum Lee, Tariq M. Khan, Imran Razzak
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05632
Pdf URL: https://arxiv.org/pdf/2412.05632
Copy Paste: [[2412.05632]] Biological Brain Age Estimation using Sex-Aware Adversarial Variational Autoencoder with Multimodal Neuroimages(https://arxiv.org/abs/2412.05632)
Keywords: generative
Abstract: Brain aging involves structural and functional changes and therefore serves as a key biomarker for brain health. Combining structural magnetic resonance imaging (sMRI) and functional magnetic resonance imaging (fMRI) has the potential to improve brain age estimation by leveraging complementary data. However, fMRI data, being noisier than sMRI, complicates multimodal fusion. Traditional fusion methods often introduce more noise than useful information, which can reduce accuracy compared to using sMRI alone. In this paper, we propose a novel multimodal framework for biological brain age estimation, utilizing a sex-aware adversarial variational autoencoder (SA-AVAE). Our framework integrates adversarial and variational learning to effectively disentangle the latent features from both modalities. Specifically, we decompose the latent space into modality-specific codes and shared codes to represent complementary and common information across modalities, respectively. To enhance the disentanglement, we introduce cross-reconstruction and shared-distinct distance ratio loss as regularization terms. Importantly, we incorporate sex information into the learned latent code, enabling the model to capture sex-specific aging patterns for brain age estimation via an integrated regressor module. We evaluate our model using the publicly available OpenBHB dataset, a comprehensive multi-site dataset for brain age estimation. The results from ablation studies and comparisons with state-of-the-art methods demonstrate that our framework outperforms existing approaches and shows significant robustness across various age groups, highlighting its potential for real-time clinical applications in the early detection of neurodegenerative diseases.
摘要：大脑衰老涉及结构和功能变化，因此是大脑健康的关键生物标志物。结合结构磁共振成像 (sMRI) 和功能磁共振成像 (fMRI) 有可能通过利用互补数据来改善大脑年龄估计。然而，fMRI 数据比 sMRI 更嘈杂，使多模态融合变得复杂。传统的融合方法通常会引入比有用信息更多的噪音，这会降低与单独使用 sMRI 相比的准确性。在本文中，我们提出了一种用于生物大脑年龄估计的新型多模态框架，利用性别感知对抗变分自动编码器 (SA-AVAE)。我们的框架整合了对抗和变分学习，以有效地从两种模态中分离出潜在特征。具体而言，我们将潜在空间分解为模态特定代码和共享代码，分别表示跨模态的互补和共同信息。为了增强分离效果，我们引入了交叉重建和共享不同距离比损失作为正则化项。重要的是，我们将性别信息纳入学习到的潜在代码中，使模型能够通过集成的回归模块捕捉性别特定的衰老模式，以估计大脑年龄。我们使用公开的 OpenBHB 数据集（一个用于估计大脑年龄的综合多站点数据集）评估我们的模型。消融研究和与最先进方法的比较结果表明，我们的框架优于现有方法，并在各个年龄组中表现出显著的稳健性，凸显了其在神经退行性疾病早期检测中的实时临床应用潜力。

Title: Efficient Continuous Video Flow Model for Video Prediction

Authors: Gaurav Shrivastava, Abhinav Shrivastava
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05633
Pdf URL: https://arxiv.org/pdf/2412.05633
Copy Paste: [[2412.05633]] Efficient Continuous Video Flow Model for Video Prediction(https://arxiv.org/abs/2412.05633)
Keywords: generation
Abstract: Multi-step prediction models, such as diffusion and rectified flow models, have emerged as state-of-the-art solutions for generation tasks. However, these models exhibit higher latency in sampling new frames compared to single-step methods. This latency issue becomes a significant bottleneck when adapting such methods for video prediction tasks, given that a typical 60-second video comprises approximately 1.5K frames. In this paper, we propose a novel approach to modeling the multi-step process, aimed at alleviating latency constraints and facilitating the adaptation of such processes for video prediction tasks. Our approach not only reduces the number of sample steps required to predict the next frame but also minimizes computational demands by reducing the model size to one-third of the original size. We evaluate our method on standard video prediction datasets, including KTH, BAIR action robot, Human3.6M and UCF101, demonstrating its efficacy in achieving state-of-the-art performance on these benchmarks.
摘要：多步预测模型（例如扩散和整流流模型）已成为生成任务的最新解决方案。然而，与单步方法相比，这些模型在采样新帧时表现出更高的延迟。考虑到典型的 60 秒视频包含大约 1.5K 帧，这种延迟问题在将此类方法应用于视频预测任务时成为一个重大瓶颈。在本文中，我们提出了一种对多步过程进行建模的新方法，旨在缓解延迟限制并促进此类过程适用于视频预测任务。我们的方法不仅减少了预测下一帧所需的采样步骤数，而且还通过将模型大小减小到原始大小的三分之一来最大限度地降低了计算需求。我们在标准视频预测数据集（包括 KTH、BAIR 动作机器人、Human3.6M 和 UCF101）上评估了我们的方法，证明了它在这些基准上实现最新性能的有效性。

Title: HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing

Authors: Zihao Zhu, Hongbao Zhang, Guanzong Wu, Siwei Lyu, Baoyuan Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05685
Pdf URL: https://arxiv.org/pdf/2412.05685
Copy Paste: [[2412.05685]] HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing(https://arxiv.org/abs/2412.05685)
Keywords: generation
Abstract: Visual-textual inconsistency (VTI) evaluation plays a crucial role in cleansing vision-language data. Its main challenges stem from the high variety of image captioning datasets, where differences in content can create a range of inconsistencies (\eg, inconsistencies in scene, entities, entity attributes, entity numbers, entity interactions). Moreover, variations in caption length can introduce inconsistencies at different levels of granularity as well. To tackle these challenges, we design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE), which can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs. Specifically, the HMGIE framework is implemented by three consecutive modules. Firstly, the semantic graph generation module converts the image caption to a semantic graph for building a structural representation of all involved semantic items. Then, the hierarchical inconsistency evaluation module provides a progressive evaluation procedure with a dynamic question-answer generation and evaluation strategy guided by the semantic graph, producing a hierarchical inconsistency evaluation graph (HIEG). Finally, the quantitative evaluation module calculates the accuracy and completeness scores based on the HIEG, followed by a natural language explanation about the detection results. Moreover, to verify the efficacy and flexibility of the proposed framework on handling different image captioning datasets, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies. Extensive experiments on MVTID and other benchmark datasets demonstrate the superior performance of the proposed HMGIE to current state-of-the-art methods.
摘要：视觉文本不一致 (VTI) 评估在清理视觉语言数据中起着至关重要的作用。它的主要挑战源于图像字幕数据集的多样性，其中内容的差异会造成一系列不一致（例如，场景、实体、实体属性、实体数量、实体交互的不一致）。此外，字幕长度的变化也会在不同粒度级别上引入不一致。为了应对这些挑战，我们设计了一个自适应评估框架，称为分层和多粒度不一致评估 (HMGIE)，它可以为各种图像字幕对提供涵盖准确性和完整性的多粒度评估。具体来说，HMGIE 框架由三个连续的模块实现。首先，语义图生成模块将图像字幕转换为语义图，以构建所有涉及的语义项目的结构表示。然后，分层不一致性评估模块提供了一个渐进式评估程序，该程序具有由语义图引导的动态问答生成和评估策略，从而生成分层不一致性评估图 (HIEG)。最后，定量评估模块根据 HIEG 计算准确度和完整性分数，然后对检测结果进行自然语言解释。此外，为了验证所提出的框架在处理不同图像字幕数据集方面的有效性和灵活性，我们构建了 MVTID，这是一个具有多种类型和粒度不一致的图像字幕数据集。在 MVTID 和其他基准数据集上进行的大量实验表明，所提出的 HMGIE 的性能优于当前最先进的方法。

Title: Jointly RS Image Deblurring and Super-Resolution with Adjustable-Kernel and Multi-Domain Attention

Authors: Yan Zhang, Pengcheng Zheng, Chengxiao Zeng, Bin Xiao, Zhenghao Li, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05696
Pdf URL: https://arxiv.org/pdf/2412.05696
Copy Paste: [[2412.05696]] Jointly RS Image Deblurring and Super-Resolution with Adjustable-Kernel and Multi-Domain Attention(https://arxiv.org/abs/2412.05696)
Keywords: super-resolution, generation
Abstract: Remote Sensing (RS) image deblurring and Super-Resolution (SR) are common tasks in computer vision that aim at restoring RS image detail and spatial scale, respectively. However, real-world RS images often suffer from a complex combination of global low-resolution (LR) degeneration and local blurring degeneration. Although carefully designed deblurring and SR models perform well on these two tasks individually, a unified model that performs jointly RS image deblurring and super-resolution (JRSIDSR) task is still challenging due to the vital dilemma of reconstructing the global and local degeneration simultaneously. Additionally, existing methods struggle to capture the interrelationship between deblurring and SR processes, leading to suboptimal results. To tackle these issues, we give a unified theoretical analysis of RS images' spatial and blur degeneration processes and propose a dual-branch parallel network named AKMD-Net for the JRSIDSR task. AKMD-Net consists of two main branches: deblurring and super-resolution branches. In the deblurring branch, we design a pixel-adjustable kernel block (PAKB) to estimate the local and spatial-varying blur kernels. In the SR branch, a multi-domain attention block (MDAB) is proposed to capture the global contextual information enhanced with high-frequency details. Furthermore, we develop an adaptive feature fusion (AFF) module to model the contextual relationships between the deblurring and SR branches. Finally, we design an adaptive Wiener loss (AW Loss) to depress the prior noise in the reconstructed images. Extensive experiments demonstrate that the proposed AKMD-Net achieves state-of-the-art (SOTA) quantitative and qualitative performance on commonly used RS image datasets. The source code is publicly available at this https URL.
摘要：遥感 (RS) 图像去模糊和超分辨率 (SR) 是计算机视觉中的常见任务，旨在分别恢复 RS 图像细节和空间尺度。然而，现实世界的 RS 图像通常会遭受全局低分辨率 (LR) 退化和局部模糊退化的复杂组合。尽管精心设计的去模糊和 SR 模型在单独执行这两个任务时表现良好，但由于同时重建全局和局部退化的重要难题，执行联合 RS 图像去模糊和超分辨率 (JRSIDSR) 任务的统一模型仍然具有挑战性。此外，现有方法难以捕捉去模糊和 SR 过程之间的相互关系，导致结果不理想。为了解决这些问题，我们对 RS 图像的空间和模糊退化过程进行了统一的理论分析，并为 JRSIDSR 任务提出了一个名为 AKMD-Net 的双分支并行网络。AKMD-Net 由两个主要分支组成：去模糊和超分辨率分支。在去模糊分支中，我们设计了一个像素可调的内核块 (PAKB) 来估计局部和空间变化的模糊内核。在 SR 分支中，提出了一个多域注意块 (MDAB) 来捕获通过高频细节增强的全局上下文信息。此外，我们开发了一个自适应特征融合 (AFF) 模块来模拟去模糊和 SR 分支之间的上下文关系。最后，我们设计了一个自适应维纳损失 (AW Loss) 来抑制重建图像中的先验噪声。大量实验表明，所提出的 AKMD-Net 在常用的 RS 图像数据集上实现了最先进的 (SOTA) 定量和定性性能。源代码可在此 https URL 上公开获取。

Title: A Tiered GAN Approach for Monet-Style Image Generation

Authors: FNU Neha, Deepshikha Bhati, Deepak Kumar Shukla, Md Amiruzzaman
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05724
Pdf URL: https://arxiv.org/pdf/2412.05724
Copy Paste: [[2412.05724]] A Tiered GAN Approach for Monet-Style Image Generation(https://arxiv.org/abs/2412.05724)
Keywords: generation, generative
Abstract: Generative Adversarial Networks (GANs) have proven to be a powerful tool in generating artistic images, capable of mimicking the styles of renowned painters, such as Claude Monet. This paper introduces a tiered GAN model to progressively refine image quality through a multi-stage process, enhancing the generated images at each step. The model transforms random noise into detailed artistic representations, addressing common challenges such as instability in training, mode collapse, and output quality. This approach combines downsampling and convolutional techniques, enabling the generation of high-quality Monet-style artwork while optimizing computational efficiency. Experimental results demonstrate the architecture's ability to produce foundational artistic structures, though further refinements are necessary for achieving higher levels of realism and fidelity to Monet's style. Future work focuses on improving training methodologies and model complexity to bridge the gap between generated and true artistic images. Additionally, the limitations of traditional GANs in artistic generation are analyzed, and strategies to overcome these shortcomings are proposed.
摘要：生成对抗网络 (GAN) 已被证明是生成艺术图像的强大工具，能够模仿克劳德·莫奈等著名画家的风格。本文介绍了一种分层 GAN 模型，通过多阶段过程逐步优化图像质量，在每个步骤中增强生成的图像。该模型将随机噪声转换为详细的艺术表现，解决了训练不稳定、模式崩溃和输出质量等常见挑战。这种方法结合了下采样和卷积技术，能够生成高质量的莫奈风格艺术品，同时优化计算效率。实验结果证明了该架构能够生成基础艺术结构，但需要进一步改进才能实现更高水平的真实感和对莫奈风格的保真度。未来的工作重点是改进训练方法和模型复杂性，以弥合生成的艺术图像和真实艺术图像之间的差距。此外，还分析了传统 GAN 在艺术生成中的局限性，并提出了克服这些缺点的策略。

Title: Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

Authors: Aditya Chinchure, Sahithya Ravi, Raymond Ng, Vered Shwartz, Boyang Li, Leonid Sigal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05725
Pdf URL: https://arxiv.org/pdf/2412.05725
Copy Paste: [[2412.05725]] Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events(https://arxiv.org/abs/2412.05725)
Keywords: generative
Abstract: The commonsense reasoning capabilities of vision-language models (VLMs), especially in abductive reasoning and defeasible reasoning, remain poorly understood. Most benchmarks focus on typical visual scenarios, making it difficult to discern whether model performance stems from keen perception and reasoning skills, or reliance on pure statistical recall. We argue that by focusing on atypical events in videos, clearer insights can be gained on the core capabilities of VLMs. Explaining and understanding such out-of-distribution events requires models to extend beyond basic pattern recognition and regurgitation of their prior knowledge. To this end, we introduce BlackSwanSuite, a benchmark for evaluating VLMs' ability to reason about unexpected events through abductive and defeasible tasks. Our tasks artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or provide new visual information that could change an existing hypothesis about the event. We curate a comprehensive benchmark suite comprising over 3,800 MCQ, 4,900 generative and 6,700 yes/no tasks, spanning 1,655 videos. After extensively evaluating various state-of-the-art VLMs, including GPT-4o and Gemini 1.5 Pro, as well as open-source VLMs such as LLaVA-Video, we find significant performance gaps of up to 32% from humans on these tasks. Our findings reveal key limitations in current VLMs, emphasizing the need for enhanced model architectures and training strategies.
摘要：视觉语言模型 (VLM) 的常识推理能力，尤其是在溯因推理和可废止推理方面的能力，仍然不太为人所知。大多数基准测试都侧重于典型的视觉场景，因此很难辨别模型性能是源于敏锐的感知和推理能力，还是依赖于纯粹的统计回忆。我们认为，通过关注视频中的非典型事件，可以更清楚地了解 VLM 的核心功能。解释和理解此类分布外事件需要模型超越基本模式识别和对其先前知识的复述。为此，我们引入了 BlackSwanSuite，这是一个用于评估 VLM 通过溯因和可废止任务推理意外事件的能力的基准测试。我们的任务人为地限制了提供给模型的视觉信息量，同时向它们询问隐藏的意外事件，或者提供可能改变有关事件的现有假设的新视觉信息。我们整理了一套全面的基准测试套件，其中包括 3,800 多项 MCQ、4,900 项生成性任务和 6,700 项是/否任务，涵盖 1,655 个视频。在广泛评估了各种最先进的 VLM（包括 GPT-4o 和 Gemini 1.5 Pro）以及开源 VLM（如 LLaVA-Video）后，我们发现在这些任务上，VLM 的性能与人类存在高达 32% 的显著差距。我们的研究结果揭示了当前 VLM 的主要局限性，强调了对增强模型架构和训练策略的需求。

Title: Compositional Image Retrieval via Instruction-Aware Contrastive Learning

Authors: Wenliang Zhong, Weizhi An, Feng Jiang, Hehuan Ma, Yuzhi Guo, Junzhou Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05756
Pdf URL: https://arxiv.org/pdf/2412.05756
Copy Paste: [[2412.05756]] Compositional Image Retrieval via Instruction-Aware Contrastive Learning(https://arxiv.org/abs/2412.05756)
Keywords: generation
Abstract: Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference. CIR is inherently an instruction-following task, as the model needs to interpret and apply modifications to the image. In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable. While existing ZS-CIR models based on CLIP have shown promising results, their capability in interpreting and following modification instructions remains limited. Some research attempts to address this by incorporating Large Language Models (LLMs). However, these approaches still face challenges in effectively integrating multimodal information and instruction understanding. To tackle above challenges, we propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation, which significantly enhance the instruction following capability for a comprehensive integration between images and instructions. Nevertheless, directly applying MLLMs introduces a new challenge since MLLMs are primarily designed for text generation rather than embedding extraction as required in CIR. To address this, we introduce a two-stage training strategy to efficiently learn a joint multimodal embedding space and further refining the ability to follow modification instructions by tuning the model in a triplet dataset similar to the CIR format. Extensive experiments on four public datasets: FashionIQ, CIRR, GeneCIS, and CIRCO demonstrates the superior performance of our model, outperforming state-of-the-art baselines by a significant margin. Codes are available at the GitHub repository.
摘要：组合图像检索 (CIR) 涉及基于图像与文本的组合查询来检索目标图像，该文本指定对视觉参考的修改或更改。CIR 本质上是一项指令跟踪任务，因为模型需要解释图像并对其应用修改。在实践中，由于下游任务中注释数据的稀缺，零样本 CIR (ZS-CIR) 是理想的选择。虽然现有的基于 CLIP 的 ZS-CIR 模型已经显示出良好的结果，但它们在解释和遵循修改指令方面的能力仍然有限。一些研究试图通过结合大型语言模型 (LLM) 来解决这个问题。然而，这些方法在有效整合多模态信息和指令理解方面仍然面临挑战。为了应对上述挑战，我们提出了一种新颖的嵌入方法，利用指令调整的多模态 LLM (MLLM) 来生成组合表示，这显著增强了指令跟踪能力，从而实现了图像和指令之间的全面集成。然而，直接应用 MLLM 会带来新的挑战，因为 MLLM 主要用于文本生成，而不是 CIR 所需的嵌入提取。为了解决这个问题，我们引入了一种两阶段训练策略，以有效地学习联合多模态嵌入空间，并通过在类似于 CIR 格式的三元组数据集中调整模型来进一步提高遵循修改指令的能力。在四个公共数据集：FashionIQ、CIRR、GeneCIS 和 CIRCO 上进行的大量实验证明了我们模型的卓越性能，远远超过了最先进的基线。代码可在 GitHub 存储库中找到。

Title: ProtGO: A Transformer based Fusion Model for accurately predicting Gene Ontology (GO) Terms from full scale Protein Sequences

Authors: Azwad Tamir, Jiann-Shiun Yuan
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2412.05776
Pdf URL: https://arxiv.org/pdf/2412.05776
Copy Paste: [[2412.05776]] ProtGO: A Transformer based Fusion Model for accurately predicting Gene Ontology (GO) Terms from full scale Protein Sequences(https://arxiv.org/abs/2412.05776)
Keywords: generation
Abstract: Recent developments in next generation sequencing technology have led to the creation of extensive, open-source protein databases consisting of hundreds of millions of sequences. To render these sequences applicable in biomedical applications, they must be meticulously annotated by wet lab testing or extracting them from existing literature. Over the last few years, researchers have developed numerous automatic annotation systems, particularly deep learning models based on machine learning and artificial intelligence, to address this issue. In this work, we propose a transformer-based fusion model capable of predicting Gene Ontology (GO) terms from full-scale protein sequences, achieving state-of-the-art accuracy compared to other contemporary machine learning annotation systems. The approach performs particularly well on clustered split datasets, which comprise training and testing samples originating from distinct distributions that are structurally diverse. This demonstrates that the model is able to understand both short and long term dependencies within the enzyme's structure and can precisely identify the motifs associated with the various GO terms. Furthermore, the technique is lightweight and less computationally expensive compared to the benchmark methods, while at the same time not unaffected by sequence length, rendering it appropriate for diverse applications with varying sequence lengths.
摘要：下一代测序技术的最新发展促成了由数亿个序列组成的广泛的开源蛋白质数据库的创建。为了使这些序列适用于生物医学应用，必须通过湿实验室测试或从现有文献中提取它们来进行细致的注释。在过去的几年里，研究人员开发了许多自动注释系统，特别是基于机器学习和人工智能的深度学习模型，以解决这个问题。在这项工作中，我们提出了一种基于转换器的融合模型，该模型能够从全尺寸蛋白质序列中预测基因本体 (GO) 术语，与其他当代机器学习注释系统相比，其准确性达到了最高水平。该方法在聚类分割数据集上表现特别出色，这些数据集包括来自结构多样的不同分布的训练和测试样本。这表明该模型能够理解酶结构内的短期和长期依赖关系，并能准确识别与各种 GO 术语相关的基序。此外，与基准方法相比，该技术重量轻且计算成本较低，同时不受序列长度的影响，适用于具有不同序列长度的多种应用。

Title: BudgetFusion: Perceptually-Guided Adaptive Diffusion Models

Authors: Qinchan (Wing)Li, Kenneth Chen, Changyue (Tina)Su, Qi Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05780
Pdf URL: https://arxiv.org/pdf/2412.05780
Copy Paste: [[2412.05780]] BudgetFusion: Perceptually-Guided Adaptive Diffusion Models(https://arxiv.org/abs/2412.05780)
Keywords: generation, generative
Abstract: Diffusion models have shown unprecedented success in the task of text-to-image generation. While these models are capable of generating high-quality and realistic images, the complexity of sequential denoising has raised societal concerns regarding high computational demands and energy consumption. In response, various efforts have been made to improve inference efficiency. However, most of the existing efforts have taken a fixed approach with neural network simplification or text prompt optimization. Are the quality improvements from all denoising computations equally perceivable to humans? We observed that images from different text prompts may require different computational efforts given the desired content. The observation motivates us to present BudgetFusion, a novel model that suggests the most perceptually efficient number of diffusion steps before a diffusion model starts to generate an image. This is achieved by predicting multi-level perceptual metrics relative to diffusion steps. With the popular Stable Diffusion as an example, we conduct both numerical analyses and user studies. Our experiments show that BudgetFusion saves up to five seconds per prompt without compromising perceptual similarity. We hope this work can initiate efforts toward answering a core question: how much do humans perceptually gain from images created by a generative model, per watt of energy?
摘要：扩散模型在文本到图像生成任务中取得了前所未有的成功。虽然这些模型能够生成高质量且逼真的图像，但顺序去噪的复杂性引起了社会对高计算需求和能耗的担忧。为此，人们做出了各种努力来提高推理效率。然而，现有的大多数努力都采用了固定的方法，即简化神经网络或优化文本提示。所有去噪计算带来的质量改进是否对人类同样可感知？我们观察到，给定所需内容，来自不同文本提示的图像可能需要不同的计算工作量。这一观察促使我们提出 BudgetFusion，这是一种新颖的模型，它在扩散模型开始生成图像之前建议感知效率最高的扩散步骤数。这是通过预测相对于扩散步骤的多级感知指标来实现的。以流行的稳定扩散为例，我们进行了数值分析和用户研究。我们的实验表明，BudgetFusion 每次提示最多可节省五秒钟，而不会影响感知相似性。我们希望这项工作能够推动人们努力回答一个核心问题：每瓦能量，人类能从生成模型创建的图像中获得多少感知收益？

Title: Open-Source Acceleration of Stable-Diffusion.cpp

Authors: Jingxu Ng, Cheng Lv, Pu Zhao, Wei Niu, Juyi Lin, Yanzhi Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05781
Pdf URL: https://arxiv.org/pdf/2412.05781
Copy Paste: [[2412.05781]] Open-Source Acceleration of Stable-Diffusion.cpp(https://arxiv.org/abs/2412.05781)
Keywords: generation
Abstract: Stable diffusion plays a crucial role in generating high-quality images. However, image generation is time-consuming and memory-intensive. To address this, this http URL (Sdcpp) emerges as an efficient inference framework to accelerate the diffusion models. Although it is lightweight, the current implementation of ggml_conv_2d operator in Sdcpp is suboptimal, exhibiting both high inference latency and massive memory usage. To address this, in this work, we present an optimized version of Sdcpp leveraging the Winograd algorithm to accelerate 2D convolution operations, which is the primary bottleneck in the pipeline. By analyzing both dependent and independent computation graphs, we exploit the device's locality and parallelism to achieve substantial performance improvements. Our framework delivers correct end-to-end results across various stable diffusion models, including SDv1.4, v1.5, v2.1, SDXL, and SDXL-Turbo. Our evaluation results demonstrate a speedup up to 2.76x for individual convolutional layers and an inference speedup up to 4.79x for the overall image generation process, compared with the original Sdcpp. Homepage: this https URL
摘要：稳定扩散在生成高质量图像方面起着至关重要的作用。但是，图像生成耗时且占用大量内存。为了解决这个问题，这个 http URL（Sdcpp）作为一个有效的推理框架出现，以加速扩散模型。尽管它是轻量级的，但 Sdcpp 中 ggml_conv_2d 运算符的当前实现并不理想，表现出高推理延迟和大量内存使用。为了解决这个问题，在这项工作中，我们提出了一个优化版本的 Sdcpp，利用 Winograd 算法来加速 2D 卷积运算，这是管道中的主要瓶颈。通过分析依赖和独立计算图，我们利用设备的局部性和并行性来实现显着的性能改进。我们的框架在各种稳定扩散模型中提供正确的端到端结果，包括 SDv1.4、v1.5、v2.1、SDXL 和 SDXL-Turbo。我们的评估结果表明，与原始 Sdcpp 相比，单个卷积层的速度提高了 2.76 倍，整个图像生成过程的推理速度提高了 4.79 倍。主页：此 https URL

Title: Language-Guided Image Tokenization for Generation

Authors: Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid, Dina Katabi, Xiuye Gu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.05796
Pdf URL: https://arxiv.org/pdf/2412.05796
Copy Paste: [[2412.05796]] Language-Guided Image Tokenization for Generation(https://arxiv.org/abs/2412.05796)
Keywords: generation
Abstract: Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics. By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focus on encoding fine-grained visual details into latent tokens, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens. These tokenization improvements consistently translate to 16.3% and 34.3% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve a 93.5x inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok's superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization.
摘要：图像标记化是将原始图像像素转换为紧凑的低维潜在表示的过程，已被证明对于可扩展且高效的图像生成至关重要。然而，主流的图像标记化方法通常具有有限的压缩率，使得高分辨率图像生成在计算上非常昂贵。为了应对这一挑战，我们建议利用语言进行高效的图像标记化，并将我们的方法称为文本条件图像标记化 (TexTok)。TexTok 是一个简单而有效的标记化框架，它利用语言提供高级语义。通过对描述性文本标题进行标记化过程的条件化，TexTok 允许标记化过程专注于将细粒度的视觉细节编码为潜在标记，从而提高重建质量和压缩率。与没有文本条件的传统标记器相比，TexTok 在 ImageNet-256 和 -512 基准上分别在不同数量的标记上实现了 29.2% 和 48.1% 的平均重建 FID 改进。这些标记化改进始终转化为生成 FID 的 16.3% 和 34.3% 的平均改进。只需用 TexTok 替换 Diffusion Transformer (DiT) 中的标记器，我们的系统就可以实现 93.5 倍的推理加速，同时仍然优于仅使用 32 个标记的 ImageNet-512 上的原始 DiT。带有原始 DiT 生成器的 TexTok 在 ImageNet-256 和 -512 上分别获得了 1.46 和 1.62 的最佳 FID 分数。此外，我们展示了 TexTok 在文本到图像生成任务上的优势，有效地利用了标记化中的现成文本标题。

Title: SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation

Authors: Leigang Qu, Haochuan Li, Wenjie Wang, Xiang Liu, Juncheng Li, Liqiang Nie, Tat-Seng Chua
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2412.05818
Pdf URL: https://arxiv.org/pdf/2412.05818
Copy Paste: [[2412.05818]] SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation(https://arxiv.org/abs/2412.05818)
Keywords: generation
Abstract: Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation, pushing forward advancements in text-to-image generation. However, achieving accurate text-image alignment for LMMs, particularly in compositional scenarios, remains challenging. Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering, costly human annotations, and continual upgrading, limiting flexibility and scalability. In this work, we introduce a model-agnostic iterative self-improvement framework (SILMM) that can enable LMMs to provide helpful and scalable self-feedback and optimize text-image alignment via Direct Preference Optimization (DPO). DPO can readily applied to LMMs that use discrete visual tokens as intermediate image representations; while it is less suitable for LMMs with continuous visual features, as obtaining generation probabilities is challenging. To adapt SILMM to LMMs with continuous features, we propose a diversity mechanism to obtain diverse representations and a kernel-based continuous DPO for alignment. Extensive experiments on three compositional text-to-image generation benchmarks validate the effectiveness and superiority of SILMM, showing improvements exceeding 30% on T2I-CompBench++ and around 20% on DPG-Bench.
摘要：大型多模态模型 (LMM) 在多模态理解和生成方面表现出了令人印象深刻的能力，推动了文本到图像生成的进步。然而，实现 LMM 的精确文本-图像对齐，特别是在组合场景中，仍然具有挑战性。现有的方法，例如多步骤生成的布局规划和从人工反馈或人工智能反馈中学习，严重依赖于及时的工程设计、昂贵的人工注释和持续的升级，限制了灵活性和可扩展性。在这项工作中，我们引入了一个与模型无关的迭代自我改进框架 (SILMM)，该框架可以使 LMM 提供有用且可扩展的自我反馈，并通过直接偏好优化 (DPO) 优化文本-图像对齐。DPO 可以很容易地应用于使用离散视觉标记作为中间图像表示的 LMM；但它不太适合具有连续视觉特征的 LMM，因为获得生成概率具有挑战性。为了使 SILMM 适应具有连续特征的 LMM，我们提出了一种多样性机制来获得多样化的表示，并提出了一种基于内核的连续 DPO 进行对齐。在三个组合文本到图像生成基准上进行的大量实验验证了 SILMM 的有效性和优越性，结果显示在 T2I-CompBench++ 上的改进超过 30%，在 DPG-Bench 上的改进约为 20%。

Title: Self-Guidance: Boosting Flow and Diffusion Generation on Their Own

Authors: Tiancheng Li, Weijian Luo, Zhiyang Chen, Liyuan Ma, Guo-Jun Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05827
Pdf URL: https://arxiv.org/pdf/2412.05827
Copy Paste: [[2412.05827]] Self-Guidance: Boosting Flow and Diffusion Generation on Their Own(https://arxiv.org/abs/2412.05827)
Keywords: generation
Abstract: Proper guidance strategies are essential to get optimal generation results without re-training diffusion and flow-based text-to-image models. However, existing guidances either require specific training or strong inductive biases of neural network architectures, potentially limiting their applications. To address these issues, in this paper, we introduce Self-Guidance (SG), a strong diffusion guidance that neither needs specific training nor requires certain forms of neural network architectures. Different from previous approaches, the Self-Guidance calculates the guidance vectors by measuring the difference between the velocities of two successive diffusion timesteps. Therefore, SG can be readily applied for both conditional and unconditional models with flexible network architectures. We conduct intensive experiments on both text-to-image generation and text-to-video generations across flexible architectures including UNet-based models and diffusion transformer-based models. On current state-of-the-art diffusion models such as Stable Diffusion 3.5 and FLUX, SG significantly boosts the image generation performance in terms of FID, and Human Preference Scores. Moreover, we find that SG has a surprisingly positive effect on the generation of high-quality human bodies such as hands, faces, and arms, showing strong potential to overcome traditional challenges on human body generations with minimal effort. We will release our implementation of SG on SD 3.5 and FLUX models along with this paper.
摘要：适当的引导策略对于获得最佳生成结果至关重要，而无需重新训练扩散和基于流的文本到图像模型。然而，现有的指导要么需要特定的训练，要么需要神经网络架构的强大归纳偏差，这可能会限制它们的应用。为了解决这些问题，在本文中，我们引入了自引导 (SG)，这是一种强大的扩散引导，既不需要特定的训练，也不需要某些形式的神经网络架构。与以前的方法不同，自引导通过测量两个连续扩散时间步长的速度差异来计算引导向量。因此，SG 可以很容易地应用于具有灵活网络架构的条件和非条件模型。我们对文本到图像生成和文本到视频生成进行了密集的实验，这些实验涉及灵活的架构，包括基于 UNet 的模型和基于扩散变压器的模型。在当前最先进的扩散模型（如稳定扩散 3.5 和 FLUX）上，SG 显著提高了 FID 和人类偏好分数方面的图像生成性能。此外，我们发现 SG 对生成高质量人体（例如手、脸和手臂）具有令人惊讶的积极影响，显示出以最小的努力克服人体生成传统挑战的强大潜力。我们将与本文一起发布我们在 SD 3.5 和 FLUX 模型上对 SG 的实现。

Title: CSG: A Context-Semantic Guided Diffusion Approach in De Novo Musculoskeletal Ultrasound Image Generation

Authors: Elay Dahan, Hedda Cohen Indelman, Angeles M. Perez-Agosto, Carmit Shiran, Gopal Avinash, Doron Shaked, Nati Daniel
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.05833
Pdf URL: https://arxiv.org/pdf/2412.05833
Copy Paste: [[2412.05833]] CSG: A Context-Semantic Guided Diffusion Approach in De Novo Musculoskeletal Ultrasound Image Generation(https://arxiv.org/abs/2412.05833)
Keywords: generation, generative
Abstract: The use of synthetic images in medical imaging Artificial Intelligence (AI) solutions has been shown to be beneficial in addressing the limited availability of diverse, unbiased, and representative data. Despite the extensive use of synthetic image generation methods, controlling the semantics variability and context details remains challenging, limiting their effectiveness in producing diverse and representative medical image datasets. In this work, we introduce a scalable semantic and context-conditioned generative model, coined CSG (Context-Semantic Guidance). This dual conditioning approach allows for comprehensive control over both structure and appearance, advancing the synthesis of realistic and diverse ultrasound images. We demonstrate the ability of CSG to generate findings (pathological anomalies) in musculoskeletal (MSK) ultrasound images. Moreover, we test the quality of the synthetic images using a three-fold validation protocol. The results show that the synthetic images generated by CSG improve the performance of semantic segmentation models, exhibit enhanced similarity to real images compared to the baseline methods, and are undistinguishable from real images according to a Turing test. Furthermore, we demonstrate an extension of the CSG that allows enhancing the variability space of images by synthetically generating augmentations of anatomical geometries and textures.
摘要：事实证明，在医学成像人工智能 (AI) 解决方案中使用合成图像有助于解决多样化、无偏见和有代表性的数据有限的可用性问题。尽管合成图像生成方法得到了广泛使用，但控制语义变化和上下文细节仍然具有挑战性，限制了它们在生成多样化和有代表性的医学图像数据集方面的有效性。在这项工作中，我们引入了一种可扩展的语义和上下文条件生成模型，称为 CSG（上下文语义指导）。这种双重条件方法可以全面控制结构和外观，从而促进逼真和多样化的超声图像的合成。我们展示了 CSG 在肌肉骨骼 (MSK) 超声图像中生成发现（病理异常）的能力。此外，我们使用三重验证协议测试合成图像的质量。结果表明，CSG 生成的合成图像提高了语义分割模型的性能，与基线方法相比，与真实图像的相似性增强，并且根据图灵测试与真实图像无法区分。此外，我们展示了 CSG 的扩展，它通过综合生成解剖几何和纹理的增强来增强图像的变异空间。

Title: MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation

Authors: Shuwei Shi, Biao Gong, Xi Chen, Dandan Zheng, Shuai Tan, Zizheng Yang, Yuyuan Li, Jingwen He, Kecheng Zheng, Jingdong Chen, Ming Yang, Yinqiang Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05848
Pdf URL: https://arxiv.org/pdf/2412.05848
Copy Paste: [[2412.05848]] MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation(https://arxiv.org/abs/2412.05848)
Keywords: generation
Abstract: The image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal. These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild. Traditional metrics, e.g., SSIM or optical flow, are hard to generalize to arbitrary videos, while, it is very tough for human annotators to label the abstract motion intensity neither. Furthermore, the motion intensity shall reveal both local object motion and global camera movement, which has not been studied before. This paper addresses the challenge with a new motion estimator, capable of measuring the decoupled motion intensities of objects and cameras in video. We leverage the contrastive learning on randomly paired videos and distinguish the video with greater motion intensity. Such a paradigm is friendly for annotation and easy to scale up to achieve stable performance on motion estimation. We then present a new I2V model, named MotionStone, developed with the decoupled motion estimator. Experimental results demonstrate the stability of the proposed motion estimator and the state-of-the-art performance of MotionStone on I2V generation. These advantages warrant the decoupled motion estimator to serve as a general plug-in enhancer for both data processing and video generation training.
摘要：图像到视频 (I2V) 生成以静态图像为条件，最近通过运动强度作为附加控制信号得到了增强。这些运动感知模型有助于生成多样化的运动模式，但缺乏可靠的运动估计器来在野外大规模视频集上训练此类模型。传统指标（例如 SSIM 或光流）很难推广到任意视频，而人类注释者很难标记抽象的运动强度。此外，运动强度应同时揭示局部物体运动和全局相机运动，这在以前尚未被研究过。本文通过一种新的运动估计器解决了这一挑战，该估计器能够测量视频中物体和相机的解耦运动强度。我们利用随机配对视频的对比学习，区分运动强度更大的视频。这种范例对注释很友好，并且易于扩展以实现稳定的运动估计性能。然后，我们提出了一种使用解耦运动估计器开发的新 I2V 模型，名为 MotionStone。实验结果证明了所提出的运动估计器的稳定性以及 MotionStone 在 I2V 生成方面的一流性能。这些优势使得解耦运动估计器可以作为数据处理和视频生成训练的通用插件增强器。

Title: 3D-Consistent Image Inpainting with Diffusion Models

Authors: Leonid Antsfeld, Boris Chidlovskii
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.05881
Pdf URL: https://arxiv.org/pdf/2412.05881
Copy Paste: [[2412.05881]] 3D-Consistent Image Inpainting with Diffusion Models(https://arxiv.org/abs/2412.05881)
Keywords: generative
Abstract: We address the problem of 3D inconsistency of image inpainting based on diffusion models. We propose a generative model using image pairs that belong to the same scene. To achieve the 3D-consistent and semantically coherent inpainting, we modify the generative diffusion model by incorporating an alternative point of view of the scene into the denoising process. This creates an inductive bias that allows to recover 3D priors while training to denoise in 2D, without explicit 3D supervision. Training unconditional diffusion models with additional images as in-context guidance allows to harmonize the masked and non-masked regions while repainting and ensures the 3D consistency. We evaluate our method on one synthetic and three real-world datasets and show that it generates semantically coherent and 3D-consistent inpaintings and outperforms the state-of-art methods.
摘要：我们解决了基于扩散模型的图像修复的 3D 不一致问题。我们提出了一种使用属于同一场景的图像对的生成模型。为了实现 3D 一致且语义一致的修复，我们通过将场景的另一种视角纳入去噪过程来修改生成扩散模型。这会产生一种归纳偏差，允许在训练 2D 去噪时恢复 3D 先验，而无需明确的 3D 监督。使用附加图像作为上下文指导来训练无条件扩散模型，可以在重新绘制时协调蒙版和非蒙版区域并确保 3D 一致性。我们在一个合成数据集和三个真实世界数据集上评估了我们的方法，并表明它可以生成语义一致且 3D 一致的修复，并且优于最先进的方法。

Title: XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

Authors: Weizhuo Li, Zhigang Wang, Yu Gu, Ge Yu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.05896
Pdf URL: https://arxiv.org/pdf/2412.05896
Copy Paste: [[2412.05896]] XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference(https://arxiv.org/abs/2412.05896)
Keywords: generative
Abstract: Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache framework makes a compromise between time and space complexities. However, caching data generates the increasingly growing memory demand, that can quickly exhaust the limited memory capacity of the modern accelerator like GPUs, particularly in long-context inference tasks. Existing studies reduce memory consumption by evicting some of cached data that have less important impact on inference accuracy. But the benefit in practice is far from ideal due to the static cache allocation across different LLM network layers. This paper observes that the layer-specific cached data have very different impacts on accuracy. We quantify this difference, and give experimental and theoretical validation. We accordingly make a formal analysis and shows that customizing the cache size for each layer in a personalized manner can yield a significant memory reduction, while still providing comparable accuracy. We simulate the cache allocation as a combinatorial optimization problem and give a global optimal solution. In particular, we devise a mini- and sampling-based inference over a lightweight variant of the LLM model, so as to quickly capture the difference and then feed it into the personalized algorithms. Extensive experiments on real-world datasets demonstrate that our proposals can reduce KV cache memory consumption by 61.6% on average, improve computational efficiency by 2.1x and then increase the throughput by up to 5.5x.
摘要：最近，生成式大型语言模型 (LLM) 在许多应用中取得了显著的成功。值得注意的是，它的推理会逐个生成输出标记，导致许多冗余计算。广泛使用的 KV-Cache 框架在时间和空间复杂性之间做出了妥协。然而，缓存数据会产生越来越大的内存需求，这会很快耗尽现代加速器（如 GPU）有限的内存容量，尤其是在长上下文推理任务中。现有研究通过逐出一些对推理准确性影响较小的缓存数据来减少内存消耗。但由于不同 LLM 网络层之间的静态缓存分配，实践中的好处远非理想。本文观察到特定层的缓存数据对准确性的影响非常不同。我们量化了这种差异，并给出了实验和理论验证。我们据此进行了正式分析，并表明以个性化的方式定制每个层的缓存大小可以显着减少内存，同时仍提供相当的准确性。我们将缓存分配模拟为组合优化问题，并给出全局最优解。具体来说，我们设计了一种基于 LLM 模型轻量级变体的迷你和采样推理，以便快速捕捉差异并将其输入到个性化算法中。在真实数据集上进行的大量实验表明，我们的提案平均可将 KV 缓存内存消耗降低 61.6%，将计算效率提高 2.1 倍，然后将吞吐量提高 5.5 倍。

Title: Accelerating Video Diffusion Models via Distribution Matching

Authors: Yuanzhi Zhu, Hanshu Yan, Huan Yang, Kai Zhang, Junnan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05899
Pdf URL: https://arxiv.org/pdf/2412.05899
Copy Paste: [[2412.05899]] Accelerating Video Diffusion Models via Distribution Matching(https://arxiv.org/abs/2412.05899)
Keywords: generation, generative
Abstract: Generative models, particularly diffusion models, have made significant success in data synthesis across various modalities, including images, videos, and 3D assets. However, current diffusion models are computationally intensive, often requiring numerous sampling steps that limit their practical application, especially in video generation. This work introduces a novel framework for diffusion distillation and distribution matching that dramatically reduces the number of inference steps while maintaining-and potentially improving-generation quality. Our approach focuses on distilling pre-trained diffusion models into a more efficient few-step generator, specifically targeting video generation. By leveraging a combination of video GAN loss and a novel 2D score distribution matching loss, we demonstrate the potential to generate high-quality video frames with substantially fewer sampling steps. To be specific, the proposed method incorporates a denoising GAN discriminator to distil from the real data and a pre-trained image diffusion model to enhance the frame quality and the prompt-following capabilities. Experimental results using AnimateDiff as the teacher model showcase the method's effectiveness, achieving superior performance in just four sampling steps compared to existing techniques.
摘要：生成模型，尤其是扩散模型，在各种模态的数据合成方面取得了重大成功，包括图像、视频和 3D 资产。然而，当前的扩散模型计算量大，通常需要大量的采样步骤，这限制了它们的实际应用，尤其是在视频生成中。这项工作引入了一种用于扩散蒸馏和分布匹配的新框架，可显着减少推理步骤的数量，同时保持并可能提高生成质量。我们的方法侧重于将预训练的扩散模型提炼成更高效的几步生成器，专门针对视频生成。通过利用视频 GAN 损失和新颖的 2D 分数分布匹配损失的组合，我们展示了以更少的采样步骤生成高质量视频帧的潜力。具体来说，所提出的方法结合了去噪 GAN 鉴别器来从真实数据中提取数据，以及预训练的图像扩散模型来提高帧质量和提示跟踪能力。使用 AnimateDiff 作为教师模型的实验结果展示了该方法的有效性，与现有技术相比，仅用四个采样步骤就实现了卓越的性能。

Title: GBR: Generative Bundle Refinement for High-fidelity Gaussian Splatting and Meshing

Authors: Jianing Zhang, Yuchao Zheng, Ziwei Li, Qionghai Dai, Xiaoyun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05908
Pdf URL: https://arxiv.org/pdf/2412.05908
Copy Paste: [[2412.05908]] GBR: Generative Bundle Refinement for High-fidelity Gaussian Splatting and Meshing(https://arxiv.org/abs/2412.05908)
Keywords: generative
Abstract: Gaussian splatting has gained attention for its efficient representation and rendering of 3D scenes using continuous Gaussian primitives. However, it struggles with sparse-view inputs due to limited geometric and photometric information, causing ambiguities in depth, shape, and texture. we propose GBR: Generative Bundle Refinement, a method for high-fidelity Gaussian splatting and meshing using only 4-6 input views. GBR integrates a neural bundle adjustment module to enhance geometry accuracy and a generative depth refinement module to improve geometry fidelity. More specifically, the neural bundle adjustment module integrates a foundation network to produce initial 3D point maps and point matches from unposed images, followed by bundle adjustment optimization to improve multiview consistency and point cloud accuracy. The generative depth refinement module employs a diffusion-based strategy to enhance geometric details and fidelity while preserving the scale. Finally, for Gaussian splatting optimization, we propose a multimodal loss function incorporating depth and normal consistency, geometric regularization, and pseudo-view supervision, providing robust guidance under sparse-view conditions. Experiments on widely used datasets show that GBR significantly outperforms existing methods under sparse-view inputs. Additionally, GBR demonstrates the ability to reconstruct and render large-scale real-world scenes, such as the Pavilion of Prince Teng and the Great Wall, with remarkable details using only 6 views.
摘要：高斯溅射因其使用连续高斯基元高效表示和渲染 3D 场景而备受关注。然而，由于几何和光度信息有限，它在处理稀疏视图输入时会遇到困难，导致深度、形状和纹理模糊。我们提出了 GBR：生成束细化，一种仅使用 4-6 个输入视图进行高保真高斯溅射和网格划分的方法。GBR 集成了神经束调整模块以增强几何精度，并集成了生成深度细化模块以提高几何保真度。更具体地说，神经束调整模块集成了基础网络，从未调整的图像中生成初始 3D 点图和点匹配，然后进行束调整优化以提高多视图一致性和点云精度。生成深度细化模块采用基于扩散的策略来增强几何细节和保真度，同时保持尺度。最后，对于高斯分层优化，我们提出了一种结合深度和法线一致性、几何正则化和伪视图监督的多模态损失函数，在稀疏视图条件下提供稳健的指导。在广泛使用的数据集上进行的实验表明，GBR 在稀疏视图输入下的表现明显优于现有方法。此外，GBR 还展示了仅使用 6 个视图即可重建和渲染大规模真实场景（如滕王阁和长城）并呈现非凡细节的能力。

Title: BiDM: Pushing the Limit of Quantization for Diffusion Models

Authors: Xingyu Zheng, Xianglong Liu, Yichen Bian, Xudong Ma, Yulun Zhang, Jiakai Wang, Jinyang Guo, Haotong Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05926
Pdf URL: https://arxiv.org/pdf/2412.05926
Copy Paste: [[2412.05926]] BiDM: Pushing the Limit of Quantization for Diffusion Models(https://arxiv.org/abs/2412.05926)
Keywords: generation, generative
Abstract: Diffusion models (DMs) have been significantly developed and widely used in various applications due to their excellent generative qualities. However, the expensive computation and massive parameters of DMs hinder their practical use in resource-constrained scenarios. As one of the effective compression approaches, quantization allows DMs to achieve storage saving and inference acceleration by reducing bit-width while maintaining generation performance. However, as the most extreme quantization form, 1-bit binarization causes the generation performance of DMs to face severe degradation or even collapse. This paper proposes a novel method, namely BiDM, for fully binarizing weights and activations of DMs, pushing quantization to the 1-bit limit. From a temporal perspective, we introduce the Timestep-friendly Binary Structure (TBS), which uses learnable activation binarizers and cross-timestep feature connections to address the highly timestep-correlated activation features of DMs. From a spatial perspective, we propose Space Patched Distillation (SPD) to address the difficulty of matching binary features during distillation, focusing on the spatial locality of image generation tasks and noise estimation networks. As the first work to fully binarize DMs, the W1A1 BiDM on the LDM-4 model for LSUN-Bedrooms 256$\times$256 achieves a remarkable FID of 22.74, significantly outperforming the current state-of-the-art general binarization methods with an FID of 59.44 and invalid generative samples, and achieves up to excellent 28.0 times storage and 52.7 times OPs savings. The code is available at this https URL .
摘要：扩散模型 (DM) 因其出色的生成特性而得到了长足发展并被广泛应用于各种应用。然而，DM 昂贵的计算成本和海量参数阻碍了其在资源受限场景中的实际应用。作为有效的压缩方法之一，量化使 DM 通过减少位宽在保持生成性能的同时实现存储节省和推理加速。然而，作为最极端的量化形式，1 位二值化导致 DM 的生成性能严重下降甚至崩溃。本文提出了一种新方法，即 BiDM，用于对 DM 的权重和激活进行完全二值化，将量化推向 1 位极限。从时间角度来看，我们引入了时间步友好的二元结构 (TBS)，它使用可学习的激活二值化器和跨时间步特征连接来解决 DM 中高度时间步相关的激活特征。从空间角度，我们提出了空间补丁蒸馏（SPD）来解决蒸馏过程中匹配二进制特征的困难，重点关注图像生成任务和噪声估计网络的空间局部性。作为第一个完全二值化 DM 的工作，LSUN-Bedrooms 256$\times$256 的 LDM-4 模型上的 W1A1 BiDM 实现了出色的 22.74 FID，明显优于目前最先进的通用二值化方法（FID 为 59.44 且生成样本无效），并且实现了高达 28.0 倍的出色存储和 52.7 倍的 OP 节省。代码可在此 https URL 获得。

Title: Enhanced 3D Generation by 2D Editing

Authors: Haoran Li, Yuli Tian, Yong Liao, Lin Wang, Yuyang Wang, Peng Yuan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05929
Pdf URL: https://arxiv.org/pdf/2412.05929
Copy Paste: [[2412.05929]] Enhanced 3D Generation by 2D Editing(https://arxiv.org/abs/2412.05929)
Keywords: generation
Abstract: Distilling 3D representations from pretrained 2D diffusion models is essential for 3D creative applications across gaming, film, and interior design. Current SDS-based methods are hindered by inefficient information distillation from diffusion models, which prevents the creation of photorealistic 3D contents. Our research reevaluates the SDS approach by analyzing its fundamental nature as a basic image editing process that commonly results in over-saturation, over-smoothing and lack of rich content due to the poor-quality single-step denoising. To address these limitations, we propose GE3D (3D Generation by Editing). Each iteration of GE3D utilizes a 2D editing framework that combines a noising trajectory to preserve the information of the input image, alongside a text-guided denoising trajectory. We optimize the process by aligning the latents across both trajectories. This approach fully exploits pretrained diffusion models to distill multi-granularity information through multiple denoising steps, resulting in photorealistic 3D outputs. Both theoretical and experimental results confirm the effectiveness of our approach, which not only advances 3D generation technology but also establishes a novel connection between 3D generation and 2D editing. This could potentially inspire further research in the field. Code and demos are released at this https URL.
摘要：从预训练的 2D 扩散模型中提取 3D 表示对于游戏、电影和室内设计中的 3D 创意应用至关重要。当前基于 SDS 的方法受到扩散模型中信息提取效率低下的阻碍，从而阻碍了逼真的 3D 内容的创建。我们的研究通过分析 SDS 方法的基本性质重新评估了它，SDS 方法是一种基本的图像编辑过程，由于单步去噪质量低下，通常会导致过度饱和、过度平滑和缺乏丰富的内容。为了解决这些限制，我们提出了 GE3D（通过编辑生成 3D）。GE3D 的每次迭代都使用一个 2D 编辑框架，该框架结合了噪声轨迹以保留输入图像的信息，以及文本引导的去噪轨迹。我们通过对齐两个轨迹上的潜在信息来优化该过程。该方法充分利用了预训练的扩散模型，通过多个去噪步骤提取多粒度信息，从而产生逼真的 3D 输出。理论和实验结果都证实了我们方法的有效性，它不仅推动了 3D 生成技术的发展，还在 3D 生成和 2D 编辑之间建立了一种新颖的联系。这可能会激发该领域的进一步研究。代码和演示发布在此 https URL 上。

Title: Accelerating Manufacturing Scale-Up from Material Discovery Using Agentic Web Navigation and Retrieval-Augmented AI for Process Engineering Schematics Design

Authors: Sakhinana Sagar Srinivas, Akash Das, Shivam Gupta, Venkataramana Runkana
Subjects: cs.LG, cs.AI, cs.IR, cs.MA
Abstract URL: https://arxiv.org/abs/2412.05937
Pdf URL: https://arxiv.org/pdf/2412.05937
Copy Paste: [[2412.05937]] Accelerating Manufacturing Scale-Up from Material Discovery Using Agentic Web Navigation and Retrieval-Augmented AI for Process Engineering Schematics Design(https://arxiv.org/abs/2412.05937)
Keywords: generation
Abstract: Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (PIDs) are critical tools for industrial process design, control, and safety. However, the generation of precise and regulation-compliant diagrams remains a significant challenge, particularly in scaling breakthroughs from material discovery to industrial production in an era of automation and digitalization. This paper introduces an autonomous agentic framework to address these challenges through a twostage approach involving knowledge acquisition and generation. The framework integrates specialized sub-agents for retrieving and synthesizing multimodal data from publicly available online sources and constructs ontological knowledge graphs using a Graph Retrieval-Augmented Generation (Graph RAG) paradigm. These capabilities enable the automation of diagram generation and open-domain question answering (ODQA) tasks with high contextual accuracy. Extensive empirical experiments demonstrate the frameworks ability to deliver regulation-compliant diagrams with minimal expert intervention, highlighting its practical utility for industrial applications.
摘要：工艺流程图 (PFD) 和工艺与仪表图 (PID) 是工业工艺设计、控制和安全的关键工具。然而，生成精确且符合法规的图表仍然是一项重大挑战，特别是在自动化和数字化时代，从材料发现到工业生产的突破性扩展方面。本文介绍了一个自主代理框架，通过涉及知识获取和生成的两阶段方法应对这些挑战。该框架集成了专门的子代理，用于从公开的在线来源检索和合成多模态数据，并使用图形检索增强生成 (Graph RAG) 范式构建本体知识图。这些功能使图表生成和开放域问答 (ODQA) 任务能够以高上下文准确性自动完成。大量的实证实验证明了该框架能够以最少的专家干预提供符合法规的图表，突出了其在工业应用中的实际效用。

Title: Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Authors: Xiao Xu, Tianhao Niu, Yuxi Xie, Libo Qin, Wanxiang Che, Min-Yen Kan
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.05939
Pdf URL: https://arxiv.org/pdf/2412.05939
Copy Paste: [[2412.05939]] Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models(https://arxiv.org/abs/2412.05939)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We clearly explore and demonstrate the potential of MMGiC to help MLLMs better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% absolute improvements over image--caption data alone on POPE and SEED-Bench. Code, data and models will be available at this https URL.
摘要：多模态大型语言模型 (MLLM) 通过仅在粗粒度概念注释（例如图像标题）上进行预训练，在视觉语言任务中表现出色。我们假设，集成细粒度概念注释（例如对象标签和对象区域）将进一步提高性能，因为两种数据粒度在概念表示的广度和深度方面相互补充。我们引入了一个包含 MLLM 的多模态多粒度概念注释 (MMGiC) 的新数据集。在构建 MMGiC 时，我们探索了不同数据配方对多模态理解和生成的影响。我们的分析表明，在我们的结构化模板和通用 MLLM 框架下，多粒度概念注释相互集成和补充。我们清楚地探索并展示了 MMGiC 帮助 MLLM 更好地定位和学习概念的潜力，在多个粒度上协调视觉和语言。我们通过研究 12 个多模态理解和生成基准上 MMGiC 和图像-字幕数据之间的公平比较和有效协作来进一步验证我们的假设，例如，在 POPE 和 SEED-Bench 上，它们的适当组合比单独的图像-字幕数据实现了 3.95% 和 2.34% 的绝对改进。代码、数据和模型将在此 https URL 上提供。

Title: Anti-Reference: Universal and Immediate Defense Against Reference-Based Generation

Authors: Yiren Song, Shengtao Lou, Xiaokang Liu, Hai Ci, Pei Yang, Jiaming Liu, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05980
Pdf URL: https://arxiv.org/pdf/2412.05980
Copy Paste: [[2412.05980]] Anti-Reference: Universal and Immediate Defense Against Reference-Based Generation(https://arxiv.org/abs/2412.05980)
Keywords: generation, generative
Abstract: Diffusion models have revolutionized generative modeling with their exceptional ability to produce high-fidelity images. However, misuse of such potent tools can lead to the creation of fake news or disturbing content targeting individuals, resulting in significant social harm. In this paper, we introduce Anti-Reference, a novel method that protects images from the threats posed by reference-based generation techniques by adding imperceptible adversarial noise to the images. We propose a unified loss function that enables joint attacks on fine-tuning-based customization methods, non-fine-tuning customization methods, and human-centric driving methods. Based on this loss, we train a Adversarial Noise Encoder to predict the noise or directly optimize the noise using the PGD method. Our method shows certain transfer attack capabilities, effectively challenging both gray-box models and some commercial APIs. Extensive experiments validate the performance of Anti-Reference, establishing a new benchmark in image security.
摘要：扩散模型凭借其生成高保真图像的卓越能力彻底改变了生成建模。然而，滥用这种强大的工具可能会导致针对个人的虚假新闻或令人不安的内容的产生，从而造成严重的社会危害。在本文中，我们介绍了一种新方法反参考，该方法通过向图像添加不可察觉的对抗性噪声来保护图像免受基于参考的生成技术带来的威胁。我们提出了一个统一的损失函数，可以对基于微调的定制方法、非微调的定制方法和以人为本的驾驶方法进行联合攻击。基于这种损失，我们训练了一个对抗性噪声编码器来预测噪声或使用 PGD 方法直接优化噪声。我们的方法表现出一定的转移攻击能力，有效地挑战了灰盒模型和一些商业 API。大量实验验证了反参考的性能，为图像安全建立了新的基准。

Title: Nested Diffusion Models Using Hierarchical Latent Priors

Authors: Xiao Zhang, Ruoxi Jiang, Rebecca Willett, Michael Maire
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.05984
Pdf URL: https://arxiv.org/pdf/2412.05984
Copy Paste: [[2412.05984]] Nested Diffusion Models Using Hierarchical Latent Priors(https://arxiv.org/abs/2412.05984)
Keywords: generation, generative
Abstract: We introduce nested diffusion models, an efficient and powerful hierarchical generative framework that substantially enhances the generation quality of diffusion models, particularly for images of complex scenes. Our approach employs a series of diffusion models to progressively generate latent variables at different semantic levels. Each model in this series is conditioned on the output of the preceding higher-level models, culminating in image generation. Hierarchical latent variables guide the generation process along predefined semantic pathways, allowing our approach to capture intricate structural details while significantly improving image quality. To construct these latent variables, we leverage a pre-trained visual encoder, which learns strong semantic visual representations, and modulate its capacity via dimensionality reduction and noise injection. Across multiple datasets, our system demonstrates significant enhancements in image quality for both unconditional and class/text conditional generation. Moreover, our unconditional generation system substantially outperforms the baseline conditional system. These advancements incur minimal computational overhead as the more abstract levels of our hierarchy work with lower-dimensional representations.
摘要：我们引入了嵌套扩散模型，这是一种高效而强大的分层生成框架，可大幅提高扩散模型的生成质量，尤其是对于复杂场景的图像。我们的方法采用一系列扩散模型来逐步生成不同语义级别的潜在变量。该系列中的每个模型都以前面的高级模型的输出为条件，最终生成图像。分层潜在变量沿着预定义的语义路径引导生成过程，使我们的方法能够捕获复杂的结构细节，同时显著提高图像质量。为了构建这些潜在变量，我们利用预先训练的视觉编码器，它可以学习强大的语义视觉表示，并通过降维和噪声注入来调节其容量。在多个数据集中，我们的系统在无条件和类/文本条件生成的图像质量方面都表现出显着的增强。此外，我们的无条件生成系统大大优于基线条件系统。这些进步产生的计算开销最小，因为我们层次结构中更抽象的级别使用低维表示。

Title: Enhancing Content Representation for AR Image Quality Assessment Using Knowledge Distillation

Authors: Aymen Sekhri, Seyed Ali Amirshahi, Mohamed-Chaker Larabi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.06003
Pdf URL: https://arxiv.org/pdf/2412.06003
Copy Paste: [[2412.06003]] Enhancing Content Representation for AR Image Quality Assessment Using Knowledge Distillation(https://arxiv.org/abs/2412.06003)
Keywords: quality assessment
Abstract: Augmented Reality (AR) is a major immersive media technology that enriches our perception of reality by overlaying digital content (the foreground) onto physical environments (the background). It has far-reaching applications, from entertainment and gaming to education, healthcare, and industrial training. Nevertheless, challenges such as visual confusion and classical distortions can result in user discomfort when using the technology. Evaluating AR quality of experience becomes essential to measure user satisfaction and engagement, facilitating the refinement necessary for creating immersive and robust experiences. Though, the scarcity of data and the distinctive characteristics of AR technology render the development of effective quality assessment metrics challenging. This paper presents a deep learning-based objective metric designed specifically for assessing image quality for AR scenarios. The approach entails four key steps, (1) fine-tuning a self-supervised pre-trained vision transformer to extract prominent features from reference images and distilling this knowledge to improve representations of distorted images, (2) quantifying distortions by computing shift representations, (3) employing cross-attention-based decoders to capture perceptual quality features, and (4) integrating regularization techniques and label smoothing to address the overfitting problem. To validate the proposed approach, we conduct extensive experiments on the ARIQA dataset. The results showcase the superior performance of our proposed approach across all model variants, namely TransformAR, TransformAR-KD, and TransformAR-KD+ in comparison to existing state-of-the-art methods.
摘要：增强现实 (AR) 是一种主要的沉浸式媒体技术，通过将数字内容（前景）叠加到物理环境（背景）上来丰富我们对现实的感知。它具有广泛的应用，从娱乐和游戏到教育、医疗保健和工业培训。然而，视觉混乱和经典扭曲等挑战可能会导致用户在使用该技术时感到不适。评估 AR 体验质量对于衡量用户满意度和参与度至关重要，有助于创造沉浸式和强大体验所需的改进。然而，数据的稀缺性和 AR 技术的独特特性使得开发有效的质量评估指标具有挑战性。本文提出了一种基于深度学习的客观指标，专门用于评估 AR 场景的图像质量。该方法包括四个关键步骤：(1) 对自监督预训练视觉变换器进行微调，以从参考图像中提取突出特征，并提炼这些知识以改进失真图像的表示；(2) 通过计算移位表示来量化失真；(3) 采用基于交叉注意的解码器来捕获感知质量特征；(4) 集成正则化技术和标签平滑来解决过度拟合问题。为了验证所提出的方法，我们在 ARIQA 数据集上进行了广泛的实验。与现有的最先进方法相比，结果显示我们提出的方法在所有模型变体（即 TransformAR、TransformAR-KD 和 TransformAR-KD+）中都表现出色。

Title: Post-hoc Probabilistic Vision-Language Models

Authors: Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, Martin Trapp
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06014
Pdf URL: https://arxiv.org/pdf/2412.06014
Copy Paste: [[2412.06014]] Post-hoc Probabilistic Vision-Language Models(https://arxiv.org/abs/2412.06014)
Keywords: generative
Abstract: Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.
摘要：视觉语言模型 (VLM)，例如 CLIP 和 SigLIP，在分类、检索和生成任务中取得了显著的成功。为此，VLM 将图像和文本描述确定性地映射到联合潜在空间，在该空间中使用余弦相似度来评估它们的相似度。然而，在下游任务中使用输入的确定性映射无法捕捉因领域转移而产生的概念的不确定性。在这项工作中，我们提出了 VLM 中的事后不确定性估计，它不需要额外的训练。我们的方法利用 VLM 中最后几层的贝叶斯后验近似，并分析量化余弦相似度的不确定性。我们证明了它在主动学习中不确定性量化和支持集选择的有效性。与基线相比，我们获得了改进且经过良好校准的预测不确定性、可解释的不确定性估计和样本高效的主动学习。我们的结果显示了大规模模型在安全关键应用方面的前景。

Title: Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

Authors: Hyeonho Jeong, Chun-Hao Paul Huang, Jong Chul Ye, Niloy Mitra, Duygu Ceylan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06016
Pdf URL: https://arxiv.org/pdf/2412.06016
Copy Paste: [[2412.06016]] Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation(https://arxiv.org/abs/2412.06016)
Keywords: generation
Abstract: While recent foundational video generators produce visually rich output, they still struggle with appearance drift, where objects gradually degrade or change inconsistently across frames, breaking visual coherence. We hypothesize that this is because there is no explicit supervision in terms of spatial tracking at the feature level. We propose Track4Gen, a spatially aware video generator that combines video diffusion loss with point tracking across frames, providing enhanced spatial supervision on the diffusion features. Track4Gen merges the video generation and point tracking tasks into a single network by making minimal changes to existing video generation architectures. Using Stable Video Diffusion as a backbone, Track4Gen demonstrates that it is possible to unify video generation and point tracking, which are typically handled as separate tasks. Our extensive evaluations show that Track4Gen effectively reduces appearance drift, resulting in temporally stable and visually coherent video generation. Project page: this http URL
摘要：虽然最近的基础视频生成器可以产生视觉丰富的输出，但它们仍然难以应对外观漂移，即对象在帧之间逐渐退化或不一致地变化，从而破坏了视觉连贯性。我们假设这是因为在特征层面的空间跟踪方面没有明确的监督。我们提出了 Track4Gen，这是一种空间感知视频生成器，它将视频扩散损失与跨帧的点跟踪相结合，为扩散特征提供增强的空间监督。Track4Gen 通过对现有视频生成架构进行最小的更改，将视频生成和点跟踪任务合并到一个网络中。使用稳定视频扩散作为主干，Track4Gen 证明了可以统一视频生成和点跟踪，这通常作为单独的任务处理。我们广泛的评估表明，Track4Gen 有效地减少了外观漂移，从而生成时间稳定且视觉连贯的视频。项目页面：此 http URL

Title: FlexDiT: Dynamic Token Density Control for Diffusion Transformer

Authors: Shuning Chang, Pichao Wang, Jiasheng Tang, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06028
Pdf URL: https://arxiv.org/pdf/2412.06028
Copy Paste: [[2412.06028]] FlexDiT: Dynamic Token Density Control for Diffusion Transformer(https://arxiv.org/abs/2412.06028)
Keywords: generation, generative
Abstract: Diffusion Transformers (DiT) deliver impressive generative performance but face prohibitive computational demands due to both the quadratic complexity of token-based self-attention and the need for extensive sampling steps. While recent research has focused on accelerating sampling, the structural inefficiencies of DiT remain underexplored. We propose FlexDiT, a framework that dynamically adapts token density across both spatial and temporal dimensions to achieve computational efficiency without compromising generation quality. Spatially, FlexDiT employs a three-segment architecture that allocates token density based on feature requirements at each layer: Poolingformer in the bottom layers for efficient global feature extraction, Sparse-Dense Token Modules (SDTM) in the middle layers to balance global context with local detail, and dense tokens in the top layers to refine high-frequency details. Temporally, FlexDiT dynamically modulates token density across denoising stages, progressively increasing token count as finer details emerge in later timesteps. This synergy between FlexDiT's spatially adaptive architecture and its temporal pruning strategy enables a unified framework that balances efficiency and fidelity throughout the generation process. Our experiments demonstrate FlexDiT's effectiveness, achieving a 55% reduction in FLOPs and a 175% improvement in inference speed on DiT-XL with only a 0.09 increase in FID score on 512$\times$512 ImageNet images, a 56% reduction in FLOPs across video generation datasets including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, and a 69% improvement in inference speed on PixArt-$\alpha$ on text-to-image generation task with a 0.24 FID score decrease. FlexDiT provides a scalable solution for high-quality diffusion-based generation compatible with further sampling optimization techniques.
摘要：扩散变换器 (DiT) 提供了令人印象深刻的生成性能，但由于基于 token 的自注意力的二次复杂度和大量采样步骤的需求，面临着令人望而却步的计算需求。虽然最近的研究集中在加速采样上，但 DiT 的结构效率低下仍未得到充分探索。我们提出了 FlexDiT，这是一个在空间和时间维度上动态调整 token 密度的框架，以在不影响生成质量的情况下实现计算效率。在空间上，FlexDiT 采用三段式架构，根据每层的特征要求分配 token 密度：底层的 Poolingformer 用于高效的全局特征提取，中间层的稀疏-密集 token 模块 (SDTM) 用于平衡全局上下文和局部细节，顶层的密集 token 用于细化高频细节。在时间上，FlexDiT 在去噪阶段动态调节 token 密度，随着后续时间步骤中出现更精细的细节，逐步增加 token 数量。 FlexDiT 的空间自适应架构与时间修剪策略之间的协同作用实现了一个统一的框架，该框架可在整个生成过程中平衡效率和保真度。我们的实验证明了 FlexDiT 的有效性，在 DiT-XL 上实现了 FLOP 减少 55% 和推理速度提高 175% ，而 512$\times$512 ImageNet 图像的 FID 分数仅增加 0.09，在包括 FaceForensics、SkyTimelapse、UCF101 和 Taichi-HD 在内的视频生成数据集上实现了 FLOP 减少 56% ，在文本到图像生成任务上 PixArt-$\alpha$ 的推理速度提高了 69% ，FID 分数降低了 0.24。FlexDiT 为基于扩散的高质量生成提供了一种可扩展的解决方案，并与进一步的采样优化技术兼容。

Title: Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training

Authors: Zhenghong Zhou, Jie An, Jiebo Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06029
Pdf URL: https://arxiv.org/pdf/2412.06029
Copy Paste: [[2412.06029]] Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training(https://arxiv.org/abs/2412.06029)
Keywords: generation
Abstract: Precise camera pose control is crucial for video generation with diffusion models. Existing methods require fine-tuning with additional datasets containing paired videos and camera pose annotations, which are both data-intensive and computationally costly, and can disrupt the pre-trained model distribution. We introduce Latent-Reframe, which enables camera control in a pre-trained video diffusion model without fine-tuning. Unlike existing methods, Latent-Reframe operates during the sampling stage, maintaining efficiency while preserving the original model distribution. Our approach reframes the latent code of video frames to align with the input camera trajectory through time-aware point clouds. Latent code inpainting and harmonization then refine the model latent space, ensuring high-quality video generation. Experimental results demonstrate that Latent-Reframe achieves comparable or superior camera control precision and video quality to training-based methods, without the need for fine-tuning on additional datasets.
摘要：精确的相机姿势控制对于使用扩散模型生成视频至关重要。现有方法需要使用包含配对视频和相机姿势注释的额外数据集进行微调，这既需要大量数据，又需要大量计算，并且会破坏预训练的模型分布。我们引入了 Latent-Reframe，它无需微调即可在预训练的视频扩散模型中实现相机控制。与现有方法不同，Latent-Reframe 在采样阶段运行，在保持效率的同时保留了原始模型分布。我们的方法通过时间感知点云重构视频帧的潜在代码，使其与输入相机轨迹对齐。然后，潜在代码修复和协调会细化模型潜在空间，确保生成高质量的视频。实验结果表明，Latent-Reframe 实现了与基于训练的方法相当或更优异的相机控制精度和视频质量，而无需在其他数据集上进行微调。

Title: GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis

Authors: Ashish Goswami, Satyam Kumar Modi, Santhosh Rishi Deshineni, Harman Singh, Prathosh A. P, Parag Singla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06089
Pdf URL: https://arxiv.org/pdf/2412.06089
Copy Paste: [[2412.06089]] GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis(https://arxiv.org/abs/2412.06089)
Keywords: generation
Abstract: Text-to-image (T2I) generation has seen significant progress with diffusion models, enabling generation of photo-realistic images from text prompts. Despite this progress, existing methods still face challenges in following complex text prompts, especially those requiring compositional and multi-step reasoning. Given such complex instructions, SOTA models often make mistakes in faithfully modeling object attributes, and relationships among them. In this work, we present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps, (a) Generate: we first generate an image using existing diffusion models (b) Plan: we make use of Multi-Modal LLMs (MLLMs) to identify the mistakes in the generated image expressed in terms of individual objects and their properties, and produce a sequence of corrective steps required in the form of an edit-plan. (c) Edit: we make use of an existing text-guided image editing models to sequentially execute our edit-plan over the generated image to get the desired image which is faithful to the original instruction. Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models. As an added contribution, we also develop a model capable of compositional editing, which further helps improve the overall accuracy of our proposed approach. Our method flexibly trades inference time compute with performance on compositional text prompts. We perform extensive experimental evaluation across 3 benchmarks and 10 T2I models including DALLE-3 and the latest -- SD-3.5-Large. Our approach not only improves the performance of the SOTA models, by upto 3 points, it also reduces the performance gap between weaker and stronger models. $\href{this https URL}{this https URL}$
摘要：文本到图像 (T2I) 生成在扩散模型方面取得了重大进展，能够根据文本提示生成照片般逼真的图像。尽管取得了这些进展，现有方法在遵循复杂文本提示方面仍然面临挑战，尤其是那些需要组合和多步骤推理的提示。面对如此复杂的指令，SOTA 模型经常会在忠实地建模对象属性及其之间的关系时犯错误。在这项工作中，我们提出了 T2I 合成的替代范式，将复杂的多步骤生成任务分解为三个步骤，(a) 生成：我们首先使用现有的扩散模型生成图像 (b) 计划：我们利用多模态 LLM (MLLM) 来识别生成的图像中以单个对象及其属性表示的错误，并以编辑计划的形式生成所需的一系列纠正步骤。(c) 编辑：我们利用现有的文本引导图像编辑模型对生成的图像按顺序执行我们的编辑计划，以获得忠实于原始指令的所需图像。我们的方法的优势在于它本质上是模块化的，无需训练，并且可以应用于任何图像生成和编辑模型的组合。作为额外的贡献，我们还开发了一个能够进行组合编辑的模型，这进一步有助于提高我们提出的方法的整体准确性。我们的方法灵活地在推理时间计算和组合文本提示的性能之间进行权衡。我们对 3 个基准和 10 个 T2I 模型进行了广泛的实验评估，包括 DALLE-3 和最新的 SD-3.5-Large。我们的方法不仅将 SOTA 模型的性能提高了 3 个百分点，还缩小了较弱和较强模型之间的性能差距。$\href{this https URL}{this https URL}$

Title: PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power Systems

Authors: Ali Menati, Fatemeh Doudi, Dileep Kalathil, Le Xie
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2412.06112
Pdf URL: https://arxiv.org/pdf/2412.06112
Copy Paste: [[2412.06112]] PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power Systems(https://arxiv.org/abs/2412.06112)
Keywords: generation
Abstract: The electricity sector is undergoing substantial transformations due to the rising electrification of demand, enhanced integration of renewable energy resources, and the emergence of new technologies. These changes are rendering the electric grid more volatile and unpredictable, making it difficult to maintain reliable operations. In order to address these issues, advanced time series prediction models are needed for closing the gap between the forecasted and actual grid outcomes. In this paper, we introduce a multivariate time series prediction model that combines traditional state space models with deep learning methods to simultaneously capture and predict the underlying dynamics of multiple time series. Additionally, we design a time series processing module that incorporates high-resolution external forecasts into sequence-to-sequence prediction models, achieving this with negligible increases in size and no loss of accuracy. We also release an extended dataset spanning five years of load, electricity price, ancillary service price, and renewable generation. To complement this dataset, we provide an open-access toolbox that includes our proposed model, the dataset itself, and several state-of-the-art prediction models, thereby creating a unified framework for benchmarking advanced machine learning approaches. Our findings indicate that the proposed model outperforms existing models across various prediction tasks, improving state-of-the-art prediction error by an average of 7% and decreasing model parameters by 43%.
摘要：由于电力需求不断增长、可再生能源资源整合不断加强以及新技术的出现，电力行业正在经历重大变革。这些变化使电网更加不稳定和不可预测，难以维持可靠的运行。为了解决这些问题，需要先进的时间序列预测模型来缩小预测结果与实际电网结果之间的差距。在本文中，我们介绍了一种多元时间序列预测模型，该模型将传统的状态空间模型与深度学习方法相结合，以同时捕获和预测多个时间序列的潜在动态。此外，我们设计了一个时间序列处理模块，将高分辨率外部预测纳入序列到序列预测模型，在大小几乎不增加且准确度不降低的情况下实现这一点。我们还发布了一个扩展数据集，涵盖了五年的负荷、电价、辅助服务价格和可再生能源发电。为了补充这个数据集，我们提供了一个开放访问工具箱，其中包括我们提出的模型、数据集本身和几个最先进的预测模型，从而为基准测试先进的机器学习方法创建了一个统一的框架。我们的研究结果表明，所提出的模型在各种预测任务中的表现均优于现有模型，将最先进的预测误差平均提高了 7%，并将模型参数降低了 43%。

Title: SGIA: Enhancing Fine-Grained Visual Classification with Sequence Generative Image Augmentation

Authors: Qiyu Liao, Xin Yuan, Min Xu, Dadong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06138
Pdf URL: https://arxiv.org/pdf/2412.06138
Copy Paste: [[2412.06138]] SGIA: Enhancing Fine-Grained Visual Classification with Sequence Generative Image Augmentation(https://arxiv.org/abs/2412.06138)
Keywords: generative
Abstract: In Fine-Grained Visual Classification (FGVC), distinguishing highly similar subcategories remains a formidable challenge, often necessitating datasets with extensive variability. The acquisition and annotation of such FGVC datasets are notably difficult and costly, demanding specialized knowledge to identify subtle distinctions among closely related categories. Our study introduces a novel approach employing the Sequence Latent Diffusion Model (SLDM) for augmenting FGVC datasets, called Sequence Generative Image Augmentation (SGIA). Our method features a unique Bridging Transfer Learning (BTL) process, designed to minimize the domain gap between real and synthetically augmented data. This approach notably surpasses existing methods in generating more realistic image samples, providing a diverse range of pose transformations that extend beyond the traditional rigid transformations and style changes in generative augmentation. We demonstrate the effectiveness of our augmented dataset with substantial improvements in FGVC tasks on various datasets, models, and training strategies, especially in few-shot learning scenarios. Our method outperforms conventional image augmentation techniques in benchmark tests on three FGVC datasets, showcasing superior realism, variability, and representational quality. Our work sets a new benchmark and outperforms the previous state-of-the-art models in classification accuracy by 0.5% for the CUB-200-2011 dataset and advances the application of generative models in FGVC data augmentation.
摘要：在细粒度视觉分类 (FGVC) 中，区分高度相似的子类别仍然是一项艰巨的挑战，通常需要具有广泛可变性的数据集。此类 FGVC 数据集的获取和注释非常困难且成本高昂，需要专业知识来识别密切相关类别之间的细微差别。我们的研究引入了一种采用序列潜在扩散模型 (SLDM) 来增强 FGVC 数据集的新方法，称为序列生成图像增强 (SGIA)。我们的方法具有独特的桥接迁移学习 (BTL) 过程，旨在最大限度地缩小真实数据和合成增强数据之间的领域差距。这种方法在生成更逼真的图像样本方面明显超越了现有方法，提供了多种姿势变换，超越了生成增强中传统的刚性变换和风格变化。我们证明了增强数据集的有效性，在各种数据集、模型和训练策略上的 FGVC 任务中取得了显着改进，尤其是在小样本学习场景中。我们的方法在三个 FGVC 数据集的基准测试中优于传统的图像增强技术，展现出卓越的真实感、可变性和表现质量。我们的工作树立了新的基准，在 CUB-200-2011 数据集的分类准确率上比之前最先进的模型高出 0.5%，并推动了生成模型在 FGVC 数据增强中的应用。

Title: MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Authors: Kangyu Zhu, Peng Xia, Yun Li, Hongtu Zhu, Sheng Wang, Huaxiu Yao
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06141
Pdf URL: https://arxiv.org/pdf/2412.06141
Copy Paste: [[2412.06141]] MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization(https://arxiv.org/abs/2412.06141)
Keywords: generation
Abstract: The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately mitigated clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. To address this challenge, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, and integrate these scores into the preference optimization process as weights, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by averaging 14.2% and 51.7% across the Med-VQA and report generation tasks. Our code are available in this https URL.
摘要：大型视觉语言模型 (LVLM) 的进步推动了它们在医学领域的应用。然而，医学 LVLM (Med-LVLM) 由于模态错位而面临事实性挑战，其中模型优先考虑文本知识而不是视觉输入，导致幻觉与医学图像中的信息相矛盾。以前通过偏好优化来增强 Med-LVLM 中的模态对齐的尝试不足以缓解偏好数据中的临床相关性，使这些样本易于区分并降低了对齐效果。为了应对这一挑战，我们提出了 MMedPO，这是一种新颖的多模态医学偏好优化方法，它考虑偏好样本的临床相关性以增强 Med-LVLM 对齐。MMedPO 通过引入两种类型的偏好来整理多模态偏好数据：(1) 通过目标 Med-LVLM 或 GPT-4o 注入的似是而非的幻觉，产生医学上不准确的反应，以及 (2) 通过局部病变噪声实现的病变区域忽略，破坏了对关键区域的视觉理解。然后，我们根据来自多个 Med-LLM 和视觉工具的分数计算每个样本的临床相关性，并将这些分数作为权重整合到偏好优化过程中，从而实现有效对齐。我们的实验表明，MMedPO 显著提高了 Med-LVLM 中的事实准确性，在 Med-VQA 和报告生成任务中平均提高了 14.2% 和 51.7%，与现有的偏好优化方法相比取得了显着的改进。我们的代码可在此 https URL 中找到。

Title: AgentAlign: Misalignment-Adapted Multi-Agent Perception for Resilient Inter-Agent Sensor Correlations

Authors: Zonglin Meng, Yun Zhang, Zhaoliang Zheng, Zhihao Zhao, Jiaqi Ma
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.06142
Pdf URL: https://arxiv.org/pdf/2412.06142
Copy Paste: [[2412.06142]] AgentAlign: Misalignment-Adapted Multi-Agent Perception for Resilient Inter-Agent Sensor Correlations(https://arxiv.org/abs/2412.06142)
Keywords: generation
Abstract: Cooperative perception has attracted wide attention given its capability to leverage shared information across connected automated vehicles (CAVs) and smart infrastructures to address sensing occlusion and range limitation issues. However, existing research overlooks the fragile multi-sensor correlations in multi-agent settings, as the heterogeneous agent sensor measurements are highly susceptible to environmental factors, leading to weakened inter-agent sensor interactions. The varying operational conditions and other real-world factors inevitably introduce multifactorial noise and consequentially lead to multi-sensor misalignment, making the deployment of multi-agent multi-modality perception particularly challenging in the real world. In this paper, we propose AgentAlign, a real-world heterogeneous agent cross-modality feature alignment framework, to effectively address these multi-modality misalignment issues. Our method introduces a cross-modality feature alignment space (CFAS) and heterogeneous agent feature alignment (HAFA) mechanism to harmonize multi-modality features across various agents dynamically. Additionally, we present a novel V2XSet-noise dataset that simulates realistic sensor imperfections under diverse environmental conditions, facilitating a systematic evaluation of our approach's robustness. Extensive experiments on the V2X-Real and V2XSet-Noise benchmarks demonstrate that our framework achieves state-of-the-art performance, underscoring its potential for real-world applications in cooperative autonomous driving. The controllable V2XSet-Noise dataset and generation pipeline will be released in the future.
摘要：协作感知因其能够利用网联自动驾驶汽车 (CAV) 和智能基础设施之间的共享信息来解决感知遮挡和范围限制问题而受到广泛关注。然而，现有研究忽视了多智能体环境中脆弱的多传感器相关性，因为异构智能体传感器测量值极易受到环境因素的影响，从而导致智能体间传感器交互减弱。变化的操作条件和其他现实因素不可避免地会引入多因素噪声，并进而导致多传感器错位，使多智能体多模态感知在现实世界中的部署尤为具有挑战性。在本文中，我们提出了 AgentAlign，这是一个现实世界的异构智能体跨模态特征对齐框架，以有效解决这些多模态错位问题。我们的方法引入了跨模态特征对齐空间 (CFAS) 和异构智能体特征对齐 (HAFA) 机制，以动态协调各个智能体之间的多模态特征。此外，我们还提出了一种新颖的 V2XSet-noise 数据集，该数据集模拟了各种环境条件下真实的传感器缺陷，有助于系统地评估我们方法的稳健性。在 V2X-Real 和 V2XSet-Noise 基准上进行的大量实验表明，我们的框架实现了最先进的性能，凸显了其在协作自动驾驶中的实际应用潜力。可控的 V2XSet-Noise 数据集和生成管道将在未来发布。

Title: Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters

Authors: Yuan Wang, Ouxiang Li, Tingting Mu, Yanbin Hao, Kuien Liu, Xiang Wang, Xiangnan He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06143
Pdf URL: https://arxiv.org/pdf/2412.06143
Copy Paste: [[2412.06143]] Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters(https://arxiv.org/abs/2412.06143)
Keywords: generation
Abstract: The success of text-to-image generation enabled by diffuion models has imposed an urgent need to erase unwanted concepts, e.g., copyrighted, offensive, and unsafe ones, from the pre-trained models in a precise, timely, and low-cost manner. The twofold demand of concept erasure requires a precise removal of the target concept during generation (i.e., erasure efficacy), while a minimal impact on non-target content generation (i.e., prior preservation). Existing methods are either computationally costly or face challenges in maintaining an effective balance between erasure efficacy and prior preservation. To improve, we propose a precise, fast, and low-cost concept erasure method, called Adaptive Vaule Decomposer (AdaVD), which is training-free. This method is grounded in a classical linear algebraic orthogonal complement operation, implemented in the value space of each cross-attention layer within the UNet of diffusion models. An effective shift factor is designed to adaptively navigate the erasure strength, enhancing prior preservation without sacrificing erasure efficacy. Extensive experimental results show that the proposed AdaVD is effective at both single and multiple concept erasure, showing a 2- to 10-fold improvement in prior preservation as compared to the second best, meanwhile achieving the best or near best erasure efficacy, when comparing with both training-based and training-free state of the arts. AdaVD supports a series of diffusion models and downstream image generation tasks, the code is available on the project page: this https URL
摘要：扩散模型成功实现了文本到图像的生成，迫切需要以精确、及时和低成本的方式从预训练模型中删除不需要的概念，例如受版权保护的概念、冒犯性和不安全的概念。概念擦除的双重需求要求在生成过程中精确删除目标概念（即擦除效果），同时尽量减少对非目标内容生成的影响（即先前保存）。现有方法要么计算成本高昂，要么在保持擦除效果和先前保存之间的有效平衡方面面临挑战。为了改进，我们提出了一种精确、快速且低成本的概念擦除方法，称为自适应值分解器 (AdaVD)，无需训练。该方法基于经典的线性代数正交补运算，在扩散模型 UNet 中每个交叉注意层的值空间中实现。设计了一个有效的移位因子来自适应地导航擦除强度，在不牺牲擦除效果的情况下增强先前保存。大量实验结果表明，所提出的 AdaVD 在单概念和多概念擦除方面均有效，与第二佳相比，先前保存效果提高了 2 到 10 倍，同时与基于训练和无训练的现有技术相比，实现了最佳或接近最佳的擦除效果。AdaVD 支持一系列扩散模型和下游图像生成任务，代码可在项目页面上找到：此 https URL

Title: An Effective and Resilient Backdoor Attack Framework against Deep Neural Networks and Vision Transformers

Authors: Xueluan Gong, Bowei Tian, Meng Xue, Yuan Wu, Yanjiao Chen, Qian Wang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2412.06149
Pdf URL: https://arxiv.org/pdf/2412.06149
Copy Paste: [[2412.06149]] An Effective and Resilient Backdoor Attack Framework against Deep Neural Networks and Vision Transformers(https://arxiv.org/abs/2412.06149)
Keywords: generation
Abstract: Recent studies have revealed the vulnerability of Deep Neural Network (DNN) models to backdoor attacks. However, existing backdoor attacks arbitrarily set the trigger mask or use a randomly selected trigger, which restricts the effectiveness and robustness of the generated backdoor triggers. In this paper, we propose a novel attention-based mask generation methodology that searches for the optimal trigger shape and location. We also introduce a Quality-of-Experience (QoE) term into the loss function and carefully adjust the transparency value of the trigger in order to make the backdoored samples to be more natural. To further improve the prediction accuracy of the victim model, we propose an alternating retraining algorithm in the backdoor injection process. The victim model is retrained with mixed poisoned datasets in even iterations and with only benign samples in odd iterations. Besides, we launch the backdoor attack under a co-optimized attack framework that alternately optimizes the backdoor trigger and backdoored model to further improve the attack performance. Apart from DNN models, we also extend our proposed attack method against vision transformers. We evaluate our proposed method with extensive experiments on VGG-Flower, CIFAR-10, GTSRB, CIFAR-100, and ImageNette datasets. It is shown that we can increase the attack success rate by as much as 82\% over baselines when the poison ratio is low and achieve a high QoE of the backdoored samples. Our proposed backdoor attack framework also showcases robustness against state-of-the-art backdoor defenses.
摘要：最近的研究揭示了深度神经网络 (DNN) 模型易受后门攻击的弱点。然而，现有的后门攻击任意设置触发器掩码或使用随机选择的触发器，这限制了生成的后门触发器的有效性和鲁棒性。在本文中，我们提出了一种基于注意力机制的新型掩码生成方法，该方法可搜索最佳触发器形状和位置。我们还在损失函数中引入了体验质量 (QoE) 项，并仔细调整触发器的透明度值，以使后门样本更加自然。为了进一步提高受害者模型的预测准确性，我们在后门注入过程中提出了一种交替再训练算法。在偶数次迭代中使用混合毒害数据集对受害者模型进行再训练，在奇数次迭代中使用仅良性样本进行再训练。此外，我们在共同优化的攻击框架下发起后门攻击，该框架交替优化后门触发器和后门模型，以进一步提高攻击性能。除了 DNN 模型外，我们还扩展了针对视觉变压器的攻击方法。我们通过对 VGG-Flower、CIFAR-10、GTSRB、CIFAR-100 和 ImageNette 数据集进行大量实验来评估我们提出的方法。结果表明，当毒药率较低时，我们可以将攻击成功率提高至基线的 82% 以上，并实现后门样本的高 QoE。我们提出的后门攻击框架还展示了针对最先进后门防御的稳健性。

Title: ASGDiffusion: Parallel High-Resolution Generation with Asynchronous Structure Guidance

Authors: Yuming Li, Peidong Jia, Daiwei Hong, Yueru Jia, Qi She, Rui Zhao, Ming Lu, Shanghang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06163
Pdf URL: https://arxiv.org/pdf/2412.06163
Copy Paste: [[2412.06163]] ASGDiffusion: Parallel High-Resolution Generation with Asynchronous Structure Guidance(https://arxiv.org/abs/2412.06163)
Keywords: generation
Abstract: Training-free high-resolution (HR) image generation has garnered significant attention due to the high costs of training large diffusion models. Most existing methods begin by reconstructing the overall structure and then proceed to refine the local details. Despite their advancements, they still face issues with repetitive patterns in HR image generation. Besides, HR generation with diffusion models incurs significant computational costs. Thus, parallel generation is essential for interactive applications. To solve the above limitations, we introduce a novel method named ASGDiffusion for parallel HR generation with Asynchronous Structure Guidance (ASG) using pre-trained diffusion models. To solve the pattern repetition problem of HR image generation, ASGDiffusion leverages the low-resolution (LR) noise weighted by the attention mask as the structure guidance for the denoising step to ensure semantic consistency. The proposed structure guidance can significantly alleviate the pattern repetition problem. To enable parallel generation, we further propose a parallelism strategy, which calculates the patch noises and structure guidance asynchronously. By leveraging multi-GPU parallel acceleration, we significantly accelerate generation speed and reduce memory usage per GPU. Extensive experiments demonstrate that our method effectively and efficiently addresses common issues like pattern repetition and achieves state-of-the-art HR generation.
摘要：由于训练大型扩散模型的成本高昂，无需训练的高分辨率 (HR) 图像生成引起了广泛关注。大多数现有方法首先重建整体结构，然后继续细化局部细节。尽管它们取得了进步，但它们仍然面临 HR 图像生成中重复模式的问题。此外，使用扩散模型的 HR 生成会产生大量的计算成本。因此，并行生成对于交互式应用程序至关重要。为了解决上述限制，我们引入了一种名为 ASGDiffusion 的新方法，用于使用预训练扩散模型进行异步结构引导 (ASG) 并行 HR 生成。为了解决 HR 图像生成的模式重复问题，ASGDiffusion 利用注意力掩码加权的低分辨率 (LR) 噪声作为去噪步骤的结构指导，以确保语义一致性。所提出的结构指导可以显著缓解模式重复问题。为了实现并行生成，我们进一步提出了一种并行策略，该策略异步计算补丁噪声和结构指导。通过利用多 GPU 并行加速，我们显著加快了生成速度并减少了每个 GPU 的内存使用量。大量实验表明，我们的方法有效且高效地解决了模式重复等常见问题，并实现了最先进的 HR 生成。

Title: AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement

Authors: Pranjal Aggarwal, Bryan Parno, Sean Welleck
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06176
Pdf URL: https://arxiv.org/pdf/2412.06176
Copy Paste: [[2412.06176]] AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement(https://arxiv.org/abs/2412.06176)
Keywords: generation
Abstract: Automated code generation with large language models has gained significant traction, but there remains no guarantee on the correctness of generated code. We aim to use formal verification to provide mathematical guarantees that the generated code is correct. However, generating formally verified code with LLMs is hindered by the scarcity of training data and the complexity of formal proofs. To tackle this challenge, we introduce AlphaVerus, a self-improving framework that bootstraps formally verified code generation by iteratively translating programs from a higher-resource language and leveraging feedback from a verifier. AlphaVerus operates in three phases: exploration of candidate translations, Treefinement -- a novel tree search algorithm for program refinement using verifier feedback, and filtering misaligned specifications and programs to prevent reward hacking. Through this iterative process, AlphaVerus enables a LLaMA-3.1-70B model to generate verified code without human intervention or model finetuning. AlphaVerus shows an ability to generate formally verified solutions for HumanEval and MBPP, laying the groundwork for truly trustworthy code-generation agents.
摘要：使用大型语言模型进行自动代码生成已获得广泛关注，但仍然无法保证生成代码的正确性。我们旨在使用形式化验证来提供生成的代码正确的数学保证。但是，使用 LLM 生成形式化验证的代码受到训练数据稀缺和形式化证明复杂性的阻碍。为了应对这一挑战，我们推出了 AlphaVerus，这是一个自我改进的框架，它通过迭代地从资源更丰富的语言翻译程序并利用验证者的反馈来引导形式化验证的代码生成。AlphaVerus 分为三个阶段：探索候选翻译、Treefinement（一种使用验证者反馈进行程序细化的新型树搜索算法）以及过滤不一致的规范和程序以防止奖励黑客攻击。通过这个迭代过程，AlphaVerus 使 LLaMA-3.1-70B 模型能够生成经过验证的代码，而无需人工干预或模型微调。 AlphaVerus 展示了为 HumanEval 和 MBPP 生成形式验证解决方案的能力，为真正值得信赖的代码生成代理奠定了基础。

Title: Towards Long Video Understanding via Fine-detailed Video Story Generation

Authors: Zeng You, Zhiquan Wen, Yaofo Chen, Xin Li, Runhao Zeng, Yaowei Wang, Mingkui Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06182
Pdf URL: https://arxiv.org/pdf/2412.06182
Copy Paste: [[2412.06182]] Towards Long Video Understanding via Fine-detailed Video Story Generation(https://arxiv.org/abs/2412.06182)
Keywords: generation
Abstract: Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval. Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy. To tackle these challenges, we introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations. Specifically, to achieve fine-grained modeling of long-temporal content, we propose a Bottom-up Video Interpretation Mechanism that progressively interprets video content from clips to video. To avoid interference from redundant information in videos, we introduce a Semantic Redundancy Reduction mechanism that removes redundancy at both the visual and textual levels. Our method transforms long videos into hierarchical textual representations that contain multi-granularity information of the video. With these representations, FDVS is applicable to various tasks without any fine-tuning. We evaluate the proposed method across eight datasets spanning three tasks. The performance demonstrates the effectiveness and versatility of our method.
摘要：长视频理解已成为计算机视觉领域的一项关键任务，推动了从监控到内容检索等众多应用的进步。现有的视频理解方法在处理长视频理解时面临两个挑战：复杂的长上下文关系建模和冗余干扰。为了应对这些挑战，我们引入了精细视频故事生成 (FDVS)，将长视频解释为详细的文本表示。具体来说，为了实现长时间内容的细粒度建模，我们提出了一种自下而上的视频解释机制，该机制逐步将视频内容从剪辑解释为视频。为了避免视频中冗余信息的干扰，我们引入了一种语义冗余减少机制，可在视觉和文本层面消除冗余。我们的方法将长视频转换为包含视频多粒度信息的分层文本表示。有了这些表示，FDVS 无需任何微调即可应用于各种任务。我们在涵盖三个任务的八个数据集上评估了所提出的方法。该性能证明了我们方法的有效性和多功能性。

Title: You KAN Do It in a Single Shot: Plug-and-Play Methods with Single-Instance Priors

Authors: Yanqi Cheng, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06204
Pdf URL: https://arxiv.org/pdf/2412.06204
Copy Paste: [[2412.06204]] You KAN Do It in a Single Shot: Plug-and-Play Methods with Single-Instance Priors(https://arxiv.org/abs/2412.06204)
Keywords: super-resolution
Abstract: The use of Plug-and-Play (PnP) methods has become a central approach for solving inverse problems, with denoisers serving as regularising priors that guide optimisation towards a clean solution. In this work, we introduce KAN-PnP, an optimisation framework that incorporates Kolmogorov-Arnold Networks (KANs) as denoisers within the Plug-and-Play (PnP) paradigm. KAN-PnP is specifically designed to solve inverse problems with single-instance priors, where only a single noisy observation is available, eliminating the need for large datasets typically required by traditional denoising methods. We show that KANs, based on the Kolmogorov-Arnold representation theorem, serve effectively as priors in such settings, providing a robust approach to denoising. We prove that the KAN denoiser is Lipschitz continuous, ensuring stability and convergence in optimisation algorithms like PnP-ADMM, even in the context of single-shot learning. Additionally, we provide theoretical guarantees for KAN-PnP, demonstrating its convergence under key conditions: the convexity of the data fidelity term, Lipschitz continuity of the denoiser, and boundedness of the regularisation functional. These conditions are crucial for stable and reliable optimisation. Our experimental results show, on super-resolution and joint optimisation, that KAN-PnP outperforms exiting methods, delivering superior performance in single-shot learning with minimal data. The method exhibits strong convergence properties, achieving high accuracy with fewer iterations.
摘要：即插即用 (PnP) 方法的使用已成为解决逆问题的主要方法，其中降噪器充当正则化先验，引导优化获得干净的解决方案。在这项工作中，我们引入了 KAN-PnP，这是一个优化框架，它将 Kolmogorov-Arnold 网络 (KAN) 整合为即插即用 (PnP) 范式中的降噪器。KAN-PnP 专门用于解决具有单实例先验的逆问题，其中只有一个噪声观察可用，从而无需传统降噪方法通常需要的大型数据集。我们表明，基于 Kolmogorov-Arnold 表示定理的 KAN 可以有效地充当此类设置中的先验，从而提供一种强大的降噪方法。我们证明 KAN 降噪器是 Lipschitz 连续的，即使在单次学习的背景下，也能确保 PnP-ADMM 等优化算法的稳定性和收敛性。此外，我们为 KAN-PnP 提供了理论保证，证明了其在关键条件下的收敛性：数据保真度项的凸性、降噪器的 Lipschitz 连续性和正则化函数的有界性。这些条件对于稳定可靠的优化至关重要。我们的实验结果表明，在超分辨率和联合优化方面，KAN-PnP 优于现有方法，在数据最少的单次学习中表现出色。该方法表现出强大的收敛特性，以更少的迭代次数实现高精度。

Title: Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

Authors: Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Tae-Hyun Oh
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.06209
Pdf URL: https://arxiv.org/pdf/2412.06209
Copy Paste: [[2412.06209]] Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment(https://arxiv.org/abs/2412.06209)
Keywords: generation
Abstract: How does audio describe the world around us? In this work, we propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals. We address this challenge by designing a model that aligns audio-visual modalities by enriching audio features with visual information and translating them into the visual latent space. These features are then fed into the pre-trained image generator to produce images. To enhance image quality, we use sound source localization to select audio-visual pairs with strong cross-modal correlations. Our method achieves substantially better results on the VEGAS and VGGSound datasets compared to previous work and demonstrates control over the generation process through simple manipulations to the input waveform or latent space. Furthermore, we analyze the geometric properties of the learned embedding space and demonstrate that our learning approach effectively aligns audio-visual signals for cross-modal generation. Based on this analysis, we show that our method is agnostic to specific design choices, showing its generalizability by integrating various model architectures and different types of audio-visual data.
摘要：音频如何描述我们周围的世界？在这项工作中，我们提出了一种从各种野外声音生成视觉场景图像的方法。由于听觉和视觉信号之间存在巨大的信息差距，这种跨模态生成任务具有挑战性。我们通过设计一个模型来解决这一挑战，该模型通过用视觉信息丰富音频特征并将其转换为视觉潜在空间来对齐视听模态。然后将这些特征输入到预先训练的图像生成器中以生成图像。为了提高图像质量，我们使用声源定位来选择具有强跨模态相关性的视听对。与以前的工作相比，我们的方法在 VEGAS 和 VGGSound 数据集上取得了更好的结果，并通过对输入波形或潜在空间的简单操作展示了对生成过程的控制。此外，我们分析了学习到的嵌入空间的几何属性，并证明了我们的学习方法有效地对齐了视听信号以进行跨模态生成。基于此分析，我们表明我们的方法与特定的设计选择无关，通过集成各种模型架构和不同类型的视听数据展示了其通用性。

Title: MSCrackMamba: Leveraging Vision Mamba for Crack Detection in Fused Multispectral Imagery

Authors: Qinfeng Zhu, Yuan Fang, Lei Fan
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2412.06211
Pdf URL: https://arxiv.org/pdf/2412.06211
Copy Paste: [[2412.06211]] MSCrackMamba: Leveraging Vision Mamba for Crack Detection in Fused Multispectral Imagery(https://arxiv.org/abs/2412.06211)
Keywords: super-resolution
Abstract: Crack detection is a critical task in structural health monitoring, aimed at assessing the structural integrity of bridges, buildings, and roads to prevent potential failures. Vision-based crack detection has become the mainstream approach due to its ease of implementation and effectiveness. Fusing infrared (IR) channels with red, green and blue (RGB) channels can enhance feature representation and thus improve crack detection. However, IR and RGB channels often differ in resolution. To align them, higher-resolution RGB images typically need to be downsampled to match the IR image resolution, which leads to the loss of fine details. Moreover, crack detection performance is restricted by the limited receptive fields and high computational complexity of traditional image segmentation networks. Inspired by the recently proposed Mamba neural architecture, this study introduces a two-stage paradigm called MSCrackMamba, which leverages Vision Mamba along with a super-resolution network to address these challenges. Specifically, to align IR and RGB channels, we first apply super-resolution to IR channels to match the resolution of RGB channels for data fusion. Vision Mamba is then adopted as the backbone network, while UperNet is employed as the decoder for crack detection. Our approach is validated on the large-scale Crack Detection dataset Crack900, demonstrating an improvement of 3.55% in mIoU compared to the best-performing baseline methods.
摘要：裂缝检测是结构健康监测中的一项关键任务，旨在评估桥梁、建筑物和道路的结构完整性，以防止潜在的故障。基于视觉的裂缝检测由于其易于实施和有效性而成为主流方法。将红外 (IR) 通道与红、绿和蓝 (RGB) 通道融合可以增强特征表示，从而改善裂缝检测。然而，IR 和 RGB 通道的分辨率通常不同。为了对齐它们，通常需要对高分辨率 RGB 图像进行下采样以匹配 IR 图像分辨率，这会导致精细细节的丢失。此外，裂缝检测性能受到传统图像分割网络有限的接受场和高计算复杂度的限制。受最近提出的 Mamba 神经架构的启发，本研究引入了一种称为 MSCrackMamba 的两阶段范式，它利用 Vision Mamba 和超分辨率网络来解决这些挑战。具体来说，为了对齐 IR 和 RGB 通道，我们首先将超分辨率应用于 IR 通道以匹配 RGB 通道的分辨率以进行数据融合。然后采用 Vision Mamba 作为主干网络，同时采用 UperNet 作为裂缝检测的解码器。我们的方法在大型裂缝检测数据集 Crack900 上进行了验证，与表现最佳的基线方法相比，mIoU 提高了 3.55%。

Title: Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction

Authors: Seungtae Nam, Xiangyu Sun, Gyeongjin Kang, Younggeun Lee, Seungjun Oh, Eunbyung Park
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.06234
Pdf URL: https://arxiv.org/pdf/2412.06234
Copy Paste: [[2412.06234]] Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction(https://arxiv.org/abs/2412.06234)
Keywords: generative
Abstract: Generalized feed-forward Gaussian models have achieved significant progress in sparse-view 3D reconstruction by leveraging prior knowledge from large multi-view datasets. However, these models often struggle to represent high-frequency details due to the limited number of Gaussians. While the densification strategy used in per-scene 3D Gaussian splatting (3D-GS) optimization can be adapted to the feed-forward models, it may not be ideally suited for generalized scenarios. In this paper, we propose Generative Densification, an efficient and generalizable method to densify Gaussians generated by feed-forward models. Unlike the 3D-GS densification strategy, which iteratively splits and clones raw Gaussian parameters, our method up-samples feature representations from the feed-forward models and generates their corresponding fine Gaussians in a single forward pass, leveraging the embedded prior knowledge for enhanced generalization. Experimental results on both object-level and scene-level reconstruction tasks demonstrate that our method outperforms state-of-the-art approaches with comparable or smaller model sizes, achieving notable improvements in representing fine details.
摘要：广义前馈高斯模型通过利用来自大型多视图数据集的先验知识，在稀疏视图 3D 重建方面取得了重大进展。然而，由于高斯数量有限，这些模型往往难以表示高频细节。虽然每场景 3D 高斯分层 (3D-GS) 优化中使用的致密化策略可以适应前馈模型，但它可能并不适用于广义场景。在本文中，我们提出了生成致密化，这是一种高效且可推广的方法，用于对前馈模型生成的高斯进行致密化。与迭代分割和克隆原始高斯参数的 3D-GS 致密化策略不同，我们的方法对前馈模型中的特征表示进行上采样，并在一次前向传递中生成它们相应的精细高斯，利用嵌入的先验知识来增强泛化。在对象级和场景级重建任务上的实验结果表明，我们的方法优于具有相当或更小模型尺寸的最先进的方法，在表示精细细节方面取得了显着的改善。

Title: VariFace: Fair and Diverse Synthetic Dataset Generation for Face Recognition

Authors: Michael Yeung, Toya Teramoto, Songtao Wu, Tatsuo Fujiwara, Kenji Suzuki, Tamaki Kojima
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06235
Pdf URL: https://arxiv.org/pdf/2412.06235
Copy Paste: [[2412.06235]] VariFace: Fair and Diverse Synthetic Dataset Generation for Face Recognition(https://arxiv.org/abs/2412.06235)
Keywords: generation
Abstract: The use of large-scale, web-scraped datasets to train face recognition models has raised significant privacy and bias concerns. Synthetic methods mitigate these concerns and provide scalable and controllable face generation to enable fair and accurate face recognition. However, existing synthetic datasets display limited intraclass and interclass diversity and do not match the face recognition performance obtained using real datasets. Here, we propose VariFace, a two-stage diffusion-based pipeline to create fair and diverse synthetic face datasets to train face recognition models. Specifically, we introduce three methods: Face Recognition Consistency to refine demographic labels, Face Vendi Score Guidance to improve interclass diversity, and Divergence Score Conditioning to balance the identity preservation-intraclass diversity trade-off. When constrained to the same dataset size, VariFace considerably outperforms previous synthetic datasets (0.9200 $\rightarrow$ 0.9405) and achieves comparable performance to face recognition models trained with real data (Real Gap = -0.0065). In an unconstrained setting, VariFace not only consistently achieves better performance compared to previous synthetic methods across dataset sizes but also, for the first time, outperforms the real dataset (CASIA-WebFace) across six evaluation datasets. This sets a new state-of-the-art performance with an average face verification accuracy of 0.9567 (Real Gap = +0.0097) across LFW, CFP-FP, CPLFW, AgeDB, and CALFW datasets and 0.9366 (Real Gap = +0.0380) on the RFW dataset.
摘要：使用大规模、从网上抓取的数据集来训练人脸识别模型引发了严重的隐私和偏见问题。合成方法可以缓解这些问题，并提供可扩展且可控制的人脸生成，从而实现公平准确的人脸识别。然而，现有的合成数据集显示出有限的类内和类间多样性，与使用真实数据集获得的人脸识别性能不匹配。在这里，我们提出了 VariFace，这是一种基于两阶段扩散的管道，用于创建公平和多样化的合成人脸数据集来训练人脸识别模型。具体来说，我们引入了三种方法：人脸识别一致性以改进人口统计标签，人脸 Vendi 分数指导以提高类间多样性，以及发散分数调节以平衡身份保存-类内多样性权衡。当限制在相同的数据集大小时，VariFace 的表现远远优于之前的合成数据集（0.9200 $\rightarrow$ 0.9405），并且实现了与使用真实数据训练的人脸识别模型相当的性能（真实差距 = -0.0065）。在不受约束的设置下，VariFace 不仅在所有数据集大小上始终比以前的合成方法取得更好的性能，而且首次在六个评估数据集上超越了真实数据集 (CASIA-WebFace)。这创下了新的最佳性能，在 LFW、CFP-FP、CPLFW、AgeDB 和 CALFW 数据集上的平均人脸验证准确率为 0.9567（真实差距 = +0.0097），在 RFW 数据集上的平均人脸验证准确率为 0.9366（真实差距 = +0.0380）。

Title: Flow Matching Guide and Code

Authors: Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, Itai Gat
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.06264
Pdf URL: https://arxiv.org/pdf/2412.06264
Copy Paste: [[2412.06264]] Flow Matching Guide and Code(https://arxiv.org/abs/2412.06264)
Keywords: generation, generative
Abstract: Flow Matching (FM) is a recent framework for generative modeling that has achieved state-of-the-art performance across various domains, including image, video, audio, speech, and biological structures. This guide offers a comprehensive and self-contained review of FM, covering its mathematical foundations, design choices, and extensions. By also providing a PyTorch package featuring relevant examples (e.g., image and text generation), this work aims to serve as a resource for both novice and experienced researchers interested in understanding, applying and further developing FM.
摘要：流匹配 (FM) 是一种最新的生成建模框架，在图像、视频、音频、语音和生物结构等各个领域都取得了一流的性能。本指南对 FM 进行了全面而完整的回顾，涵盖了其数学基础、设计选择和扩展。通过提供包含相关示例（例如图像和文本生成）的 PyTorch 包，这项工作旨在为有兴趣了解、应用和进一步开发 FM 的新手和经验丰富的研究人员提供资源。

Title: Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction

Authors: Dongxu Wei, Zhiqi Li, Peidong Liu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.06273
Pdf URL: https://arxiv.org/pdf/2412.06273
Copy Paste: [[2412.06273]] Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction(https://arxiv.org/abs/2412.06273)
Keywords: generation
Abstract: Prior works employing pixel-based Gaussian representation have demonstrated efficacy in feed-forward sparse-view reconstruction. However, such representation necessitates cross-view overlap for accurate depth estimation, and is challenged by object occlusions and frustum truncations. As a result, these methods require scene-centric data acquisition to maintain cross-view overlap and complete scene visibility to circumvent occlusions and truncations, which limits their applicability to scene-centric reconstruction. In contrast, in autonomous driving scenarios, a more practical paradigm is ego-centric reconstruction, which is characterized by minimal cross-view overlap and frequent occlusions and truncations. The limitations of pixel-based representation thus hinder the utility of prior works in this task. In light of this, this paper conducts an in-depth analysis of different representations, and introduces Omni-Gaussian representation with tailored network design to complement their strengths and mitigate their drawbacks. Experiments show that our method significantly surpasses state-of-the-art methods, pixelSplat and MVSplat, in ego-centric reconstruction, and achieves comparable performance to prior works in scene-centric reconstruction. Furthermore, we extend our method with diffusion models, pioneering feed-forward multi-modal generation of 3D driving scenes.
摘要：先前的研究采用基于像素的高斯表示，已证明其在前馈稀疏视图重建中的有效性。然而，这种表示需要跨视图重叠才能实现准确的深度估计，而且面临着物体遮挡和截头体的挑战。因此，这些方法需要以场景为中心的数据采集来保持跨视图重叠，并需要完整的场景可见性来避免遮挡和截断，这限制了它们在以场景为中心的重建中的适用性。相比之下，在自动驾驶场景中，更实用的范例是以自我为中心的重建，其特点是跨视图重叠最小，遮挡和截断频繁。因此，基于像素的表示的局限性阻碍了先前研究在该任务中的实用性。有鉴于此，本文对不同的表示进行了深入分析，并引入了具有定制网络设计的全高斯表示来补充它们的优势并减轻它们的缺点。实验表明，我们的方法在以自我为中心的重建方面明显优于最先进的方法 pixelSplat 和 MVSplat，并且在以场景为中心的重建方面取得了与之前研究相当的性能。此外，我们利用扩散模型扩展了我们的方法，开创了 3D 驾驶场景的前馈多模态生成。

Title: Neural Garment Dynamic Super-Resolution

Authors: Meng Zhang, Jun Li
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.06285
Pdf URL: https://arxiv.org/pdf/2412.06285
Copy Paste: [[2412.06285]] Neural Garment Dynamic Super-Resolution(https://arxiv.org/abs/2412.06285)
Keywords: super-resolution
Abstract: Achieving efficient, high-fidelity, high-resolution garment simulation is challenging due to its computational demands. Conversely, low-resolution garment simulation is more accessible and ideal for low-budget devices like smartphones. In this paper, we introduce a lightweight, learning-based method for garment dynamic super-resolution, designed to efficiently enhance high-resolution, high-frequency details in low-resolution garment simulations. Starting with low-resolution garment simulation and underlying body motion, we utilize a mesh-graph-net to compute super-resolution features based on coarse garment dynamics and garment-body interactions. These features are then used by a hyper-net to construct an implicit function of detailed wrinkle residuals for each coarse mesh triangle. Considering the influence of coarse garment shapes on detailed wrinkle performance, we correct the coarse garment shape and predict detailed wrinkle residuals using these implicit functions. Finally, we generate detailed high-resolution garment geometry by applying the detailed wrinkle residuals to the corrected coarse garment. Our method enables roll-out prediction by iteratively using its predictions as input for subsequent frames, producing fine-grained wrinkle details to enhance the low-resolution simulation. Despite training on a small dataset, our network robustly generalizes to different body shapes, motions, and garment types not present in the training data. We demonstrate significant improvements over state-of-the-art alternatives, particularly in enhancing the quality of high-frequency, fine-grained wrinkle details.
摘要：由于计算需求，实现高效、高保真、高分辨率的服装模拟具有挑战性。相反，低分辨率服装模拟更容易获得，并且非常适合智能手机等低预算设备。在本文中，我们介绍了一种轻量级的、基于学习的服装动态超分辨率方法，旨在有效增强低分辨率服装模拟中的高分辨率、高频细节。从低分辨率服装模拟和底层身体运动开始，我们利用网格图网络根据粗糙服装动态和服装与身体的相互作用计算超分辨率特征。然后，超网络使用这些特征为每个粗网格三角形构建详细皱纹残差的隐式函数。考虑到粗糙服装形状对详细皱纹性能的影响，我们使用这些隐式函数校正粗糙服装形状并预测详细皱纹残差。最后，我们通过将详细皱纹残差应用于校正后的粗糙服装来生成详细的高分辨率服装几何形状。我们的方法通过迭代地将其预测用作后续帧的输入来实现滚动预测，从而产生细粒度的皱纹细节以增强低分辨率模拟。尽管在小型数据集上进行训练，但我们的网络可以稳健地推广到训练数据中不存在的不同体形、动作和服装类型。与最先进的替代方案相比，我们展示了显着的改进，特别是在提高高频、细粒度皱纹细节的质量方面。

Title: LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations

Authors: Mingjie Xu, Mengyang Wu, Yuzhi Zhao, Jason Chun Lok Li, Weifeng Ou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06322
Pdf URL: https://arxiv.org/pdf/2412.06322
Copy Paste: [[2412.06322]] LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations(https://arxiv.org/abs/2412.06322)
Keywords: generation
Abstract: Scene Graph Generation (SGG) converts visual scenes into structured graph representations, providing deeper scene understanding for complex vision tasks. However, existing SGG models often overlook essential spatial relationships and struggle with generalization in open-vocabulary contexts. To address these limitations, we propose LLaVA-SpaceSGG, a multimodal large language model (MLLM) designed for open-vocabulary SGG with enhanced spatial relation modeling. To train it, we collect the SGG instruction-tuning dataset, named SpaceSGG. This dataset is constructed by combining publicly available datasets and synthesizing data using open-source models within our data construction pipeline. It combines object locations, object relations, and depth information, resulting in three data formats: spatial SGG description, question-answering, and conversation. To enhance the transfer of MLLMs' inherent capabilities to the SGG task, we introduce a two-stage training paradigm. Experiments show that LLaVA-SpaceSGG outperforms other open-vocabulary SGG methods, boosting recall by 8.6% and mean recall by 28.4% compared to the baseline. Our codebase, dataset, and trained models are publicly accessible on GitHub at the following URL: this https URL.
摘要：场景图生成 (SGG) 将视觉场景转换为结构化图形表示，为复杂的视觉任务提供更深入的场景理解。然而，现有的 SGG 模型通常会忽略必要的空间关系，并且在开放词汇环境中难以进行泛化。为了解决这些限制，我们提出了 LLaVA-SpaceSGG，这是一种多模态大型语言模型 (MLLM)，专为开放词汇 SGG 设计，具有增强的空间关系建模。为了训练它，我们收集了 SGG 指令调整数据集，名为 SpaceSGG。该数据集是通过结合公开可用的数据集并使用我们数据构建管道内的开源模型合成数据构建的。它结合了对象位置、对象关系和深度信息，从而产生了三种数据格式：空间 SGG 描述、问答和对话。为了增强 MLLM 固有功能向 SGG 任务的转移，我们引入了一个两阶段训练范式。实验表明，LLaVA-SpaceSGG 的表现优于其他开放词汇 SGG 方法，与基线相比，召回率提高了 8.6%，平均召回率提高了 28.4%。我们的代码库、数据集和经过训练的模型可在 GitHub 上通过以下网址公开访问：此 https URL。

Title: HAIFAI: Human-AI Collaboration for Mental Face Reconstruction

Authors: Florian Strohm, Mihai Bâce, Andreas Bulling
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06323
Pdf URL: https://arxiv.org/pdf/2412.06323
Copy Paste: [[2412.06323]] HAIFAI: Human-AI Collaboration for Mental Face Reconstruction(https://arxiv.org/abs/2412.06323)
Keywords: generative
Abstract: We present HAIFAI - a novel collaborative human-AI system to tackle the challenging task of reconstructing a visual representation of a face that exists only in a person's mind. Users iteratively rank images presented by the AI system based on their resemblance to a mental image. These rankings, in turn, allow the system to extract relevant image features, fuse them into a unified feature vector, and use a generative model to reconstruct the mental image. We also propose an extension called HAIFAI-X that allows users to manually refine and further improve the reconstruction using an easy-to-use slider interface. To avoid the need for tedious human data collection for model training, we introduce a computational user model of human ranking behaviour. For this, we collected a small face ranking dataset through an online crowd-sourcing study containing data from 275 participants. We evaluate HAIFAI and HAIFAI-X in a 12-participant user study and show that HAIFAI outperforms the previous state of the art regarding reconstruction quality, usability, perceived workload, and reconstruction speed. HAIFAI-X achieves even better reconstruction quality at the cost of reduced usability, perceived workload, and increased reconstruction time. We further validate the reconstructions in a subsequent face ranking study with 18 participants and show that HAIFAI-X achieves a new state-of-the-art identification rate of 60.6%. These findings represent a significant advancement towards developing new collaborative intelligent systems capable of reliably and effortlessly reconstructing a user's mental image.
摘要：我们推出了 HAIFAI，这是一种新型的协作式人机协作系统，用于解决重建仅存在于人脑中的面部视觉表征这一具有挑战性的任务。用户根据 AI 系统呈现的图像与心理图像的相似性，对其进行迭代排序。这些排序反过来又允许系统提取相关的图像特征，将它们融合为统一的特征向量，并使用生成模型重建心理图像。我们还提出了一个名为 HAIFAI-X 的扩展，它允许用户使用易于使用的滑块界面手动优化和进一步改进重建。为了避免需要进行繁琐的人工数据收集以进行模型训练，我们引入了人类排名行为的计算用户模型。为此，我们通过一项在线众包研究收集了一个小型面部排名数据集，其中包含来自 275 名参与者的数据。我们在一项有 12 名参与者的用户研究中评估了 HAIFAI 和 HAIFAI-X，结果表明 HAIFAI 在重建质量、可用性、感知工作量和重建速度方面均优于之前最先进的技术。HAIFAI-X 实现了更好的重建质量，但代价是可用性降低、感知工作量增加和重建时间增加。我们在随后的一项有 18 名参与者的面部排名研究中进一步验证了重建结果，结果表明 HAIFAI-X 实现了 60.6% 的最新最先进的识别率。这些发现代表着在开发能够可靠且轻松地重建用户心理形象的新型协作智能系统方面取得了重大进展。

Title: Normalizing Flows are Capable Generative Models

Authors: Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, Josh Susskind
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06329
Pdf URL: https://arxiv.org/pdf/2412.06329
Copy Paste: [[2412.06329]] Normalizing Flows are Capable Generative Models(https://arxiv.org/abs/2412.06329)
Keywords: generative
Abstract: Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In this work, we demonstrate that NFs are more powerful than previously believed. We present \textit{TarFlow}: a simple and scalable architecture that enables highly performant NF models. TarFlow can be thought of as a Transformer-based variant of Masked Autoregressive Flows (MAFs): it consists of a stack of autoregressive Transformer blocks on image patches, alternating the autoregression direction between layers. TarFlow is straightforward to train end-to-end, and capable of directly modeling and generating pixels. We also propose three key techniques to improve sample quality: Gaussian noise augmentation during training, a post training denoising procedure, and an effective guidance method for both class-conditional and unconditional settings. Putting these together, TarFlow sets new state-of-the-art results on likelihood estimation for images, beating the previous best methods by a large margin, and generates samples with quality and diversity comparable to diffusion models, for the first time with a stand-alone NF model. We make our code available at \href{this https URL}{this https URL}.
摘要：正则化流 (NF) 是基于似然的连续输入模型。它们在密度估计和生成建模任务上都表现出了良好的效果，但近年来受到的关注相对较少。在这项工作中，我们证明了 NF 比以前认为的更强大。我们提出了 \textit{TarFlow}：一种简单且可扩展的架构，可实现高性能的 NF 模型。TarFlow 可以被认为是基于 Transformer 的 Masked Autoregressive Flows (MAF) 变体：它由图像块上的一堆自回归 Transformer 块组成，在层之间交替自回归方向。TarFlow 易于端到端训练，能够直接建模和生成像素。我们还提出了三种提高样本质量的关键技术：训练期间的高斯噪声增强、训练后去噪程序以及针对类条件和无条件设置的有效指导方法。综合起来，TarFlow 在图像似然估计方面取得了新的先进成果，大大超越了之前的最佳方法，并首次使用独立的 NF 模型生成了与扩散模型质量和多样性相当的样本。我们的代码可在 \href{此 https URL}{此 https URL} 上找到。

Title: UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts

Authors: Zhen Wan, Yue Ma, Chenyang Qi, Zhiheng Liu, Tao Gui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06340
Pdf URL: https://arxiv.org/pdf/2412.06340
Copy Paste: [[2412.06340]] UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts(https://arxiv.org/abs/2412.06340)
Keywords: generative
Abstract: In this paper, we present UniPaint, a unified generative space-time video inpainting framework that enables spatial-temporal inpainting and interpolation. Different from existing methods that treat video inpainting and video interpolation as two distinct tasks, we leverage a unified inpainting framework to tackle them and observe that these two tasks can mutually enhance synthesis performance. Specifically, we first introduce a plug-and-play space-time video inpainting adapter, which can be employed in various personalized models. The key insight is to propose a Mixture of Experts (MoE) attention to cover various tasks. Then, we design a spatial-temporal masking strategy during the training stage to mutually enhance each other and improve performance. UniPaint produces high-quality and aesthetically pleasing results, achieving the best quantitative results across various tasks and scale setups. The code and checkpoints will be available soon.
摘要：在本文中，我们提出了 UniPaint，这是一个统一的生成时空视频修复框架，可实现时空修复和插值。与将视频修复和视频插值视为两个不同任务的现有方法不同，我们利用统一的修复框架来解决它们，并观察到这两个任务可以相互提高综合性能。具体来说，我们首先引入一个即插即用的时空视频修复适配器，可用于各种个性化模型。关键见解是提出一种混合专家 (MoE) 注意力来涵盖各种任务。然后，我们在训练阶段设计了一种时空掩蔽策略，以相互增强并提高性能。UniPaint 可产生高质量且美观的结果，在各种任务和规模设置中实现最佳定量结果。代码和检查点将很快推出。

Title: Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

Authors: Joshua Freeman, Chloe Rippe, Edoardo Debenedetti, Maksym Andriushchenko
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06370
Pdf URL: https://arxiv.org/pdf/2412.06370
Copy Paste: [[2412.06370]] Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit(https://arxiv.org/abs/2412.06370)
Keywords: generative
Abstract: Copyright infringement in frontier LLMs has received much attention recently due to the New York Times v. OpenAI lawsuit, filed in December 2023. The New York Times claims that GPT-4 has infringed its copyrights by reproducing articles for use in LLM training and by memorizing the inputs, thereby publicly displaying them in LLM outputs. Our work aims to measure the propensity of OpenAI's LLMs to exhibit verbatim memorization in its outputs relative to other LLMs, specifically focusing on news articles. We discover that both GPT and Claude models use refusal training and output filters to prevent verbatim output of the memorized articles. We apply a basic prompt template to bypass the refusal training and show that OpenAI models are currently less prone to memorization elicitation than models from Meta, Mistral, and Anthropic. We find that as models increase in size, especially beyond 100 billion parameters, they demonstrate significantly greater capacity for memorization. Our findings have practical implications for training: more attention must be placed on preventing verbatim memorization in very large models. Our findings also have legal significance: in assessing the relative memorization capacity of OpenAI's LLMs, we probe the strength of The New York Times's copyright infringement claims and OpenAI's legal defenses, while underscoring issues at the intersection of generative AI, law, and policy.
摘要：由于 2023 年 12 月提起的《纽约时报》诉 OpenAI 诉讼，前沿法学硕士中的版权侵权问题最近受到了广泛关注。《纽约时报》声称 GPT-4 通过复制文章用于法学硕士培训并记忆输入，从而在法学硕士输出中公开展示它们，侵犯了其版权。我们的工作旨在衡量 OpenAI 的法学硕士相对于其他法学硕士（特别是新闻文章）在其输出中表现出逐字记忆的倾向。我们发现 GPT 和 Claude 模型都使用拒绝训练和输出过滤器来防止逐字输出记忆的文章。我们应用了一个基本的提示模板来绕过拒绝训练，并表明 OpenAI 模型目前比 Meta、Mistral 和 Anthropic 的模型更不容易引起记忆。我们发现，随着模型规模的增加，尤其是超过 1000 亿个参数，它们表现出显著增强的记忆能力。我们的发现对训练具有实际意义：必须更加注意防止非常大的模型中出现逐字记忆。我们的发现还具有法律意义：在评估 OpenAI 法学硕士的相对记忆能力时，我们探讨了《纽约时报》的版权侵权索赔和 OpenAI 的法律辩护的力度，同时强调了生成式人工智能、法律和政策交叉点的问题。

Title: Exploring the Impact of Synthetic Data on Human Gesture Recognition Tasks Using GANs

Authors: George Kontogiannis, Pantelis Tzamalis, Sotiris Nikoletseas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.06389
Pdf URL: https://arxiv.org/pdf/2412.06389
Copy Paste: [[2412.06389]] Exploring the Impact of Synthetic Data on Human Gesture Recognition Tasks Using GANs(https://arxiv.org/abs/2412.06389)
Keywords: generation, generative
Abstract: In the evolving domain of Human Activity Recognition (HAR) using Internet of Things (IoT) devices, there is an emerging interest in employing Deep Generative Models (DGMs) to address data scarcity, enhance data quality, and improve classification metrics scores. Among these types of models, Generative Adversarial Networks (GANs) have arisen as a powerful tool for generating synthetic data that mimic real-world scenarios with high fidelity. However, Human Gesture Recognition (HGR), a subset of HAR, particularly in healthcare applications, using time series data such as allergic gestures, remains highly unexplored. In this paper, we examine and evaluate the performance of two GANs in the generation of synthetic gesture motion data that compose a part of an open-source benchmark dataset. The data is related to the disease identification domain and healthcare, specifically to allergic rhinitis. We also focus on these AI models' performance in terms of fidelity, diversity, and privacy. Furthermore, we examine the scenario if the synthetic data can substitute real data, in training scenarios and how well models trained on synthetic data can be generalized for the allergic rhinitis gestures. In our work, these gestures are related to 6-axes accelerometer and gyroscope data, serving as multi-variate time series instances, and retrieved from smart wearable devices. To the best of our knowledge, this study is the first to explore the feasibility of synthesizing motion gestures for allergic rhinitis from wearable IoT device data using Generative Adversarial Networks (GANs) and testing their impact on the generalization of gesture recognition systems. It is worth noting that, even if our method has been applied to a specific category of gestures, it is designed to be generalized and can be deployed also to other motion data in the HGR domain.
摘要：在使用物联网 (IoT) 设备的人类活动识别 (HAR) 这一不断发展的领域中，人们开始对使用深度生成模型 (DGM) 来解决数据稀缺问题、提高数据质量和改进分类指标得分产生浓厚兴趣。在这些类型的模型中，生成对抗网络 (GAN) 已成为一种强大的工具，可用于生成高保真度模拟真实世界场景的合成数据。然而，人类手势识别 (HGR) 是 HAR 的一个子集，尤其是在医疗保健应用中，使用过敏手势等时间序列数据，仍未得到充分探索。在本文中，我们检查并评估了两个 GAN 在生成合成手势运动数据方面的性能，这些数据构成了开源基准数据集的一部分。这些数据与疾病识别领域和医疗保健有关，特别是过敏性鼻炎。我们还关注这些 AI 模型在保真度、多样性和隐私方面的性能。此外，我们研究了在训练场景中合成数据是否可以替代真实数据的情况，以及在合成数据上训练的模型能否很好地推广到过敏性鼻炎手势。在我们的工作中，这些手势与 6 轴加速度计和陀螺仪数据相关，作为多变量时间序列实例，并从智能可穿戴设备中检索。据我们所知，这项研究是第一个探索使用生成对抗网络 (GAN) 从可穿戴 IoT 设备数据合成过敏性鼻炎运动手势的可行性并测试它们对手势识别系统泛化的影响的研究。值得注意的是，即使我们的方法已经应用于特定类别的手势，它也被设计为具有泛化能力，也可以部署到 HGR 域中的其他运动数据。

Title: Generative Lines Matching Models

Authors: Ori Matityahu, Raanan Fattal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06403
Pdf URL: https://arxiv.org/pdf/2412.06403
Copy Paste: [[2412.06403]] Generative Lines Matching Models(https://arxiv.org/abs/2412.06403)
Keywords: generative
Abstract: In this paper we identify the source of a singularity in the training loss of key denoising models, that causes the denoiser's predictions to collapse towards the mean of the source or target distributions. This degeneracy creates false basins of attraction, distorting the denoising trajectories and ultimately increasing the number of steps required to sample these models. We circumvent this artifact by leveraging the deterministic ODE-based samplers, offered by certain denoising diffusion and score-matching models, which establish a well-defined change-of-variables between the source and target distributions. Given this correspondence, we propose a new probability flow model, the Lines Matching Model (LMM), which matches globally straight lines interpolating the two distributions. We demonstrate that the flow fields produced by the LMM exhibit notable temporal consistency, resulting in trajectories with excellent straightness scores. Beyond its sampling efficiency, the LMM formulation allows us to enhance the fidelity of the generated samples by integrating domain-specific reconstruction and adversarial losses, and by optimizing its training for the sampling procedure used. Overall, the LMM achieves state-of-the-art FID scores with minimal NFEs on established benchmark datasets: 1.57/1.39 (NFE=1/2) on CIFAR-10, 1.47/1.17 on ImageNet 64x64, and 2.68/1.54 on AFHQ 64x64. Finally, we provide a theoretical analysis showing that the use of optimal transport to relate the two distributions suffers from a curse of dimensionality, where the pairing set size (mini-batch) must scale exponentially with the signal dimension.
摘要：在本文中，我们确定了关键去噪模型训练损失中奇异性的来源，这种奇异性导致去噪器的预测向源或目标分布的平均值崩溃。这种退化会产生虚假的吸引盆地，扭曲去噪轨迹，最终增加对这些模型进行采样所需的步骤数。我们利用某些去噪扩散和分数匹配模型提供的确定性基于 ODE 的采样器来规避这种伪影，这些采样器在源分布和目标分布之间建立了明确定义的变量变化。鉴于这种对应关系，我们提出了一种新的概率流模型，即线匹配模型 (LMM)，它全局匹配插入两个分布的直线。我们证明 LMM 产生的流场表现出显着的时间一致性，从而产生具有出色直线度分数的轨迹。除了采样效率之外，LMM 公式还允许我们通过整合特定领域的重建和对抗性损失，以及通过优化其用于采样程序的训练来提高生成样本的保真度。总体而言，LMM 在已建立的基准数据集上以最少的 NFE 实现了最先进的 FID 分数：CIFAR-10 上为 1.57/1.39（NFE=1/2），ImageNet 64x64 上为 1.47/1.17，AFHQ 64x64 上为 2.68/1.54。最后，我们提供了一个理论分析，表明使用最佳传输来关联两个分布会受到维数灾难的影响，其中配对集大小（小批量）必须随信号维数呈指数级增长。

Title: World-Consistent Data Generation for Vision-and-Language Navigation

Authors: Yu Zhong, Rui Zhang, Zihao Zhang, Shuo Wang, Chuan Fang, Xishan Zhang, Jiaming Guo, Shaohui Peng, Di Huang, Yanyang Yan, Xing Hu, Ping Tan, Qi Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06413
Pdf URL: https://arxiv.org/pdf/2412.06413
Copy Paste: [[2412.06413]] World-Consistent Data Generation for Vision-and-Language Navigation(https://arxiv.org/abs/2412.06413)
Keywords: generation
Abstract: Vision-and-Language Navigation (VLN) is a challenging task that requires an agent to navigate through photorealistic environments following natural-language instructions. One main obstacle existing in VLN is data scarcity, leading to poor generalization performance over unseen environments. Tough data argumentation is a promising way for scaling up the dataset, how to generate VLN data both diverse and world-consistent remains problematic. To cope with this issue, we propose the world-consistent data generation (WCGEN), an efficacious data-augmentation framework satisfying both diversity and world-consistency, targeting at enhancing the generalizations of agents to novel environments. Roughly, our framework consists of two stages, the trajectory stage which leverages a point-cloud based technique to ensure spatial coherency among viewpoints, and the viewpoint stage which adopts a novel angle synthesis method to guarantee spatial and wraparound consistency within the entire observation. By accurately predicting viewpoint changes with 3D knowledge, our approach maintains the world-consistency during the generation procedure. Experiments on a wide range of datasets verify the effectiveness of our method, demonstrating that our data augmentation strategy enables agents to achieve new state-of-the-art results on all navigation tasks, and is capable of enhancing the VLN agents' generalization ability to unseen environments.
摘要：视觉和语言导航 (VLN) 是一项具有挑战性的任务，需要代理按照自然语言指令在逼真的环境中导航。VLN 中存在的一个主要障碍是数据稀缺，导致在看不见的环境中泛化性能较差。严格的数据论证是扩大数据集的一种有前途的方法，如何生成多样化和世界一致的 VLN 数据仍然是一个问题。为了解决这个问题，我们提出了世界一致性数据生成 (WCGEN)，这是一种有效的数据增强框架，既满足多样性又满足世界一致性，旨在增强代理对新环境的泛化能力。大致来说，我们的框架由两个阶段组成，轨迹阶段利用基于点云的技术来确保视点之间的空间一致性，视点阶段采用一种新颖的角度合成方法来保证整个观察中的空间和环绕一致性。通过使用 3D 知识准确预测视点变化，我们的方法在生成过程中保持了世界一致性。在广泛数据集上的实验验证了我们方法的有效性，表明我们的数据增强策略使代理能够在所有导航任务上取得新的最佳结果，并且能够增强 VLN 代理对未知环境的泛化能力。

Title: How Certain are Uncertainty Estimates? Three Novel Earth Observation Datasets for Benchmarking Uncertainty Quantification in Machine Learning

Authors: Yuanyuan Wang, Qian Song, Dawood Wasif, Muhammad Shahzad, Christoph Koller, Jonathan Bamber, Xiao Xiang Zhu
Subjects: cs.LG, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2412.06451
Pdf URL: https://arxiv.org/pdf/2412.06451
Copy Paste: [[2412.06451]] How Certain are Uncertainty Estimates? Three Novel Earth Observation Datasets for Benchmarking Uncertainty Quantification in Machine Learning(https://arxiv.org/abs/2412.06451)
Keywords: generation
Abstract: Uncertainty quantification (UQ) is essential for assessing the reliability of Earth observation (EO) products. However, the extensive use of machine learning models in EO introduces an additional layer of complexity, as those models themselves are inherently uncertain. While various UQ methods do exist for machine learning models, their performance on EO datasets remains largely unevaluated. A key challenge in the community is the absence of the ground truth for uncertainty, i.e. how certain the uncertainty estimates are, apart from the labels for the image/signal. This article fills this gap by introducing three benchmark datasets specifically designed for UQ in EO machine learning models. These datasets address three common problem types in EO: regression, image segmentation, and scene classification. They enable a transparent comparison of different UQ methods for EO machine learning models. We describe the creation and characteristics of each dataset, including data sources, preprocessing steps, and label generation, with a particular focus on calculating the reference uncertainty. We also showcase baseline performance of several machine learning models on each dataset, highlighting the utility of these benchmarks for model development and comparison. Overall, this article offers a valuable resource for researchers and practitioners working in artificial intelligence for EO, promoting a more accurate and reliable quality measure of the outputs of machine learning models. The dataset and code are accessible via this https URL.
摘要：不确定性量化 (UQ) 对于评估地球观测 (EO) 产品的可靠性至关重要。然而，在 EO 中广泛使用机器学习模型带来了额外的复杂性，因为这些模型本身就具有不确定性。虽然机器学习模型确实存在各种 UQ 方法，但它们在 EO 数据集上的性能仍未得到充分评估。社区面临的一个关键挑战是缺乏不确定性的基本事实，即除了图像/信号的标签之外，不确定性估计的确定性如何。本文通过介绍三个专为 EO 机器学习模型中的 UQ 设计的基准数据集来填补这一空白。这些数据集解决了 EO 中的三种常见问题类型：回归、图像分割和场景分类。它们可以透明地比较 EO 机器学习模型的不同 UQ 方法。我们描述了每个数据集的创建和特征，包括数据源、预处理步骤和标签生成，特别关注计算参考不确定性。我们还展示了几个机器学习模型在每个数据集上的基线性能，强调了这些基准对模型开发和比较的实用性。总体而言，本文为 EO 人工智能研究人员和从业人员提供了宝贵的资源，促进了对机器学习模型输出进行更准确、更可靠的质量测量。数据集和代码可通过此 https URL 访问。

Title: AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis

Authors: Shidan He, Lei Liu, Shen Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06510
Pdf URL: https://arxiv.org/pdf/2412.06510
Copy Paste: [[2412.06510]] AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis(https://arxiv.org/abs/2412.06510)
Keywords: generation
Abstract: Anomaly synthesis is a crucial approach to augment abnormal data for advancing anomaly inspection. Based on the knowledge from the large-scale pre-training, existing text-to-image anomaly synthesis methods predominantly focus on textual information or coarse-aligned visual features to guide the entire generation process. However, these methods often lack sufficient descriptors to capture the complicated characteristics of realistic anomalies (e.g., the fine-grained visual pattern of anomalies), limiting the realism and generalization of the generation process. To this end, we propose a novel anomaly synthesis framework called AnomalyControl to learn cross-modal semantic features as guidance signals, which could encode the generalized anomaly cues from text-image reference prompts and improve the realism of synthesized abnormal samples. Specifically, AnomalyControl adopts a flexible and non-matching prompt pair (i.e., a text-image reference prompt and a targeted text prompt), where a Cross-modal Semantic Modeling (CSM) module is designed to extract cross-modal semantic features from the textual and visual descriptors. Then, an Anomaly-Semantic Enhanced Attention (ASEA) mechanism is formulated to allow CSM to focus on the specific visual patterns of the anomaly, thus enhancing the realism and contextual relevance of the generated anomaly features. Treating cross-modal semantic features as the prior, a Semantic Guided Adapter (SGA) is designed to encode effective guidance signals for the adequate and controllable synthesis process. Extensive experiments indicate that AnomalyControl can achieve state-of-the-art results in anomaly synthesis compared with existing methods while exhibiting superior performance for downstream tasks.
摘要：异常合成是增强异常数据以推进异常检测的重要方法。基于大规模预训练的知识，现有的文本到图像异常合成方法主要关注文本信息或粗对齐的视觉特征来指导整个生成过程。然而，这些方法通常缺乏足够的描述符来捕捉现实异常的复杂特征（例如，异常的细粒度视觉模式），限制了生成过程的真实性和泛化。为此，我们提出了一种名为 AnomalyControl 的新型异常合成框架来学习跨模态语义特征作为指导信号，它可以对来自文本图像参考提示的广义异常线索进行编码并提高合成异常样本的真实性。具体而言，AnomalyControl采用灵活的非匹配提示对（即文本-图像参考提示和有针对性的文本提示），其中设计了跨模态语义建模（CSM）模块从文本和视觉描述符中提取跨模态语义特征。然后，制定了异常语义增强注意（ASEA）机制，使CSM能够关注异常的特定视觉模式，从而增强生成的异常特征的真实感和上下文相关性。将跨模态语义特征作为先验，设计了语义引导适配器（SGA）来编码有效的引导信号，以实现充分且可控的合成过程。大量实验表明，与现有方法相比，AnomalyControl可以在异常合成中取得最佳结果，同时在下游任务中表现出优异的性能。

Title: When Dimensionality Reduction Meets Graph (Drawing) Theory: Introducing a Common Framework, Challenges and Opportunities

Authors: Fernando Paulovich, Alessio Arleo, Stef van den Elzen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.06555
Pdf URL: https://arxiv.org/pdf/2412.06555
Copy Paste: [[2412.06555]] When Dimensionality Reduction Meets Graph (Drawing) Theory: Introducing a Common Framework, Challenges and Opportunities(https://arxiv.org/abs/2412.06555)
Keywords: generation
Abstract: In the vast landscape of visualization research, Dimensionality Reduction (DR) and graph analysis are two popular subfields, often essential to most visual data analytics setups. DR aims to create representations to support neighborhood and similarity analysis on complex, large datasets. Graph analysis focuses on identifying the salient topological properties and key actors within networked data, with specialized research on investigating how such features could be presented to the user to ease the comprehension of the underlying structure. Although these two disciplines are typically regarded as disjoint subfields, we argue that both fields share strong similarities and synergies that can potentially benefit both. Therefore, this paper discusses and introduces a unifying framework to help bridge the gap between DR and graph (drawing) theory. Our goal is to use the strongly math-grounded graph theory to improve the overall process of creating DR visual representations. We propose how to break the DR process into well-defined stages, discussing how to match some of the DR state-of-the-art techniques to this framework and presenting ideas on how graph drawing, topology features, and some popular algorithms and strategies used in graph analysis can be employed to improve DR topology extraction, embedding generation, and result validation. We also discuss the challenges and identify opportunities for implementing and using our framework, opening directions for future visualization research.
摘要：在广阔的可视化研究领域中，降维 (DR) 和图形分析是两个流行的子领域，通常对大多数可视化数据分析设置至关重要。降维旨在创建表示来支持对复杂、大型数据集进行邻域和相似性分析。图形分析侧重于识别网络数据中的显着拓扑属性和关键参与者，并专门研究如何将这些特征呈现给用户以方便理解底层结构。虽然这两个学科通常被视为不相交的子领域，但我们认为这两个领域具有很强的相似性和协同作用，可能对两者都有益。因此，本文讨论并介绍了一个统一的框架，以帮助弥合降维和图形（绘图）理论之间的差距。我们的目标是使用强有力的数学基础图论来改进创建 DR 视觉表示的整体过程。我们提出了如何将 DR 过程分解为明确定义的阶段，讨论了如何将一些 DR 最新技术与该框架相匹配，并提出了关于如何使用图形绘制、拓扑特征以及图形分析中使用的一些流行算法和策略来改进 DR 拓扑提取、嵌入生成和结果验证的想法。我们还讨论了实施和使用我们的框架所面临的挑战并确定了机会，为未来的可视化研究开辟了方向。

Title: MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

Authors: Weitao Wang, Haoran Xu, Yuxiao Yang, Zhifang Liu, Jun Meng, Haoqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06614
Pdf URL: https://arxiv.org/pdf/2412.06614
Copy Paste: [[2412.06614]] MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences(https://arxiv.org/abs/2412.06614)
Keywords: generation
Abstract: Recent years have witnessed remarkable progress in 3D content generation. However, corresponding evaluation methods struggle to keep pace. Automatic approaches have proven challenging to align with human preferences, and the mixed comparison of text- and image-driven methods often leads to unfair evaluations. In this paper, we present a comprehensive framework to better align and evaluate multi-view diffusion models with human preferences. To begin with, we first collect and filter a standardized image prompt set from DALL$\cdot$E and Objaverse, which we then use to generate multi-view assets with several multi-view diffusion models. Through a systematic ranking pipeline on these assets, we obtain a human annotation dataset with 16k expert pairwise comparisons and train a reward model, coined MVReward, to effectively encode human preferences. With MVReward, image-driven 3D methods can be evaluated against each other in a more fair and transparent manner. Building on this, we further propose Multi-View Preference Learning (MVP), a plug-and-play multi-view diffusion tuning strategy. Extensive experiments demonstrate that MVReward can serve as a reliable metric and MVP consistently enhances the alignment of multi-view diffusion models with human preferences.
摘要：近年来，3D 内容生成取得了显著进展。然而，相应的评估方法却难以跟上步伐。事实证明，自动方法很难与人类偏好保持一致，而文本和图像驱动方法的混合比较往往会导致不公平的评估。在本文中，我们提出了一个全面的框架，以更好地将多视图扩散模型与人类偏好保持一致和进行评估。首先，我们首先从 DALL$\cdot$E 和 Objaverse 收集和过滤标准化图像提示集，然后我们使用该提示集通过多个多视图扩散模型生成多视图资产。通过对这些资产进行系统的排名流程，我们获得了一个包含 16k 专家成对比较的人工注释数据集，并训练了一个奖励模型，即 MVReward，以有效地编码人类偏好。使用 MVReward，可以以更公平和透明的方式对图像驱动的 3D 方法进行相互评估。在此基础上，我们进一步提出了多视图偏好学习 (MVP)，这是一种即插即用的多视图扩散调整策略。大量实验表明，MVReward 可以作为一种可靠的指标，并且 MVP 可以持续增强多视角扩散模型与人类偏好的一致性。

Title: Copyright-Protected Language Generation via Adaptive Model Fusion

Authors: Javier Abad, Konstantin Donhauser, Francesco Pinto, Fanny Yang
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.06619
Pdf URL: https://arxiv.org/pdf/2412.06619
Copy Paste: [[2412.06619]] Copyright-Protected Language Generation via Adaptive Model Fusion(https://arxiv.org/abs/2412.06619)
Keywords: generation
Abstract: The risk of language models reproducing copyrighted material from their training data has led to the development of various protective measures. Among these, inference-time strategies that impose constraints via post-processing have shown promise in addressing the complexities of copyright regulation. However, they often incur prohibitive computational costs or suffer from performance trade-offs. To overcome these limitations, we introduce Copyright-Protecting Model Fusion (CP-Fuse), a novel approach that combines models trained on disjoint sets of copyrighted material during inference. In particular, CP-Fuse adaptively aggregates the model outputs to minimize the reproduction of copyrighted content, adhering to a crucial balancing property that prevents the regurgitation of memorized data. Through extensive experiments, we show that CP-Fuse significantly reduces the reproduction of protected material without compromising the quality of text and code generation. Moreover, its post-hoc nature allows seamless integration with other protective measures, further enhancing copyright safeguards. Lastly, we show that CP-Fuse is robust against common techniques for extracting training data.
摘要：语言模型从训练数据中复制受版权保护的材料的风险导致了各种保护措施的发展。其中，通过后处理施加约束的推理时间策略在解决版权监管的复杂性方面显示出了希望。然而，它们往往需要高昂的计算成本或遭受性能权衡。为了克服这些限制，我们引入了版权保护模型融合 (CP-Fuse)，这是一种新颖的方法，它在推理过程中结合了在不相交的版权材料集上训练的模型。具体来说，CP-Fuse 自适应地聚合模型输出以最大限度地减少版权内容的复制，遵守防止记忆数据复述的关键平衡属性。通过大量实验，我们表明 CP-Fuse 显著减少了受保护材料的复制，而不会损害文本和代码生成的质量。此外，其事后性质允许与其他保护措施无缝集成，进一步增强版权保护。最后，我们表明 CP-Fuse 对提取训练数据的常见技术具有很强的鲁棒性。

Title: The Narrow Gate: Localized Image-Text Communication in Vision-Language Models

Authors: Alessandro Serra, Francesco Ortu, Emanuele Panizon, Lucrezia Valeriani, Lorenzo Basile, Alessio Ansuini, Diego Doimo, Alberto Cazzaniga
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06646
Pdf URL: https://arxiv.org/pdf/2412.06646
Copy Paste: [[2412.06646]] The Narrow Gate: Localized Image-Text Communication in Vision-Language Models(https://arxiv.org/abs/2412.06646)
Keywords: generation
Abstract: Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image-understanding tasks, specifically focusing on how visual information is processed and transferred to the textual domain. We compare VLMs that generate both images and text with those that output only text, highlighting key differences in information flow. We find that in models with multimodal outputs, image and text embeddings are more separated within the residual stream. Additionally, models vary in how information is exchanged from visual to textual tokens. VLMs that only output text exhibit a distributed communication pattern, where information is exchanged through multiple image tokens. In contrast, models trained for image and text generation rely on a single token that acts as a narrow gate for the visual information. We demonstrate that ablating this single token significantly deteriorates performance on image understanding tasks. Furthermore, modifying this token enables effective steering of the image semantics, showing that targeted, local interventions can reliably control the model's global behavior.
摘要：多模态训练的最新进展显著提高了统一模型中图像理解和生成的集成。本研究调查了视觉语言模型 (VLM) 如何处理图像理解任务，特别关注如何处理视觉信息并将其传输到文本域。我们将生成图像和文本的 VLM 与仅输出文本的 VLM 进行比较，突出了信息流的主要差异。我们发现，在具有多模态输出的模型中，图像和文本嵌入在残差流中更加分离。此外，模型在信息从视觉到文本标记的交换方式上有所不同。仅输出文本的 VLM 表现出分布式通信模式，其中信息通过多个图像标记进行交换。相比之下，为图像和文本生成训练的模型依赖于单个标记，该标记充当视觉信息的狭窄门。我们证明，消除这个单个标记会显著降低图像理解任务的性能。此外，修改此标记可以有效地控制图像语义，表明有针对性的局部干预可以可靠地控制模型的全局行为。

Title: Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion

Authors: Shuaiting Li, Juncan Deng, Zeyu Wang, Hong Gu, Kedong Xu, Haibin Shen, Kejie Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06661
Pdf URL: https://arxiv.org/pdf/2412.06661
Copy Paste: [[2412.06661]] Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion(https://arxiv.org/abs/2412.06661)
Keywords: generation
Abstract: Text-to-image generation of Stable Diffusion models has achieved notable success due to its remarkable generation ability. However, the repetitive denoising process is computationally intensive during inference, which renders Diffusion models less suitable for real-world applications that require low latency and scalability. Recent studies have employed post-training quantization (PTQ) and quantization-aware training (QAT) methods to compress Diffusion models. Nevertheless, prior research has often neglected to examine the consistency between results generated by quantized models and those from floating-point models. This consistency is crucial in fields such as content creation, design, and edge deployment, as it can significantly enhance both efficiency and system stability for practitioners. To ensure that quantized models generate high-quality and consistent images, we propose an efficient quantization framework for Stable Diffusion models. Our approach features a Serial-to-Parallel calibration pipeline that addresses the consistency of both the calibration and inference processes, as well as ensuring training stability. Based on this pipeline, we further introduce a mix-precision quantization strategy, multi-timestep activation quantization, and time information precalculation techniques to ensure high-fidelity generation in comparison to floating-point models. Through extensive experiments with Stable Diffusion v1-4, v2-1, and XL 1.0, we have demonstrated that our method outperforms the current state-of-the-art techniques when tested on prompts from the COCO validation dataset and the Stable-Diffusion-Prompts dataset. Under W4A8 quantization settings, our approach enhances both distribution similarity and visual similarity by 45%-60%.
摘要：稳定扩散模型的文本到图像生成因其出色的生成能力而取得了显著的成功。然而，在推理过程中，重复的去噪过程计算量很大，这使得扩散模型不太适合需要低延迟和可扩展性的实际应用。最近的研究采用了训练后量化 (PTQ) 和量化感知训练 (QAT) 方法来压缩扩散模型。然而，先前的研究往往忽略了检查量化模型生成的结果与浮点模型生成的结果之间的一致性。这种一致性在内容创建、设计和边缘部署等领域至关重要，因为它可以显著提高从业者的效率和系统稳定性。为了确保量化模型生成高质量且一致的图像，我们提出了一种用于稳定扩散模型的有效量化框架。我们的方法采用串行到并行校准管道，解决了校准和推理过程的一致性问题，并确保了训练稳定性。基于此流程，我们进一步引入了混合精度量化策略、多时间步长激活量化和时间信息预计算技术，以确保与浮点模型相比，生成结果具有高保真度。通过对 Stable Diffusion v1-4、v2-1 和 XL 1.0 进行大量实验，我们证明了在 COCO 验证数据集和 Stable-Diffusion-Prompts 数据集的提示上进行测试时，我们的方法优于当前最先进的技术。在 W4A8 量化设置下，我们的方法将分布相似度和视觉相似度均提高了 45%-60%。

Title: ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Authors: Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, Hang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06673
Pdf URL: https://arxiv.org/pdf/2412.06673
Copy Paste: [[2412.06673]] ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance(https://arxiv.org/abs/2412.06673)
Keywords: generation
Abstract: In this paper, we introduce ILLUME, a unified multimodal large language model (MLLM) that seamlessly integrates multimodal understanding and generation capabilities within a single large language model through a unified next-token prediction formulation. To address the large dataset size typically required for image-text alignment, we propose to enhance data efficiency through the design of a vision tokenizer that incorporates semantic information and a progressive multi-stage training procedure. This approach reduces the dataset size to just 15M for pretraining -- over four times fewer than what is typically needed -- while achieving competitive or even superior performance with existing unified MLLMs, such as Janus. Additionally, to promote synergistic enhancement between understanding and generation capabilities, which is under-explored in previous works, we introduce a novel self-enhancing multimodal alignment scheme. This scheme supervises the MLLM to self-assess the consistency between text descriptions and self-generated images, facilitating the model to interpret images more accurately and avoid unrealistic and incorrect predictions caused by misalignment in image generation. Based on extensive experiments, our proposed ILLUME stands out and competes with state-of-the-art unified MLLMs and specialized models across various benchmarks for multimodal understanding, generation, and editing.
摘要：在本文中，我们介绍了 ILLUME，这是一种统一的多模态大型语言模型 (MLLM)，它通过统一的下一个标记预测公式将多模态理解和生成功能无缝集成到单个大型语言模型中。为了解决图像文本对齐通常需要的大型数据集大小问题，我们建议通过设计一个结合语义信息和渐进式多阶段训练程序的视觉标记器来提高数据效率。这种方法将预训练的数据集大小减少到仅 15M——比通常需要的少四倍多——同时实现与现有统一 MLLM（如 Janus）相媲美甚至更优异的性能。此外，为了促进理解和生成能力之间的协同增强（这在以前的研究中尚未得到充分探索），我们引入了一种新颖的自增强多模态对齐方案。该方案监督 MLLM 自我评估文本描述和自生成图像之间的一致性，从而帮助模型更准确地解释图像并避免由于图像生成错位而导致的不切实际和不正确的预测。基于大量实验，我们提出的 ILLUME 脱颖而出，并在多模式理解、生成和编辑等各种基准中与最先进的统一 MLLM 和专门模型相媲美。

Title: EMOv2: Pushing 5M Vision Model Frontier

Authors: Jiangning Zhang, Teng Hu, Haoyang He, Zhucun Xue, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06674
Pdf URL: https://arxiv.org/pdf/2412.06674
Copy Paste: [[2412.06674]] EMOv2: Pushing 5M Vision Model Frontier(https://arxiv.org/abs/2412.06674)
Keywords: generation
Abstract: This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1M/2M/5M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5M by +2.6. When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level. Code is available at this https URL.
摘要：这项工作致力于在参数、FLOP 和性能之间进行权衡，开发用于密集预测的参数高效轻量级模型。我们的目标是在各种下游任务上建立 5M 量级轻量级模型的新前沿。倒置残差块 (IRB) 是轻量级 CNN 的基础设施，但基于注意力的设计尚未认可其对应物。我们的工作从统一的角度重新思考了高效 IRB 的轻量级基础设施和 Transformer 中的实用组件，将基于 CNN 的 IRB 扩展到基于注意力的模型，并抽象出一个单残差元移动块 (MMBlock) 用于轻量级模型设计。遵循简洁有效的设计标准，我们推导了一个现代的改进型倒置残差移动块 (i2RMB)，并改进了一个没有复杂结构的分层高效模型 (EMOv2)。考虑到移动用户在 4G/5G 带宽下下载模型时难以察觉的延迟，并确保模型性能，我们研究了 5M 量级轻量级模型的性能上限。在各种视觉识别、密集预测和图像生成任务上进行的大量实验证明了我们的 EMOv2 优于最先进的方法，例如，EMOv2-1M/2M/5M 实现了 72.3、75.8 和 79.4 Top-1，大大超越了基于等阶 CNN/Attention 的模型。同时，配备 RetinaNet 的 EMOv2-5M 在物体检测任务中实现了 41.5 mAP，比之前的 EMO-5M 高出 2.6。当采用更强大的训练方法时，我们的 EMOv2-5M 最终实现了 82.9 Top-1 准确率，将 5M 级模型的性能提升到了一个新的水平。代码可在此 https URL 上找到。

Title: Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach

Authors: Weichao Xu, Huaxin Pei, Jingxuan Yang, Yuchen Shi, Yi Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.06684
Pdf URL: https://arxiv.org/pdf/2412.06684
Copy Paste: [[2412.06684]] Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach(https://arxiv.org/abs/2412.06684)
Keywords: generation
Abstract: Recent years have witnessed surprising achievements of decision-making policies across various fields, such as autonomous driving and robotics. Testing for decision-making policies is crucial with the existence of critical scenarios that may threaten their reliability. Numerous research efforts have been dedicated to testing these policies. However, there are still significant challenges, such as low testing efficiency and diversity due to the complexity of the policies and environments under test. Inspired by the remarkable capabilities of large language models (LLMs), in this paper, we propose an LLM-driven online testing framework for efficiently testing decision-making policies. The main idea is to employ an LLM-based test scenario generator to intelligently generate challenging test cases through contemplation and reasoning. Specifically, we first design a "generate-test-feedback" pipeline and apply templated prompt engineering to fully leverage the knowledge and reasoning abilities of LLMs. Then, we introduce a multi-scale scenario generation strategy to address the inherent challenges LLMs face in making fine adjustments, further enhancing testing efficiency. Finally, we evaluate the LLM-driven approach on five widely used benchmarks. The experimental results demonstrate that our method significantly outperforms baseline approaches in uncovering both critical and diverse scenarios.
摘要：近年来，决策策略在自动驾驶、机器人等各个领域都取得了令人惊喜的成就。由于存在可能威胁其可靠性的关键场景，因此对决策策略的测试至关重要。许多研究工作致力于测试这些策略。然而，仍然存在重大挑战，例如测试效率低下以及由于测试策略和环境的复杂性而导致的多样性。受大型语言模型 (LLM) 卓越功能的启发，本文提出了一个 LLM 驱动的在线测试框架，用于高效测试决策策略。主要思想是使用基于 LLM 的测试场景生成器通过思考和推理智能地生成具有挑战性的测试用例。具体而言，我们首先设计一个“生成-测试-反馈”管道并应用模板化提示工程来充分利用 LLM 的知识和推理能力。然后，我们引入了一种多尺度场景生成策略来解决 LLM 在进行微调时面临的固有挑战，进一步提高测试效率。最后，我们在五个广泛使用的基准上评估了 LLM 驱动的方法。实验结果表明，我们的方法在发现关键和多样化场景方面明显优于基线方法。

Title: Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

Authors: Yuxuan Xue, Xianghui Xie, Riccardo Marin, Gerard Pons-Moll
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06698
Pdf URL: https://arxiv.org/pdf/2412.06698
Copy Paste: [[2412.06698]] Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy(https://arxiv.org/abs/2412.06698)
Keywords: generation
Abstract: Creating realistic 3D objects and clothed avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot guarantee the generated multi-view images are 3D consistent. In this paper, we propose Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy. We leverage a pre-trained 2D diffusion model and a 3D diffusion model via our elegantly designed process that synchronizes two diffusion models at both training and sampling time. The synergy between the 2D and 3D diffusion models brings two major advantages: 1) 2D helps 3D in generalization: the pretrained 2D model has strong generalization ability to unseen images, providing strong shape priors for the 3D diffusion model; 2) 3D helps 2D in multi-view consistency: the 3D diffusion model enhances the 3D consistency of 2D multi-view sampling process, resulting in more accurate multi-view generation. We validate our idea through extensive experiments in image-based objects and clothed avatar generation tasks. Results show that our method generates realistic 3D objects and avatars with high-fidelity geometry and texture. Extensive ablations also validate our design choices and demonstrate the strong generalization ability to diverse clothing and compositional shapes. Our code and pretrained models will be publicly released on this https URL.
摘要：从单个 RGB 图像创建逼真的 3D 物体和穿着衣服的头像是一个有吸引力但又具有挑战性的问题。由于其不适定的性质，最近的研究利用在大型数据集上预训练的 2D 扩散模型的强大先验。虽然 2D 扩散模型表现出很强的泛化能力，但它们不能保证生成的多视图图像是 3D 一致的。在本文中，我们提出了 Gen-3Diffusion：通过 2D 和 3D 扩散协同实现逼真的图像到 3D 生成。我们利用预先训练的 2D 扩散模型和 3D 扩散模型，通过我们精心设计的流程在训练和采样时同步两个扩散模型。2D 和 3D 扩散模型之间的协同作用带来两个主要优点：1）2D 有助于 3D 泛化：预训练的 2D 模型对未见图像具有很强的泛化能力，为 3D 扩散模型提供强大的形状先验； 2）3D 有助于 2D 实现多视图一致性：3D 扩散模型增强了 2D 多视图采样过程的 3D 一致性，从而实现更准确的多视图生成。我们通过大量基于图像的物体和穿着衣服的虚拟形象生成任务的实验验证了我们的想法。结果表明，我们的方法可以生成具有高保真几何形状和纹理的逼真的 3D 物体和虚拟形象。大量消融也验证了我们的设计选择，并展示了对各种服装和构图形状的强大泛化能力。我们的代码和预训练模型将在此 https URL 上公开发布。

Title: You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Authors: Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, Xinlong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06699
Pdf URL: https://arxiv.org/pdf/2412.06699
Copy Paste: [[2412.06699]] You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale(https://arxiv.org/abs/2412.06699)
Keywords: generation
Abstract: Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Please refer to our project page at: this https URL
摘要：最近的 3D 生成模型通常依靠有限规模的 3D“黄金标签”或 2D 扩散先验来创建 3D 内容。然而，由于缺乏可扩展的学习范例，它们的性能受到受限 3D 先验的上限。在这项工作中，我们提出了 See3D，这是一个针对大规模互联网视频进行训练的视觉条件多视图扩散模型，用于开放世界 3D 创作。该模型旨在通过仅从庞大且快速增长的视频数据中查看视觉内容来获取 3D 知识——你看到了，你就得到了。为了实现这一目标，我们首先使用提议的数据管理管道来扩大训练数据，该管道会自动过滤掉源视频中的多视图不一致和观察不足。这会产生一个高质量、丰富多样、大规模的多视图图像数据集，称为 WebVi3D，包含来自 16M 视频剪辑的 320M 帧。然而，在没有明确 3D 几何或相机姿势注释的情况下从视频中学习通用 3D 先验并非易事，而为网络规模的视频注释姿势的成本过高。为了消除对姿势条件的需求，我们引入了一种创新的视觉条件 - 通过在蒙版视频数据中添加时间相关噪声来生成纯 2D 感应视觉信号。最后，我们通过将 See3D 集成到基于扭曲的管道中以进行高保真 3D 生成，引入了一种新颖的视觉条件 3D 生成框架。我们在单次和稀疏重建基准上进行的数值和视觉比较表明，在经济高效且可扩展的视频数据上训练的 See3D 实现了显着的零样本和开放世界生成能力，明显优于在昂贵且受限的 3D 数据集上训练的模型。请参阅我们的项目页面：此 https URL

Title: Parkinson's Disease Diagnosis Through Deep Learning: A Novel LSTM-Based Approach for Freezing of Gait Detection

Authors: Aqib Nazir Mir, Iqra Nissar, Mumtaz Ahmed, Sarfaraz Masood, Danish Raza Rizvi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06709
Pdf URL: https://arxiv.org/pdf/2412.06709
Copy Paste: [[2412.06709]] Parkinson's Disease Diagnosis Through Deep Learning: A Novel LSTM-Based Approach for Freezing of Gait Detection(https://arxiv.org/abs/2412.06709)
Keywords: generative
Abstract: Deep learning holds tremendous potential in healthcare for uncovering hidden patterns within extensive clinical datasets, aiding in the diagnosis of various diseases. Parkinson's disease (PD) is a neurodegenerative condition characterized by the deterioration of brain function. In the initial stages of PD, automatic diagnosis poses a challenge due to the similarity in behavior between individuals with PD and those who are healthy. Our objective is to propose an effective model that can aid in the early detection of Parkinson's disease. We employed the VGRF gait signal dataset sourced from Physionet for distinguishing between healthy individuals and those diagnosed with Parkinson's disease. This paper introduces a novel deep learning architecture based on the LSTM network for automatically detecting freezing of gait episodes in Parkinson's disease patients. In contrast to conventional machine learning algorithms, this method eliminates manual feature engineering and proficiently captures prolonged temporal dependencies in gait patterns, thereby improving the diagnosis of Parkinson's disease. The LSTM network resolves the issue of vanishing gradients by employing memory blocks in place of self-connected hidden units, allowing for optimal information assimilation. To prevent overfitting, dropout and L2 regularization techniques have been employed. Additionally, the stochastic gradient-based optimizer Adam is used for the optimization process. The results indicate that our proposed approach surpasses current state-of-the-art models in FOG episode detection, achieving an accuracy of 97.71%, sensitivity of 99%, precision of 98%, and specificity of 96%. This demonstrates its potential as a superior classification method for Parkinson's disease detection.
摘要：深度学习在医疗保健领域具有巨大潜力，可以发现大量临床数据集中的隐藏模式，从而帮助诊断各种疾病。帕金森病 (PD) 是一种神经退行性疾病，其特征是大脑功能退化。在 PD 的初期，由于 PD 患者和健康人的行为相似，自动诊断面临挑战。我们的目标是提出一个有效的模型，帮助早期发现帕金森病。我们使用来自 Physionet 的 VGRF 步态信号数据集来区分健康人和被诊断患有帕金森病的人。本文介绍了一种基于 LSTM 网络的新型深度学习架构，用于自动检测帕金森病患者的步态冻结发作。与传统的机器学习算法相比，这种方法无需手动特征工程，可以熟练地捕捉步态模式中的长时间时间依赖性，从而改善帕金森病的诊断。 LSTM 网络通过使用记忆块代替自连接隐藏单元来解决梯度消失问题，从而实现最佳信息吸收。为了防止过度拟合，我们采用了 dropout 和 L2 正则化技术。此外，优化过程中还使用了基于随机梯度的优化器 Adam。结果表明，我们提出的方法在 FOG 发作检测方面超越了当前最先进的模型，准确率为 97.71%，灵敏度为 99%，精确度为 98%，特异性为 96%。这表明它有可能成为帕金森病检测的卓越分类方法。

Title: ContRail: A Framework for Realistic Railway Image Synthesis using ControlNet

Authors: Andrei-Robert Alexandrescu, Razvan-Gabriel Petec, Alexandru Manole, Laura-Silvia Diosan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06742
Pdf URL: https://arxiv.org/pdf/2412.06742
Copy Paste: [[2412.06742]] ContRail: A Framework for Realistic Railway Image Synthesis using ControlNet(https://arxiv.org/abs/2412.06742)
Keywords: generation
Abstract: Deep Learning became an ubiquitous paradigm due to its extraordinary effectiveness and applicability in numerous domains. However, the approach suffers from the high demand of data required to achieve the potential of this type of model. An ever-increasing sub-field of Artificial Intelligence, Image Synthesis, aims to address this limitation through the design of intelligent models capable of creating original and realistic images, endeavour which could drastically reduce the need for real data. The Stable Diffusion generation paradigm recently propelled state-of-the-art approaches to exceed all previous benchmarks. In this work, we propose the ContRail framework based on the novel Stable Diffusion model ControlNet, which we empower through a multi-modal conditioning method. We experiment with the task of synthetic railway image generation, where we improve the performance in rail-specific tasks, such as rail semantic segmentation by enriching the dataset with realistic synthetic images.
摘要：深度学习因其在众多领域的非凡有效性和适用性而成为一种无处不在的范式。然而，该方法的缺点是需要大量数据才能发挥这种模型的潜力。人工智能中一个不断发展的子领域——图像合成，旨在通过设计能够创建原始逼真图像的智能模型来解决这一限制，从而大大减少对真实数据的需求。稳定扩散生成范式最近推动了最先进的方法超越所有以前的基准。在这项工作中，我们提出了基于新型稳定扩散模型 ControlNet 的 ContRail 框架，并通过多模态调节方法为其提供支持。我们尝试了合成铁路图像生成任务，通过使用逼真的合成图像丰富数据集，我们提高了特定于铁路的任务（例如铁路语义分割）的性能。

Title: Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models

Authors: Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, Tom Goldstein
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.06748
Pdf URL: https://arxiv.org/pdf/2412.06748
Copy Paste: [[2412.06748]] Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models(https://arxiv.org/abs/2412.06748)
Keywords: generation
Abstract: A key component of building safe and reliable language models is enabling the models to appropriately refuse to follow certain instructions or answer certain questions. We may want models to output refusal messages for various categories of user queries, for example, ill-posed questions, instructions for committing illegal acts, or queries which require information past the model's knowledge horizon. Engineering models that refuse to answer such questions is complicated by the fact that an individual may want their model to exhibit varying levels of sensitivity for refusing queries of various categories, and different users may want different refusal rates. The current default approach involves training multiple models with varying proportions of refusal messages from each category to achieve the desired refusal rates, which is computationally expensive and may require training a new model to accommodate each user's desired preference over refusal rates. To address these challenges, we propose refusal tokens, one such token for each refusal category or a single refusal token, which are prepended to the model's responses during training. We then show how to increase or decrease the probability of generating the refusal token for each category during inference to steer the model's refusal behavior. Refusal tokens enable controlling a single model's refusal rates without the need of any further fine-tuning, but only by selectively intervening during generation.
摘要：构建安全可靠的语言模型的一个关键要素是使模型能够适当地拒绝遵循某些指令或回答某些问题。我们可能希望模型针对各种类别的用户查询输出拒绝消息，例如，不适定问题、实施非法行为的指令或需要超出模型知识范围的信息的查询。设计拒绝回答此类问题的模型很复杂，因为个人可能希望他们的模型对拒绝各种类别的查询表现出不同的敏感度，而不同的用户可能希望不同的拒绝率。当前的默认方法涉及使用来自每个类别的不同比例的拒绝消息训练多个模型以实现所需的拒绝率，这在计算上很昂贵，并且可能需要训练一个新模型来适应每个用户对拒绝率的期望偏好。为了应对这些挑战，我们提出了拒绝标记，每个拒绝类别一个这样的标记或一个拒绝标记，这些标记在训练期间被添加到模型的响应中。然后，我们展示了如何在推理过程中增加或减少生成每个类别的拒绝标记的概率，以控制模型的拒绝行为。拒绝标记可以控制单个模型的拒绝率，而无需进一步微调，而只需在生成过程中有选择地进行干预即可。

Title: InstantRestore: Single-Step Personalized Face Restoration with Shared-Image Attention

Authors: Howard Zhang, Yuval Alaluf, Sizhuo Ma, Achuta Kadambi, Jian Wang, Kfir Aberman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06753
Pdf URL: https://arxiv.org/pdf/2412.06753
Copy Paste: [[2412.06753]] InstantRestore: Single-Step Personalized Face Restoration with Shared-Image Attention(https://arxiv.org/abs/2412.06753)
Keywords: restoration
Abstract: Face image restoration aims to enhance degraded facial images while addressing challenges such as diverse degradation types, real-time processing demands, and, most crucially, the preservation of identity-specific features. Existing methods often struggle with slow processing times and suboptimal restoration, especially under severe degradation, failing to accurately reconstruct finer-level identity details. To address these issues, we introduce InstantRestore, a novel framework that leverages a single-step image diffusion model and an attention-sharing mechanism for fast and personalized face restoration. Additionally, InstantRestore incorporates a novel landmark attention loss, aligning key facial landmarks to refine the attention maps, enhancing identity preservation. At inference time, given a degraded input and a small (~4) set of reference images, InstantRestore performs a single forward pass through the network to achieve near real-time performance. Unlike prior approaches that rely on full diffusion processes or per-identity model tuning, InstantRestore offers a scalable solution suitable for large-scale applications. Extensive experiments demonstrate that InstantRestore outperforms existing methods in quality and speed, making it an appealing choice for identity-preserving face restoration.
摘要：人脸图像恢复旨在增强退化的面部图像，同时解决诸如多种退化类型、实时处理需求以及最重要的身份特定特征的保存等挑战。现有方法通常存在处理时间慢和恢复效果不佳的问题，尤其是在严重退化的情况下，无法准确重建更精细的身份细节。为了解决这些问题，我们引入了 InstantRestore，这是一个新颖的框架，它利用单步图像扩散模型和注意力共享机制来实现快速和个性化的人脸恢复。此外，InstantRestore 还采用了一种新颖的地标注意力损失，对齐关键面部地标以细化注意力图，增强身份保存。在推理时，给定退化的输入和一小组（约 4 个）参考图像，InstantRestore 通过网络执行一次前向传递以实现近乎实时的性能。与以前依赖完整扩散过程或每个身份模型调整的方法不同，InstantRestore 提供了适用于大规模应用的可扩展解决方案。大量实验表明，InstantRestore 在质量和速度上优于现有方法，使其成为身份保存面部恢复的理想选择。

Title: Ranking-aware adapter for text-driven image ordering with CLIP

Authors: Wei-Hsiang Yu, Yen-Yu Lin, Ming-Hsuan Yang, Yi-Hsuan Tsai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06760
Pdf URL: https://arxiv.org/pdf/2412.06760
Copy Paste: [[2412.06760]] Ranking-aware adapter for text-driven image ordering with CLIP(https://arxiv.org/abs/2412.06760)
Keywords: quality assessment
Abstract: Recent advances in vision-language models (VLMs) have made significant progress in downstream tasks that require quantitative concepts such as facial age estimation and image quality assessment, enabling VLMs to explore applications like image ranking and retrieval. However, existing studies typically focus on the reasoning based on a single image and heavily depend on text prompting, limiting their ability to learn comprehensive understanding from multiple images. To address this, we propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task and introduces a lightweight adapter to augment CLIP for text-guided image ranking. Specifically, our approach incorporates learnable prompts to adapt to new instructions for ranking purposes and an auxiliary branch with ranking-aware attention, leveraging text-conditioned visual differences for additional supervision in image ranking. Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks and achieves competitive results compared to state-of-the-art models designed for specific tasks like facial age estimation and image quality assessment. Overall, our approach primarily focuses on ranking images with a single instruction, which provides a natural and generalized way of learning from visual differences across images, bypassing the need for extensive text prompts tailored to individual tasks. Code is available: this https URL.
摘要：视觉语言模型 (VLM) 的最新进展在需要定量概念（例如面部年龄估计和图像质量评估）的下游任务中取得了重大进展，使 VLM 能够探索图像排名和检索等应用。然而，现有的研究通常侧重于基于单个图像的推理，并且严重依赖文本提示，限制了它们从多幅图像中学习全面理解的能力。为了解决这个问题，我们提出了一种有效而高效的方法，将 CLIP 模型重新定义为学习排名任务，并引入一个轻量级适配器来增强 CLIP 以进行文本引导的图像排名。具体来说，我们的方法结合了可学习的提示以适应排名目的的新指令，以及具有排名感知注意力的辅助分支，利用文本条件的视觉差异在图像排名中进行额外监督。我们的排名感知适配器在各种任务上的表现始终优于经过微调的 CLIP，并且与为面部年龄估计和图像质量评估等特定任务设计的最先进的模型相比，取得了具有竞争力的结果。总体而言，我们的方法主要侧重于使用单一指令对图像进行排序，这提供了一种从图像之间的视觉差异中学习的自然而通用的方法，无需针对单个任务定制大量文本提示。代码可用：此 https URL。

Title: Visual Lexicon: Rich Image Features in Language Space

Authors: XuDong Wang, Xingyi Zhou, Alireza Fathi, Trevor Darrell, Cordelia Schmid
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06774
Pdf URL: https://arxiv.org/pdf/2412.06774
Copy Paste: [[2412.06774]] Visual Lexicon: Rich Image Features in Language Space(https://arxiv.org/abs/2412.06774)
Keywords: generation
Abstract: We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex simultaneously captures rich semantic content and fine visual details, enabling high-quality image generation and comprehensive visual scene understanding. Through a self-supervised learning pipeline, ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic-level reconstruction. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages, allowing them to be used independently as "text tokens" or combined with natural language tokens to prompt pretrained T2I models with both visual and textual inputs, mirroring how we interact with vision-language models (VLMs). Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text embeddings--even with a single ViLex token. Moreover, ViLex successfully performs various DreamBooth tasks in a zero-shot, unsupervised manner without fine-tuning T2I models. Additionally, ViLex serves as a powerful vision encoder, consistently improving vision-language model performance across 15 benchmarks relative to a strong SigLIP baseline.
摘要：我们提出了一种新颖的视觉语言——Visual Lexicon，它将丰富的图像信息编码到词汇标记的文本空间中，同时保留通常难以用自然语言传达的复杂视觉细节。与优先考虑高级语义（例如 CLIP）或像素级重建（例如 VAE）的传统方法不同，ViLex 可以同时捕获丰富的语义内容和精细的视觉细节，从而实现高质量的图像生成和全面的视觉场景理解。通过自监督学习管道，ViLex 使用冻结的文本到图像 (T2I) 扩散模型生成针对重建输入图像进行优化的标记，从而保留高保真语义级重建所需的详细信息。作为语言空间中的图像嵌入，ViLex 标记利用自然语言的组合性，允许它们独立用作“文本标记”或与自然语言标记结合使用，以使用视觉和文本输入提示预训练的 T2I 模型，从而反映我们与视觉语言模型 (VLM) 交互的方式。实验表明，与文本嵌入相比，ViLex 在图像重建方面实现了更高的保真度——即使使用单个 ViLex 标记也是如此。此外，ViLex 成功地以零样本、无监督的方式执行了各种 DreamBooth 任务，而无需微调 T2I 模型。此外，ViLex 可充当强大的视觉编码器，相对于强大的 SigLIP 基线，它在 15 个基准测试中持续提高视觉语言模型的性能。

Title: Diverse Score Distillation

Authors: Yanbo Xu, Jayanth Srinivasa, Gaowen Liu, Shubham Tulsiani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06780
Pdf URL: https://arxiv.org/pdf/2412.06780
Copy Paste: [[2412.06780]] Diverse Score Distillation(https://arxiv.org/abs/2412.06780)
Keywords: generation
Abstract: Score distillation of 2D diffusion models has proven to be a powerful mechanism to guide 3D optimization, for example enabling text-based 3D generation or single-view reconstruction. A common limitation of existing score distillation formulations, however, is that the outputs of the (mode-seeking) optimization are limited in diversity despite the underlying diffusion model being capable of generating diverse samples. In this work, inspired by the sampling process in denoising diffusion, we propose a score formulation that guides the optimization to follow generation paths defined by random initial seeds, thus ensuring diversity. We then present an approximation to adopt this formulation for scenarios where the optimization may not precisely follow the generation paths (e.g. a 3D representation whose renderings evolve in a co-dependent manner). We showcase the applications of our `Diverse Score Distillation' (DSD) formulation across tasks such as 2D optimization, text-based 3D inference, and single-view reconstruction. We also empirically validate DSD against prior score distillation formulations and show that it significantly improves sample diversity while preserving fidelity.
摘要：二维扩散模型的分数蒸馏已被证明是一种指导三维优化的强大机制，例如实现基于文本的三维生成或单视图重建。然而，现有分数蒸馏公式的一个常见限制是，尽管底层扩散模型能够生成多样化的样本，但（模式搜索）优化的输出在多样性方面受到限制。在这项工作中，受去噪扩散中的采样过程的启发，我们提出了一个分数公式，该公式指导优化遵循由随机初始种子定义的生成路径，从而确保多样性。然后，我们提出了一种近似方法，以在优化可能不精确遵循生成路径的场景中采用此公式（例如，渲染以相互依赖的方式演变的三维表示）。我们展示了我们的“多样化分数蒸馏”（DSD）公式在二维优化、基于文本的三维推理和单视图重建等任务中的应用。我们还根据之前的分数蒸馏公式对 DSD 进行了实证验证，并表明它在保持保真度的同时显着提高了样本多样性。

Title: Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Authors: Nicolas Dufour, David Picard, Vicky Kalogeiton, Loic Landrieu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.06781
Pdf URL: https://arxiv.org/pdf/2412.06781
Copy Paste: [[2412.06781]] Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation(https://arxiv.org/abs/2412.06781)
Keywords: generative
Abstract: Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.
摘要：全球视觉地理定位可预测图像在地球上的拍摄位置。由于图像在定位精度方面各不相同，因此这项任务本身就具有相当大的模糊性。然而，现有的方法是确定性的，忽略了这一点。在本文中，我们旨在缩小传统地理定位和现代生成方法之间的差距。我们提出了第一种基于扩散和黎曼流匹配的生成地理定位方法，其中去噪过程直接在地球表面进行。我们的模型在三个视觉地理定位基准上实现了最先进的性能：OpenStreetView-5M、YFCC-100M 和 iNat21。此外，我们引入了概率视觉地理定位任务，其中模型预测所有可能位置的概率分布，而不是单个点。我们为这项任务引入了新的指标和基线，展示了我们基于扩散的方法的优势。代码和模型将提供。

Title: Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation

Authors: Ruihan Gao, Kangle Deng, Gengshan Yang, Wenzhen Yuan, Jun-Yan Zhu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.06785
Pdf URL: https://arxiv.org/pdf/2412.06785
Copy Paste: [[2412.06785]] Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation(https://arxiv.org/abs/2412.06785)
Keywords: generation
Abstract: 3D generation methods have shown visually compelling results powered by diffusion image priors. However, they often fail to produce realistic geometric details, resulting in overly smooth surfaces or geometric details inaccurately baked in albedo maps. To address this, we introduce a new method that incorporates touch as an additional modality to improve the geometric details of generated 3D assets. We design a lightweight 3D texture field to synthesize visual and tactile textures, guided by 2D diffusion model priors on both visual and tactile domains. We condition the visual texture generation on high-resolution tactile normals and guide the patch-based tactile texture refinement with a customized TextureDreambooth. We further present a multi-part generation pipeline that enables us to synthesize different textures across various regions. To our knowledge, we are the first to leverage high-resolution tactile sensing to enhance geometric details for 3D generation tasks. We evaluate our method in both text-to-3D and image-to-3D settings. Our experiments demonstrate that our method provides customized and realistic fine geometric textures while maintaining accurate alignment between two modalities of vision and touch.
摘要：3D 生成方法已显示出由扩散图像先验驱动的视觉上引人注目的结果。然而，它们往往无法产生逼真的几何细节，导致表面过于光滑或几何细节在反照率图中烘焙不准确。为了解决这个问题，我们引入了一种新方法，该方法结合了触觉作为一种额外的方式，以改善生成的 3D 资产的几何细节。我们设计了一个轻量级的 3D 纹理场来合成视觉和触觉纹理，由视觉和触觉域上的 2D 扩散模型先验引导。我们根据高分辨率触觉法线对视觉纹理生成进行条件处理，并使用定制的 TextureDreambooth 指导基于块的触觉纹理细化。我们进一步介绍了一个多部分生成管道，使我们能够在各个区域合成不同的纹理。据我们所知，我们是第一个利用高分辨率触觉感应来增强 3D 生成任务的几何细节的人。我们在文本到 3D 和图像到 3D 设置中评估我们的方法。我们的实验表明，我们的方法提供了定制的、逼真的精细几何纹理，同时保持了视觉和触觉两种模式之间的精确对齐。

Title: Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis

Authors: M. Hamza Mughal, Rishabh Dabral, Merel C.J. Scholman, Vera Demberg, Christian Theobalt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.06786
Pdf URL: https://arxiv.org/pdf/2412.06786
Copy Paste: [[2412.06786]] Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis(https://arxiv.org/abs/2412.06786)
Keywords: generation
Abstract: Non-verbal communication often comprises of semantically rich gestures that help convey the meaning of an utterance. Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. Our neuro-explicit gesture generation approach is designed to produce semantic gestures grounded in interpretable linguistic knowledge. We achieve this by using explicit domain knowledge to retrieve exemplar motions from a database of co-speech gestures. Once retrieved, we then inject these semantic exemplar gestures into our diffusion-based gesture generation pipeline using DDIM inversion and retrieval guidance at the inference time without any need of training. Further, we propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence. Our comparative evaluations demonstrate the validity of our approach against recent gesture generation approaches. The reader is urged to explore the results on our project page.
摘要：非语言交流通常包括语义丰富的手势，有助于传达话语的含义。对于现有的神经系统来说，产生这种语义的同语手势一直是一个重大挑战，因为现有的神经系统可以产生有节奏的节拍手势，但很难产生语义上有意义的手势。因此，我们提出了 RAG-Gesture，这是一种基于扩散的手势生成方法，利用检索增强生成 (RAG) 来产生看起来自然且语义丰富的手势。我们的神经显式手势生成方法旨在产生基于可解释语言知识的语义手势。我们通过使用显式领域知识从同语手势数据库中检索示例动作来实现这一点。检索后，我们随后在推理时使用 DDIM 反转和检索指导将这些语义示例手势注入我们基于扩散的手势生成管道中，而无需任何训练。此外，我们提出了一种指导控制范例，允许用户调节每次检索插入对生成序列的影响量。我们的比较评估证明了我们的方法相对于最近的手势生成方法的有效性。建议读者在我们的项目页面上探索结果。

Title: [MASK] is All You Need

Authors: Vincent Tao Hu, Björn Ommer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.06787
Pdf URL: https://arxiv.org/pdf/2412.06787
Copy Paste: [[2412.06787]] [MASK] is All You Need(https://arxiv.org/abs/2412.06787)
Keywords: generative
Abstract: In generative models, two paradigms have gained attraction in various applications: next-set prediction-based Masked Generative Models and next-noise prediction-based Non-Autoregressive Models, e.g., Diffusion Models. In this work, we propose using discrete-state models to connect them and explore their scalability in the vision domain. First, we conduct a step-by-step analysis in a unified design space across two types of models including timestep-independence, noise schedule, temperature, guidance strength, etc in a scalable manner. Second, we re-cast typical discriminative tasks, e.g., image segmentation, as an unmasking process from [MASK]tokens on a discrete-state model. This enables us to perform various sampling processes, including flexible conditional sampling by only training once to model the joint distribution. All aforementioned explorations lead to our framework named Discrete Interpolants, which enables us to achieve state-of-the-art or competitive performance compared to previous discrete-state based methods in various benchmarks, like ImageNet256, MS COCO, and video dataset FaceForensics. In summary, by leveraging [MASK] in discrete-state models, we can bridge Masked Generative and Non-autoregressive Diffusion models, as well as generative and discriminative tasks.
摘要：在生成模型中，两种范式在各种应用中都获得了关注：基于下一组预测的掩蔽生成模型和基于下一噪声预测的非自回归模型，例如扩散模型。在这项工作中，我们建议使用离散状态模型来连接它们并探索它们在视觉领域的可扩展性。首先，我们以可扩展的方式在统一的设计空间中对两种类型的模型（包括时间步长独立性、噪声时间表、温度、指导强度等）进行逐步分析。其次，我们将典型的判别任务（例如图像分割）重塑为离散状态模型上从 [MASK] 标记中揭示掩蔽的过程。这使我们能够执行各种采样过程，包括仅通过一次训练来建模联合分布的灵活条件采样。所有上述探索都促成了我们名为“离散插值”的框架，与之前基于离散状态的方法相比，该框架使我们能够在各种基准测试（如 ImageNet256、MS COCO 和视频数据集 FaceForensics）中实现最先进或最具竞争力的性能。总之，通过在离散状态模型中利用 [MASK]，我们可以连接掩蔽生成和非自回归扩散模型，以及生成和判别任务。