2024-12-05

Title: DYffCast: Regional Precipitation Nowcasting Using IMERG Satellite Data. A case study over South America

Authors: Daniel Seal, Rossella Arcucci, Salva Rühling-Cachay, César Quilodrán-Casas
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02723
Pdf URL: https://arxiv.org/pdf/2412.02723
Copy Paste: [[2412.02723]] DYffCast: Regional Precipitation Nowcasting Using IMERG Satellite Data. A case study over South America(https://arxiv.org/abs/2412.02723)
Keywords: generative
Abstract: Climate change is increasing the frequency of extreme precipitation events, making weather disasters such as flooding and landslides more likely. The ability to accurately nowcast precipitation is therefore becoming more critical for safeguarding society by providing immediate, accurate information to decision makers. Motivated by the recent success of generative models at precipitation nowcasting, this paper: extends the DYffusion framework to this task and evaluates its performance at forecasting IMERG satellite precipitation data up to a 4-hour horizon; modifies the DYffusion framework to improve its ability to model rainfall data; and introduces a novel loss function that combines MSE, MAE and the LPIPS perceptual score. In a quantitative evaluation of forecasts up to a 4-hour horizon, the modified DYffusion framework trained with the novel loss outperforms four competitor models. It has the highest CSI scores for weak, moderate, and heavy rain thresholds and retains an LPIPS score $<$ 0.2 for the entire roll-out, degrading the least as lead-time increases. The proposed nowcasting model demonstrates visually stable and sharp forecasts up to a 2-hour horizon on a heavy rain case study. Code is available at this https URL.
摘要：气候变化导致极端降水事件发生的频率增加，使洪水和山体滑坡等气象灾害更有可能发生。因此，准确预报降水的能力对于通过向决策者提供即时、准确的信息来保护社会变得更加重要。受生成模型在降水预报方面取得的最新成功的启发，本文：将 DYffusion 框架扩展到此任务并评估其在预报长达 4 小时的 IMERG 卫星降水数据方面的表现；修改 DYffusion 框架以提高其建模降雨数据的能力；并引入一种结合 MSE、MAE 和 LPIPS 感知分数的新型损失函数。在对长达 4 小时的预报的定量评估中，使用新型损失训练的修改后的 DYffusion 框架优于四个竞争模型。它对弱雨、中雨和大雨阈值的 CSI 得分最高，并且在整个发布过程中保持 LPIPS 得分 $<$ 0.2，随着提前期的增加，降幅最小。所提出的即时预报模型在大雨案例研究中展示了长达 2 小时的视觉稳定和清晰的预报。代码可在此 https URL 上找到。

Title: WxC-Bench: A Novel Dataset for Weather and Climate Downstream Tasks

Authors: Rajat Shinde, Christopher E. Phillips, Kumar Ankur, Aman Gupta, Simon Pfreundschuh, Sujit Roy, Sheyenne Kirkland, Vishal Gaur, Amy Lin, Aditi Sheshadri, Udaysankar Nair, Manil Maskey, Rahul Ramachandran
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02780
Pdf URL: https://arxiv.org/pdf/2412.02780
Copy Paste: [[2412.02780]] WxC-Bench: A Novel Dataset for Weather and Climate Downstream Tasks(https://arxiv.org/abs/2412.02780)
Keywords: generation
Abstract: High-quality machine learning (ML)-ready datasets play a foundational role in developing new artificial intelligence (AI) models or fine-tuning existing models for scientific applications such as weather and climate analysis. Unfortunately, despite the growing development of new deep learning models for weather and climate, there is a scarcity of curated, pre-processed machine learning (ML)-ready datasets. Curating such high-quality datasets for developing new models is challenging particularly because the modality of the input data varies significantly for different downstream tasks addressing different atmospheric scales (spatial and temporal). Here we introduce WxC-Bench (Weather and Climate Bench), a multi-modal dataset designed to support the development of generalizable AI models for downstream use-cases in weather and climate research. WxC-Bench is designed as a dataset of datasets for developing ML-models for a complex weather and climate system, addressing selected downstream tasks as machine learning phenomenon. WxC-Bench encompasses several atmospheric processes from meso-$\beta$ (20 - 200 km) scale to synoptic scales (2500 km), such as aviation turbulence, hurricane intensity and track monitoring, weather analog search, gravity wave parameterization, and natural language report generation. We provide a comprehensive description of the dataset and also present a technical validation for baseline analysis. The dataset and code to prepare the ML-ready data have been made publicly available on Hugging Face -- this https URL
摘要：高质量的机器学习 (ML) 就绪数据集在开发新的人工智能 (AI) 模型或微调现有模型以用于天气和气候分析等科学应用方面发挥着基础性作用。不幸的是，尽管天气和气候的新深度学习模型不断发展，但精心挑选的、经过预处理的机器学习 (ML) 就绪数据集却很少。为开发新模型挑选此类高质量数据集具有挑战性，特别是因为输入数据的模态对于处理不同大气尺度（空间和时间）的不同下游任务而言差异很大。在这里，我们介绍了 WxC-Bench（天气和气候台），这是一个多模态数据集，旨在支持为天气和气候研究的下游用例开发可推广的 AI 模型。WxC-Bench 被设计为一个数据集，用于为复杂的天气和气候系统开发 ML 模型，将选定的下游任务作为机器学习现象来处理。 WxC-Bench 涵盖了从中观尺度（20 - 200 公里）到天气尺度（2500 公里）的多种大气过程，例如航空湍流、飓风强度和轨迹监测、天气模拟搜索、重力波参数化和自然语言报告生成。我们提供了数据集的全面描述，并提供了基线分析的技术验证。数据集和准备 ML 就绪数据的代码已在 Hugging Face 上公开发布 - 此 https URL

Title: Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation

Authors: Raphael Ruschel, Md Awsafur Rahman, Hardik Prajapati, Suya You, B. S. Manjuanth
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.02808
Pdf URL: https://arxiv.org/pdf/2412.02808
Copy Paste: [[2412.02808]] Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation(https://arxiv.org/abs/2412.02808)
Keywords: generation
Abstract: Understanding video content is pivotal for advancing real-world applications like activity recognition, autonomous systems, and human-computer interaction. While scene graphs are adept at capturing spatial relationships between objects in individual frames, extending these representations to capture dynamic interactions across video sequences remains a significant challenge. To address this, we present TCDSG, Temporally Consistent Dynamic Scene Graphs, an innovative end-to-end framework that detects, tracks, and links subject-object relationships across time, generating action tracklets, temporally consistent sequences of entities and their interactions. Our approach leverages a novel bipartite matching mechanism, enhanced by adaptive decoder queries and feedback loops, ensuring temporal coherence and robust tracking over extended sequences. This method not only establishes a new benchmark by achieving over 60% improvement in temporal recall@k on the Action Genome, OpenPVSG, and MEVA datasets but also pioneers the augmentation of MEVA with persistent object ID annotations for comprehensive tracklet generation. By seamlessly integrating spatial and temporal dynamics, our work sets a new standard in multi-frame video analysis, opening new avenues for high-impact applications in surveillance, autonomous navigation, and beyond.
摘要：理解视频内容对于推进活动识别、自主系统和人机交互等现实世界应用至关重要。虽然场景图擅长捕捉各个帧中对象之间的空间关系，但扩展这些表示以捕捉视频序列之间的动态交互仍然是一项重大挑战。为了解决这个问题，我们提出了 TCDSG，即时间一致的动态场景图，这是一个创新的端到端框架，可以检测、跟踪和链接跨时间的主体-客体关系，生成动作轨迹、时间一致的实体序列及其交互。我们的方法利用了一种新颖的二分匹配机制，并通过自适应解码器查询和反馈循环进行增强，确保扩展序列的时间连贯性和稳健的跟踪。该方法不仅通过在 Action Genome、OpenPVSG 和 MEVA 数据集上实现超过 60% 的时间召回率提高建立了新的基准，而且还率先使用持久对象 ID 注释增强 MEVA 以进行全面的轨迹生成。通过无缝集成空间和时间动态，我们的工作为多帧视频分析设立了新标准，为监控、自动导航等领域的高影响力应用开辟了新途径。

Title: FLAME 3 Dataset: Unleashing the Power of Radiometric Thermal UAV Imagery for Wildfire Management

Authors: Bryce Hopkins, Leo ONeill, Michael Marinaccio, Eric Rowell, Russell Parsons, Sarah Flanary, Irtija Nazim, Carl Seielstad, Fatemeh Afghah
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02831
Pdf URL: https://arxiv.org/pdf/2412.02831
Copy Paste: [[2412.02831]] FLAME 3 Dataset: Unleashing the Power of Radiometric Thermal UAV Imagery for Wildfire Management(https://arxiv.org/abs/2412.02831)
Keywords: generation
Abstract: The increasing accessibility of radiometric thermal imaging sensors for unmanned aerial vehicles (UAVs) offers significant potential for advancing AI-driven aerial wildfire management. Radiometric imaging provides per-pixel temperature estimates, a valuable improvement over non-radiometric data that requires irradiance measurements to be converted into visible images using RGB color palettes. Despite its benefits, this technology has been underutilized largely due to a lack of available data for researchers. This study addresses this gap by introducing methods for collecting and processing synchronized visual spectrum and radiometric thermal imagery using UAVs at prescribed fires. The included imagery processing pipeline drastically simplifies and partially automates each step from data collection to neural network input. Further, we present the FLAME 3 dataset, the first comprehensive collection of side-by-side visual spectrum and radiometric thermal imagery of wildland fires. Building on our previous FLAME 1 and FLAME 2 datasets, FLAME 3 includes radiometric thermal Tag Image File Format (TIFFs) and nadir thermal plots, providing a new data type and collection method. This dataset aims to spur a new generation of machine learning models utilizing radiometric thermal imagery, potentially trivializing tasks such as aerial wildfire detection, segmentation, and assessment. A single-burn subset of FLAME 3 for computer vision applications is available on Kaggle with the full 6 burn set available to readers upon request.
摘要：无人机 (UAV) 上辐射热成像传感器的日益普及为推进 AI 驱动的空中野火管理提供了巨大潜力。辐射成像提供每像素温度估计，与非辐射数据相比，这是一个有价值的改进，非辐射数据需要使用 RGB 调色板将辐照度测量值转换为可见图像。尽管这项技术有诸多好处，但它并未得到充分利用，主要是因为研究人员缺乏可用数据。本研究通过介绍在规定火灾中使用无人机收集和处理同步可见光谱和辐射热图像的方法来解决这一差距。所包含的图像处理管道大大简化并部分自动化了从数据收集到神经网络输入的每个步骤。此外，我们介绍了 FLAME 3 数据集，这是第一个全面的并排可见光谱和辐射热图像的野火数据集。在我们之前的 FLAME 1 和 FLAME 2 数据集的基础上，FLAME 3 包括辐射热标记图像文件格式 (TIFF) 和天底热图，提供了一种新的数据类型和收集方法。该数据集旨在推动新一代利用辐射热成像的机器学习模型的发展，从而有可能简化空中野火检测、分割和评估等任务。Kaggle 上提供了用于计算机视觉应用的 FLAME 3 单次燃烧子集，读者可应要求获取完整的 6 次燃烧集。

Title: GUESS: Generative Uncertainty Ensemble for Self Supervision

Authors: Salman Mohamadi, Gianfranco Doretto, Donald A. Adjeroh
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.02896
Pdf URL: https://arxiv.org/pdf/2412.02896
Copy Paste: [[2412.02896]] GUESS: Generative Uncertainty Ensemble for Self Supervision(https://arxiv.org/abs/2412.02896)
Keywords: generative
Abstract: Self-supervised learning (SSL) frameworks consist of pretext task, and loss function aiming to learn useful general features from unlabeled data. The basic idea of most SSL baselines revolves around enforcing the invariance to a variety of data augmentations via the loss function. However, one main issue is that, inattentive or deterministic enforcement of the invariance to any kind of data augmentation is generally not only inefficient, but also potentially detrimental to performance on the downstream tasks. In this work, we investigate the issue from the viewpoint of uncertainty in invariance representation. Uncertainty representation is fairly under-explored in the design of SSL architectures as well as loss functions. We incorporate uncertainty representation in both loss function as well as architecture design aiming for more data-dependent invariance enforcement. The former is represented in the form of data-derived uncertainty in SSL loss function resulting in a generative-discriminative loss function. The latter is achieved by feeding slightly different distorted versions of samples to the ensemble aiming for learning better and more robust representation. Specifically, building upon the recent methods that use hard and soft whitening (a.k.a redundancy reduction), we introduce a new approach GUESS, a pseudo-whitening framework, composed of controlled uncertainty injection, a new architecture, and a new loss function. We include detailed results and ablation analysis establishing GUESS as a new baseline.
摘要：自监督学习 (SSL) 框架由借口任务和损失函数组成，旨在从未标记的数据中学习有用的一般特征。大多数 SSL 基线的基本思想都围绕着通过损失函数强制对各种数据增强的不变性。然而，一个主要问题是，对任何类型的数据增强的不变性的不注意或确定性执行通常不仅效率低下，而且可能对下游任务的性能产生不利影响。在这项工作中，我们从不变性表示中的不确定性的角度研究了这个问题。在 SSL 架构和损失函数的设计中，不确定性表示还未被充分探索。我们在损失函数和架构设计中都加入了不确定性表示，旨在实现更多依赖于数据的不变性。前者以 SSL 损失函数中数据衍生的不确定性的形式表示，从而产生生成判别损失函数。后者是通过将略有不同的样本扭曲版本输入到集成中来实现的，旨在学习更好、更稳健的表示。具体来说，在最近使用硬白化和软白化（又称冗余减少）的方法的基础上，我们引入了一种新方法 GUESS，这是一种伪白化框架，由受控不确定性注入、新架构和新损失函数组成。我们提供了详细的结果和消融分析，将 GUESS 确立为新的基线。

Title: Panoptic Diffusion Models: co-generation of images and segmentation maps

Authors: Yinghan Long, Kaushik Roy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02929
Pdf URL: https://arxiv.org/pdf/2412.02929
Copy Paste: [[2412.02929]] Panoptic Diffusion Models: co-generation of images and segmentation maps(https://arxiv.org/abs/2412.02929)
Keywords: generation
Abstract: Recently, diffusion models have demonstrated impressive capabilities in text-guided and image-conditioned image generation. However, existing diffusion models cannot simultaneously generate a segmentation map of objects and a corresponding image from the prompt. Previous attempts either generate segmentation maps based on the images or provide maps as input conditions to control image generation, limiting their functionality to given inputs. Incorporating an inherent understanding of the scene layouts can improve the creativity and realism of diffusion models. To address this limitation, we present Panoptic Diffusion Model (PDM), the first model designed to generate both images and panoptic segmentation maps concurrently. PDM bridges the gap between image and text by constructing segmentation layouts that provide detailed, built-in guidance throughout the generation process. This ensures the inclusion of categories mentioned in text prompts and enriches the diversity of segments within the background. We demonstrate the effectiveness of PDM across two architectures: a unified diffusion transformer and a two-stream transformer with a pretrained backbone. To facilitate co-generation with fewer sampling steps, we incorporate a fast diffusion solver into PDM. Additionally, when ground-truth maps are available, PDM can function as a text-guided image-to-image generation model. Finally, we propose a novel metric for evaluating the quality of generated maps and show that PDM achieves state-of-the-art results in image generation with implicit scene control.
摘要：最近，扩散模型在文本引导和图像条件图像生成方面表现出了令人印象深刻的能力。然而，现有的扩散模型不能同时从提示中生成对象的分割图和相应的图像。以前的尝试要么基于图像生成分割图，要么提供图作为控制图像生成的输入条件，从而将其功能限制在给定的输入范围内。结合对场景布局的固有理解可以提高扩散模型的创造力和真实感。为了解决这一限制，我们提出了全景扩散模型 (PDM)，这是第一个旨在同时生成图像和全景分割图的模型。PDM 通过构建分割布局来弥合图像和文本之间的差距，这些分割布局在整个生成过程中提供详细的内置指导。这确保了文本提示中提到的类别的包含，并丰富了背景中片段的多样性。我们展示了 PDM 在两种架构中的有效性：统一扩散变压器和具有预训练主干的双流变压器。为了以更少的采样步骤促进共同生成，我们在 PDM 中整合了一个快速扩散求解器。此外，当有真实地图可用时，PDM 可以充当文本引导的图像到图像生成模型。最后，我们提出了一种用于评估生成地图质量的新指标，并表明 PDM 在隐式场景控制的图像生成方面取得了最先进的成果。

Title: Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution

Authors: Jiahua Xiao, Jiawei Zhang, Dongqing Zou, Xiaodan Zhang, Jimmy Ren, Xing Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02960
Pdf URL: https://arxiv.org/pdf/2412.02960
Copy Paste: [[2412.02960]] Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution(https://arxiv.org/abs/2412.02960)
Keywords: restoration, super-resolution
Abstract: Real-world image super-resolution (Real-ISR) has achieved a remarkable leap by leveraging large-scale text-to-image models, enabling realistic image restoration from given recognition textual prompts. However, these methods sometimes fail to recognize some salient objects, resulting in inaccurate semantic restoration in these regions. Additionally, the same region may have a strong response to more than one prompt and it will lead to semantic ambiguity for image super-resolution. To alleviate the above two issues, in this paper, we propose to consider semantic segmentation as an additional control condition into diffusion-based image super-resolution. Compared to textual prompt conditions, semantic segmentation enables a more comprehensive perception of salient objects within an image by assigning class labels to each pixel. It also mitigates the risks of semantic ambiguities by explicitly allocating objects to their respective spatial regions. In practice, inspired by the fact that image super-resolution and segmentation can benefit each other, we propose SegSR which introduces a dual-diffusion framework to facilitate interaction between the image super-resolution and segmentation diffusion models. Specifically, we develop a Dual-Modality Bridge module to enable updated information flow between these two diffusion models, achieving mutual benefit during the reverse diffusion process. Extensive experiments show that SegSR can generate realistic images while preserving semantic structures more effectively.
摘要：真实世界图像超分辨率 (Real-ISR) 利用大规模文本到图像模型实现了显著飞跃，能够根据给定的识别文本提示进行逼真的图像恢复。然而，这些方法有时无法识别一些显著的对象，导致这些区域的语义恢复不准确。此外，同一区域可能对多个提示有强烈的反应，这将导致图像超分辨率的语义模糊性。为了缓解上述两个问题，在本文中，我们建议将语义分割作为基于扩散的图像超分辨率的附加控制条件。与文本提示条件相比，语义分割通过为每个像素分配类标签，可以更全面地感知图像中的显著对象。它还通过将对象明确分配到各自的空间区域来降低语义模糊的风险。在实践中，受图像超分辨率和分割可以互利互惠这一事实的启发，我们提出了 SegSR，它引入了一个双扩散框架来促进图像超分辨率和分割扩散模型之间的交互。具体来说，我们开发了一个双模态桥接模块，以实现这两个扩散模型之间的更新信息流，从而在反向扩散过程中实现互利互惠。大量实验表明，SegSR 可以生成逼真的图像，同时更有效地保留语义结构。

Title: Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

Authors: XiuYu Zhang, Zening Luo, Michelle E. Lu
Subjects: cs.CV, cs.DC
Abstract URL: https://arxiv.org/abs/2412.02962
Pdf URL: https://arxiv.org/pdf/2412.02962
Copy Paste: [[2412.02962]] Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference(https://arxiv.org/abs/2412.02962)
Keywords: generation
Abstract: Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation. However, the inference speed of diffusion models is limited by the slow sampling process, restricting its use cases. The sequential denoising steps required for generating a single sample could take tens or hundreds of iterations and thus have become a significant bottleneck. This limitation is more salient for applications that are interactive in nature or require small latency. To address this challenge, we propose Partially Conditioned Patch Parallelism (PCPP) to accelerate the inference of high-resolution diffusion models. Using the fact that the difference between the images in adjacent diffusion steps is nearly zero, Patch Parallelism (PP) leverages multiple GPUs communicating asynchronously to compute patches of an image in multiple computing devices based on the entire image (all patches) in the previous diffusion step. PCPP develops PP to reduce computation in inference by conditioning only on parts of the neighboring patches in each diffusion step, which also decreases communication among computing devices. As a result, PCPP decreases the communication cost by around $70\%$ compared to DistriFusion (the state of the art implementation of PP) and achieves $2.36\sim 8.02\times$ inference speed-up using $4\sim 8$ GPUs compared to $2.32\sim 6.71\times$ achieved by DistriFusion depending on the computing device configuration and resolution of generation at the cost of a possible decrease in image quality. PCPP demonstrates the potential to strike a favorable trade-off, enabling high-quality image generation with substantially reduced latency.
摘要：扩散模型在生成图像方面表现出了令人兴奋的能力，在视频制作方面也非常有前景。然而，扩散模型的推理速度受到缓慢的采样过程的限制，限制了它的使用案例。生成单个样本所需的连续去噪步骤可能需要数十或数百次迭代，因此已成为一个重大瓶颈。对于本质上是交互的或需要较小延迟的应用程序，这种限制更为明显。为了应对这一挑战，我们提出了部分条件补丁并行 (PCPP) 来加速高分辨率扩散模型的推理。利用相邻扩散步骤中图像之间的差异几乎为零的事实，补丁并行 (PP) 利用多个异步通信的 GPU 来基于上一个扩散步骤中的整个图像（所有补丁）在多个计算设备中计算图像的补丁。PCPP 开发了 PP，通过在每个扩散步骤中仅对相邻补丁的部分进行条件处理来减少推理中的计算，这也减少了计算设备之间的通信。因此，与 DistriFusion（PP 的最新实现）相比，PCPP 将通信成本降低了约 $70\%$，并使用 $4\sim 8$ 个 GPU 实现了 $2.36\sim 8.02\times$ 推理速度，而 DistriFusion 实现的速度为 $2.32\sim 6.71\times$，具体取决于计算设备配置和生成分辨率，但代价是图像质量可能会下降。PCPP 展示了达成有利权衡的潜力，能够以大幅降低的延迟生成高质量图像。

Title: MedAutoCorrect: Image-Conditioned Autocorrection in Medical Reporting

Authors: Arnold Caleb Asiimwe, Dídac Surís, Pranav Rajpurkar, Carl Vondrick
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02971
Pdf URL: https://arxiv.org/pdf/2412.02971
Copy Paste: [[2412.02971]] MedAutoCorrect: Image-Conditioned Autocorrection in Medical Reporting(https://arxiv.org/abs/2412.02971)
Keywords: generation
Abstract: In medical reporting, the accuracy of radiological reports, whether generated by humans or machine learning algorithms, is critical. We tackle a new task in this paper: image-conditioned autocorrection of inaccuracies within these reports. Using the MIMIC-CXR dataset, we first intentionally introduce a diverse range of errors into reports. Subsequently, we propose a two-stage framework capable of pinpointing these errors and then making corrections, simulating an \textit{autocorrection} process. This method aims to address the shortcomings of existing automated medical reporting systems, like factual errors and incorrect conclusions, enhancing report reliability in vital healthcare applications. Importantly, our approach could serve as a guardrail, ensuring the accuracy and trustworthiness of automated report generation. Experiments on established datasets and state of the art report generation models validate this method's potential in correcting medical reporting errors.
摘要：在医疗报告中，无论是人工还是机器学习算法生成的放射学报告，其准确性都至关重要。在本文中，我们解决了一项新任务：基于图像的报告中不准确性的自动更正。使用 MIMIC-CXR 数据集，我们首先有意在报告中引入各种错误。随后，我们提出了一个两阶段框架，能够精确定位这些错误然后进行更正，模拟 \textit{自动更正} 过程。该方法旨在解决现有自动医疗报告系统的缺点，如事实错误和错误结论，从而提高重要医疗保健应用中报告的可靠性。重要的是，我们的方法可以作为护栏，确保自动报告生成的准确性和可信度。在已建立的数据集和最先进的报告生成模型上进行的实验验证了该方法在纠正医疗报告错误方面的潜力。

Title: Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Authors: Alex Havrilla, Andrew Dai, Laura O'Mahony, Koen Oostermeijer, Vera Zisler, Alon Albalak, Fabrizio Milo, Sharath Chandra Raparthy, Kanishk Gandhi, Baber Abbasi, Duy Phung, Maia Iyer, Dakota Mahan, Chase Blagden, Srishti Gureja, Mohammed Hamdy, Wen-Ding Li, Giovanni Paolini, Pawan Sasanka Ammanamanchi, Elliot Meyerson
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.02980
Pdf URL: https://arxiv.org/pdf/2412.02980
Copy Paste: [[2412.02980]] Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models(https://arxiv.org/abs/2412.02980)
Keywords: generation
Abstract: Synthetic data generation with Large Language Models is a promising paradigm for augmenting natural data over a nearly infinite range of tasks. Given this variety, direct comparisons among synthetic data generation algorithms are scarce, making it difficult to understand where improvement comes from and what bottlenecks exist. We propose to evaluate algorithms via the makeup of synthetic data generated by each algorithm in terms of data quality, diversity, and complexity. We choose these three characteristics for their significance in open-ended processes and the impact each has on the capabilities of downstream models. We find quality to be essential for in-distribution model generalization, diversity to be essential for out-of-distribution generalization, and complexity to be beneficial for both. Further, we emphasize the existence of Quality-Diversity trade-offs in training data and the downstream effects on model performance. We then examine the effect of various components in the synthetic data pipeline on each data characteristic. This examination allows us to taxonomize and compare synthetic data generation algorithms through the components they utilize and the resulting effects on data QDC composition. This analysis extends into a discussion on the importance of balancing QDC in synthetic data for efficient reinforcement learning and self-improvement algorithms. Analogous to the QD trade-offs in training data, often there exist trade-offs between model output quality and output diversity which impact the composition of synthetic data. We observe that many models are currently evaluated and optimized only for output quality, thereby limiting output diversity and the potential for self-improvement. We argue that balancing these trade-offs is essential to the development of future self-improvement algorithms and highlight a number of works making progress in this direction.
摘要：使用大型语言模型进行合成数据生成是一种有前途的范例，可用于在几乎无限的任务范围内增强自然数据。鉴于这种多样性，合成数据生成算法之间的直接比较很少，因此很难理解改进来自何处以及存在哪些瓶颈。我们建议通过每种算法生成的合成数据的组成来评估算法，包括数据质量、多样性和复杂性。我们选择这三个特征是因为它们在开放式流程中的重要性以及每个特征对下游模型功能的影响。我们发现质量对于分布内模型泛化至关重要，多样性对于分布外泛化至关重要，而复杂性对两者都有益。此外，我们强调训练数据中存在质量-多样性权衡以及对模型性能的下游影响。然后，我们检查合成数据管道中各种组件对每个数据特征的影响。通过这种检查，我们可以通过合成数据生成算法使用的组件及其对数据 QDC 组成的影响对其进行分类和比较。这项分析延伸到关于平衡合成数据中的 QDC 对高效强化学习和自我改进算法的重要性的讨论。与训练数据中的 QD 权衡类似，模型输出质量和输出多样性之间通常存在权衡，这会影响合成数据的组成。我们观察到，许多模型目前仅针对输出质量进行评估和优化，从而限制了输出多样性和自我改进的潜力。我们认为平衡这些权衡对于未来自我改进算法的发展至关重要，并强调了一些在这方面取得进展的工作。

Title: EchoONE: Segmenting Multiple echocardiography Planes in One Model

Authors: Jiongtong Hu, Wei Zhuo, Jun Cheng, Yingying Liu, Wufeng Xue, Dong Ni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02993
Pdf URL: https://arxiv.org/pdf/2412.02993
Copy Paste: [[2412.02993]] EchoONE: Segmenting Multiple echocardiography Planes in One Model(https://arxiv.org/abs/2412.02993)
Keywords: generation
Abstract: In clinical practice of echocardiography examinations, multiple planes containing the heart structures of different view are usually required in screening, diagnosis and treatment of cardiac disease. AI models for echocardiography have to be tailored for each specific plane due to the dramatic structure differences, thus resulting in repetition development and extra complexity. Effective solution for such a multi-plane segmentation (MPS) problem is highly demanded for medical images, yet has not been well investigated. In this paper, we propose a novel solution, EchoONE, for this problem with a SAM-based segmentation architecture, a prior-composable mask learning (PC-Mask) module for semantic-aware dense prompt generation, and a learnable CNN-branch with a simple yet effective local feature fusion and adaption (LFFA) module for SAM adapting. We extensively evaluated our method on multiple internal and external echocardiography datasets, and achieved consistently state-of-the-art performance for multi-source datasets with different heart planes. This is the first time that the MPS problem is solved in one model for echocardiography data. The code will be available at this https URL.
摘要：在超声心动图检查的临床实践中，筛查、诊断和治疗心脏病通常需要包含不同视图的心脏结构的多个平面。由于结构差异巨大，超声心动图的人工智能模型必须针对每个特定平面进行定制，从而导致重复开发和额外的复杂性。医学图像迫切需要解决这种多平面分割 (MPS) 问题的有效解决方案，但尚未得到很好的研究。在本文中，我们为这个问题提出了一种新颖的解决方案 EchoONE，它具有基于 SAM 的分割架构、用于语义感知密集提示生成的先验可组合掩模学习 (PC-Mask) 模块，以及用于 SAM 适应的简单而有效的局部特征融合和自适应 (LFFA) 模块的可学习 CNN 分支。我们在多个内部和外部超声心动图数据集上对我们的方法进行了广泛的评估，并在具有不同心脏平面的多源数据集上实现了一致的最佳性能。这是第一次在一个超声心动图数据模型中解决 MPS 问题。代码将在此 https URL 上提供。

Title: CLAS: A Machine Learning Enhanced Framework for Exploring Large 3D Design Datasets

Authors: XiuYu Zhang, Xiaolei Ye, Jui-Che Chang, Yue Fang
Subjects: cs.CV, cs.HC, cs.IR
Abstract URL: https://arxiv.org/abs/2412.02996
Pdf URL: https://arxiv.org/pdf/2412.02996
Copy Paste: [[2412.02996]] CLAS: A Machine Learning Enhanced Framework for Exploring Large 3D Design Datasets(https://arxiv.org/abs/2412.02996)
Keywords: generative
Abstract: Three-dimensional (3D) objects have wide applications. Despite the growing interest in 3D modeling in academia and industries, designing and/or creating 3D objects from scratch remains time-consuming and challenging. With the development of generative artificial intelligence (AI), designers discover a new way to create images for ideation. However, generative AIs are less useful in creating 3D objects with satisfying qualities. To allow 3D designers to access a wide range of 3D objects for creative activities based on their specific demands, we propose a machine learning (ML) enhanced framework CLAS - named after the four-step of capture, label, associate, and search - to enable fully automatic retrieval of 3D objects based on user specifications leveraging the existing datasets of 3D objects. CLAS provides an effective and efficient method for any person or organization to benefit from their existing but not utilized 3D datasets. In addition, CLAS may also be used to produce high-quality 3D object synthesis datasets for training and evaluating 3D generative models. As a proof of concept, we created and showcased a search system with a web user interface (UI) for retrieving 6,778 3D objects of chairs in the ShapeNet dataset powered by CLAS. In a close-set retrieval setting, our retrieval method achieves a mean reciprocal rank (MRR) of 0.58, top 1 accuracy of 42.27%, and top 10 accuracy of 89.64%.
摘要：三维 (3D) 物体具有广泛的应用。尽管学术界和工业界对 3D 建模的兴趣日益浓厚，但从头开始设计和/或创建 3D 物体仍然耗时且具有挑战性。随着生成式人工智能 (AI) 的发展，设计师发现了一种创建图像以供构思的新方法。然而，生成式人工智能在创建具有令人满意的质量的 3D 物体方面用处不大。为了让 3D 设计师能够根据自己的特定需求访问各种 3D 物体以进行创作活动，我们提出了一个机器学习 (ML) 增强框架 CLAS（以捕获、标记、关联和搜索四个步骤命名），以便利用现有的 3D 物体数据集根据用户规范全自动检索 3D 物体。CLAS 为任何个人或组织提供了一种有效且高效的方法，使其能够从现有但未使用的 3D 数据集中受益。此外，CLAS 还可用于生成高质量的 3D 物体合成数据集，以训练和评估 3D 生成模型。作为概念验证，我们创建并展示了一个带有 Web 用户界面 (UI) 的搜索系统，用于检索由 CLAS 提供支持的 ShapeNet 数据集中的 6,778 个椅子 3D 对象。在封闭集检索设置中，我们的检索方法实现了 0.58 的平均倒数排名 (MRR)、42.27% 的前 1 名准确率和 89.64% 的前 10 名准确率。

Title: AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?

Authors: Shouwei Ruan, Hanqin Liu, Yao Huang, Xiaoqi Wang, Caixin Kang, Hang Su, Yinpeng Dong, Xingxing Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03002
Pdf URL: https://arxiv.org/pdf/2412.03002
Copy Paste: [[2412.03002]] AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?(https://arxiv.org/abs/2412.03002)
Keywords: generative
Abstract: Vision Language Models (VLMs) have exhibited remarkable generalization capabilities, yet their robustness in dynamic real-world scenarios remains largely unexplored. To systematically evaluate VLMs' robustness to real-world 3D variations, we propose AdvDreamer, the first framework that generates physically reproducible adversarial 3D transformation (Adv-3DT) samples from single-view images. AdvDreamer integrates advanced generative techniques with two key innovations and aims to characterize the worst-case distributions of 3D variations from natural images. To ensure adversarial effectiveness and method generality, we introduce an Inverse Semantic Probability Objective that executes adversarial optimization on fundamental vision-text alignment spaces, which can be generalizable across different VLM architectures and downstream tasks. To mitigate the distribution discrepancy between generated and real-world samples while maintaining physical reproducibility, we design a Naturalness Reward Model that provides regularization feedback during adversarial optimization, preventing convergence towards hallucinated and unnatural elements. Leveraging AdvDreamer, we establish MM3DTBench, the first VQA dataset for benchmarking VLMs' 3D variations robustness. Extensive evaluations on representative VLMs with diverse architectures highlight that 3D variations in the real world may pose severe threats to model performance across various tasks.
摘要：视觉语言模型 (VLM) 表现出了卓越的泛化能力，但它们在动态真实场景中的稳健性仍未得到充分探索。为了系统地评估 VLM 对真实世界 3D 变化的稳健性，我们提出了 AdvDreamer，这是第一个从单视图图像生成物理上可重现的对抗性 3D 变换 (Adv-3DT) 样本的框架。AdvDreamer 将先进的生成技术与两项关键创新相结合，旨在表征自然图像中 3D 变化的最坏情况分布。为了确保对抗有效性和方法通用性，我们引入了一个逆语义概率目标，它在基本的视觉文本对齐空间上执行对抗优化，这可以在不同的 VLM 架构和下游任务中推广。为了在保持物理可重复性的同时减轻生成样本和真实世界样本之间的分布差异，我们设计了一个自然奖励模型，该模型在对抗优化期间提供正则化反馈，防止收敛到幻觉和不自然的元素。利用 AdvDreamer，我们建立了 MM3DTBench，这是第一个用于对 VLM 的 3D 变化稳健性进行基准测试的 VQA 数据集。对具有不同架构的代表性 VLM 进行的广泛评估表明，现实世界中的 3D 变化可能会对各种任务中的模型性能构成严重威胁。

Title: Human Multi-View Synthesis from a Single-View Model:Transferred Body and Face Representations

Authors: Yu Feng, Shunsi Zhang, Jian Shu, Hanfeng Zhao, Guoliang Pang, Chi Zhang, Hao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03011
Pdf URL: https://arxiv.org/pdf/2412.03011
Copy Paste: [[2412.03011]] Human Multi-View Synthesis from a Single-View Model:Transferred Body and Face Representations(https://arxiv.org/abs/2412.03011)
Keywords: restoration, generation
Abstract: Generating multi-view human images from a single view is a complex and significant challenge. Although recent advancements in multi-view object generation have shown impressive results with diffusion models, novel view synthesis for humans remains constrained by the limited availability of 3D human datasets. Consequently, many existing models struggle to produce realistic human body shapes or capture fine-grained facial details accurately. To address these issues, we propose an innovative framework that leverages transferred body and facial representations for multi-view human synthesis. Specifically, we use a single-view model pretrained on a large-scale human dataset to develop a multi-view body representation, aiming to extend the 2D knowledge of the single-view model to a multi-view diffusion model. Additionally, to enhance the model's detail restoration capability, we integrate transferred multimodal facial features into our trained human diffusion model. Experimental evaluations on benchmark datasets demonstrate that our approach outperforms the current state-of-the-art methods, achieving superior performance in multi-view human synthesis.
摘要：从单一视角生成多视角人体图像是一项复杂而重大的挑战。尽管多视角对象生成方面的最新进展已显示出扩散模型的令人印象深刻的结果，但人体的新视角合成仍然受到 3D 人体数据集有限的限制。因此，许多现有模型难以产生逼真的人体形状或准确捕捉细粒度的面部细节。为了解决这些问题，我们提出了一个创新框架，利用转移的身体和面部表征进行多视角人体合成。具体而言，我们使用在大型人体数据集上预先训练的单视角模型来开发多视角身体表征，旨在将单视角模型的 2D 知识扩展到多视角扩散模型。此外，为了增强模型的细节恢复能力，我们将转移的多模态面部特征集成到我们训练过的人体扩散模型中。在基准数据集上的实验评估表明，我们的方法优于当前最先进的方法，在多视角人体合成中取得了卓越的性能。

Title: Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach

Authors: Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03017
Pdf URL: https://arxiv.org/pdf/2412.03017
Copy Paste: [[2412.03017]] Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach(https://arxiv.org/abs/2412.03017)
Keywords: super-resolution
Abstract: Diffusion prior-based methods have shown impressive results in real-world image super-resolution (SR). However, most existing methods entangle pixel-level and semantic-level SR objectives in the training process, struggling to balance pixel-wise fidelity and perceptual quality. Meanwhile, users have varying preferences on SR results, thus it is demanded to develop an adjustable SR model that can be tailored to different fidelity-perception preferences during inference without re-training. We present Pixel-level and Semantic-level Adjustable SR (PiSA-SR), which learns two LoRA modules upon the pre-trained stable-diffusion (SD) model to achieve improved and adjustable SR results. We first formulate the SD-based SR problem as learning the residual between the low-quality input and the high-quality output, then show that the learning objective can be decoupled into two distinct LoRA weight spaces: one is characterized by the $\ell_2$-loss for pixel-level regression, and another is characterized by the LPIPS and classifier score distillation losses to extract semantic information from pre-trained classification and SD models. In its default setting, PiSA-SR can be performed in a single diffusion step, achieving leading real-world SR results in both quality and efficiency. By introducing two adjustable guidance scales on the two LoRA modules to control the strengths of pixel-wise fidelity and semantic-level details during inference, PiSASR can offer flexible SR results according to user preference without re-training. Codes and models can be found at this https URL.
摘要：基于扩散先验的方法在现实世界的图像超分辨率 (SR) 中表现出色。然而，大多数现有方法在训练过程中将像素级和语义级 SR 目标纠缠在一起，难以平衡像素级保真度和感知质量。同时，用户对 SR 结果的偏好各不相同，因此需要开发一种可调节的 SR 模型，该模型可以在推理过程中根据不同的保真度感知偏好进行定制，而无需重新训练。我们提出了像素级和语义级可调 SR (PiSA-SR)，它在预先训练的稳定扩散 (SD) 模型上学习两个 LoRA 模块，以实现改进和可调的 SR 结果。我们首先将基于 SD 的 SR 问题公式化为学习低质量输入和高质量输出之间的残差，然后表明学习目标可以分解为两个不同的 LoRA 权重空间：一个以像素级回归的 $\ell_2$ 损失为特征，另一个以 LPIPS 和分类器分数蒸馏损失为特征，以从预训练的分类和 SD 模型中提取语义信息。在默认设置下，PiSA-SR 可以在单个扩散步骤中执行，在质量和效率方面实现领先的真实世界 SR 结果。通过在两个 LoRA 模块上引入两个可调节的指导尺度来控制推理过程中像素保真度和语义级细节的强度，PiSASR 可以根据用户偏好提供灵活的 SR 结果而无需重新训练。代码和模型可以在这个 https URL 中找到。

Title: UTSD: Unified Time Series Diffusion Model

Authors: Xiangkai Ma, Xiaobin Hong, Wenzhong Li, Sanglu Lu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03068
Pdf URL: https://arxiv.org/pdf/2412.03068
Copy Paste: [[2412.03068]] UTSD: Unified Time Series Diffusion Model(https://arxiv.org/abs/2412.03068)
Keywords: generation
Abstract: Transformer-based architectures have achieved unprecedented success in time series analysis. However, facing the challenge of across-domain modeling, existing studies utilize statistical prior as prompt engineering fails under the huge distribution shift among various domains. In this paper, a Unified Time Series Diffusion (UTSD) model is established for the first time to model the multi-domain probability distribution, utilizing the powerful probability distribution modeling ability of Diffusion. Unlike the autoregressive models that capture the conditional probabilities of the prediction horizon to the historical sequence, we use a diffusion denoising process to model the mixture distribution of the cross-domain data and generate the prediction sequence for the target domain directly utilizing conditional sampling. The proposed UTSD contains three pivotal designs: (1) The condition network captures the multi-scale fluctuation patterns from the observation sequence, which are utilized as context representations to guide the denoising network to generate the prediction sequence; (2) Adapter-based fine-tuning strategy, the multi-domain universal representation learned in the pretraining stage is utilized for downstream tasks in target domains; (3) The diffusion and denoising process on the actual sequence space, combined with the improved classifier free guidance as the conditional generation strategy, greatly improves the stability and accuracy of the downstream task. We conduct extensive experiments on mainstream benchmarks, and the pre-trained UTSD outperforms existing foundation models on all data domains, exhibiting superior zero-shot generalization ability. After training from scratch, UTSD achieves comparable performance against domain-specific proprietary models. The empirical results validate the potential of UTSD as a time series foundational model.
摘要：基于 Transformer 的架构在时间序列分析中取得了前所未有的成功。然而，面对跨域建模的挑战，现有研究大多利用统计先验，因为在各个域之间巨大的分布偏移下，工程很快就会失败。本文首次利用扩散强大的概率分布建模能力，建立了统一时间序列扩散（UTSD）模型来建模多域概率分布。与将预测范围的条件概率捕获到历史序列的自回归模型不同，我们使用扩散去噪过程来建模跨域数据的混合分布，并直接利用条件采样生成目标域的预测序列。所提出的 UTSD 包含三个关键设计：（1）条件网络从观测序列中捕获多尺度波动模式，将其作为上下文表示来指导去噪网络生成预测序列；（2）基于适配器的微调策略，将在预训练阶段学习到的多域通用表示用于目标域的下游任务；（3）在实际序列空间上进行扩散和去噪处理，结合改进的无分类器指导作为条件生成策略，大大提高了下游任务的稳定性和准确性。我们在主流基准上进行了广泛的实验，预训练的 UTSD 在所有数据域上的表现均优于现有的基础模型，表现出卓越的零样本泛化能力。从头开始训练后，UTSD 实现了与特定领域的专有模型相当的性能。实证结果验证了 UTSD 作为时间序列基础模型的潜力。

Title: TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Authors: Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, Xinglong Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03069
Pdf URL: https://arxiv.org/pdf/2412.03069
Copy Paste: [[2412.03069]] TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation(https://arxiv.org/abs/2412.03069)
Keywords: generation
Abstract: We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-targeted Vector Quantization (VQ) encoder for unifying these two tasks. We observe that understanding and generation require fundamentally different granularities of visual information. This leads to a critical trade-off, particularly compromising performance in multimodal understanding tasks. TokenFlow addresses this challenge through an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment via a shared mapping mechanism. This design enables direct access to both high-level semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through shared indices. Our extensive experiments demonstrate TokenFlow's superiority across multiple dimensions. Leveraging TokenFlow, we demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance, achieving a 7.2\% average improvement. For image reconstruction, we achieve a strong FID score of 0.63 at 384*384 resolution. Moreover, TokenFlow establishes state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256*256 resolution, achieving comparable results to SDXL.
摘要：我们提出了 TokenFlow，这是一种新颖的统一图像标记器，它弥合了多模态理解和生成之间长期存在的差距。先前的研究尝试使用单个重建目标矢量量化 (VQ) 编码器来统一这两个任务。我们观察到，理解和生成需要完全不同的视觉信息粒度。这导致了一个关键的权衡，特别是在多模态理解任务中损害了性能。TokenFlow 通过创新的双码本架构解决了这一挑战，该架构将语义和像素级特征学习分离，同时通过共享映射机制保持它们的对齐。这种设计允许通过共享索引直接访问对理解任务至关重要的高级语义表示和对生成至关重要的细粒度视觉特征。我们广泛的实验证明了 TokenFlow 在多个维度上的优势。利用 TokenFlow，我们首次证明离散视觉输入在理解性能方面可以超越 LLaVA-1.5 13B，平均提高 7.2%。对于图像重建，我们在 384*384 分辨率下实现了 0.63 的强劲 FID 得分。此外，TokenFlow 在自回归图像生成中建立了最先进的性能，在 256*256 分辨率下 GenEval 得分为 0.55，实现了与 SDXL 相当的结果。

Title: Mimir: Improving Video Diffusion Models for Precise Text Understanding

Authors: Shuai Tan, Biao Gong, Yutong Feng, Kecheng Zheng, Dandan Zheng, Shuwei Shi, Yujun Shen, Jingdong Chen, Ming Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03085
Pdf URL: https://arxiv.org/pdf/2412.03085
Copy Paste: [[2412.03085]] Mimir: Improving Video Diffusion Models for Precise Text Understanding(https://arxiv.org/abs/2412.03085)
Keywords: generation
Abstract: Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of Mimir in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. Project page: this https URL
摘要：由于文本具有叙事性，因此它是视频生成中的关键控制信号。为了将文本描述渲染成视频片段，当前的视频传播模型借用了文本编码器的特征，但文本理解能力有限。大型语言模型 (LLM) 的最新成功展示了仅解码器转换器的强大功能，它为文本到视频 (T2V) 生成提供了三个明显的好处，即卓越的可扩展性带来的精确文本理解、下一个标记预测带来的超越输入文本的想象力以及通过指令调整优先考虑用户兴趣的灵活性。然而，两种不同的文本建模范式出现的特征分布差距阻碍了在已建立的 T2V 模型中直接使用 LLM。这项工作通过 Mimir 解决了这一挑战，Mimir 是一个端到端训练框架，具有精心定制的标记融合器，可协调文本编码器和 LLM 的输出。这样的设计允许 T2V 模型充分利用学习到的视频先验，同时利用 LLM 的文本相关功能。大量定量和定性结果证明了 Mimir 在生成高质量视频方面的有效性，具有出色的文本理解能力，尤其是在处理短字幕和管理移动动作时。项目页面：此 https URL

Title: MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction

Authors: Gangjian Zhang, Nanjie Yao, Shunsi Zhang, Hanfeng Zhao, Guoliang Pang, Jian Shu, Hao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03103
Pdf URL: https://arxiv.org/pdf/2412.03103
Copy Paste: [[2412.03103]] MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction(https://arxiv.org/abs/2412.03103)
Keywords: generative
Abstract: This paper investigates the research task of reconstructing the 3D clothed human body from a monocular image. Due to the inherent ambiguity of single-view input, existing approaches leverage pre-trained SMPL(-X) estimation models or generative models to provide auxiliary information for human reconstruction. However, these methods capture only the general human body geometry and overlook specific geometric details, leading to inaccurate skeleton reconstruction, incorrect joint positions, and unclear cloth wrinkles. In response to these issues, we propose a multi-level geometry learning framework. Technically, we design three key components: skeleton-level enhancement, joint-level augmentation, and wrinkle-level refinement modules. Specifically, we effectively integrate the projected 3D Fourier features into a Gaussian reconstruction model, introduce perturbations to improve joint depth estimation during training, and refine the human coarse wrinkles by resembling the de-noising process of diffusion model. Extensive quantitative and qualitative experiments on two out-of-distribution test sets show the superior performance of our approach compared to state-of-the-art (SOTA) methods.
摘要：本文研究了从单目图像重建3D着装人体的研究任务。由于单视图输入固有的模糊性，现有的方法利用预先训练的SMPL(-X)估计模型或生成模型为人体重建提供辅助信息。然而，这些方法只捕捉了人体的一般几何形状，而忽略了特定的几何细节，导致骨架重建不准确、关节位置不正确以及衣服皱纹不清晰。针对这些问题，我们提出了一个多层次几何学习框架。在技术上，我们设计了三个关键组件：骨架级增强、关节级增强和皱纹级细化模块。具体而言，我们将投影的3D傅里叶特征有效地集成到高斯重建模型中，在训练期间引入扰动以改进关节深度估计，并通过类似于扩散模型的去噪过程来细化人体粗皱纹。在两个分布外测试集上进行的大量定量和定性实验表明，与最先进的 (SOTA) 方法相比，我们的方法具有更优异的性能。

Title: Few-Shot Learning with Adaptive Weight Masking in Conditional GANs

Authors: Jiacheng Hu, Zhen Qi, Jianjun Wei, Jiajing Chen, Runyuan Bao, Xinyu Qiu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03105
Pdf URL: https://arxiv.org/pdf/2412.03105
Copy Paste: [[2412.03105]] Few-Shot Learning with Adaptive Weight Masking in Conditional GANs(https://arxiv.org/abs/2412.03105)
Keywords: generative
Abstract: Deep learning has revolutionized various fields, yet its efficacy is hindered by overfitting and the requirement of extensive annotated data, particularly in few-shot learning scenarios where limited samples are available. This paper introduces a novel approach to few-shot learning by employing a Residual Weight Masking Conditional Generative Adversarial Network (RWM-CGAN) for data augmentation. The proposed model integrates residual units within the generator to enhance network depth and sample quality, coupled with a weight mask regularization technique in the discriminator to improve feature learning from small-sample categories. This method addresses the core issues of robustness and generalization in few-shot learning by providing a controlled and clear augmentation of the sample space. Extensive experiments demonstrate that RWM-CGAN not only expands the sample space effectively but also enriches the diversity and quality of generated samples, leading to significant improvements in detection and classification accuracy on public datasets. The paper contributes to the advancement of few-shot learning by offering a practical solution to the challenges posed by data scarcity and the need for rapid generalization to new tasks or categories.
摘要：深度学习已经彻底改变了各个领域，然而，其有效性却受到过度拟合和对大量注释数据的需求的阻碍，尤其是在样本有限的小样本学习场景中。本文介绍了一种新的小样本学习方法，即采用残差权重掩蔽条件生成对抗网络 (RWM-CGAN) 进行数据增强。所提出的模型集成了生成器中的残差单元以增强网络深度和样本质量，并在鉴别器中使用权重掩码正则化技术来改善小样本类别的特征学习。该方法通过提供受控且清晰的样本空间增强来解决小样本学习中的鲁棒性和泛化的核心问题。大量实验表明，RWM-CGAN 不仅有效地扩展了样本空间，还丰富了生成样本的多样性和质量，从而显著提高了公共数据集上的检测和分类准确率。本文为数据稀缺以及快速泛化到新任务或类别的需求提供了一个实用的解决方案，为小样本学习的进步做出了贡献。

Title: Splats in Splats: Embedding Invisible 3D Watermark within Gaussian Splatting

Authors: Yijia Guo, Wenkai Huang, Yang Li, Gaolei Li, Hang Zhang, Liwen Hu, Jianhua Li, Tiejun Huang, Lei Ma
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.03121
Pdf URL: https://arxiv.org/pdf/2412.03121
Copy Paste: [[2412.03121]] Splats in Splats: Embedding Invisible 3D Watermark within Gaussian Splatting(https://arxiv.org/abs/2412.03121)
Keywords: generation
Abstract: 3D Gaussian splatting (3DGS) has demonstrated impressive 3D reconstruction performance with explicit scene representations. Given the widespread application of 3DGS in 3D reconstruction and generation tasks, there is an urgent need to protect the copyright of 3DGS assets. However, existing copyright protection techniques for 3DGS overlook the usability of 3D assets, posing challenges for practical deployment. Here we describe WaterGS, the first 3DGS watermarking framework that embeds 3D content in 3DGS itself without modifying any attributes of the vanilla 3DGS. To achieve this, we take a deep insight into spherical harmonics (SH) and devise an importance-graded SH coefficient encryption strategy to embed the hidden SH coefficients. Furthermore, we employ a convolutional autoencoder to establish a mapping between the original Gaussian primitives' opacity and the hidden Gaussian primitives' opacity. Extensive experiments indicate that WaterGS significantly outperforms existing 3D steganography techniques, with 5.31% higher scene fidelity and 3X faster rendering speed, while ensuring security, robustness, and user experience. Codes and data will be released at this https URL.
摘要：3D 高斯分层 (3DGS) 凭借明确的场景表示展现出令人印象深刻的 3D 重建性能。鉴于 3DGS 在 3D 重建和生成任务中的广泛应用，保护 3DGS 资产的版权迫在眉睫。然而，现有的 3DGS 版权保护技术忽视了 3D 资产的可用性，对实际部署构成挑战。这里我们描述了 WaterGS，这是第一个 3DGS 水印框架，它将 3D 内容嵌入 3DGS 本身，而不修改原始 3DGS 的任何属性。为了实现这一点，我们深入了解了球面谐波 (SH)，并设计了一种重要性分级 SH 系数加密策略来嵌入隐藏的 SH 系数。此外，我们使用卷积自动编码器在原始高斯基元的不透明度和隐藏的高斯基元的不透明度之间建立映射。大量实验表明，WaterGS 的性能显著优于现有的 3D 隐写技术，场景保真度提高了 5.31%，渲染速度提高了 3 倍，同时确保了安全性、稳健性和用户体验。代码和数据将在此 https URL 上发布。

Title: Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis

Authors: Siyoon Jin, Jisu Nam, Jiyoung Kim, Dahyun Chung, Yeong-Seok Kim, Joonhyung Park, Heonjeong Chu, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03150
Pdf URL: https://arxiv.org/pdf/2412.03150
Copy Paste: [[2412.03150]] Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis(https://arxiv.org/abs/2412.03150)
Keywords: generation
Abstract: Exemplar-based semantic image synthesis aims to generate images aligned with given semantic content while preserving the appearance of an exemplar image. Conventional structure-guidance models, such as ControlNet, are limited in that they cannot directly utilize exemplar images as input, relying instead solely on text prompts to control appearance. Recent tuning-free approaches address this limitation by transferring local appearance from the exemplar image to the synthesized image through implicit cross-image matching in the augmented self-attention mechanism of pre-trained diffusion models. However, these methods face challenges when applied to content-rich scenes with significant geometric deformations, such as driving scenes. In this paper, we propose the Appearance Matching Adapter (AM-Adapter), a learnable framework that enhances cross-image matching within augmented self-attention by incorporating semantic information from segmentation maps. To effectively disentangle generation and matching processes, we adopt a stage-wise training approach. Initially, we train the structure-guidance and generation networks, followed by training the AM-Adapter while keeping the other networks frozen. During inference, we introduce an automated exemplar retrieval method to efficiently select exemplar image-segmentation pairs. Despite utilizing a limited number of learnable parameters, our method achieves state-of-the-art performance, excelling in both semantic alignment preservation and local appearance fidelity. Extensive ablation studies further validate our design choices. Code and pre-trained weights will be publicly available.: this https URL
摘要：基于样本的语义图像合成旨在生成与给定语义内容一致的图像，同时保留样本图像的外观。传统的结构指导模型（例如 ControlNet）的局限性在于它们不能直接利用样本图像作为输入，而是仅依靠文本提示来控制外观。最近的免调优方法通过在预训练扩散模型的增强自注意力机制中通过隐式跨图像匹配将局部外观从样本图像转移到合成图像来解决这一限制。然而，这些方法在应用于具有显著几何变形的内容丰富的场景（例如驾驶场景）时面临挑战。在本文中，我们提出了外观匹配适配器 (AM-Adapter)，这是一个可学习的框架，它通过结合来自分割图的语义信息来增强增强自注意力中的跨图像匹配。为了有效地解开生成和匹配过程，我们采用了分阶段训练方法。首先，我们训练结构指导和生成网络，然后训练 AM-Adapter，同时保持其他网络冻结。在推理过程中，我们引入了一种自动样本检索方法来高效地选择样本图像分割对。尽管我们的方法使用了有限数量的可学习参数，但它实现了最先进的性能，在语义对齐保持和局部外观保真度方面都表现出色。广泛的消融研究进一步验证了我们的设计选择。代码和预训练权重将公开提供。：此 https URL

Title: PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation

Authors: Qihan Huang, Long Chan, Jinlong Liu, Wanggui He, Hao Jiang, Mingli Song, Jie Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03177
Pdf URL: https://arxiv.org/pdf/2412.03177
Copy Paste: [[2412.03177]] PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation(https://arxiv.org/abs/2412.03177)
Keywords: generation
Abstract: Finetuning-free personalized image generation can synthesize customized images without test-time finetuning, attracting wide research interest owing to its high efficiency. Current finetuning-free methods simply adopt a single training stage with a simple image reconstruction task, and they typically generate low-quality images inconsistent with the reference images during test-time. To mitigate this problem, inspired by the recent DPO (i.e., direct preference optimization) technique, this work proposes an additional training stage to improve the pre-trained personalized generation models. However, traditional DPO only determines the overall superiority or inferiority of two samples, which is not suitable for personalized image generation because the generated images are commonly inconsistent with the reference images only in some local image patches. To tackle this problem, this work proposes PatchDPO that estimates the quality of image patches within each generated image and accordingly trains the model. To this end, PatchDPO first leverages the pre-trained vision model with a proposed self-supervised training method to estimate the patch quality. Next, PatchDPO adopts a weighted training approach to train the model with the estimated patch quality, which rewards the image patches with high quality while penalizing the image patches with low quality. Experiment results demonstrate that PatchDPO significantly improves the performance of multiple pre-trained personalized generation models, and achieves state-of-the-art performance on both single-object and multi-object personalized image generation. Our code is available at this https URL.
摘要：无需微调的个性化图像生成无需测试时微调即可合成定制图像，由于其高效率而吸引了广泛的研究兴趣。当前的无需微调方法仅采用单个训练阶段和简单的图像重建任务，并且它们通常在测试时生成与参考图像不一致的低质量图像。为了缓解这个问题，受到最近的 DPO（即直接偏好优化）技术的启发，这项工作提出了一个额外的训练阶段来改进预先训练的个性化生成模型。然而，传统的 DPO 只确定两个样本的整体优劣，这不适合个性化图像生成，因为生成的图像通常只在某些局部图像块中与参考图像不一致。为了解决这个问题，这项工作提出了 PatchDPO，它可以估计每个生成图像中图像块的质量并据此训练模型。为此，PatchDPO 首先利用预训练的视觉模型和提出的自监督训练方法来估计块质量。接下来，PatchDPO 采用加权训练方法，利用估计的图像块质量来训练模型，奖励高质量的图像块，同时惩罚质量低的图像块。实验结果表明，PatchDPO 显著提高了多个预训练个性化生成模型的性能，并在单对象和多对象个性化图像生成方面均取得了最佳性能。我们的代码可在此 https URL 上找到。

Title: Parametric Enhancement of PerceptNet: A Human-Inspired Approach for Image Quality Assessment

Authors: Jorge Vila-Tomás, Pablo Hernández-Cámara, Valero Laparra, Jesús Malo
Subjects: cs.CV, q-bio.NC
Abstract URL: https://arxiv.org/abs/2412.03210
Pdf URL: https://arxiv.org/pdf/2412.03210
Copy Paste: [[2412.03210]] Parametric Enhancement of PerceptNet: A Human-Inspired Approach for Image Quality Assessment(https://arxiv.org/abs/2412.03210)
Keywords: quality assessment
Abstract: While deep learning models can learn human-like features at earlier levels, which suggests their utility in modeling human vision, few attempts exist to incorporate these features by design. Current approaches mostly optimize all parameters blindly, only constraining minor architectural aspects. This paper demonstrates how parametrizing neural network layers enables more biologically-plausible operations while reducing trainable parameters and improving interpretability. We constrain operations to functional forms present in human vision, optimizing only these functions' parameters rather than all convolutional tensor elements independently. We present two parametric model versions: one with hand-chosen biologically plausible parameters, and another fitted to human perception experimental data. We compare these with a non-parametric version. All models achieve comparable state-of-the-art results, with parametric versions showing orders of magnitude parameter reduction for minimal performance loss. The parametric models demonstrate improved interpretability and training behavior. Notably, the model fitted to human perception, despite biological initialization, converges to biologically incorrect results. This raises scientific questions and highlights the need for diverse evaluation methods to measure models' humanness, rather than assuming task performance correlates with human-like behavior.
摘要：虽然深度学习模型可以在早期阶段学习类似人类的特征，这表明它们在建模人类视觉方面很有用，但很少有人尝试通过设计来整合这些特征。当前的方法大多是盲目优化所有参数，只限制次要的架构方面。本文展示了如何参数化神经网络层，从而实现更符合生物学原理的操作，同时减少可训练参数并提高可解释性。我们将操作限制为人类视觉中存在的函数形式，仅优化这些函数的参数，而不是独立优化所有卷积张量元素。我们提出了两个参数模型版本：一个具有手动选择的生物学上合理的参数，另一个适合人类感知实验数据。我们将它们与非参数版本进行了比较。所有模型都实现了可比的最先进的结果，参数版本显示了数量级的参数减少，而性能损失最小。参数模型表现出更好的可解释性和训练行为。值得注意的是，尽管进行了生物初始化，但适合人类感知的模型仍会收敛到生物学上不正确的结果。这引发了科学问题，并强调需要采用多种评估方法来衡量模型的人性化，而不是假设任务表现与类似人类的行为相关。

Title: Semi-Supervised Transfer Boosting (SS-TrBoosting)

Authors: Lingfei Deng, Changming Zhao, Zhenbang Du, Kun Xia, Dongrui Wu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03212
Pdf URL: https://arxiv.org/pdf/2412.03212
Copy Paste: [[2412.03212]] Semi-Supervised Transfer Boosting (SS-TrBoosting)(https://arxiv.org/abs/2412.03212)
Keywords: generation
Abstract: Semi-supervised domain adaptation (SSDA) aims at training a high-performance model for a target domain using few labeled target data, many unlabeled target data, and plenty of auxiliary data from a source domain. Previous works in SSDA mainly focused on learning transferable representations across domains. However, it is difficult to find a feature space where the source and target domains share the same conditional probability distribution. Additionally, there is no flexible and effective strategy extending existing unsupervised domain adaptation (UDA) approaches to SSDA settings. In order to solve the above two challenges, we propose a novel fine-tuning framework, semi-supervised transfer boosting (SS-TrBoosting). Given a well-trained deep learning-based UDA or SSDA model, we use it as the initial model, generate additional base learners by boosting, and then use all of them as an ensemble. More specifically, half of the base learners are generated by supervised domain adaptation, and half by semi-supervised learning. Furthermore, for more efficient data transmission and better data privacy protection, we propose a source data generation approach to extend SS-TrBoosting to semi-supervised source-free domain adaptation (SS-SFDA). Extensive experiments showed that SS-TrBoosting can be applied to a variety of existing UDA, SSDA and SFDA approaches to further improve their performance.
摘要：半监督域自适应 (SSDA) 旨在使用少量标记目标数据、大量未标记目标数据和大量来自源域的辅助数据为目标域训练高性能模型。SSDA 中的先前工作主要集中于学习跨域的可迁移表示。然而，很难找到源域和目标域具有相同条件概率分布的特征空间。此外，没有灵活有效的策略将现有的无监督域自适应 (UDA) 方法扩展到 SSDA 设置。为了解决上述两个挑战，我们提出了一种新颖的微调框架，即半监督迁移增强 (SS-TrBoosting)。给定一个训练有素的基于深度学习的 UDA 或 SSDA 模型，我们将其用作初始模型，通过增强生成额外的基础学习器，然后将它们全部用作集成。更具体地说，一半的基础学习器由监督域自适应生成，另一半由半监督学习生成。此外，为了更高效的数据传输和更好的数据隐私保护，我们提出了一种源数据生成方法，将 SS-TrBoosting 扩展为半监督无源域自适应（SS-SFDA）。大量实验表明，SS-TrBoosting 可以应用于各种现有的 UDA、SSDA 和 SFDA 方法，以进一步提高其性能。

Title: Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

Authors: Minghao Shao, Abdul Basit, Ramesh Karri, Muhammad Shafique
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.03220
Pdf URL: https://arxiv.org/pdf/2412.03220
Copy Paste: [[2412.03220]] Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges(https://arxiv.org/abs/2412.03220)
Keywords: generation
Abstract: Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language and generating coherent responses to various prompts or queries. These models far exceed the complexity of conventional neural networks, often encompassing dozens of neural network layers and containing billions to trillions of parameters. They are typically trained on vast datasets, utilizing architectures based on transformer blocks. Present-day LLMs are multi-functional, capable of performing a range of tasks from text generation and language translation to question answering, as well as code generation and analysis. An advanced subset of these models, known as Multimodal Large Language Models (MLLMs), extends LLM capabilities to process and interpret multiple data modalities, including images, audio, and video. This enhancement empowers MLLMs with capabilities like video editing, image comprehension, and captioning for visual content. This survey provides a comprehensive overview of the recent advancements in LLMs. We begin by tracing the evolution of LLMs and subsequently delve into the advent and nuances of MLLMs. We analyze emerging state-of-the-art MLLMs, exploring their technical features, strengths, and limitations. Additionally, we present a comparative analysis of these models and discuss their challenges, potential limitations, and prospects for future development.
摘要：大型语言模型 (LLM) 代表一类深度学习模型，擅长理解自然语言并对各种提示或查询生成连贯的响应。这些模型的复杂性远远超过了传统神经网络，通常包含数十个神经网络层，包含数十亿到数万亿个参数。它们通常在庞大的数据集上进行训练，利用基于转换器块的架构。当今的 LLM 是多功能的，能够执行从文本生成和语言翻译到问答以及代码生成和分析的一系列任务。这些模型的一个高级子集称为多模态大型语言模型 (MLLM)，它扩展了 LLM 处理和解释多种数据模态（包括图像、音频和视频）的能力。这种增强使 MLLM 具有视频编辑、图像理解和视觉内容字幕等功能。本调查全面概述了 LLM 的最新进展。我们首先追溯 LLM 的发展，然后深入研究 MLLM 的出现和细微差别。我们分析了新兴的先进 MLLM，探索了它们的技术特点、优势和局限性。此外，我们还对这些模型进行了比较分析，并讨论了它们的挑战、潜在局限性以及未来发展的前景。

Title: MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers

Authors: Xiaohe Ma, Valentin Deschaintre, Miloš Hašan, Fujun Luan, Kun Zhou, Hongzhi Wu, Yiwei Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03225
Pdf URL: https://arxiv.org/pdf/2412.03225
Copy Paste: [[2412.03225]] MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers(https://arxiv.org/abs/2412.03225)
Keywords: generation
Abstract: High-quality material generation is key for virtual environment authoring and inverse rendering. We propose MaterialPicker, a multi-modal material generator leveraging a Diffusion Transformer (DiT) architecture, improving and simplifying the creation of high-quality materials from text prompts and/or photographs. Our method can generate a material based on an image crop of a material sample, even if the captured surface is distorted, viewed at an angle or partially occluded, as is often the case in photographs of natural scenes. We further allow the user to specify a text prompt to provide additional guidance for the generation. We finetune a pre-trained DiT-based video generator into a material generator, where each material map is treated as a frame in a video sequence. We evaluate our approach both quantitatively and qualitatively and show that it enables more diverse material generation and better distortion correction than previous work.
摘要：高质量材料生成是虚拟环境创作和逆向渲染的关键。我们提出了 MaterialPicker，这是一种利用扩散变换器 (DiT) 架构的多模态材料生成器，可改进和简化从文本提示和/或照片创建高质量材料的过程。我们的方法可以根据材料样本的图像裁剪生成材料，即使捕获的表面扭曲、以一定角度观看或部分遮挡（自然场景的照片中经常出现这种情况）。我们还允许用户指定文本提示，为生成提供额外指导。我们将预先训练的基于 DiT 的视频生成器微调为材料生成器，其中每个材料图都被视为视频序列中的一帧。我们从数量和质量两个方面评估了我们的方法，并表明它比以前的工作能够生成更多样化的材料并更好地校正失真。

Title: Task-driven Image Fusion with Learnable Fusion Loss

Authors: Haowen Bai, Jiangshe Zhang, Zixiang Zhao, Yichen Wu, Lilun Deng, Yukun Cui, Tao Feng, Shuang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03240
Pdf URL: https://arxiv.org/pdf/2412.03240
Copy Paste: [[2412.03240]] Task-driven Image Fusion with Learnable Fusion Loss(https://arxiv.org/abs/2412.03240)
Keywords: generation
Abstract: Multi-modal image fusion aggregates information from multiple sensor sources, achieving superior visual quality and perceptual characteristics compared to any single source, often enhancing downstream tasks. However, current fusion methods for downstream tasks still use predefined fusion objectives that potentially mismatch the downstream tasks, limiting adaptive guidance and reducing model flexibility. To address this, we propose Task-driven Image Fusion (TDFusion), a fusion framework incorporating a learnable fusion loss guided by task loss. Specifically, our fusion loss includes learnable parameters modeled by a neural network called the loss generation module. This module is supervised by the loss of downstream tasks in a meta-learning manner. The learning objective is to minimize the task loss of the fused images, once the fusion module has been optimized by the fusion loss. Iterative updates between the fusion module and the loss module ensure that the fusion network evolves toward minimizing task loss, guiding the fusion process toward the task objectives. TDFusion's training relies solely on the loss of downstream tasks, making it adaptable to any specific task. It can be applied to any architecture of fusion and task networks. Experiments demonstrate TDFusion's performance in both fusion and task-related applications, including four public fusion datasets, semantic segmentation, and object detection. The code will be released.
摘要：多模态图像融合聚合了来自多个传感器源的信息，与任何单一源相比，实现了卓越的视觉质量和感知特性，通常可以增强下游任务。然而，当前下游任务的融合方法仍然使用预定义的融合目标，这些目标可能与下游任务不匹配，从而限制了自适应指导并降低了模型灵活性。为了解决这个问题，我们提出了任务驱动图像融合 (TDFusion)，这是一个融合框架，其中包含由任务损失引导的可学习融合损失。具体来说，我们的融合损失包括由称为损失生成模块的神经网络建模的可学习参数。该模块以元学习的方式由下游任务的损失监督。一旦融合模块通过融合损失进行了优化，学习目标是最小化融合图像的任务损失。融合模块和损失模块之间的迭代更新确保融合网络朝着最小化任务损失的方向发展，从而引导融合过程朝着任务目标发展。TDFusion 的训练完全依赖于下游任务的损失，使其能够适应任何特定任务。它可以应用于任何融合和任务网络架构。实验证明了 TDFusion 在融合和任务相关应用中的性能，包括四个公共融合数据集、语义分割和对象检测。代码即将发布。

Title: DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation

Authors: Qingdong He, Jinlong Peng, Pengcheng Xu, Boyuan Jiang, Xiaobin Hu, Donghao Luo, Yong Liu, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03255
Pdf URL: https://arxiv.org/pdf/2412.03255
Copy Paste: [[2412.03255]] DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation(https://arxiv.org/abs/2412.03255)
Keywords: generation
Abstract: To enhance the controllability of text-to-image diffusion models, current ControlNet-like models have explored various control signals to dictate image attributes. However, existing methods either handle conditions inefficiently or use a fixed number of conditions, which does not fully address the complexity of multiple conditions and their potential conflicts. This underscores the need for innovative approaches to manage multiple conditions effectively for more reliable and detailed image synthesis. To address this issue, we propose a novel framework, DynamicControl, which supports dynamic combinations of diverse control signals, allowing adaptive selection of different numbers and types of conditions. Our approach begins with a double-cycle controller that generates an initial real score sorting for all input conditions by leveraging pre-trained conditional generation models and discriminative models. This controller evaluates the similarity between extracted conditions and input conditions, as well as the pixel-level similarity with the source image. Then, we integrate a Multimodal Large Language Model (MLLM) to build an efficient condition evaluator. This evaluator optimizes the ordering of conditions based on the double-cycle controller's score ranking. Our method jointly optimizes MLLMs and diffusion models, utilizing MLLMs' reasoning capabilities to facilitate multi-condition text-to-image (T2I) tasks. The final sorted conditions are fed into a parallel multi-control adapter, which learns feature maps from dynamic visual conditions and integrates them to modulate ControlNet, thereby enhancing control over generated images. Through both quantitative and qualitative comparisons, DynamicControl demonstrates its superiority over existing methods in terms of controllability, generation quality and composability under various conditional controls.
摘要：为了增强文本到图像传播模型的可控性，当前的 ControlNet 类模型已经探索了各种控制信号来指示图像属性。然而，现有方法要么处理条件效率低下，要么使用固定数量的条件，这不能完全解决多种条件的复杂性及其潜在冲突。这强调了需要创新方法来有效地管理多种条件，以实现更可靠和更详细的图像合成。为了解决这个问题，我们提出了一个新颖的框架 DynamicControl，它支持各种控制信号的动态组合，允许自适应地选择不同数量和类型的条件。我们的方法从双循环控制器开始，该控制器利用预先训练的条件生成模型和判别模型为所有输入条件生成初始真实分数排序。该控制器评估提取条件与输入条件之间的相似性，以及与源图像的像素级相似性。然后，我们集成了多模态大型语言模型 (MLLM) 来构建高效的条件评估器。该评估器根据双循环控制器的分数排名优化条件的排序。我们的方法联合优化了 MLLM 和扩散模型，利用 MLLM 的推理能力来促进多条件文本到图像 (T2I) 任务。最终排序的条件被输入到并行多控制适配器中，该适配器从动态视觉条件中学习特征图并将其集成以调制 ControlNet，从而增强对生成图像的控制。通过定量和定性比较，DynamicControl 在各种条件控制下的可控性、生成质量和可组合性方面证明了其优于现有方法。

Title: GERD: Geometric event response data generation

Authors: Jens Egholm Pedersen, Dimitris Korakovounis, Jörg Conradt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03259
Pdf URL: https://arxiv.org/pdf/2412.03259
Copy Paste: [[2412.03259]] GERD: Geometric event response data generation(https://arxiv.org/abs/2412.03259)
Keywords: generation
Abstract: Event-based vision sensors are appealing because of their time resolution, higher dynamic range, and low-power consumption. They also provide data that is fundamentally different from conventional frame-based cameras: events are sparse, discrete, and require integration in time. Unlike conventional models grounded in established geometric and physical principles, event-based models lack comparable foundations. We introduce a method to generate event-based data under controlled transformations. Specifically, we subject a prototypical object to transformations that change over time to produce carefully curated event videos. We hope this work simplifies studies for geometric approaches in event-based vision. GERD is available at this https URL
摘要：基于事件的视觉传感器因其时间分辨率、更高的动态范围和低功耗而具有吸引力。它们还提供与传统基于帧的相机根本不同的数据：事件稀疏、离散且需要时间积分。与基于既定几何和物理原理的传统模型不同，基于事件的模型缺乏可比的基础。我们介绍了一种在受控变换下生成基于事件的数据的方法。具体来说，我们对原型对象进行随时间变化的变换，以生成精心策划的事件视频。我们希望这项工作能够简化基于事件的视觉中几何方法的研究。GERD 可在此 https URL 上找到

Title: RFSR: Improving ISR Diffusion Models via Reward Feedback Learning

Authors: Xiaopeng Sun, Qinwei Lin, Yu Gao, Yujie Zhong, Chengjian Feng, Dengjie Li, Zheng Zhao, Jie Hu, Lin Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03268
Pdf URL: https://arxiv.org/pdf/2412.03268
Copy Paste: [[2412.03268]] RFSR: Improving ISR Diffusion Models via Reward Feedback Learning(https://arxiv.org/abs/2412.03268)
Keywords: super-resolution, generative
Abstract: Generative diffusion models (DM) have been extensively utilized in image super-resolution (ISR). Most of the existing methods adopt the denoising loss from DDPMs for model optimization. We posit that introducing reward feedback learning to finetune the existing models can further improve the quality of the generated images. In this paper, we propose a timestep-aware training strategy with reward feedback learning. Specifically, in the initial denoising stages of ISR diffusion, we apply low-frequency constraints to super-resolution (SR) images to maintain structural stability. In the later denoising stages, we use reward feedback learning to improve the perceptual and aesthetic quality of the SR images. In addition, we incorporate Gram-KL regularization to alleviate stylization caused by reward hacking. Our method can be integrated into any diffusion-based ISR model in a plug-and-play manner. Experiments show that ISR diffusion models, when fine-tuned with our method, significantly improve the perceptual and aesthetic quality of SR images, achieving excellent subjective results. Code: this https URL
摘要：生成扩散模型 (DM) 已广泛应用于图像超分辨率 (ISR)。大多数现有方法采用 DDPM 的去噪损失进行模型优化。我们认为引入奖励反馈学习来微调现有模型可以进一步提高生成图像的质量。在本文中，我们提出了一种带有奖励反馈学习的时间步长感知训练策略。具体而言，在 ISR 扩散的初始去噪阶段，我们对超分辨率 (SR) 图像应用低频约束以保持结构稳定性。在后期的去噪阶段，我们使用奖励反馈学习来提高 SR 图像的感知和美学质量。此外，我们结合了 Gram-KL 正则化来缓解奖励黑客造成的风格化。我们的方法可以即插即用的方式集成到任何基于扩散的 ISR 模型中。实验表明，使用我们的方法微调的 ISR 扩散模型显著提高了 SR 图像的感知和美学质量，实现了出色的主观结果。代码：此 https URL

Title: Geometry-guided Cross-view Diffusion for One-to-many Cross-view Image Synthesis

Authors: Tao Jun Lin, Wenqing Wang, Yujiao Shi, Akhil Perincherry, Ankit Vora, Hongdong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03315
Pdf URL: https://arxiv.org/pdf/2412.03315
Copy Paste: [[2412.03315]] Geometry-guided Cross-view Diffusion for One-to-many Cross-view Image Synthesis(https://arxiv.org/abs/2412.03315)
Keywords: generation
Abstract: This paper presents a novel approach for cross-view synthesis aimed at generating plausible ground-level images from corresponding satellite imagery or vice versa. We refer to these tasks as satellite-to-ground (Sat2Grd) and ground-to-satellite (Grd2Sat) synthesis, respectively. Unlike previous works that typically focus on one-to-one generation, producing a single output image from a single input image, our approach acknowledges the inherent one-to-many nature of the problem. This recognition stems from the challenges posed by differences in illumination, weather conditions, and occlusions between the two views. To effectively model this uncertainty, we leverage recent advancements in diffusion models. Specifically, we exploit random Gaussian noise to represent the diverse possibilities learnt from the target view data. We introduce a Geometry-guided Cross-view Condition (GCC) strategy to establish explicit geometric correspondences between satellite and street-view features. This enables us to resolve the geometry ambiguity introduced by camera pose between image pairs, boosting the performance of cross-view image synthesis. Through extensive quantitative and qualitative analyses on three benchmark cross-view datasets, we demonstrate the superiority of our proposed geometry-guided cross-view condition over baseline methods, including recent state-of-the-art approaches in cross-view image synthesis. Our method generates images of higher quality, fidelity, and diversity than other state-of-the-art approaches.
摘要：本文介绍了一种新颖的交叉视图合成方法，旨在从相应的卫星图像生成可信的地面图像，反之亦然。我们将这些任务分别称为卫星到地面 (Sat2Grd) 和地面到卫星 (Grd2Sat) 合成。与以前通常专注于一对一生成（从单个输入图像生成单个输出图像）的工作不同，我们的方法承认问题固有的一对多性质。这种认识源于两个视图之间的照明、天气条件和遮挡差异所带来的挑战。为了有效地模拟这种不确定性，我们利用了扩散模型的最新进展。具体来说，我们利用随机高斯噪声来表示从目标视图数据中学习到的各种可能性。我们引入了一种几何引导的交叉视图条件 (GCC) 策略来建立卫星和街景特征之间的明确几何对应关系。这使我们能够解决图像对之间相机姿势引入的几何模糊性，从而提高交叉视图图像合成的性能。通过对三个基准交叉视图数据集进行广泛的定量和定性分析，我们证明了我们提出的几何引导交叉视图条件优于基线方法，包括最近最先进的交叉视图图像合成方法。我们的方法生成的图像质量、保真度和多样性都高于其他最先进的方法。

Title: DIVE: Taming DINO for Subject-Driven Video Editing

Authors: Yi Huang, Wei Xiong, He Zhang, Chaoqi Chen, Jianzhuang Liu, Mingfu Yan, Shifeng Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03347
Pdf URL: https://arxiv.org/pdf/2412.03347
Copy Paste: [[2412.03347]] DIVE: Taming DINO for Subject-Driven Video Editing(https://arxiv.org/abs/2412.03347)
Keywords: generation
Abstract: Building on the success of diffusion models in image generation and editing, video editing has recently gained substantial attention. However, maintaining temporal consistency and motion alignment still remains challenging. To address these issues, this paper proposes DINO-guided Video Editing (DIVE), a framework designed to facilitate subject-driven editing in source videos conditioned on either target text prompts or reference images with specific identities. The core of DIVE lies in leveraging the powerful semantic features extracted from a pretrained DINOv2 model as implicit correspondences to guide the editing process. Specifically, to ensure temporal motion consistency, DIVE employs DINO features to align with the motion trajectory of the source video. Extensive experiments on diverse real-world videos demonstrate that our framework can achieve high-quality editing results with robust motion consistency, highlighting the potential of DINO to contribute to video editing. For precise subject editing, DIVE incorporates the DINO features of reference images into a pretrained text-to-image model to learn Low-Rank Adaptations (LoRAs), effectively registering the target subject's identity. Project page: this https URL
摘要：基于扩散模型在图像生成和编辑方面的成功，视频编辑最近受到了广泛关注。然而，保持时间一致性和运动对齐仍然具有挑战性。为了解决这些问题，本文提出了 DINO 引导视频编辑 (DIVE)，这是一个旨在促进以目标文本提示或具有特定身份的参考图像为条件的源视频中的主题驱动编辑的框架。DIVE 的核心在于利用从预训练的 DINOv2 模型中提取的强大语义特征作为隐式对应来指导编辑过程。具体来说，为了确保时间运动一致性，DIVE 使用 DINO 特征与源视频的运动轨迹对齐。对各种真实世界视频进行的大量实验表明，我们的框架可以实现具有强大运动一致性的高质量编辑结果，凸显了 DINO 对视频编辑的贡献潜力。为了精确的主题编辑，DIVE 将参考图像的 DINO 特征合并到预训练的文本到图像模型中以学习低秩自适应 (LoRA)，从而有效地注册目标主体的身份。项目页面：此 https URL

Title: Fairer Analysis and Demographically Balanced Face Generation for Fairer Face Verification

Authors: Alexandre Fournier-Montgieux, Michael Soumm, Adrian Popescu, Bertrand Luvison, Hervé Le Borgne
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03349
Pdf URL: https://arxiv.org/pdf/2412.03349
Copy Paste: [[2412.03349]] Fairer Analysis and Demographically Balanced Face Generation for Fairer Face Verification(https://arxiv.org/abs/2412.03349)
Keywords: generation, generative
Abstract: Face recognition and verification are two computer vision tasks whose performances have advanced with the introduction of deep representations. However, ethical, legal, and technical challenges due to the sensitive nature of face data and biases in real-world training datasets hinder their development. Generative AI addresses privacy by creating fictitious identities, but fairness problems remain. Using the existing DCFace SOTA framework, we introduce a new controlled generation pipeline that improves fairness. Through classical fairness metrics and a proposed in-depth statistical analysis based on logit models and ANOVA, we show that our generation pipeline improves fairness more than other bias mitigation approaches while slightly improving raw performance.
摘要：人脸识别和验证是两项计算机视觉任务，随着深度表示的引入，其性能得到了提升。然而，由于人脸数据的敏感性以及现实世界训练数据集中的偏见，道德、法律和技术挑战阻碍了它们的发展。生成式人工智能通过创建虚构身份来解决隐私问题，但公平性问题仍然存在。利用现有的 DCFace SOTA 框架，我们引入了一种新的受控生成管道，以提高公平性。通过经典的公平性指标和基于逻辑模型和方差分析的深入统计分析，我们表明我们的生成管道比其他偏见缓解方法更能提高公平性，同时略微提高原始性能。

Title: TASR: Timestep-Aware Diffusion Model for Image Super-Resolution

Authors: Qinwei Lin, Xiaopeng Sun, Yu Gao, Yujie Zhong, Dengjie Li, Zheng Zhao, Haoqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03355
Pdf URL: https://arxiv.org/pdf/2412.03355
Copy Paste: [[2412.03355]] TASR: Timestep-Aware Diffusion Model for Image Super-Resolution(https://arxiv.org/abs/2412.03355)
Keywords: super-resolution, generation
Abstract: Diffusion models have recently achieved outstanding results in the field of image super-resolution. These methods typically inject low-resolution (LR) images via this http URL this paper, we first explore the temporal dynamics of information infusion through ControlNet, revealing that the input from LR images predominantly influences the initial stages of the denoising process. Leveraging this insight, we introduce a novel timestep-aware diffusion model that adaptively integrates features from both ControlNet and the pre-trained Stable Diffusion (SD). Our method enhances the transmission of LR information in the early stages of diffusion to guarantee image fidelity and stimulates the generation ability of the SD model itself more in the later stages to enhance the detail of generated images. To train this method, we propose a timestep-aware training strategy that adopts distinct losses at varying timesteps and acts on disparate modules. Experiments on benchmark datasets demonstrate the effectiveness of our method. Code: this https URL
摘要：扩散模型最近在图像超分辨率领域取得了突出的成果。这些方法通常通过此 http URL 注入低分辨率 (LR) 图像。本文，我们首先探索通过 ControlNet 注入信息的时间动态，揭示来自 LR 图像的输入主要影响去噪过程的初始阶段。利用这一见解，我们引入了一种新颖的时间步感知扩散模型，该模型自适应地集成了 ControlNet 和预训练的稳定扩散 (SD) 的特征。我们的方法增强了扩散早期阶段 LR 信息的传输以保证图像保真度，并在后期阶段更多地刺激 SD 模型本身的生成能力以增强生成图像的细节。为了训练这种方法，我们提出了一种时间步感知训练策略，该策略在不同的时间步采用不同的损失并作用于不同的模块。在基准数据集上的实验证明了我们方法的有效性。代码：此 https URL

Title: Mapping using Transformers for Volumes -- Network for Super-Resolution with Long-Range Interactions

Authors: August Leander Høeg, Sophia W. Bardenfleth, Hans Martin Kjer, Tim B. Dyrby, Vedrana Andersen Dahl, Anders Dahl
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.03379
Pdf URL: https://arxiv.org/pdf/2412.03379
Copy Paste: [[2412.03379]] Mapping using Transformers for Volumes -- Network for Super-Resolution with Long-Range Interactions(https://arxiv.org/abs/2412.03379)
Keywords: super-resolution
Abstract: Until now, it has been difficult for volumetric super-resolution to utilize the recent advances in transformer-based models seen in 2D super-resolution. The memory required for self-attention in 3D volumes limits the receptive field. Therefore, long-range interactions are not used in 3D to the extent done in 2D and the strength of transformers is not realized. We propose a multi-scale transformer-based model based on hierarchical attention blocks combined with carrier tokens at multiple scales to overcome this. Here information from larger regions at coarse resolution is sequentially carried on to finer-resolution regions to predict the super-resolved image. Using transformer layers at each resolution, our coarse-to-fine modeling limits the number of tokens at each scale and enables attention over larger regions than what has previously been possible. We experimentally compare our method, MTVNet, against state-of-the-art volumetric super-resolution models on five 3D datasets demonstrating the advantage of an increased receptive field. This advantage is especially pronounced for images that are larger than what is seen in popularly used 3D datasets. Our code is available at this https URL
摘要：到目前为止，体积超分辨率很难利用 2D 超分辨率中基于 Transformer 的模型的最新进展。3D 体积中自注意力所需的内存限制了感受野。因此，3D 中使用的长距离交互不如 2D 中那样多，无法发挥 Transformer 的强度。为了解决这个问题，我们提出了一种基于分层注意力块的多尺度 Transformer 模型，该模型结合了多个尺度的载体标记。在这里，来自粗分辨率较大区域的信息被顺序传送到更精细分辨率区域以预测超分辨率图像。使用每个分辨率的 Transformer 层，我们的由粗到细建模限制了每个尺度的标记数量，并能够关注比以前更大的区域。我们在五个 3D 数据集上通过实验将我们的方法 MTVNet 与最先进的体积超分辨率模型进行了比较，证明了增加感受野的优势。对于比常用的 3D 数据集中更大的图像，这一优势尤其明显。我们的代码可在此 https URL 上获取

Title: Implicit Priors Editing in Stable Diffusion via Targeted Token Adjustment

Authors: Feng He, Chao Zhang, Zhixue Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03400
Pdf URL: https://arxiv.org/pdf/2412.03400
Copy Paste: [[2412.03400]] Implicit Priors Editing in Stable Diffusion via Targeted Token Adjustment(https://arxiv.org/abs/2412.03400)
Keywords: generation
Abstract: Implicit assumptions and priors are often necessary in text-to-image generation tasks, especially when textual prompts lack sufficient context. However, these assumptions can sometimes reflect outdated concepts, inaccuracies, or societal bias embedded in the training data. We present Embedding-only Editing (Embedit), a method designed to efficiently adjust implict assumptions and priors in the model without affecting its interpretation of unrelated objects or overall performance. Given a "source" prompt (e.g., "rose") that elicits an implicit assumption (e.g., rose is red) and a "destination" prompt that specifies the desired attribute (e.g., "blue rose"), Embedit fine-tunes only the word token embedding (WTE) of the target object ("rose") to optimize the last hidden state of text encoder in Stable Diffusion, a SOTA text-to-image model. This targeted adjustment prevents unintended effects on other objects in the model's knowledge base, as the WTEs for unrelated objects and the model weights remain unchanged. Consequently, when a prompt does not contain the edited object, all representations, and the model outputs are identical to those of the original, unedited model. Our method is highly efficient, modifying only 768 parameters for Stable Diffusion 1.4 and 2048 for XL in a single edit, matching the WTE dimension of each respective model. This minimal scope, combined with rapid execution, makes Embedit highly practical for real-world applications. Additionally, changes are easily reversible by restoring the original WTE layers. Our experimental results demonstrate that Embedit consistently outperforms previous methods across various models, tasks, and editing scenarios (both single and sequential multiple edits), achieving at least a 6.01% improvement (from 87.17% to 93.18%).
摘要：在文本到图像生成任务中，隐式假设和先验通常是必要的，尤其是当文本提示缺乏足够的上下文时。然而，这些假设有时会反映出训练数据中嵌入的过时概念、不准确性或社会偏见。我们提出了仅嵌入编辑 (Embedit)，这种方法旨在有效地调整模型中的隐式假设和先验，而不会影响其对不相关对象的解释或整体性能。给定一个引出隐式假设（例如，玫瑰是红色的）的“源”提示（例如，“玫瑰”）和一个指定所需属性（例如，“蓝玫瑰”）的“目标”提示，Embedit 仅微调目标对象（“玫瑰”）的单词标记嵌入 (WTE)，以优化稳定扩散（SOTA 文本到图像模型）中文本编码器的最后隐藏状态。这种有针对性的调整可防止对模型知识库中的其他对象产生意外影响，因为不相关对象的 WTE 和模型权重保持不变。因此，当提示不包含已编辑的对象时，所有表示和模型输出都与原始未编辑模型的表示和模型输出相同。我们的方法非常高效，只需在一次编辑中修改 Stable Diffusion 1.4 的 768 个参数和 XL 的 2048 个参数，即可匹配每个相应模型的 WTE 维度。这种最小范围与快速执行相结合，使 Embedit 在实际应用中非常实用。此外，通过恢复原始 WTE 层，可以轻松撤消更改。我们的实验结果表明，Embedit 在各种模型、任务和编辑场景（单次和连续多次编辑）中始终优于以前的方法，实现了至少 6.01% 的改进（从 87.17% 到 93.18%）。

Title: Skel3D: Skeleton Guided Novel View Synthesis

Authors: Aron Fóthi, Bence Fazekas, Natabara Máté Gyöngyössy, Kristian Fenech
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03407
Pdf URL: https://arxiv.org/pdf/2412.03407
Copy Paste: [[2412.03407]] Skel3D: Skeleton Guided Novel View Synthesis(https://arxiv.org/abs/2412.03407)
Keywords: generative
Abstract: In this paper, we present an approach for monocular open-set novel view synthesis (NVS) that leverages object skeletons to guide the underlying diffusion model. Building upon a baseline that utilizes a pre-trained 2D image generator, our method takes advantage of the Objaverse dataset, which includes animated objects with bone structures. By introducing a skeleton guide layer following the existing ray conditioning normalization (RCN) layer, our approach enhances pose accuracy and multi-view consistency. The skeleton guide layer provides detailed structural information for the generative model, improving the quality of synthesized views. Experimental results demonstrate that our skeleton-guided method significantly enhances consistency and accuracy across diverse object categories within the Objaverse dataset. Our method outperforms existing state-of-the-art NVS techniques both quantitatively and qualitatively, without relying on explicit 3D representations.
摘要：在本文中，我们提出了一种单目开放集新视图合成 (NVS) 方法，该方法利用对象骨架来指导底层扩散模型。基于使用预训练的 2D 图像生成器的基线，我们的方法利用了 Objaverse 数据集，其中包括具有骨骼结构的动画对象。通过在现有的射线调节规范化 (RCN) 层之后引入骨架引导层，我们的方法提高了姿势准确性和多视图一致性。骨架引导层为生成模型提供了详细的结构信息，从而提高了合成视图的质量。实验结果表明，我们的骨架引导方法显著提高了 Objaverse 数据集中不同对象类别的一致性和准确性。我们的方法在数量和质量上都优于现有的最先进的 NVS 技术，并且不依赖于明确的 3D 表示。

Title: PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

Authors: Ao Wang, Hui Chen, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Zijia Lin, Jungong Han, Guiguang Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03409
Pdf URL: https://arxiv.org/pdf/2412.03409
Copy Paste: [[2412.03409]] PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation(https://arxiv.org/abs/2412.03409)
Keywords: generation
Abstract: Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction. This results in the significant contextual information loss for certain layers, leading to notable performance decline. To address this, we present PrefixKV. It reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. With an adaptive layer-wise KV retention recipe based on binary search, the maximum contextual information can thus be preserved in each layer, facilitating the generation. Extensive experiments demonstrate that our method achieves the state-of-the-art performance compared with others. It exhibits superior inference efficiency and generation quality trade-offs, showing promising potential for practical applications. Code is available at \url{this https URL}.
摘要：最近，大型视觉语言模型 (LVLM) 因其在多种多模态输入下强大的生成和推理能力而迅速流行起来。然而，这些模型在推理过程中会产生大量的计算和内存开销，这极大地阻碍了实际场景中的高效部署。由于输入和输出序列很长，需要大量的键值 (KV) 缓存，这显著增加了推理成本。基于此，最近的研究已经研究了减少 KV 缓存大小以提高效率的方法。虽然有效，但它们通常忽略了 KV 向量在各层之间的不同重要性分布，并在下一个 token 预测期间为每层保持相同的缓存大小。这会导致某些层的上下文信息大量丢失，从而导致性能明显下降。为了解决这个问题，我们提出了 PrefixKV。它将确定所有层的 KV 缓存大小的挑战重新定义为寻找最佳全局前缀配置的任务。通过基于二分搜索的自适应分层 KV 保留配方，可以在每层中保留最大的上下文信息，从而促进生成。大量实验表明，与其他方法相比，我们的方法达到了最先进的性能。它表现出卓越的推理效率和生成质量权衡，显示出实际应用的潜力。代码可在 \url{此 https URL} 处获取。

Title: SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model

Authors: Yan Li, Ziya Zhou, Zhiqiang Wang, Wei Xue, Wenhan Luo, Yike Guo
Subjects: cs.CV, cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2412.03430
Pdf URL: https://arxiv.org/pdf/2412.03430
Copy Paste: [[2412.03430]] SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model(https://arxiv.org/abs/2412.03430)
Keywords: generation, generative
Abstract: Recent advancements in generative models have significantly enhanced talking face video generation, yet singing video generation remains underexplored. The differences between human talking and singing limit the performance of existing talking face video generation models when applied to singing. The fundamental differences between talking and singing-specifically in audio characteristics and behavioral expressions-limit the effectiveness of existing models. We observe that the differences between singing and talking audios manifest in terms of frequency and amplitude. To address this, we have designed a multi-scale spectral module to help the model learn singing patterns in the spectral domain. Additionally, we develop a spectral-filtering module that aids the model in learning the human behaviors associated with singing audio. These two modules are integrated into the diffusion model to enhance singing video generation performance, resulting in our proposed model, SINGER. Furthermore, the lack of high-quality real-world singing face videos has hindered the development of the singing video generation community. To address this gap, we have collected an in-the-wild audio-visual singing dataset to facilitate research in this area. Our experiments demonstrate that SINGER is capable of generating vivid singing videos and outperforms state-of-the-art methods in both objective and subjective evaluations.
摘要：生成模型的最新进展显著增强了说话人脸视频的生成，但歌唱视频的生成仍未得到充分探索。人类说话和歌唱之间的差异限制了现有说话人脸视频生成模型应用于歌唱时的性能。说话和歌唱之间的根本差异——特别是在音频特征和行为表达方面——限制了现有模型的有效性。我们观察到歌唱和说话音频之间的差异表现在频率和振幅方面。为了解决这个问题，我们设计了一个多尺度频谱模块，帮助模型学习频谱域中的歌唱模式。此外，我们开发了一个频谱滤波模块，帮助模型学习与歌唱音频相关的人类行为。这两个模块被集成到扩散模型中，以增强歌唱视频的生成性能，从而产生了我们提出的模型 SINGER。此外，缺乏高质量的真实世界歌唱人脸视频阻碍了歌唱视频生成社区的发展。为了弥补这一差距，我们收集了一个自然界的视听歌唱数据集，以促进这一领域的研究。我们的实验表明，SINGER 能够生成生动的歌唱视频，并且在客观和主观评价中都优于最先进的方法。

Title: Pre-trained Multiple Latent Variable Generative Models are good defenders against Adversarial Attacks

Authors: Dario Serez, Marco Cristani, Alessio Del Bue, Vittorio Murino, Pietro Morerio
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03453
Pdf URL: https://arxiv.org/pdf/2412.03453
Copy Paste: [[2412.03453]] Pre-trained Multiple Latent Variable Generative Models are good defenders against Adversarial Attacks(https://arxiv.org/abs/2412.03453)
Keywords: generative
Abstract: Attackers can deliberately perturb classifiers' input with subtle noise, altering final predictions. Among proposed countermeasures, adversarial purification employs generative networks to preprocess input images, filtering out adversarial noise. In this study, we propose specific generators, defined Multiple Latent Variable Generative Models (MLVGMs), for adversarial purification. These models possess multiple latent variables that naturally disentangle coarse from fine features. Taking advantage of these properties, we autoencode images to maintain class-relevant information, while discarding and re-sampling any detail, including adversarial noise. The procedure is completely training-free, exploring the generalization abilities of pre-trained MLVGMs on the adversarial purification downstream task. Despite the lack of large models, trained on billions of samples, we show that smaller MLVGMs are already competitive with traditional methods, and can be used as foundation models. Official code released at this https URL.
摘要：攻击者可以故意用细微的噪声扰乱分类器的输入，从而改变最终的预测。在提出的对策中，对抗性净化采用生成网络对输入图像进行预处理，过滤掉对抗性噪声。在本研究中，我们提出了用于对抗性净化的特定生成器，即多隐变量生成模型 (MLVGM)。这些模型具有多个隐变量，可以自然地将粗略特征与精细特征区分开来。利用这些特性，我们对图像进行自动编码以保留与类别相关的信息，同时丢弃并重新采样任何细节，包括对抗性噪声。该过程完全无需训练，探索预训练 MLVGM 在对抗性净化下游任务中的泛化能力。尽管缺乏对数十亿个样本进行训练的大型模型，但我们表明较小的 MLVGM 已经可以与传统方法相媲美，并且可以用作基础模型。官方代码在此 https URL 上发布。

Title: Urban4D: Semantic-Guided 4D Gaussian Splatting for Urban Scene Reconstruction

Authors: Ziwen Li, Jiaxin Huang, Runnan Chen, Yunlong Che, Yandong Guo, Tongliang Liu, Fakhri Karray, Mingming Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03473
Pdf URL: https://arxiv.org/pdf/2412.03473
Copy Paste: [[2412.03473]] Urban4D: Semantic-Guided 4D Gaussian Splatting for Urban Scene Reconstruction(https://arxiv.org/abs/2412.03473)
Keywords: generation
Abstract: Reconstructing dynamic urban scenes presents significant challenges due to their intrinsic geometric structures and spatiotemporal dynamics. Existing methods that attempt to model dynamic urban scenes without leveraging priors on potentially moving regions often produce suboptimal results. Meanwhile, approaches based on manual 3D annotations yield improved reconstruction quality but are impractical due to labor-intensive labeling. In this paper, we revisit the potential of 2D semantic maps for classifying dynamic and static Gaussians and integrating spatial and temporal dimensions for urban scene representation. We introduce Urban4D, a novel framework that employs a semantic-guided decomposition strategy inspired by advances in deep 2D semantic map generation. Our approach distinguishes potentially dynamic objects through reliable semantic Gaussians. To explicitly model dynamic objects, we propose an intuitive and effective 4D Gaussian splatting (4DGS) representation that aggregates temporal information through learnable time embeddings for each Gaussian, predicting their deformations at desired timestamps using a multilayer perceptron (MLP). For more accurate static reconstruction, we also design a k-nearest neighbor (KNN)-based consistency regularization to handle the ground surface due to its low-texture characteristic. Extensive experiments on real-world datasets demonstrate that Urban4D not only achieves comparable or better quality than previous state-of-the-art methods but also effectively captures dynamic objects while maintaining high visual fidelity for static elements.
摘要：由于动态城市场景的内在几何结构和时空动态，重建动态城市场景面临着重大挑战。现有的方法试图在不利用潜在移动区域的先验知识的情况下对动态城市场景进行建模，但通常会产生次优结果。同时，基于手动 3D 注释的方法可以提高重建质量，但由于劳动密集型标记而不切实际。在本文中，我们重新审视了 2D 语义图对动态和静态高斯进行分类以及整合空间和时间维度以表示城市场景的潜力。我们介绍了 Urban4D，这是一个新颖的框架，它采用了一种语义引导分解策略，该策略受到深度 2D 语义图生成进展的启发。我们的方法通过可靠的语义高斯来区分潜在的动态对象。为了明确地建模动态对象，我们提出了一种直观有效的 4D 高斯分层 (4DGS) 表示，它通过每个高斯的可学习时间嵌入来聚合时间信息，使用多层感知器 (MLP) 预测它们在所需时间戳的变形。为了实现更准确的静态重建，我们还设计了一个基于 k 最近邻 (KNN) 的一致性正则化来处理地面低纹理特征。在现实世界数据集上进行的大量实验表明，Urban4D 不仅实现了与之前最先进的方法相当或更好的质量，而且还能有效捕捉动态物体，同时保持静态元素的高视觉保真度。

Title: Flow Matching with General Discrete Paths: A Kinetic-Optimal Perspective

Authors: Neta Shaul, Itai Gat, Marton Havasi, Daniel Severo, Anuroop Sriram, Peter Holderrieth, Brian Karrer, Yaron Lipman, Ricky T. Q. Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03487
Pdf URL: https://arxiv.org/pdf/2412.03487
Copy Paste: [[2412.03487]] Flow Matching with General Discrete Paths: A Kinetic-Optimal Perspective(https://arxiv.org/abs/2412.03487)
Keywords: generation, generative
Abstract: The design space of discrete-space diffusion or flow generative models are significantly less well-understood than their continuous-space counterparts, with many works focusing only on a simple masked construction. In this work, we aim to take a holistic approach to the construction of discrete generative models based on continuous-time Markov chains, and for the first time, allow the use of arbitrary discrete probability paths, or colloquially, corruption processes. Through the lens of optimizing the symmetric kinetic energy, we propose velocity formulas that can be applied to any given probability path, completely decoupling the probability and velocity, and giving the user the freedom to specify any desirable probability path based on expert knowledge specific to the data domain. Furthermore, we find that a special construction of mixture probability paths optimizes the symmetric kinetic energy for the discrete case. We empirically validate the usefulness of this new design space across multiple modalities: text generation, inorganic material generation, and image generation. We find that we can outperform the mask construction even in text with kinetic-optimal mixture paths, while we can make use of domain-specific constructions of the probability path over the visual domain.
摘要：离散空间扩散或流生成模型的设计空间远不如连续空间模型那么容易理解，许多研究仅关注简单的掩码构造。在这项工作中，我们旨在采用整体方法来构建基于连续时间马尔可夫链的离散生成模型，并首次允许使用任意离散概率路径，或通俗地说，腐败过程。通过优化对称动能的视角，我们提出了可以应用于任何给定概率路径的速度公式，完全将概率和速度分离，并让用户可以根据特定于数据域的专业知识自由指定任何理想的概率路径。此外，我们发现混合概率路径的特殊构造可以优化离散情况下的对称动能。我们通过多种模式实证验证了这种新设计空间的实用性：文本生成、无机材料生成和图像生成。我们发现，即使在具有动力学最优混合路径的文本中，我们也可以超越掩码构造，同时我们可以利用视觉域上概率路径的领域特定构造。

Title: Distillation of Diffusion Features for Semantic Correspondence

Authors: Frank Fundel, Johannes Schusterbauer, Vincent Tao Hu, Björn Ommer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03512
Pdf URL: https://arxiv.org/pdf/2412.03512
Copy Paste: [[2412.03512]] Distillation of Diffusion Features for Semantic Correspondence(https://arxiv.org/abs/2412.03512)
Keywords: generative
Abstract: Semantic correspondence, the task of determining relationships between different parts of images, underpins various applications including 3D reconstruction, image-to-image translation, object tracking, and visual place recognition. Recent studies have begun to explore representations learned in large generative image models for semantic correspondence, demonstrating promising results. Building on this progress, current state-of-the-art methods rely on combining multiple large models, resulting in high computational demands and reduced efficiency. In this work, we address this challenge by proposing a more computationally efficient approach. We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency. We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost. Furthermore, we demonstrate that by incorporating 3D data, we are able to further improve performance, without the need for human-annotated correspondences. Overall, our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence. Our code and weights are publicly available on our project page.
摘要：语义对应是确定图像不同部分之间关系的任务，它支撑着各种应用，包括 3D 重建、图像到图像转换、对象跟踪和视觉位置识别。最近的研究已经开始探索在大型生成图像模型中学习的语义对应表示，并取得了令人鼓舞的结果。在此基础上，当前最先进的方法依赖于组合多个大型模型，这导致计算需求高且效率低下。在这项工作中，我们通过提出一种计算效率更高的方法来应对这一挑战。我们提出了一种新颖的知识提炼技术来克服效率低下的问题。我们展示了如何使用两个大型视觉基础模型，并将这些互补模型的功能提炼成一个较小的模型，以较低的计算成本保持高精度。此外，我们证明通过合并 3D 数据，我们能够进一步提高性能，而无需人工注释的对应关系。总体而言，我们的实证结果表明，我们采用 3D 数据增强的精简模型实现了优于当前最先进方法的性能，同时显著减少了计算量并增强了现实世界应用（例如语义视频通信）的实用性。我们的代码和权重在我们的项目页面上公开提供。

Title: NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

Authors: Lingen Li, Zhaoyang Zhang, Yaowei Li, Jiale Xu, Xiaoyu Li, Wenbo Hu, Weihao Cheng, Jinwei Gu, Tianfan Xue, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03517
Pdf URL: https://arxiv.org/pdf/2412.03517
Copy Paste: [[2412.03517]] NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images(https://arxiv.org/abs/2412.03517)
Keywords: generative
Abstract: Recent advancements in generative models have significantly improved novel view synthesis (NVS) from multi-view data. However, existing methods depend on external multi-view alignment processes, such as explicit pose estimation or pre-reconstruction, which limits their flexibility and accessibility, especially when alignment is unstable due to insufficient overlap or occlusions between views. In this paper, we propose NVComposer, a novel approach that eliminates the need for explicit external alignment. NVComposer enables the generative model to implicitly infer spatial and geometric relationships between multiple conditional views by introducing two key components: 1) an image-pose dual-stream diffusion model that simultaneously generates target novel views and condition camera poses, and 2) a geometry-aware feature alignment module that distills geometric priors from dense stereo models during training. Extensive experiments demonstrate that NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks, removing the reliance on external alignment and thus improving model accessibility. Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases, highlighting its potential for more flexible and accessible generative NVS systems.
摘要：生成模型的最新进展显著改善了从多视图数据合成新视图 (NVS) 的能力。然而，现有方法依赖于外部多视图对齐过程，例如显式姿势估计或预重建，这限制了它们的灵活性和可访问性，尤其是当由于视图之间重叠不足或遮挡导致对齐不稳定时。在本文中，我们提出了 NVComposer，这是一种无需显式外部对齐的新方法。NVComposer 通过引入两个关键组件使生成模型能够隐式推断多个条件视图之间的空间和几何关系：1) 同时生成目标新视图和条件相机姿势的图像姿势双流扩散模型，以及 2) 在训练期间从密集立体模型中提取几何先验的几何感知特征对齐模块。大量实验表明，NVComposer 在生成多视图 NVS 任务中实现了最先进的性能，消除了对外部对齐的依赖，从而提高了模型的可访问性。我们的方法随着未调整的输入视图数量的增加，合成质量得到了显著改善，凸显了其在更灵活、更易于访问的生成 NVS 系统方面的潜力。

Title: Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

Authors: Hannan Lu, Xiaohe Wu, Shudong Wang, Xiameng Qin, Xinyu Zhang, Junyu Han, Wangmeng Zuo, Ji Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03520
Pdf URL: https://arxiv.org/pdf/2412.03520
Copy Paste: [[2412.03520]] Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention(https://arxiv.org/abs/2412.03520)
Keywords: generation
Abstract: Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of the standard ControlNet, enabling precise control over Bird's-Eye-View layouts. To enhance the generation of object instances crucial for autonomous driving, we propose a re-weighted learning objective, dynamically adjusting the learning weights for object instances during training. CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos. The project can be found at this https URL.
摘要：为自动驾驶训练生成多视角视频最近引起了广泛关注，其挑战是解决跨视角和跨帧一致性问题。现有方法通常将解耦的注意力机制应用于空间、时间和视角维度。然而，这些方法通常难以保持跨维度的一致性，尤其是在处理出现在不同时间和视点的快速移动物体时。在本文中，我们介绍了 CogDriving，这是一种专为合成高质量多视角驾驶视频而设计的新型网络。CogDriving 利用具有整体 4D 注意力模块的扩散变压器架构，实现跨空间、时间和视点维度的同时关联。我们还提出了一种专为 CogDriving 量身定制的轻量级控制器，即微控制器，它仅使用标准 ControlNet 的 1.1% 的参数，从而实现对鸟瞰视图布局的精确控制。为了增强对自动驾驶至关重要的对象实例的生成，我们提出了一个重新加权的学习目标，在训练期间动态调整对象实例的学习权重。 CogDriving 在 nuScenes 验证集上表现出色，FVD 得分达到 37.8，凸显了其生成逼真驾驶视频的能力。该项目可在此 https URL 上找到。

Title: NODE-AdvGAN: Improving the transferability and perceptual similarity of adversarial examples by dynamic-system-driven adversarial generative model

Authors: Xinheng Xie, Yue Wu, Cuiyu He
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03539
Pdf URL: https://arxiv.org/pdf/2412.03539
Copy Paste: [[2412.03539]] NODE-AdvGAN: Improving the transferability and perceptual similarity of adversarial examples by dynamic-system-driven adversarial generative model(https://arxiv.org/abs/2412.03539)
Keywords: generation, generative
Abstract: Understanding adversarial examples is crucial for improving the model's robustness, as they introduce imperceptible perturbations that deceive models. Effective adversarial examples, therefore, offer the potential to train more robust models by removing their singularities. We propose NODE-AdvGAN, a novel approach that treats adversarial generation as a continuous process and employs a Neural Ordinary Differential Equation (NODE) for simulating the dynamics of the generator. By mimicking the iterative nature of traditional gradient-based methods, NODE-AdvGAN generates smoother and more precise perturbations that preserve high perceptual similarity when added to benign images. We also propose a new training strategy, NODE-AdvGAN-T, which enhances transferability in black-box attacks by effectively tuning noise parameters during training. Experiments demonstrate that NODE-AdvGAN and NODE-AdvGAN-T generate more effective adversarial examples that achieve higher attack success rates while preserving better perceptual quality than traditional GAN-based methods.
摘要：理解对抗样本对于提高模型的鲁棒性至关重要，因为它们会引入难以察觉的扰动，从而欺骗模型。因此，有效的对抗样本可以通过消除奇点来训练更鲁棒的模型。我们提出了 NODE-AdvGAN，这是一种新颖的方法，它将对抗生成视为一个连续的过程，并采用神经常微分方程 (NODE) 来模拟生成器的动态。通过模仿传统基于梯度的方法的迭代性质，NODE-AdvGAN 可以生成更平滑、更精确的扰动，当添加到良性图像时，这些扰动可以保持较高的感知相似性。我们还提出了一种新的训练策略 NODE-AdvGAN-T，它通过在训练期间有效调整噪声参数来增强黑盒攻击的可迁移性。实验表明，与传统的基于 GAN 的方法相比，NODE-AdvGAN 和 NODE-AdvGAN-T 可以生成更有效的对抗样本，实现更高的攻击成功率，同时保持更好的感知质量。

Title: Imagine360: Immersive 360 Video Generation from Perspective Anchor

Authors: Jing Tan, Shuai Yang, Tong Wu, Jingwen He, Yuwei Guo, Ziwei Liu, Dahua Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03552
Pdf URL: https://arxiv.org/pdf/2412.03552
Copy Paste: [[2412.03552]] Imagine360: Immersive 360 Video Generation from Perspective Anchor(https://arxiv.org/abs/2412.03552)
Keywords: generation
Abstract: $360^\circ$ videos offer a hyper-immersive experience that allows the viewers to explore a dynamic scene from full 360 degrees. To achieve more user-friendly and personalized content creation in $360^\circ$ video format, we seek to lift standard perspective videos into $360^\circ$ equirectangular videos. To this end, we introduce Imagine360, the first perspective-to-$360^\circ$ video generation framework that creates high-quality $360^\circ$ videos with rich and diverse motion patterns from video anchors. Imagine360 learns fine-grained spherical visual and motion patterns from limited $360^\circ$ video data with several key designs. 1) Firstly we adopt the dual-branch design, including a perspective and a panorama video denoising branch to provide local and global constraints for $360^\circ$ video generation, with motion module and spatial LoRA layers fine-tuned on extended web $360^\circ$ videos. 2) Additionally, an antipodal mask is devised to capture long-range motion dependencies, enhancing the reversed camera motion between antipodal pixels across hemispheres. 3) To handle diverse perspective video inputs, we propose elevation-aware designs that adapt to varying video masking due to changing elevations across frames. Extensive experiments show Imagine360 achieves superior graphics quality and motion coherence among state-of-the-art $360^\circ$ video generation methods. We believe Imagine360 holds promise for advancing personalized, immersive $360^\circ$ video creation.
摘要：$360^\circ$ 视频提供超沉浸式体验，使观看者可以从 360 度全方位探索动态场景。为了在 $360^\circ$ 视频格式中实现更加用户友好和个性化的内容创作，我们力求将标准透视视频提升为 $360^\circ$ 等距矩形视频。为此，我们推出了 Imagine360，这是第一个从透视到 $360^\circ$ 视频生成框架，它可以从视频主播那里创建具有丰富多样运动模式的高质量 $360^\circ$ 视频。Imagine360 通过几种关键设计从有限的 $360^\circ$ 视频数据中学习细粒度的球面视觉和运动模式。1) 首先，我们采用双分支设计，包括透视和全景视频去噪分支，为 $360^\circ$ 视频生成提供局部和全局约束，并在扩展的网络 $360^\circ$ 视频上对运动模块和空间 LoRA 层进行微调。 2) 此外，还设计了一种对映掩码来捕捉长距离运动依赖性，增强了跨半球对映像素之间的反向相机运动。3) 为了处理不同的透视视频输入，我们提出了高度感知设计，以适应由于帧间高度变化而产生的不同视频掩码。大量实验表明，Imagine360 在最先进的 $360^\circ$ 视频生成方法中实现了卓越的图形质量和运动连贯性。我们相信 Imagine360 有望推动个性化、沉浸式 $360^\circ$ 视频创作。

Title: PaliGemma 2: A Family of Versatile VLMs for Transfer

Authors: Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin, Lucas Beyer, Xiaohua Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03555
Pdf URL: https://arxiv.org/pdf/2412.03555
Copy Paste: [[2412.03555]] PaliGemma 2: A Family of Versatile VLMs for Transfer(https://arxiv.org/abs/2412.03555)
Keywords: generation
Abstract: PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.
摘要：PaliGemma 2 是基于 Gemma 2 系列语言模型的 PaliGemma 开放式视觉语言模型 (VLM) 的升级版。我们将 PaliGemma 也使用的 SigLIP-So400m 视觉编码器与整个 Gemma 2 模型系列（从 2B 模型一直到 27B 模型）相结合。我们以三种分辨率（224px、448px 和 896px）分多个阶段训练这些模型，使它们具备广泛的知识，以便通过微调进行迁移。由此产生的涵盖不同模型大小和分辨率的基础模型系列使我们能够研究影响迁移性能的因素（例如学习率），并分析任务类型、模型大小和分辨率之间的相互作用。我们进一步增加了 PaliGemma 范围之外的传输任务的数量和广度，包括不同的 OCR 相关任务，如表格结构识别、分子结构识别、乐谱识别，以及长细粒度字幕和放射报告生成，PaliGemma 2 在这些任务上获得了最先进的结果。

Title: MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Authors: Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, Lu Sheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03558
Pdf URL: https://arxiv.org/pdf/2412.03558
Copy Paste: [[2412.03558]] MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation(https://arxiv.org/abs/2412.03558)
Keywords: generation
Abstract: This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.
摘要：本文介绍了 MIDI，一种从单个图像生成组合式 3D 场景的新范式。与依赖重建或检索技术的现有方法或采用多阶段逐个对象生成的近期方法不同，MIDI 将预训练的图像到 3D 对象生成模型扩展为多实例扩散模型，从而能够同时生成具有准确空间关系和高泛化的多个 3D 实例。MIDI 的核心是采用一种新颖的多实例注意机制，可直接在生成过程中有效捕捉对象间交互和空间连贯性，而无需复杂的多步骤过程。该方法利用部分对象图像和全局场景上下文作为输入，直接在 3D 生成过程中对对象完成进行建模。在训练期间，我们使用有限数量的场景级数据有效地监督 3D 实例之间的交互，同时结合单对象数据进行正则化，从而保持预训练的泛化能力。 MIDI 在图像到场景生成方面展示了最先进的性能，通过对合成数据、真实世界场景数据以及文本到图像扩散模型生成的风格化场景图像的评估进行了验证。

Title: FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes

Authors: Lue Fan, Hao Zhang, Qitai Wang, Hongsheng Li, Zhaoxiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03566
Pdf URL: https://arxiv.org/pdf/2412.03566
Copy Paste: [[2412.03566]] FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes(https://arxiv.org/abs/2412.03566)
Keywords: generation, generative
Abstract: We propose FreeSim, a camera simulation method for autonomous driving. FreeSim emphasizes high-quality rendering from viewpoints beyond the recorded ego trajectories. In such viewpoints, previous methods have unacceptable degradation because the training data of these viewpoints is unavailable. To address such data scarcity, we first propose a generative enhancement model with a matched data construction strategy. The resulting model can generate high-quality images in a viewpoint slightly deviated from the recorded trajectories, conditioned on the degraded rendering of this viewpoint. We then propose a progressive reconstruction strategy, which progressively adds generated images of unrecorded views into the reconstruction process, starting from slightly off-trajectory viewpoints and moving progressively farther away. With this progressive generation-reconstruction pipeline, FreeSim supports high-quality off-trajectory view synthesis under large deviations of more than 3 meters.
摘要：我们提出了一种用于自动驾驶的摄像头模拟方法 FreeSim。FreeSim 强调从超出记录的自我轨迹的视点进行高质量渲染。在这些视点中，以前的方法会出现不可接受的退化，因为这些视点的训练数据不可用。为了解决这种数据稀缺问题，我们首先提出了一种具有匹配数据构建策略的生成增强模型。结果模型可以在略微偏离记录轨迹的视点中生成高质量图像，条件是该视点的渲染效果下降。然后，我们提出了一种渐进式重建策略，该策略逐步将未记录视图的生成图像添加到重建过程中，从略微偏离轨迹的视点开始，逐渐移得更远。借助这种渐进式生成-重建流程，FreeSim 支持在超过 3 米的大偏差下进行高质量的偏离轨迹视图合成。

Title: Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis

Authors: Qitao Zhao, Shubham Tulsiani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03570
Pdf URL: https://arxiv.org/pdf/2412.03570
Copy Paste: [[2412.03570]] Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis(https://arxiv.org/abs/2412.03570)
Keywords: generative
Abstract: Inferring the 3D structure underlying a set of multi-view images typically requires solving two co-dependent tasks -- accurate 3D reconstruction requires precise camera poses, and predicting camera poses relies on (implicitly or explicitly) modeling the underlying 3D. The classical framework of analysis by synthesis casts this inference as a joint optimization seeking to explain the observed pixels, and recent instantiations learn expressive 3D representations (e.g., Neural Fields) with gradient-descent-based pose refinement of initial pose estimates. However, given a sparse set of observed views, the observations may not provide sufficient direct evidence to obtain complete and accurate 3D. Moreover, large errors in pose estimation may not be easily corrected and can further degrade the inferred 3D. To allow robust 3D reconstruction and pose estimation in this challenging setup, we propose SparseAGS, a method that adapts this analysis-by-synthesis approach by: a) including novel-view-synthesis-based generative priors in conjunction with photometric objectives to improve the quality of the inferred 3D, and b) explicitly reasoning about outliers and using a discrete search with a continuous optimization-based strategy to correct them. We validate our framework across real-world and synthetic datasets in combination with several off-the-shelf pose estimation systems as initialization. We find that it significantly improves the base systems' pose accuracy while yielding high-quality 3D reconstructions that outperform the results from current multi-view reconstruction baselines.
摘要：推断一组多视图图像的底层 3D 结构通常需要解决两个相互依赖的任务——准确的 3D 重建需要精确的相机姿势，而预测相机姿势则依赖于（隐式或显式）对底层 3D 进行建模。经典的综合分析框架将这种推断视为一种寻求解释观察到的像素的联合优化，而最近的实例则通过基于梯度下降的姿势细化初始姿势估计来学习富有表现力的 3D 表示（例如，神经场）。然而，给定一组稀疏的观察视图，这些观察结果可能无法提供足够的直接证据来获得完整而准确的 3D。此外，姿势估计中的大误差可能不易纠正，并会进一步降低推断出的 3D 质量。为了在这种具有挑战性的设置中实现稳健的 3D 重建和姿势估计，我们提出了 SparseAGS，这是一种通过以下方式调整分析综合方法的方法：a) 结合基于新视图合成的生成先验和光度目标来提高推断的 3D 的质量，以及 b) 明确推理异常值并使用离散搜索和基于连续优化的策略来纠正它们。我们结合几个现成的姿势估计系统作为初始化，在现实世界和合成数据集上验证了我们的框架。我们发现它显著提高了基础系统的姿势准确性，同时产生了高质量的 3D 重建，其性能优于当前多视图重建基线的结果。

Title: Style3D: Attention-guided Multi-view Style Transfer for 3D Object Generation

Authors: Bingjie Song, Xin Huang, Ruting Xie, Xue Wang, Qing Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03571
Pdf URL: https://arxiv.org/pdf/2412.03571
Copy Paste: [[2412.03571]] Style3D: Attention-guided Multi-view Style Transfer for 3D Object Generation(https://arxiv.org/abs/2412.03571)
Keywords: generation
Abstract: We present Style3D, a novel approach for generating stylized 3D objects from a content image and a style image. Unlike most previous methods that require case- or style-specific training, Style3D supports instant 3D object stylization. Our key insight is that 3D object stylization can be decomposed into two interconnected processes: multi-view dual-feature alignment and sparse-view spatial reconstruction. We introduce MultiFusion Attention, an attention-guided technique to achieve multi-view stylization from the content-style pair. Specifically, the query features from the content image preserve geometric consistency across multiple views, while the key and value features from the style image are used to guide the stylistic transfer. This dual-feature alignment ensures that spatial coherence and stylistic fidelity are maintained across multi-view images. Finally, a large 3D reconstruction model is introduced to generate coherent stylized 3D objects. By establishing an interplay between structural and stylistic features across multiple views, our approach enables a holistic 3D stylization process. Extensive experiments demonstrate that Style3D offers a more flexible and scalable solution for generating style-consistent 3D assets, surpassing existing methods in both computational efficiency and visual quality.
摘要：我们提出了 Style3D，一种从内容图像和风格图像生成风格化 3D 对象的新方法。与大多数需要针对案例或风格进行训练的先前方法不同，Style3D 支持即时 3D 对象风格化。我们的关键见解是 3D 对象风格化可以分解为两个相互关联的过程：多视图双特征对齐和稀疏视图空间重建。我们引入了 MultiFusion Attention，这是一种注意力引导技术，可从内容风格对中实现多视图风格化。具体而言，来自内容图像的查询特征在多个视图之间保持几何一致性，而来自风格图像的键和值特征用于指导风格转换。这种双特征对齐可确保在多视图图像之间保持空间连贯性和风格保真度。最后，引入一个大型 3D 重建模型来生成连贯的风格化 3D 对象。通过在多个视图中建立结构特征和风格特征之间的相互作用，我们的方法可以实现整体的 3D 风格化过程。大量实验表明，Style3D 提供了一种更灵活、更可扩展的解决方案，可以生成风格一致的 3D 资产，在计算效率和视觉质量方面均超越现有方法。

Title: Navigation World Models

Authors: Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.03572
Pdf URL: https://arxiv.org/pdf/2412.03572
Copy Paste: [[2412.03572]] Navigation World Models(https://arxiv.org/abs/2412.03572)
Keywords: generation
Abstract: Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.
摘要：导航是具有视觉运动能力的代理的基本技能。我们引入了导航世界模型 (NWM)，这是一种可控视频生成模型，可根据过去的观察和导航操作预测未来的视觉观察。为了捕捉复杂的环境动态，NWM 采用了条件扩散变换器 (CDiT)，该变换器在人类和机器人代理的大量自我中心视频上进行了训练，并扩展到 10 亿个参数。在熟悉的环境中，NWM 可以通过模拟导航轨迹并评估它们是否达到预期目标来规划导航轨迹。与具有固定行为的监督导航策略不同，NWM 可以在规划过程中动态地纳入约束。实验证明了它在从头开始规划轨迹或通过对从外部策略采样的轨迹进行排序方面的有效性。此外，NWM 利用其学习到的视觉先验知识，从单个输入图像想象不熟悉环境中的轨迹，使其成为下一代导航系统的灵活而强大的工具。