2024-12-12

Title: Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation

Authors: Dongjie Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.07797
Pdf URL: https://arxiv.org/pdf/2412.07797
Copy Paste: [[2412.07797]] Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation(https://arxiv.org/abs/2412.07797)
Keywords: generation
Abstract: In the field of text-to-motion generation, Bert-type Masked Models (MoMask, MMM) currently produce higher-quality outputs compared to GPT-type autoregressive models (T2M-GPT). However, these Bert-type models often lack the streaming output capability required for applications in video game and multimedia environments, a feature inherent to GPT-type models. Additionally, they demonstrate weaker performance in out-of-distribution generation. To surpass the quality of BERT-type models while leveraging a GPT-type structure, without adding extra refinement models that complicate scaling data, we propose a novel architecture, Mogo (Motion Only Generate Once), which generates high-quality lifelike 3D human motions by training a single transformer model. Mogo consists of only two main components: 1) RVQ-VAE, a hierarchical residual vector quantization variational autoencoder, which discretizes continuous motion sequences with high precision; 2) Hierarchical Causal Transformer, responsible for generating the base motion sequences in an autoregressive manner while simultaneously inferring residuals across different layers. Experimental results demonstrate that Mogo can generate continuous and cyclic motion sequences up to 260 frames (13 seconds), surpassing the 196 frames (10 seconds) length limitation of existing datasets like HumanML3D. On the HumanML3D test set, Mogo achieves a FID score of 0.079, outperforming both the GPT-type model T2M-GPT (FID = 0.116), AttT2M (FID = 0.112) and the BERT-type model MMM (FID = 0.080). Furthermore, our model achieves the best quantitative performance in out-of-distribution generation.
摘要：在文本到动作生成领域，与 GPT 型自回归模型（T2M-GPT）相比，Bert 型掩蔽模型（MoMask，MMM）目前可以产生更高质量的输出。然而，这些 Bert 型模型通常缺乏视频游戏和多媒体环境中应用所需的流式输出能力，这是 GPT 型模型所固有的特性。此外，它们在分布外生成方面表现出较弱的性能。为了在利用 GPT 型结构的同时超越 BERT 型模型的质量，而无需添加使缩放数据复杂化的额外细化模型，我们提出了一种新颖的架构 Mogo（Motion Only Generate Once），它通过训练单个 Transformer 模型来生成高质量逼真的 3D 人体运动。Mogo 仅包含两个主要组件：1）RVQ-VAE，一种分层残差矢量量化变分自动编码器，可高精度地离散化连续运动序列； 2）分层因果变换器，负责以自回归方式生成基础运动序列，同时推断不同层之间的残差。实验结果表明，Mogo 可以生成长达 260 帧（13 秒）的连续循环运动序列，超过了 HumanML3D 等现有数据集 196 帧（10 秒）的长度限制。在 HumanML3D 测试集上，Mogo 的 FID 得分为 0.079，优于 GPT 型模型 T2M-GPT（FID = 0.116）、AttT2M（FID = 0.112）和 BERT 型模型 MMM（FID = 0.080）。此外，我们的模型在分布外生成方面取得了最佳的定量性能。

Title: Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor

Authors: Jiali Chen, Xusen Hei, Yuqi Xue, Yuancheng Wei, Jiayuan Xie, Yi Cai, Qing Li
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.07801
Pdf URL: https://arxiv.org/pdf/2412.07801
Copy Paste: [[2412.07801]] Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor(https://arxiv.org/abs/2412.07801)
Keywords: generation
Abstract: Large multimodal models (LMMs) have shown remarkable performance in the visual commonsense reasoning (VCR) task, which aims to answer a multiple-choice question based on visual commonsense within an image. However, the ability of LMMs to correct potential visual commonsense errors in the distractor upon their occurrence is yet under-explored. Drawing inspiration from how a human teacher crafts challenging distractors to test students' comprehension of the concepts or skills and assists them in identifying and correcting errors toward the answer, we are the pioneering research for LMMs to simulate this error correction process. To this end, we employ GPT-4 as a ``teacher'' to collect the explainable feedback dataset VCR-DF for error correction, which serves as a benchmark to evaluate the ability of LMMs to identify misconceptions and clarify reasons behind the error in VCR distractors toward final answers. In addition, we propose an LMM-based Pedagogical Expert Instructed Feedback Generation (PEIFG) model to incorporate the learnable expert prompts and multimodal instruction as guidance for feedback generation. Experimental results show that our PEIFG significantly outperforms existing LMMs. We believe that our benchmark provides a new direction for evaluating the capabilities of LMMs.
摘要：大型多模态模型 (LMM) 在视觉常识推理 (VCR) 任务中表现出色，该任务旨在根据图像中的视觉常识回答多项选择题。然而，LMM 在干扰项中出现潜在视觉常识错误时纠正它们的能力尚未得到充分探索。从人类教师如何设计具有挑战性的干扰项来测试学生对概念或技能的理解并帮助他们识别和纠正答案中的错误中汲取灵感，我们是 LMM 模拟此错误纠正过程的先驱研究。为此，我们聘请 GPT-4 作为“老师”来收集可解释的反馈数据集 VCR-DF 进行错误纠正，该数据集作为基准来评估 LMM 识别误解和阐明 VCR 干扰项中错误原因的能力，从而获得最终答案。此外，我们提出了一种基于 LMM 的教学专家指导反馈生成 (PEIFG) 模型，将可学习的专家提示和多模态指导结合起来作为反馈生成的指导。实验结果表明，我们的 PEIFG 明显优于现有的 LMM。我们相信我们的基准为评估 LMM 的能力提供了一个新的方向。

Title: Boosting Alignment for Post-Unlearning Text-to-Image Generative Models

Authors: Myeongseob Ko, Henry Li, Zhun Wang, Jonathan Patsenker, Jiachen T. Wang, Qinbin Li, Ming Jin, Dawn Song, Ruoxi Jia
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.07808
Pdf URL: https://arxiv.org/pdf/2412.07808
Copy Paste: [[2412.07808]] Boosting Alignment for Post-Unlearning Text-to-Image Generative Models(https://arxiv.org/abs/2412.07808)
Keywords: generation, generative
Abstract: Large-scale generative models have shown impressive image-generation capabilities, propelled by massive data. However, this often inadvertently leads to the generation of harmful or inappropriate content and raises copyright concerns. Driven by these concerns, machine unlearning has become crucial to effectively purge undesirable knowledge from models. While existing literature has studied various unlearning techniques, these often suffer from either poor unlearning quality or degradation in text-image alignment after unlearning, due to the competitive nature of these objectives. To address these challenges, we propose a framework that seeks an optimal model update at each unlearning iteration, ensuring monotonic improvement on both objectives. We further derive the characterization of such an update. In addition, we design procedures to strategically diversify the unlearning and remaining datasets to boost performance improvement. Our evaluation demonstrates that our method effectively removes target classes from recent diffusion-based generative models and concepts from stable diffusion models while maintaining close alignment with the models' original trained states, thus outperforming state-of-the-art baselines. Our code will be made available at \url{this https URL}.
摘要：在海量数据的推动下，大规模生成模型已经展现出令人印象深刻的图像生成能力。然而，这往往会无意中导致有害或不适当内容的生成，并引发版权问题。在这些担忧的驱动下，机器学习已成为有效清除模型中不良知识的关键。虽然现有文献已经研究了各种学习技术，但由于这些目标的竞争性质，这些技术通常会遭受学习质量差或学习后文本图像对齐质量下降的困扰。为了应对这些挑战，我们提出了一个框架，在每次学习迭代中寻求最佳模型更新，确保两个目标的单调改进。我们进一步推导出这种更新的特征。此外，我们设计了程序来战略性地多样化学习和剩余数据集，以提高性能改进。我们的评估表明，我们的方法有效地从最近的基于扩散的生成模型中删除目标类别，并从稳定的扩散模型中删除概念，同时与模型的原始训练状态保持紧密一致，从而超越最先进的基线。我们的代码将在 \url{此 https URL} 上提供。

Title: Pix2Poly: A Sequence Prediction Method for End-to-end Polygonal Building Footprint Extraction from Remote Sensing Imagery

Authors: Yeshwanth Kumar Adimoolam, Charalambos Poullis, Melinos Averkiou
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.07899
Pdf URL: https://arxiv.org/pdf/2412.07899
Copy Paste: [[2412.07899]] Pix2Poly: A Sequence Prediction Method for End-to-end Polygonal Building Footprint Extraction from Remote Sensing Imagery(https://arxiv.org/abs/2412.07899)
Keywords: generative
Abstract: Extraction of building footprint polygons from remotely sensed data is essential for several urban understanding tasks such as reconstruction, navigation, and mapping. Despite significant progress in the area, extracting accurate polygonal building footprints remains an open problem. In this paper, we introduce Pix2Poly, an attention-based end-to-end trainable and differentiable deep neural network capable of directly generating explicit high-quality building footprints in a ring graph format. Pix2Poly employs a generative encoder-decoder transformer to produce a sequence of graph vertex tokens whose connectivity information is learned by an optimal matching network. Compared to previous graph learning methods, ours is a truly end-to-end trainable approach that extracts high-quality building footprints and road networks without requiring complicated, computationally intensive raster loss functions and intricate training pipelines. Upon evaluating Pix2Poly on several complex and challenging datasets, we report that Pix2Poly outperforms state-of-the-art methods in several vector shape quality metrics while being an entirely explicit method. Our code is available at this https URL.
摘要：从遥感数据中提取建筑物足迹多边形对于重建、导航和制图等多项城市理解任务至关重要。尽管该领域取得了重大进展，但提取准确的多边形建筑物足迹仍然是一个悬而未决的问题。在本文中，我们介绍了 Pix2Poly，这是一种基于注意力的端到端可训练和可区分的深度神经网络，能够直接生成环形图格式的显式高质量建筑物足迹。Pix2Poly 采用生成编码器-解码器转换器来生成一系列图顶点标记，其连接信息由最佳匹配网络学习。与以前的图学习方法相比，我们的方法是一种真正的端到端可训练方法，可提取高质量的建筑物足迹和道路网络，而无需复杂、计算密集的栅格损失函数和复杂的训练管道。在对几个复杂且具有挑战性的数据集评估 Pix2Poly 后，我们报告 Pix2Poly 在几个矢量形状质量指标方面优于最先进的方法，同时是一种完全明确的方法。我们的代码可在此 https URL 上找到。

Title: Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Authors: Can Yaras, Siyi Chen, Peng Wang, Qing Qu
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.07909
Pdf URL: https://arxiv.org/pdf/2412.07909
Copy Paste: [[2412.07909]] Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning(https://arxiv.org/abs/2412.07909)
Keywords: generative
Abstract: Multimodal learning has recently gained significant popularity, demonstrating impressive performance across various zero-shot classification tasks and a range of perceptive and generative applications. Models such as Contrastive Language-Image Pretraining (CLIP) are designed to bridge different modalities, such as images and text, by learning a shared representation space through contrastive learning. Despite their success, the working mechanisms underlying multimodal learning are not yet well understood. Notably, these models often exhibit a modality gap, where different modalities occupy distinct regions within the shared representation space. In this work, we conduct an in-depth analysis of the emergence of modality gap by characterizing the gradient flow learning dynamics. Specifically, we identify the critical roles of mismatched data pairs and a learnable temperature parameter in causing and perpetuating the modality gap during training. Furthermore, our theoretical insights are validated through experiments on practical CLIP models. These findings provide principled guidance for mitigating the modality gap, including strategies such as appropriate temperature scheduling and modality swapping. Additionally, we demonstrate that closing the modality gap leads to improved performance on tasks such as image-text retrieval.
摘要：多模态学习最近大受欢迎，在各种零样本分类任务以及一系列感知和生成应用中表现出色。对比语言-图像预训练 (CLIP) 等模型旨在通过对比学习来学习共享表示空间，从而桥接不同模态（例如图像和文本）。尽管它们取得了成功，但多模态学习背后的工作机制尚不清楚。值得注意的是，这些模型通常表现出模态差距，其中不同模态占据共享表示空间内的不同区域。在这项工作中，我们通过表征梯度流学习动态对模态差距的出现进行了深入分析。具体而言，我们确定了不匹配的数据对和可学习的温度参数在训练期间导致和延续模态差距的关键作用。此外，我们的理论见解通过对实际 CLIP 模型的实验得到验证。这些发现为缓解模态差距提供了原则性指导，包括适当的温度调度和模态交换等策略。此外，我们证明缩小模态差距可以提高图像文本检索等任务的性能。

Title: Non-Normal Diffusion Models

Authors: Henry Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.07935
Pdf URL: https://arxiv.org/pdf/2412.07935
Copy Paste: [[2412.07935]] Non-Normal Diffusion Models(https://arxiv.org/abs/2412.07935)
Keywords: generative
Abstract: Diffusion models generate samples by incrementally reversing a process that turns data into noise. We show that when the step size goes to zero, the reversed process is invariant to the distribution of these increments. This reveals a previously unconsidered parameter in the design of diffusion models: the distribution of the diffusion step $\Delta x_k := x_{k} - x_{k + 1}$. This parameter is implicitly set by default to be normally distributed in most diffusion models. By lifting this assumption, we generalize the framework for designing diffusion models and establish an expanded class of diffusion processes with greater flexibility in the choice of loss function used during training. We demonstrate the effectiveness of these models on density estimation and generative modeling tasks on standard image datasets, and show that different choices of the distribution of $\Delta x_k$ result in qualitatively different generated samples.
摘要：扩散模型通过逐步逆转将数据变成噪声的过程来生成样本。我们表明，当步长趋向于零时，逆转的过程对于这些增量的分布是不变的。这揭示了扩散模型设计中一个以前未被考虑的参数：扩散步骤的分布 $\Delta x_k := x_{k} - x_{k + 1}$。在大多数扩散模型中，此参数默认隐式设置为正态分布。通过取消这一假设，我们概括了设计扩散模型的框架，并建立了一个扩展的扩散过程类，在训练期间使用的损失函数选择方面具有更大的灵活性。我们证明了这些模型在标准图像数据集上的密度估计和生成建模任务中的有效性，并表明对 $\Delta x_k$ 分布的不同选择会导致生成的样本在质量上有所不同。

Title: Phase-aware Training Schedule Simplifies Learning in Flow-Based Generative Models

Authors: Santiago Aranguri, Francesco Insulla
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.07972
Pdf URL: https://arxiv.org/pdf/2412.07972
Copy Paste: [[2412.07972]] Phase-aware Training Schedule Simplifies Learning in Flow-Based Generative Models(https://arxiv.org/abs/2412.07972)
Keywords: generative
Abstract: We analyze the training of a two-layer autoencoder used to parameterize a flow-based generative model for sampling from a high-dimensional Gaussian mixture. Previous work shows that the phase where the relative probability between the modes is learned disappears as the dimension goes to infinity without an appropriate time schedule. We introduce a time dilation that solves this problem. This enables us to characterize the learned velocity field, finding a first phase where the probability of each mode is learned and a second phase where the variance of each mode is learned. We find that the autoencoder representing the velocity field learns to simplify by estimating only the parameters relevant to each phase. Turning to real data, we propose a method that, for a given feature, finds intervals of time where training improves accuracy the most on that feature. Since practitioners take a uniform distribution over training times, our method enables more efficient training. We provide preliminary experiments validating this approach.
摘要：我们分析了两层自动编码器的训练，该编码器用于参数化基于流的生成模型，以便从高维高斯混合中进行采样。先前的研究表明，如果没有适当的时间安排，随着维度趋于无穷大，学习模式之间的相对概率的阶段就会消失。我们引入了时间膨胀来解决这个问题。这使我们能够表征学习到的速度场，找到学习每种模式概率的第一阶段和学习每种模式方差的第二阶段。我们发现，表示速度场的自动编码器通过仅估计与每个阶段相关的参数来学习简化。转向真实数据，我们提出了一种方法，对于给定的特征，找到训练可以最大程度地提高该特征的准确性的时间间隔。由于从业者在训练时间上采取均匀分布，因此我们的方法可以提高训练效率。我们提供了初步实验来验证这种方法。

Title: MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents

Authors: Yun Xing, Nhat Chung, Jie Zhang, Yue Cao, Ivor Tsang, Yang Liu, Lei Ma, Qing Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08014
Pdf URL: https://arxiv.org/pdf/2412.08014
Copy Paste: [[2412.08014]] MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents(https://arxiv.org/abs/2412.08014)
Keywords: generation, generative
Abstract: Physical adversarial attacks in driving scenarios can expose critical vulnerabilities in visual perception models. However, developing such attacks remains challenging due to diverse real-world backgrounds and the requirement for maintaining visual naturality. Building upon this challenge, we reformulate physical adversarial attacks as a one-shot patch-generation problem. Our approach generates adversarial patches through a deep generative model that considers the specific scene context, enabling direct physical deployment in matching environments. The primary challenge lies in simultaneously achieving two objectives: generating adversarial patches that effectively mislead object detection systems while determining contextually appropriate placement within the scene. We propose MAGIC (Mastering Physical Adversarial Generation In Context), a novel framework powered by multi-modal LLM agents to address these challenges. MAGIC automatically understands scene context and orchestrates adversarial patch generation through the synergistic interaction of language and vision capabilities. MAGIC orchestrates three specialized LLM agents: The adv-patch generation agent (GAgent) masters the creation of deceptive patches through strategic prompt engineering for text-to-image models. The adv-patch deployment agent (DAgent) ensures contextual coherence by determining optimal placement strategies based on scene understanding. The self-examination agent (EAgent) completes this trilogy by providing critical oversight and iterative refinement of both processes. We validate our method on both digital and physical level, \ie, nuImage and manually captured real scenes, where both statistical and visual results prove that our MAGIC is powerful and effectively for attacking wide-used object detection systems.
摘要：驾驶场景中的物理对抗攻击可能会暴露视觉感知模型中的关键漏洞。然而，由于现实世界背景的多样性以及保持视觉自然性的要求，开发此类攻击仍然具有挑战性。基于这一挑战，我们将物理对抗攻击重新表述为一次性补丁生成问题。我们的方法通过考虑特定场景上下文的深度生成模型生成对抗补丁，从而能够在匹配环境中直接进行物理部署。主要挑战在于同时实现两个目标：生成能够有效误导物体检测系统的对抗补丁，同时确定场景中适合上下文的位置。我们提出了 MAGIC（掌握上下文中的物理对抗生成），这是一个由多模态 LLM 代理驱动的新型框架，旨在应对这些挑战。MAGIC 通过语言和视觉功能的协同作用自动理解场景上下文并协调对抗补丁生成。MAGIC 协调三个专门的 LLM 代理：adv-patch 生成代理 (GAgent) 通过针对文本到图像模型的战略提示工程掌握欺骗性补丁的创建。 adv-patch 部署代理 (DAgent) 通过根据场景理解确定最佳放置策略来确保上下文连贯性。自我检查代理 (EAgent) 通过对这两个过程进行关键监督和迭代改进来完成这三部曲。我们在数字和物理层面验证了我们的方法，即 nuImage 和手动捕获的真实场景，其中统计和视觉结果都证明我们的 MAGIC 功能强大且有效地攻击广泛使用的对象检测系统。

Title: NeRF-NQA: No-Reference Quality Assessment for Scenes Generated by NeRF and Neural View Synthesis Methods

Authors: Qiang Qu, Hanxue Liang, Xiaoming Chen, Yuk Ying Chung, Yiran Shen
Subjects: cs.CV, cs.AI, cs.HC, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2412.08029
Pdf URL: https://arxiv.org/pdf/2412.08029
Copy Paste: [[2412.08029]] NeRF-NQA: No-Reference Quality Assessment for Scenes Generated by NeRF and Neural View Synthesis Methods(https://arxiv.org/abs/2412.08029)
Keywords: quality assessment
Abstract: Neural View Synthesis (NVS) has demonstrated efficacy in generating high-fidelity dense viewpoint videos using a image set with sparse views. However, existing quality assessment methods like PSNR, SSIM, and LPIPS are not tailored for the scenes with dense viewpoints synthesized by NVS and NeRF variants, thus, they often fall short in capturing the perceptual quality, including spatial and angular aspects of NVS-synthesized scenes. Furthermore, the lack of dense ground truth views makes the full reference quality assessment on NVS-synthesized scenes challenging. For instance, datasets such as LLFF provide only sparse images, insufficient for complete full-reference assessments. To address the issues above, we propose NeRF-NQA, the first no-reference quality assessment method for densely-observed scenes synthesized from the NVS and NeRF variants. NeRF-NQA employs a joint quality assessment strategy, integrating both viewwise and pointwise approaches, to evaluate the quality of NVS-generated scenes. The viewwise approach assesses the spatial quality of each individual synthesized view and the overall inter-views consistency, while the pointwise approach focuses on the angular qualities of scene surface points and their compound inter-point quality. Extensive evaluations are conducted to compare NeRF-NQA with 23 mainstream visual quality assessment methods (from fields of image, video, and light-field assessment). The results demonstrate NeRF-NQA outperforms the existing assessment methods significantly and it shows substantial superiority on assessing NVS-synthesized scenes without references. An implementation of this paper are available at this https URL.
摘要：神经视图合成 (NVS) 已证明能够有效地使用具有稀疏视图的图像集生成高保真密集视点视频。然而，现有的质量评估方法（如 PSNR、SSIM 和 LPIPS）并不适用于 NVS 和 NeRF 变体合成的密集视点场景，因此，它们通常无法捕捉感知质量，包括 NVS 合成场景的空间和角度方面。此外，缺乏密集的真实视图使得对 NVS 合成场景进行完整的参考质量评估具有挑战性。例如，LLFF 等数据集仅提供稀疏图像，不足以进行完整的全参考评估。为了解决上述问题，我们提出了 NeRF-NQA，这是第一个针对从 NVS 和 NeRF 变体合成的密集观察场景的无参考质量评估方法。NeRF-NQA 采用联合质量评估策略，结合视图和点方法，来评估 NVS 生成场景的质量。视图方法评估每个合成视图的空间质量和整体视图间一致性，而点方法则侧重于场景表面点的角度质量及其复合点间质量。进行了广泛的评估，将 NeRF-NQA 与 23 种主流视觉质量评估方法（来自图像、视频和光场评估领域）进行比较。结果表明，NeRF-NQA 明显优于现有的评估方法，并且在评估没有参考的 NVS 合成场景方面表现出显著的优势。本文的实现可在此 https URL 上找到。

Title: DynamicPAE: Generating Scene-Aware Physical Adversarial Examples in Real-Time

Authors: Jin Hu, Xianglong Liu, Jiakai Wang, Junkai Zhang, Xianqi Yang, Haotong Qin, Yuqing Ma, Ke Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08053
Pdf URL: https://arxiv.org/pdf/2412.08053
Copy Paste: [[2412.08053]] DynamicPAE: Generating Scene-Aware Physical Adversarial Examples in Real-Time(https://arxiv.org/abs/2412.08053)
Keywords: generation, generative
Abstract: Physical adversarial examples (PAEs) are regarded as "whistle-blowers" of real-world risks in deep-learning applications. However, current PAE generation studies show limited adaptive attacking ability to diverse and varying scenes. The key challenges in generating dynamic PAEs are exploring their patterns under noisy gradient feedback and adapting the attack to agnostic scenario natures. To address the problems, we present DynamicPAE, the first generative framework that enables scene-aware real-time physical attacks beyond static attacks. Specifically, to train the dynamic PAE generator under noisy gradient feedback, we introduce the residual-driven sample trajectory guidance technique, which redefines the training task to break the limited feedback information restriction that leads to the degeneracy problem. Intuitively, it allows the gradient feedback to be passed to the generator through a low-noise auxiliary task, thereby guiding the optimization away from degenerate solutions and facilitating a more comprehensive and stable exploration of feasible PAEs. To adapt the generator to agnostic scenario natures, we introduce the context-aligned scene expectation simulation process, consisting of the conditional-uncertainty-aligned data module and the skewness-aligned objective re-weighting module. The former enhances robustness in the context of incomplete observation by employing a conditional probabilistic model for domain randomization, while the latter facilitates consistent stealth control across different attack targets by automatically reweighting losses based on the skewness indicator. Extensive digital and physical evaluations demonstrate the superior attack performance of DynamicPAE, attaining a 1.95 $\times$ boost (65.55% average AP drop under attack) on representative object detectors (e.g., Yolo-v8) over state-of-the-art static PAE generating methods.
摘要：物理对抗样本 (PAE) 被视为深度学习应用中真实世界风险的“告密者”。然而，目前的 PAE 生成研究表明，其对多样化和变化场景的自适应攻击能力有限。生成动态 PAE 的关键挑战是在嘈杂的梯度反馈下探索它们的模式，并使攻击适应不可知的场景性质。为了解决这些问题，我们提出了 DynamicPAE，这是第一个能够实现场景感知实时物理攻击的生成框架，而不仅仅是静态攻击。具体来说，为了在嘈杂的梯度反馈下训练动态 PAE 生成器，我们引入了残差驱动的样本轨迹引导技术，该技术重新定义了训练任务，以打破导致退化问题的有限反馈信息限制。直观地说，它允许通过低噪声辅助任务将梯度反馈传递给生成器，从而引导优化远离退化解决方案，并促进对可行 PAE 的更全面和稳定的探索。为了使生成器适应不可知场景的性质，我们引入了上下文对齐的场景期望模拟过程，包括条件不确定性对齐的数据模块和倾斜对齐的目标重新加权模块。前者通过采用条件概率模型进行域随机化，增强了在观察不完整的情况下的鲁棒性，而后者通过基于倾斜指标自动重新加权损失，促进了不同攻击目标之间的一致隐身控制。广泛的数字和物理评估证明了 DynamicPAE 的卓越攻击性能，与最先进的静态 PAE 生成方法相比，在代表性物体检测器（例如 Yolo-v8）上实现了 1.95 $\times$ 的提升（攻击下平均 AP 下降 65.55%）。

Title: Federated In-Context LLM Agent Learning

Authors: Panlong Wu, Kangshuo Li, Junbao Nan, Fangxin Wang
Subjects: cs.LG, cs.AI, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.08054
Pdf URL: https://arxiv.org/pdf/2412.08054
Copy Paste: [[2412.08054]] Federated In-Context LLM Agent Learning(https://arxiv.org/abs/2412.08054)
Keywords: generation
Abstract: Large Language Models (LLMs) have revolutionized intelligent services by enabling logical reasoning, tool use, and interaction with external systems as agents. The advancement of LLMs is frequently hindered by the scarcity of high-quality data, much of which is inherently sensitive. Federated learning (FL) offers a potential solution by facilitating the collaborative training of distributed LLMs while safeguarding private data. However, FL frameworks face significant bandwidth and computational demands, along with challenges from heterogeneous data distributions. The emerging in-context learning capability of LLMs offers a promising approach by aggregating natural language rather than bulky model parameters. Yet, this method risks privacy leakage, as it necessitates the collection and presentation of data samples from various clients during aggregation. In this paper, we propose a novel privacy-preserving Federated In-Context LLM Agent Learning (FICAL) algorithm, which to our best knowledge for the first work unleashes the power of in-context learning to train diverse LLM agents through FL. In our design, knowledge compendiums generated by a novel LLM-enhanced Knowledge Compendiums Generation (KCG) module are transmitted between clients and the server instead of model parameters in previous FL methods. Apart from that, an incredible Retrieval Augmented Generation (RAG) based Tool Learning and Utilizing (TLU) module is designed and we incorporate the aggregated global knowledge compendium as a teacher to teach LLM agents the usage of tools. We conducted extensive experiments and the results show that FICAL has competitive performance compared to other SOTA baselines with a significant communication cost decrease of $\mathbf{3.33\times10^5}$ times.
摘要：大型语言模型 (LLM) 通过实现逻辑推理、工具使用以及与外部系统作为代理的交互，彻底改变了智能服务。LLM 的发展经常受到高质量数据的稀缺的阻碍，其中许多数据本质上是敏感的。联邦学习 (FL) 提供了一种潜在的解决方案，它促进了分布式 LLM 的协作训练，同时保护了私人数据。然而，FL 框架面临着巨大的带宽和计算需求，以及来自异构数据分布的挑战。LLM 新兴的上下文学习能力通过聚合自然语言而不是庞大的模型参数提供了一种有前途的方法。然而，这种方法存在隐私泄露的风险，因为它需要在聚合过程中收集和呈现来自不同客户端的数据样本。在本文中，我们提出了一种新颖的隐私保护联邦上下文 LLM 代理学习 (FICAL) 算法，据我们所知，这是第一项工作，它释放了上下文学习的力量，通过 FL 来训练不同的 LLM 代理。在我们的设计中，客户端和服务器之间传输的是由新颖的 LLM 增强知识汇编生成 (KCG) 模块生成的知识汇编，而不是以前的 FL 方法中的模型参数。除此之外，我们还设计了一个基于检索增强生成 (RAG) 的出色工具学习和利用 (TLU) 模块，并将聚合的全局知识汇编作为老师来教 LLM 代理如何使用工具。我们进行了广泛的实验，结果表明，与其他 SOTA 基线相比，FICAL 具有竞争性的性能，通信成本显著降低了 $\mathbf{3.33\times10^5}$ 倍。

Title: Statistical Downscaling via High-Dimensional Distribution Matching with Generative Models

Authors: Zhong Yi Wan, Ignacio Lopez-Gomez, Robert Carver, Tapio Schneider, John Anderson, Fei Sha, Leonardo Zepeda-Núñez
Subjects: cs.LG, math.NA, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2412.08079
Pdf URL: https://arxiv.org/pdf/2412.08079
Copy Paste: [[2412.08079]] Statistical Downscaling via High-Dimensional Distribution Matching with Generative Models(https://arxiv.org/abs/2412.08079)
Keywords: super-resolution, generative
Abstract: Statistical downscaling is a technique used in climate modeling to increase the resolution of climate simulations. High-resolution climate information is essential for various high-impact applications, including natural hazard risk assessment. However, simulating climate at high resolution is intractable. Thus, climate simulations are often conducted at a coarse scale and then downscaled to the desired resolution. Existing downscaling techniques are either simulation-based methods with high computational costs, or statistical approaches with limitations in accuracy or application specificity. We introduce Generative Bias Correction and Super-Resolution (GenBCSR), a two-stage probabilistic framework for statistical downscaling that overcomes the limitations of previous methods. GenBCSR employs two transformations to match high-dimensional distributions at different resolutions: (i) the first stage, bias correction, aligns the distributions at coarse scale, (ii) the second stage, statistical super-resolution, lifts the corrected coarse distribution by introducing fine-grained details. Each stage is instantiated by a state-of-the-art generative model, resulting in an efficient and effective computational pipeline for the well-studied distribution matching problem. By framing the downscaling problem as distribution matching, GenBCSR relaxes the constraints of supervised learning, which requires samples to be aligned. Despite not requiring such correspondence, we show that GenBCSR surpasses standard approaches in predictive accuracy of critical impact variables, particularly in predicting the tails (99% percentile) of composite indexes composed of interacting variables, achieving up to 4-5 folds of error reduction.
摘要：统计降尺度是气候建模中用来提高气候模拟分辨率的一种技术。高分辨率气候信息对于各种高影响应用至关重要，包括自然灾害风险评估。然而，高分辨率模拟气候是难以实现的。因此，气候模拟通常以粗尺度进行，然后降尺度到所需的分辨率。现有的降尺度技术要么是基于模拟的计算成本高的方法，要么是准确性或应用特异性有限的统计方法。我们引入了生成偏差校正和超分辨率 (GenBCSR)，这是一种用于统计降尺度的两阶段概率框架，它克服了以前方法的局限性。GenBCSR 采用两种转换来匹配不同分辨率的高维分布：(i) 第一阶段，偏差校正，在粗尺度上对齐分布，(ii) 第二阶段，统计超分辨率，通过引入细粒度细节来提升校正后的粗分布。每个阶段都由最先进的生成模型实例化，从而为研究透彻的分布匹配问题提供高效且有效的计算流程。通过将降尺度问题定义为分布匹配，GenBCSR 放宽了监督学习的约束，这要求样本对齐。尽管不需要这样的对应关系，但我们表明，GenBCSR 在关键影响变量的预测准确性方面超越了标准方法，特别是在预测由相互作用的变量组成的综合指数的尾部（99% 百分位数）方面，实现了高达 4-5 倍的误差减少。

Title: Generative Zoo

Authors: Tomasz Niewiadomski, Anastasios Yiannakidis, Hanz Cuevas-Velasquez, Soubhik Sanyal, Michael J. Black, Silvia Zuffi, Peter Kulits
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08101
Pdf URL: https://arxiv.org/pdf/2412.08101
Copy Paste: [[2412.08101]] Generative Zoo(https://arxiv.org/abs/2412.08101)
Keywords: generation, generative
Abstract: The model-based estimation of 3D animal pose and shape from images enables computational modeling of animal behavior. Training models for this purpose requires large amounts of labeled image data with precise pose and shape annotations. However, capturing such data requires the use of multi-view or marker-based motion-capture systems, which are impractical to adapt to wild animals in situ and impossible to scale across a comprehensive set of animal species. Some have attempted to address the challenge of procuring training data by pseudo-labeling individual real-world images through manual 2D annotation, followed by 3D-parameter optimization to those labels. While this approach may produce silhouette-aligned samples, the obtained pose and shape parameters are often implausible due to the ill-posed nature of the monocular fitting problem. Sidestepping real-world ambiguity, others have designed complex synthetic-data-generation pipelines leveraging video-game engines and collections of artist-designed 3D assets. Such engines yield perfect ground-truth annotations but are often lacking in visual realism and require considerable manual effort to adapt to new species or environments. Motivated by these shortcomings, we propose an alternative approach to synthetic-data generation: rendering with a conditional image-generation model. We introduce a pipeline that samples a diverse set of poses and shapes for a variety of mammalian quadrupeds and generates realistic images with corresponding ground-truth pose and shape parameters. To demonstrate the scalability of our approach, we introduce GenZoo, a synthetic dataset containing one million images of distinct subjects. We train a 3D pose and shape regressor on GenZoo, which achieves state-of-the-art performance on a real-world animal pose and shape estimation benchmark, despite being trained solely on synthetic data. this https URL
摘要：基于模型的动物 3D 姿势和形状估计可以从图像中实现动物行为的计算建模。为此目的，训练模型需要大量带有精确姿势和形状注释的标记图像数据。然而，捕获此类数据需要使用多视图或基于标记的运动捕捉系统，而这些系统无法适应野生动物的现场情况，也不可能扩展到所有动物物种。一些人试图通过手动 2D 注释对单个真实世界图像进行伪标记，然后对这些标记进行 3D 参数优化，以解决获取训练数据的挑战。虽然这种方法可以生成轮廓对齐的样本，但由于单目拟合问题的不适定性质，获得的姿势和形状参数通常不可信。为了避开现实世界的模糊性，其他人设计了复杂的合成数据生成流程，利用视频游戏引擎和艺术家设计的 3D 资产集合。此类引擎可以产生完美的真实注释，但通常缺乏视觉真实感，并且需要大量的手动工作才能适应新物种或环境。针对这些缺点，我们提出了一种替代的合成数据生成方法：使用条件图像生成模型进行渲染。我们引入了一个管道，该管道对各种哺乳动物四足动物的各种姿势和形状进行采样，并生成具有相应真实姿势和形状参数的真实图像。为了证明我们方法的可扩展性，我们引入了 GenZoo，这是一个包含一百万张不同主题图像的合成数据集。我们在 GenZoo 上训练了一个 3D 姿势和形状回归器，尽管仅使用合成数据进行训练，但它在现实世界的动物姿势和形状估计基准上实现了最先进的性能。这个 https URL

Title: Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models

Authors: Sri Harsha Dumpala, David Arps, Sageev Oore, Laura Kallmeyer, Hassan Sajjad
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08111
Pdf URL: https://arxiv.org/pdf/2412.08111
Copy Paste: [[2412.08111]] Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models(https://arxiv.org/abs/2412.08111)
Keywords: generation
Abstract: Vision-language models (VLMs), serve as foundation models for multi-modal applications such as image captioning and text-to-image generation. Recent studies have highlighted limitations in VLM text encoders, particularly in areas like compositionality and semantic understanding, though the underlying reasons for these limitations remain unclear. In this work, we aim to address this gap by analyzing the syntactic information, one of the fundamental linguistic properties, encoded by the text encoders of VLMs. We perform a thorough analysis comparing VLMs with different objective functions, parameter size and training data size, and with uni-modal language models (ULMs) in their ability to encode syntactic knowledge. Our findings suggest that ULM text encoders acquire syntactic information more effectively than those in VLMs. The syntactic information learned by VLM text encoders is shaped primarily by the pre-training objective, which plays a more crucial role than other factors such as model architecture, model size, or the volume of pre-training data. Models exhibit different layer-wise trends where CLIP performance dropped across layers while for other models, middle layers are rich in encoding syntactic knowledge.
摘要：视觉语言模型 (VLM) 是图像字幕和文本到图像生成等多模态应用的基础模型。最近的研究强调了 VLM 文本编码器的局限性，特别是在组合性和语义理解等领域，尽管这些局限性的根本原因仍不清楚。在这项工作中，我们旨在通过分析 VLM 文本编码器编码的句法信息（基本语言属性之一）来解决这一差距。我们对具有不同目标函数、参数大小和训练数据大小的 VLM 以及单模态语言模型 (ULM) 编码句法知识的能力进行了全面分析。我们的研究结果表明，ULM 文本编码器比 VLM 中的编码器更有效地获取句法信息。VLM 文本编码器学习的句法信息主要由预训练目标塑造，它比模型架构、模型大小或预训练数据量等其他因素起着更重要的作用。模型表现出不同的分层趋势，其中 CLIP 性能在各个层之间下降，而对于其他模型，中间层富含编码句法知识。

Title: Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

Authors: Quang-Hung Le, Long Hoang Dang, Ngan Le, Truyen Tran, Thao Minh Le
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08125
Pdf URL: https://arxiv.org/pdf/2412.08125
Copy Paste: [[2412.08125]] Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models(https://arxiv.org/abs/2412.08125)
Keywords: generation
Abstract: Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks.
摘要：现有的大型视觉语言模型 (LVLM) 擅长跨多模态输入匹配概念，但在组合概念和实体之间的高级关系方面却举步维艰。本文介绍了渐进式多粒度视觉语言对齐 (PromViL)，这是一种新颖的框架，可增强 LVLM 执行扎实的组合视觉推理任务的能力。我们的方法构建了一个多模态对齐的层次结构，从简单到复杂的概念。通过逐步将文本描述与相应的视觉区域对齐，我们的模型学会利用较低级别的上下文信息来指导更高级别的推理。为了促进这一学习过程，我们引入了一个数据生成过程，该过程创建了一个源自 Visual Genome 的新数据集，提供了广泛的嵌套组合视觉语言对。实验结果表明，我们的 PromViL 框架在各种视觉基础和组合问答任务上的表现明显优于基线。

Title: DiffRaman: A Conditional Latent Denoising Diffusion Probabilistic Model for Bacterial Raman Spectroscopy Identification Under Limited Data Conditions

Authors: Haiming Yao, Wei Luo, Ang Gao, Tao Zhou, Xue Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08131
Pdf URL: https://arxiv.org/pdf/2412.08131
Copy Paste: [[2412.08131]] DiffRaman: A Conditional Latent Denoising Diffusion Probabilistic Model for Bacterial Raman Spectroscopy Identification Under Limited Data Conditions(https://arxiv.org/abs/2412.08131)
Keywords: generation, generative
Abstract: Raman spectroscopy has attracted significant attention in various biochemical detection fields, especially in the rapid identification of pathogenic bacteria. The integration of this technology with deep learning to facilitate automated bacterial Raman spectroscopy diagnosis has emerged as a key focus in recent research. However, the diagnostic performance of existing deep learning methods largely depends on a sufficient dataset, and in scenarios where there is a limited availability of Raman spectroscopy data, it is inadequate to fully optimize the numerous parameters of deep neural networks. To address these challenges, this paper proposes a data generation method utilizing deep generative models to expand the data volume and enhance the recognition accuracy of bacterial Raman spectra. Specifically, we introduce DiffRaman, a conditional latent denoising diffusion probability model for Raman spectra generation. Experimental results demonstrate that synthetic bacterial Raman spectra generated by DiffRaman can effectively emulate real experimental spectra, thereby enhancing the performance of diagnostic models, especially under conditions of limited data. Furthermore, compared to existing generative models, the proposed DiffRaman offers improvements in both generation quality and computational efficiency. Our DiffRaman approach offers a well-suited solution for automated bacteria Raman spectroscopy diagnosis in data-scarce scenarios, offering new insights into alleviating the labor of spectroscopic measurements and enhancing rare bacteria identification.
摘要：拉曼光谱技术在各个生化检测领域引起了广泛关注，特别是在致病菌的快速识别方面。将该技术与深度学习相结合以实现细菌拉曼光谱的自动化诊断已成为近年来研究的重点。然而，现有深度学习方法的诊断性能很大程度上取决于足够的数据集，在拉曼光谱数据有限的场景中，不足以充分优化深度神经网络的众多参数。针对这些挑战，本文提出了一种利用深度生成模型的数据生成方法来扩大数据量并提高细菌拉曼光谱的识别准确率。具体来说，我们引入了DiffRaman，一种用于拉曼光谱生成的条件潜在去噪扩散概率模型。实验结果表明，DiffRaman生成的合成细菌拉曼光谱可以有效地模拟真实的实验光谱，从而提高诊断模型的性能，尤其是在数据有限的条件下。此外，与现有的生成模型相比，提出的DiffRaman在生成质量和计算效率方面都有所提高。我们的 DiffRaman 方法为数据稀缺情况下的自动细菌拉曼光谱诊断提供了非常适合的解决方案，为减轻光谱测量的劳动量和增强稀有细菌识别提供了新的见解。

Title: AsyncDSB: Schedule-Asynchronous Diffusion Schr\"odinger Bridge for Image Inpainting

Authors: Zihao Han, Baoquan Zhang, Lisai Zhang, Shanshan Feng, Kenghong Lin, Guotao Liang, Yunming Ye, Xiaochen Qi, Guangming Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08149
Pdf URL: https://arxiv.org/pdf/2412.08149
Copy Paste: [[2412.08149]] AsyncDSB: Schedule-Asynchronous Diffusion Schr\"odinger Bridge for Image Inpainting(https://arxiv.org/abs/2412.08149)
Keywords: restoration, generation
Abstract: Image inpainting is an important image generation task, which aims to restore corrupted image from partial visible area. Recently, diffusion Schrödinger bridge methods effectively tackle this task by modeling the translation between corrupted and target images as a diffusion Schrödinger bridge process along a noising schedule path. Although these methods have shown superior performance, in this paper, we find that 1) existing methods suffer from a schedule-restoration mismatching issue, i.e., the theoretical schedule and practical restoration processes usually exist a large discrepancy, which theoretically results in the schedule not fully leveraged for restoring images; and 2) the key reason causing such issue is that the restoration process of all pixels are actually asynchronous but existing methods set a synchronous noise schedule to them, i.e., all pixels shares the same noise schedule. To this end, we propose a schedule-Asynchronous Diffusion Schrödinger Bridge (AsyncDSB) for image inpainting. Our insight is preferentially scheduling pixels with high frequency (i.e., large gradients) and then low frequency (i.e., small gradients). Based on this insight, given a corrupted image, we first train a network to predict its gradient map in corrupted area. Then, we regard the predicted image gradient as prior and design a simple yet effective pixel-asynchronous noise schedule strategy to enhance the diffusion Schrödinger bridge. Thanks to the asynchronous schedule at pixels, the temporal interdependence of restoration process between pixels can be fully characterized for high-quality image inpainting. Experiments on real-world datasets show that our AsyncDSB achieves superior performance, especially on FID with around 3% - 14% improvement over state-of-the-art baseline methods.
摘要：图像修复是一项重要的图像生成任务，旨在从部分可见区域恢复损坏的图像。最近，扩散薛定谔桥方法通过将损坏图像和目标图像之间的转换建模为沿噪声时间表路径的扩散薛定谔桥过程，有效地解决了该任务。虽然这些方法已经表现出优异的性能，但在本文中，我们发现1）现有方法存在时间表恢复不匹配的问题，即理论时间表和实际恢复过程通常存在很大差异，这在理论上导致时间表不能充分利用来恢复图像；2）导致此类问题的关键原因是所有像素的恢复过程实际上是异步的，但现有方法为它们设置了同步噪声时间表，即所有像素共享相同的噪声时间表。为此，我们提出了一种用于图像修复的时间表异步扩散薛定谔桥（AsyncDSB）。我们的见解是优先调度具有高频率（即大梯度）的像素，然后调度低频率（即小梯度）的像素。基于这一见解，给定损坏的图像，我们首先训练一个网络来预测损坏区域中的梯度图。然后，我们将预测的图像梯度视为先验，并设计一个简单但有效的像素异步噪声调度策略来增强扩散薛定谔桥。由于像素处的异步调度，可以充分表征像素之间恢复过程的时间相互依赖性，以实现高质量的图像修复。在真实数据集上的实验表明，我们的 AsyncDSB 实现了卓越的性能，尤其是在 FID 上，比最先进的基线方法提高了约 3% - 14%。

Title: Analyzing and Improving Model Collapse in Rectified Flow Models

Authors: Huminhao Zhu, Fangyikang Wang, Tianyu Ding, Qing Qu, Zhihui Zhu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08175
Pdf URL: https://arxiv.org/pdf/2412.08175
Copy Paste: [[2412.08175]] Analyzing and Improving Model Collapse in Rectified Flow Models(https://arxiv.org/abs/2412.08175)
Keywords: generation, generative
Abstract: Generative models aim to produce synthetic data indistinguishable from real distributions, but iterative training on self-generated data can lead to \emph{model collapse (MC)}, where performance degrades over time. In this work, we provide the first theoretical analysis of MC in Rectified Flow by framing it within the context of Denoising Autoencoders (DAEs). We show that when DAE models are trained on recursively generated synthetic data with small noise variance, they suffer from MC with progressive diminishing generation quality. To address this MC issue, we propose methods that strategically incorporate real data into the training process, even when direct noise-image pairs are unavailable. Our proposed techniques, including Reverse Collapse-Avoiding (RCA) Reflow and Online Collapse-Avoiding Reflow (OCAR), effectively prevent MC while maintaining the efficiency benefits of Rectified Flow. Extensive experiments on standard image datasets demonstrate that our methods not only mitigate MC but also improve sampling efficiency, leading to higher-quality image generation with fewer sampling steps.
摘要：生成模型旨在生成与真实分布难以区分的合成数据，但对自生成数据进行迭代训练可能会导致 \emph{模型崩溃 (MC)}，即性能会随着时间的推移而下降。在这项工作中，我们首次在去噪自编码器 (DAE) 的背景下对整流流中的 MC 进行了理论分析。我们表明，当使用噪声方差较小的递归生成的合成数据训练 DAE 模型时，它们会受到 MC 的影响，并且生成质量会逐渐降低。为了解决这个 MC 问题，我们提出了一些方法，即使在没有直接噪声图像对的情况下，也可以将真实数据策略性地纳入训练过程。我们提出的技术，包括反向崩溃避免 (RCA) 重排和在线崩溃避免重排 (OCAR)，可以有效防止 MC，同时保持整流流的效率优势。在标准图像数据集上进行的大量实验表明，我们的方法不仅可以减轻 MC，还可以提高采样效率，从而以更少的采样步骤生成更高质量的图像。

Title: GN-FR:Generalizable Neural Radiance Fields for Flare Removal

Authors: Gopi Raju Matta, Rahul Siddartha, Rongali Simhachala Venkata Girish, Sumit Sharma, Kaushik Mitra
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.08200
Pdf URL: https://arxiv.org/pdf/2412.08200
Copy Paste: [[2412.08200]] GN-FR:Generalizable Neural Radiance Fields for Flare Removal(https://arxiv.org/abs/2412.08200)
Keywords: generation
Abstract: Flare, an optical phenomenon resulting from unwanted scattering and reflections within a lens system, presents a significant challenge in imaging. The diverse patterns of flares, such as halos, streaks, color bleeding, and haze, complicate the flare removal process. Existing traditional and learning-based methods have exhibited limited efficacy due to their reliance on single-image approaches, where flare removal is highly ill-posed. We address this by framing flare removal as a multi-view image problem, taking advantage of the view-dependent nature of flare artifacts. This approach leverages information from neighboring views to recover details obscured by flare in individual images. Our proposed framework, GN-FR (Generalizable Neural Radiance Fields for Flare Removal), can render flare-free views from a sparse set of input images affected by lens flare and generalizes across different scenes in an unsupervised manner. GN-FR incorporates several modules within the Generalizable NeRF Transformer (GNT) framework: Flare-occupancy Mask Generation (FMG), View Sampler (VS), and Point Sampler (PS). To overcome the impracticality of capturing both flare-corrupted and flare-free data, we introduce a masking loss function that utilizes mask information in an unsupervised setting. Additionally, we present a 3D multi-view flare dataset, comprising 17 real flare scenes with 782 images, 80 real flare patterns, and their corresponding annotated flare-occupancy masks. To our knowledge, this is the first work to address flare removal within a Neural Radiance Fields (NeRF) framework.
摘要：眩光是一种光学现象，由镜头系统内不必要的散射和反射引起，对成像提出了重大挑战。眩光的多种模式（例如光晕、条纹、渗色和雾霾）使眩光去除过程变得复杂。现有的传统方法和基于学习的方法由于依赖于单图像方法而效果有限，而单图像方法中的眩光去除非常不恰当。我们通过将眩光去除视为多视图图像问题来解决这个问题，利用眩光伪影的视图相关特性。这种方法利用来自相邻视图的信息来恢复单个图像中被眩光遮挡的细节。我们提出的框架 GN-FR（用于眩光去除的可泛化神经辐射场）可以从一组受镜头眩光影响的稀疏输入图像中渲染无眩光视图，并以无监督的方式在不同场景中推广。 GN-FR 整合了通用 NeRF Transformer (GNT) 框架中的多个模块：耀斑占用掩模生成 (FMG)、视图采样器 (VS) 和点采样器 (PS)。为了克服捕获耀斑破坏和无耀斑数据的不切实际性，我们引入了一个掩蔽损失函数，该函数在无监督设置中利用掩模信息。此外，我们还提供了一个 3D 多视图耀斑数据集，其中包括 17 个真实耀斑场景，其中包含 782 张图像、80 个真实耀斑图案及其相应的带注释的耀斑占用掩模。据我们所知，这是第一项在神经辐射场 (NeRF) 框架内解决耀斑去除问题的工作。

Title: Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Authors: Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08221
Pdf URL: https://arxiv.org/pdf/2412.08221
Copy Paste: [[2412.08221]] Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming(https://arxiv.org/abs/2412.08221)
Keywords: generation
Abstract: DALL-E and Sora have gained attention by producing implausible images, such as "astronauts riding a horse in space." Despite the proliferation of text-to-vision models that have inundated the internet with synthetic visuals, from images to 3D assets, current benchmarks predominantly evaluate these models on real-world scenes paired with captions. We introduce Generate Any Scene, a framework that systematically enumerates scene graphs representing a vast array of visual scenes, spanning realistic to imaginative compositions. Generate Any Scene leverages 'scene graph programming', a method for dynamically constructing scene graphs of varying complexity from a structured taxonomy of visual elements. This taxonomy includes numerous objects, attributes, and relations, enabling the synthesis of an almost infinite variety of scene graphs. Using these structured representations, Generate Any Scene translates each scene graph into a caption, enabling scalable evaluation of text-to-vision models through standard metrics. We conduct extensive evaluations across multiple text-to-image, text-to-video, and text-to-3D models, presenting key findings on model performance. We find that DiT-backbone text-to-image models align more closely with input captions than UNet-backbone models. Text-to-video models struggle with balancing dynamics and consistency, while both text-to-video and text-to-3D models show notable gaps in human preference alignment. We demonstrate the effectiveness of Generate Any Scene by conducting three practical applications leveraging captions generated by Generate Any Scene: 1) a self-improving framework where models iteratively enhance their performance using generated data, 2) a distillation process to transfer specific strengths from proprietary models to open-source counterparts, and 3) improvements in content moderation by identifying and generating challenging synthetic data.
摘要：DALL-E 和 Sora 因制作令人难以置信的图像而受到关注，例如“宇航员在太空中骑马”。尽管文本到视觉模型的激增，从图像到 3D 资产的合成视觉效果充斥着互联网，但当前的基准测试主要在配有字幕的真实世界场景上评估这些模型。我们引入了 Generate Any Scene，这是一个框架，可以系统地枚举代表大量视觉场景的场景图，涵盖从现实到富有想象力的构图。Generate Any Scene 利用“场景图编程”，这是一种从结构化的视觉元素分类法中动态构建不同复杂程度的场景图的方法。该分类法包括众多对象、属性和关系，可以合成几乎无限多种场景图。使用这些结构化的表示，Generate Any Scene 将每个场景图转换为字幕，从而通过标准指标实现对文本到视觉模型的可扩展评估。我们对多个文本转图像、文本转视频和文本转 3D 模型进行了广泛的评估，并展示了有关模型性能的关键发现。我们发现 DiT 主干文本转图像模型比 UNet 主干模型更接近输入字幕。文本转视频模型在平衡动态和一致性方面遇到了困难，而文本转视频和文本转 3D 模型在人类偏好一致性方面都存在明显差距。我们通过利用 Generate Any Scene 生成的字幕进行三个实际应用来证明 Generate Any Scene 的有效性：1) 一个自我改进的框架，其中模型使用生成的数据迭代地提高其性能，2) 一个蒸馏过程，将专有模型的特定优势转移到开源模型，3) 通过识别和生成具有挑战性的合成数据来改进内容审核。

Title: Self-Refining Diffusion Samplers: Enabling Parallelization via Parareal Iterations

Authors: Nikil Roashan Selvam, Amil Merchant, Stefano Ermon
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08292
Pdf URL: https://arxiv.org/pdf/2412.08292
Copy Paste: [[2412.08292]] Self-Refining Diffusion Samplers: Enabling Parallelization via Parareal Iterations(https://arxiv.org/abs/2412.08292)
Keywords: generation
Abstract: In diffusion models, samples are generated through an iterative refinement process, requiring hundreds of sequential model evaluations. Several recent methods have introduced approximations (fewer discretization steps or distillation) to trade off speed at the cost of sample quality. In contrast, we introduce Self-Refining Diffusion Samplers (SRDS) that retain sample quality and can improve latency at the cost of additional parallel compute. We take inspiration from the Parareal algorithm, a popular numerical method for parallel-in-time integration of differential equations. In SRDS, a quick but rough estimate of a sample is first created and then iteratively refined in parallel through Parareal iterations. SRDS is not only guaranteed to accurately solve the ODE and converge to the serial solution but also benefits from parallelization across the diffusion trajectory, enabling batched inference and pipelining. As we demonstrate for pre-trained diffusion models, the early convergence of this refinement procedure drastically reduces the number of steps required to produce a sample, speeding up generation for instance by up to 1.7x on a 25-step StableDiffusion-v2 benchmark and up to 4.3x on longer trajectories.
摘要：在扩散模型中，样本是通过迭代细化过程生成的，需要数百次连续的模型评估。最近的几种方法引入了近似值（更少的离散化步骤或蒸馏），以牺牲样本质量为代价来换取速度。相比之下，我们引入了自细化扩散采样器 (SRDS)，它可以保持样本质量，并可以以额外的并行计算为代价来改善延迟。我们从 Parareal 算法中汲取灵感，这是一种用于微分方程并行积分的流行数值方法。在 SRDS 中，首先创建样本的快速但粗略估计，然后通过 Parareal 迭代并行迭代细化。SRDS 不仅可以保证准确求解 ODE 并收敛到串行解，而且还受益于扩散轨迹的并行化，从而实现批量推理和流水线操作。正如我们对预先训练的扩散模型所展示的那样，这种细化过程的早期收敛大大减少了生成样本所需的步骤数，例如在 25 步 StableDiffusion-v2 基准上将生成速度提高了 1.7 倍，在较长轨迹上将生成速度提高了 4.3 倍。

Title: Video Summarization using Denoising Diffusion Probabilistic Model

Authors: Zirui Shang, Yubo Zhu, Hongxi Li, Shuo yang, Xinxiao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08357
Pdf URL: https://arxiv.org/pdf/2412.08357
Copy Paste: [[2412.08357]] Video Summarization using Denoising Diffusion Probabilistic Model(https://arxiv.org/abs/2412.08357)
Keywords: generative
Abstract: Video summarization aims to eliminate visual redundancy while retaining key parts of video to construct concise and comprehensive synopses. Most existing methods use discriminative models to predict the importance scores of video frames. However, these methods are susceptible to annotation inconsistency caused by the inherent subjectivity of different annotators when annotating the same video. In this paper, we introduce a generative framework for video summarization that learns how to generate summaries from a probability distribution perspective, effectively reducing the interference of subjective annotation noise. Specifically, we propose a novel diffusion summarization method based on the Denoising Diffusion Probabilistic Model (DDPM), which learns the probability distribution of training data through noise prediction, and generates summaries by iterative denoising. Our method is more resistant to subjective annotation noise, and is less prone to overfitting the training data than discriminative methods, with strong generalization ability. Moreover, to facilitate training DDPM with limited data, we employ an unsupervised video summarization model to implement the earlier denoising process. Extensive experiments on various datasets (TVSum, SumMe, and FPVSum) demonstrate the effectiveness of our method.
摘要：视频摘要旨在消除视觉冗余，同时保留视频的关键部分，以构建简洁、全面的概要。大多数现有方法使用判别模型来预测视频帧的重要性得分。然而，这些方法容易受到不同注释者在注释同一视频时固有的主观性导致的注释不一致的影响。在本文中，我们介绍了一个视频摘要的生成框架，该框架学习如何从概率分布的角度生成摘要，有效地减少了主观注释噪声的干扰。具体而言，我们提出了一种基于去噪扩散概率模型 (DDPM) 的新型扩散摘要方法，该方法通过噪声预测来学习训练数据的概率分布，并通过迭代去噪来生成摘要。我们的方法对主观注释噪声的抵抗力更强，并且比判别方法更不容易过度拟合训练数据，具有很强的泛化能力。此外，为了便于使用有限的数据训练 DDPM，我们采用无监督视频摘要模型来实现早期的去噪过程。在各种数据集（TVSum、SumMe 和 FPVSum）上进行的大量实验证明了我们方法的有效性。

Title: Adversarial Purification by Consistency-aware Latent Space Optimization on Data Manifolds

Authors: Shuhai Zhang, Jiahao Yang, Hui Luo, Jie Chen, Li Wang, Feng Liu, Bo Han, Mingkui Tan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08394
Pdf URL: https://arxiv.org/pdf/2412.08394
Copy Paste: [[2412.08394]] Adversarial Purification by Consistency-aware Latent Space Optimization on Data Manifolds(https://arxiv.org/abs/2412.08394)
Keywords: restoration, generative
Abstract: Deep neural networks (DNNs) are vulnerable to adversarial samples crafted by adding imperceptible perturbations to clean data, potentially leading to incorrect and dangerous predictions. Adversarial purification has been an effective means to improve DNNs robustness by removing these perturbations before feeding the data into the model. However, it faces significant challenges in preserving key structural and semantic information of data, as the imperceptible nature of adversarial perturbations makes it hard to avoid over-correcting, which can destroy important information and degrade model performance. In this paper, we break away from traditional adversarial purification methods by focusing on the clean data manifold. To this end, we reveal that samples generated by a well-trained generative model are close to clean ones but far from adversarial ones. Leveraging this insight, we propose Consistency Model-based Adversarial Purification (CMAP), which optimizes vectors within the latent space of a pre-trained consistency model to generate samples for restoring clean data. Specifically, 1) we propose a \textit{Perceptual consistency restoration} mechanism by minimizing the discrepancy between generated samples and input samples in both pixel and perceptual spaces. 2) To maintain the optimized latent vectors within the valid data manifold, we introduce a \textit{Latent distribution consistency constraint} strategy to align generated samples with the clean data distribution. 3) We also apply a \textit{Latent vector consistency prediction} scheme via an ensemble approach to enhance prediction reliability. CMAP fundamentally addresses adversarial perturbations at their source, providing a robust purification. Extensive experiments on CIFAR-10 and ImageNet-100 show that our CMAP significantly enhances robustness against strong adversarial attacks while preserving high natural accuracy.
摘要：深度神经网络 (DNN) 容易受到对抗性样本的攻击，这些样本通过在干净数据中添加不可察觉的扰动而制作，可能会导致错误和危险的预测。对抗性净化是一种有效的手段，通过在将数据输入模型之前消除这些扰动来提高 DNN 的鲁棒性。然而，它在保留数据的关键结构和语义信息方面面临着重大挑战，因为对抗性扰动的不可察觉性质使得难以避免过度校正，这可能会破坏重要信息并降低模型性能。在本文中，我们摆脱了传统的对抗性净化方法，专注于干净数据流形。为此，我们发现训练有素的生成模型生成的样本接近干净样本，但远离对抗样本。利用这一见解，我们提出了基于一致性模型的对抗性净化 (CMAP)，它优化预先训练的一致性模型的潜在空间内的向量以生成用于恢复干净数据的样本。具体来说，1）我们提出了一种 \textit{感知一致性恢复} 机制，通过最小化像素和感知空间中生成的样本和输入样本之间的差异。2）为了在有效数据流形内保持优化的潜在向量，我们引入了 \textit{潜在分布一致性约束} 策略，以使生成的样本与干净的数据分布对齐。3）我们还通过集成方法应用 \textit{潜在向量一致性预测} 方案来提高预测可靠性。CMAP 从根本上解决了源头上的对抗性扰动，提供了强大的净化能力。在 CIFAR-10 和 ImageNet-100 上进行的大量实验表明，我们的 CMAP 显著增强了对强对抗性攻击的鲁棒性，同时保持了较高的自然准确性。

Title: Pysical Informed Driving World Model

Authors: Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, Wei Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08410
Pdf URL: https://arxiv.org/pdf/2412.08410
Copy Paste: [[2412.08410]] Pysical Informed Driving World Model(https://arxiv.org/abs/2412.08410)
Keywords: generation
Abstract: Autonomous driving requires robust perception models trained on high-quality, large-scale multi-view driving videos for tasks like 3D object detection, segmentation and trajectory prediction. While world models provide a cost-effective solution for generating realistic driving videos, challenges remain in ensuring these videos adhere to fundamental physical principles, such as relative and absolute motion, spatial relationship like occlusion and spatial consistency, and temporal consistency. To address these, we propose DrivePhysica, an innovative model designed to generate realistic multi-view driving videos that accurately adhere to essential physical principles through three key advancements: (1) a Coordinate System Aligner module that integrates relative and absolute motion features to enhance motion interpretation, (2) an Instance Flow Guidance module that ensures precise temporal consistency via efficient 3D flow extraction, and (3) a Box Coordinate Guidance module that improves spatial relationship understanding and accurately resolves occlusion hierarchies. Grounded in physical principles, we achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks. Our project homepage: this https URL
摘要：自动驾驶需要基于高质量、大规模多视角驾驶视频训练的稳健感知模型，以完成 3D 物体检测、分割和轨迹预测等任务。虽然世界模型为生成逼真的驾驶视频提供了一种经济高效的解决方案，但在确保这些视频遵循基本物理原理（例如相对和绝对运动、空间关系（如遮挡和空间一致性）以及时间一致性）方面仍然存在挑战。为了解决这些问题，我们提出了 DrivePhysica，这是一种创新模型，旨在通过三项关键改进生成逼真的多视角驾驶视频，这些视频可准确遵循基本物理原理：（1）坐标系对齐器模块，集成相对和绝对运动特征以增强运动解释，（2）实例流引导模块，通过高效的 3D 流提取确保精确的时间一致性，以及（3）框坐标引导模块，可改善空间关系理解并准确解决遮挡层次结构。基于物理原理，我们在驱动视频生成质量（Nuscenes 数据集上的 3.96 FID 和 38.06 FVD）和下游感知任务方面取得了最先进的性能。我们的项目主页：此 https URL

Title: Pragmatist: Multiview Conditional Diffusion Models for High-Fidelity 3D Reconstruction from Unposed Sparse Views

Authors: Songchun Zhang, Chunhui Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08412
Pdf URL: https://arxiv.org/pdf/2412.08412
Copy Paste: [[2412.08412]] Pragmatist: Multiview Conditional Diffusion Models for High-Fidelity 3D Reconstruction from Unposed Sparse Views(https://arxiv.org/abs/2412.08412)
Keywords: generative
Abstract: Inferring 3D structures from sparse, unposed observations is challenging due to its unconstrained nature. Recent methods propose to predict implicit representations directly from unposed inputs in a data-driven manner, achieving promising results. However, these methods do not utilize geometric priors and cannot hallucinate the appearance of unseen regions, thus making it challenging to reconstruct fine geometric and textural details. To tackle this challenge, our key idea is to reformulate this ill-posed problem as conditional novel view synthesis, aiming to generate complete observations from limited input views to facilitate reconstruction. With complete observations, the poses of the input views can be easily recovered and further used to optimize the reconstructed object. To this end, we propose a novel pipeline Pragmatist. First, we generate a complete observation of the object via a multiview conditional diffusion model. Then, we use a feed-forward large reconstruction model to obtain the reconstructed mesh. To further improve the reconstruction quality, we recover the poses of input views by inverting the obtained 3D representations and further optimize the texture using detailed input views. Unlike previous approaches, our pipeline improves reconstruction by efficiently leveraging unposed inputs and generative priors, circumventing the direct resolution of highly ill-posed problems. Extensive experiments show that our approach achieves promising performance in several benchmarks.
摘要：由于其不受约束的性质，从稀疏的非摆姿势观测中推断 3D 结构具有挑战性。最近的方法提出以数据驱动的方式从未摆姿势的输入中直接预测隐式表示，并取得了有希望的结果。然而，这些方法不利用几何先验，不能幻觉出看不见的区域的外观，因此很难重建精细的几何和纹理细节。为了应对这一挑战，我们的主要思想是将这个不适定问题重新表述为条件新视图合成，旨在从有限的输入视图中生成完整的观测以促进重建。通过完整的观测，可以轻松恢复输入视图的姿势并进一步用于优化重建对象。为此，我们提出了一种新颖的管道实用主义者。首先，我们通过多视图条件扩散模型生成对物体的完整观测。然后，我们使用前馈大重建模型来获得重建的网格。为了进一步提高重建质量，我们通过反转获得的 3D 表示来恢复输入视图的姿势，并使用详细的输入视图进一步优化纹理。与以前的方法不同，我们的流程通过有效利用非姿势输入和生成先验来改进重建，从而避免直接解决高度不适定的问题。大量实验表明，我们的方法在多个基准测试中取得了令人鼓舞的性能。

Title: Federated Learning for Traffic Flow Prediction with Synthetic Data Augmentation

Authors: Fermin Orozco, Pedro Porto Buarque de Gusmão, Hongkai Wen, Johan Wahlström, Man Luo
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2412.08460
Pdf URL: https://arxiv.org/pdf/2412.08460
Copy Paste: [[2412.08460]] Federated Learning for Traffic Flow Prediction with Synthetic Data Augmentation(https://arxiv.org/abs/2412.08460)
Keywords: generation
Abstract: Deep-learning based traffic prediction models require vast amounts of data to learn embedded spatial and temporal dependencies. The inherent privacy and commercial sensitivity of such data has encouraged a shift towards decentralised data-driven methods, such as Federated Learning (FL). Under a traditional Machine Learning paradigm, traffic flow prediction models can capture spatial and temporal relationships within centralised data. In reality, traffic data is likely distributed across separate data silos owned by multiple stakeholders. In this work, a cross-silo FL setting is motivated to facilitate stakeholder collaboration for optimal traffic flow prediction applications. This work introduces an FL framework, referred to as FedTPS, to generate synthetic data to augment each client's local dataset by training a diffusion-based trajectory generation model through FL. The proposed framework is evaluated on a large-scale real world ride-sharing dataset using various FL methods and Traffic Flow Prediction models, including a novel prediction model we introduce, which leverages Temporal and Graph Attention mechanisms to learn the Spatio-Temporal dependencies embedded within regional traffic flow data. Experimental results show that FedTPS outperforms multiple other FL baselines with respect to global model performance.
摘要：基于深度学习的交通预测模型需要大量数据来学习嵌入的空间和时间依赖关系。此类数据固有的隐私和商业敏感性促使人们转向分散的数据驱动方法，例如联邦学习 (FL)。在传统的机器学习范式下，交通流预测模型可以捕获集中数据中的空间和时间关系。实际上，交通数据可能分布在多个利益相关者拥有的单独数据孤岛中。在这项工作中，跨孤岛 FL 设置旨在促进利益相关者的协作，以实现最佳交通流预测应用。这项工作引入了一个 FL 框架，称为 FedTPS，通过 FL 训练基于扩散的轨迹生成模型来生成合成数据以增强每个客户端的本地数据集。使用各种 FL 方法和交通流预测模型在大型现实世界拼车数据集上评估了所提出的框架，包括我们引入的一种新型预测模型，该模型利用时间和图形注意机制来学习嵌入在区域交通流数据中的时空依赖关系。实验结果表明，FedTPS 在全局模型性能方面优于多个其他 FL 基线。

Title: CC-Diff: Enhancing Contextual Coherence in Remote Sensing Image Synthesis

Authors: Mu Zhang, Yunfan Liu, Yue Liu, Hongtian Yu, Qixiang Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08464
Pdf URL: https://arxiv.org/pdf/2412.08464
Copy Paste: [[2412.08464]] CC-Diff: Enhancing Contextual Coherence in Remote Sensing Image Synthesis(https://arxiv.org/abs/2412.08464)
Keywords: generation
Abstract: Accurately depicting real-world landscapes in remote sensing (RS) images requires precise alignment between objects and their environment. However, most existing synthesis methods for natural images prioritize foreground control, often reducing the background to plain textures. This neglects the interaction between foreground and background, which can lead to incoherence in RS scenarios. In this paper, we introduce CC-Diff, a Diffusion Model-based approach for RS image generation with enhanced Context Coherence. To capture spatial interdependence, we propose a sequential pipeline where background generation is conditioned on synthesized foreground instances. Distinct learnable queries are also employed to model both the complex background texture and its semantic relation to the foreground. Extensive experiments demonstrate that CC-Diff outperforms state-of-the-art methods in visual fidelity, semantic accuracy, and positional precision, excelling in both RS and natural image domains. CC-Diff also shows strong trainability, improving detection accuracy by 2.04 mAP on DOTA and 2.25 mAP on the COCO benchmark.
摘要：在遥感 (RS) 图像中准确描绘真实世界景观需要物体与其环境之间的精确对齐。然而，大多数现有的自然图像合成方法优先考虑前景控制，通常会将背景简化为纯纹理。这忽略了前景和背景之间的相互作用，这可能导致 RS 场景中的不连贯性。在本文中，我们介绍了 CC-Diff，这是一种基于扩散模型的 RS 图像生成方法，具有增强的上下文连贯性。为了捕捉空间相互依赖性，我们提出了一个顺序管道，其中背景生成以合成的前景实例为条件。还采用了不同的可学习查询来模拟复杂的背景纹理及其与前景的语义关系。大量实验表明，CC-Diff 在视觉保真度、语义准确性和位置精度方面优于最先进的方法，在 RS 和自然图像领域都表现出色。 CC-Diff 还表现出很强的可训练性，在 DOTA 上将检测准确率提高了 2.04 mAP，在 COCO 基准上将检测准确率提高了 2.25 mAP。

Title: Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

Authors: Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, Limin Wang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.08467
Pdf URL: https://arxiv.org/pdf/2412.08467
Copy Paste: [[2412.08467]] Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel(https://arxiv.org/abs/2412.08467)
Keywords: generation
Abstract: Creating high-quality data for training robust language-instructed agents is a long-lasting challenge in embodied AI. In this paper, we introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs by iteratively refining the data pool through the collaboration between two models, the instruction generator and the navigator, without any human-in-the-loop annotation. Specifically, SRDF starts with using a base generator to create an initial data pool for training a base navigator, followed by applying the trained navigator to filter the data pool. This leads to higher-fidelity data to train a better generator, which can, in turn, produce higher-quality data for training the next-round navigator. Such a flywheel establishes a data self-refining process, yielding a continuously improved and highly effective dataset for large-scale language-guided navigation learning. Our experiments demonstrate that after several flywheel rounds, the navigator elevates the performance boundary from 70% to 78% SPL on the classic R2R test set, surpassing human performance (76%) for the first time. Meanwhile, this process results in a superior generator, evidenced by a SPICE increase from 23.5 to 26.2, better than all previous VLN instruction generation methods. Finally, we demonstrate the scalability of our method through increasing environment and instruction diversity, and the generalization ability of our pre-trained navigator across various downstream navigation tasks, surpassing state-of-the-art methods by a large margin in all cases.
摘要：创建高质量数据来训练强大的语言指导代理是具身人工智能领域的一项长期挑战。在本文中，我们引入了一种自精炼数据飞轮 (SRDF)，它通过两个模型（指令生成器和导航器）之间的协作迭代地细化数据池，生成高质量和大规模的导航指令轨迹对，而无需任何人为注释。具体来说，SRDF 首先使用基础生成器创建初始数据池以训练基础导航器，然后应用训练后的导航器过滤数据池。这会产生更高保真度的数据来训练更好的生成器，进而可以生成更高质量的数据来训练下一轮导航器。这样的飞轮建立了一个数据自精炼过程，为大规模语言引导导航学习提供了不断改进和高效的数据集。我们的实验表明，经过几次飞轮循环，导航器在经典 R2R 测试集上将性能边界从 70% SPL 提升到 78%，首次超越人类表现（76%）。同时，这一过程产生了一个卓越的生成器，SPICE 从 23.5 增加到 26.2，优于所有以前的 VLN 指令生成方法。最后，我们通过增加环境和指令多样性证明了我们方法的可扩展性，以及我们预先训练的导航器在各种下游导航任务中的泛化能力，在所有情况下都大大超越了最先进的方法。

Title: InvDiff: Invariant Guidance for Bias Mitigation in Diffusion Models

Authors: Min Hou, Yueying Wu, Chang Xu, Yu-Hao Huang, Chenxi Bai, Le Wu, Jiang Bian
Subjects: cs.CV, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08480
Pdf URL: https://arxiv.org/pdf/2412.08480
Copy Paste: [[2412.08480]] InvDiff: Invariant Guidance for Bias Mitigation in Diffusion Models(https://arxiv.org/abs/2412.08480)
Keywords: generation, generative
Abstract: As one of the most successful generative models, diffusion models have demonstrated remarkable efficacy in synthesizing high-quality images. These models learn the underlying high-dimensional data distribution in an unsupervised manner. Despite their success, diffusion models are highly data-driven and prone to inheriting the imbalances and biases present in real-world data. Some studies have attempted to address these issues by designing text prompts for known biases or using bias labels to construct unbiased data. While these methods have shown improved results, real-world scenarios often contain various unknown biases, and obtaining bias labels is particularly challenging. In this paper, we emphasize the necessity of mitigating bias in pre-trained diffusion models without relying on auxiliary bias annotations. To tackle this problem, we propose a framework, InvDiff, which aims to learn invariant semantic information for diffusion guidance. Specifically, we propose identifying underlying biases in the training data and designing a novel debiasing training objective. Then, we employ a lightweight trainable module that automatically preserves invariant semantic information and uses it to guide the diffusion model's sampling process toward unbiased outcomes simultaneously. Notably, we only need to learn a small number of parameters in the lightweight learnable module without altering the pre-trained diffusion model. Furthermore, we provide a theoretical guarantee that the implementation of InvDiff is equivalent to reducing the error upper bound of generalization. Extensive experimental results on three publicly available benchmarks demonstrate that InvDiff effectively reduces biases while maintaining the quality of image generation. Our code is available at this https URL.
摘要：作为最成功的生成模型之一，扩散模型在合成高质量图像方面表现出了显著的效果。这些模型以无监督的方式学习底层的高维数据分布。尽管扩散模型取得了成功，但它们是高度数据驱动的，容易继承真实世界数据中存在的不平衡和偏差。一些研究试图通过设计已知偏差的文本提示或使用偏差标签构建无偏数据来解决这些问题。虽然这些方法显示出更好的结果，但真实世界场景通常包含各种未知偏差，而获取偏差标签尤其具有挑战性。在本文中，我们强调了在不依赖辅助偏差注释的情况下减轻预训练扩散模型中偏差的必要性。为了解决这个问题，我们提出了一个框架 InvDiff，旨在学习不变的语义信息以进行扩散指导。具体来说，我们建议识别训练数据中的潜在偏差并设计一个新的去偏差训练目标。然后，我们采用轻量级可训练模块，该模块自动保留不变的语义信息，并使用它同时引导扩散模型的采样过程朝着无偏结果发展。值得注意的是，我们只需要在轻量级可学习模块中学习少量参数，而无需改变预先训练的扩散模型。此外，我们提供了理论保证，即 InvDiff 的实现相当于降低泛化的错误上限。在三个公开可用的基准上进行的大量实验结果表明，InvDiff 有效地减少了偏差，同时保持了图像生成的质量。我们的代码可在此 https URL 上找到。

Title: Learning Flow Fields in Attention for Controllable Person Image Generation

Authors: Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa, Aditya Patel, Tao Xiang, Miaojing Shi, Sen He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08486
Pdf URL: https://arxiv.org/pdf/2412.08486
Copy Paste: [[2412.08486]] Learning Flow Fields in Attention for Controllable Person Image Generation(https://arxiv.org/abs/2412.08486)
Keywords: generation
Abstract: Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.
摘要：可控人物图像生成旨在根据参考图像生成人物图像，从而精确控制人物的外观或姿势。然而，先前的方法通常会扭曲参考图像中的细粒度纹理细节，尽管可以实现较高的整体图像质量。我们将这些扭曲归因于对参考图像中相应区域的注意力不足。为了解决这个问题，我们提出了学习注意力流场 (Leffa)，它在训练期间明确引导目标查询关注注意层中的正确参考键。具体来说，它是通过基于扩散的基线内注意力图顶部的正则化损失来实现的。我们大量的实验表明，Leffa 在控制外观（虚拟试穿）和姿势（姿势转换）方面实现了最先进的性能，在保持高图像质量的同时显着减少了细粒度细节失真。此外，我们表明我们的损失与模型无关，可用于提高其他扩散模型的性能。

Title: StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

Authors: Mingkun Lei, Xue Song, Beier Zhu, Hao Wang, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08503
Pdf URL: https://arxiv.org/pdf/2412.08503
Copy Paste: [[2412.08503]] StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements(https://arxiv.org/abs/2412.08503)
Keywords: generation
Abstract: Text-driven style transfer aims to merge the style of a reference image with content described by a text prompt. Recent advancements in text-to-image models have improved the nuance of style transformations, yet significant challenges remain, particularly with overfitting to reference styles, limiting stylistic control, and misaligning with textual content. In this paper, we propose three complementary strategies to address these issues. First, we introduce a cross-modal Adaptive Instance Normalization (AdaIN) mechanism for better integration of style and text features, enhancing alignment. Second, we develop a Style-based Classifier-Free Guidance (SCFG) approach that enables selective control over stylistic elements, reducing irrelevant influences. Finally, we incorporate a teacher model during early generation stages to stabilize spatial layouts and mitigate artifacts. Our extensive evaluations demonstrate significant improvements in style transfer quality and alignment with textual prompts. Furthermore, our approach can be integrated into existing style transfer frameworks without fine-tuning.
摘要：文本驱动的风格转换旨在将参考图像的风格与文本提示描述的内容融合在一起。文本到图像模型的最新进展改善了风格转换的细微差别，但仍然存在重大挑战，特别是过度拟合参考风格、限制风格控制以及与文本内容不一致。在本文中，我们提出了三种互补的策略来解决这些问题。首先，我们引入了一种跨模态自适应实例规范化 (AdaIN) 机制，以更好地整合风格和文本特征，增强对齐。其次，我们开发了一种基于风格的无分类器指导 (SCFG) 方法，可以选择性地控制风格元素，减少不相关的影响。最后，我们在早期生成阶段加入了一个教师模型，以稳定空间布局并减轻伪影。我们广泛的评估表明，风格转换质量和与文本提示的对齐有了显着改善。此外，我们的方法可以集成到现有的风格转换框架中而无需微调。

Title: Watermarking Training Data of Music Generation Models

Authors: Pascal Epple, Igor Shilov, Bozhidar Stevanovski, Yves-Alexandre de Montjoye
Subjects: cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.08549
Pdf URL: https://arxiv.org/pdf/2412.08549
Copy Paste: [[2412.08549]] Watermarking Training Data of Music Generation Models(https://arxiv.org/abs/2412.08549)
Keywords: generation, generative
Abstract: Generative Artificial Intelligence (Gen-AI) models are increasingly used to produce content across domains, including text, images, and audio. While these models represent a major technical breakthrough, they gain their generative capabilities from being trained on enormous amounts of human-generated content, which often includes copyrighted material. In this work, we investigate whether audio watermarking techniques can be used to detect an unauthorized usage of content to train a music generation model. We compare outputs generated by a model trained on watermarked data to a model trained on non-watermarked data. We study factors that impact the model's generation behaviour: the watermarking technique, the proportion of watermarked samples in the training set, and the robustness of the watermarking technique against the model's tokenizer. Our results show that audio watermarking techniques, including some that are imperceptible to humans, can lead to noticeable shifts in the model's outputs. We also study the robustness of a state-of-the-art watermarking technique to removal techniques.
摘要：生成人工智能 (Gen-AI) 模型越来越多地用于跨领域生成内容，包括文本、图像和音频。虽然这些模型代表了一项重大的技术突破，但它们的生成能力是通过对大量人工生成的内容进行训练而获得的，这些内容通常包括受版权保护的材料。在这项工作中，我们研究了音频水印技术是否可用于检测未经授权的内容使用，以训练音乐生成模型。我们将在带水印数据上训练的模型生成的输出与在无水印数据上训练的模型生成的输出进行比较。我们研究了影响模型生成行为的因素：水印技术、训练集中带水印样本的比例以及水印技术对模型标记器的鲁棒性。我们的结果表明，音频水印技术（包括一些人类无法察觉的技术）会导致模型输出发生明显变化。我们还研究了最先进的水印技术对去除技术的鲁棒性。

Title: Can We Generate Visual Programs Without Prompting LLMs?

Authors: Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.08564
Pdf URL: https://arxiv.org/pdf/2412.08564
Copy Paste: [[2412.08564]] Can We Generate Visual Programs Without Prompting LLMs?(https://arxiv.org/abs/2412.08564)
Keywords: generation
Abstract: Visual programming prompts LLMs (large language mod-els) to generate executable code for visual tasks like visual question answering (VQA). Prompt-based methods are difficult to improve while also being unreliable and costly in both time and money. Our goal is to develop an efficient visual programming system without 1) using prompt-based LLMs at inference time and 2) a large set of program and answer annotations. We develop a synthetic data augmentation approach and alternative program generation method based on decoupling programs into higher-level skills called templates and the corresponding arguments. Our results show that with data augmentation, prompt-free smaller LLMs ($\approx$ 1B parameters) are competitive with state-of-the art models with the added benefit of much faster inference
摘要：可视化编程提示 LLM（大型语言模型）为视觉任务（如视觉问答 (VQA)）生成可执行代码。基于提示的方法很难改进，而且不可靠，耗费时间和金钱。我们的目标是开发一个高效的可视化编程系统，无需 1）在推理时使用基于提示的 LLM，无需 2）大量程序和答案注释。我们开发了一种综合数据增强方法和替代程序生成方法，该方法基于将程序解耦为称为模板的高级技能和相应的参数。我们的结果表明，通过数据增强，无提示的小型 LLM（约 1B 个参数）可以与最先进的模型相媲美，并且具有推理速度更快的额外优势

Title: GenPlan: Generative sequence models as adaptive planners

Authors: Akash Karthikeyan, Yash Vardhan Pant
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08565
Pdf URL: https://arxiv.org/pdf/2412.08565
Copy Paste: [[2412.08565]] GenPlan: Generative sequence models as adaptive planners(https://arxiv.org/abs/2412.08565)
Keywords: generative
Abstract: Offline reinforcement learning has shown tremendous success in behavioral planning by learning from previously collected demonstrations. However, decision-making in multitask missions still presents significant challenges. For instance, a mission might require an agent to explore an unknown environment, discover goals, and navigate to them, even if it involves interacting with obstacles along the way. Such behavioral planning problems are difficult to solve due to: a) agents failing to adapt beyond the single task learned through their reward function, and b) the inability to generalize to new environments not covered in the training demonstrations, e.g., environments where all doors were unlocked in the demonstrations. Consequently, state-of-the-art decision making methods are limited to missions where the required tasks are well-represented in the training demonstrations and can be solved within a short (temporal) planning horizon. To address this, we propose GenPlan: a stochastic and adaptive planner that leverages discrete-flow models for generative sequence modeling, enabling sample-efficient exploration and exploitation. This framework relies on an iterative denoising procedure to generate a sequence of goals and actions. This approach captures multi-modal action distributions and facilitates goal and task discovery, thereby enhancing generalization to out-of-distribution tasks and environments, i.e., missions not part of the training data. We demonstrate the effectiveness of our method through multiple simulation environments. Notably, GenPlan outperforms the state-of-the-art methods by over 10% on adaptive planning tasks, where the agent adapts to multi-task missions while leveraging demonstrations on single-goal-reaching tasks.
摘要：离线强化学习通过从之前收集的演示中学习，在行为规划方面取得了巨大成功。然而，多任务任务中的决策仍然面临重大挑战。例如，任务可能需要代理探索未知环境、发现目标并导航到目标，即使这涉及沿途与障碍物互动。此类行为规划问题很难解决，因为：a) 代理无法适应通过奖励函数学习的单一任务之外的任务，以及 b) 无法推广到训练演示中未涵盖的新环境，例如演示中所有门都未锁定的环境。因此，最先进的决策方法仅限于所需任务在训练演示中得到充分体现并且可以在短（时间）规划范围内解决的任务。为了解决这个问题，我们提出了 GenPlan：一种随机和自适应规划器，它利用离散流模型进行生成序列建模，从而实现样本高效的探索和利用。该框架依赖于迭代去噪程序来生成一系列目标和动作。这种方法可以捕获多模态动作分布并促进目标和任务发现，从而增强对分布外任务和环境（即不属于训练数据的任务）的泛化。我们通过多个模拟环境证明了我们方法的有效性。值得注意的是，GenPlan 在自适应规划任务上的表现比最先进的方法高出 10% 以上，其中代理适应多任务任务，同时利用单目标任务的演示。

Title: TryOffAnyone: Tiled Cloth Generation from a Dressed Person

Authors: Ioannis Xarchakos, Theodoros Koukopoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08573
Pdf URL: https://arxiv.org/pdf/2412.08573
Copy Paste: [[2412.08573]] TryOffAnyone: Tiled Cloth Generation from a Dressed Person(https://arxiv.org/abs/2412.08573)
Keywords: generation
Abstract: The fashion industry is increasingly leveraging computer vision and deep learning technologies to enhance online shopping experiences and operational efficiencies. In this paper, we address the challenge of generating high-fidelity tiled garment images essential for personalized recommendations, outfit composition, and virtual try-on systems from photos of garments worn by models. Inspired by the success of Latent Diffusion Models (LDMs) in image-to-image translation, we propose a novel approach utilizing a fine-tuned StableDiffusion model. Our method features a streamlined single-stage network design, which integrates garmentspecific masks to isolate and process target clothing items effectively. By simplifying the network architecture through selective training of transformer blocks and removing unnecessary crossattention layers, we significantly reduce computational complexity while achieving state-of-the-art performance on benchmark datasets like VITON-HD. Experimental results demonstrate the effectiveness of our approach in producing high-quality tiled garment images for both full-body and half-body inputs. Code and model are available at: this https URL
摘要：时尚行业越来越多地利用计算机视觉和深度学习技术来增强在线购物体验和运营效率。在本文中，我们解决了从模特穿着的服装照片中生成高保真平铺服装图像的挑战，这对于个性化推荐、服装组合和虚拟试穿系统至关重要。受潜在扩散模型 (LDM) 在图像到图像转换中的成功启发，我们提出了一种利用微调的 StableDiffusion 模型的新方法。我们的方法采用精简的单阶段网络设计，集成了服装专用掩模，以有效隔离和处理目标服装。通过选择性训练变压器块并删除不必要的交叉注意层来简化网络架构，我们显着降低了计算复杂度，同时在 VITON-HD 等基准数据集上实现了最先进的性能。实验结果证明了我们的方法在为全身和半身输入生成高质量平铺服装图像方面的有效性。代码和模型可在以下位置获得：此 https URL

Title: LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

Authors: Zejian Li, Chenye Meng, Yize Li, Ling Yang, Shengyuan Zhang, Jiarui Ma, Jiayi Li, Guang Yang, Changyuan Yang, Zhiyuan Yang, Jinxiong Chang, Lingyun Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08580
Pdf URL: https://arxiv.org/pdf/2412.08580
Copy Paste: [[2412.08580]] LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations(https://arxiv.org/abs/2412.08580)
Keywords: generation
Abstract: Recent advances in text-to-image (T2I) generation have shown remarkable success in producing high-quality images from text. However, existing T2I models show decayed performance in compositional image generation involving multiple objects and intricate relationships. We attribute this problem to limitations in existing datasets of image-text pairs, which lack precise inter-object relationship annotations with prompts only. To address this problem, we construct LAION-SG, a large-scale dataset with high-quality structural annotations of scene graphs (SG), which precisely describe attributes and relationships of multiple objects, effectively representing the semantic structure in complex scenes. Based on LAION-SG, we train a new foundation model SDXL-SG to incorporate structural annotation information into the generation process. Extensive experiments show advanced models trained on our LAION-SG boast significant performance improvements in complex scene generation over models on existing datasets. We also introduce CompSG-Bench, a benchmark that evaluates models on compositional image generation, establishing a new standard for this domain.
摘要：文本到图像 (T2I) 生成的最新进展已显示出在从文本生成高质量图像方面取得了显著成功。然而，现有的 T2I 模型在涉及多个对象和复杂关系的合成图像生成中表现出性能下降。我们将此问题归因于现有图像-文本对数据集的局限性，这些数据集仅通过提示缺乏精确的对象间关系注释。为了解决这个问题，我们构建了 LAION-SG，这是一个具有高质量场景图 (SG) 结构注释的大规模数据集，它精确描述了多个对象的属性和关系，有效地表示了复杂场景中的语义结构。基于 LAION-SG，我们训练了一个新的基础模型 SDXL-SG，将结构注释信息纳入生成过程。大量实验表明，在我们的 LAION-SG 上训练的高级模型在复杂场景生成方面比现有数据集上的模型具有显着的性能提升。我们还推出了 CompSG-Bench，这是一个评估合成图像生成模型的基准，为该领域建立了新的标准。

Title: Fair Primal Dual Splitting Method for Image Inverse Problems

Authors: Yunfei Qu, Deren Han
Subjects: cs.CV, math.OC
Abstract URL: https://arxiv.org/abs/2412.08613
Pdf URL: https://arxiv.org/pdf/2412.08613
Copy Paste: [[2412.08613]] Fair Primal Dual Splitting Method for Image Inverse Problems(https://arxiv.org/abs/2412.08613)
Keywords: super-resolution
Abstract: Image inverse problems have numerous applications, including image processing, super-resolution, and computer vision, which are important areas in image science. These application models can be seen as a three-function composite optimization problem solvable by a variety of primal dual-type methods. We propose a fair primal dual algorithmic framework that incorporates the smooth term not only into the primal subproblem but also into the dual subproblem. We unify the global convergence and establish the convergence rates of our proposed fair primal dual method. Experiments on image denoising and super-resolution reconstruction demonstrate the superiority of the proposed method over the current state-of-the-art.
摘要：图像逆问题具有广泛的应用，包括图像处理、超分辨率和计算机视觉，这些都是图像科学的重要领域。这些应用模型可以看作是一个三函数复合优化问题，可以通过多种原始对偶类型方法解决。我们提出了一个公平的原始对偶算法框架，该框架不仅将平滑项纳入原始子问题，而且还纳入对偶子问题。我们统一了全局收敛性并建立了我们提出的公平原始对偶方法的收敛速度。图像去噪和超分辨率重建实验证明了所提出的方法优于当前最先进的方法。

Title: GPD-1: Generative Pre-training for Driving

Authors: Zixun Xie, Sicheng Zuo, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, Jie Zhou, Jiwen Lu, Shanghang Zhang
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.08643
Pdf URL: https://arxiv.org/pdf/2412.08643
Copy Paste: [[2412.08643]] GPD-1: Generative Pre-training for Driving(https://arxiv.org/abs/2412.08643)
Keywords: generation, generative
Abstract: Modeling the evolutions of driving scenarios is important for the evaluation and decision-making of autonomous driving systems. Most existing methods focus on one aspect of scene evolution such as map generation, motion prediction, and trajectory planning. In this paper, we propose a unified Generative Pre-training for Driving (GPD-1) model to accomplish all these tasks altogether without additional fine-tuning. We represent each scene with ego, agent, and map tokens and formulate autonomous driving as a unified token generation problem. We adopt the autoregressive transformer architecture and use a scene-level attention mask to enable intra-scene bi-directional interactions. For the ego and agent tokens, we propose a hierarchical positional tokenizer to effectively encode both 2D positions and headings. For the map tokens, we train a map vector-quantized autoencoder to efficiently compress ego-centric semantic maps into discrete tokens. We pre-train our GPD-1 on the large-scale nuPlan dataset and conduct extensive experiments to evaluate its effectiveness. With different prompts, our GPD-1 successfully generalizes to various tasks without finetuning, including scene generation, traffic simulation, closed-loop simulation, map prediction, and motion planning. Code: this https URL.
摘要：对驾驶场景的演变进行建模对于自动驾驶系统的评估和决策非常重要。大多数现有方法都侧重于场景演变的一个方面，例如地图生成、运动预测和轨迹规划。在本文中，我们提出了一个统一的生成式驾驶预训练 (GPD-1) 模型，无需额外微调即可完成所有这些任务。我们用自我、代理和地图标记表示每个场景，并将自动驾驶制定为统一的标记生成问题。我们采用自回归变压器架构并使用场景级注意力掩码来实现场景内的双向交互。对于自我和代理标记，我们提出了一个分层位置标记器来有效地编码 2D 位置和方向。对于地图标记，我们训练地图矢量量化自动编码器，以有效地将以自我为中心的语义图压缩为离散标记。我们在大型 nuPlan 数据集上对我们的 GPD-1 进行了预训练，并进行了广泛的实验以评估其有效性。在不同的提示下，我们的 GPD-1 无需微调即可成功推广到各种任务，包括场景生成、交通模拟、闭环模拟、地图预测和运动规划。代码：这个 https URL。

Title: ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

Authors: Daniel Winter, Asaf Shul, Matan Cohen, Dana Berman, Yael Pritch, Alex Rav-Acha, Yedid Hoshen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08645
Pdf URL: https://arxiv.org/pdf/2412.08645
Copy Paste: [[2412.08645]] ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation(https://arxiv.org/abs/2412.08645)
Keywords: generation
Abstract: This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods struggle to fully meet the task's challenging objectives: (i) seamlessly composing the object into the scene with photorealistic pose and lighting, and (ii) preserving the object's identity. We hypothesize that achieving these goals requires large scale supervision, but manually collecting sufficient data is simply too expensive. The key observation in this paper is that many mass-produced objects recur across multiple images of large unlabeled datasets, in different scenes, poses, and lighting conditions. We use this observation to create massive supervision by retrieving sets of diverse views of the same object. This powerful paired dataset enables us to train a straightforward text-to-image diffusion architecture to map the object and scene descriptions to the composited image. We compare our method, ObjectMate, with state-of-the-art methods for object insertion and subject-driven generation, using a single or multiple references. Empirically, ObjectMate achieves superior identity preservation and more photorealistic composition. Differently from many other multi-reference methods, ObjectMate does not require slow test-time tuning.
摘要：本文介绍了一种无需调整的对象插入和主体驱动生成方法。该任务涉及将给定多个视图的对象组合到由图像或文本指定的场景中。现有方法难以完全满足该任务的挑战性目标：（i）以逼真的姿势和光照将对象无缝组合到场景中，以及（ii）保留对象的身份。我们假设实现这些目标需要大规模监督，但手动收集足够的数据实在是太昂贵了。本文的关键观察是，许多批量生产的对象在大型未标记数据集的多幅图像中重复出现，场景、姿势和光照条件各不相同。我们利用这一观察结果通过检索同一对象的多组不同视图来创建大规模监督。这个强大的配对数据集使我们能够训练一个简单的文本到图像扩散架构，以将对象和场景描述映射到合成图像中。我们使用一个或多个参考将我们的方法 ObjectMate 与最先进的对象插入和主体驱动生成方法进行了比较。从经验上看，ObjectMate 实现了卓越的身份保存和更逼真的构图。与许多其他多参考方法不同，ObjectMate 不需要缓慢的测试时间调整。