2024-11-27

Title: Leveraging Conversational Generative AI for Anomaly Detection in Digital Substations

Authors: Aydin Zaboli, Seong Lok Choi, Junho Hong
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2411.16692
Pdf URL: https://arxiv.org/pdf/2411.16692
Copy Paste: [[2411.16692]] Leveraging Conversational Generative AI for Anomaly Detection in Digital Substations(https://arxiv.org/abs/2411.16692)
Keywords: generative
Abstract: This study addresses critical challenges of cybersecurity in digital substations by proposing an innovative task-oriented dialogue (ToD) system for anomaly detection (AD) in multicast messages, specifically, generic object oriented substation event (GOOSE) and sampled value (SV) datasets. Leveraging generative artificial intelligence (GenAI) technology, the proposed framework demonstrates superior error reduction, scalability, and adaptability compared with traditional human-in-the-loop (HITL) processes. Notably, this methodology offers significant advantages over machine learning (ML) techniques in terms of efficiency and implementation speed when confronting novel and/or unknown cyber threats, while also maintaining model complexity and precision. The research employs advanced performance metrics to conduct a comparative assessment between the proposed AD and HITL-based AD frameworks, utilizing a hardware-in-the-loop (HIL) testbed for generating and extracting features of IEC61850 communication messages. This approach presents a promising solution for enhancing the reliability of power system operations in the face of evolving cybersecurity challenges.
摘要：本研究提出了一种创新的任务导向对话 (ToD) 系统，用于多播消息中的异常检测 (AD)，具体来说，是通用的面向对象变电站事件 (GOOSE) 和采样值 (SV) 数据集，从而解决了数字变电站网络安全的关键挑战。利用生成人工智能 (GenAI) 技术，与传统的人在环 (HITL) 流程相比，所提出的框架表现出卓越的错误减少、可扩展性和适应性。值得注意的是，在面对新的和/或未知的网络威胁时，这种方法在效率和实施速度方面比机器学习 (ML) 技术具有显着优势，同时还保持了模型的复杂性和精度。该研究采用先进的性能指标对所提出的 AD 和基于 HITL 的 AD 框架进行比较评估，利用硬件在环 (HIL) 测试平台来生成和提取 IEC61850 通信消息的特征。面对不断发展的网络安全挑战，这种方法为提高电力系统运行的可靠性提供了一种有希望的解决方案。

Title: Conditional Text-to-Image Generation with Reference Guidance

Authors: Taewook Kim, Ze Wang, Zhengyuan Yang, Jiang Wang, Lijuan Wang, Zicheng Liu, Qiang Qiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16713
Pdf URL: https://arxiv.org/pdf/2411.16713
Copy Paste: [[2411.16713]] Conditional Text-to-Image Generation with Reference Guidance(https://arxiv.org/abs/2411.16713)
Keywords: generation
Abstract: Text-to-image diffusion models have demonstrated tremendous success in synthesizing visually stunning images given textual instructions. Despite remarkable progress in creating high-fidelity visuals, text-to-image models can still struggle with precisely rendering subjects, such as text spelling. To address this challenge, this paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate. In addition, this reference condition empowers the model to be conditioned in ways that the vocabularies of the text tokenizer cannot adequately represent, and further extends the model's generalization to novel capabilities such as generating non-English text spellings. We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references. Each plugin is trained with auxiliary networks and loss functions customized for applications such as English scene-text generation, multi-lingual scene-text generation, and logo-image generation. Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
摘要：文本到图像的扩散模型在根据文本指令合成视觉效果极佳的图像方面取得了巨大成功。尽管在创建高保真视觉效果方面取得了显著进展，但文本到图像模型在精确渲染主题（例如文本拼写）方面仍然举步维艰。为了应对这一挑战，本文探讨了使用图像的附加条件，为扩散模型生成特定主题提供视觉指导。此外，此参考条件使模型能够以文本标记器的词汇无法充分表示的方式进行调节，并进一步将模型的泛化扩展到生成非英语文本拼写等新功能。我们开发了几个小规模专家插件，有效地赋予稳定扩散模型采用不同参考的能力。每个插件都使用辅助网络和损失函数进行训练，这些函数针对英语场景文本生成、多语言场景文本生成和徽标图像生成等应用进行了定制。我们的专家插件在所有任务上都表现出比现有方法更好的结果，每个插件仅包含 28.55M 个可训练参数。

Title: TPIE: Topology-Preserved Image Editing With Text Instructions

Authors: Nivetha Jayakumar, Srivardhan Reddy Gadila, Tonmoy Hossain, Yangfeng Ji, Miaomiao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16714
Pdf URL: https://arxiv.org/pdf/2411.16714
Copy Paste: [[2411.16714]] TPIE: Topology-Preserved Image Editing With Text Instructions(https://arxiv.org/abs/2411.16714)
Keywords: generative
Abstract: Preserving topological structures is important in real-world applications, particularly in sensitive domains such as healthcare and medicine, where the correctness of human anatomy is critical. However, most existing image editing models focus on manipulating intensity and texture features, often overlooking object geometry within images. To address this issue, this paper introduces a novel method, Topology-Preserved Image Editing with text instructions (TPIE), that for the first time ensures the topology and geometry remaining intact in edited images through text-guided generative diffusion models. More specifically, our method treats newly generated samples as deformable variations of a given input template, allowing for controllable and structure-preserving edits. Our proposed TPIE framework consists of two key modules: (i) an autoencoder-based registration network that learns latent representations of object transformations, parameterized by velocity fields, from pairwise training images; and (ii) a novel latent conditional geometric diffusion (LCDG) model efficiently capturing the data distribution of learned transformation features conditioned on custom-defined text instructions. We validate TPIE on a diverse set of 2D and 3D images and compare them with state-of-the-art image editing approaches. Experimental results show that our method outperforms other baselines in generating more realistic images with well-preserved topology. Our code will be made publicly available on Github.
摘要：在实际应用中，保留拓扑结构非常重要，特别是在医疗保健和医学等敏感领域，人体解剖结构的正确性至关重要。然而，大多数现有的图像编辑模型都专注于操纵强度和纹理特征，往往忽略了图像中的对象几何形状。为了解决这个问题，本文介绍了一种新方法，即使用文本指令进行拓扑保留的图像编辑 (TPIE)，该方法首次通过文本引导的生成扩散模型确保编辑图像中的拓扑和几何形状保持完整。更具体地说，我们的方法将新生成的样本视为给定输入模板的可变形变体，从而允许进行可控和结构保留的编辑。我们提出的 TPIE 框架由两个关键模块组成：(i) 基于自动编码器的配准网络，从成对训练图像中学习由速度场参数化的对象变换的潜在表示；(ii) 一种新型的潜在条件几何扩散 (LCDG) 模型，可有效捕获以自定义文本指令为条件的学习变换特征的数据分布。我们在一系列不同的 2D 和 3D 图像上验证了 TPIE，并将其与最先进的图像编辑方法进行了比较。实验结果表明，我们的方法在生成具有良好保留拓扑的更逼真图像方面优于其他基线。我们的代码将在 Github 上公开发布。

Title: Neuro-Symbolic Evaluation of Text-to-Video Models using Formalf Verification

Authors: S. P. Sharan, Minkyu Choi, Sahil Shah, Harsh Goel, Mohammad Omama, Sandeep Chinchali
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16718
Pdf URL: https://arxiv.org/pdf/2411.16718
Copy Paste: [[2411.16718]] Neuro-Symbolic Evaluation of Text-to-Video Models using Formalf Verification(https://arxiv.org/abs/2411.16718)
Keywords: generation
Abstract: Recent advancements in text-to-video models such as Sora, Gen-3, MovieGen, and CogVideoX are pushing the boundaries of synthetic video generation, with adoption seen in fields like robotics, autonomous driving, and entertainment. As these models become prevalent, various metrics and benchmarks have emerged to evaluate the quality of the generated videos. However, these metrics emphasize visual quality and smoothness, neglecting temporal fidelity and text-to-video alignment, which are crucial for safety-critical applications. To address this gap, we introduce NeuS-V, a novel synthetic video evaluation metric that rigorously assesses text-to-video alignment using neuro-symbolic formal verification techniques. Our approach first converts the prompt into a formally defined Temporal Logic (TL) specification and translates the generated video into an automaton representation. Then, it evaluates the text-to-video alignment by formally checking the video automaton against the TL specification. Furthermore, we present a dataset of temporally extended prompts to evaluate state-of-the-art video generation models against our benchmark. We find that NeuS-V demonstrates a higher correlation by over 5x with human evaluations when compared to existing metrics. Our evaluation further reveals that current video generation models perform poorly on these temporally complex prompts, highlighting the need for future work in improving text-to-video generation capabilities.
摘要：Sora、Gen-3、MovieGen 和 CogVideoX 等文本转视频模型的最新进展正在突破合成视频生成的界限，并在机器人、自动驾驶和娱乐等领域得到广泛应用。随着这些模型的普及，出现了各种指标和基准来评估生成的视频的质量。然而，这些指标强调视觉质量和流畅度，忽略了时间保真度和文本到视频对齐，而这两者对于安全关键型应用至关重要。为了解决这一差距，我们引入了 NeuS-V，这是一种新颖的合成视频评估指标，它使用神经符号形式验证技术严格评估文本到视频的对齐。我们的方法首先将提示转换为正式定义的时间逻辑 (TL) 规范，并将生成的视频转换为自动机表示。然后，它通过根据 TL 规范正式检查视频自动机来评估文本到视频的对齐。此外，我们提供了一个时间扩展提示的数据集，以根据我们的基准评估最先进的视频生成模型。我们发现，与现有指标相比，NeuS-V 与人工评估的相关性高出 5 倍以上。我们的评估进一步表明，当前的视频生成模型在这些时间复杂的提示上表现不佳，这凸显了未来需要改进文本到视频的生成能力。

Title: Importance-based Token Merging for Diffusion Models

Authors: Haoyu Wu, Jingyi Xu, Hieu Le, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16720
Pdf URL: https://arxiv.org/pdf/2411.16720
Copy Paste: [[2411.16720]] Importance-based Token Merging for Diffusion Models(https://arxiv.org/abs/2411.16720)
Keywords: generation
Abstract: Diffusion models excel at high-quality image and video generation. However, a major drawback is their high latency. A simple yet powerful way to speed them up is by merging similar tokens for faster computation, though this can result in some quality loss. In this paper, we demonstrate that preserving important tokens during merging significantly improves sample quality. Notably, the importance of each token can be reliably determined using the classifier-free guidance magnitude, as this measure is strongly correlated with the conditioning input and corresponds to output fidelity. Since classifier-free guidance incurs no additional computational cost or requires extra modules, our method can be easily integrated into most diffusion-based frameworks. Experiments show that our approach significantly outperforms the baseline across various applications, including text-to-image synthesis, multi-view image generation, and video generation.
摘要：扩散模型擅长生成高质量的图像和视频。然而，一个主要的缺点是它们的高延迟。一个简单而有效的加速方法是合并相似的标记以加快计算速度，尽管这可能会导致一些质量损失。在本文中，我们证明在合并过程中保留重要的标记可以显著提高样本质量。值得注意的是，可以使用无分类器指导幅度可靠地确定每个标记的重要性，因为该度量与条件输入密切相关并且对应于输出保真度。由于无分类器指导不会产生额外的计算成本或需要额外的模块，因此我们的方法可以轻松集成到大多数基于扩散的框架中。实验表明，我们的方法在各种应用中的表现都远远优于基线，包括文本到图像合成、多视图图像生成和视频生成。

Title: EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Authors: Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, Qingfeng Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16726
Pdf URL: https://arxiv.org/pdf/2411.16726
Copy Paste: [[2411.16726]] EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion(https://arxiv.org/abs/2411.16726)
Keywords: generation
Abstract: Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. Firstly, to realize better control over the generation of lip movement and facial expression, a Vision-guided Audio Information Decoupling (V-AID) approach is designed to generate audio-based decoupled representations aligned with lip movements and expression. Specifically, to achieve alignment between audio and facial expression representation spaces, we present a Diffusion-based Co-speech Temporal Expansion (Di-CTE) module within V-AID to generate expression-related representations under multi-source emotion condition constraints. Then we propose a well-designed Emotional Talking Head Diffusion (ETHD) backbone to efficiently generate highly expressive talking head videos, which contains an Expression Decoupling Injection (EDI) module to automatically decouple the expressions from reference portraits while integrating the target expression information, achieving more expressive generation performance. Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation, yielding state-of-the-art performance compared to existing methods.
摘要：扩散模型彻底改变了说话头部生成领域，但在表现力、可控性和长时间生成稳定性方面仍面临挑战。在本研究中，我们提出了一个 EmotiveTalk 框架来解决这些问题。首先，为了更好地控制唇部运动和面部表情的生成，设计了一种视觉引导的音频信息解耦（V-AID）方法来生成与唇部运动和表情一致的基于音频的解耦表示。具体而言，为了实现音频和面部表情表示空间之间的对齐，我们在 V-AID 中提出了一个基于扩散的语音间时间扩展（Di-CTE）模块，以在多源情绪条件约束下生成与表情相关的表示。然后，我们提出了一个精心设计的情绪说话头部扩散（ETHD）主干，以高效地生成极具表现力的说话头部视频，其中包含一个表情解耦注入（EDI）模块，可在整合目标表情信息的同时自动将表情与参考肖像解耦，实现更具表现力的生成性能。实验结果表明，EmotiveTalk 可以生成富有表现力的头部表情视频，保证了承诺的情绪可控性和长时间生成过程中的稳定性，与现有方法相比取得了最先进的性能。

Title: Classifier-Free Guidance inside the Attraction Basin May Cause Memorization

Authors: Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, Yuki Mitsufuji
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16738
Pdf URL: https://arxiv.org/pdf/2411.16738
Copy Paste: [[2411.16738]] Classifier-Free Guidance inside the Attraction Basin May Cause Memorization(https://arxiv.org/abs/2411.16738)
Keywords: generation
Abstract: Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel way to understand the memorization phenomenon, and propose a simple yet effective approach to mitigate it. We argue that memorization occurs because of an attraction basin in the denoising process which steers the diffusion trajectory towards a memorized image. However, this can be mitigated by guiding the diffusion trajectory away from the attraction basin by not applying classifier-free guidance until an ideal transition point occurs from which classifier-free guidance is applied. This leads to the generation of non-memorized images that are high in image quality and well-aligned with the conditioning mechanism. To further improve on this, we present a new guidance technique, \emph{opposite guidance}, that escapes the attraction basin sooner in the denoising process. We demonstrate the existence of attraction basins in various scenarios in which memorization occurs, and we show that our proposed approach successfully mitigates memorization.
摘要：扩散模型倾向于精确地再现训练数据中的图像。这种对训练数据的精确再现令人担忧，因为它可能导致侵犯版权和/或泄露隐私敏感信息。在本文中，我们提出了一种理解记忆现象的新方法，并提出了一种简单而有效的方法来缓解它。我们认为记忆的发生是因为去噪过程中的吸引盆地将扩散轨迹引向记忆图像。然而，这可以通过引导扩散轨迹远离吸引盆地来缓解，即在出现应用无分类器引导的理想过渡点之前不应用无分类器引导。这导致生成图像质量高且与调节机制高度一致的非记忆图像。为了进一步改进这一点，我们提出了一种新的引导技术，\emph{相反引导}，它可以在去噪过程中更快地逃离吸引盆地。我们证明了在发生记忆的各种场景中吸引盆地的存在，并且我们表明我们提出的方法成功地减轻了记忆。

Title: Gradient-Guided Parameter Mask for Multi-Scenario Image Restoration Under Adverse Weather

Authors: Jilong Guo, Haobo Yang, Mo Zhou, Xinyu Zhang
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2411.16739
Pdf URL: https://arxiv.org/pdf/2411.16739
Copy Paste: [[2411.16739]] Gradient-Guided Parameter Mask for Multi-Scenario Image Restoration Under Adverse Weather(https://arxiv.org/abs/2411.16739)
Keywords: restoration
Abstract: Removing adverse weather conditions such as rain, raindrop, and snow from images is critical for various real-world applications, including autonomous driving, surveillance, and remote sensing. However, existing multi-task approaches typically rely on augmenting the model with additional parameters to handle multiple scenarios. While this enables the model to address diverse tasks, the introduction of extra parameters significantly complicates its practical deployment. In this paper, we propose a novel Gradient-Guided Parameter Mask for Multi-Scenario Image Restoration under adverse weather, designed to effectively handle image degradation under diverse weather conditions without additional parameters. Our method segments model parameters into common and specific components by evaluating the gradient variation intensity during training for each specific weather condition. This enables the model to precisely and adaptively learn relevant features for each weather scenario, improving both efficiency and effectiveness without compromising on performance. This method constructs specific masks based on gradient fluctuations to isolate parameters influenced by other tasks, ensuring that the model achieves strong performance across all scenarios without adding extra parameters. We demonstrate the state-of-the-art performance of our framework through extensive experiments on multiple benchmark datasets. Specifically, our method achieves PSNR scores of 29.22 on the Raindrop dataset, 30.76 on the Rain dataset, and 29.56 on the Snow100K dataset. Code is available at: \href{this https URL}{this https URL}.
摘要：从图像中去除雨、雨滴和雪等恶劣天气条件对于各种实际应用至关重要，包括自动驾驶、监控和遥感。然而，现有的多任务方法通常依赖于使用额外参数增强模型来处理多种场景。虽然这使模型能够处理不同的任务，但额外参数的引入大大增加了其实际部署的复杂性。在本文中，我们提出了一种用于恶劣天气下多场景图像恢复的新型梯度引导参数掩码，旨在有效处理不同天气条件下的图像退化，而无需额外参数。我们的方法通过评估每种特定天气条件下训练期间的梯度变化强度，将模型参数划分为公共和特定成分。这使模型能够精确且自适应地学习每种天气场景的相关特征，从而在不影响性能的情况下提高效率和效果。该方法根据梯度波动构建特定掩码以隔离受其他任务影响的参数，确保模型在所有场景中实现强大的性能而无需添加额外参数。我们通过对多个基准数据集进行大量实验展示了我们框架的最先进的性能。具体来说，我们的方法在 Raindrop 数据集上实现了 29.22 的 PSNR 得分，在 Rain 数据集上实现了 30.76 的 PSNR 得分，在 Snow100K 数据集上实现了 29.56 的 PSNR 得分。代码可从以下位置获取：\href{此 https URL}{此 https URL}。

Title: Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Authors: Jun Chen, Dannong Xu, Junjie Fei, Chun-Mei Feng, Mohamed Elhoseiny
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16740
Pdf URL: https://arxiv.org/pdf/2411.16740
Copy Paste: [[2411.16740]] Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents(https://arxiv.org/abs/2411.16740)
Keywords: generation
Abstract: Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images. Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module. V-RAG sets a new standard, with a 9% and 11% improvement in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, compared to the previous best baseline models. Additionally, integrating V-RAG with LMMs enables them to efficiently operate across thousands of images, yielding significant improvements on our DocHaystack and InfoHaystack benchmarks. Our code and datasets are available at this https URL
摘要：大型多模态模型 (LMM) 在视觉语言理解方面取得了令人瞩目的进展，但它们在需要对大量图像进行复杂推理的实际应用中面临限制。现有的多图像问答基准范围有限，每个问题仅与最多 30 张图像配对，这不能完全满足实际使用中遇到的大规模检索任务的需求。为了缩小这些差距，我们引入了两个文档 haystack 基准，称为 DocHaystack 和 InfoHaystack，旨在评估 LMM 在大规模视觉文档检索和理解方面的表现。此外，我们提出了 V-RAG，这是一种新颖的以视觉为中心的检索增强生成 (RAG) 框架，它利用一套多模态视觉编码器，每个编码器都针对特定优势进行了优化，并配备了专用的问题文档相关性模块。 V-RAG 树立了新标准，与之前的最佳基线模型相比，在具有挑战性的 DocHaystack-1000 和 InfoHaystack-1000 基准测试中，Recall@1 分别提高了 9% 和 11%。此外，将 V-RAG 与 LMM 集成使它们能够高效地处理数千张图像，从而显著提高我们的 DocHaystack 和 InfoHaystack 基准测试。我们的代码和数据集可在此 https URL 上找到

Title: FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction

Authors: Junwei You, Rui Gan, Weizhe Tang, Zilin Huang, Jiaxi Liu, Zhuoyu Jiang, Haotian Shi, Keshu Wu, Keke Long, Sicheng Fu, Sikai Chen, Bin Ran
Subjects: cs.CV, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2411.16747
Pdf URL: https://arxiv.org/pdf/2411.16747
Copy Paste: [[2411.16747]] FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction(https://arxiv.org/abs/2411.16747)
Keywords: generative
Abstract: Vehicle trajectory prediction is crucial for advancing autonomous driving and advanced driver assistance systems (ADAS). Although deep learning-based approaches - especially those utilizing transformer-based and generative models - have markedly improved prediction accuracy by capturing complex, non-linear patterns in vehicle dynamics and traffic interactions, they frequently overlook detailed car-following behaviors and the inter-vehicle interactions critical for real-world driving applications, particularly in fully autonomous or mixed traffic scenarios. To address the issue, this study introduces a scaled noise conditional diffusion model for car-following trajectory prediction, which integrates detailed inter-vehicular interactions and car-following dynamics into a generative framework, improving both the accuracy and plausibility of predicted trajectories. The model utilizes a novel pipeline to capture historical vehicle dynamics by scaling noise with encoded historical features within the diffusion process. Particularly, it employs a cross-attention-based transformer architecture to model intricate inter-vehicle dependencies, effectively guiding the denoising process and enhancing prediction accuracy. Experimental results on diverse real-world driving scenarios demonstrate the state-of-the-art performance and robustness of the proposed method.
摘要：车辆轨迹预测对于推进自动驾驶和高级驾驶辅助系统 (ADAS) 至关重要。尽管基于深度学习的方法（尤其是那些利用基于变换器和生成模型的方法）通过捕捉车辆动力学和交通交互中复杂的非线性模式显著提高了预测准确性，但它们经常忽略详细的跟车行为和对现实世界驾驶应用至关重要的车辆间交互，特别是在完全自动驾驶或混合交通场景中。为了解决这个问题，本研究引入了一种用于跟车轨迹预测的缩放噪声条件扩散模型，该模型将详细的车辆间交互和车辆跟车动力学集成到一个生成框架中，提高了预测轨迹的准确性和合理性。该模型利用一种新颖的管道，通过在扩散过程中使用编码的历史特征缩放噪声来捕获历史车辆动力学。特别是，它采用了基于交叉注意力的变换器架构来模拟复杂的车辆间依赖关系，有效地指导了去噪过程并提高了预测准确性。在各种真实驾驶场景中的实验结果证明了所提出方法的先进性能和稳健性。

Title: LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

Authors: Haojie Zhang, Zhihao Liang, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Chenxing Li, Jianhua Tao, Yaling Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16748
Pdf URL: https://arxiv.org/pdf/2411.16748
Copy Paste: [[2411.16748]] LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis(https://arxiv.org/abs/2411.16748)
Keywords: generation
Abstract: Portrait image animation using audio has rapidly advanced, enabling the creation of increasingly realistic and expressive animated faces. The challenges of this multimodality-guided video generation task involve fusing various modalities while ensuring consistency in timing and portrait. We further seek to produce vivid talking heads. To address these challenges, we present LetsTalk (LatEnt Diffusion TranSformer for Talking Video Synthesis), a diffusion transformer that incorporates modular temporal and spatial attention mechanisms to merge multimodality and enhance spatial-temporal consistency. To handle multimodal conditions, we first summarize three fusion schemes, ranging from shallow to deep fusion compactness, and thoroughly explore their impact and applicability. Then we propose a suitable solution according to the modality differences of image, audio, and video generation. For portrait, we utilize a deep fusion scheme (Symbiotic Fusion) to ensure portrait consistency. For audio, we implement a shallow fusion scheme (Direct Fusion) to achieve audio-animation alignment while preserving diversity. Our extensive experiments demonstrate that our approach generates temporally coherent and realistic videos with enhanced diversity and liveliness.
摘要：使用音频的肖像动画发展迅速，使得创建越来越逼真和富有表现力的动画脸成为可能。这种多模态引导视频生成任务的挑战在于融合各种模态，同时确保时间和肖像的一致性。我们进一步寻求制作生动的说话头像。为了应对这些挑战，我们提出了 LetsTalk（用于说话视频合成的 LatEnt Diffusion TranSformer），这是一种扩散变换器，它结合了模块化时间和空间注意机制来合并多模态并增强时空一致性。为了处理多模态条件，我们首先总结了三种融合方案，从浅融合到深融合紧凑性，并彻底探索它们的影响和适用性。然后，我们根据图像、音频和视频生成的模态差异提出合适的解决方案。对于肖像，我们利用深度融合方案（Symbiotic Fusion）来确保肖像一致性。对于音频，我们实施了浅融合方案（Direct Fusion）以实现音频动画对齐，同时保留多样性。我们大量的实验表明，我们的方法可以生成时间连贯且逼真的视频，并增强多样性和生动性。

Title: AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks

Authors: You Li, Fan Ma, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16749
Pdf URL: https://arxiv.org/pdf/2411.16749
Copy Paste: [[2411.16749]] AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks(https://arxiv.org/abs/2411.16749)
Keywords: generation
Abstract: Diffusion models have recently been employed to generate high-quality images, reducing the need for manual data collection and improving model generalization in tasks such as object detection, instance segmentation, and image perception. However, the synthetic framework is usually designed with meticulous human effort for each task due to various requirements on image layout, content, and annotation formats, restricting the application of synthetic data on more general scenarios. In this paper, we propose AnySynth, a unified framework integrating adaptable, comprehensive, and highly controllable components capable of generating an arbitrary type of synthetic data given diverse requirements. Specifically, the Task-Specific Layout Generation Module is first introduced to produce reasonable layouts for different tasks by leveraging the generation ability of large language models and layout priors of real-world images. A Uni-Controlled Image Generation Module is then developed to create high-quality synthetic images that are controllable and based on the generated layouts. In addition, user specific reference images, and style images can be incorporated into the generation to task requirements. Finally, the Task-Oriented Annotation Module offers precise and detailed annotations for the generated images across different tasks. We have validated our framework's performance across various tasks, including Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Composed Image Retrieval, and Multi-modal Image Perception and Grounding. The specific data synthesized by our framework significantly improves model performance in these tasks, demonstrating the generality and effectiveness of our framework.
摘要：扩散模型最近已被用于生成高质量图像，从而减少了对手动数据收集的需求，并提高了模型在对象检测、实例分割和图像感知等任务中的泛化能力。然而，由于对图像布局、内容和注释格式的各种要求，合成框架通常需要为每个任务精心设计，这限制了合成数据在更一般场景中的应用。在本文中，我们提出了 AnySynth，这是一个统一的框架，集成了适应性强、全面且高度可控的组件，能够根据不同的需求生成任意类型的合成数据。具体来说，首先引入任务特定的布局生成模块，利用大型语言模型的生成能力和真实世界图像的布局先验为不同任务生成合理的布局。然后开发单控图像生成模块，以创建可控且基于生成的布局的高质量合成图像。此外，可以根据任务要求将用户特定的参考图像和风格图像纳入生成中。最后，面向任务的注释模块为不同任务中生成的图像提供精确而详细的注释。我们已经在各种任务中验证了框架的性能，包括少样本目标检测、跨域目标检测、零样本组合图像检索以及多模态图像感知和定位。我们的框架合成的特定数据显著提高了这些任务中的模型性能，证明了我们框架的通用性和有效性。

Title: Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy

Authors: You Li, Fan Ma, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16752
Pdf URL: https://arxiv.org/pdf/2411.16752
Copy Paste: [[2411.16752]] Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy(https://arxiv.org/abs/2411.16752)
Keywords: generation
Abstract: The Zero-shot Composed Image Retrieval (ZSCIR) requires retrieving images that match the query image and the relative captions. Current methods focus on projecting the query image into the text feature space, subsequently combining them with features of query texts for retrieval. However, retrieving images only with the text features cannot guarantee detailed alignment due to the natural gap between images and text. In this paper, we introduce Imagined Proxy for CIR (IP-CIR), a training-free method that creates a proxy image aligned with the query image and text description, enhancing query representation in the retrieval process. We first leverage the large language model's generalization capability to generate an image layout, and then apply both the query text and image for conditional generation. The robust query features are enhanced by merging the proxy image, query image, and text semantic perturbation. Our newly proposed balancing metric integrates text-based and proxy retrieval similarities, allowing for more accurate retrieval of the target image while incorporating image-side information into the process. Experiments on three public datasets demonstrate that our method significantly improves retrieval performances. We achieve state-of-the-art (SOTA) results on the CIRR dataset with a Recall@K of 70.07 at K=10. Additionally, we achieved an improvement in Recall@10 on the FashionIQ dataset, rising from 45.11 to 45.74, and improved the baseline performance in CIRCO with a mAPK@10 score, increasing from 32.24 to 34.26.
摘要：零样本组合图像检索 (ZSCIR) 需要检索与查询图像和相关标题匹配的图像。当前方法侧重于将查询图像投影到文本特征空间，然后将它们与查询文本的特征相结合进行检索。但是，由于图像和文本之间存在天然差距，仅使用文本特征检索图像无法保证详细对齐。在本文中，我们引入了想象的 CIR 代理 (IP-CIR)，这是一种无需训练的方法，可创建与查询图像和文本描述对齐的代理图像，从而增强检索过程中的查询表示。我们首先利用大型语言模型的泛化能力来生成图像布局，然后应用查询文本和图像进行条件生成。通过合并代理图像、查询图像和文本语义扰动，可以增强稳健的查询特征。我们新提出的平衡指标整合了基于文本和代理检索的相似性，允许更准确地检索目标图像，同时将图像端信息纳入流程。在三个公开数据集上的实验表明，我们的方法显著提高了检索性能。我们在 CIRR 数据集上取得了最佳 (SOTA) 结果，在 K=10 时 Recall@K 为 70.07。此外，我们在 FashionIQ 数据集上实现了 Recall@10 的改进，从 45.11 上升到 45.74，并在 CIRCO 中提高了基准性能，mAPK@10 得分从 32.24 上升到 34.26。

Title: Visual Counter Turing Test (VCT^2): Discovering the Challenges for AI-Generated Image Detection and Introducing Visual AI Index (V_AI)

Authors: Nasrin Imanpour, Shashwat Bajpai, Subhankar Ghosh, Sainath Reddy Sankepally, Abhilekh Borah, Hasnat Md Abdullah, Nishoak Kosaraju, Shreyas Dixit, Ashhar Aziz, Shwetangshu Biswas, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16754
Pdf URL: https://arxiv.org/pdf/2411.16754
Copy Paste: [[2411.16754]] Visual Counter Turing Test (VCT^2): Discovering the Challenges for AI-Generated Image Detection and Introducing Visual AI Index (V_AI)(https://arxiv.org/abs/2411.16754)
Keywords: generation, generative
Abstract: The proliferation of AI techniques for image generation, coupled with their increasing accessibility, has raised significant concerns about the potential misuse of these images to spread misinformation. Recent AI-generated image detection (AGID) methods include CNNDetection, NPR, DM Image Detection, Fake Image Detection, DIRE, LASTED, GAN Image Detection, AIDE, SSP, DRCT, RINE, OCC-CLIP, De-Fake, and Deep Fake Detection. However, we argue that the current state-of-the-art AGID techniques are inadequate for effectively detecting contemporary AI-generated images and advocate for a comprehensive reevaluation of these methods. We introduce the Visual Counter Turing Test (VCT^2), a benchmark comprising ~130K images generated by contemporary text-to-image models (Stable Diffusion 2.1, Stable Diffusion XL, Stable Diffusion 3, DALL-E 3, and Midjourney 6). VCT^2 includes two sets of prompts sourced from tweets by the New York Times Twitter account and captions from the MS COCO dataset. We also evaluate the performance of the aforementioned AGID techniques on the VCT$^2$ benchmark, highlighting their ineffectiveness in detecting AI-generated images. As image-generative AI models continue to evolve, the need for a quantifiable framework to evaluate these models becomes increasingly critical. To meet this need, we propose the Visual AI Index (V_AI), which assesses generated images from various visual perspectives, including texture complexity and object coherence, setting a new standard for evaluating image-generative AI models. To foster research in this domain, we make our this https URL and this https URL datasets publicly available.
摘要：用于图像生成的 AI 技术的激增及其日益普及引起了人们对这些图像可能被滥用来传播错误信息的极大担忧。最近的 AI 生成图像检测 (AGID) 方法包括 CNNDetection、NPR、DM 图像检测、假图像检测、DIRE、LASTED、GAN 图像检测、AIDE、SSP、DRCT、RINE、OCC-CLIP、De-Fake 和 Deep Fake 检测。然而，我们认为目前最先进的 AGID 技术不足以有效检测当代 AI 生成的图像，并主张全面重新评估这些方法。我们引入了视觉反图灵测试 (VCT^2)，这是一个基准，包含由当代文本到图像模型 (Stable Diffusion 2.1、Stable Diffusion XL、Stable Diffusion 3、DALL-E 3 和 Midjourney 6) 生成的约 130K 张图像。 VCT^2 包括两组提示，分别来自《纽约时报》 Twitter 帐户的推文和 MS COCO 数据集的标题。我们还评估了上述 AGID 技术在 VCT$^2$ 基准上的性能，强调了它们在检测 AI 生成的图像方面的无效性。随着图像生成 AI 模型的不断发展，对可量化框架进行评估这些模型的需求变得越来越重要。为了满足这一需求，我们提出了视觉 AI 指数 (V_AI)，它从各种视觉角度评估生成的图像，包括纹理复杂性和对象连贯性，为评估图像生成 AI 模型设定了新标准。为了促进该领域的研究，我们公开了此 https URL 和此 https URL 数据集。

Title: Revisiting DDIM Inversion for Controlling Defect Generation by Disentangling the Background

Authors: Youngjae Cho, Gwangyeol Kim, Sirojbek Safarov, Seongdeok Bang, Jaewoo Park
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16767
Pdf URL: https://arxiv.org/pdf/2411.16767
Copy Paste: [[2411.16767]] Revisiting DDIM Inversion for Controlling Defect Generation by Disentangling the Background(https://arxiv.org/abs/2411.16767)
Keywords: generation, generative
Abstract: In anomaly detection, the scarcity of anomalous data compared to normal data poses a challenge in effectively utilizing deep neural network representations to identify anomalous features. From a data-centric perspective, generative models can solve this data imbalance issue by synthesizing anomaly datasets. Although previous research tried to enhance the controllability and quality of generating defects, they do not consider the relation between background and defect. Since the defect depends on the object's background (i.e., the normal part of an object), training only the defect area cannot utilize the background information, and even generation can be biased depending on the mask information. In addition, controlling logical anomalies should consider the dependency between background and defect areas (e.g., orange colored defect on a orange juice bottle). In this paper, our paper proposes modeling a relationship between the background and defect, where background affects denoising defects; however, the reverse is not. We introduce the regularizing term to disentangle denoising background from defects. From the disentanglement loss, we rethink defect generation with DDIM Inversion, where we generate the defect on the target normal image. Additionally, we theoretically prove that our methodology can generate a defect on the target normal image with an invariant background. We demonstrate our synthetic data is realistic and effective in several experiments.
摘要：在异常检测中，与正常数据相比，异常数据的稀缺性对有效利用深度神经网络表示来识别异常特征构成了挑战。从以数据为中心的角度来看，生成模型可以通过合成异常数据集来解决这种数据不平衡问题。尽管之前的研究试图提高生成缺陷的可控性和质量，但它们没有考虑背景和缺陷之间的关系。由于缺陷取决于物体的背景（即物体的正常部分），因此仅训练缺陷区域无法利用背景信息，甚至生成也可能因掩模信息而产生偏差。此外，控制逻辑异常应考虑背景和缺陷区域之间的依赖关系（例如，橙汁瓶上的橙色缺陷）。在本文中，我们的论文提出了对背景和缺陷之间的关系进行建模，其中背景会影响去噪缺陷；但反之则不然。我们引入了正则化项来将去噪背景与缺陷区分开来。从解缠结损失出发，我们重新思考了使用 DDIM 反转的缺陷生成，我们在目标法线图像上生成缺陷。此外，我们从理论上证明了我们的方法可以在背景不变的情况下在目标法线图像上生成缺陷。我们在多个实验中证明了我们的合成数据是真实有效的。

Title: VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

Authors: Wey Yeh Choong, Yangyang Guo, Mohan Kankanhalli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16771
Pdf URL: https://arxiv.org/pdf/2411.16771
Copy Paste: [[2411.16771]] VidHal: Benchmarking Temporal Hallucinations in Vision LLMs(https://arxiv.org/abs/2411.16771)
Keywords: generation
Abstract: Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucination. Existing research addressing this problem has primarily been confined to image inputs, with limited exploration of video-based hallucinations. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address this, we introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. VidHal is constructed by bootstrapping video instances across common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent. We conduct extensive experiments on VidHal and comprehensively evaluate a broad selection of models. Our results uncover significant limitations in existing VLLMs regarding hallucination generation. Through our benchmark, we aim to inspire further research on 1) holistic understanding of VLLM capabilities, particularly regarding hallucination, and 2) extensive development of advanced VLLMs to alleviate this problem.
摘要：人们普遍认为视觉大型语言模型 (VLLM) 容易产生幻觉。针对此问题的现有研究主要局限于图像输入，对基于视频的幻觉的探索有限。此外，当前的评估方法无法捕捉生成的响应中的细微错误，而视频丰富的时空动态往往会加剧这些错误。为了解决这个问题，我们引入了 VidHal，这是一个专门用于评估 VLLM 中基于视频的幻觉的基准。VidHal 是通过在共同的时间方面引导视频实例构建的。我们基准的一个决定性特征在于精心创建字幕，这些字幕代表与每个视频相关的不同程度的幻觉。为了实现细粒度评估，我们提出了一种新颖的字幕排序任务，要求 VLLM 根据幻觉程度对字幕进行排序。我们对 VidHal 进行了广泛的实验，并全面评估了多种模型。我们的结果揭示了现有 VLLM 在幻觉生成方面的重大局限性。通过我们的基准，我们旨在激发进一步的研究：1）全面了解 VLLM 功能，特别是关于幻觉的功能；2）广泛开发先进的 VLLM 以缓解这一问题。

Title: SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models

Authors: Harsh Goel, Sai Shankar Narasimhan, Oguzhan Akcin, Sandeep Chinchali
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16776
Pdf URL: https://arxiv.org/pdf/2411.16776
Copy Paste: [[2411.16776]] SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models(https://arxiv.org/abs/2411.16776)
Keywords: generation
Abstract: In recent years, significant progress has been made in collecting large-scale datasets to improve segmentation and autonomous driving models. These large-scale datasets are often dominated by common environmental conditions such as "Clear and Day" weather, leading to decreased performance in under-represented conditions like "Rainy and Night". To address this issue, we introduce SynDiff-AD, a novel data augmentation pipeline that leverages diffusion models (DMs) to generate realistic images for such subgroups. SynDiff-AD uses ControlNet-a DM that guides data generation conditioned on semantic maps-along with a novel prompting scheme that generates subgroup-specific, semantically dense prompts. By augmenting datasets with SynDiff-AD, we improve the performance of segmentation models like Mask2Former and SegFormer by up to 1.2% and 2.3% on the Waymo dataset, and up to 1.4% and 0.7% on the DeepDrive dataset, respectively. Additionally, we demonstrate that our SynDiff-AD pipeline enhances the driving performance of end-to-end autonomous driving models, like AIM-2D and AIM-BEV, by up to 20% across diverse environmental conditions in the CARLA autonomous driving simulator, providing a more robust model.
摘要：近年来，在收集大规模数据集以改进分割和自动驾驶模型方面取得了重大进展。这些大规模数据集通常以“晴天”等常见环境条件为主，导致在“雨天和夜晚”等代表性不足的条件下性能下降。为了解决这个问题，我们引入了 SynDiff-AD，这是一种新颖的数据增强管道，它利用扩散模型 (DM) 为此类子组生成逼真的图像。SynDiff-AD 使用 ControlNet（一种以语义图为条件指导数据生成的 DM）以及一种新颖的提示方案，该方案可生成特定于子组、语义密集的提示。通过使用 SynDiff-AD 增强数据集，我们将 Mask2Former 和 SegFormer 等分割模型的性能分别提高了 Waymo 数据集上的 1.2% 和 2.3%，以及 DeepDrive 数据集上的 1.4% 和 0.7%。此外，我们证明，我们的 SynDiff-AD 管道可在 CARLA 自动驾驶模拟器的各种环境条件下将端到端自动驾驶模型（如 AIM-2D 和 AIM-BEV）的驾驶性能提高多达 20%，从而提供更为稳健的模型。

Title: NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model

Authors: Jinpeng Liu, Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Ying Shan, Yansong Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16779
Pdf URL: https://arxiv.org/pdf/2411.16779
Copy Paste: [[2411.16779]] NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model(https://arxiv.org/abs/2411.16779)
Keywords: generative
Abstract: We introduce NovelGS, a diffusion model for Gaussian Splatting (GS) given sparse-view images. Recent works leverage feed-forward networks to generate pixel-aligned Gaussians, which could be fast rendered. Unfortunately, the method was unable to produce satisfactory results for areas not covered by the input images due to the formulation of these methods. In contrast, we leverage the novel view denoising through a transformer-based network to generate 3D Gaussians. Specifically, by incorporating both conditional views and noisy target views, the network predicts pixel-aligned Gaussians for each view. During training, the rendered target and some additional views of the Gaussians are supervised. During inference, the target views are iteratively rendered and denoised from pure noise. Our approach demonstrates state-of-the-art performance in addressing the multi-view image reconstruction challenge. Due to generative modeling of unseen regions, NovelGS effectively reconstructs 3D objects with consistent and sharp textures. Experimental results on publicly available datasets indicate that NovelGS substantially surpasses existing image-to-3D frameworks, both qualitatively and quantitatively. We also demonstrate the potential of NovelGS in generative tasks, such as text-to-3D and image-to-3D, by integrating it with existing multiview diffusion models. We will make the code publicly accessible.
摘要：我们引入了 NovelGS，一种针对稀疏视图图像的高斯分布 (GS) 扩散模型。最近的研究利用前馈网络生成像素对齐的高斯，可以快速渲染。不幸的是，由于这些方法的公式化，该方法无法对输入图像未覆盖的区域产生令人满意的结果。相反，我们利用基于变压器的网络进行新颖的视图去噪来生成 3D 高斯。具体而言，通过结合条件视图和嘈杂的目标视图，网络可以预测每个视图的像素对齐的高斯。在训练期间，对渲染的目标和高斯的一些额外视图进行监督。在推理期间，目标视图被迭代渲染并从纯噪声中去噪。我们的方法在解决多视图图像重建挑战方面展示了最先进的性能。由于对看不见的区域进行生成建模，NovelGS 可以有效地重建具有一致且清晰纹理的 3D 对象。在公开数据集上的实验结果表明，NovelGS 在质量和数量上都大大超越了现有的图像到 3D 框架。我们还通过将 NovelGS 与现有的多视图扩散模型相结合，展示了 NovelGS 在文本到 3D 和图像到 3D 等生成任务中的潜力。我们将公开提供代码。

Title: UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

Authors: Yiheng Li, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16781
Pdf URL: https://arxiv.org/pdf/2411.16781
Copy Paste: [[2411.16781]] UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing(https://arxiv.org/abs/2411.16781)
Keywords: generation
Abstract: Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.
摘要：人体姿势在数字时代起着至关重要的作用。虽然最近的研究在理解和生成人体姿势方面取得了令人瞩目的进展，但它们通常仅支持单一模态的控制信号并且孤立地运行，限制了它们在现实世界场景中的应用。本文介绍了 UniPose，这是一个采用大型语言模型 (LLM) 来理解、生成和编辑各种模态的人体姿势的框架，包括图像、文本和 3D SMPL 姿势。具体来说，我们应用姿势标记器将 3D 姿势转换为离散姿势标记，从而能够在统一词汇表中无缝集成到 LLM 中。为了进一步增强细粒度的姿势感知能力，我们为 UniPose 提供了多种视觉编码器，其中包括一个特定于姿势的视觉编码器。得益于统一的学习策略，UniPose 有效地在不同的姿势相关任务之间传递知识，适应看不见的任务，并展示扩展的功能。这项工作是首次尝试构建用于姿势理解、生成和编辑的通用框架。大量实验凸显了 UniPose 在各种与姿势相关的任务中具有竞争力甚至更优异的性能。

Title: From Diffusion to Resolution: Leveraging 2D Diffusion Models for 3D Super-Resolution Task

Authors: Bohao Chen, Yanchao Zhang, Yanan Lv, Hua Han, Xi Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16792
Pdf URL: https://arxiv.org/pdf/2411.16792
Copy Paste: [[2411.16792]] From Diffusion to Resolution: Leveraging 2D Diffusion Models for 3D Super-Resolution Task(https://arxiv.org/abs/2411.16792)
Keywords: super-resolution, generation
Abstract: Diffusion models have recently emerged as a powerful technique in image generation, especially for image super-resolution tasks. While 2D diffusion models significantly enhance the resolution of individual images, existing diffusion-based methods for 3D volume super-resolution often struggle with structure discontinuities in axial direction and high sampling costs. In this work, we present a novel approach that leverages the 2D diffusion model and lateral continuity within the volume to enhance 3D volume electron microscopy (vEM) super-resolution. We first simulate lateral degradation with slices in the XY plane and train a 2D diffusion model to learn how to restore the degraded slices. The model is then applied slice-by-slice in the lateral direction of low-resolution volume, recovering slices while preserving inherent lateral continuity. Following this, a high-frequency-aware 3D super-resolution network is trained on the recovery lateral slice sequences to learn spatial feature transformation across slices. Finally, the network is applied to infer high-resolution volumes in the axial direction, enabling 3D super-resolution. We validate our approach through comprehensive evaluations, including image similarity assessments, resolution analysis, and performance on downstream tasks. Our results on two publicly available focused ion beam scanning electron microscopy (FIB-SEM) datasets demonstrate the robustness and practical applicability of our framework for 3D volume super-resolution.
摘要：扩散模型最近已成为图像生成中的一种强大技术，尤其是对于图像超分辨率任务。虽然 2D 扩散模型显著提高了单个图像的分辨率，但现有的基于扩散的 3D 体积超分辨率方法通常会遇到轴向结构不连续性和高采样成本的问题。在这项工作中，我们提出了一种新方法，利用 2D 扩散模型和体积内的横向连续性来增强 3D 体积电子显微镜 (vEM) 超分辨率。我们首先用 XY 平面中的切片模拟横向退化，并训练 2D 扩散模型来学习如何恢复退化的切片。然后将该模型逐片应用于低分辨率体积的横向方向，在保留固有横向连续性的同时恢复切片。随后，在恢复横向切片序列上训练高频感知 3D 超分辨率网络，以学习跨切片的空间特征变换。最后，应用该网络推断轴向高分辨率体积，实现 3D 超分辨率。我们通过综合评估来验证我们的方法，包括图像相似性评估、分辨率分析和下游任务的性能。我们在两个公开可用的聚焦离子束扫描电子显微镜 (FIB-SEM) 数据集上的结果证明了我们的 3D 体积超分辨率框架的稳健性和实用性。

Title: Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image

Authors: Jiajing Lin, Zhenzhong Wang, Shu Jiang, Yongjie Hou, Min Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16800
Pdf URL: https://arxiv.org/pdf/2411.16800
Copy Paste: [[2411.16800]] Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image(https://arxiv.org/abs/2411.16800)
Keywords: generation
Abstract: The task of 4D content generation involves creating dynamic 3D models that evolve over time in response to specific input conditions, such as images. Existing methods rely heavily on pre-trained video diffusion models to guide 4D content dynamics, but these approaches often fail to capture essential physical principles, as video diffusion models lack a robust understanding of real-world physics. Moreover, these models face challenges in providing fine-grained control over dynamics and exhibit high computational costs. In this work, we propose Phys4DGen, a novel, high-efficiency framework that generates physics-compliant 4D content from a single image with enhanced control capabilities. Our approach uniquely integrates physical simulations into the 4D generation pipeline, ensuring adherence to fundamental physical laws. Inspired by the human ability to infer physical properties visually, we introduce a Physical Perception Module (PPM) that discerns the material properties and structural components of the 3D object from the input image, facilitating accurate downstream simulations. Phys4DGen significantly accelerates the 4D generation process by eliminating iterative optimization steps in the dynamics modeling phase. It allows users to intuitively control the movement speed and direction of generated 4D content by adjusting external forces, achieving finely tunable, physically plausible animations. Extensive evaluations show that Phys4DGen outperforms existing methods in both inference speed and physical realism, producing high-quality, controllable 4D content.
摘要：4D 内容生成任务涉及创建动态 3D 模型，这些模型会随着特定输入条件（例如图像）的变化而变化。现有方法严重依赖预先训练的视频扩散模型来指导 4D 内容动态，但这些方法往往无法捕捉到基本的物理原理，因为视频扩散模型缺乏对现实世界物理的可靠理解。此外，这些模型在提供对动态的细粒度控制方面面临挑战，并且计算成本很高。在这项工作中，我们提出了 Phys4DGen，这是一种新颖的高效框架，它可以通过增强的控制能力从单个图像生成符合物理的 4D 内容。我们的方法以独特的方式将物理模拟集成到 4D 生成管道中，确保遵守基本物理定律。受人类通过视觉推断物理属性的能力的启发，我们引入了一个物理感知模块 (PPM)，它可以从输入图像中辨别 3D 对象的材料属性和结构成分，从而促进准确的下游模拟。 Phys4DGen 通过消除动力学建模阶段的迭代优化步骤，显著加快了 4D 生成过程。它允许用户通过调整外力直观地控制生成的 4D 内容的运动速度和方向，实现精细可调、物理上合理的动画。大量评估表明，Phys4DGen 在推理速度和物理真实感方面均优于现有方法，可生成高质量、可控的 4D 内容。

Title: Controllable Human Image Generation with Personalized Multi-Garments

Authors: Yisol Choi, Sangkyung Kwak, Sihyun Yu, Hyungwon Choi, Jinwoo Shin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16801
Pdf URL: https://arxiv.org/pdf/2411.16801
Copy Paste: [[2411.16801]] Controllable Human Image Generation with Personalized Multi-Garments(https://arxiv.org/abs/2411.16801)
Keywords: generation
Abstract: We present BootComp, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments. Here, the main bottleneck is data acquisition for training: collecting a large-scale dataset of high-quality reference garment images per human subject is quite challenging, i.e., ideally, one needs to manually gather every single garment photograph worn by each human. To address this, we propose a data generation pipeline to construct a large synthetic dataset, consisting of human and multiple-garment pairs, by introducing a model to extract any reference garment images from each human image. To ensure data quality, we also propose a filtering strategy to remove undesirable generated data based on measuring perceptual similarities between the garment presented in human image and extracted garment. Finally, by utilizing the constructed synthetic dataset, we train a diffusion model having two parallel denoising paths that use multiple garment images as conditions to generate human images while preserving their fine-grained details. We further show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain, including virtual try-on, and controllable human image generation with other conditions, e.g., pose, face, etc.
摘要：我们提出了 BootComp，这是一种基于文本到图像扩散模型的新型框架，用于生成具有多个参考服装的可控人体图像。这里的主要瓶颈是训练的数据采集：为每个人体收集大量高质量的参考服装图像数据集非常具有挑战性，也就是说，理想情况下，需要手动收集每个人穿着的每一张服装照片。为了解决这个问题，我们提出了一个数据生成管道，通过引入一个模型从每个人体图像中提取任何参考服装图像，构建一个由人体和多件服装对组成的大型合成数据集。为了确保数据质量，我们还提出了一种过滤策略，根据测量人体图像中呈现的服装和提取的服装之间的感知相似性来去除不需要的生成数据。最后，通过利用构建的合成数据集，我们训练了一个具有两个并行去噪路径的扩散模型，该模型使用多个服装图像作为条件来生成人体图像，同时保留其细粒度细节。我们进一步展示了我们框架的广泛适用性，通过将其适应时尚领域的不同类型的基于参考的生成，包括虚拟试穿，以及具有其他条件（例如姿势、面部等）的可控人体图像生成。

Title: InTraGen: Trajectory-controlled Video Generation for Object Interactions

Authors: Zuhao Liu, Aleksandar Yanev, Ahmad Mahmood, Ivan Nikolov, Saman Motamed, Wei-Shi Zheng, Xi Wang, Luc Van Gool, Danda Pani Paudel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16804
Pdf URL: https://arxiv.org/pdf/2411.16804
Copy Paste: [[2411.16804]] InTraGen: Trajectory-controlled Video Generation for Object Interactions(https://arxiv.org/abs/2411.16804)
Keywords: generation
Abstract: Advances in video generation have significantly improved the realism and quality of created scenes. This has fueled interest in developing intuitive tools that let users leverage video generation as world simulators. Text-to-video (T2V) generation is one such approach, enabling video creation from text descriptions only. Yet, due to the inherent ambiguity in texts and the limited temporal information offered by text prompts, researchers have explored additional control signals like trajectory-guided systems, for more accurate T2V generation. Nonetheless, methods to evaluate whether T2V models can generate realistic interactions between multiple objects are lacking. We introduce InTraGen, a pipeline for improved trajectory-based generation of object interaction scenarios. We propose 4 new datasets and a novel trajectory quality metric to evaluate the performance of the proposed InTraGen. To achieve object interaction, we introduce a multi-modal interaction encoding pipeline with an object ID injection mechanism that enriches object-environment interactions. Our results demonstrate improvements in both visual fidelity and quantitative performance. Code and datasets are available at this https URL
摘要：视频生成技术的进步显著提高了所创建场景的真实度和质量。这激发了人们对开发直观工具的兴趣，这些工具可让用户利用视频生成作为世界模拟器。文本到视频 (T2V) 生成就是这样一种方法，它仅通过文本描述即可创建视频。然而，由于文本固有的歧义性和文本提示提供的时间信息有限，研究人员已经探索了其他控制信号，如轨迹引导系统，以实现更准确的 T2V 生成。尽管如此，评估 T2V 模型是否可以在多个对象之间生成逼真交互的方法仍然缺乏。我们引入了 InTraGen，这是一种用于改进基于轨迹的对象交互场景生成的管道。我们提出了 4 个新数据集和一个新颖的轨迹质量指标来评估所提出的 InTraGen 的性能。为了实现对象交互，我们引入了一个多模态交互编码管道，该管道具有对象 ID 注入机制，可以丰富对象与环境的交互。我们的结果表明，视觉保真度和定量性能都有所改善。代码和数据集可在此 https URL 上找到

Title: Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation

Authors: Shengeng Tang, Jiayi He, Lechao Cheng, Jingjing Wu, Dan Guo, Richang Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16810
Pdf URL: https://arxiv.org/pdf/2411.16810
Copy Paste: [[2411.16810]] Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation(https://arxiv.org/abs/2411.16810)
Keywords: generation
Abstract: Generating continuous sign language videos from discrete segments is challenging due to the need for smooth transitions that preserve natural flow and meaning. Traditional approaches that simply concatenate isolated signs often result in abrupt transitions, disrupting video coherence. To address this, we propose a novel framework, Sign-D2C, that employs a conditional diffusion model to synthesize contextually smooth transition frames, enabling the seamless construction of continuous sign language sequences. Our approach transforms the unsupervised problem of transition frame generation into a supervised training task by simulating the absence of transition frames through random masking of segments in long-duration sign videos. The model learns to predict these masked frames by denoising Gaussian noise, conditioned on the surrounding sign observations, allowing it to handle complex, unstructured transitions. During inference, we apply a linearly interpolating padding strategy that initializes missing frames through interpolation between boundary frames, providing a stable foundation for iterative refinement by the diffusion model. Extensive experiments on the PHOENIX14T, USTC-CSL100, and USTC-SLR500 datasets demonstrate the effectiveness of our method in producing continuous, natural sign language videos.
摘要：从离散片段生成连续的手语视频具有挑战性，因为需要平滑的过渡以保持自然的流畅性和含义。简单地连接孤立手势的传统方法通常会导致突然的过渡，破坏视频的连贯性。为了解决这个问题，我们提出了一个新颖的框架 Sign-D2C，它采用条件扩散模型来合成上下文平滑的过渡帧，从而实现连续手语序列的无缝构建。我们的方法通过随机屏蔽长时间手势视频中的片段来模拟过渡帧的缺失，将过渡帧生成的无监督问题转变为监督训练任务。该模型通过对高斯噪声进行去噪来学习预测这些被屏蔽的帧，并以周围的手势观察为条件，使其能够处理复杂的非结构化过渡。在推理过程中，我们应用线性插值填充策略，通过在边界帧之间进行插值来初始化缺失帧，为扩散模型的迭代细化提供稳定的基础。在 PHOENIX14T、USTC-CSL100 和 USTC-SLR500 数据集上进行的大量实验证明了我们的方法在制作连续、自然手语视频方面的有效性。

Title: Pathways on the Image Manifold: Image Editing via Video Generation

Authors: Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaïd, Ron Kimmel
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16819
Pdf URL: https://arxiv.org/pdf/2411.16819
Copy Paste: [[2411.16819]] Pathways on the Image Manifold: Image Editing via Video Generation(https://arxiv.org/abs/2411.16819)
Keywords: generation
Abstract: Recent advances in image editing, driven by image diffusion models, have shown remarkable progress. However, significant challenges remain, as these models often struggle to follow complex edit instructions accurately and frequently compromise fidelity by altering key elements of the original image. Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image's key aspects. Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation.
摘要：在图像扩散模型的推动下，图像编辑领域的最新进展取得了显著进展。然而，仍然存在重大挑战，因为这些模型通常难以准确遵循复杂的编辑指令，并且经常通过改变原始图像的关键元素来损害保真度。同时，视频生成也取得了显著的进步，模型可以有效地充当一致且连续的世界模拟器。在本文中，我们建议通过利用图像到视频模型进行图像编辑来合并这两个领域。我们将图像编辑重新表述为一个时间过程，使用预训练的视频模型来创建从原始图像到所需编辑的平滑过渡。这种方法不断遍历图像流形，确保编辑一致，同时保留原始图像的关键方面。我们的方法在基于文本的图像编辑方面取得了最先进的成果，在编辑准确性和图像保存方面都有显著的改进。

Title: DetailGen3D: Generative 3D Geometry Enhancement via Data-Dependent Flow

Authors: Ken Deng, Yuanchen Guo, Jingxiang Sun, Zixin Zou, Yangguang Li, Xin Cai, Yanpei Cao, Yebin Liu, Ding Liang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.16820
Pdf URL: https://arxiv.org/pdf/2411.16820
Copy Paste: [[2411.16820]] DetailGen3D: Generative 3D Geometry Enhancement via Data-Dependent Flow(https://arxiv.org/abs/2411.16820)
Keywords: generation, generative
Abstract: Modern 3D generation methods can rapidly create shapes from sparse or single views, but their outputs often lack geometric detail due to computational constraints. We present DetailGen3D, a generative approach specifically designed to enhance these generated 3D shapes. Our key insight is to model the coarse-to-fine transformation directly through data-dependent flows in latent space, avoiding the computational overhead of large-scale 3D generative models. We introduce a token matching strategy that ensures accurate spatial correspondence during refinement, enabling local detail synthesis while preserving global structure. By carefully designing our training data to match the characteristics of synthesized coarse shapes, our method can effectively enhance shapes produced by various 3D generation and reconstruction approaches, from single-view to sparse multi-view inputs. Extensive experiments demonstrate that DetailGen3D achieves high-fidelity geometric detail synthesis while maintaining efficiency in training.
摘要：现代 3D 生成方法可以从稀疏或单一视图快速创建形状，但由于计算限制，它们的输出通常缺乏几何细节。我们提出了 DetailGen3D，这是一种专门为增强这些生成的 3D 形状而设计的生成方法。我们的主要见解是通过潜在空间中依赖于数据的流直接对粗到细的转换进行建模，从而避免大规模 3D 生成模型的计算开销。我们引入了一种标记匹配策略，可确保细化过程中准确的空间对应性，从而实现局部细节合成，同时保留全局结构。通过精心设计我们的训练数据以匹配合成粗形状的特征，我们的方法可以有效增强各种 3D 生成和重建方法生成的形状，从单视图到稀疏的多视图输入。大量实验表明，DetailGen3D 实现了高保真几何细节合成，同时保持了训练效率。

Title: Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing

Authors: Hanhui Wang, Yihua Zhang, Ruizheng Bai, Yue Zhao, Sijia Liu, Zhengzhong Tu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16832
Pdf URL: https://arxiv.org/pdf/2411.16832
Copy Paste: [[2411.16832]] Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing(https://arxiv.org/abs/2411.16832)
Keywords: generative
Abstract: Recent advancements in diffusion models have made generative image editing more accessible, enabling creative edits but raising ethical concerns, particularly regarding malicious edits to human portraits that threaten privacy and identity security. Existing protection methods primarily rely on adversarial perturbations to nullify edits but often fail against diverse editing requests. We propose FaceLock, a novel approach to portrait protection that optimizes adversarial perturbations to destroy or significantly alter biometric information, rendering edited outputs biometrically unrecognizable. FaceLock integrates facial recognition and visual perception into perturbation optimization to provide robust protection against various editing attempts. We also highlight flaws in commonly used evaluation metrics and reveal how they can be manipulated, emphasizing the need for reliable assessments of protection. Experiments show FaceLock outperforms baselines in defending against malicious edits and is robust against purification techniques. Ablation studies confirm its stability and broad applicability across diffusion-based editing algorithms. Our work advances biometric defense and sets the foundation for privacy-preserving practices in image editing. The code is available at: this https URL.
摘要：扩散模型的最新进展使得生成图像编辑更加容易实现，使得创造性编辑成为可能，但也引发了道德问题，尤其是对威胁隐私和身份安全的人类肖像恶意编辑。现有的保护方法主要依靠对抗性扰动来使编辑无效，但往往无法应对各种编辑请求。我们提出了 FaceLock，这是一种新颖的肖像保护方法，它优化了对抗性扰动以破坏或显著改变生物特征信息，使编辑后的输出在生物特征上无法识别。FaceLock 将面部识别和视觉感知集成到扰动优化中，以提供对各种编辑尝试的强大保护。我们还强调了常用评估指标中的缺陷，并揭示了它们如何被操纵，强调了对保护进行可靠评估的必要性。实验表明，FaceLock 在防御恶意编辑方面优于基线，并且对净化技术具有很强的鲁棒性。消融研究证实了其在基于扩散的编辑算法中的稳定性和广泛适用性。我们的工作推进了生物特征防御，并为图像编辑中的隐私保护实践奠定了基础。代码可在以下位置获得：此 https URL。

Title: SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

Authors: Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, XIngang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16856
Pdf URL: https://arxiv.org/pdf/2411.16856
Copy Paste: [[2411.16856]] SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE(https://arxiv.org/abs/2411.16856)
Keywords: generation
Abstract: Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.
摘要：自回归模型在各个领域都取得了显著的成功，从大型语言模型 (LLM) 到大型多模态模型 (LMM) 和 2D 内容生成，越来越接近通用人工智能 (AGI)。尽管取得了这些进展，但将自回归方法应用于 3D 对象生成和理解仍在很大程度上尚未得到探索。本文介绍了 Scale AutoRegressive 3D (SAR3D)，这是一种新颖的框架，它利用多尺度 3D 矢量量化变分自编码器 (VQVAE) 对 3D 对象进行标记，以实现高效的自回归生成和详细理解。通过预测多尺度潜在表示中的下一个尺度而不是下一个单个标记，SAR3D 显著缩短了生成时间，在 A6000 GPU 上仅用 0.82 秒即可实现快速 3D 对象生成。此外，鉴于富含分层 3D 感知信息的标记，我们对它们进行了预训练的 LLM 微调，从而实现了对 3D 内容的多模态理解。我们的实验表明，SAR3D 在速度和质量上都超越了当前的 3D 生成方法，并允许 LLM 全面解释和描述 3D 模型。

Title: Explainable AI Approach using Near Misses Analysis

Authors: Eran Kaufman, Avivit levy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16895
Pdf URL: https://arxiv.org/pdf/2411.16895
Copy Paste: [[2411.16895]] Explainable AI Approach using Near Misses Analysis(https://arxiv.org/abs/2411.16895)
Keywords: generation
Abstract: This paper introduces a novel XAI approach based on near-misses analysis (NMA). This approach reveals a hierarchy of logical 'concepts' inferred from the latent decision-making process of a Neural Network (NN) without delving into its explicit structure. We examined our proposed XAI approach on different network architectures that vary in size and shape (e.g., ResNet, VGG, EfficientNet, MobileNet) on several datasets (ImageNet and CIFAR100). The results demonstrate its usability to reflect NNs latent process of concepts generation. We generated a new metric for explainability. Moreover, our experiments suggest that efficient architectures, which achieve a similar accuracy level with much less neurons may still pay the price of explainability and robustness in terms of concepts generation. We, thus, pave a promising new path for XAI research to follow.
摘要：本文介绍了一种基于近失分析 (NMA) 的新型 XAI 方法。该方法揭示了从神经网络 (NN) 的潜在决策过程推断出的逻辑“概念”层次结构，而无需深入研究其显式结构。我们在多个数据集 (ImageNet 和 CIFAR100) 上对大小和形状各异的不同网络架构 (例如 ResNet、VGG、EfficientNet、MobileNet) 测试了我们提出的 XAI 方法。结果证明了它能够反映 NN 概念生成的潜在过程。我们生成了一个新的可解释性指标。此外，我们的实验表明，以更少的神经元实现类似准确度水平的高效架构在概念生成方面仍可能以可解释性和鲁棒性为代价。因此，我们为 XAI 研究铺平了一条有希望的新道路。

Title: ZoomLDM: Latent Diffusion Model for multi-scale image generation

Authors: Srikar Yellapragada, Alexandros Graikos, Kostas Triaridis, Prateek Prasanna, Rajarsi R. Gupta, Joel Saltz, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16969
Pdf URL: https://arxiv.org/pdf/2411.16969
Copy Paste: [[2411.16969]] ZoomLDM: Latent Diffusion Model for multi-scale image generation(https://arxiv.org/abs/2411.16969)
Keywords: super-resolution, generation, generative
Abstract: Diffusion models have revolutionized image generation, yet several challenges restrict their application to large-image domains, such as digital pathology and satellite imagery. Given that it is infeasible to directly train a model on 'whole' images from domains with potential gigapixel sizes, diffusion-based generative methods have focused on synthesizing small, fixed-size patches extracted from these images. However, generating small patches has limited applicability since patch-based models fail to capture the global structures and wider context of large images, which can be crucial for synthesizing (semantically) accurate samples. In this paper, to overcome this limitation, we present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings and allows the diffusion model to synthesize images at different 'zoom' levels, i.e., fixed-size patches extracted from large images at varying scales. ZoomLDM achieves state-of-the-art image generation quality across all scales, excelling particularly in the data-scarce setting of generating thumbnails of entire large images. The multi-scale nature of ZoomLDM unlocks additional capabilities in large image generation, enabling computationally tractable and globally coherent image synthesis up to $4096 \times 4096$ pixels and $4\times$ super-resolution. Additionally, multi-scale features extracted from ZoomLDM are highly effective in multiple instance learning experiments. We provide high-resolution examples of the generated images on our website this https URL.
摘要：扩散模型彻底改变了图像生成，然而一些挑战限制了它们在大型图像领域的应用，例如数字病理学和卫星图像。鉴于无法直接在具有潜在千兆像素大小的域的“整个”图像上训练模型，基于扩散的生成方法专注于合成从这些图像中提取的小型固定大小的补丁。然而，生成小补丁的适用性有限，因为基于补丁的模型无法捕捉大型图像的全局结构和更广泛的背景，而这对于合成（语义上）准确的样本至关重要。在本文中，为了克服这一限制，我们提出了 ZoomLDM，这是一种专为生成跨多个尺度的图像而定制的扩散模型。我们方法的核心是一种新颖的放大感知调节机制，它利用自监督学习 (SSL) 嵌入并允许扩散模型合成不同“缩放”级别的图像，即从不同尺度的大图像中提取的固定大小的补丁。 ZoomLDM 在所有尺度上都实现了最先进的图像生成质量，尤其是在数据稀缺的情况下生成整个大图像的缩略图。ZoomLDM 的多尺度特性解锁了大图像生成的额外功能，实现了高达 $4096 \times 4096$ 像素和 $4\times$ 超分辨率的计算可处理和全局一致的图像合成。此外，从 ZoomLDM 中提取的多尺度特征在多实例学习实验中非常有效。我们在我们的网站上提供了生成图像的高分辨率示例，此 https URL。

Title: Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation

Authors: Shambhavi Mishra, Julio Silva-Rodrıguez, Ismail Ben Ayed, Marco Pedersoli, Jose Dolz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17002
Pdf URL: https://arxiv.org/pdf/2411.17002
Copy Paste: [[2411.17002]] Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation(https://arxiv.org/abs/2411.17002)
Keywords: generation
Abstract: Vision-language foundation models, such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is signifi- cantly degraded. In this work, we explore how to efficiently leverage class text information to mitigate these distribu- tion drifts encountered by large pre-trained vision-language models (VLMs) during test-time inference. In particular, we propose to generate pseudo-labels for the test-time samples by exploiting generic class text embeddings as fixed cen- troids of a label assignment problem, which is efficiently solved with Optimal Transport. Furthermore, the proposed adaptation method (CLIP-OT) integrates a multiple template knowledge distillation approach, which replicates multi-view contrastive learning strategies in unsupervised representa- tion learning but without incurring additional computational complexity. Extensive experiments on multiple popular test- time adaptation benchmarks presenting diverse complex- ity empirically show the superiority of CLIP-OT, achieving performance gains of up to 7% over recent state-of-the-art methods, yet being computationally and memory efficient.
摘要：视觉语言基础模型（例如 CLIP）在各种任务中都表现出了前所未有的零样本性能。然而，这些模型在分布偏移下可能不可靠，因为它们的性能会显著下降。在这项工作中，我们探索了如何有效地利用类文本信息来缓解大型预训练视觉语言模型 (VLM) 在测试时推理期间遇到的这些分布漂移。具体来说，我们建议通过利用通用类文本嵌入作为标签分配问题的固定质心来为测试时样本生成伪标签，该问题可以通过 Optimal Transport 有效解决。此外，所提出的自适应方法 (CLIP-OT) 集成了一种多模板知识提炼方法，它在无监督表示学习中复制了多视角对比学习策略，但不会产生额外的计算复杂性。在多个流行的、具有不同复杂性的测试时间自适应基准上进行的大量实验从经验上证明了 CLIP-OT 的优越性，与最近的最先进方法相比，其性能提高了 7%，同时计算和内存效率更高。

Title: TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On

Authors: Zhenchen Wan, Yanwu Xu, Zhaoqing Wang, Feng Liu, Tongliang Liu, Mingming Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17017
Pdf URL: https://arxiv.org/pdf/2411.17017
Copy Paste: [[2411.17017]] TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On(https://arxiv.org/abs/2411.17017)
Keywords: generation, generative
Abstract: Recent advancements in Virtual Try-On (VTO) have demonstrated exceptional efficacy in generating realistic images and preserving garment details, largely attributed to the robust generative capabilities of text-to-image (T2I) diffusion backbones. However, the T2I models that underpin these methods have become outdated, thereby limiting the potential for further improvement in VTO. Additionally, current methods face notable challenges in accurately rendering text on garments without distortion and preserving fine-grained details, such as textures and material fidelity. The emergence of Diffusion Transformer (DiT) based T2I models has showcased impressive performance and offers a promising opportunity for advancing VTO. Directly applying existing VTO techniques to transformer-based T2I models is ineffective due to substantial architectural differences, which hinder their ability to fully leverage the models' advanced capabilities for improved text generation. To address these challenges and unlock the full potential of DiT-based T2I models for VTO, we propose TED-VITON, a novel framework that integrates a Garment Semantic (GS) Adapter for enhancing garment-specific features, a Text Preservation Loss to ensure accurate and distortion-free text rendering, and a constraint mechanism to generate prompts by optimizing Large Language Model (LLM). These innovations enable state-of-the-art (SOTA) performance in visual quality and text fidelity, establishing a new benchmark for VTO task.
摘要：虚拟试穿 (VTO) 的最新进展已证明在生成逼真图像和保留服装细节方面具有卓越的效果，这主要归功于文本到图像 (T2I) 扩散主干的强大生成能力。然而，支撑这些方法的 T2I 模型已经过时，从而限制了 VTO 进一步改进的潜力。此外，当前的方法在准确地在服装上呈现文本而不失真以及保留细粒度细节（例如纹理和材料保真度）方面面临着显著的挑战。基于扩散变压器 (DiT) 的 T2I 模型的出现展示了令人印象深刻的性能，并为推进 VTO 提供了有希望的机会。将现有的 VTO 技术直接应用于基于变压器的 T2I 模型是无效的，因为存在巨大的架构差异，这阻碍了它们充分利用模型的高级功能来改进文本生成的能力。为了应对这些挑战并充分发挥基于 DiT 的 T2I 模型在 VTO 中的潜力，我们提出了 TED-VITON，这是一个新颖的框架，它集成了服装语义 (GS) 适配器以增强服装特定功能、文本保留损失以确保准确且无失真的文本渲染，以及通过优化大型语言模型 (LLM) 来生成提示的约束机制。这些创新使视觉质量和文本保真度达到最先进 (SOTA) 的性能，为 VTO 任务建立了新的基准。

Title: g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks

Authors: Zihan Wang, Gim Hee Lee
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2411.17030
Pdf URL: https://arxiv.org/pdf/2411.17030
Copy Paste: [[2411.17030]] g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks(https://arxiv.org/abs/2411.17030)
Keywords: generation
Abstract: We introduce Generalizable 3D-Language Feature Fields (g3D-LF), a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks. Our g3D-LF processes posed RGB-D images from agents to encode feature fields for: 1) Novel view representation predictions from any position in the 3D scene; 2) Generations of BEV maps centered on the agent; 3) Querying targets using multi-granularity language within the above-mentioned representations. Our representation can be generalized to unseen environments, enabling real-time construction and dynamic updates. By volume rendering latent features along sampled rays and integrating semantic and spatial relationships through multiscale encoders, our g3D-LF produces representations at different scales and perspectives, aligned with multi-granularity language, via multi-level contrastive learning. Furthermore, we prepare a large-scale 3D-language dataset to align the representations of the feature fields with language. Extensive experiments on Vision-and-Language Navigation under both Panorama and Monocular settings, Zero-shot Object Navigation, and Situated Question Answering tasks highlight the significant advantages and effectiveness of our g3D-LF for embodied tasks.
摘要：我们引入了可泛化 3D 语言特征场 (g3D-LF)，这是一个在大型 3D 语言数据集上预训练的 3D 表示模型，用于具体化任务。我们的 g3D-LF 处理来自代理的 RGB-D 图像以编码特征场，用于：1) 从 3D 场景中的任意位置进行新颖的视图表示预测；2) 以代理为中心的 BEV 图生成；3) 使用上述表示中的多粒度语言查询目标。我们的表示可以推广到看不见的环境，实现实时构建和动态更新。通过沿采样光线进行体积渲染潜在特征并通过多尺度编码器整合语义和空间关系，我们的 g3D-LF 通过多级对比学习生成与多粒度语言一致的不同尺度和视角的表示。此外，我们准备了一个大规模的 3D 语言数据集，以将特征场的表示与语言对齐。在全景和单目设置下的视觉和语言导航、零样本物体导航和情境问答任务上进行的大量实验凸显了我们的 g3D-LF 在具体任务方面具有的显著优势和有效性。

Title: Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models

Authors: Jaemin Kim, Bryan S Kim, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17041
Pdf URL: https://arxiv.org/pdf/2411.17041
Copy Paste: [[2411.17041]] Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models(https://arxiv.org/abs/2411.17041)
Keywords: generation, generative
Abstract: Diffusion models have achieved impressive results in generative tasks like text-to-image (T2I) and text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependency across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions or are constrained to limited prompts, hindering their scalability and applicability. In this paper, we propose Free$^2$Guide, a novel gradient-free framework for aligning generated videos with text prompts without requiring additional model training. Leveraging principles from path integral control, Free$^2$Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward model. Additionally, our framework supports the flexible ensembling of multiple reward models, including large-scale image-based models, to synergistically enhance alignment without incurring substantial computational overhead. We demonstrate that Free$^2$Guide significantly improves text alignment across various dimensions and enhances the overall quality of generated videos.
摘要：扩散模型在文本到图像 (T2I) 和文本到视频 (T2V) 合成等生成任务中取得了令人瞩目的成果。然而，由于跨帧的复杂时间依赖性，在 T2V 生成中实现准确的文本对齐仍然具有挑战性。现有的基于强化学习 (RL) 的增强文本对齐方法通常需要可微分的奖励函数或受限于有限的提示，从而阻碍了它们的可扩展性和适用性。在本文中，我们提出了 Free$^2$Guide，这是一种新颖的无梯度框架，用于将生成的视频与文本提示对齐，而无需额外的模型训练。利用路径积分控制的原理，Free$^2$Guide 使用不可微分的奖励函数近似扩散模型的指导，从而能够将强大的黑盒大型视觉语言模型 (LVLM) 集成为奖励模型。此外，我们的框架支持灵活地集成多个奖励模型，包括基于大规模图像的模型，以协同增强对齐，而不会产生大量计算开销。我们证明 Free$^2$Guide 显著改善了各个维度的文本对齐，并提高了生成视频的整体质量。

Title: Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation

Authors: Minh-Tuan Tran, Trung Le, Xuan-May Le, Jianfei Cai, Mehrtash Harandi, Dinh Phung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17046
Pdf URL: https://arxiv.org/pdf/2411.17046
Copy Paste: [[2411.17046]] Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation(https://arxiv.org/abs/2411.17046)
Keywords: generation
Abstract: Data-Free Knowledge Distillation (DFKD) is an advanced technique that enables knowledge transfer from a teacher model to a student model without relying on original training data. While DFKD methods have achieved success on smaller datasets like CIFAR10 and CIFAR100, they encounter challenges on larger, high-resolution datasets such as ImageNet. A primary issue with previous approaches is their generation of synthetic images at high resolutions (e.g., $224 \times 224$) without leveraging information from real images, often resulting in noisy images that lack essential class-specific features in large datasets. Additionally, the computational cost of generating the extensive data needed for effective knowledge transfer can be prohibitive. In this paper, we introduce MUlti-reSolution data-freE (MUSE) to address these limitations. MUSE generates images at lower resolutions while using Class Activation Maps (CAMs) to ensure that the generated images retain critical, class-specific features. To further enhance model diversity, we propose multi-resolution generation and embedding diversity techniques that strengthen latent space representations, leading to significant performance improvements. Experimental results demonstrate that MUSE achieves state-of-the-art performance across both small- and large-scale datasets, with notable performance gains of up to two digits in nearly all ImageNet and subset experiments. Code is available at this https URL.
摘要：无数据知识蒸馏 (DFKD) 是一种先进的技术，它无需依赖原始训练数据即可将知识从教师模型转移到学生模型。虽然 DFKD 方法在 CIFAR10 和 CIFAR100 等较小的数据集上取得了成功，但它们在 ImageNet 等较大的高分辨率数据集上遇到了挑战。以前的方法的主要问题是它们以高分辨率（例如 $224 \times 224$）生成合成图像，而不利用真实图像中的信息，这通常会导致在大型数据集中生成缺少基本类别特定特征的噪声图像。此外，生成有效知识转移所需的大量数据的计算成本可能过高。在本文中，我们引入了多分辨率无数据 (MUSE) 来解决这些限制。MUSE 以较低的分辨率生成图像，同时使用类激活图 (CAM) 来确保生成的图像保留关键的类别特定特征。为了进一步增强模型多样性，我们提出了多分辨率生成和嵌入多样性技术，以增强潜在空间表示，从而显著提高性能。实验结果表明，MUSE 在小型和大型数据集上均实现了最先进的性能，在几乎所有 ImageNet 和子集实验中，性能显著提升了两位数。代码可在此 https URL 上获取。

Title: PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

Authors: Hengjia Li, Haonan Qiu, Shiwei Zhang, Xiang Wang, Yujie Wei, Zekun Li, Yingya Zhang, Boxi Wu, Deng Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17048
Pdf URL: https://arxiv.org/pdf/2411.17048
Copy Paste: [[2411.17048]] PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation(https://arxiv.org/abs/2411.17048)
Keywords: generation
Abstract: The current text-to-video (T2V) generation has made significant progress in synthesizing realistic general videos, but it is still under-explored in identity-specific human video generation with customized ID images. The key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following after the identity injection. Current video identity customization methods mainly rely on reconstructing given identity images on text-to-image models, which have a divergent distribution with the T2V model. This process introduces a tuning-inference gap, leading to dynamic and semantic degradation. To tackle this problem, we propose a novel framework, dubbed \textbf{PersonalVideo}, that applies direct supervision on videos synthesized by the T2V model to bridge the gap. Specifically, we introduce a learnable Isolated Identity Adapter to customize the specific identity non-intrusively, which does not comprise the original T2V model's abilities (e.g., motion dynamic and semantic following). With the non-reconstructive identity loss, we further employ simulated prompt augmentation to reduce overfitting by supervising generated results in more semantic scenarios, gaining good robustness even with only a single reference image available. Extensive experiments demonstrate our method's superiority in delivering high identity faithfulness while preserving the inherent video generation qualities of the original T2V model, outshining prior approaches. Notably, our PersonalVideo seamlessly integrates with pre-trained SD components, such as ControlNet and style LoRA, requiring no extra tuning overhead.
摘要：当前的文本转视频 (T2V) 生成在合成逼真的一般视频方面取得了重大进展，但在使用定制 ID 图像生成特定身份的人体视频方面仍未得到充分探索。关键挑战在于在身份注入后保留原始运动动态和语义跟踪的同时，始终保持高 ID 保真度。当前的视频身份定制方法主要依赖于在文本转图像模型上重建给定的身份图像，这些模型与 T2V 模型具有不同的分布。此过程引入了调整推理差距，导致动态和语义退化。为了解决这个问题，我们提出了一个称为 \textbf{PersonalVideo} 的新框架，该框架对 T2V 模型合成的视频应用直接监督以弥合差距。具体而言，我们引入了一个可学习的隔离身份适配器来非侵入式地定制特定身份，这不包括原始 T2V 模型的能力（例如，运动动态和语义跟踪）。有了非重构身份损失，我们进一步采用模拟即时增强来减少过度拟合，通过在更多语义场景中监督生成结果，即使只有一个参考图像可用，也能获得良好的鲁棒性。大量实验证明了我们的方法在提供高身份忠诚度方面的优势，同时保留了原始 T2V 模型固有的视频生成质量，优于之前的方法。值得注意的是，我们的 PersonalVideo 无缝集成了预先训练的 SD 组件，例如 ControlNet 和风格 LoRA，无需额外的调整开销。

Title: A generalised novel loss function for computational fluid dynamics

Authors: Zachary Cooper-Baldock, Paulo E. Santos, Russell S.A. Brinkworth, Karl Sammut
Subjects: cs.LG, cs.CV, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2411.17059
Pdf URL: https://arxiv.org/pdf/2411.17059
Copy Paste: [[2411.17059]] A generalised novel loss function for computational fluid dynamics(https://arxiv.org/abs/2411.17059)
Keywords: generative
Abstract: Computational fluid dynamics (CFD) simulations are crucial in automotive, aerospace, maritime and medical applications, but are limited by the complexity, cost and computational requirements of directly calculating the flow, often taking days of compute time. Machine-learning architectures, such as controlled generative adversarial networks (cGANs) hold significant potential in enhancing or replacing CFD investigations, due to cGANs ability to approximate the underlying data distribution of a dataset. Unlike traditional cGAN applications, where the entire image carries information, CFD data contains small regions of highly variant data, immersed in a large context of low variance that is of minimal importance. This renders most existing deep learning techniques that give equal importance to every portion of the data during training, inefficient. To mitigate this, a novel loss function is proposed called Gradient Mean Squared Error (GMSE) which automatically and dynamically identifies the regions of importance on a field-by-field basis, assigning appropriate weights according to the local variance. To assess the effectiveness of the proposed solution, three identical networks were trained; optimised with Mean Squared Error (MSE) loss, proposed GMSE loss and a dynamic variant of GMSE (DGMSE). The novel loss function resulted in faster loss convergence, correlating to reduced training time, whilst also displaying an 83.6% reduction in structural similarity error between the generated field and ground truth simulations, a 76.6% higher maximum rate of loss and an increased ability to fool a discriminator network. It is hoped that this loss function will enable accelerated machine learning within computational fluid dynamics.
摘要：计算流体动力学 (CFD) 模拟在汽车、航空航天、海事和医疗应用中至关重要，但受限于直接计算流量的复杂性、成本和计算要求，通常需要数天的计算时间。机器学习架构（例如受控生成对抗网络 (cGAN)）在增强或取代 CFD 研究方面具有巨大潜力，因为 cGAN 能够近似数据集的底层数据分布。与整个图像都承载信息的传统 cGAN 应用不同，CFD 数据包含高度变异的数据小区域，沉浸在重要性最低的低方差大环境中。这使得大多数现有的深度学习技术效率低下，这些技术在训练期间对数据的每个部分都给予同等重视。为了缓解这个问题，提出了一种称为梯度均方误差 (GMSE) 的新型损失函数，它可以自动、动态地逐个字段识别重要区域，并根据局部方差分配适当的权重。为了评估所提解决方案的有效性，训练了三个相同的网络；使用均方误差 (MSE) 损失、建议的 GMSE 损失和动态 GMSE 变体 (DGMSE) 进行优化。新的损失函数导致损失收敛速度更快，从而缩短了训练时间，同时还显示生成的场和地面真实模拟之间的结构相似性误差减少了 83.6%，最大损失率提高了 76.6%，并且欺骗鉴别器网络的能力增强了。希望这种损失函数能够加速计算流体动力学中的机器学习。

Title: Contrastive Graph Condensation: Advancing Data Versatility through Self-Supervised Learning

Authors: Xinyi Gao, Yayong Li, Tong Chen, Guanhua Ye, Wentao Zhang, Hongzhi Yin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17063
Pdf URL: https://arxiv.org/pdf/2411.17063
Copy Paste: [[2411.17063]] Contrastive Graph Condensation: Advancing Data Versatility through Self-Supervised Learning(https://arxiv.org/abs/2411.17063)
Keywords: generation
Abstract: With the increasing computation of training graph neural networks (GNNs) on large-scale graphs, graph condensation (GC) has emerged as a promising solution to synthesize a compact, substitute graph of the large-scale original graph for efficient GNN training. However, existing GC methods predominantly employ classification as the surrogate task for optimization, thus excessively relying on node labels and constraining their utility in label-sparsity scenarios. More critically, this surrogate task tends to overfit class-specific information within the condensed graph, consequently restricting the generalization capabilities of GC for other downstream tasks. To address these challenges, we introduce Contrastive Graph Condensation (CTGC), which adopts a self-supervised surrogate task to extract critical, causal information from the original graph and enhance the cross-task generalizability of the condensed graph. Specifically, CTGC employs a dual-branch framework to disentangle the generation of the node attributes and graph structures, where a dedicated structural branch is designed to explicitly encode geometric information through nodes' positional embeddings. By implementing an alternating optimization scheme with contrastive loss terms, CTGC promotes the mutual enhancement of both branches and facilitates high-quality graph generation through the model inversion technique. Extensive experiments demonstrate that CTGC excels in handling various downstream tasks with a limited number of labels, consistently outperforming state-of-the-art GC methods.
摘要：随着在大规模图上训练图神经网络 (GNN) 的计算量不断增加，图压缩 (GC) 已成为一种有前途的解决方案，用于合成大规模原始图的紧凑替代图，以实现高效的 GNN 训练。然而，现有的 GC 方法主要使用分类作为优化的替代任务，因此过度依赖节点标签并限制了它们在标签稀疏场景中的效用。更重要的是，这种替代任务往往会过度拟合压缩图中的类特定信息，从而限制了 GC 对其他下游任务的泛化能力。为了应对这些挑战，我们引入了对比图压缩 (CTGC)，它采用自监督替代任务从原始图中提取关键的因果信息并增强压缩图的跨任务泛化能力。具体来说，CTGC 采用双分支框架来解开节点属性和图结构的生成，其中设计了一个专用的结构分支，通过节点的位置嵌入显式编码几何信息。通过实施具有对比损失项的交替优化方案，CTGC 促进了两个分支的相互增强，并通过模型反演技术促进了高质量的图生成。大量实验表明，CTGC 擅长处理具有有限数量标签的各种下游任务，始终优于最先进的 GC 方法。

Title: Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models

Authors: Colin Conwell, Rupert Tawiah-Quashie, Tomer Ullman
Subjects: cs.CV, cs.CL, cs.SC
Abstract URL: https://arxiv.org/abs/2411.17066
Pdf URL: https://arxiv.org/pdf/2411.17066
Copy Paste: [[2411.17066]] Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models(https://arxiv.org/abs/2411.17066)
Keywords: generative
Abstract: Despite remarkable progress in multi-modal AI research, there is a salient domain in which modern AI continues to lag considerably behind even human children: the reliable deployment of logical operators. Here, we examine three forms of logical operators: relations, negations, and discrete numbers. We asked human respondents (N=178 in total) to evaluate images generated by a state-of-the-art image-generating AI (DALL-E 3) prompted with these `logical probes', and find that none reliably produce human agreement scores greater than 50\%. The negation probes and numbers (beyond 3) fail most frequently. In a 4th experiment, we assess a `grounded diffusion' pipeline that leverages targeted prompt engineering and structured intermediate representations for greater compositional control, but find its performance is judged even worse than that of DALL-E 3 across prompts. To provide further clarity on potential sources of success and failure in these text-to-image systems, we supplement our 4 core experiments with multiple auxiliary analyses and schematic diagrams, directly quantifying, for example, the relationship between the N-gram frequency of relational prompts and the average match to generated images; the success rates for 3 different prompt modification strategies in the rendering of negation prompts; and the scalar variability / ratio dependence (`approximate numeracy') of prompts involving integers. We conclude by discussing the limitations inherent to `grounded' multimodal learning systems whose grounding relies heavily on vector-based semantics (e.g. DALL-E 3), or under-specified syntactical constraints (e.g. `grounded diffusion'), and propose minimal modifications (inspired by development, based in imagery) that could help to bridge the lingering compositional gap between scale and structure. All data and code is available at this https URL
摘要：尽管多模态人工智能研究取得了显著进展，但现代人工智能仍然远远落后于人类儿童的一个突出领域：逻辑运算符的可靠部署。在这里，我们研究了三种形式的逻辑运算符：关系、否定和离散数。我们要求人类受访者（共计 178 人）评估由最先进的图像生成人工智能 (DALL-E 3) 用这些“逻辑探测”提示生成的图像，发现没有一个能可靠地产生超过 50% 的人类一致性分数。否定探测和数字（超过 3）失败的频率最高。在第 4 个实验中，我们评估了一个“接地扩散”管道，该管道利用有针对性的提示工程和结构化的中间表示来实现更大的组合控制，但发现其性能在各个提示中的评分甚至比 DALL-E 3 更差。为了进一步明确这些文本到图像系统成功和失败的潜在原因，我们在 4 个核心实验中补充了多个辅助分析和示意图，直接量化关系提示的 N-gram 频率与生成图像的平均匹配之间的关系；3 种不同的提示修改策略在呈现否定提示时的成功率；以及涉及整数的提示的标量变化/比率依赖性（“近似数字”）。最后，我们讨论了“接地”多模态学习系统固有的局限性，这些系统的接地严重依赖于基于向量的语义（例如 DALL-E 3）或未指定的句法约束（例如“接地扩散”），并提出了最小的修改（受开发启发，基于图像），这些修改可能有助于弥合规模和结构之间挥之不去的构图差距。所有数据和代码均可在此 https URL 上找到

Title: {\Omega}SFormer: Dual-Modal {\Omega}-like Super-Resolution Transformer Network for Cross-scale and High-accuracy Terraced Field Vectorization Extraction

Authors: Chang Li, Yu Wang, Ce Zhang, Yongjun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17088
Pdf URL: https://arxiv.org/pdf/2411.17088
Copy Paste: [[2411.17088]] {\Omega}SFormer: Dual-Modal {\Omega}-like Super-Resolution Transformer Network for Cross-scale and High-accuracy Terraced Field Vectorization Extraction(https://arxiv.org/abs/2411.17088)
Keywords: super-resolution
Abstract: Terraced field is a significant engineering practice for soil and water conservation (SWC). Terraced field extraction from remotely sensed imagery is the foundation for monitoring and evaluating SWC. This study is the first to propose a novel dual-modal {\Omega}-like super-resolution Transformer network for intelligent TFVE, offering the following advantages: (1) reducing edge segmentation error from conventional multi-scale downsampling encoder, through fusing original high-resolution features with downsampling features at each step of encoder and leveraging a multi-head attention mechanism; (2) improving the accuracy of TFVE by proposing a {\Omega}-like network structure, which fully integrates rich high-level features from both spectral and terrain data to form cross-scale super-resolution features; (3) validating an optimal fusion scheme for cross-modal and cross-scale (i.e., inconsistent spatial resolution between remotely sensed imagery and DEM) super-resolution feature extraction; (4) mitigating uncertainty between segmentation edge pixels by a coarse-to-fine and spatial topological semantic relationship optimization (STSRO) segmentation strategy; (5) leveraging contour vibration neural network to continuously optimize parameters and iteratively vectorize terraced fields from semantic segmentation results. Moreover, a DMRVD for deep-learning-based TFVE was created for the first time, which covers nine study areas in four provinces of China, with a total coverage area of 22441 square kilometers. To assess the performance of {\Omega}SFormer, classic and SOTA networks were compared. The mIOU of {\Omega}SFormer has improved by 0.165, 0.297 and 0.128 respectively, when compared with best accuracy single-modal remotely sensed imagery, single-modal DEM and dual-modal result.
摘要：梯田是水土保持工程的重要实践，从遥感影像中提取梯田是监测和评估梯田的基础。本研究首次提出了一种用于智能梯田特征提取的双模态{\Omega}型超分辨率Transformer网络，具有以下优点：（1）通过在编码器的每一步融合原始高分辨率特征和降采样特征，并利用多头注意机制，减少了传统多尺度降采样编码器的边缘分割误差；（2）通过提出一种类{\Omega}的网络结构，充分融合来自光谱和地形数据的丰富高级特征，形成跨尺度超分辨率特征，提高了梯田特征提取的精度；（3）验证了跨模态和跨尺度（即遥感影像和DEM之间的空间分辨率不一致）超分辨率特征提取的最佳融合方案； (4) 通过由粗到精和空间拓扑语义关系优化（STSRO）分割策略减轻分割边缘像素之间的不确定性；（5）利用轮廓振动神经网络不断优化参数，并根据语义分割结果迭代矢量化梯田。此外，首次创建了基于深度学习的 TFVE 的 DMRVD，覆盖中国四省九个研究区，总覆盖面积为 22441 平方公里。为了评估 {\Omega}SFormer 的性能，对经典网络和 SOTA 网络进行了比较。与最佳精度的单模态遥感影像、单模态 DEM 和双模态结果相比，{\Omega}SFormer 的 mIOU 分别提高了 0.165、0.297 和 0.128。

Title: Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

Authors: Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram
Subjects: cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2411.17089
Pdf URL: https://arxiv.org/pdf/2411.17089
Copy Paste: [[2411.17089]] Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation(https://arxiv.org/abs/2411.17089)
Keywords: generation
Abstract: Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) caching is used to store intermediate activations, enabling GPUs to perform only the incremental computation required for each new token. This approach significantly lowers the computational overhead for token generation. However, the memory required for KV caching grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU. Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution, but they are hindered by excessive data movement and dependence on CPU capabilities. In this paper, we introduce an efficient CPU-GPU I/O-aware LLM inference method that avoids transferring the entire KV cache from CPU to GPU by recomputing partial KV cache from activations while concurrently transferring the remaining KV cache via PCIe bus. This approach overlaps GPU recomputation with data transfer to minimize idle GPU time and maximize inference performance. Our method is fully automated by integrating a profiler module that utilizes input characteristics and system hardware information, a scheduler module to optimize the distribution of computation and communication workloads, and a runtime module to efficiently execute the derived execution plan. Experimental results show that our method achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches.
摘要：大型语言模型 (LLM) 的推理需要大量的计算。为了降低自回归解码的成本，使用键值 (KV) 缓存来存储中间激活，使 GPU 能够仅执行每个新标记所需的增量计算。这种方法显著降低了标记生成的计算开销。但是，KV 缓存所需的内存增长迅速，通常会超过 GPU 内存的容量。一种经济高效的替代方案是将 KV 缓存卸载到 CPU 内存，这可以减轻 GPU 内存压力，但瓶颈会转移到 CPU 和 GPU 之间 PCIe 连接的有限带宽上。现有方法试图通过将 GPU 计算与 I/O 重叠或采用 CPU-GPU 异构执行来解决这些问题，但它们受到过多数据移动和对 CPU 功能的依赖的阻碍。在本文中，我们介绍了一种高效的 CPU-GPU I/O 感知 LLM 推理方法，该方法通过从激活中重新计算部分 KV 缓存，同时通过 PCIe 总线同时传输剩余的 KV 缓存，从而避免将整个 KV 缓存从 CPU 传输到 GPU。这种方法将 GPU 重新计算与数据传输重叠，以最大限度地减少空闲 GPU 时间并最大限度地提高推理性能。我们的方法通过集成利用输入特性和系统硬件信息的分析器模块、用于优化计算和通信工作负载分配的调度器模块以及用于高效执行派生执行计划的运行时模块来实现完全自动化。实验结果表明，与最先进的方法相比，我们的方法在解码过程中实现了高达 35.8% 的延迟降低和 46.2% 的吞吐量提高。

Title: PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution

Authors: Libo Zhu, Jianze Li, Haotong Qin, Yulun Zhang, Yong Guo, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17106
Pdf URL: https://arxiv.org/pdf/2411.17106
Copy Paste: [[2411.17106]] PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution(https://arxiv.org/abs/2411.17106)
Keywords: super-resolution
Abstract: Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps. However, even though the denoising step has been reduced to one, they require high computational costs and storage requirements, making it difficult for deployment on hardware devices. To address these issues, we propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR. First, we simplify OSD model to two core components, UNet and Variational Autoencoder (VAE) by removing the CLIPEncoder. Secondly, we propose Learnable Boundary Quantizer (LBQ) and Learnable Equivalent Transformation (LET) to optimize the quantization process and manipulate activation distributions for better quantization. Finally, we design a Distributed Quantization Calibration (DQC) strategy that stabilizes the training of quantized parameters for rapid convergence. Comprehensive experiments demonstrate that PassionSR with 8-bit and 6-bit obtains comparable visual results with full-precision model. Moreover, our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR. Our code will be at this https URL.
摘要：基于扩散的图像超分辨率 (SR) 模型以多个去噪步骤为代价表现出优异的性能。然而，即使去噪步骤已减少到一个，它们也需要很高的计算成本和存储要求，因此难以在硬件设备上部署。为了解决这些问题，我们提出了一种具有自适应尺度的新型训练后量化方法，用于一步扩散 (OSD) 图像 SR，PassionSR。首先，我们通过删除 CLIPEncoder 将 OSD 模型简化为两个核心组件，UNet 和变分自动编码器 (VAE)。其次，我们提出了可学习边界量化器 (LBQ) 和可学习等效变换 (LET) 来优化量化过程并操纵激活分布以实现更好的量化。最后，我们设计了一种分布式量化校准 (DQC) 策略，该策略可以稳定量化参数的训练以实现快速收敛。综合实验表明，8 位和 6 位的 PassionSR 可获得与全精度模型相当的视觉效果。此外，我们的 PassionSR 比近期领先的图像 SR 低位量化方法具有显著优势。我们的代码将位于此 https URL 中。

Title: OSDFace: One-Step Diffusion Model for Face Restoration

Authors: Jingkai Wang, Jue Gong, Lin Zhang, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17163
Pdf URL: https://arxiv.org/pdf/2411.17163
Copy Paste: [[2411.17163]] OSDFace: One-Step Diffusion Model for Face Restoration(https://arxiv.org/abs/2411.17163)
Keywords: restoration, generative
Abstract: Diffusion models have demonstrated impressive performance in face restoration. Yet, their multi-step inference process remains computationally intensive, limiting their applicability in real-world scenarios. Moreover, existing methods often struggle to generate face images that are harmonious, realistic, and consistent with the subject's identity. In this work, we propose OSDFace, a novel one-step diffusion model for face restoration. Specifically, we propose a visual representation embedder (VRE) to better capture prior information and understand the input face. In VRE, low-quality faces are processed by a visual tokenizer and subsequently embedded with a vector-quantized dictionary to generate visual prompts. Additionally, we incorporate a facial identity loss derived from face recognition to further ensure identity consistency. We further employ a generative adversarial network (GAN) as a guidance model to encourage distribution alignment between the restored face and the ground truth. Experimental results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics, generating high-fidelity, natural face images with high identity consistency. The code and model will be released at this https URL.
摘要：扩散模型在人脸恢复方面表现出色。然而，它们的多步推理过程仍然需要大量计算，这限制了它们在现实场景中的适用性。此外，现有方法通常难以生成和谐、逼真且与主体身份一致的人脸图像。在这项工作中，我们提出了一种用于人脸恢复的新型一步扩散模型 OSDFace。具体来说，我们提出了一种视觉表示嵌入器 (VRE) 来更好地捕获先验信息并理解输入的人脸。在 VRE 中，低质量的人脸由视觉标记器处理，然后嵌入矢量量化词典以生成视觉提示。此外，我们结合了人脸识别产生的面部身份损失，以进一步确保身份一致性。我们进一步采用生成对抗网络 (GAN) 作为指导模型，以鼓励恢复的人脸与基本事实之间的分布对齐。实验结果表明，OSDFace 在视觉质量和定量指标方面均超越了当前最先进的 (SOTA) 方法，生成了具有高身份一致性的高保真、自然的人脸图像。代码和模型将会在这个https URL上发布。

Title: X-MeshGraphNet: Scalable Multi-Scale Graph Neural Networks for Physics Simulation

Authors: Mohammad Amin Nabian
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2411.17164
Pdf URL: https://arxiv.org/pdf/2411.17164
Copy Paste: [[2411.17164]] X-MeshGraphNet: Scalable Multi-Scale Graph Neural Networks for Physics Simulation(https://arxiv.org/abs/2411.17164)
Keywords: generation
Abstract: Graph Neural Networks (GNNs) have gained significant traction for simulating complex physical systems, with models like MeshGraphNet demonstrating strong performance on unstructured simulation meshes. However, these models face several limitations, including scalability issues, requirement for meshing at inference, and challenges in handling long-range interactions. In this work, we introduce X-MeshGraphNet, a scalable, multi-scale extension of MeshGraphNet designed to address these challenges. X-MeshGraphNet overcomes the scalability bottleneck by partitioning large graphs and incorporating halo regions that enable seamless message passing across partitions. This, combined with gradient aggregation, ensures that training across partitions is equivalent to processing the entire graph at once. To remove the dependency on simulation meshes, X-MeshGraphNet constructs custom graphs directly from CAD files by generating uniform point clouds on the surface or volume of the object and connecting k-nearest neighbors. Additionally, our model builds multi-scale graphs by iteratively combining coarse and fine-resolution point clouds, where each level refines the previous, allowing for efficient long-range interactions. Our experiments demonstrate that X-MeshGraphNet maintains the predictive accuracy of full-graph GNNs while significantly improving scalability and flexibility. This approach eliminates the need for time-consuming mesh generation at inference, offering a practical solution for real-time simulation across a wide range of applications. The code for reproducing the results presented in this paper is available through NVIDIA Modulus: this http URL.
摘要：图神经网络 (GNN) 在模拟复杂物理系统方面获得了巨大关注，其中 MeshGraphNet 等模型在非结构化模拟网格上表现出色。然而，这些模型面临一些限制，包括可扩展性问题、推理时网格划分的要求以及处理长距离交互的挑战。在这项工作中，我们引入了 X-MeshGraphNet，这是 MeshGraphNet 的可扩展、多尺度扩展，旨在解决这些挑战。X-MeshGraphNet 通过对大型图进行分区并合并光晕区域来克服可扩展性瓶颈，从而实现跨分区的无缝消息传递。这与梯度聚合相结合，可确保跨分区训练相当于一次处理整个图。为了消除对模拟网格的依赖，X-MeshGraphNet 通过在物体的表面或体积上生成均匀点云并连接 k 个最近邻，直接从 CAD 文件构建自定义图。此外，我们的模型通过迭代组合粗分辨率和精细分辨率点云来构建多尺度图，其中每个级别都会细化前一个级别，从而实现高效的远程交互。我们的实验表明，X-MeshGraphNet 保持了全图 GNN 的预测准确性，同时显著提高了可扩展性和灵活性。这种方法消除了推理时耗时的网格生成需求，为广泛应用中的实时模拟提供了实用的解决方案。用于重现本文中呈现的结果的代码可通过 NVIDIA Modulus 获得：此 http URL。

Title: ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Authors: Chengyou Jia, Changliang Xia, Zhuohang Dang, Weijia Wu, Hangwei Qian, Minnan Luo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17176
Pdf URL: https://arxiv.org/pdf/2411.17176
Copy Paste: [[2411.17176]] ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting(https://arxiv.org/abs/2411.17176)
Keywords: generation, generative
Abstract: Despite the significant advancements in text-to-image (T2I) generative models, users often face a trial-and-error challenge in practical scenarios. This challenge arises from the complexity and uncertainty of tedious steps such as crafting suitable prompts, selecting appropriate models, and configuring specific arguments, making users resort to labor-intensive attempts for desired images. This paper proposes Automatic T2I generation, which aims to automate these tedious steps, allowing users to simply describe their needs in a freestyle chatting way. To systematically study this problem, we first introduce ChatGenBench, a novel benchmark designed for Automatic T2I. It features high-quality paired data with diverse freestyle inputs, enabling comprehensive evaluation of automatic T2I models across all steps. Additionally, recognizing Automatic T2I as a complex multi-step reasoning task, we propose ChatGen-Evo, a multi-stage evolution strategy that progressively equips models with essential automation skills. Through extensive evaluation across step-wise accuracy and image quality, ChatGen-Evo significantly enhances performance over various baselines. Our evaluation also uncovers valuable insights for advancing automatic T2I. All our data, code, and models will be available in \url{this https URL}
摘要：尽管文本转图像 (T2I) 生成模型取得了重大进展，但用户在实际场景中经常面临反复试验的挑战。这一挑战源于繁琐步骤的复杂性和不确定性，例如制作合适的提示、选择合适的模型和配置特定参数，使用户不得不为获得所需的图像而进行劳动密集型的尝试。本文提出了自动 T2I 生成，旨在将这些繁琐的步骤自动化，让用户以自由聊天的方式简单地描述他们的需求。为了系统地研究这个问题，我们首先介绍了 ChatGenBench，这是一个为自动 T2I 设计的新基准。它具有高质量的配对数据和多样化的自由式输入，能够全面评估所有步骤中的自动 T2I 模型。此外，认识到自动 T2I 是一项复杂的多步骤推理任务，我们提出了 ChatGen-Evo，这是一种多阶段进化策略，可逐步为模型配备必要的自动化技能。通过对逐步准确性和图像质量的广泛评估，ChatGen-Evo 显著提高了各种基线的性能。我们的评估还揭示了推进自动化 T2I 的宝贵见解。我们所有的数据、代码和模型都将在 \url{此 https URL} 中提供

Title: LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization

Authors: Rui Xie, Tianchen Zhao, Zhihang Yuan, Rui Wan, Wenxi Gao, Zhenhua Zhu, Xuefei Ning, Yu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17178
Pdf URL: https://arxiv.org/pdf/2411.17178
Copy Paste: [[2411.17178]] LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization(https://arxiv.org/abs/2411.17178)
Keywords: generation
Abstract: Visual Autoregressive (VAR) has emerged as a promising approach in image generation, offering competitive potential and performance comparable to diffusion-based models. However, current AR-based visual generation models require substantial computational resources, limiting their applicability on resource-constrained devices. To address this issue, we conducted analysis and identified significant redundancy in three dimensions of the VAR model: (1) the attention map, (2) the attention outputs when using classifier free guidance, and (3) the data precision. Correspondingly, we proposed efficient attention mechanism and low-bit quantization method to enhance the efficiency of VAR models while maintaining performance. With negligible performance lost (less than 0.056 FID increase), we could achieve 85.2% reduction in attention computation, 50% reduction in overall memory and 1.5x latency reduction. To ensure deployment feasibility, we developed efficient training-free compression techniques and analyze the deployment feasibility and efficiency gain of each technique.
摘要：视觉自回归 (VAR) 已成为图像生成中一种很有前途的方法，具有可与基于扩散的模型相媲美的竞争潜力和性能。然而，当前基于 AR 的视觉生成模型需要大量的计算资源，限制了它们在资源受限设备上的适用性。为了解决这个问题，我们进行了分析并发现 VAR 模型在三个维度上存在显著的冗余：(1) 注意力图，(2) 使用无分类器指导时的注意力输出，以及 (3) 数据精度。相应地，我们提出了有效的注意力机制和低位量化方法来提高 VAR 模型的效率同时保持性能。在性能损失可忽略不计（FID 增加不到 0.056）的情况下，我们可以实现注意力计算减少 85.2%、整体内存减少 50% 和延迟减少 1.5 倍。为了确保部署可行性，我们开发了高效的免训练压缩技术，并分析了每种技术的部署可行性和效率增益。

Title: Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

Authors: Dongping Chen, Ruoxi Chen, Shu Pu, Zhaoyi Liu, Yanru Wu, Caixi Chen, Benlin Liu, Yue Huang, Yao Wan, Pan Zhou, Ranjay Krishna
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.17188
Pdf URL: https://arxiv.org/pdf/2411.17188
Copy Paste: [[2411.17188]] Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment(https://arxiv.org/abs/2411.17188)
Keywords: generation
Abstract: Many real-world user queries (e.g. "How do to make egg fried rice?") could benefit from systems capable of generating responses with both textual steps with accompanying images, similar to a cookbook. Models designed to generate interleaved text and images face challenges in ensuring consistency within and across these modalities. To address these challenges, we present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG leverages a scene graph structure to capture relationships between text and image blocks, evaluating responses on four levels of granularity: holistic, structural, block-level, and image-specific. This multi-tiered evaluation allows for a nuanced assessment of consistency, coherence, and accuracy, and provides interpretable question-answer feedback. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories. This benchmark dataset includes complex language-vision dependencies and golden answers to evaluate models effectively on vision-centric tasks such as style transfer, a challenging area for current models. Using ISG-Bench, we demonstrate that recent unified vision-language models perform poorly on generating interleaved content. While compositional approaches that combine separate language and image models show a 111% improvement over unified models at the holistic level, their performance remains suboptimal at both block and image levels. To facilitate future work, we develop ISG-Agent, a baseline agent employing a "plan-execute-refine" pipeline to invoke tools, achieving a 122% performance improvement.
摘要：许多现实世界的用户查询（例如“如何制作蛋炒饭？”）可以从能够生成带有文本步骤和附带图像的响应的系统（类似于食谱）中受益。旨在生成交错文本和图像的模型在确保这些模式内部和跨模式一致性方面面临挑战。为了应对这些挑战，我们提出了 ISG，这是一个用于交错文本和图像生成的综合评估框架。ISG 利用场景图结构来捕获文本和图像块之间的关系，在四个粒度级别上评估响应：整体、结构、块级和图像特定。这种多层次评估允许对一致性、连贯性和准确性进行细致入微的评估，并提供可解释的问答反馈。与 ISG 一起，我们引入了一个基准 ISG-Bench，涵盖 8 个类别和 21 个子类别的 1,150 个样本。这个基准数据集包括复杂的语言视觉依赖关系和黄金答案，可以有效地评估模型在以视觉为中心的任务（例如风格转换，这是当前模型的一个挑战领域）上的表现。使用 ISG-Bench，我们证明了最近的统一视觉语言模型在生成交错内容方面表现不佳。虽然结合单独的语言和图像模型的组合方法在整体层面上比统一模型提高了 111%，但它们在块和图像层面上的性能仍然不是最佳的。为了促进未来的工作，我们开发了 ISG-Agent，这是一个采用“计划-执行-优化”管道来调用工具的基线代理，实现了 122% 的性能提升。

Title: PhysMotion: Physics-Grounded Dynamics From a Single Image

Authors: Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, Chenfanfu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17189
Pdf URL: https://arxiv.org/pdf/2411.17189
Copy Paste: [[2411.17189]] PhysMotion: Physics-Grounded Dynamics From a Single Image(https://arxiv.org/abs/2411.17189)
Keywords: generation, generative
Abstract: We introduce PhysMotion, a novel framework that leverages principled physics-based simulations to guide intermediate 3D representations generated from a single image and input conditions (e.g., applied force and torque), producing high-quality, physically plausible video generation. By utilizing continuum mechanics-based simulations as a prior knowledge, our approach addresses the limitations of traditional data-driven generative models and result in more consistent physically plausible motions. Our framework begins by reconstructing a feed-forward 3D Gaussian from a single image through geometry optimization. This representation is then time-stepped using a differentiable Material Point Method (MPM) with continuum mechanics-based elastoplasticity models, which provides a strong foundation for realistic dynamics, albeit at a coarse level of detail. To enhance the geometry, appearance and ensure spatiotemporal consistency, we refine the initial simulation using a text-to-image (T2I) diffusion model with cross-frame attention, resulting in a physically plausible video that retains intricate details comparable to the input image. We conduct comprehensive qualitative and quantitative evaluations to validate the efficacy of our method. Our project page is available at: \url{this https URL}.
摘要：我们引入了 PhysMotion，这是一种新颖的框架，它利用基于物理学的原理模拟来指导从单个图像和输入条件（例如施加的力和扭矩）生成的中间 3D 表示，从而生成高质量、物理上合理的视频。通过利用基于连续力学的模拟作为先验知识，我们的方法解决了传统数据驱动生成模型的局限性，并产生了更一致的物理上合理的运动。我们的框架首先通过几何优化从单个图像重建前馈 3D 高斯。然后使用可微分材料点法 (MPM) 和基于连续力学的弹塑性模型对该表示进行时间步进，这为真实的动力学提供了坚实的基础，尽管细节程度较粗略。为了增强几何形状和外观并确保时空一致性，我们使用具有跨帧注意的文本到图像 (T2I) 扩散模型改进了初始模拟，从而生成了物理上可信的视频，该视频保留了与输入图像相当的复杂细节。我们进行了全面的定性和定量评估，以验证我们方法的有效性。我们的项目页面位于：\url{此 https URL}。

Title: MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution

Authors: Chengxing Xie, Xiaoming Zhang, Kai Zhang, Linze Li, Yuqian Fu, Biao Gong, Tianrui Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17214
Pdf URL: https://arxiv.org/pdf/2411.17214
Copy Paste: [[2411.17214]] MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution(https://arxiv.org/abs/2411.17214)
Keywords: super-resolution
Abstract: Recent advances in image super-resolution (SR) have significantly benefited from the incorporation of Transformer architectures. However, conventional techniques aimed at enlarging the self-attention window to capture broader contexts come with inherent drawbacks, especially the significantly increased computational demands. Moreover, the feature perception within a fixed-size window of existing models restricts the effective receptive fields and the intermediate feature diversity. This study demonstrates that a flexible integration of attention across diverse spatial extents can yield significant performance enhancements. In line with this insight, we introduce Multi-Range Attention Transformer (MAT) tailored for SR tasks. MAT leverages the computational advantages inherent in dilation operation, in conjunction with self-attention mechanism, to facilitate both multi-range attention (MA) and sparse multi-range attention (SMA), enabling efficient capture of both regional and sparse global features. Further coupled with local feature extraction, MAT adeptly capture dependencies across various spatial ranges, improving the diversity and efficacy of its feature representations. We also introduce the MSConvStar module, which augments the model's ability for multi-range representation learning. Comprehensive experiments show that our MAT exhibits superior performance to existing state-of-the-art SR models with remarkable efficiency (~3.3 faster than SRFormer-light).
摘要：图像超分辨率 (SR) 的最新进展极大地受益于 Transformer 架构的引入。然而，旨在扩大自注意力窗口以捕捉更广泛上下文的传统技术具有固有的缺点，尤其是计算需求的显著增加。此外，现有模型固定大小窗口内的特征感知限制了有效感受野和中间特征多样性。这项研究表明，灵活地整合跨不同空间范围的注意力可以显著提高性能。根据这一见解，我们引入了针对 SR 任务量身定制的多范围注意力变换器 (MAT)。MAT 利用扩张操作固有的计算优势，结合自注意力机制，促进多范围注意力 (MA) 和稀疏多范围注意力 (SMA)，从而能够有效捕捉区域和稀疏全局特征。此外，结合局部特征提取，MAT 能够熟练地捕捉跨不同空间范围的依赖关系，提高其特征表示的多样性和有效性。我们还引入了 MSConvStar 模块，增强了模型的多范围表示学习能力。综合实验表明，我们的 MAT 表现出优于现有最先进 SR 模型的性能，效率显著提高（比 SRFormer-light 快约 3.3 倍）。

Title: Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning

Authors: Hui-Yue Yang, Hui Chen, Ao Wang, Kai Chen, Zijia Lin, Yongliang Tang, Pengcheng Gao, Yuming Quan, Jungong Han, Guiguang Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17217
Pdf URL: https://arxiv.org/pdf/2411.17217
Copy Paste: [[2411.17217]] Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning(https://arxiv.org/abs/2411.17217)
Keywords: generation
Abstract: Segment Anything Model (SAM) has made great progress in anomaly segmentation tasks due to its impressive generalization ability. However, existing methods that directly apply SAM through prompting often overlook the domain shift issue, where SAM performs well on natural images but struggles in industrial scenarios. Parameter-Efficient Fine-Tuning (PEFT) offers a promising solution, but it may yield suboptimal performance by not adequately addressing the perception challenges during adaptation to anomaly images. In this paper, we propose a novel Self-Perceptinon Tuning (SPT) method, aiming to enhance SAM's perception capability for anomaly segmentation. The SPT method incorporates a self-drafting tuning strategy, which generates an initial coarse draft of the anomaly mask, followed by a refinement process. Additionally, a visual-relation-aware adapter is introduced to improve the perception of discriminative relational information for mask generation. Extensive experimental results on several benchmark datasets demonstrate that our SPT method can significantly outperform baseline methods, validating its effectiveness. Models and codes will be available online.
摘要：任何分割模型 (SAM) 因其出色的泛化能力而在异常分割任务中取得了巨大进展。然而，现有的通过提示直接应用 SAM 的方法往往忽略了域转移问题，SAM 在自然图像上表现良好，但在工业场景中却表现不佳。参数高效微调 (PEFT) 提供了一种有前途的解决方案，但它可能无法充分解决适应异常图像过程中的感知挑战，从而导致性能不佳。在本文中，我们提出了一种新颖的自感知调整 (SPT) 方法，旨在增强 SAM 对异常分割的感知能力。SPT 方法采用了一种自起草调整策略，该策略生成异常蒙版的初始粗略草稿，然后进行细化过程。此外，还引入了视觉关系感知适配器来改善对蒙版生成的判别关系信息的感知。在多个基准数据集上的大量实验结果表明，我们的 SPT 方法可以显著优于基线方法，从而验证了其有效性。模型和代码将在线提供。

Title: AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

Authors: Jiarui Wang, Huiyu Duan, Guangtao Zhai, Juntong Wang, Xiongkuo Min
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17221
Pdf URL: https://arxiv.org/pdf/2411.17221
Copy Paste: [[2411.17221]] AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM(https://arxiv.org/abs/2411.17221)
Keywords: generation, quality assessment
Abstract: The rapid advancement of large multimodal models (LMMs) has led to the rapid expansion of artificial intelligence generated videos (AIGVs), which highlights the pressing need for effective video quality assessment (VQA) models designed specifically for AIGVs. Current VQA models generally fall short in accurately assessing the perceptual quality of AIGVs due to the presence of unique distortions, such as unrealistic objects, unnatural movements, or inconsistent visual elements. To address this challenge, we first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 diverse prompts. With these AIGVs, a systematic annotation pipeline including scoring and ranking processes is devised, which collects 370k expert ratings to date. Based on AIGVQA-DB, we further introduce AIGV-Assessor, a novel VQA model that leverages spatiotemporal features and LMM frameworks to capture the intricate quality attributes of AIGVs, thereby accurately predicting precise video quality scores and video pair preferences. Through comprehensive experiments on both AIGVQA-DB and existing AIGV databases, AIGV-Assessor demonstrates state-of-the-art performance, significantly surpassing existing scoring or evaluation methods in terms of multiple perceptual quality dimensions.
摘要：大型多模态模型 (LMM) 的快速发展推动了人工智能生成视频 (AIGV) 的快速扩张，这凸显了对专门为 AIGV 设计的有效视频质量评估 (VQA) 模型的迫切需求。当前的 VQA 模型通常无法准确评估 AIGV 的感知质量，因为存在独特的失真，例如不切实际的物体、不自然的动作或不一致的视觉元素。为了应对这一挑战，我们首先提出了 AIGVQA-DB，这是一个大型数据集，包含 36,576 个 AIGV，由 15 个高级文本转视频模型使用 1,048 个不同的提示生成。利用这些 AIGV，设计了一个包括评分和排名过程的系统注释流程，迄今为止已收集了 370,000 条专家评分。在 AIGVQA-DB 的基础上，我们进一步引入了 AIGV-Assessor，这是一种新颖的 VQA 模型，利用时空特征和 LMM 框架来捕捉 AIGV 的复杂质量属性，从而准确预测精确的视频质量分数和视频对偏好。通过在 AIGVQA-DB 和现有 AIGV 数据库上进行的全面实验，AIGV-Assessor 展示了最先进的性能，在多个感知质量维度方面显著超越了现有的评分或评估方法。

Title: DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting

Authors: Yicheng Yang, Pengxiang Li, Lu Zhang, Liqian Ma, Ping Hu, Siyu Du, Yunzhi Zhuge, Xu Jia, Huchuan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17223
Pdf URL: https://arxiv.org/pdf/2411.17223
Copy Paste: [[2411.17223]] DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting(https://arxiv.org/abs/2411.17223)
Keywords: generative
Abstract: Subject-driven image inpainting has emerged as a popular task in image editing alongside recent advancements in diffusion models. Previous methods primarily focus on identity preservation but struggle to maintain the editability of inserted objects. In response, this paper introduces DreamMix, a diffusion-based generative model adept at inserting target objects into given scenes at user-specified locations while concurrently enabling arbitrary text-driven modifications to their attributes. In particular, we leverage advanced foundational inpainting models and introduce a disentangled local-global inpainting framework to balance precise local object insertion with effective global visual coherence. Additionally, we propose an Attribute Decoupling Mechanism (ADM) and a Textual Attribute Substitution (TAS) module to improve the diversity and discriminative capability of the text-based attribute guidance, respectively. Extensive experiments demonstrate that DreamMix effectively balances identity preservation and attribute editability across various application scenarios, including object insertion, attribute editing, and small object inpainting. Our code is publicly available at this https URL.
摘要：随着扩散模型的最新进展，主题驱动的图像修复已成为图像编辑中的一项流行任务。以前的方法主要侧重于身份保存，但难以保持插入对象的可编辑性。为此，本文介绍了 DreamMix，这是一种基于扩散的生成模型，擅长将目标对象插入到用户指定位置的给定场景中，同时允许对其属性进行任意文本驱动的修改。具体来说，我们利用先进的基础修复模型并引入解缠的局部全局修复框架来平衡精确的局部对象插入和有效的全局视觉连贯性。此外，我们提出了一种属性解耦机制 (ADM) 和文本属性替换 (TAS) 模块，分别用于提高基于文本的属性指导的多样性和判别能力。大量实验表明，DreamMix 有效地平衡了各种应用场景中的身份保存和属性可编辑性，包括对象插入、属性编辑和小对象修复。我们的代码在此 https URL 上公开提供。

Title: MWFormer: Multi-Weather Image Restoration Using Degradation-Aware Transformers

Authors: Ruoxi Zhu, Zhengzhong Tu, Jiaming Liu, Alan C. Bovik, Yibo Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17226
Pdf URL: https://arxiv.org/pdf/2411.17226
Copy Paste: [[2411.17226]] MWFormer: Multi-Weather Image Restoration Using Degradation-Aware Transformers(https://arxiv.org/abs/2411.17226)
Keywords: restoration
Abstract: Restoring images captured under adverse weather conditions is a fundamental task for many computer vision applications. However, most existing weather restoration approaches are only capable of handling a specific type of degradation, which is often insufficient in real-world scenarios, such as rainy-snowy or rainy-hazy weather. Towards being able to address these situations, we propose a multi-weather Transformer, or MWFormer for short, which is a holistic vision Transformer that aims to solve multiple weather-induced degradations using a single, unified architecture. MWFormer uses hyper-networks and feature-wise linear modulation blocks to restore images degraded by various weather types using the same set of learned parameters. We first employ contrastive learning to train an auxiliary network that extracts content-independent, distortion-aware feature embeddings that efficiently represent predicted weather types, of which more than one may occur. Guided by these weather-informed predictions, the image restoration Transformer adaptively modulates its parameters to conduct both local and global feature processing, in response to multiple possible weather. Moreover, MWFormer allows for a novel way of tuning, during application, to either a single type of weather restoration or to hybrid weather restoration without any retraining, offering greater controllability than existing methods. Our experimental results on multi-weather restoration benchmarks show that MWFormer achieves significant performance improvements compared to existing state-of-the-art methods, without requiring much computational cost. Moreover, we demonstrate that our methodology of using hyper-networks can be integrated into various network architectures to further boost their performance. The code is available at: this https URL
摘要：恢复在恶劣天气条件下拍摄的图像是许多计算机视觉应用的基本任务。然而，大多数现有的天气恢复方法只能处理特定类型的退化，这在现实世界中往往是不够的，例如雨雪天气或雨雾天气。为了能够解决这些情况，我们提出了一种多天气 Transformer，简称 MWFormer，这是一种整体视觉 Transformer，旨在使用单一统一的架构解决多种天气引起的退化问题。MWFormer 使用超网络和特征线性调制块，使用同一组学习参数恢复因各种天气类型而退化的图像。我们首先采用对比学习来训练辅助网络，该网络提取与内容无关、失真感知的特征嵌入，这些嵌入可以有效地表示预测的天气类型，其中可能出现多种天气类型。在这些天气预报的指导下，图像恢复 Transformer 自适应地调节其参数以进行局部和全局特征处理，以应对多种可能的天气。此外，MWFormer 允许在应用过程中以新颖的方式调整单一类型的天气恢复或混合天气恢复，而无需任何再训练，从而提供比现有方法更高的可控性。我们在多天气恢复基准上的实验结果表明，与现有的最先进方法相比，MWFormer 实现了显着的性能改进，而无需太多的计算成本。此外，我们证明了使用超网络的方法可以集成到各种网络架构中，以进一步提高其性能。代码可在以下位置获得：此 https URL

Title: From Graph Diffusion to Graph Classification

Authors: Jia Jun Cheng Xian, Sadegh Mahdavi, Renjie Liao, Oliver Schulte
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17236
Pdf URL: https://arxiv.org/pdf/2411.17236
Copy Paste: [[2411.17236]] From Graph Diffusion to Graph Classification(https://arxiv.org/abs/2411.17236)
Keywords: generation, generative
Abstract: Generative models such as diffusion models have achieved remarkable success in state-of-the-art image and text tasks. Recently, score-based diffusion models have extended their success beyond image generation, showing competitive performance with discriminative methods in image {\em classification} tasks~\cite{zimmermann2021score}. However, their application to classification in the {\em graph} domain, which presents unique challenges such as complex topologies, remains underexplored. We show how graph diffusion models can be applied for graph classification. We find that to achieve competitive classification accuracy, score-based graph diffusion models should be trained with a novel training objective that is tailored to graph classification. In experiments with a sampling-based inference method, our discriminative training objective achieves state-of-the-art graph classification accuracy.
摘要：生成模型（例如扩散模型）在最先进的图像和文本任务中取得了显著的成功。最近，基于分数的扩散模型已将其成功扩展到图像生成之外，在图像分类任务中表现出与判别方法相媲美的性能~\cite{zimmermann2021score}。然而，它们在图形领域分类中的应用仍未得到充分探索，这带来了复杂拓扑等独特挑战。我们展示了如何将图扩散模型应用于图分类。我们发现，为了实现具有竞争力的分类准确率，基于分数的图扩散模型应该使用针对图分类量身定制的新型训练目标进行训练。在使用基于采样的推理方法的实验中，我们的判别训练目标实现了最先进的图分类准确率。

Title: Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment

Authors: Zheng Chen, Xun Zhang, Wenbo Li, Renjing Pei, Fenglong Song, Xiongkuo Min, Xiaohong Liu, Xin Yuan, Yong Guo, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17237
Pdf URL: https://arxiv.org/pdf/2411.17237
Copy Paste: [[2411.17237]] Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment(https://arxiv.org/abs/2411.17237)
Keywords: quality assessment
Abstract: The development of multimodal large language models (MLLMs) enables the evaluation of image quality through natural language descriptions. This advancement allows for more detailed assessments. However, these MLLM-based IQA methods primarily rely on general contextual descriptions, sometimes limiting fine-grained quality assessment. To address this limitation, we introduce a new image quality assessment (IQA) task paradigm, grounding-IQA. This paradigm integrates multimodal referring and grounding with IQA to realize more fine-grained quality perception. Specifically, grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA). GIQA-DES involves detailed descriptions with precise locations (e.g., bounding boxes), while GIQA-VQA focuses on quality QA for local regions. To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline. Furthermore, we develop a well-designed benchmark, GIQA-Bench. The benchmark comprehensively evaluates the model grounding-IQA performance from three perspectives: description quality, VQA accuracy, and grounding precision. Experiments demonstrate that our proposed task paradigm, dataset, and benchmark facilitate the more fine-grained IQA application. Code: this https URL.
摘要：多模态大型语言模型 (MLLM) 的发展使得通过自然语言描述来评估图像质量成为可能。这一进步使得更详细的评估成为可能。然而，这些基于 MLLM 的 IQA 方法主要依赖于一般的上下文描述，有时会限制细粒度的质量评估。为了解决这一限制，我们引入了一种新的图像质量评估 (IQA) 任务范式，grounding-IQA。该范式将多模态参考和 grounding 与 IQA 相结合，以实现更细粒度的质量感知。具体而言，grounding-IQA 包括两个子任务：grounding-IQA-描述 (GIQA-DES) 和视觉问答 (GIQA-VQA)。GIQA-DES 涉及具有精确位置的详细描述（例如边界框），而 GIQA-VQA 则侧重于局部区域的质量 QA。为了实现 grounding-IQA，我们通过我们提出的自动注释流程构建了相应的数据集 GIQA-160K。此外，我们开发了一个精心设计的基准 GIQA-Bench。该基准从三个角度全面评估模型接地 IQA 性能：描述质量、VQA 准确性和接地精度。实验表明，我们提出的任务范式、数据集和基准有助于更细粒度的 IQA 应用。代码：此 https URL。

Title: Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration

Authors: Junyuan Deng, Wei Yin, Xiaoyang Guo, Qian Zhang, Xiaotao Hu, Weiqiang Ren, Xiaoxiao Long, Ping Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17240
Pdf URL: https://arxiv.org/pdf/2411.17240
Copy Paste: [[2411.17240]] Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration(https://arxiv.org/abs/2411.17240)
Keywords: generation
Abstract: In this paper, we present DM-Calib, a diffusion-based approach for estimating pin- hole camera intrinsic parameters from a single input image. Monocular camera calibration is essential for many 3D vision tasks. However, most existing methods depend on handcrafted assumptions or are constrained by limited training data, re- sulting in poor generalization across diverse real-world images. Recent advance- ments in stable diffusion models, trained on massive data, have shown the ability to generate high-quality images with varied characteristics. Emerging evidence in- dicates that these models implicitly capture the relationship between camera focal length and image content. Building on this insight, we explore how to leverage the powerful priors of diffusion models for monocular pinhole camera calibration. Specifically, we introduce a new image-based representation, termed Camera Im- age, which losslessly encodes the numerical camera intrinsics and integrates seam- lessly with the diffusion framework. Using this representation, we reformulate the problem of estimating camera intrinsics as the generation of a dense Camera Im- age conditioned on an input image. By fine-tuning a stable diffusion model to gen- erate a Camera Image from a single RGB input, we can extract camera intrinsics via a RANSAC operation. We further demonstrate that our monocular calibration method enhances performance across various 3D tasks, including zero-shot metric depth estimation, 3D metrology, pose estimation and sparse-view reconstruction. Extensive experiments on multiple public datasets show that our approach signifi- cantly outperforms baselines and provides broad benefits to 3D vision tasks. Code is available at this https URL.
摘要：在本文中，我们介绍了 DM-Calib，这是一种基于扩散的方法，用于从单个输入图像估计针孔相机固有参数。单目相机校准对于许多 3D 视觉任务至关重要。然而，大多数现有方法依赖于手工假设或受到有限训练数据的限制，导致在各种现实世界图像中的泛化能力较差。在海量数据上训练的稳定扩散模型的最新进展表明，它能够生成具有各种特征的高质量图像。新出现的证据表明，这些模型隐式地捕捉了相机焦距和图像内容之间的关系。基于这一见解，我们探索如何利用扩散模型的强大先验进行单目针孔相机校准。具体来说，我们引入了一种新的基于图像的表示，称为相机图像，它无损编码数值相机固有参数并与扩散框架无缝集成。使用这种表示，我们将估计相机固有参数的问题重新表述为以输入图像为条件生成密集相机图像。通过微调稳定的扩散模型以从单个 RGB 输入生成相机图像，我们可以通过 RANSAC 操作提取相机内在特性。我们进一步证明，我们的单目校准方法可提高各种 3D 任务的性能，包括零样本度量深度估计、3D 计量、姿势估计和稀疏视图重建。在多个公共数据集上进行的大量实验表明，我们的方法明显优于基线，并为 3D 视觉任务提供了广泛的益处。代码可在此 https URL 上找到。

Title: HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator

Authors: Fan Yang, Ru Zhen, Jianing Wang, Yanhao Zhang, Haoxiang Chen, Haonan Lu, Sicheng Zhao, Guiguang Ding
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.17261
Pdf URL: https://arxiv.org/pdf/2411.17261
Copy Paste: [[2411.17261]] HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator(https://arxiv.org/abs/2411.17261)
Keywords: generation
Abstract: AIGC images are prevalent across various fields, yet they frequently suffer from quality issues like artifacts and unnatural textures. Specialized models aim to predict defect region heatmaps but face two primary challenges: (1) lack of explainability, failing to provide reasons and analyses for subtle defects, and (2) inability to leverage common sense and logical reasoning, leading to poor generalization. Multimodal large language models (MLLMs) promise better comprehension and reasoning but face their own challenges: (1) difficulty in fine-grained defect localization due to the limitations in capturing tiny details; and (2) constraints in providing pixel-wise outputs necessary for precise heatmap generation. To address these challenges, we propose HEIE: a novel MLLM-Based Hierarchical Explainable image Implausibility Evaluator. We introduce the CoT-Driven Explainable Trinity Evaluator, which integrates heatmaps, scores, and explanation outputs, using CoT to decompose complex tasks into subtasks of increasing difficulty and enhance interpretability. Our Adaptive Hierarchical Implausibility Mapper synergizes low-level image features with high-level mapper tokens from LLMs, enabling precise local-to-global hierarchical heatmap predictions through an uncertainty-based adaptive token approach. Moreover, we propose a new dataset: Expl-AIGI-Eval, designed to facilitate interpretable implausibility evaluation of AIGC images. Our method demonstrates state-of-the-art performance through extensive experiments.
摘要：AIGC 图像在各个领域都很普遍，但它们经常受到伪影和不自然纹理等质量问题的困扰。专门的模型旨在预测缺陷区域热图，但面临两个主要挑战：（1）缺乏可解释性，无法提供细微缺陷的原因和分析，以及（2）无法利用常识和逻辑推理，导致泛化能力差。多模态大型语言模型 (MLLM) 有望提供更好的理解和推理能力，但也面临着自身的挑战：（1）由于捕捉微小细节的局限性，难以进行细粒度缺陷定位；（2）在提供精确热图生成所需的像素级输出方面受到限制。为了应对这些挑战，我们提出了 HEIE：一种基于 MLLM 的新型分层可解释图像不合理性评估器。我们引入了 CoT 驱动的可解释三位一体评估器，它集成了热图、分数和解释输出，使用 CoT 将复杂任务分解为难度不断增加的子任务并增强可解释性。我们的自适应分层不合理性映射器将低级图像特征与 LLM 中的高级映射器标记协同作用，通过基于不确定性的自适应标记方法实现精确的局部到全局分层热图预测。此外，我们提出了一个新的数据集：Expl-AIGI-Eval，旨在促进对 AIGC 图像进行可解释的不合理性评估。我们的方法通过大量实验展示了最先进的性能。

Title: Reward Incremental Learning in Text-to-Image Generation

Authors: Maorong Wang, Jiafeng Mao, Xueting Wang, Toshihiko Yamasaki
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17310
Pdf URL: https://arxiv.org/pdf/2411.17310
Copy Paste: [[2411.17310]] Reward Incremental Learning in Text-to-Image Generation(https://arxiv.org/abs/2411.17310)
Keywords: generation
Abstract: The recent success of denoising diffusion models has significantly advanced text-to-image generation. While these large-scale pretrained models show excellent performance in general image synthesis, downstream objectives often require fine-tuning to meet specific criteria such as aesthetics or human preference. Reward gradient-based strategies are promising in this context, yet existing methods are limited to single-reward tasks, restricting their applicability in real-world scenarios that demand adapting to multiple objectives introduced incrementally over time. In this paper, we first define this more realistic and unexplored problem, termed Reward Incremental Learning (RIL), where models are desired to adapt to multiple downstream objectives incrementally. Additionally, while the models adapt to the ever-emerging new objectives, we observe a unique form of catastrophic forgetting in diffusion model fine-tuning, affecting both metric-wise and visual structure-wise image quality. To address this catastrophic forgetting challenge, we propose Reward Incremental Distillation (RID), a method that mitigates forgetting with minimal computational overhead, enabling stable performance across sequential reward tasks. The experimental results demonstrate the efficacy of RID in achieving consistent, high-quality generation in RIL scenarios. The source code of our work will be publicly available upon acceptance.
摘要：去噪扩散模型的最新成功显著推动了文本到图像的生成。虽然这些大规模预训练模型在一般图像合成中表现出色，但下游目标通常需要进行微调以满足美学或人类偏好等特定标准。基于奖励梯度的策略在这种情况下很有前景，但现有方法仅限于单一奖励任务，限制了它们在需要适应随时间逐步引入的多个目标的现实场景中的适用性。在本文中，我们首先定义了这个更现实和未探索的问题，称为奖励增量学习 (RIL)，其中模型需要逐步适应多个下游目标。此外，虽然模型适应不断出现的新目标，但我们观察到扩散模型微调中存在一种独特的灾难性遗忘形式，影响度量和视觉结构方面的图像质量。为了应对这一灾难性的遗忘挑战，我们提出了奖励增量蒸馏 (RID)，这是一种以最小的计算开销缓解遗忘的方法，可在连续奖励任务中实现稳定的性能。实验结果证明了 RID 在 RIL 场景中实现一致、高质量生成的有效性。我们工作的源代码将在接受后公开提供。

Title: DWCL: Dual-Weighted Contrastive Learning for Multi-View Clustering

Authors: Zhihui Zhang, Xiaoshuai Hao, Hanning Yuan, Lianhua Chi, Qi Guo, Qi Li, Ziqiang Yuan, Jinhui Pang, Yexin Li, Sijie Ruan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17354
Pdf URL: https://arxiv.org/pdf/2411.17354
Copy Paste: [[2411.17354]] DWCL: Dual-Weighted Contrastive Learning for Multi-View Clustering(https://arxiv.org/abs/2411.17354)
Keywords: generation
Abstract: Multi-view contrastive clustering (MVCC) has gained significant attention for generating consistent clustering structures from multiple views through contrastive learning. However, most existing MVCC methods create cross-views by combining any two views, leading to a high volume of unreliable pairs. Furthermore, these approaches often overlook discrepancies in multi-view representations, resulting in representation degeneration. To address these challenges, we introduce a novel model called Dual-Weighted Contrastive Learning (DWCL) for Multi-View Clustering. Specifically, to reduce the impact of unreliable cross-views, we introduce an innovative Best-Other (B-O) contrastive mechanism that enhances the representation of individual views at a low computational cost. Furthermore, we develop a dual weighting strategy that combines a view quality weight, reflecting the quality of each view, with a view discrepancy weight. This approach effectively mitigates representation degeneration by downplaying cross-views that are both low in quality and high in discrepancy. We theoretically validate the efficiency of the B-O contrastive mechanism and the effectiveness of the dual weighting strategy. Extensive experiments demonstrate that DWCL outperforms previous methods across eight multi-view datasets, showcasing superior performance and robustness in MVCC. Specifically, our method achieves absolute accuracy improvements of 5.4\% and 5.6\% compared to state-of-the-art methods on the Caltech6V7 and MSRCv1 datasets, respectively.
摘要：多视图对比聚类 (MVCC) 因通过对比学习从多个视图生成一致的聚类结构而备受关注。然而，大多数现有的 MVCC 方法通过组合任意两个视图来创建交叉视图，从而导致大量不可靠的对。此外，这些方法通常会忽略多视图表示中的差异，从而导致表示退化。为了应对这些挑战，我们引入了一种称为双加权对比学习 (DWCL) 的多视图聚类新模型。具体来说，为了减少不可靠的交叉视图的影响，我们引入了一种创新的最佳-其他 (B-O) 对比机制，以较低的计算成本增强了各个视图的表示。此外，我们开发了一种双重加权策略，将反映每个视图质量的视图质量权重与视图差异权重相结合。这种方法通过淡化质量低且差异大的交叉视图有效地缓解了表示退化。我们从理论上验证了 B-O 对比机制的效率和双重加权策略的有效性。大量实验表明，DWCL 在八个多视图数据集上的表现优于以前的方法，在 MVCC 中展现出卓越的性能和稳健性。具体来说，与 Caltech6V7 和 MSRCv1 数据集上最先进的方法相比，我们的方法分别实现了 5.4% 和 5.6% 的绝对准确率提升。

Title: AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation

Authors: Ziyi Xu, Ziyao Huang, Juan Cao, Yong Zhang, Xiaodong Cun, Qing Shuai, Yuchen Wang, Linchao Bao, Jintao Li, Fan Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17383
Pdf URL: https://arxiv.org/pdf/2411.17383
Copy Paste: [[2411.17383]] AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation(https://arxiv.org/abs/2411.17383)
Keywords: generation
Abstract: The automatic generation of anchor-style product promotion videos presents promising opportunities in online commerce, advertising, and consumer engagement. However, this remains a challenging task despite significant advancements in pose-guided human video generation. In addressing this challenge, we identify the integration of human-object interactions (HOI) into pose-guided human video generation as a core issue. To this end, we introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object, achieving high visual fidelity and controllable interactions. Specifically, we propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives and disentangles object and human appearance, and the HOI-motion injection, which enables complex human-object interactions by overcoming challenges in object trajectory conditioning and inter-occlusion management. Additionally, we introduce the HOI-region reweighting loss, a training objective that enhances the learning of object details. Extensive experiments demonstrate that our proposed system outperforms existing methods in preserving object appearance and shape awareness, while simultaneously maintaining consistency in human appearance and motion. Project page: this https URL
摘要：自动生成主播风格的产品推广视频为在线商务、广告和消费者参与带来了良好的机会。然而，尽管姿势引导的人体视频生成取得了重大进展，但这仍然是一项艰巨的任务。在应对这一挑战时，我们将人与物体交互 (HOI) 集成到姿势引导的人体视频生成中视为核心问题。为此，我们引入了 AnchorCrafter，这是一种基于扩散的新型系统，旨在生成以目标人物和定制物体为特色的 2D 视频，实现高视觉保真度和可控交互。具体来说，我们提出了两项关键创新：HOI 外观感知，它从任意多视角增强了物体外观识别并解开物体和人的外观；以及 HOI 运动注入，它通过克服物体轨迹调节和遮挡管理方面的挑战来实现复杂的人与物体交互。此外，我们引入了 HOI 区域重加权损失，这是一个增强物体细节学习的训练目标。大量实验表明，我们提出的系统在保留物体外观和形状感知方面优于现有方法，同时保持了人体外观和运动的一致性。项目页面：此 https URL

Title: CoA: Chain-of-Action for Generative Semantic Labels

Authors: Meng Wei, Zhongnian Li, Peng Ying, Xinzheng Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17406
Pdf URL: https://arxiv.org/pdf/2411.17406
Copy Paste: [[2411.17406]] CoA: Chain-of-Action for Generative Semantic Labels(https://arxiv.org/abs/2411.17406)
Keywords: generative
Abstract: Recent advances in vision-language models (VLM) have demonstrated remarkable capability in image classification. These VLMs leverage a predefined set of categories to construct text prompts for zero-shot reasoning. However, in more open-ended domains like autonomous driving, using a predefined set of labels becomes impractical, as the semantic label space is unknown and constantly evolving. Additionally, fixed embedding text prompts often tend to predict a single label (while in reality, multiple labels commonly exist per image). In this paper, we introduce CoA, an innovative Chain-of-Action (CoA) method that generates labels aligned with all contextually relevant features of an image. CoA is designed based on the observation that enriched and valuable contextual information improves generative performance during inference. Traditional vision-language models tend to output singular and redundant responses. Therefore, we employ a tailored CoA to alleviate this problem. We first break down the generative labeling task into detailed actions and construct an CoA leading to the final generative objective. Each action extracts and merges key information from the previous action and passes the enriched information as context to the next action, ultimately improving the VLM in generating comprehensive and accurate semantic labels. We assess the effectiveness of CoA through comprehensive evaluations on widely-used benchmark datasets and the results demonstrate significant improvements across key performance metrics.
摘要：视觉语言模型 (VLM) 的最新进展已展示出其在图像分类方面的卓越能力。这些 VLM 利用一组预定义的类别来构建零样本推理的文本提示。然而，在自动驾驶等更开放的领域，使用一组预定义的标签变得不切实际，因为语义标签空间是未知的且不断发展。此外，固定嵌入的文本提示往往倾向于预测单个标签（而实际上，每个图像通常存在多个标签）。在本文中，我们介绍了 CoA，这是一种创新的行动链 (CoA) 方法，可生成与图像的所有上下文相关特征一致的标签。CoA 的设计基于这样的观察：丰富而有价值的上下文信息可提高推理过程中的生成性能。传统的视觉语言模型倾向于输出单一且冗余的响应。因此，我们采用量身定制的 CoA 来缓解这个问题。我们首先将生成标记任务分解为详细的操作，并构建一个 CoA 以达到最终的生成目标。每个动作都会从上一个动作中提取和合并关键信息，并将丰富的信息作为上下文传递给下一个动作，最终提高 VLM 生成全面而准确的语义标签的能力。我们通过对广泛使用的基准数据集进行全面评估来评估 CoA 的有效性，结果表明关键性能指标均有显著改善。

Title: DRiVE: Diffusion-based Rigging Empowers Generation of Versatile and Expressive Characters

Authors: Mingze Sun, Junhao Chen, Junting Dong, Yurun Chen, Xinyu Jiang, Shiwei Mao, Puhua Jiang, Jingbo Wang, Bo Dai, Ruqi Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17423
Pdf URL: https://arxiv.org/pdf/2411.17423
Copy Paste: [[2411.17423]] DRiVE: Diffusion-based Rigging Empowers Generation of Versatile and Expressive Characters(https://arxiv.org/abs/2411.17423)
Keywords: generation, generative
Abstract: Recent advances in generative models have enabled high-quality 3D character reconstruction from multi-modal. However, animating these generated characters remains a challenging task, especially for complex elements like garments and hair, due to the lack of large-scale datasets and effective rigging methods. To address this gap, we curate AnimeRig, a large-scale dataset with detailed skeleton and skinning annotations. Building upon this, we propose DRiVE, a novel framework for generating and rigging 3D human characters with intricate structures. Unlike existing methods, DRiVE utilizes a 3D Gaussian representation, facilitating efficient animation and high-quality rendering. We further introduce GSDiff, a 3D Gaussian-based diffusion module that predicts joint positions as spatial distributions, overcoming the limitations of regression-based approaches. Extensive experiments demonstrate that DRiVE achieves precise rigging results, enabling realistic dynamics for clothing and hair, and surpassing previous methods in both quality and versatility. The code and dataset will be made public for academic use upon acceptance.
摘要：生成模型的最新进展使得从多模态中实现高质量的 3D 角色重建成为可能。然而，由于缺乏大规模数据集和有效的装配方法，为这些生成的角色制作动画仍然是一项艰巨的任务，尤其是对于服装和头发等复杂元素。为了解决这一差距，我们整理了 AnimeRig，这是一个具有详细骨架和蒙皮注释的大型数据集。在此基础上，我们提出了 DRiVE，这是一个用于生成和装配具有复杂结构的 3D 人体角色的新框架。与现有方法不同，DRiVE 采用 3D 高斯表示，有助于实现高效的动画和高质量的渲染。我们进一步介绍了 GSDiff，这是一个基于 3D 高斯的扩散模块，它将关节位置预测为空间分布，克服了基于回归的方法的局限性。大量实验表明，DRiVE 实现了精确的装配结果，使服装和头发具有逼真的动态效果，并且在质量和多功能性方面都超越了以前的方法。代码和数据集将在被接受后公开用于学术用途。

Title: Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Authors: Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2411.17440
Pdf URL: https://arxiv.org/pdf/2411.17440
Copy Paste: [[2411.17440]] Identity-Preserving Text-to-Video Generation by Frequency Decomposition(https://arxiv.org/abs/2411.17440)
Keywords: generation, generative
Abstract: Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based control scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video. Inspired by prior findings in frequency analysis of diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features and high-frequency intrinsic features. First, from a low-frequency perspective, we introduce a global facial extractor, which encodes reference images and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into transformer blocks, enhancing the model's ability to preserve fine-grained features. We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our ConsisID generates high-quality, identity-preserving videos, making strides towards more effective IPT2V.
摘要：身份保留文本转视频 (IPT2V) 生成旨在创建具有一致人类身份的高保真视频。这是视频生成中的一项重要任务，但对于生成模型来说仍然是一个悬而未决的问题。本文将 IPT2V 的技术前沿推向了文献中尚未解决的两个方向：(1) 无需繁琐的逐案微调的免调优流程，以及 (2) 基于频率感知的启发式身份保留 DiT 控制方案。我们提出了 ConsisID，这是一种无需调优的基于 DiT 的可控 IPT2V 模型，可在生成的视频中保持人类身份的一致性。受扩散变压器频率分析的先前发现的启发，它在频域中采用身份控制信号，其中面部特征可以分解为低频全局特征和高频内在特征。首先，从低频角度来看，我们引入了一个全局面部提取器，它将参考图像和面部关键点编码到潜在空间中，生成富含低频信息的特征。然后将这些特征集成到网络的浅层中，以减轻与 DiT 相关的训练挑战。其次，从高频角度来看，我们设计了一个局部面部提取器来捕获高频细节并将其注入到变压器块中，从而增强了模型保留细粒度特征的能力。我们提出了一种分层训练策略，利用频率信息进行身份保存，将原始预训练视频生成模型转换为 IPT2V 模型。大量实验表明，我们的频率感知启发式方案为基于 DiT 的模型提供了最佳控制解决方案。得益于此方案，我们的 ConsisID 可以生成高质量、身份保存的视频，朝着更有效的 IPT2V 迈进了一步。

Title: VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Authors: Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.17451
Pdf URL: https://arxiv.org/pdf/2411.17451
Copy Paste: [[2411.17451]] VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models(https://arxiv.org/abs/2411.17451)
Keywords: generative
Abstract: Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.
摘要：视觉语言生成奖励模型 (VL-GenRM) 在协调和评估多模态 AI 系统方面发挥着至关重要的作用，但它们自身的评估仍未得到充分探索。当前的评估方法主要依赖于传统 VL 任务中 AI 注释的偏好标签，这可能会引入偏见，并且通常无法有效挑战最先进的模型。为了解决这些限制，我们引入了 VL-RewardBench，这是一个涵盖一般多模态查询、视觉幻觉检测和复杂推理任务的综合基准。通过结合样本选择和人工验证的 AI 辅助注释管道，我们整理了 1,250 个高质量示例，专门用于探测模型的局限性。对 16 个领先的大型视觉语言模型进行全面评估，证明了 VL-RewardBench 作为具有挑战性的测试平台的有效性，其中即使是 GPT-4o 也只能达到 65.4% 的准确率，而最先进的开源模型（如 Qwen2-VL-72B）也难以超越随机猜测。重要的是，VL-RewardBench 上的表现与使用 VL-GenRM 的 Best-of-N 采样的 MMMU-Pro 准确率高度相关（Pearson's r > 0.9）。分析实验揭示了改进 VL-GenRM 的三个关键见解：(i) 模型主要在基本视觉感知任务而不是推理任务上失败；(ii) 推理时间扩展效益因模型容量而异；(iii) 训练 VL-GenRM 学习判断可大幅提高判断能力（7B VL-GenRM 的准确率提高了 14.7%）。我们相信 VL-RewardBench 以及实验见解将成为推进 VL-GenRM 的宝贵资源。

Title: FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval

Authors: Jingyou Xie, Jiayi Kuang, Zhenzhou Lin, Jiarui Ouyang, Zishuo Zhao, Ying Shen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.17454
Pdf URL: https://arxiv.org/pdf/2411.17454
Copy Paste: [[2411.17454]] FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval(https://arxiv.org/abs/2411.17454)
Keywords: generation
Abstract: Given a query from one modality, few-shot cross-modal retrieval (CMR) retrieves semantically similar instances in another modality with the target domain including classes that are disjoint from the source domain. Compared with classical few-shot CMR methods, vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance. However, they still suffer challenges due to (1) the feature degradation encountered in the target domain and (2) the extreme data imbalance. To tackle these issues, we propose FLEX-CLIP, a novel Feature-level Generation Network Enhanced CLIP. FLEX-CLIP includes two training stages. In multimodal feature generation, we propose a composite multimodal VAE-GAN network to capture real feature distribution patterns and generate pseudo samples based on CLIP features, addressing data imbalance. For common space projection, we develop a gate residual network to fuse CLIP features with projected features, reducing feature degradation in X-shot scenarios. Experimental results on four benchmark datasets show a 7%-15% improvement over state-of-the-art methods, with ablation studies demonstrating enhancement of CLIP features.
摘要：给定一个模态的查询，小样本跨模态检索 (CMR) 会检索另一个模态中语义相似的实例，其中目标域包括与源域不相交的类。与经典的小样本 CMR 方法相比，CLIP 等视觉语言预训练方法表现出了出色的小样本或零样本学习性能。然而，它们仍然面临挑战，因为 (1) 在目标域中遇到的特征退化和 (2) 极端的数据不平衡。为了解决这些问题，我们提出了 FLEX-CLIP，一种新颖的特征级生成网络增强 CLIP。FLEX-CLIP 包括两个训练阶段。在多模态特征生成中，我们提出了一个复合多模态 VAE-GAN 网络来捕获真实的特征分布模式并基于 CLIP 特征生成伪样本，从而解决数据不平衡问题。对于公共空间投影，我们开发了一个门残差网络来将 CLIP 特征与投影特征融合，从而减少 X-shot 场景中的特征退化。在四个基准数据集上的实验结果显示，与最先进的方法相比，该方法提高了 7%-15%，并且消融研究表明 CLIP 特征得到增强。

Title: Adversarial Bounding Boxes Generation (ABBG) Attack against Visual Object Trackers

Authors: Fatemeh Nourilenjan Nokabadi, Jean-Francois Lalonde, Christian Gagné
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17468
Pdf URL: https://arxiv.org/pdf/2411.17468
Copy Paste: [[2411.17468]] Adversarial Bounding Boxes Generation (ABBG) Attack against Visual Object Trackers(https://arxiv.org/abs/2411.17468)
Keywords: generation
Abstract: Adversarial perturbations aim to deceive neural networks into predicting inaccurate results. For visual object trackers, adversarial attacks have been developed to generate perturbations by manipulating the outputs. However, transformer trackers predict a specific bounding box instead of an object candidate list, which limits the applicability of many existing attack scenarios. To address this issue, we present a novel white-box approach to attack visual object trackers with transformer backbones using only one bounding box. From the tracker predicted bounding box, we generate a list of adversarial bounding boxes and compute the adversarial loss for those bounding boxes. Experimental results demonstrate that our simple yet effective attack outperforms existing attacks against several robust transformer trackers, including TransT-M, ROMTrack, and MixFormer, on popular benchmark tracking datasets such as GOT-10k, UAV123, and VOT2022STS.
摘要：对抗性扰动旨在欺骗神经网络预测不准确的结果。对于视觉对象跟踪器，已经开发出对抗性攻击来通过操纵输出来产生扰动。然而，Transformer 跟踪器预测的是特定边界框而不是对象候选列表，这限制了许多现有攻击场景的适用性。为了解决这个问题，我们提出了一种新颖的白盒方法，仅使用一个边界框来攻击具有 Transformer 主干的视觉对象跟踪器。从跟踪器预测的边界框中，我们生成一个对抗性边界框列表并计算这些边界框的对抗性损失。实验结果表明，在流行的基准跟踪数据集（例如 GOT-10k、UAV123 和 VOT2022STS）上，我们简单而有效的攻击优于针对几种强大的 Transformer 跟踪器（包括 TransT-M、ROMTrack 和 MixFormer）的现有攻击。

Title: Towards Precise Scaling Laws for Video Diffusion Transformers

Authors: Yuanyang Yin, Yaqi Zhao, Mingwu Zheng, Ke Lin, Jiarong Ou, Rui Chen, Victor Shea-Jay Huang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Baoqun Yin, Wentao Zhang, Kun Gai
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17470
Pdf URL: https://arxiv.org/pdf/2411.17470
Copy Paste: [[2411.17470]] Towards Precise Scaling Laws for Video Diffusion Transformers(https://arxiv.org/abs/2411.17470)
Keywords: generation
Abstract: Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size, two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among validation loss, any model size, and compute budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.
摘要：由于视频扩散变压器的训练成本高昂，因此在给定数据和计算预算内实现其最佳性能至关重要。这需要在进行大规模训练之前精确确定最佳模型大小和训练超参数。虽然语言模型采用缩放定律来预测性能，但它们在视觉生成模型中的存在和准确推导仍未得到充分探索。在本文中，我们系统地分析了视频扩散变压器的缩放定律并确认了它们的存在。此外，我们发现，与语言模型不同，视频扩散模型对学习率和批量大小更敏感，这两个超参数通常没有精确建模。为了解决这个问题，我们提出了一种新的缩放定律，可以预测任何模型大小和计算预算的最佳超参数。在这些最佳设置下，我们在 1e10 TFlops 的计算预算内实现了可比的性能，并将推理成本与传统缩放方法相比降低了 40.1%。此外，我们在验证损失、任何模型大小和计算预算之间建立了更普遍和精确的关系。这使得非最优模型大小的性能预测成为可能，这在实际推理成本约束下也可能具有吸引力，从而实现更好的权衡。

Title: Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

Authors: Eric Hanchen Jiang, Yasi Zhang, Zhi Zhang, Yixin Wan, Andrew Lizarraga, Shufan Li, Ying Nian Wu
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.17472
Pdf URL: https://arxiv.org/pdf/2411.17472
Copy Paste: [[2411.17472]] Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory(https://arxiv.org/abs/2411.17472)
Keywords: generative
Abstract: Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images from textual prompts. Despite these advances, existing models struggle with complex prompts involving multiple objects and attributes, often misaligning modifiers with their corresponding nouns or neglecting certain elements. Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding and a lack of robust generalization guarantees. Leveraging the PAC-Bayes framework, we propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties, including divergence between objects, alignment between modifiers and their corresponding nouns, minimal attention to irrelevant tokens, and regularization for better generalization. Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment. We demonstrate the effectiveness of our method on standard benchmarks, achieving state-of-the-art results across multiple metrics. By integrating custom priors into the denoising process, our method enhances image quality and addresses long-standing challenges in T2I diffusion models, paving the way for more reliable and interpretable generative models.
摘要：文本到图像 (T2I) 扩散模型通过从文本提示中生成高保真、多样化且视觉逼真的图像，彻底改变了生成模型。尽管取得了这些进展，但现有模型仍难以处理涉及多个对象和属性的复杂提示，通常会使修饰语与其对应的名词不对齐或忽略某些元素。最近基于注意力的方法改进了对象包含和语言绑定，但仍然面临着属性错误绑定和缺乏强大的泛化保证等挑战。利用 PAC-Bayes 框架，我们提出了一种贝叶斯方法，该方法设计了注意力分布的自定义先验以强制执行所需的属性，包括对象之间的差异、修饰语与其对应名词之间的对齐、对不相关标记的最小关注以及正则化以实现更好的泛化。我们的方法将注意力机制视为可解释的组件，从而实现细粒度控制和改进的属性对象对齐。我们在标准基准上证明了我们方法的有效性，并在多个指标上取得了最先进的结果。通过将自定义先验集成到去噪过程中，我们的方法提高了图像质量并解决了 T2I 扩散模型中长期存在的挑战，为更可靠、更可解释的生成模型铺平了道路。

Title: Puzzle Similarity: A Perceptually-guided No-Reference Metric for Artifact Detection in 3D Scene Reconstructions

Authors: Nicolai Hermann, Jorge Condor, Piotr Didyk
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17489
Pdf URL: https://arxiv.org/pdf/2411.17489
Copy Paste: [[2411.17489]] Puzzle Similarity: A Perceptually-guided No-Reference Metric for Artifact Detection in 3D Scene Reconstructions(https://arxiv.org/abs/2411.17489)
Keywords: restoration, quality assessment
Abstract: Modern reconstruction techniques can effectively model complex 3D scenes from sparse 2D views. However, automatically assessing the quality of novel views and identifying artifacts is challenging due to the lack of ground truth images and the limitations of no-reference image metrics in predicting detailed artifact maps. The absence of such quality metrics hinders accurate predictions of the quality of generated views and limits the adoption of post-processing techniques, such as inpainting, to enhance reconstruction quality. In this work, we propose a new no-reference metric, Puzzle Similarity, which is designed to localize artifacts in novel views. Our approach utilizes image patch statistics from the input views to establish a scene-specific distribution that is later used to identify poorly reconstructed regions in the novel views. We test and evaluate our method in the context of 3D reconstruction; to this end, we collected a novel dataset of human quality assessment in unseen reconstructed views. Through this dataset, we demonstrate that our method can not only successfully localize artifacts in novel views, correlating with human assessment, but do so without direct references. Surprisingly, our metric outperforms both no-reference metrics and popular full-reference image metrics. We can leverage our new metric to enhance applications like automatic image restoration, guided acquisition, or 3D reconstruction from sparse inputs.
摘要：现代重建技术可以从稀疏的二维视图有效地建模复杂的三维场景。然而，由于缺乏地面实况图像以及无参考图像指标在预测详细伪影图方面的局限性，自动评估新视图的质量并识别伪影是一项挑战。缺乏这样的质量指标会阻碍对生成视图质量的准确预测，并限制采用后处理技术（如修复）来提高重建质量。在这项工作中，我们提出了一种新的无参考指标，即拼图相似度，旨在定位新视图中的伪影。我们的方法利用来自输入视图的图像块统计数据来建立特定于场景的分布，该分布随后用于识别新视图中重建不佳的区域。我们在三维重建的背景下测试和评估我们的方法；为此，我们收集了一个在看不见的重建视图中进行人工质量评估的新型数据集。通过这个数据集，我们证明了我们的方法不仅可以成功地定位新视图中的伪影，与人工评估相关联，而且无需直接参考即可做到这一点。令人惊讶的是，我们的指标优于无参考指标和流行的全参考图像指标。我们可以利用我们的新指标来增强自动图像恢复、引导采集或从稀疏输入进行 3D 重建等应用。

Title: Perceptually Optimized Super Resolution

Authors: Volodymyr Karpenko, Taimoor Tariq, Jorge Condor, Piotr Didyk
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.17513
Pdf URL: https://arxiv.org/pdf/2411.17513
Copy Paste: [[2411.17513]] Perceptually Optimized Super Resolution(https://arxiv.org/abs/2411.17513)
Keywords: super-resolution
Abstract: Modern deep-learning based super-resolution techniques process images and videos independently of the underlying content and viewing conditions. However, the sensitivity of the human visual system to image details changes depending on the underlying content characteristics, such as spatial frequency, luminance, color, contrast, or motion. This observation hints that computational resources spent on up-sampling visual content may be wasted whenever a viewer cannot resolve the results. Motivated by this observation, we propose a perceptually inspired and architecture-agnostic approach for controlling the visual quality and efficiency of super-resolution techniques. The core is a perceptual model that dynamically guides super-resolution methods according to the human's sensitivity to image details. Our technique leverages the limitations of the human visual system to improve the efficiency of super-resolution techniques by focusing computational resources on perceptually important regions; judged on the basis of factors such as adapting luminance, contrast, spatial frequency, motion, and viewing conditions. We demonstrate the application of our proposed model in combination with network branching, and network complexity reduction to improve the computational efficiency of super-resolution methods without visible quality loss. Quantitative and qualitative evaluations, including user studies, demonstrate the effectiveness of our approach in reducing FLOPS by factors of 2$\mathbf{x}$ and greater, without sacrificing perceived quality.
摘要：现代基于深度学习的超分辨率技术独立于底层内容和观看条件来处理图像和视频。然而，人类视觉系统对图像细节的敏感度会根据底层内容特征（例如空间频率、亮度、颜色、对比度或运动）而变化。这一观察暗示，当观看者无法解析结果时，用于上采样视觉内容的计算资源可能会被浪费。受这一观察的启发，我们提出了一种受感知启发且与架构无关的方法来控制超分辨率技术的视觉质量和效率。核心是一个感知模型，它根据人类对图像细节的敏感度动态指导超分辨率方法。我们的技术利用人类视觉系统的局限性，通过将计算资源集中在感知上重要的区域来提高超分辨率技术的效率；根据调整亮度、对比度、空间频率、运动和观看条件等因素进行判断。我们展示了我们提出的模型与网络分支和网络复杂度降低相结合的应用，以提高超分辨率方法的计算效率，而不会出现明显的质量损失。定量和定性评估（包括用户研究）证明了我们的方法的有效性，它可以将 FLOPS 减少 2$\mathbf{x}$ 倍或更多，而不会牺牲感知质量。

Title: FTMoMamba: Motion Generation with Frequency and Text State Space Models

Authors: Chengjian Li, Xiangbo Shu, Qiongjie Cui, Yazhou Yao, Jinhui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17532
Pdf URL: https://arxiv.org/pdf/2411.17532
Copy Paste: [[2411.17532]] FTMoMamba: Motion Generation with Frequency and Text State Space Models(https://arxiv.org/abs/2411.17532)
Keywords: generation
Abstract: Diffusion models achieve impressive performance in human motion generation. However, current approaches typically ignore the significance of frequency-domain information in capturing fine-grained motions within the latent space (e.g., low frequencies correlate with static poses, and high frequencies align with fine-grained motions). Additionally, there is a semantic discrepancy between text and motion, leading to inconsistency between the generated motions and the text descriptions. In this work, we propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model (FreqSSM) and a Text State Space Model (TextSSM). Specifically, to learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components, guiding the generation of static pose (e.g., sits, lay) and fine-grained motions (e.g., transition, stumble), respectively. To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level, aligning textual semantics with sequential features. Extensive experiments show that FTMoMamba achieves superior performance on the text-to-motion generation task, especially gaining the lowest FID of 0.181 (rather lower than 0.421 of MLD) on the HumanML3D dataset.
摘要：扩散模型在人体运动生成方面取得了令人印象深刻的表现。然而，当前的方法通常忽略了频域信息在捕捉潜在空间内的细粒度运动方面的重要性（例如，低频与静态姿势相关，高频与细粒度运动一致）。此外，文本和运动之间存在语义差异，导致生成的动作与文本描述不一致。在这项工作中，我们提出了一种基于扩散的新型 FTMoMamba 框架，该框架配备了频率状态空间模型 (FreqSSM) 和文本状态空间模型 (TextSSM)。具体来说，为了学习细粒度表示，FreqSSM 将序列分解为低频和高频分量，分别指导静态姿势（例如，坐下、躺下）和细粒度运动（例如，过渡、跌倒）的生成。为了确保文本和运动之间的一致性，TextSSM 在句子级别对文本特征进行编码，将文本语义与序列特征对齐。大量实验表明，FTMoMamba 在文本到动作生成任务上取得了优异的表现，尤其是在 HumanML3D 数据集上获得了最低 FID 0.181（低于 MLD 的 0.421）。

Title: IMPROVE: Improving Medical Plausibility without Reliance on HumanValidation - An Enhanced Prototype-Guided Diffusion Framework

Authors: Anurag Shandilya, Swapnil Bhat, Akshat Gautam, Subhash Yadav, Siddharth Bhatt, Deval Mehta, Kshitij Jadhav
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17535
Pdf URL: https://arxiv.org/pdf/2411.17535
Copy Paste: [[2411.17535]] IMPROVE: Improving Medical Plausibility without Reliance on HumanValidation - An Enhanced Prototype-Guided Diffusion Framework(https://arxiv.org/abs/2411.17535)
Keywords: generation, generative
Abstract: Generative models have proven to be very effective in generating synthetic medical images and find applications in downstream tasks such as enhancing rare disease datasets, long-tailed dataset augmentation, and scaling machine learning algorithms. For medical applications, the synthetically generated medical images by such models are still reasonable in quality when evaluated based on traditional metrics such as FID score, precision, and recall. However, these metrics fail to capture the medical/biological plausibility of the generated images. Human expert feedback has been used to get biological plausibility which demonstrates that these generated images have very low plausibility. Recently, the research community has further integrated this human feedback through Reinforcement Learning from Human Feedback(RLHF), which generates more medically plausible images. However, incorporating human feedback is a costly and slow process. In this work, we propose a novel approach to improve the medical plausibility of generated images without the need for human feedback. We introduce IMPROVE:Improving Medical Plausibility without Reliance on Human Validation - An Enhanced Prototype-Guided Diffusion Framework, a prototype-guided diffusion process for medical image generation and show that it substantially enhances the biological plausibility of the generated medical images without the need for any human feedback. We perform experiments on Bone Marrow and HAM10000 datasets and show that medical accuracy can be substantially increased without human feedback.
摘要：生成模型已被证明在生成合成医学图像方面非常有效，并可用于下游任务，例如增强罕见疾病数据集、长尾数据集增强和扩展机器学习算法。对于医疗应用，当基于 FID 分数、精度和召回率等传统指标进行评估时，此类模型合成生成的医学图像的质量仍然合理。然而，这些指标无法捕捉生成图像的医学/生物学合理性。人类专家的反馈已被用来获得生物学合理性，这表明这些生成的图像的合理性非常低。最近，研究界通过强化学习从人类反馈 (RLHF) 进一步整合了这种人类反馈，从而生成更多医学上合理的图像。然而，融入人类反馈是一个昂贵而缓慢的过程。在这项工作中，我们提出了一种新方法来提高生成图像的医学合理性，而无需人工反馈。我们引入了 IMPROVE：在不依赖人工验证的情况下提高医学可信度 - 增强原型引导扩散框架，这是一种用于医学图像生成的原型引导扩散过程，并表明它大大提高了生成的医学图像的生物可信度，而无需任何人工反馈。我们在骨髓和 HAM10000 数据集上进行了实验，并表明无需人工反馈即可大幅提高医学准确性。

Title: Pre-training for Action Recognition with Automatically Generated Fractal Datasets

Authors: Davyd Svyezhentsev, George Retsinas, Petros Maragos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17584
Pdf URL: https://arxiv.org/pdf/2411.17584
Copy Paste: [[2411.17584]] Pre-training for Action Recognition with Automatically Generated Fractal Datasets(https://arxiv.org/abs/2411.17584)
Keywords: generative
Abstract: In recent years, interest in synthetic data has grown, particularly in the context of pre-training the image modality to support a range of computer vision tasks, including object classification, medical imaging etc. Previous work has demonstrated that synthetic samples, automatically produced by various generative processes, can replace real counterparts and yield strong visual representations. This approach resolves issues associated with real data such as collection and labeling costs, copyright and privacy. We extend this trend to the video domain applying it to the task of action recognition. Employing fractal geometry, we present methods to automatically produce large-scale datasets of short synthetic video clips, which can be utilized for pre-training neural models. The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures. To narrow the domain gap, we further identify key properties of real videos and carefully emulate them during pre-training. Through thorough ablations, we determine the attributes that strengthen downstream results and offer general guidelines for pre-training with synthetic videos. The proposed approach is evaluated by fine-tuning pre-trained models on established action recognition datasets HMDB51 and UCF101 as well as four other video benchmarks related to group action recognition, fine-grained action recognition and dynamic scenes. Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets. Code and samples of synthetic videos are available at this https URL .
摘要：近年来，人们对合成数据的兴趣日益浓厚，特别是在预训练图像模态以支持一系列计算机视觉任务（包括对象分类、医学成像等）的背景下。先前的研究表明，由各种生成过程自动生成的合成样本可以替代真实样本并产生强大的视觉表现。这种方法解决了与真实数据相关的问题，例如收集和标记成本、版权和隐私。我们将这一趋势扩展到视频领域，将其应用于动作识别任务。利用分形几何，我们提出了自动生成短合成视频片段的大规模数据集的方法，可用于预训练神经模型。生成的视频片段具有显著的多样性，这源于分形生成复杂多尺度结构的先天能力。为了缩小领域差距，我们进一步确定了真实视频的关键属性，并在预训练期间仔细模拟它们。通过彻底的消融，我们确定了加强下游结果的属性，并为使用合成视频进行预训练提供了一般指导。通过在已建立的动作识别数据集 HMDB51 和 UCF101 以及其他四个与群体动作识别、细粒度动作识别和动态场景相关的视频基准上微调预训练模型来评估所提出的方法。与标准 Kinetics 预训练相比，我们报告的结果在部分下游数据集上接近甚至更胜一筹。代码和合成视频样本可在此 https URL 上找到。

Title: VideoDirector: Precise Video Editing via Text-to-Video Models

Authors: Yukun Wang, Longguang Wang, Zhiyuan Ma, Qibin Hu, Kai Xu, Yulan Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17592
Pdf URL: https://arxiv.org/pdf/2411.17592
Copy Paste: [[2411.17592]] VideoDirector: Precise Video Editing via Text-to-Video Models(https://arxiv.org/abs/2411.17592)
Keywords: generation, generative
Abstract: Despite the typical inversion-then-editing paradigm using text-to-image (T2I) models has demonstrated promising results, directly extending it to text-to-video (T2V) models still suffers severe artifacts such as color flickering and content distortion. Consequently, current video editing methods primarily rely on T2I models, which inherently lack temporal-coherence generative ability, often resulting in inferior editing results. In this paper, we attribute the failure of the typical editing paradigm to: 1) Tightly Spatial-temporal Coupling. The vanilla pivotal-based inversion strategy struggles to disentangle spatial-temporal information in the video diffusion model; 2) Complicated Spatial-temporal Layout. The vanilla cross-attention control is deficient in preserving the unedited content. To address these limitations, we propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion. Furthermore, we introduce a self-attention control strategy to maintain higher fidelity for precise partial content editing. Experimental results demonstrate that our method (termed VideoDirector) effectively harnesses the powerful temporal generation capabilities of T2V models, producing edited videos with state-of-the-art performance in accuracy, motion smoothness, realism, and fidelity to unedited content.
摘要：尽管使用文本到图像 (T2I) 模型的典型反转后编辑范式已经显示出良好的结果，但将其直接扩展到文本到视频 (T2V) 模型仍然会出现严重的伪影，例如颜色闪烁和内容失真。因此，当前的视频编辑方法主要依赖于 T2I 模型，而 T2I 模型本质上缺乏时间相干性生成能力，通常导致较差的编辑结果。在本文中，我们将典型编辑范式的失败归因于：1）紧密的时空耦合。基于 vanilla 枢轴的反转策略难以解开视频扩散模型中的时空信息；2）复杂的时空布局。vanilla 交叉注意控制在保留未编辑内容方面存在不足。为了解决这些限制，我们提出了一种时空解耦指导 (STDG) 和多帧空文本优化策略，为更精确的枢轴反转提供枢轴时间线索。此外，我们引入了自注意力控制策略，以保持更高的保真度，从而实现精确的部分内容编辑。实验结果表明，我们的方法（称为 VideoDirector）有效地利用了 T2V 模型强大的时间生成能力，在准确性、运动流畅度、真实感和对未编辑内容的保真度方面，制作出具有最先进性能的编辑视频。

Title: Accelerating Vision Diffusion Transformers with Skip Branches

Authors: Guanjie Chen, Xinyu Zhao, Yucheng Zhou, Tianlong Chen, Cheng Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17616
Pdf URL: https://arxiv.org/pdf/2411.17616
Copy Paste: [[2411.17616]] Accelerating Vision Diffusion Transformers with Skip Branches(https://arxiv.org/abs/2411.17616)
Keywords: generation
Abstract: Diffusion Transformers (DiT), an emerging image and video generation model architecture, has demonstrated great potential because of its high generation quality and scalability properties. Despite the impressive performance, its practical deployment is constrained by computational complexity and redundancy in the sequential denoising process. While feature caching across timesteps has proven effective in accelerating diffusion models, its application to DiT is limited by fundamental architectural differences from U-Net-based approaches. Through empirical analysis of DiT feature dynamics, we identify that significant feature variation between DiT blocks presents a key challenge for feature reusability. To address this, we convert standard DiT into Skip-DiT with skip branches to enhance feature smoothness. Further, we introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time. We validated effectiveness of our proposal on different DiT backbones for video and image generation, showcasing skip branches to help preserve generation quality and achieve higher speedup. Experimental results indicate that Skip-DiT achieves a 1.5x speedup almost for free and a 2.2x speedup with only a minor reduction in quantitative metrics. Code is available at this https URL.
摘要：扩散变换器 (DiT) 是一种新兴的图像和视频生成模型架构，由于其高生成质量和可扩展性而显示出巨大的潜力。尽管性能令人印象深刻，但其实际部署受到顺序去噪过程中计算复杂性和冗余的限制。虽然跨时间步长的特征缓存已被证明可有效加速扩散模型，但其在 DiT 中的应用受到与基于 U-Net 的方法的基本架构差异的限制。通过对 DiT 特征动态的实证分析，我们发现 DiT 块之间的显著特征变化对特征可重用性提出了关键挑战。为了解决这个问题，我们将标准 DiT 转换为带有跳过分支的 Skip-DiT，以增强特征平滑度。此外，我们引入了 Skip-Cache，它利用跳过分支在推理时跨时间步长缓存 DiT 特征。我们在用于视频和图像生成的不同 DiT 主干上验证了我们的提案的有效性，展示了跳过分支有助于保持生成质量并实现更高的加速。实验结果表明，Skip-DiT 几乎不费吹灰之力就实现了 1.5 倍的加速，在定量指标略有下降的情况下实现了 2.2 倍的加速。代码可从此 https URL 获取。

Title: Synthetic Data Generation with LLM for Improved Depression Prediction

Authors: Andrea Kang, Jun Yu Chen, Zoe Lee-Youngzie, Shuhao Fu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.17672
Pdf URL: https://arxiv.org/pdf/2411.17672
Copy Paste: [[2411.17672]] Synthetic Data Generation with LLM for Improved Depression Prediction(https://arxiv.org/abs/2411.17672)
Keywords: generation
Abstract: Automatic detection of depression is a rapidly growing field of research at the intersection of psychology and machine learning. However, with its exponential interest comes a growing concern for data privacy and scarcity due to the sensitivity of such a topic. In this paper, we propose a pipeline for Large Language Models (LLMs) to generate synthetic data to improve the performance of depression prediction models. Starting from unstructured, naturalistic text data from recorded transcripts of clinical interviews, we utilize an open-source LLM to generate synthetic data through chain-of-thought prompting. This pipeline involves two key steps: the first step is the generation of the synopsis and sentiment analysis based on the original transcript and depression score, while the second is the generation of the synthetic synopsis/sentiment analysis based on the summaries generated in the first step and a new depression score. Not only was the synthetic data satisfactory in terms of fidelity and privacy-preserving metrics, it also balanced the distribution of severity in the training dataset, thereby significantly enhancing the model's capability in predicting the intensity of the patient's depression. By leveraging LLMs to generate synthetic data that can be augmented to limited and imbalanced real-world datasets, we demonstrate a novel approach to addressing data scarcity and privacy concerns commonly faced in automatic depression detection, all while maintaining the statistical integrity of the original dataset. This approach offers a robust framework for future mental health research and applications.
摘要：抑郁症的自动检测是心理学和机器学习交叉领域中一个发展迅速的研究领域。然而，随着人们对抑郁症的兴趣呈指数级增长，由于这一话题的敏感性，人们对数据隐私和稀缺性的担忧也日益增加。在本文中，我们提出了一种大型语言模型 (LLM) 生成合成数据的流程，以提高抑郁症预测模型的性能。从临床访谈记录的非结构化、自然文本数据开始，我们利用开源 LLM 通过思路链提示生成合成数据。该流程涉及两个关键步骤：第一步是基于原始记录和抑郁评分生成概要和情绪分析，第二步是基于第一步生成的摘要和新的抑郁评分生成合成概要/情绪分析。合成数据不仅在保真度和隐私保护指标方面令人满意，而且还平衡了训练数据集中严重程度的分布，从而显著增强了模型预测患者抑郁强度的能力。通过利用 LLM 生成可以扩充到有限且不平衡的现实世界数据集的合成数据，我们展示了一种解决自动抑郁症检测中常见的数据稀缺和隐私问题的新方法，同时保持了原始数据集的统计完整性。这种方法为未来的心理健康研究和应用提供了一个强大的框架。

Title: SketchAgent: Language-Driven Sequential Sketch Generation

Authors: Yael Vinker, Tamar Rott Shaham, Kristine Zheng, Alex Zhao, Judith E Fan, Antonio Torralba
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17673
Pdf URL: https://arxiv.org/pdf/2411.17673
Copy Paste: [[2411.17673]] SketchAgent: Language-Driven Sequential Sketch Generation(https://arxiv.org/abs/2411.17673)
Keywords: generation
Abstract: Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains challenging. In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions. Our approach requires no training or fine-tuning. Instead, we leverage the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs). We present an intuitive sketching language, introduced to the model through in-context examples, enabling it to "draw" using string-based actions. These are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks. By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching. We demonstrate that SketchAgent can generate sketches from diverse prompts, engage in dialogue-driven drawing, and collaborate meaningfully with human users.
摘要：素描是一种多功能工具，可用于将想法外化，实现跨学科的快速探索和视觉交流。虽然人工智能系统推动了内容创作和人机交互的重大进步，但捕捉人类素描的动态和抽象性质仍然具有挑战性。在这项工作中，我们引入了 SketchAgent，这是一种语言驱动的顺序素描生成方法，使用户能够通过动态对话交互创建、修改和优化素描。我们的方法不需要训练或微调。相反，我们利用现成的多模态大型语言模型 (LLM) 的顺序性和丰富的先验知识。我们提出了一种直观的素描语言，通过上下文示例引入模型，使其能够使用基于字符串的操作“绘制”。这些被处理成矢量图形，然后渲染以在像素画布上创建素描，可以再次访问该素描以执行进一步的任务。通过一笔一划地绘制，我们的代理捕捉到了素描固有的不断发展的动态品质。我们证明了 SketchAgent 可以根据不同的提示生成素描，参与对话驱动的绘图，并与人类用户进行有意义的合作。

Title: GenDeg: Diffusion-Based Degradation Synthesis for Generalizable All-in-One Image Restoration

Authors: Sudarshan Rajagopalan, Nithin Gopalakrishnan Nair, Jay N. Paranjape, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17687
Pdf URL: https://arxiv.org/pdf/2411.17687
Copy Paste: [[2411.17687]] GenDeg: Diffusion-Based Degradation Synthesis for Generalizable All-in-One Image Restoration(https://arxiv.org/abs/2411.17687)
Keywords: restoration, generative
Abstract: Deep learning-based models for All-In-One Image Restoration (AIOR) have achieved significant advancements in recent years. However, their practical applicability is limited by poor generalization to samples outside the training distribution. This limitation arises primarily from insufficient diversity in degradation variations and scenes within existing datasets, resulting in inadequate representations of real-world scenarios. Additionally, capturing large-scale real-world paired data for degradations such as haze, low-light, and raindrops is often cumbersome and sometimes infeasible. In this paper, we leverage the generative capabilities of latent diffusion models to synthesize high-quality degraded images from their clean counterparts. Specifically, we introduce GenDeg, a degradation and intensity-aware conditional diffusion model capable of producing diverse degradation patterns on clean images. Using GenDeg, we synthesize over 550k samples across six degradation types: haze, rain, snow, motion blur, low-light, and raindrops. These generated samples are integrated with existing datasets to form the GenDS dataset, comprising over 750k samples. Our experiments reveal that image restoration models trained on the GenDS dataset exhibit significant improvements in out-of-distribution performance compared to those trained solely on existing datasets. Furthermore, we provide comprehensive analyses on the implications of diffusion model-based synthetic degradations for AIOR. The code will be made publicly available.
摘要：近年来，基于深度学习的一体化图像恢复 (AIOR) 模型取得了重大进展。然而，它们的实际适用性受到对训练分布之外的样本的泛化能力较差的限制。这种限制主要源于现有数据集中退化变化和场景的多样性不足，导致无法充分表示真实世界场景。此外，捕获雾霾、低光和雨滴等退化的大规模真实世界配对数据通常很麻烦，有时甚至不可行。在本文中，我们利用潜在扩散模型的生成能力从干净的图像合成高质量的退化图像。具体来说，我们引入了 GenDeg，这是一种退化和强度感知的条件扩散模型，能够在干净的图像上产生不同的退化模式。使用 GenDeg，我们合成了六种退化类型的 550k 多个样本：雾霾、雨、雪、运动模糊、低光和雨滴。这些生成的样本与现有数据集集成，形成包含 750k 多个样本的 GenDS 数据集。我们的实验表明，与仅在现有数据集上训练的模型相比，在 GenDS 数据集上训练的图像恢复模型在分布外性能方面表现出显著的改进。此外，我们还对基于扩散模型的合成退化对 AIOR 的影响进行了全面的分析。代码将公开发布。

Title: ScribbleLight: Single Image Indoor Relighting with Scribbles

Authors: Jun Myeong Choi, Annie Wang, Pieter Peers, Anand Bhattad, Roni Sengupta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.17696
Pdf URL: https://arxiv.org/pdf/2411.17696
Copy Paste: [[2411.17696]] ScribbleLight: Single Image Indoor Relighting with Scribbles(https://arxiv.org/abs/2411.17696)
Keywords: generative
Abstract: Image-based relighting of indoor rooms creates an immersive virtual understanding of the space, which is useful for interior design, virtual staging, and real estate. Relighting indoor rooms from a single image is especially challenging due to complex illumination interactions between multiple lights and cluttered objects featuring a large variety in geometrical and material complexity. Recently, generative models have been successfully applied to image-based relighting conditioned on a target image or a latent code, albeit without detailed local lighting control. In this paper, we introduce ScribbleLight, a generative model that supports local fine-grained control of lighting effects through scribbles that describe changes in lighting. Our key technical novelty is an Albedo-conditioned Stable Image Diffusion model that preserves the intrinsic color and texture of the original image after relighting and an encoder-decoder-based ControlNet architecture that enables geometry-preserving lighting effects with normal map and scribble annotations. We demonstrate ScribbleLight's ability to create different lighting effects (e.g., turning lights on/off, adding highlights, cast shadows, or indirect lighting from unseen lights) from sparse scribble annotations.
摘要：基于图像的室内房间重新照明可创建沉浸式虚拟空间理解，这对于室内设计、虚拟舞台和房地产非常有用。由于多盏灯和杂乱物体之间的复杂照明相互作用以及几何和材料复杂性的多样性，从单个图像重新照明室内房间尤其具有挑战性。最近，生成模型已成功应用于以目标图像或潜在代码为条件的基于图像的重新照明，尽管没有详细的局部照明控制。在本文中，我们介绍了 ScribbleLight，这是一种生成模型，它通过描述照明变化的涂鸦来支持对照明效果的局部细粒度控制。我们的关键技术创新是反照率条件的稳定图像扩散模型，该模型在重新照明后保留了原始图像的固有颜色和纹理，以及基于编码器-解码器的 ControlNet 架构，该架构可通过法线贴图和涂鸦注释实现几何保留的照明效果。我们展示了 ScribbleLight 从稀疏涂鸦注释中创建不同灯光效果的能力（例如，打开/关闭灯光、添加高光、投射阴影或来自看不见的灯光的间接照明）。

Title: Video-Guided Foley Sound Generation with Multimodal Controls

Authors: Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.17698
Pdf URL: https://arxiv.org/pdf/2411.17698
Copy Paste: [[2411.17698]] Video-Guided Foley Sound Generation with Multimodal Controls(https://arxiv.org/abs/2411.17698)
Keywords: generation
Abstract: Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: this https URL
摘要：为视频生成音效通常需要创建与现实生活音源有很大差异的艺术音效，并在声音设计中进行灵活控制。为了解决这个问题，我们引入了 MultiFoley，这是一种专为视频引导声音生成而设计的模型，支持通过文本、音频和视频进行多模态调节。给定一个无声视频和一个文本提示，MultiFoley 允许用户创建干净的声音（例如，滑板轮旋转时没有风噪）或更异想天开的声音（例如，让狮子的吼叫听起来像猫的喵喵叫）。MultiFoley 还允许用户从音效 (SFX) 库或部分视频中选择参考音频进行调节。我们模型的一个关键创新之处在于它在具有低质量音频和专业 SFX 录音的互联网视频数据集上进行联合训练，从而实现高质量、全带宽 (48kHz) 的音频生成。通过自动评估和人工研究，我们证明 MultiFoley 成功地在各种条件输入中生成同步的高质量声音，并且优于现有方法。请参阅我们的项目页面以获取视频结果：此 https URL