2025-03-03

Title: EgoNormia: Benchmarking Physical Social Norm Understanding

Authors: MohammadHossein Rezaei, Yicheng Fu, Phil Cuvin, Caleb Ziems, Yanzhe Zhang, Hao Zhu, Diyi Yang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.20490
Pdf URL: https://arxiv.org/pdf/2502.20490
Copy Paste: [[2502.20490]] EgoNormia: Benchmarking Physical Social Norm Understanding(https://arxiv.org/abs/2502.20490)
Keywords: generation
Abstract: Human activity is moderated by norms. When performing actions in the real world, humans not only follow norms, but also consider the trade-off between different norms However, machines are often trained without explicit supervision on norm understanding and reasoning, especially when the norms are grounded in a physical and social context. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present EgoNormia $\|\epsilon\|$, consisting of 1,853 ego-centric videos of human interactions, each of which has two related questions evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench of 92%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation method, it is possible to use EgoNomia to enhance normative reasoning in VLMs.
摘要：人类活动由规范调节。当在现实世界中采取行动时，人类不仅遵循规范，而且还要考虑不同规范之间的权衡，但是，机器通常在规范理解和推理上明确监督，尤其是当规范基于物理和社会环境中时，就经常受到训练。为了改善和评估视觉模型（VLMS）的规范推理能力，我们提出了Egonormia $ \ | \ epsilon \ | $，由1,853个以人类互动为中心的视频组成，每个视频都有两个相关问题，这些问题既评估了规范性行动的预测和正当性。规范行动包括七个类别：安全，隐私，亲近，礼貌，合作，协调/积极性以及沟通/透明度。为了大规模编译该数据集，我们提出了一条新型管道，利用视频采样，自动答案生成，过滤和人类验证。我们的工作表明，当前最新的视觉模型缺乏强大的规范理解，对自我态度的评分最高为45％（与92％的人类基础相比）。我们对每个维度的性能的分析强调了安全，隐私以及应用于现实世界代理商时缺乏协作和沟通能力的重大风险。我们还表明，通过一种基于检索的生成方法，可以使用自我工艺来增强VLMS中的规范推理。

Title: Unified Kernel-Segregated Transpose Convolution Operation

Authors: Vijay Srinivas Tida, Md Imran Hossen, Liqun Shan, Sai Venkatesh Chilukoti, Sonya Hsu, Xiali Hei
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20493
Pdf URL: https://arxiv.org/pdf/2502.20493
Copy Paste: [[2502.20493]] Unified Kernel-Segregated Transpose Convolution Operation(https://arxiv.org/abs/2502.20493)
Keywords: generative
Abstract: The optimization of the transpose convolution layer for deep learning applications is achieved with the kernel segregation mechanism. However, kernel segregation has disadvantages, such as computing extra elements to obtain the output feature map with odd dimensions while launching a thread. To mitigate this problem, we introduce a unified kernel segregation approach that limits the usage of memory and computational resources by employing one unified kernel to execute four sub-kernels. The findings reveal that the suggested approach achieves an average computational speedup of 2.03x (3.89x) when tested on specific datasets with an RTX 2070 GPU (Intel Xeon CPU). The ablation study shows an average computational speedup of 3.5x when evaluating the transpose convolution layers from well-known Generative Adversarial Networks (GANs). The implementation of the proposed method for the transpose convolution layers in the EB-GAN model demonstrates significant memory savings of up to 35 MB.
摘要：通过内核分离机制实现了用于深度学习应用的转置卷积层的优化。但是，内核分离存在缺点，例如计算额外的元素以在启动线程时获得具有奇数的输出特征映射。为了减轻此问题，我们引入了一种统一的内核隔离方法，该方法通过使用一个统一的内核执行四个子内核来限制内存和计算资源的使用。研究结果表明，当使用RTX 2070 GPU（Intel Xeon CPU）测试时，建议的方法达到了2.03倍（3.89倍）的平均计算速度。消融研究表明，评估来自众所周知的生成对抗网络（GAN）的转置卷积层时的平均计算速度为3.5倍。 EB-GAN模型中的旋转卷积层的提议方法的实现表明，最高35 MB的内存节省。

Title: CoCa-CXR: Contrastive Captioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding

Authors: Yixiong Chen, Shawn Xu, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Shravya Shetty, Daniel Golden, Alan Yuille, Lin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20509
Pdf URL: https://arxiv.org/pdf/2502.20509
Copy Paste: [[2502.20509]] CoCa-CXR: Contrastive Captioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding(https://arxiv.org/abs/2502.20509)
Keywords: generation
Abstract: Vision-language models have proven to be of great benefit for medical image analysis since they learn rich semantics from both images and reports. Prior efforts have focused on better alignment of image and text representations to enhance image understanding. However, though explicit reference to a prior image is common in Chest X-Ray (CXR) reports, aligning progression descriptions with the semantics differences in image pairs remains under-explored. In this work, we propose two components to address this issue. (1) A CXR report processing pipeline to extract temporal structure. It processes reports with a large language model (LLM) to separate the description and comparison contexts, and extracts fine-grained annotations from reports. (2) A contrastive captioner model for CXR, namely CoCa-CXR, to learn how to both describe images and their temporal progressions. CoCa-CXR incorporates a novel regional cross-attention module to identify local differences between paired CXR images. Extensive experiments show the superiority of CoCa-CXR on both progression analysis and report generation compared to previous methods. Notably, on MS-CXR-T progression classification, CoCa-CXR obtains 65.0% average testing accuracy on five pulmonary conditions, outperforming the previous state-of-the-art (SOTA) model BioViL-T by 4.8%. It also achieves a RadGraph F1 of 24.2% on MIMIC-CXR, which is comparable to the Med-Gemini foundation model.
摘要：事实证明，视觉语言模型对医学图像分析具有很大的好处，因为它们从图像和报告中都学到了丰富的语义。先前的努力专注于更好地对齐图像和文本表示，以增强图像理解。然而，尽管对先前图像的明确引用在胸部X射线（CXR）报告中很常见，但是将进程描述与图像对中语义差异的一致性保持不足。在这项工作中，我们提出了两个组件来解决此问题。（1）CXR报告处理管道以提取时间结构。它使用大型语言模型（LLM）处理报告，以分开描述和比较上下文，并从报告中提取细粒度的注释。（2）CXR的对比字幕模型，即可口可乐，以学习如何描述图像及其时间进步。可口可乐结合了一个新型的区域跨意识模块，以识别配对的CXR图像之间的局部差异。广泛的实验表明，与以前的方法相比，可口可乐对两种进展分析和报告产生的优越性。值得注意的是，在MS-CXR-T进展分类上，可口可乐在五个肺部条件下获得了65.0％的平均测试准确性，表现优于先前的最先前的ART（SOTA）模型Biovil-T，高于4.8％。它还可以在模仿CXR上获得24.2％的Radgraph F1，这与Med-Gemini基础模型相当。

Title: Towards Statistical Factuality Guarantee for Large Vision-Language Models

Authors: Zhuohang Li, Chao Yan, Nicholas J. Jackson, Wendi Cui, Bo Li, Jiaxin Zhang, Bradley A. Malin
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2502.20560
Pdf URL: https://arxiv.org/pdf/2502.20560
Copy Paste: [[2502.20560]] Towards Statistical Factuality Guarantee for Large Vision-Language Models(https://arxiv.org/abs/2502.20560)
Keywords: generation
Abstract: Advancements in Large Vision-Language Models (LVLMs) have demonstrated promising performance in a variety of vision-language tasks involving image-conditioned free-form text generation. However, growing concerns about hallucinations in LVLMs, where the generated text is inconsistent with the visual context, are becoming a major impediment to deploying these models in applications that demand guaranteed reliability. In this paper, we introduce a framework to address this challenge, ConfLVLM, which is grounded on conformal prediction to achieve finite-sample distribution-free statistical guarantees on the factuality of LVLM output. This framework treats an LVLM as a hypothesis generator, where each generated text detail (or claim) is considered an individual hypothesis. It then applies a statistical hypothesis testing procedure to verify each claim using efficient heuristic uncertainty measures to filter out unreliable claims before returning any responses to users. We conduct extensive experiments covering three representative application domains, including general scene understanding, medical radiology report generation, and document understanding. Remarkably, ConfLVLM reduces the error rate of claims generated by LLaVa-1.5 for scene descriptions from 87.8\% to 10.0\% by filtering out erroneous claims with a 95.3\% true positive rate. Our results further demonstrate that ConfLVLM is highly flexible, and can be applied to any black-box LVLMs paired with any uncertainty measure for any image-conditioned free-form text generation task while providing a rigorous guarantee on controlling the risk of hallucination.
摘要：大型视觉模型（LVLM）的进步在涉及图像条件的自由形式文本生成的各种视觉语言任务中表现出了有希望的表现。但是，对LVLMS中幻觉的越来越关注，因为生成的文本与视觉上下文不一致，它已成为将这些模型部署在需要保证可靠性的应用程序中的主要障碍。在本文中，我们介绍了一个框架来应对这一挑战，即ConflvLM，该挑战基于共形预测，以实现LVLM输出事实的有限样本分配的无统计保证。该框架将LVLM视为假设发生器，其中每个生成的文本细节（或主张）被视为单个假设。然后，它应用了统计假设测试程序，以使用有效的启发式不确定性度量来验证每个索赔，以在返回对用户的任何响应之前过滤不可靠的索赔。我们进行了广泛的实验，涵盖了三个代表性应用领域，包括一般场景理解，医学放射学报告生成和文档理解。值得注意的是，通过以95.3 \％真实的正率来滤除错误的索赔，Conflvlm将场景描述的索赔错误率从87.8 \％\％降低到10.0 \％。我们的结果进一步表明，ConflvLM具有很高的灵活性，并且可以应用于任何黑盒LVLMS与任何不确定性措施，用于任何图像条件条件的自由形式文本生成任务，同时为控制幻觉风险提供严格的保证。

Title: LISArD: Learning Image Similarity to Defend Against Gray-box Adversarial Attacks

Authors: Joana C. Costa, Tiago Roxo, Hugo Proença, Pedro R. M. Inácio
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.20562
Pdf URL: https://arxiv.org/pdf/2502.20562
Copy Paste: [[2502.20562]] LISArD: Learning Image Similarity to Defend Against Gray-box Adversarial Attacks(https://arxiv.org/abs/2502.20562)
Keywords: generative
Abstract: State-of-the-art defense mechanisms are typically evaluated in the context of white-box attacks, which is not realistic, as it assumes the attacker can access the gradients of the target network. To protect against this scenario, Adversarial Training (AT) and Adversarial Distillation (AD) include adversarial examples during the training phase, and Adversarial Purification uses a generative model to reconstruct all the images given to the classifier. This paper considers an even more realistic evaluation scenario: gray-box attacks, which assume that the attacker knows the architecture and the dataset used to train the target network, but cannot access its gradients. We provide empirical evidence that models are vulnerable to gray-box attacks and propose LISArD, a defense mechanism that does not increase computational and temporal costs but provides robustness against gray-box and white-box attacks without including AT. Our method approximates a cross-correlation matrix, created with the embeddings of perturbed and clean images, to a diagonal matrix while simultaneously conducting classification learning. Our results show that LISArD can effectively protect against gray-box attacks, can be used in multiple architectures, and carries over its resilience to the white-box scenario. Also, state-of-the-art AD models underperform greatly when removing AT and/or moving to gray-box settings, highlighting the lack of robustness from existing approaches to perform in various conditions (aside from white-box settings). All the source code is available at this https URL.
摘要：通常在白盒攻击的背景下评估最新的防御机制，这是不现实的，因为它假设攻击者可以访问目标网络的梯度。为了防止这种情况，对抗训练（AT）和对抗性蒸馏（AD）在训练阶段包括对抗性示例，对抗性纯化使用生成模型来重建给分类器给出的所有图像。本文考虑了一个更现实的评估方案：灰色框攻击，假设攻击者知道用于训练目标网络的体系结构和数据集，但无法访问其梯度。我们提供了经验证据，表明模型容易受到灰色框攻击的影响，并提出了Lisard，这是一种防御机制，不会增加计算和时间成本，但可以对灰色框和白盒子的攻击提供稳健性，而无需添加AT。我们的方法近似于用扰动和干净的图像的嵌入形成的互相关矩阵，同时进行分类学习，同时进行对角线矩阵。我们的结果表明，Lisard可以有效地防止灰色盒子攻击，可以在多个体系结构中使用，并将其弹性赋予对白色盒子方案。同样，在删除和/或移至灰色盒子设置时，最先进的广告模型在大大的表现不佳，这突出了现有方法在各种条件下（除了白色盒子设置）中缺乏鲁棒性。所有源代码均可在此HTTPS URL上找到。

Title: InstaFace: Identity-Preserving Facial Editing with Single Image Inference

Authors: MD Wahiduzzaman Khan, Mingshan Jia, Shaolin Zhang, En Yu, Kaska Musial-Gabrys
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20577
Pdf URL: https://arxiv.org/pdf/2502.20577
Copy Paste: [[2502.20577]] InstaFace: Identity-Preserving Facial Editing with Single Image Inference(https://arxiv.org/abs/2502.20577)
Keywords: generative
Abstract: Facial appearance editing is crucial for digital avatars, AR/VR, and personalized content creation, driving realistic user experiences. However, preserving identity with generative models is challenging, especially in scenarios with limited data availability. Traditional methods often require multiple images and still struggle with unnatural face shifts, inconsistent hair alignment, or excessive smoothing effects. To overcome these challenges, we introduce a novel diffusion-based framework, InstaFace, to generate realistic images while preserving identity using only a single image. Central to InstaFace, we introduce an efficient guidance network that harnesses 3D perspectives by integrating multiple 3DMM-based conditionals without introducing additional trainable parameters. Moreover, to ensure maximum identity retention as well as preservation of background, hair, and other contextual features like accessories, we introduce a novel module that utilizes feature embeddings from a facial recognition model and a pre-trained vision-language model. Quantitative evaluations demonstrate that our method outperforms several state-of-the-art approaches in terms of identity preservation, photorealism, and effective control of pose, expression, and lighting.
摘要：面部外观编辑对于数字化身，AR/VR和个性化的内容创建以及推动现实的用户体验至关重要。但是，通过生成模型保留身份是具有挑战性的，尤其是在数据可用性有限的情况下。传统方法通常需要多个图像，并且仍然在不自然的面部移动，不一致的头发对准或过度平滑效果方面挣扎。为了克服这些挑战，我们介绍了一种新颖的基于扩散的框架Instaface，以生成逼真的图像，同时仅使用单个图像保留身份。 Instaface的核心，我们引入了一个高效的指导网络，该网络通过集成了多个基于3DMM的条件，而无需引入其他可训练的参数来利用3D透视。此外，为了确保最大的身份保留，背景，头发和其他上下文功能（如配件），我们引入了一个新型模块，该模块利用了面部识别模型和预训练的视觉语言模型的特征嵌入。定量评估表明，我们的方法在身份保存，光真相和对姿势，表达和照明的有效控制方面优于几种最先进的方法。

Title: RTGen: Real-Time Generative Detection Transformer

Authors: Chi Ruan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20622
Pdf URL: https://arxiv.org/pdf/2502.20622
Copy Paste: [[2502.20622]] RTGen: Real-Time Generative Detection Transformer(https://arxiv.org/abs/2502.20622)
Keywords: generation, generative
Abstract: While open-vocabulary object detectors require predefined categories during inference, generative object detectors overcome this limitation by endowing the model with text generation capabilities. However, existing generative object detection methods directly append an autoregressive language model to an object detector to generate texts for each detected object. This straightforward design leads to structural redundancy and increased processing time. In this paper, we propose a Real-Time GENerative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder), which innovatively integrates a non-autoregressive language model into the detection decoder, enabling concurrent processing of object and text information. With these efficient designs, RTGen achieves a remarkable inference speed of 60.41 FPS. Moreover, RTGen obtains 18.6 mAP on the LVIS dataset, outperforming the previous SOTA method by 3.5 mAP.
摘要：虽然开放式摄制对象检测器在推理过程中需要预定义的类别，但生成对象检测器通过具有文本生成能力来赋予模型来克服此限制。但是，现有的生成对象检测方法将自回归语言模型直接附加到对象检测器上，以生成每个检测到的对象的文本。这种直接的设计导致结构冗余和增加的处理时间。在本文中，我们提出了一个实时生成检测变压器（RTGEN），这是一种具有简洁的编码器架构架构的实时生成对象检测器。具体而言，我们介绍了一种新型的区域语言解码器（RL-Decoder），该解码器创新地将非自动回忆性语言模型整合到检测解码器中，从而使对象和文本信息并发处理。通过这些有效的设计，RTGEN的推理速度为60.41 fps。此外，RTGEN在LVIS数据集上获得18.6 MAP，以3.5 MAP优于先前的SOTA方法。

Title: Are LLMs Ready for Practical Adoption for Assertion Generation?

Authors: Vaishnavi Pulavarthi, Deeksha Nandal, Soham Dan, Debjit Pal
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.20633
Pdf URL: https://arxiv.org/pdf/2502.20633
Copy Paste: [[2502.20633]] Are LLMs Ready for Practical Adoption for Assertion Generation?(https://arxiv.org/abs/2502.20633)
Keywords: generation, generative
Abstract: Assertions have been the de facto collateral for simulation-based and formal verification of hardware designs for over a decade. The quality of hardware verification, i.e., detection and diagnosis of corner-case design bugs, is critically dependent on the quality of the assertions. With the onset of generative AI such as Transformers and Large-Language Models (LLMs), there has been a renewed interest in developing novel, effective, and scalable techniques of generating functional and security assertions from design source code. While there have been recent works that use commercial-of-the-shelf (COTS) LLMs for assertion generation, there is no comprehensive study in quantifying the effectiveness of LLMs in generating syntactically and semantically correct assertions. In this paper, we first discuss AssertionBench from our prior work, a comprehensive set of designs and assertions to quantify the goodness of a broad spectrum of COTS LLMs for the task of assertion generations from hardware design source code. Our key insight was that COTS LLMs are not yet ready for prime-time adoption for assertion generation as they generate a considerable fraction of syntactically and semantically incorrect assertions. Motivated by the insight, we propose AssertionLLM, a first of its kind LLM model, specifically fine-tuned for assertion generation. Our initial experimental results show that AssertionLLM considerably improves the semantic and syntactic correctness of the generated assertions over COTS LLMs.
摘要：十多年来，断言一直是基于模拟的硬件设计的正式验证和正式验证的事实。硬件验证的质量，即对角案例设计错误的检测和诊断，主要取决于断言的质量。随着生成AI的发作，例如变形金刚和大型语言模型（LLMS），人们对开发新颖，有效且可扩展的技术产生了从设计源代码产生功能和安全性主张的兴趣。尽管最近有使用商业化（COTS）LLM进行断言生成的作品，但在量化LLM在句法和语义上正确的断言中的有效性方面尚无全面研究。在本文中，我们首先讨论了先前工作中的主张，这是一套全面的设计和断言，以量化一系列COTS LLM的优点，以实现硬件设计源代码的断言世代的任务。我们的主要见解是，COTS LLM尚未准备好采用黄金时段的主张，因为它们产生了相当一部分语法和语义上不正确的断言。在洞察力的推动下，我们提出了essertionllm，这是一个同类LLM模型中的第一个模型，专门针对断言生成。我们最初的实验结果表明，断言大大提高了对COTS LLMS生成的断言的语义和句法正确性。

Title: Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models

Authors: Yu Pan, Bingrong Dai, Jiahao Chen, Lin Wang, Yi Du, Jiao Liu
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2502.20650
Pdf URL: https://arxiv.org/pdf/2502.20650
Copy Paste: [[2502.20650]] Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models(https://arxiv.org/abs/2502.20650)
Keywords: generation
Abstract: In recent years, Diffusion Models (DMs) have demonstrated significant advances in the field of image generation. However, according to current research, DMs are vulnerable to backdoor attacks, which allow attackers to control the model's output by inputting data containing covert triggers, such as a specific patch or phrase. Existing defense strategies are well equipped to thwart such attacks through backdoor detection and trigger inversion because previous attack methods are constrained by limited input spaces and triggers defined by low-dimensional features. To bridge these gaps, we propose Gungnir, a novel method that enables attackers to activate the backdoor in DMs through hidden style triggers within input images. Our approach proposes using stylistic features as triggers for the first time and implements backdoor attacks successfully in image2image tasks by utilizing Reconstructing-Adversarial Noise (RAN) and Short-Term-Timesteps-Retention (STTR) of DMs. Meanwhile, experiments demonstrate that our method can easily bypass existing defense methods. Among existing DM main backdoor defense frameworks, our approach achieves a 0\% backdoor detection rate (BDR). Our codes are available at this https URL.
摘要：近年来，扩散模型（DMS）在图像产生领域显示出显着的进步。但是，根据当前的研究，DMS容易受到后门攻击的影响，这使攻击者可以通过输入包含隐秘触发器的数据（例如特定的补丁或短语）来控制模型的输出。现有的防御策略可以通过后门检测和触发反转来阻止此类攻击，因为先前的攻击方法受到有限的输入空间和触发器的限制，并且由低维特征定义。为了弥合这些差距，我们提出了一种新颖的方法，它使攻击者能够通过输入图像中的隐藏样式触发DMS中的后门激活后门。我们的方法提出了首次将风格功能作为触发器，并通过利用DMS的重建 - 交流噪声（RAN）和短期timesteps-retention（STTR）在Image2Image任务中成功实施后门攻击。同时，实验表明我们的方法可以轻松绕过现有的防御方法。在现有的DM主后门防御框架中，我们的方法达到了0 \％的后门检测率（BDR）。我们的代码可在此HTTPS URL上找到。

Title: Advancing AI-Powered Medical Image Synthesis: Insights from MedVQA-GI Challenge Using CLIP, Fine-Tuned Stable Diffusion, and Dream-Booth + LoRA

Authors: Ojonugwa Oluwafemi Ejiga Peter, Md Mahmudur Rahman, Fahmi Khalifa
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.20667
Pdf URL: https://arxiv.org/pdf/2502.20667
Copy Paste: [[2502.20667]] Advancing AI-Powered Medical Image Synthesis: Insights from MedVQA-GI Challenge Using CLIP, Fine-Tuned Stable Diffusion, and Dream-Booth + LoRA(https://arxiv.org/abs/2502.20667)
Keywords: generation, generative
Abstract: The MEDVQA-GI challenge addresses the integration of AI-driven text-to-image generative models in medical diagnostics, aiming to enhance diagnostic capabilities through synthetic image generation. Existing methods primarily focus on static image analysis and lack the dynamic generation of medical imagery from textual descriptions. This study intends to partially close this gap by introducing a novel approach based on fine-tuned generative models to generate dynamic, scalable, and precise images from textual descriptions. Particularly, our system integrates fine-tuned Stable Diffusion and DreamBooth models, as well as Low-Rank Adaptation (LORA), to generate high-fidelity medical images. The problem is around two sub-tasks namely: image synthesis (IS) and optimal prompt production (OPG). The former creates medical images via verbal prompts, whereas the latter provides prompts that produce high-quality images in specified categories. The study emphasizes the limitations of traditional medical image generation methods, such as hand sketching, constrained datasets, static procedures, and generic models. Our evaluation measures showed that Stable Diffusion surpasses CLIP and DreamBooth + LORA in terms of producing high-quality, diversified images. Specifically, Stable Diffusion had the lowest Fréchet Inception Distance (FID) scores (0.099 for single center, 0.064 for multi-center, and 0.067 for combined), indicating higher image quality. Furthermore, it had the highest average Inception Score (2.327 across all datasets), indicating exceptional diversity and quality. This advances the field of AI-powered medical diagnosis. Future research will concentrate on model refining, dataset augmentation, and ethical considerations for efficiently implementing these advances into clinical practice
摘要：MEDVQA-GI挑战旨在解决AI驱动的文本对象生成模型在医学诊断中的集成，旨在通过合成图像生成来增强诊断能力。现有方法主要集中于静态图像分析，并且缺乏文本描述中医学图像的动态生成。这项研究打算通过引入基于微调生成模型的新方法来部分缩小这一差距，以从文本描述中生成动态，可扩展和精确的图像。特别是，我们的系统集成了微调的稳定扩散和Dreambooth模型，以及低级适应（LORA），以生成高保真的医学图像。问题是大约两个子任务，即：图像合成（IS）和最佳提示生产（OPG）。前者通过口头提示创建医学图像，而后者提供了在指定类别中产生高质量图像的提示。该研究强调了传统的医学图像生成方法的局限性，例如手绘，限制数据集，静态程序和通用模型。我们的评估措施表明，稳定的扩散在产生高质量，多样化的图像方面超过了剪辑和Dreambooth + Lora。具体而言，稳定的扩散具有最低的Fréchet成立距离（FID）得分（单中心0.099，多中心为0.064，合并为0.067），表明较高的图像质量。此外，它的平均成立得分最高（所有数据集中为2.327），表明了异常多样性和质量。这进展了AI驱动的医疗诊断领域。未来的研究将集中于模型炼油，数据集扩展和道德考虑，以有效地将这些进步实施到临床实践中

Title: Diffusion Restoration Adapter for Real-World Image Restoration

Authors: Hanbang Liang, Zhen Wang, Weihui Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20679
Pdf URL: https://arxiv.org/pdf/2502.20679
Copy Paste: [[2502.20679]] Diffusion Restoration Adapter for Real-World Image Restoration(https://arxiv.org/abs/2502.20679)
Keywords: restoration, generation, generative
Abstract: Diffusion models have demonstrated their powerful image generation capabilities, effectively fitting highly complex image distributions. These models can serve as strong priors for image restoration. Existing methods often utilize techniques like ControlNet to sample high quality images with low quality images from these priors. However, ControlNet typically involves copying a large part of the original network, resulting in a significantly large number of parameters as the prior scales up. In this paper, we propose a relatively lightweight Adapter that leverages the powerful generative capabilities of pretrained priors to achieve photo-realistic image restoration. The Adapters can be adapt to both denoising UNet and DiT, and performs excellent.
摘要：扩散模型已经证明了它们强大的图像生成功能，有效地拟合了高度复杂的图像分布。这些模型可以作为图像恢复的强大先验。现有的方法通常利用ControlNet等技术来对这些先验的低质量图像进行品尝高质量的图像。但是，ControlNet通常涉及复制原始网络的很大一部分，随着先前的缩放，导致大量参数。在本文中，我们提出了一个相对轻巧的适配器，该适配器利用预审前的先验的强大生成能力来实现光真实的图像恢复。适配器可以适应denoing unet和dit，并且表现出色。

Title: WorldModelBench: Judging Video Generation Models As World Models

Authors: Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, Yao Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20694
Pdf URL: https://arxiv.org/pdf/2502.20694
Copy Paste: [[2502.20694]] WorldModelBench: Judging Video Generation Models As World Models(https://arxiv.org/abs/2502.20694)
Keywords: generation
Abstract: Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law - issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 8.6% higher average accuracy in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The website is available at this https URL.
摘要：视频生成模型已经迅速发展，将自己定位为视频世界模型，能够支持机器人技术和自动驾驶等决策应用程序。但是，当前的基准测试无法严格评估这些主张，仅着眼于一般视频质量，而忽略了诸如物理依从性等世界模型的重要因素。为了弥合这一差距，我们提出了WorldModelbench，这是一种基准测试，旨在评估应用程序驱动域中视频生成模型的世界建模功能。 WorldModelbench提供了两个关键优势：（1）反对细微的世界建模违规行为：通过合并指导和物理遵守维度，WorldModelbench检测到微妙的违规行为，例如对物体大小的不规则变化，违反了群众保护法的问题 - 群众保护法 - 先前的基础标记所忽略的问题。（2）与大规模的人类偏好保持一致：我们众包67K人类标签可准确测量14个边界模型。使用我们的高质量人类标签，我们进一步进行了精确的判断，以使评估程序自动化，比使用2B参数的GPT-4O，预测世界建模违规行为的平均准确性高8.6％。此外，我们证明了通过最大程度地提高判断力的奖励明显提高了世界建模能力，从而证明了对人类注释的培训。该网站可在此HTTPS URL上找到。

Title: Towards General Visual-Linguistic Face Forgery Detection(V2)

Authors: Ke Sun, Shen Chen, Taiping Yao, Ziyin Zhou, Jiayi Ji, Xiaoshuai Sun, Chia-Wen Lin, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20698
Pdf URL: https://arxiv.org/pdf/2502.20698
Copy Paste: [[2502.20698]] Towards General Visual-Linguistic Face Forgery Detection(V2)(https://arxiv.org/abs/2502.20698)
Keywords: generation
Abstract: Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust. Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection. However, existing annotation approaches, whether through human labeling or direct Multimodal Large Language Model (MLLM) generation, often suffer from hallucination issues, leading to inaccurate text descriptions, especially for high-quality forgeries. To address this, we propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification, followed by a comprehensive prompting strategy to guide MLLMs in reducing hallucination. We validate our approach through fine-tuning both CLIP with a three-branch training framework combining unimodal and multimodal objectives, and MLLMs with our structured annotations. Experimental results demonstrate that our method not only achieves more accurate annotations with higher region identification accuracy, but also leads to improvements in model performance across various forgery detection benchmarks. Our Codes are available in this https URL.
摘要：面部操纵技术已取得了重大进步，对安全和社会信任提出了严重的挑战。最近的著作表明，利用多模式模型可以增强面部伪造检测的概括和解释性。但是，现有的注释方法，无论是通过人类标签还是直接多模式大语模型（MLLM）产生，通常会遭受幻觉问题的困扰，导致文本描述不准确，尤其是对于高质量的伪造。为了解决这个问题，我们提出了面部伪造文本生成器（FFTG），这是一种新颖的注释管道，通过利用伪造掩码的初始区域和类型识别来生成准确的文本描述，然后采取全面的提示策略来指导MLLMS减少幻觉。我们通过通过三个分支机构培训框架进行微调来验证我们的方法，从而结合了单峰和多模式目标，并将MLLM与我们的结构化注释相结合。实验结果表明，我们的方法不仅具有较高的区域识别精度的更准确的注释，而且还可以改善各种伪造检测基准的模型性能。我们的代码可在此HTTPS URL中使用。

Title: Generating Clinically Realistic EHR Data via a Hierarchy- and Semantics-Guided Transformer

Authors: Guanglin Zhou, Sebastiano Barbieri
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20719
Pdf URL: https://arxiv.org/pdf/2502.20719
Copy Paste: [[2502.20719]] Generating Clinically Realistic EHR Data via a Hierarchy- and Semantics-Guided Transformer(https://arxiv.org/abs/2502.20719)
Keywords: generation, generative
Abstract: Generating realistic synthetic electronic health records (EHRs) holds tremendous promise for accelerating healthcare research, facilitating AI model development and enhancing patient privacy. However, existing generative methods typically treat EHRs as flat sequences of discrete medical codes. This approach overlooks two critical aspects: the inherent hierarchical organization of clinical coding systems and the rich semantic context provided by code descriptions. Consequently, synthetic patient sequences often lack high clinical fidelity and have limited utility in downstream clinical tasks. In this paper, we propose the Hierarchy- and Semantics-Guided Transformer (HiSGT), a novel framework that leverages both hierarchical and semantic information for the generative process. HiSGT constructs a hierarchical graph to encode parent-child and sibling relationships among clinical codes and employs a graph neural network to derive hierarchy-aware embeddings. These are then fused with semantic embeddings extracted from a pre-trained clinical language model (e.g., ClinicalBERT), enabling the Transformer-based generator to more accurately model the nuanced clinical patterns inherent in real EHRs. Extensive experiments on the MIMIC-III and MIMIC-IV datasets demonstrate that HiSGT significantly improves the statistical alignment of synthetic data with real patient records, as well as supports robust downstream applications such as chronic disease classification. By addressing the limitations of conventional raw code-based generative models, HiSGT represents a significant step toward clinically high-fidelity synthetic data generation and a general paradigm suitable for interpretable medical code representation, offering valuable applications in data augmentation and privacy-preserving healthcare analytics.
摘要：产生现实的合成电子健康记录（EHRS）具有加速医疗保健研究，促进AI模型开发和增强患者隐私的巨大希望。但是，现有的生成方法通常将EHR视为离散医疗代码的平面序列。这种方法忽略了两个关键方面：临床编码系统的固有层次结构组织以及代码描述提供的丰富语义上下文。因此，合成患者序列通常缺乏临床保真度高，并且在下游临床任务中的效用有限。在本文中，我们提出了层次结构和语义引导的变压器（HISGT），这是一个新颖的框架，利用层次和语义信息为生成过程。 HISGT构建了一个分层图，以编码临床代码之间的亲子和兄弟姐妹关系，并采用图形神经网络来得出层次结构 - 感知的嵌入。然后将它们与从预先训练的临床语言模型（例如临床上）提取的语义嵌入融合在一起，从而使基于变压器的发电机更准确地模拟了实际EHR中固有的细微临床模式。关于模拟III和模拟IV数据集的广泛实验表明，HISGT显着改善了合成数据与实际患者记录的统计一致性，并支持诸如慢性病分类之类的强大下游应用。通过解决常规原始代码生成模型的局限性，HISGT代表了朝着临床上高的综合数据生成和适合于可解释的医疗代码表示的一般范式迈出的重要一步，为数据增强和保护保密性提供医疗保健分析提供了宝贵的应用。

Title: CADDreamer: CAD object Generation from Single-view Images

Authors: Yuan Li, Cheng Lin, Yuan Liu, Xiaoxiao Long, Chenxu Zhang, Ningna Wang, Xin Li, Wenping Wang, Xiaohu Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20732
Pdf URL: https://arxiv.org/pdf/2502.20732
Copy Paste: [[2502.20732]] CADDreamer: CAD object Generation from Single-view Images(https://arxiv.org/abs/2502.20732)
Keywords: generation, generative
Abstract: Diffusion-based 3D generation has made remarkable progress in recent years. However, existing 3D generative models often produce overly dense and unstructured meshes, which stand in stark contrast to the compact, structured, and sharply-edged Computer-Aided Design (CAD) models crafted by human designers. To address this gap, we introduce CADDreamer, a novel approach for generating boundary representations (B-rep) of CAD objects from a single image. CADDreamer employs a primitive-aware multi-view diffusion model that captures both local geometric details and high-level structural semantics during the generation process. By encoding primitive semantics into the color domain, the method leverages the strong priors of pre-trained diffusion models to align with well-defined primitives. This enables the inference of multi-view normal maps and semantic maps from a single image, facilitating the reconstruction of a mesh with primitive labels. Furthermore, we introduce geometric optimization techniques and topology-preserving extraction methods to mitigate noise and distortion in the generated primitives. These enhancements result in a complete and seamless B-rep of the CAD model. Experimental results demonstrate that our method effectively recovers high-quality CAD objects from single-view images. Compared to existing 3D generation techniques, the B-rep models produced by CADDreamer are compact in representation, clear in structure, sharp in edges, and watertight in topology.
摘要：近年来，基于扩散的3D一代取得了显着的进展。但是，现有的3D生成模型通常会产生过度致密和非结构化的网格，与人类设计师制作的紧凑，结构化和尖锐的计算机辅助设计（CAD）模型形成鲜明对比。为了解决这一差距，我们介绍了Caddreamer，这是一种从单个图像中生成CAD对象的边界表示（B-REP）的新方法。 Caddreamer采用原始感知的多视图扩散模型，该模型在生成过程中捕获了局部几何细节和高级结构语义。通过将原始语义编码到颜色域中，该方法利用了预训练的扩散模型的强烈先验，以与定义明确的原始物保持一致。这可以从单个图像中推断多视图正常地图和语义图，从而促进了带有原始标签的网格的重建。此外，我们介绍了几何优化技术和拓扑的提取方法，以减轻生成的原始物质中的噪声和失真。这些增强功能导致CAD模型的完整和无缝B-REP。实验结果表明，我们的方法有效地从单视图像中恢复了高质量的CAD对象。与现有的3D生成技术相比，Caddreamer产生的B-REP模型在表示方面是紧凑的，结构清晰，边缘清晰，拓扑中的水密性。

Title: Two-Stream Spatial-Temporal Transformer Framework for Person Identification via Natural Conversational Keypoints

Authors: Masoumeh Chapariniya, Hossein Ranjbar, Teodora Vukovic, Sarah Ebling, Volker Dellwo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20803
Pdf URL: https://arxiv.org/pdf/2502.20803
Copy Paste: [[2502.20803]] Two-Stream Spatial-Temporal Transformer Framework for Person Identification via Natural Conversational Keypoints(https://arxiv.org/abs/2502.20803)
Keywords: generative
Abstract: In the age of AI-driven generative technologies, traditional biometric recognition systems face unprecedented challenges, particularly from sophisticated deepfake and face reenactment techniques. In this study, we propose a Two-Stream Spatial-Temporal Transformer Framework for person identification using upper body keypoints visible during online conversations, which we term conversational keypoints. Our framework processes both spatial relationships between keypoints and their temporal evolution through two specialized branches: a Spatial Transformer (STR) that learns distinctive structural patterns in keypoint configurations, and a Temporal Transformer (TTR) that captures sequential motion patterns. Using the state-of-the-art Sapiens pose estimator, we extract 133 keypoints (based on COCO-WholeBody format) representing facial features, head pose, and hand positions. The framework was evaluated on a dataset of 114 individuals engaged in natural conversations, achieving recognition accuracies of 80.12% for the spatial stream, 63.61% for the temporal stream. We then explored two fusion strategies: a shared loss function approach achieving 82.22% accuracy, and a feature-level fusion method that concatenates feature maps from both streams, significantly improving performance to 94.86%. By jointly modeling both static anatomical relationships and dynamic movement patterns, our approach learns comprehensive identity signatures that are more robust to spoofing than traditional appearance-based methods.
摘要：在AI驱动的生成技术时代，传统的生物特征识别系统面临着前所未有的挑战，尤其是从精致的深层效果和面部重演技术中。在这项研究中，我们建议使用在线对话期间可见的上身关键点可见的两流时空变压器框架，用于人体识别，我们将其称为“对话”关键。我们的框架通过两个专业分支来处理关键点及其时间演化之间的空间关系：一个空间变压器（STR），在关键点配置中学习独特的结构模式，以及捕获顺序运动模式的时间变压器（TTR）。使用最先进的Sapiens姿势估计量，我们提取133个键盘（基于可可全体格式），代表面部特征，头部姿势和手部位置。该框架是在从事自然对话的114个人的数据集上评估的，空间流的识别精度为80.12％，时间流的识别精度为63.61％。然后，我们探索了两种融合策略：一种共享损失函数方法，达到82.22％的精度，以及一种功能级融合方法，该方法将两种流的特征地图连接起来，从而将性能显着提高到94.86％。通过共同对静态解剖关系和动态运动模式进行共同建模，我们的方法学习了与传统基于外观的方法更适合欺骗的全面身份签名。

Title: HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

Authors: Xiao Wang, Jingyun Hua, Weihong Lin, Yuanxing Zhang, Fuzheng Zhang, Jianlong Wu, Di Zhang, Liqiang Nie
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2502.20811
Pdf URL: https://arxiv.org/pdf/2502.20811
Copy Paste: [[2502.20811]] HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models(https://arxiv.org/abs/2502.20811)
Keywords: generation
Abstract: Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. \textbf{HAICTrain} comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, \textbf{HAICBench} includes 500 manually annotated video-caption pairs and 1,400 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench are released at this https URL.
摘要：最近的多模式大型语言模型（MLLM）在视频理解方面取得了长足的进步。但是，他们在涉及人类行为的视频上的表现仍然受到缺乏高质量数据的限制。为了解决这个问题，我们引入了两阶段数据注释管道。首先，我们设计策略来积累以互联网中清晰的人为行动为特色的视频。其次，以标准化的标题格式注释视频，该格式使用人类属性来区分个人，并按时间顺序详细介绍其行为和互动。通过这条管道，我们策划了两个数据集，即Haictrain和Haicbench。 \ textbf {haictrain}包括Gemini-Pro生成的126K视频捕获对，并用于培训目的。同时，\ textbf {haicbench}包括500个手动注释的视频捕获对和1,400 QA对，以全面评估人类的行动理解。实验结果表明，海学培训不仅可以显着提高4个基准的人类理解能力，而且还可以改善文本到视频的生成结果。 HACTRAIN和HAICBENCH均在此HTTPS URL上释放。

Title: MFSR-GAN: Multi-Frame Super-Resolution with Handheld Motion Modeling

Authors: Fadeel Sher Khan, Joshua Ebenezer, Hamid Sheikh, Seok-Jun Lee
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2502.20824
Pdf URL: https://arxiv.org/pdf/2502.20824
Copy Paste: [[2502.20824]] MFSR-GAN: Multi-Frame Super-Resolution with Handheld Motion Modeling(https://arxiv.org/abs/2502.20824)
Keywords: super-resolution
Abstract: Smartphone cameras have become ubiquitous imaging tools, yet their small sensors and compact optics often limit spatial resolution and introduce distortions. Combining information from multiple low-resolution (LR) frames to produce a high-resolution (HR) image has been explored to overcome the inherent limitations of smartphone cameras. Despite the promise of multi-frame super-resolution (MFSR), current approaches are hindered by datasets that fail to capture the characteristic noise and motion patterns found in real-world handheld burst images. In this work, we address this gap by introducing a novel synthetic data engine that uses multi-exposure static images to synthesize LR-HR training pairs while preserving sensor-specific noise characteristics and image motion found during handheld burst photography. We also propose MFSR-GAN: a multi-scale RAW-to-RGB network for MFSR. Compared to prior approaches, MFSR-GAN emphasizes a "base frame" throughout its architecture to mitigate artifacts. Experimental results on both synthetic and real data demonstrates that MFSR-GAN trained with our synthetic engine yields sharper, more realistic reconstructions than existing methods for real-world MFSR.
摘要：智能手机摄像机已成为无处不在的成像工具，但是它们的小传感器和紧凑的光学元件通常会限制空间分辨率并引入失真。已经探索了来自多个低分辨率（LR）帧的信息来产生高分辨率（HR）图像，以克服智能手机摄像机的固有局限性。尽管有多帧超分辨率（MFSR）的承诺，但当前方法仍被数据集阻碍，这些数据集无法捕获现实世界手持式爆发图像中发现的特征噪声和运动模式。在这项工作中，我们通过引入一种新型的合成数据引擎来解决这一差距，该数据引擎使用多曝光静态图像合成LR-HR训练对，同时保留传感器特异性的噪声特性和在手持式爆发摄影过程中发现的图像运动。我们还提出了MFSR-GAN：用于MFSR的多尺度原始-RGB网络。与先前的方法相比，MFSR-GAN强调了整个架构的“基础框架”，以减轻人工制品。合成数据和实际数据的实验结果表明，接受我们合成发动机训练的MFSR-GAN比现有MFSR的现有方法更清晰，更现实的重建。

Title: LADs: Leveraging LLMs for AI-Driven DevOps

Authors: Ahmad Faraz Khan, Azal Ahmad Khan, Anas Mohamed, Haider Ali, Suchithra Moolinti, Sabaat Haroon, Usman Tahir, Mattia Fazzini, Ali R. Butt, Ali Anwar
Subjects: cs.LG, cs.AI, cs.DC, cs.SE
Abstract URL: https://arxiv.org/abs/2502.20825
Pdf URL: https://arxiv.org/pdf/2502.20825
Copy Paste: [[2502.20825]] LADs: Leveraging LLMs for AI-Driven DevOps(https://arxiv.org/abs/2502.20825)
Keywords: generation
Abstract: Automating cloud configuration and deployment remains a critical challenge due to evolving infrastructures, heterogeneous hardware, and fluctuating workloads. Existing solutions lack adaptability and require extensive manual tuning, leading to inefficiencies and misconfigurations. We introduce LADs, the first LLM-driven framework designed to tackle these challenges by ensuring robustness, adaptability, and efficiency in automated cloud management. Instead of merely applying existing techniques, LADs provides a principled approach to configuration optimization through in-depth analysis of what optimization works under which conditions. By leveraging Retrieval-Augmented Generation, Few-Shot Learning, Chain-of-Thought, and Feedback-Based Prompt Chaining, LADs generates accurate configurations and learns from deployment failures to iteratively refine system settings. Our findings reveal key insights into the trade-offs between performance, cost, and scalability, helping practitioners determine the right strategies for different deployment scenarios. For instance, we demonstrate how prompt chaining-based adaptive feedback loops enhance fault tolerance in multi-tenant environments and how structured log analysis with example shots improves configuration accuracy. Through extensive evaluations, LADs reduces manual effort, optimizes resource utilization, and improves system reliability. By open-sourcing LADs, we aim to drive further innovation in AI-powered DevOps automation.
摘要：由于不断发展的基础架构，异质硬件和波动的工作负载，自动化云配置和部署仍然是一个关键的挑战。现有的解决方案缺乏适应性，需要大量的手动调整，从而导致效率低下和配置错误。我们介绍了LADS，这是第一个以LLM驱动的框架，旨在通过确保自动化云管理的鲁棒性，适应性和效率来应对这些挑战。 LADS不仅应用现有技术，还通过深入分析哪些条件下的优化起作用，提供了一种原则性的配置优化方法。通过利用检索效果的生成，很少的学习，经过思考链和基于反馈的及时链接，LADS生成了准确的配置，并从部署失败中学习到迭代地完善系统设置。我们的发现揭示了对绩效，成本和可扩展性之间权衡取舍的关键见解，帮助从业人员确定了不同部署方案的正确策略。例如，我们演示了迅速基于链接的自适应反馈回路如何在多租户环境中增强容错性以及示例拍摄的结构化日志分析如何提高配置精度。通过广泛的评估，LADS减少了手动努力，优化资源利用并提高系统可靠性。通过开源小伙子，我们旨在以AI驱动的DevOps自动化来推动进一步的创新。

Title: Oscillation-Reduced MXFP4 Training for Vision Transformers

Authors: Yuxiang Chen, Haocheng Xi, Jun Zhu, Jianfei Chen
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.20853
Pdf URL: https://arxiv.org/pdf/2502.20853
Copy Paste: [[2502.20853]] Oscillation-Reduced MXFP4 Training for Vision Transformers(https://arxiv.org/abs/2502.20853)
Keywords: generation
Abstract: Pre-training Transformers in FP4 precision is becoming a promising approach to gain substantial speedup, but it comes with a considerable loss of accuracy. Microscaling (MX) data format provides a fine-grained per-group quantization method to improve the representation ability of the FP4 format and is supported by the next-generation Blackwell GPU architecture. However, training with MXFP4 data format still results in significant degradation and there is a lack of systematic research on the reason. In this work, we propose a novel training method TetraJet for a more accurate FP4 training. We comprehensively evaluate all of the quantizers involved in the training, and identify the weight oscillation problem in the forward pass as the main source of the degradation in MXFP4 training. Therefore, we introduce two novel methods, EMA Quantizer (Q-EMA) and Adaptive Ramping Optimizer (Q-Ramping), to resolve the oscillation problem. Extensive experiments on Vision Transformers demonstrate that TetraJet consistently outperforms the existing 4-bit training methods, and Q-EMA & Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than $50\%$ compared to the baseline, and can even achieve competitive performance compared to full precision training. The codes are available at this https URL
摘要：FP4精度中的训练前变压器正在成为一种有前途的方法，以获得大幅加速，但准确性丧失。显微镜（MX）数据格式提供了一种细粒度的每组量化方法，以提高FP4格式的表示能力，并由下一代Blackwell GPU体系结构支持。但是，使用MXFP4数据格式的培训仍然会导致大量降级，并且缺乏对原因的系统研究。在这项工作中，我们提出了一种新型的训练方法，以进行更准确的FP4培训。我们全面评估了培训中涉及的所有量化器，并确定向前通行证中的重量振荡问题是MXFP4培训中退化的主要来源。因此，我们介绍了两种新型方法，即EMA量化器（Q-EMA）和自适应渐变优化器（Q-Ramping），以解决振荡问题。关于视觉变压器的广泛实验表明，四吉特始终胜过现有的4位训练方法，而Q-EMA和Q-Ramping可以通过有效降低振荡来提供额外的增强。与基线相比，我们将准确性降解量降低了$ 50 \％$，甚至可以与完整的精确培训相比，甚至可以实现竞争性能。这些代码可在此HTTPS URL上找到

Title: Adaptive Identification of Blurred Regions for Accurate Image Deblurring

Authors: Hu Gao, Depeng Dang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20880
Pdf URL: https://arxiv.org/pdf/2502.20880
Copy Paste: [[2502.20880]] Adaptive Identification of Blurred Regions for Accurate Image Deblurring(https://arxiv.org/abs/2502.20880)
Keywords: restoration
Abstract: Image deblurring aims to restore high-quality images from blurred ones. While existing deblurring methods have made significant progress, most overlook the fact that the degradation degree varies across different regions. In this paper, we propose AIBNet, a network that adaptively identifies the blurred regions, enabling differential restoration of these regions. Specifically, we design a spatial feature differential handling block (SFDHBlock), with the core being the spatial domain feature enhancement module (SFEM). Through the feature difference operation, SFEM not only helps the model focus on the key information in the blurred regions but also eliminates the interference of implicit noise. Additionally, based on the fact that the difference between sharp and blurred images primarily lies in the high-frequency components, we propose a high-frequency feature selection block (HFSBlock). The HFSBlock first uses learnable filters to extract high-frequency features and then selectively retains the most important ones. To fully leverage the decoder's potential, we use a pre-trained model as the encoder and incorporate the above modules only in the decoder. Finally, to alleviate the resource burden during training, we introduce a progressive training strategy. Extensive experiments demonstrate that our AIBNet achieves superior performance in image deblurring.
摘要：图像DeBlurring旨在恢复模糊图像的高质量图像。尽管现有的脱张方法取得了重大进展，但大多数人忽略了降解程度在不同地区各不相同的事实。在本文中，我们提出了AIBNET，该网络是一种适应性地识别模糊区域的网络，从而实现了这些区域的差异恢复。具体而言，我们设计了一个空间特征差分处理块（SFDHBLOCK），其中核心是空间域特征增强模块（SFEM）。通过特征差异操作，SFEM不仅可以帮助模型关注模糊区域中的关键信息，而且还消除了隐式噪声的干扰。此外，基于尖锐图像和模糊图像之间的差异主要在于高频组件之间，我们提出了一个高频特征选择块（HFSBlock）。 HFSBlock首先使用可学习的过滤器来提取高频功能，然后选择性保留最重要的滤镜。为了充分利用解码器的潜力，我们使用预先训练的模型作为编码器，并仅将上述模块合并到解码器中。最后，为了减轻培训期间的资源负担，我们引入了一种进步的培训策略。广泛的实验表明，我们的AIBNET在图像脱毛上实现了卓越的性能。

Title: DiffBrush:Just Painting the Art by Your Hands

Authors: Jiaming Chu, Lei Jin, Tao Wang, Junliang Xing, Jian Zhao
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2502.20904
Pdf URL: https://arxiv.org/pdf/2502.20904
Copy Paste: [[2502.20904]] DiffBrush:Just Painting the Art by Your Hands(https://arxiv.org/abs/2502.20904)
Keywords: generation
Abstract: The rapid development of image generation and editing algorithms in recent years has enabled ordinary user to produce realistic images. However, the current AI painting ecosystem predominantly relies on text-driven diffusion models (T2I), which pose challenges in accurately capturing user requirements. Furthermore, achieving compatibility with other modalities incurs substantial training costs. To this end, we introduce DiffBrush, which is compatible with T2I models and allows users to draw and edit images. By manipulating and adapting the internal representation of the diffusion model, DiffBrush guides the model-generated images to converge towards the user's hand-drawn sketches for user's specific needs without additional training. DiffBrush achieves control over the color, semantic, and instance of objects in images by continuously guiding the latent and instance-level attention map during the denoising process of the diffusion model. Besides, we propose a latent regeneration, which refines the randomly sampled noise in the diffusion model, obtaining a better image generation layout. Finally, users only need to roughly draw the mask of the instance (acceptable colors) on the canvas, DiffBrush can naturally generate the corresponding instance at the corresponding location.
摘要：近年来，图像生成和编辑算法的快速开发使普通用户能够生成逼真的图像。但是，当前的AI绘画生态系统主要依赖于文本驱动的扩散模型（T2I），该模型在准确捕获用户需求方面构成了挑战。此外，实现与其他方式的兼容性会带来实质性的培训成本。为此，我们介绍了Diffbrush，它与T2I模型兼容，并允许用户绘制和编辑图像。通过操纵和调整扩散模型的内部表示形式，Diffrush指导模型生成的图像将用户的手绘草图收敛于用户的特定需求，而无需其他培训。 Diffbrush通过在扩散模型的转化过程中连续引导潜在和实例级别的注意力图来控制图像中对象的颜色，语义和实例。此外，我们提出了一个潜在再生，该再生在扩散模型中完善了随机采样的噪声，从而获得了更好的图像生成布局。最后，用户只需要在画布上大致绘制实例（可接受的颜色）的掩码，Diffbrush自然可以在相应的位置生成相应的实例。

Title: BadRefSR: Backdoor Attacks Against Reference-based Image Super Resolution

Authors: Xue Yang, Tao Chen, Lei Guo, Wenbo Jiang, Ji Guo, Yongming Li, Jiaming He
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2502.20943
Pdf URL: https://arxiv.org/pdf/2502.20943
Copy Paste: [[2502.20943]] BadRefSR: Backdoor Attacks Against Reference-based Image Super Resolution(https://arxiv.org/abs/2502.20943)
Keywords: super-resolution
Abstract: Reference-based image super-resolution (RefSR) represents a promising advancement in super-resolution (SR). In contrast to single-image super-resolution (SISR), RefSR leverages an additional reference image to help recover high-frequency details, yet its vulnerability to backdoor attacks has not been explored. To fill this research gap, we propose a novel attack framework called BadRefSR, which embeds backdoors in the RefSR model by adding triggers to the reference images and training with a mixed loss function. Extensive experiments across various backdoor attack settings demonstrate the effectiveness of BadRefSR. The compromised RefSR network performs normally on clean input images, while outputting attacker-specified target images on triggered input images. Our study aims to alert researchers to the potential backdoor risks in RefSR. Codes are available at this https URL.
摘要：基于参考的图像超分辨率（REFSR）代表了超分辨率（SR）的有希望的进步。与单像超分辨率（SISR）相反，REFSR利用了附加的参考图像来帮助恢复高频细节，但尚未探索其对后门攻击的脆弱性。为了填补这一研究差距，我们提出了一个名为BadRefsr的新型攻击框架，该框架通过将触发器添加到参考图像和具有混合损耗功能的训练中，将后门嵌入REFSR模型中。在各种后门攻击环境中进行的广泛实验证明了BadRefsr的有效性。折衷的REFSR网络在干净的输入图像上正常执行，同时输出触发器指定的目标图像在触发的输入图像上。我们的研究旨在提醒研究人员对REFSR的潜在后门风险。代码可在此HTTPS URL上找到。

Title: Generative Uncertainty in Diffusion Models

Authors: Metod Jazbec, Eliot Wong-Toi, Guoxuan Xia, Dan Zhang, Eric Nalisnick, Stephan Mandt
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20946
Pdf URL: https://arxiv.org/pdf/2502.20946
Copy Paste: [[2502.20946]] Generative Uncertainty in Diffusion Models(https://arxiv.org/abs/2502.20946)
Keywords: generative
Abstract: Diffusion models have recently driven significant breakthroughs in generative modeling. While state-of-the-art models produce high-quality samples on average, individual samples can still be low quality. Detecting such samples without human inspection remains a challenging task. To address this, we propose a Bayesian framework for estimating generative uncertainty of synthetic samples. We outline how to make Bayesian inference practical for large, modern generative models and introduce a new semantic likelihood (evaluated in the latent space of a feature extractor) to address the challenges posed by high-dimensional sample spaces. Through our experiments, we demonstrate that the proposed generative uncertainty effectively identifies poor-quality samples and significantly outperforms existing uncertainty-based methods. Notably, our Bayesian framework can be applied post-hoc to any pretrained diffusion or flow matching model (via the Laplace approximation), and we propose simple yet effective techniques to minimize its computational overhead during sampling.
摘要：扩散模型最近在生成建模方面取得了显着突破。尽管最先进的模型平均生产高质量的样本，但单个样品仍然可以是低质量的。在未经人类检查的情况下检测此类样本仍然是一项具有挑战性的任务。为了解决这个问题，我们提出了一个贝叶斯框架，用于估计合成样品的生成不确定性。我们概述了如何使大型现代生成模型的贝叶斯推论实用，并引入了新的语义可能性（在功能提取器的潜在空间中进行评估），以应对高维样品空间带来的挑战。通过我们的实验，我们证明了所提出的生成不确定性有效地识别出质量较差的样本，并显着胜过现有的基于不确定性的方法。值得注意的是，我们的贝叶斯框架可以在事后将其应用于任何预处理的扩散或流匹配模型（通过拉普拉斯近似），我们提出了简单但有效的技术，以最大程度地减少其在采样过程中的计算开销。

Title: Retrieval Augmented Generation for Topic Modeling in Organizational Research: An Introduction with Empirical Demonstration

Authors: Gerion Spielberger, Florian Artinger, Jochen Reb, Rudolf Kerschreiter
Subjects: cs.LG, cs.AI, econ.GN
Abstract URL: https://arxiv.org/abs/2502.20963
Pdf URL: https://arxiv.org/pdf/2502.20963
Copy Paste: [[2502.20963]] Retrieval Augmented Generation for Topic Modeling in Organizational Research: An Introduction with Empirical Demonstration(https://arxiv.org/abs/2502.20963)
Keywords: generation
Abstract: Analyzing textual data is the cornerstone of qualitative research. While traditional methods such as grounded theory and content analysis are widely used, they are labor-intensive and time-consuming. Topic modeling offers an automated complement. Yet, existing approaches, including LLM-based topic modeling, still struggle with issues such as high data preprocessing requirements, interpretability, and reliability. This paper introduces Agentic Retrieval-Augmented Generation (Agentic RAG) as a method for topic modeling with LLMs. It integrates three key components: (1) retrieval, enabling automatized access to external data beyond an LLM's pre-trained knowledge; (2) generation, leveraging LLM capabilities for text synthesis; and (3) agent-driven learning, iteratively refining retrieval and query formulation processes. To empirically validate Agentic RAG for topic modeling, we reanalyze a Twitter/X dataset, previously examined by Mu et al. (2024a). Our findings demonstrate that the approach is more efficient, interpretable and at the same time achieves higher reliability and validity in comparison to the standard machine learning approach but also in comparison to LLM prompting for topic modeling. These results highlight Agentic RAG's ability to generate semantically relevant and reproducible topics, positioning it as a robust, scalable, and transparent alternative for AI-driven qualitative research in leadership, managerial, and organizational research.
摘要：分析文本数据是定性研究的基石。尽管传统方法（例如接地理论和内容分析）被广泛使用，但它们是劳动密集型且耗时的。主题建模提供自动补充。然而，现有的方法，包括基于LLM的主题建模，仍然在高数据预处理要求，可解释性和可靠性等问题上挣扎。本文介绍了代理检索增强的生成（AgentIc rag），作为使用LLM的主题建模的方法。它集成了三个关键组成部分：（1）检索，使自动访问LLM的预训练知识以外的外部数据访问；（2）生成，利用LLM的文本合成功能；（3）代理驱动的学习，迭代完善的检索和查询配方过程。为了验证主题建模的代理抹布，我们重新分析了一个Twitter/x数据集，Mu等人先前对此进行了检查。（2024a）。我们的发现表明，与标准机器学习方法相比，该方法更有效，更容易解释，并且同时达到了更高的可靠性和有效性，但与LLM相比，提示主题建模。这些结果突出了Agesic Rag生成语义相关和可再现的主题的能力，将其定位为在领导，管理和组织研究方面的AI驱动定性研究中，将其定位为可靠，可扩展和透明的替代方案。

Title: Fine-Grained Retrieval-Augmented Generation for Visual Question Answering

Authors: Zhengxuan Zhang, Yin Wu, Yuyu Luo, Nan Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20964
Pdf URL: https://arxiv.org/pdf/2502.20964
Copy Paste: [[2502.20964]] Fine-Grained Retrieval-Augmented Generation for Visual Question Answering(https://arxiv.org/abs/2502.20964)
Keywords: generation
Abstract: Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images. Although cutting-edge multimodal large language models (MLLMs) such as GPT-4o achieve strong performance on VQA tasks, they frequently fall short in accessing domain-specific or the latest knowledge. To mitigate this issue, retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs), referred to as KB-VQA, emerges as a promising approach. Nevertheless, conventional unimodal retrieval techniques, which translate images into textual descriptions, often result in the loss of critical visual details. This study presents fine-grained knowledge units, which merge textual snippets with entity images stored in vector databases. Furthermore, we introduce a knowledge unit retrieval-augmented generation framework (KU-RAG) that integrates fine-grained retrieval with MLLMs. The proposed KU-RAG framework ensures precise retrieval of relevant knowledge and enhances reasoning capabilities through a knowledge correction chain. Experimental findings demonstrate that our approach significantly boosts the performance of leading KB-VQA methods, achieving improvements of up to 10%.
摘要：视觉问题回答（VQA）专注于通过利用图像中的信息来提供自然语言问题的答案。尽管GPT-4O等尖端的多模式大型语言模型（MLLM）在VQA任务上实现了强劲的性能，但它们在访问域特异性或最新知识方面经常缺乏。为了减轻此问题，检索型发电（RAG）利用外部知识库（KBS）（称为KB-VQA）是一种有希望的方法。然而，将图像转化为文本描述的常规单峰检索技术通常会导致关键视觉细节的丧失。这项研究提出了细粒度的知识单元，该单元将文本片段与存储在矢量数据库中的实体图像合并。此外，我们引入了一个知识单元检索授权的生成框架（KU-rag），该框架（KU-rag）与MLLM集成了细粒度的检索。提出的KU-RAG框架确保了相关知识的精确检索，并通过知识校正链增强了推理能力。实验发现表明，我们的方法显着提高了领先的KB-VQA方法的性能，可提高高达10％。

Title: Synthesizing Tabular Data Using Selectivity Enhanced Generative Adversarial Networks

Authors: Youran Zhou, Jianzhong Qi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.21034
Pdf URL: https://arxiv.org/pdf/2502.21034
Copy Paste: [[2502.21034]] Synthesizing Tabular Data Using Selectivity Enhanced Generative Adversarial Networks(https://arxiv.org/abs/2502.21034)
Keywords: generative
Abstract: As E-commerce platforms face surging transactions during major shopping events like Black Friday, stress testing with synthesized data is crucial for resource planning. Most recent studies use Generative Adversarial Networks (GANs) to generate tabular data while ensuring privacy and machine learning utility. However, these methods overlook the computational demands of processing GAN-generated data, making them unsuitable for E-commerce stress testing. This thesis introduces a novel GAN-based approach incorporating query selectivity constraints, a key factor in database transaction processing. We integrate a pre-trained deep neural network to maintain selectivity consistency between real and synthetic data. Our method, tested on five real-world datasets, outperforms three state-of-the-art GANs and a VAE model, improving selectivity estimation accuracy by up to 20pct and machine learning utility by up to 6 pct.
摘要：由于电子商务平台在黑色星期五（黑色星期五）等重大购物活动中面临激增的交易，因此使用合成数据进行压力测试对于资源计划至关重要。最近的研究使用生成的对抗网络（GAN）来生成表格数据，同时确保隐私和机器学习实用程序。但是，这些方法忽略了处理GAN生成数据的计算需求，这使它们不适合电子商务压力测试。本文介绍了一种新型基于GAN的方法，其中包含了查询选择性约束，这是数据库事务处理中的关键因素。我们集成了预先训练的深神经网络，以保持真实数据和合成数据之间的选择性一致性。我们的方法在五个现实世界数据集上进行了测试，优于三个最先进的gans和一个VAE模型，最多可提高20PCT和机器学习实用程序的选择性估计精度，最多可提高6 pct。

Title: Synthesizing Individualized Aging Brains in Health and Disease with Generative Models and Parallel Transport

Authors: Jingru Fu, Yuqi Zheng, Neel Dey, Daniel Ferreira, Rodrigo Moreno
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2502.21049
Pdf URL: https://arxiv.org/pdf/2502.21049
Copy Paste: [[2502.21049]] Synthesizing Individualized Aging Brains in Health and Disease with Generative Models and Parallel Transport(https://arxiv.org/abs/2502.21049)
Keywords: generation, generative
Abstract: Simulating prospective magnetic resonance imaging (MRI) scans from a given individual brain image is challenging, as it requires accounting for canonical changes in aging and/or disease progression while also considering the individual brain's current status and unique characteristics. While current deep generative models can produce high-resolution anatomically accurate templates for population-wide studies, their ability to predict future aging trajectories for individuals remains limited, particularly in capturing subject-specific neuroanatomical variations over time. In this study, we introduce Individualized Brain Synthesis (InBrainSyn), a framework for synthesizing high-resolution subject-specific longitudinal MRI scans that simulate neurodegeneration in both Alzheimer's disease (AD) and normal aging. InBrainSyn uses a parallel transport algorithm to adapt the population-level aging trajectories learned by a generative deep template network, enabling individualized aging synthesis. As InBrainSyn uses diffeomorphic transformations to simulate aging, the synthesized images are topologically consistent with the original anatomy by design. We evaluated InBrainSyn both quantitatively and qualitatively on AD and healthy control cohorts from the Open Access Series of Imaging Studies - version 3 dataset. Experimentally, InBrainSyn can also model neuroanatomical transitions between normal aging and AD. An evaluation of an external set supports its generalizability. Overall, with only a single baseline scan, InBrainSyn synthesizes realistic 3D spatiotemporal T1w MRI scans, producing personalized longitudinal aging trajectories. The code for InBrainSyn is available at: this https URL.
摘要：从给定的个体大脑形象中模拟前瞻性磁共振成像（MRI）扫描很具有挑战性，因为它需要考虑衰老和/或疾病进展的规范变化，同时还考虑了单个大脑的当前状态和独特的特征。尽管当前的深层生成模型可以为整个人群研究产生高分辨率的解剖学精确模板，但它们预测个人的未来衰老轨迹的能力仍然有限，尤其是在捕获特定主体特定的神经解剖学变异时。在这项研究中，我们引入了个性化的脑合成（Inbrainsnn），这是合成高分辨率受试者特异性纵向MRI扫描的框架，该术语模拟了阿尔茨海默氏病（AD）和正常衰老的神经变性。 Inbrainsns使用平行的传输算法来适应通过生成深层模板网络学到的种群水平的衰老轨迹，从而实现个性化的衰老合成。由于脑类人体使用差异变换来模拟衰老，因此合成图像在拓扑上与设计的原始解剖结构一致。我们从开放式成像研究系列的AD和健康对照组上进行了定量和定性评估Inbrainsyn -版本3数据集。在实验上，Inbrainsn还可以模拟正常衰老和AD之间的神经解剖学转变。对外部集合的评估支持其普遍性。总体而言，仅进行一次基线扫描，Inbrainsn合成了现实的3D时空T1W MRI扫描，产生个性化的纵向老化轨迹。 INBRAINSN的代码可在以下网址提供：此HTTPS URL。

Title: Spatial Reasoning with Denoising Models

Authors: Christopher Wewer, Bart Pogodzinski, Bernt Schiele, Jan Eric Lenssen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.21075
Pdf URL: https://arxiv.org/pdf/2502.21075
Copy Paste: [[2502.21075]] Spatial Reasoning with Denoising Models(https://arxiv.org/abs/2502.21075)
Keywords: generation, generative
Abstract: We introduce Spatial Reasoning Models (SRMs), a framework to perform reasoning over sets of continuous variables via denoising generative models. SRMs infer continuous representations on a set of unobserved variables, given observations on observed variables. Current generative models on spatial domains, such as diffusion and flow matching models, often collapse to hallucination in case of complex distributions. To measure this, we introduce a set of benchmark tasks that test the quality of complex reasoning in generative models and can quantify hallucination. The SRM framework allows to report key findings about importance of sequentialization in generation, the associated order, as well as the sampling strategies during training. It demonstrates, for the first time, that order of generation can successfully be predicted by the denoising network itself. Using these findings, we can increase the accuracy of specific reasoning tasks from <1% to >50%.
摘要：我们引入了空间推理模型（SRMS），这是一个框架，可以通过denoing生成模型对连续变量进行推理。鉴于观察到的变量，SRMS推断一组未观察到的变量的连续表示。在空间域（例如扩散和流匹配模型）上的当前生成模型通常会在复杂分布的情况下崩溃为幻觉。为了衡量这一点，我们引入了一组基准任务，以测试生成模型中复杂推理的质量并可以量化幻觉。 SRM框架允许报告有关生成序列化重要性的关键发现，相关顺序以及培训期间的采样策略。它首次证明了发电的订单可以通过denoing网络本身成功预测。使用这些发现，我们可以将特定推理任务的准确性从<1％提高到> 50％。

Title: Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

Authors: Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, Bin Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.21079
Pdf URL: https://arxiv.org/pdf/2502.21079
Copy Paste: [[2502.21079]] Training-free and Adaptive Sparse Attention for Efficient Long Video Generation(https://arxiv.org/abs/2502.21079)
Keywords: generation
Abstract: Generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency, primarily due to the computational demands of attention mechanisms. For instance, generating an 8-second 720p video (110K tokens) with HunyuanVideo takes about 600 PFLOPs, with around 500 PFLOPs consumed by attention computations. To address this issue, we propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method. Firstly, to realize the Dynamic Pattern, we introduce a blockified pattern to efficiently capture the hierarchical sparsity inherent in DiTs. This is based on our observation that sparse characteristics of DiTs exhibit hierarchical and blockified structures between and within different modalities. This blockified approach significantly reduces the complexity of attention computation while maintaining high fidelity in the generated videos. Secondly, to enable Online Precise Search, we propose the Fused LSE-Cached Search with Head-adaptive Hierarchical Block Sparse Attention. This method is motivated by our finding that DiTs' sparse pattern and LSE vary w.r.t. inputs, layers, and heads, but remain invariant across denoising steps. By leveraging this invariance across denoising steps, it adapts to the dynamic nature of DiTs and allows for precise, real-time identification of sparse indices with minimal overhead. AdaSpa is implemented as an adaptive, plug-and-play solution and can be integrated seamlessly with existing DiTs, requiring neither additional fine-tuning nor a dataset-dependent profiling. Extensive experiments validate that AdaSpa delivers substantial acceleration across various models while preserving video quality, establishing itself as a robust and scalable approach to efficient video generation.
摘要：具有扩散变压器（DIT）的高保真长视频通常受到大量延迟的阻碍，这主要是由于注意机制的计算需求。例如，使用Hunyuanvideo生成8秒的720p视频（110k令牌），大约需要600个Pflops，而注意力计算消耗了约500个Pflops。为了解决这个问题，我们建议ADASPA，第一个动态模式和在线精确搜索稀疏注意方法。首先，为了实现动态模式，我们引入了一个阻止模式，以有效地捕获DITS固有的层次稀疏性。这是基于我们的观察结果，即DIT的稀疏特征在不同方式之间和内部表现出层次结构和阻塞结构。这种阻止的方法大大降低了注意力计算的复杂性，同时保持了生成的视频中的高保真度。其次，为了启用在线精确搜索，我们提出了融合的LSE搜索搜索，并具有头部自适应层次块稀疏的注意力。我们发现DITS稀疏模式和LSE不同的W.R.T.的发现是激励此方法的。输入，层和头部，但在DeNoising步骤中保持不变。通过在跨剥离步骤中利用这种不变性，它适应了DIT的动态性质，并可以精确，实时识别稀疏开销的稀疏指数。 ADASPA被用作自适应，插件解决方案，可以与现有DIT无缝集成，不需要其他微调，也不需要数据集依赖于数据集。广泛的实验验证了ADASPA在各种模型的同时提供了大量加速度，同时保留视频质量，将自己确立为有效的视频生成的强大而可扩展的方法。

Title: Rare event modeling with self-regularized normalizing flows: what can we learn from a single failure?

Authors: Charles Dawson, Van Tran, Max Z. Li, Chuchu Fan
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.21110
Pdf URL: https://arxiv.org/pdf/2502.21110
Copy Paste: [[2502.21110]] Rare event modeling with self-regularized normalizing flows: what can we learn from a single failure?(https://arxiv.org/abs/2502.21110)
Keywords: generative
Abstract: Increased deployment of autonomous systems in fields like transportation and robotics have seen a corresponding increase in safety-critical failures. These failures can be difficult to model and debug due to the relative lack of data: compared to tens of thousands of examples from normal operations, we may have only seconds of data leading up to the failure. This scarcity makes it challenging to train generative models of rare failure events, as existing methods risk either overfitting to noise in the limited failure dataset or underfitting due to an overly strong prior. We address this challenge with CalNF, or calibrated normalizing flows, a self-regularized framework for posterior learning from limited data. CalNF achieves state-of-the-art performance on data-limited failure modeling and inverse problems and enables a first-of-a-kind case study into the root causes of the 2022 Southwest Airlines scheduling crisis.
摘要：在运输和机器人技术等领域中自治系统的部署增加了，安全至关重要的故障的增加。由于数据相对缺乏，这些故障可能很难建模和调试：与正常操作中成千上万的示例相比，我们可能只有几秒钟的数据导致故障。这种稀缺性使训练罕见故障事件的生成模型的生成模型具有挑战性，因为现有方法可能会因有限的故障数据集中的噪声过度拟合，或者由于过度强大的先验而导致的噪声不足。我们通过CALNF解决了这一挑战，或校准正常化的流量，这是从有限数据中进行后验学习的自我调节框架。 CALNF在数据限制的故障建模和反问题上实现了最先进的绩效，并可以将第一个案例研究纳入2022年西南航空安排危机的根本原因。

Title: A Review on Generative AI For Text-To-Image and Image-To-Image Generation and Implications To Scientific Images

Authors: Zineb Sordo, Eric Chagnon, Daniela Ushizima
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.21151
Pdf URL: https://arxiv.org/pdf/2502.21151
Copy Paste: [[2502.21151]] A Review on Generative AI For Text-To-Image and Image-To-Image Generation and Implications To Scientific Images(https://arxiv.org/abs/2502.21151)
Keywords: generation, generative
Abstract: This review surveys the state-of-the-art in text-to-image and image-to-image generation within the scope of generative AI. We provide a comparative analysis of three prominent architectures: Variational Autoencoders, Generative Adversarial Networks and Diffusion Models. For each, we elucidate core concepts, architectural innovations, and practical strengths and limitations, particularly for scientific image understanding. Finally, we discuss critical open challenges and potential future research directions in this rapidly evolving field.
摘要：这篇评论在生成AI的范围内调查了文本对图像和图像对图像生成的最先进。我们提供了三个突出体系结构的比较分析：变化自动编码器，生成对抗网络和扩散模型。对于每个人，我们都阐明了核心概念，建筑创新以及实践优势和局限性，尤其是对于科学形象的理解。最后，我们讨论了这个快速发展的领域的关键开放挑战和潜在的未来研究方向。

Title: Autonomous Curriculum Design via Relative Entropy Based Task Modifications

Authors: Muhammed Yusuf Satici, Jianxun Wang, David L. Roberts
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.21166
Pdf URL: https://arxiv.org/pdf/2502.21166
Copy Paste: [[2502.21166]] Autonomous Curriculum Design via Relative Entropy Based Task Modifications(https://arxiv.org/abs/2502.21166)
Keywords: generation
Abstract: Curriculum learning is a training method in which an agent is first trained on a curriculum of relatively simple tasks related to a target task in an effort to shorten the time required to train on the target task. Autonomous curriculum design involves the design of such curriculum with no reliance on human knowledge and/or expertise. Finding an efficient and effective way of autonomously designing curricula remains an open problem. We propose a novel approach for automatically designing curricula by leveraging the learner's uncertainty to select curricula tasks. Our approach measures the uncertainty in the learner's policy using relative entropy, and guides the agent to states of high uncertainty to facilitate learning. Our algorithm supports the generation of autonomous curricula in a self-assessed manner by leveraging the learner's past and current policies but it also allows the use of teacher guided design in an instructive setting. We provide theoretical guarantees for the convergence of our algorithm using two time-scale optimization processes. Results show that our algorithm outperforms randomly generated curriculum, and learning directly on the target task as well as the curriculum-learning criteria existing in literature. We also present two additional heuristic distance measures that could be combined with our relative-entropy approach for further performance improvements.
摘要：课程学习是一种培训方法，其中首先对代理进行了与目标任务相关任务的相对简单任务的课程培训，以缩短培训目标任务所需的时间。自主课程设计涉及此类课程的设计，而无需依赖人类知识和/或专业知识。寻找一种自主设计课程的有效方法仍然是一个空旷的问题。我们通过利用学习者的不确定性来选择课程任务来自动设计课程的新方法。我们的方法使用相对熵来衡量学习者政策的不确定性，并指导代理人对高度不确定性的状态进行促进学习。我们的算法通过利用学习者的过去和当前政策来支持自主课程的产生，但它也允许在有指导的环境中使用教师指导的设计。我们使用两个时间尺度的优化过程为我们的算法收敛提供了理论保证。结果表明，我们的算法的表现优于随机生成的课程，直接学习目标任务以及文献中存在的课程学习标准。我们还提出了另外的两项启发式距离测量，可以将它们与我们的相对渗透方法结合使用，以进一步改善性能。

Title: QFAL: Quantum Federated Adversarial Learning

Authors: Walid El Maouaki, Nouhaila Innan, Alberto Marchisio, Taoufik Said, Mohamed Bennai, Muhammad Shafique
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2502.21171
Pdf URL: https://arxiv.org/pdf/2502.21171
Copy Paste: [[2502.21171]] QFAL: Quantum Federated Adversarial Learning(https://arxiv.org/abs/2502.21171)
Keywords: generation
Abstract: Quantum federated learning (QFL) merges the privacy advantages of federated systems with the computational potential of quantum neural networks (QNNs), yet its vulnerability to adversarial attacks remains poorly understood. This work pioneers the integration of adversarial training into QFL, proposing a robust framework, quantum federated adversarial learning (QFAL), where clients collaboratively defend against perturbations by combining local adversarial example generation with federated averaging (FedAvg). We systematically evaluate the interplay between three critical factors: client count (5, 10, 15), adversarial training coverage (0-100%), and adversarial attack perturbation strength (epsilon = 0.01-0.5), using the MNIST dataset. Our experimental results show that while fewer clients often yield higher clean-data accuracy, larger federations can more effectively balance accuracy and robustness when partially adversarially trained. Notably, even limited adversarial coverage (e.g., 20%-50%) can significantly improve resilience to moderate perturbations, though at the cost of reduced baseline performance. Conversely, full adversarial training (100%) may regain high clean accuracy but is vulnerable under stronger attacks. These findings underscore an inherent trade-off between robust and standard objectives, which is further complicated by quantum-specific factors. We conclude that a carefully chosen combination of client count and adversarial coverage is critical for mitigating adversarial vulnerabilities in QFL. Moreover, we highlight opportunities for future research, including adaptive adversarial training schedules, more diverse quantum encoding schemes, and personalized defense strategies to further enhance the robustness-accuracy trade-off in real-world quantum federated environments.
摘要：量子联盟学习（QFL）将联合系统的隐私优势与量子神经网络（QNN）的计算潜力合并，但其对对抗性攻击的脆弱性仍然很众所周知。这项工作是将对抗性培训纳入QFL，提出了一个强大的框架，量子联盟的对抗学习（QFAL），客户通过将当地对抗性示例生成与联邦平均相结合（FedAvg）来协作防御扰动（FedAvg）。我们使用MNIST数据集，系统地评估了三个关键因素之间的相互作用：客户计数（5、10、15），对抗训练覆盖率（0-100％）和对抗攻击扰动强度（Epsilon = 0.01-0.5）。我们的实验结果表明，尽管较少的客户通常会产生更高的清洁数据准确性，但在经过部分对手训练时，更大的联合会可以更有效地平衡准确性和鲁棒性。值得注意的是，即使是有限的对抗覆盖率（例如20％-50％）也可以显着提高对中度扰动的弹性，尽管以降低基线性能为代价。相反，全面的对抗训练（100％）可能会恢复高清洁精度，但在更强的攻击下很容易受到伤害。这些发现强调了稳健目标和标准目标之间固有的权衡，这是量子特异性因素更加复杂的。我们得出的结论是，客户数量和对抗性覆盖范围的精心选择的组合对于减轻QFL中的对抗脆弱性至关重要。此外，我们重点介绍了未来研究的机会，包括自适应对抗训练时间表，更多样化的量子编码方案以及个性化的防御策略，以进一步增强现实世界中量子联盟环境中的稳健性 - 准确性权衡。

Title: SYN-LUNGS: Towards Simulating Lung Nodules with Anatomy-Informed Digital Twins for AI Training

Authors: Fakrul Islam Tushar, Lavsen Dahal, Cindy McCabe, Fong Chi Ho, Paul Segars, Ehsan Abadi, Kyle J. Lafata, Ehsan Samei, Joseph Y. Lo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.21187
Pdf URL: https://arxiv.org/pdf/2502.21187
Copy Paste: [[2502.21187]] SYN-LUNGS: Towards Simulating Lung Nodules with Anatomy-Informed Digital Twins for AI Training(https://arxiv.org/abs/2502.21187)
Keywords: generation, generative
Abstract: AI models for lung cancer screening are limited by data scarcity, impacting generalizability and clinical applicability. Generative models address this issue but are constrained by training data variability. We introduce SYN-LUNGS, a framework for generating high-quality 3D CT images with detailed annotations. SYN-LUNGS integrates XCAT3 phantoms for digital twin generation, X-Lesions for nodule simulation (varying size, location, and appearance), and DukeSim for CT image formation with vendor and parameter variability. The dataset includes 3,072 nodule images from 1,044 simulated CT scans, with 512 lesions and 174 digital twins. Models trained on clinical + simulated data outperform clinical only models, achieving 10% improvement in detection, 2-9% in segmentation and classification, and enhanced this http URL incorporating anatomy-informed simulations, SYN-LUNGS provides a scalable approach for AI model development, particularly in rare disease representation and improving model reliability.
摘要：肺癌筛查的AI模型受数据稀缺性的限制，从而影响普遍性和临床适用性。生成模型解决了这个问题，但受培训数据可变性的约束。我们介绍了Syn-Rungs，这是一种用于生成带有详细注释的高质量3D CT图像的框架。 Syn-Lungs集成了用于数字双胞胎生成的XCAT3幻象，用于结节模拟的X元素（尺寸，位置和外观不同），以及用于CT图像形成的DUKESIM与供应商和参数变异性。该数据集包括来自1,044张模拟CT扫描的3,072个结节图像，其中512个病变和174个数字双胞胎。接受临床 +模拟数据培训的模型仅优于临床模型，在检测中提高了10％，分割和分类的2-9％，并增强了该HTTP URL，融合了解剖学信息模拟的模拟，SynLungs为AI模型开发提供了可扩展的方法，尤其是在稀有疾病表示和改善模型可靠性中。

Title: BAnG: Bidirectional Anchored Generation for Conditional RNA Design

Authors: Roman Klypa, Alberto Bietti, Sergei Grudinin
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2502.21274
Pdf URL: https://arxiv.org/pdf/2502.21274
Copy Paste: [[2502.21274]] BAnG: Bidirectional Anchored Generation for Conditional RNA Design(https://arxiv.org/abs/2502.21274)
Keywords: generation, generative
Abstract: Designing RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Existing computational approaches require a substantial amount of experimentally determined RNA sequences for each specific protein or a detailed knowledge of RNA structure, restricting their utility in practice. To address this limitation, we develop RNA-BAnG, a deep learning-based model designed to generate RNA sequences for protein interactions without these requirements. Central to our approach is a novel generative method, Bidirectional Anchored Generation (BAnG), which leverages the observation that protein-binding RNA sequences often contain functional binding motifs embedded within broader sequence contexts. We first validate our method on generic synthetic tasks involving similar localized motifs to those appearing in RNAs, demonstrating its benefits over existing generative approaches. We then evaluate our model on biological sequences, showing its effectiveness for conditional RNA sequence design given a binding protein.
摘要：设计与特定蛋白质相互作用的RNA分子是实验和计算生物学中的关键挑战。现有的计算方法需要大量的每个特定蛋白质或RNA结构的详细知识实验确定的RNA序列，从而在实践中限制了它们的效用。为了解决这一限制，我们开发了RNA-bang，这是一种基于深度学习的模型，旨在生成无需这些要求的蛋白质相互作用的RNA序列。我们方法的核心是一种新颖的生成方法，即双向锚定生成（BANG），它利用了这样一种观察，即蛋白质结合RNA序列通常包含嵌入更广泛序列环境中的功能结合基序。我们首先验证了与RNA中出现的局部局部图案的通用合成任务的方法，证明了其对现有生成方法的好处。然后，我们在生物序列上评估了我们的模型，显示了其对结合蛋白的条件RNA序列设计的有效性。

Title: Does Generation Require Memorization? Creative Diffusion Models using Ambient Diffusion

Authors: Kulin Shah, Alkis Kalavasis, Adam R. Klivans, Giannis Daras
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.21278
Pdf URL: https://arxiv.org/pdf/2502.21278
Copy Paste: [[2502.21278]] Does Generation Require Memorization? Creative Diffusion Models using Ambient Diffusion(https://arxiv.org/abs/2502.21278)
Keywords: generation, generative
Abstract: There is strong empirical evidence that the state-of-the-art diffusion modeling paradigm leads to models that memorize the training set, especially when the training set is small. Prior methods to mitigate the memorization problem often lead to a decrease in image quality. Is it possible to obtain strong and creative generative models, i.e., models that achieve high generation quality and low memorization? Despite the current pessimistic landscape of results, we make significant progress in pushing the trade-off between fidelity and memorization. We first provide theoretical evidence that memorization in diffusion models is only necessary for denoising problems at low noise scales (usually used in generating high-frequency details). Using this theoretical insight, we propose a simple, principled method to train the diffusion models using noisy data at large noise scales. We show that our method significantly reduces memorization without decreasing the image quality, for both text-conditional and unconditional models and for a variety of data availability settings.
摘要：有强有力的经验证据表明，最先进的扩散建模范式会导致记忆训练集的模型，尤其是在训练集很小的时候。减轻记忆问题的先前方法通常会导致图像质量下降。是否有可能获得强大而有创造力的生成模型，即实现高生代质量和低记忆的模型？尽管当前的结果是结果的景观，但我们在推动忠诚度和记忆之间的权衡方面取得了重大进展。我们首先提供了理论上的证据，表明在扩散模型中的记忆仅对于在低噪声尺度下的问题（通常用于生成高频细节）所必需。使用这种理论洞察力，我们提出了一种简单的原则方法，使用大噪声尺度上的嘈杂数据训练扩散模型。我们表明，对于文本条件和无条件模型以及各种数据可用性设置，我们的方法大大降低了记忆，而不会降低图像质量。

Title: MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing

Authors: Xueyun Tian, Wei Li, Bingbing Xu, Yige Yuan, Yuanzhuo Wang, Huawei Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.21291
Pdf URL: https://arxiv.org/pdf/2502.21291
Copy Paste: [[2502.21291]] MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing(https://arxiv.org/abs/2502.21291)
Keywords: generation
Abstract: Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Therefore, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation. MIGE introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion this http URL unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: By leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: Learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a state-of-the-art in the new task of instruction-based subject-driven editing. Code and model have been publicly available at this https URL.
摘要：尽管基于扩散的图像产生取得了重大进展，但受试者驱动的生成和基于教学的编辑仍然具有挑战性。现有方法通常分别对其进行处理，在有限的高质量数据和概括不佳的情况下挣扎。但是，这两个任务都需要捕获复杂的视觉变化，同时保持输入和输出之间的一致性。因此，我们提出了MIGE，这是一个统一的框架，使用多模式指令标准化任务表示。它将主题驱动的生成视为在空白画布上的创建和基于指令的编辑，以修改现有图像，建立共享的输入输出公式。 Mige引入了一种新型的多式模式编码器，该编码器将自由形式的多模式指令映射到统一的视觉语言空间中，通过功能融合来整合视觉和语义特征。此HTTP URL统一能够对这两个任务进行联合培训，从而提供两个关键的优势，从而提供两个关键的优势：（1）通过相互培训，促进了视觉和语义的指导性，并促进了视觉效果，并促进了视觉效果，并促进了视觉效果，并促进了视觉效果，并促进了视觉上的进化，并实现了范围，并提供了范围，并促进了视觉上的进步，并提供了范围。基于教学的编辑。（2）概括：以统一格式学习有助于交叉任务知识转移，从而使Mige推广到新颖的组成任务，包括基于教学的主题驱动的编辑。实验表明，MIGE在主题驱动的生成和基于教学的编辑中都表现出色，同时在基于教学的主题驱动编辑的新任务中设置最新任务。代码和模型已在此HTTPS URL上公开可用。

Title: Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos

Authors: Zhiyu Tan, Junyan Wang, Hao Yang, Luozheng Qin, Hesen Chen, Qiang Zhou, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.21314
Pdf URL: https://arxiv.org/pdf/2502.21314
Copy Paste: [[2502.21314]] Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos(https://arxiv.org/abs/2502.21314)
Keywords: generation
Abstract: Text-to-video generation has demonstrated promising progress with the advent of diffusion models, yet existing approaches are limited by dataset quality and computational resources. To address these limitations, this paper presents a comprehensive approach that advances both data curation and model design. We introduce CFC-VIDS-1M, a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline. The pipeline first evaluates video quality across multiple dimensions, followed by a fine-grained stage that leverages vision-language models to enhance text-video alignment and semantic richness. Building upon the curated dataset's emphasis on visual quality and temporal coherence, we develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms. The model is trained through a progressive four-stage strategy designed to efficiently handle the complexities of video generation. Extensive experiments demonstrate that our integrated approach of high-quality data curation and efficient training strategy generates visually appealing and temporally coherent videos while maintaining computational efficiency. We will release our dataset, code, and models.
摘要：文本到视频的生成通过扩散模型的出现表明了有希望的进步，但是现有方法受数据集质量和计算资源的限制。为了解决这些局限性，本文提出了一种全面的方法，可以提高数据策展和模型设计。我们介绍了CFC-VIDS-1M，这是一种通过系统的粗到精细策划管道构建的高质量视频数据集。该管道首先评估了跨多个维度的视频质量，然后是一个细粒度的阶段，该阶段利用视觉模型来增强文本视频对齐和语义丰富度。在策划的数据集对视觉质量和时间连贯性的强调基础上，我们开发了浣熊，浣熊是一种基于变压器的建筑，具有脱钩的时空注意机制。该模型通过旨在有效处理视频生成的复杂性的渐进式四阶段策略进行培训。广泛的实验表明，我们的高质量数据策划和有效培训策略的综合方法会产生视觉吸引力和时间相干的视频，同时保持计算效率。我们将发布我们的数据集，代码和模型。

Title: How far can we go with ImageNet for Text-to-Image generation?

Authors: L. Degeorge, A. Ghosh, N. Dufour, D. Picard, V. Kalogeiton
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.21318
Pdf URL: https://arxiv.org/pdf/2502.21318
Copy Paste: [[2502.21318]] How far can we go with ImageNet for Text-to-Image generation?(https://arxiv.org/abs/2502.21318)
Keywords: generation
Abstract: Recent text-to-image (T2I) generation models have achieved remarkable results by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over quality. We challenge this established paradigm by demonstrating that strategic data augmentation of small, well-curated datasets can match or outperform models trained on massive web-scraped collections. Using only ImageNet enhanced with well-designed text and image augmentations, we achieve a +2 overall score over SD-XL on GenEval and +5 on DPGBench while using just 1/10th the parameters and 1/1000th the training images. Our results suggest that strategic data augmentation, rather than massive datasets, could offer a more sustainable path forward for T2I generation.
摘要：最近的文本到图像（T2I）生成模型通过对数十亿个数据集进行培训，取得了显着的结果，遵循“更大的IS IS范围”范式，将数据数量优先于质量而优先。我们通过证明小型，精心策划的数据集的战略数据扩大可以匹配或胜过接受大型Web式收藏培训的模型来挑战这种既定的范式。仅使用精心设计的文本和图像增强功能来增强Imagenet，我们仅使用1/10的参数和训练图像的1/10，在Geneval上获得+2的总分，而DPGBench上的SD-XL和+5的总分。我们的结果表明，战略数据增强而不是大量数据集可以为T2I生成提供更可持续的途径。