2025-04-17

Title: Flux Already Knows - Activating Subject-Driven Image Generation without Training

Authors: Hao Kang, Stathi Fotiadis, Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Min Jin Chong, Xin Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11478
Pdf URL: https://arxiv.org/pdf/2504.11478
Copy Paste: [[2504.11478]] Flux Already Knows - Activating Subject-Driven Image Generation without Training(https://arxiv.org/abs/2504.11478)
Keywords: generation
Abstract: We propose a simple yet effective zero-shot framework for subject-driven image generation using a vanilla Flux model. By framing the task as grid-based image completion and simply replicating the subject image(s) in a mosaic layout, we activate strong identity-preserving capabilities without any additional data, training, or inference-time fine-tuning. This "free lunch" approach is further strengthened by a novel cascade attention design and meta prompting technique, boosting fidelity and versatility. Experimental results show that our method outperforms baselines across multiple key metrics in benchmarks and human preference studies, with trade-offs in certain aspects. Additionally, it supports diverse edits, including logo insertion, virtual try-on, and subject replacement or insertion. These results demonstrate that a pre-trained foundational text-to-image model can enable high-quality, resource-efficient subject-driven generation, opening new possibilities for lightweight customization in downstream applications.
摘要：我们建议使用香草助剂模型为主题驱动的图像生成一个简单而有效的零射击框架。通过将任务作为基于网格的图像完成并简单地在马赛克布局中复制主题图像，我们就可以激活强大的具有身份性能的功能，而无需任何其他数据，训练或推理时间微调。这种“免费的午餐”方法通过新颖的级联注意力设计和元提示技术进一步增强，从而提高了忠诚度和多功能性。实验结果表明，我们的方法在基准和人类偏好研究中的多个关键指标上的基准都优于基准，并且在某些方面进行了权衡。此外，它支持各种编辑，包括徽标插入，虚拟试验以及主题替换或插入。这些结果表明，预先训练的基础文本对图像模型可以实现高质量的，资源有效的主题驱动的生成，为下游应用程序中的轻量化定制开辟了新的可能性。

Title: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Authors: Tianyi Zhang, Yang Sui, Shaochen Zhong, Vipin Chaudhary, Xia Hu, Anshumali Shrivastava
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2504.11651
Pdf URL: https://arxiv.org/pdf/2504.11651
Copy Paste: [[2504.11651]] 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float(https://arxiv.org/abs/2504.11651)
Keywords: generation
Abstract: Large Language Models (LLMs) have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) decomposition of memory-intensive lookup tables (LUTs) into compact LUTs that fit in GPU SRAM, (ii) a two-phase kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on recent models, including Llama-3.1, Qwen-2.5, and Gemma-3, validates our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit exact outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models. Notably, our method enables lossless inference of Llama-3.1-405B, an 810GB model, on a single node equipped with 8x80GB GPUs. Our code and models are available at this https URL.
摘要：大型语言模型（LLMS）的规模迅速增长，为在资源受限的硬件上有效部署带来了重大挑战。在本文中，我们引入了动态长度浮点（DFLOAT11），这是一个无损压缩框架，可将LLM大小降低30％，同时保留与原始模型相同的输出。 DFLOAT11是由LLMS的BFLOAT16重量表示中的低熵激励的，这揭示了现有的存储格式的效率显着效率。通过应用熵编码，DFLOAT11根据频率将动态长度编码分配给权重，从而实现近乎信息的压缩，而不会损失任何精确度。为了促进使用动态长度编码的有效推断，我们开发了一个自定义的GPU内核，用于快速在线减压。我们的设计结合了以下内容：（i）将内存密集型查找表（LUTS）分解为适合GPU SRAM中的紧凑型LUT，（ii）使用轻量级辅助辅助变量协调线程读取/写入位置的两相内核，以及（iii）transferer-Block-Block-Block-Block-block-block-block lepel level level Decompressions以最大程度地减少LATENCY LATENCY LATENCY。在包括Llama-3.1，QWEN-2.5和GEMMA-3在内的最新模型的实验验证了我们的假设，即DFLOAT11可实现约30％的模型尺寸降低，同时保留位点精确的输出。与将未压缩模型的部分卸载到CPU以满足内存约束的潜在替代方案相比，DFLOAT11在令牌生成中的吞吐量高1.9-38.8 x。使用固定的GPU内存预算，DFLOAT11启用5.3-13.17x的上下文长度比未压缩模型更长。值得注意的是，我们的方法可以在配备8x80GB GPU的单个节点上对Llama-3.1-405B（810GB模型）的无损推断。我们的代码和模型可在此HTTPS URL上找到。

Title: Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation

Authors: Amirhossein Dadashzadeh, Parsa Esmati, Majid Mirmehdi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11669
Pdf URL: https://arxiv.org/pdf/2504.11669
Copy Paste: [[2504.11669]] Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation(https://arxiv.org/abs/2504.11669)
Keywords: generation
Abstract: Recent advances in Source-Free Unsupervised Video Domain Adaptation (SFUVDA) leverage vision-language models to enhance pseudo-label generation. However, challenges such as noisy pseudo-labels and over-confident predictions limit their effectiveness in adapting well across domains. We propose Co-STAR, a novel framework that integrates curriculum learning with collaborative self-training between a source-trained teacher and a contrastive vision-language model (CLIP). Our curriculum learning approach employs a reliability-based weight function that measures bidirectional prediction alignment between the teacher and CLIP, balancing between confident and uncertain predictions. This function preserves uncertainty for difficult samples, while prioritizing reliable pseudo-labels when the predictions from both models closely align. To further improve adaptation, we propose Adaptive Curriculum Regularization, which modifies the learning priority of samples in a probabilistic, adaptive manner based on their confidence scores and prediction stability, mitigating overfitting to noisy and over-confident samples. Extensive experiments across multiple video domain adaptation benchmarks demonstrate that Co-STAR consistently outperforms state-of-the-art SFUVDA methods. Code is available at: this https URL
摘要：无源无监督的视频域适应性（SFUVDA）的最新进展利用了视觉模型来增强伪标签的生成。但是，诸如嘈杂的伪标签和过度自信的预测之类的挑战限制了它们在跨领域适应井的有效性。我们提出了联合主演，这是一个新颖的框架，将课程学习与源训练的教师与对比的视觉模型（CLIP）之间的协作自我训练集成在一起。我们的课程学习方法采用了基于可靠性的权重功能，该功能可以衡量教师和剪辑之间的双向预测一致性，并在自信和不确定的预测之间保持平衡。此函数可以保留困难样本的不确定性，同时，当两个模型的预测紧密对齐时，优先考虑可靠的伪标记。为了进一步改善适应性，我们提出了自适应课程正则化，该课程正规化基于其置信度得分和预测稳定性，以概率的自适应方式修改样品的学习优先级，从而减轻了过度适合噪音和过度支持的样本。跨多个视频域适应基准的广泛实验表明，共同主演始终优于最先进的SFUVDA方法。代码可用：此HTTPS URL

Title: Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics

Authors: Yiran He, Yun Cao, Bowen Yang, Zeyu Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11686
Pdf URL: https://arxiv.org/pdf/2504.11686
Copy Paste: [[2504.11686]] Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics(https://arxiv.org/abs/2504.11686)
Keywords: generation, generative
Abstract: The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.
摘要：生成AI的快速发展有助于创建内容，并使图像操纵更加容易且难以检测。尽管多模式的大型语言模型（LLM）已经编码了丰富的世界知识，但它们并不是针对AI生成的内容（AIGC）而固有的，并难以理解当地伪造的细节。在这项工作中，我们研究了多模式LLM在伪造检测中的应用。我们提出了一个能够评估图像真实性，定位篡改区域，提供证据以及基于语义篡改线索的生成方法的框架。我们的方法表明，LLM在伪造分析中的潜力可以通过细致的及时工程和应用少量学习技术有效地解锁。我们进行定性和定量实验，并表明GPT4V可以在Autopplice中获得92.1％的精度，而LAMA的精度为86.3％，这与最先进的AIGC检测方法具有竞争力。我们进一步讨论了多模式LLM在此类任务中的局限性，并提出了潜在的改进。

Title: Learning What NOT to Count

Authors: Adriano D'Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11705
Pdf URL: https://arxiv.org/pdf/2504.11705
Copy Paste: [[2504.11705]] Learning What NOT to Count(https://arxiv.org/abs/2504.11705)
Keywords: generative
Abstract: Few/zero-shot object counting methods reduce the need for extensive annotations but often struggle to distinguish between fine-grained categories, especially when multiple similar objects appear in the same scene. To address this limitation, we propose an annotation-free approach that enables the seamless integration of new fine-grained categories into existing few/zero-shot counting models. By leveraging latent generative models, we synthesize high-quality, category-specific crowded scenes, providing a rich training source for adapting to new categories without manual labeling. Our approach introduces an attention prediction network that identifies fine-grained category boundaries trained using only synthetic pseudo-annotated data. At inference, these fine-grained attention estimates refine the output of existing few/zero-shot counting networks. To benchmark our method, we further introduce the FGTC dataset, a taxonomy-specific fine-grained object counting dataset for natural images. Our method substantially enhances pre-trained state-of-the-art models on fine-grained taxon counting tasks, while using only synthetic data. Code and data to be released upon acceptance.
摘要：很少有/零射对象计数方法减少了对大量注释的需求，但通常很难区分细粒类别，尤其是当多个类似对象出现在同一场景中时。为了解决这一限制，我们提出了一种无注释方法，该方法可以将新的细颗粒类别无缝集成到现有的几个/零弹药计数模型中。通过利用潜在的生成模型，我们综合了高质量的类别特定的拥挤场景，提供了丰富的培训来源，以适应无手动标记的新类别。我们的方法介绍了一个注意力预测网络，该网络仅使用合成伪注释的数据来识别训练的细粒类别边界。在推断时，这些细粒度的注意力估计完善了现有的少量/零射击计数网络的输出。为了基准我们的方法，我们进一步介绍了FGTC数据集，FGTC数据集是一种针对自然图像的特定分类学细粒对象计数数据集。我们的方法实质上增强了仅使用合成数据的细粒分类单元计数任务上的预先训练的最新模型。接受后要发布的代码和数据。

Title: Towards Safe Synthetic Image Generation On the Web: A Multimodal Robust NSFW Defense and Million Scale Dataset

Authors: Muhammad Shahid Muneer, Simon S. Woo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11707
Pdf URL: https://arxiv.org/pdf/2504.11707
Copy Paste: [[2504.11707]] Towards Safe Synthetic Image Generation On the Web: A Multimodal Robust NSFW Defense and Million Scale Dataset(https://arxiv.org/abs/2504.11707)
Keywords: generation
Abstract: In the past years, we have witnessed the remarkable success of Text-to-Image (T2I) models and their widespread use on the web. Extensive research in making T2I models produce hyper-realistic images has led to new concerns, such as generating Not-Safe-For-Work (NSFW) web content and polluting the web society. To help prevent misuse of T2I models and create a safer web environment for users features like NSFW filters and post-hoc security checks are used in these models. However, recent work unveiled how these methods can easily fail to prevent misuse. In particular, adversarial attacks on text and image modalities can easily outplay defensive measures. %Exploiting such leads to the growing concern of preventing adversarial attacks on text and image modalities. Moreover, there is currently no robust multimodal NSFW dataset that includes both prompt and image pairs and adversarial examples. This work proposes a million-scale prompt and image dataset generated using open-source diffusion models. Second, we develop a multimodal defense to distinguish safe and NSFW text and images, which is robust against adversarial attacks and directly alleviates current challenges. Our extensive experiments show that our model performs well against existing SOTA NSFW detection methods in terms of accuracy and recall, drastically reducing the Attack Success Rate (ASR) in multimodal adversarial attack scenarios. Code: this https URL.
摘要：在过去的几年中，我们见证了文本对图像（T2I）模型的显着成功及其在网络上的广泛使用。在制作T2I模型中产生过度现实的图像的广泛研究引起了新的问题，例如生成非安全的工作（NSFW）Web内容并污染Web社会。为了防止滥用T2I模型并为这些模型中使用NSFW过滤器和事后安全检查等用户功能创建更安全的Web环境。但是，最近的工作揭示了这些方法如何轻易阻止滥用。特别是，对文本和图像方式的对抗性攻击可以轻松地超越防御措施。％利用此类导致预防对文本和图像模式的对抗性攻击的关注日益增加。此外，当前没有强大的多模式NSFW数据集，它包括提示和图像对和对抗性示例。这项工作提出了一个使用开源扩散模型生成的百万级提示和图像数据集。其次，我们开发了一种多模式防御，以区分安全和NSFW的文本和图像，这对对抗性攻击非常有力，并直接减轻了当前的挑战。我们的广泛实验表明，我们的模型在准确性和回忆方面对现有的SOTA NSFW检测方法表现良好，从而大大降低了多模式对抗攻击方案的攻击成功率（ASR）。代码：此HTTPS URL。

Title: Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching

Authors: Aaron Havens, Benjamin Kurt Miller, Bing Yan, Carles Domingo-Enrich, Anuroop Sriram, Brandon Wood, Daniel Levine, Bin Hu, Brandon Amos, Brian Karrer, Xiang Fu, Guan-Horng Liu, Ricky T. Q. Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11713
Pdf URL: https://arxiv.org/pdf/2504.11713
Copy Paste: [[2504.11713]] Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching(https://arxiv.org/abs/2504.11713)
Keywords: generation
Abstract: We introduce Adjoint Sampling, a highly scalable and efficient algorithm for learning diffusion processes that sample from unnormalized densities, or energy functions. It is the first on-policy approach that allows significantly more gradient updates than the number of energy evaluations and model samples, allowing us to scale to much larger problem settings than previously explored by similar methods. Our framework is theoretically grounded in stochastic optimal control and shares the same theoretical guarantees as Adjoint Matching, being able to train without the need for corrective measures that push samples towards the target distribution. We show how to incorporate key symmetries, as well as periodic boundary conditions, for modeling molecules in both cartesian and torsional coordinates. We demonstrate the effectiveness of our approach through extensive experiments on classical energy functions, and further scale up to neural network-based energy models where we perform amortized conformer generation across many molecular systems. To encourage further research in developing highly scalable sampling methods, we plan to open source these challenging benchmarks, where successful methods can directly impact progress in computational chemistry.
摘要：我们介绍了伴随采样，这是一种用于学习扩散过程的高度可扩展性和有效的算法，该算法是从非均衡密度或能量函数中采样的。这是第一种与能量评估和模型样本的数量相比，它允许更新的梯度更新，从而使我们可以扩展到比以前通过类似方法探索的更大的问题设置。理论上我们的框架以随机最佳控制为基础，并具有与伴侣匹配相同的理论保证，能够训练而无需采取纠正措施将样品推向目标分布。我们展示了如何在笛卡尔和扭转坐标中结合建模分子的关键对称性以及周期性的边界条件。我们通过对经典能量函数进行广泛的实验来证明我们的方法的有效性，并进一步扩展到基于神经网络的能量模型，在这些模型中，我们在许多分子系统中进行摊销的构象产生。为了鼓励进一步研究开发高度可扩展的采样方法，我们计划开源这些具有挑战性的基准，在这些基准中，成功的方法可以直接影响计算化学的进展。

Title: EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos

Authors: Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11732
Pdf URL: https://arxiv.org/pdf/2504.11732
Copy Paste: [[2504.11732]] EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos(https://arxiv.org/abs/2504.11732)
Keywords: generation
Abstract: Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop an automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.
摘要：从第一人称角度生成视频在增强现实和体现的情报领域具有广泛的应用前景。在这项工作中，我们探讨了跨视频视频预测任务，其中给定一个以EXO为中心的视频，相应以自我为中心的视频的第一帧以及文本说明，目标是生成以自我为中心的视频的未来框架。受自我为中心视频中的手对象交互（HOI）代表当前参与者的主要意图和动作的启发，我们提出了EgoExo-Gen，该基因明确对交叉视频预测的手动动力学进行了模拟。 EgoExo-Gen由两个阶段组成。首先，我们设计了一个跨视图HOI掩码预测模型，该模型通过建模时空的自我exo对应关系来预测未来自我框架中的HOI掩码。接下来，我们采用视频扩散模型，使用第一个自我框架和文本说明来预测未来的自我框架，同时将HOI面具作为结构指导，以增强预测质量。为了促进培训，我们开发了一条自动化管道，通过利用视觉基础模型来为自我和exo-Videos生成伪HOI面具。广泛的实验表明，与自我exo4d和H2O基准数据集的先前视频预测模型相比，我们提出的EgoExo-Gen可以实现更好的预测性能，而HOI掩码可显着改善以自我视频中的自我和互动对象的产生。

Title: DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment

Authors: Li Yu, Situo Wang, Wei Zhou, Moncef Gabbouj
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11733
Pdf URL: https://arxiv.org/pdf/2504.11733
Copy Paste: [[2504.11733]] DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment(https://arxiv.org/abs/2504.11733)
Keywords: quality assessment
Abstract: Inspired by the dual-stream theory of the human visual system (HVS) - where the ventral stream is responsible for object recognition and detail analysis, while the dorsal stream focuses on spatial relationships and motion perception - an increasing number of video quality assessment (VQA) works built upon this framework are proposed. Recent advancements in large multi-modal models, notably Contrastive Language-Image Pretraining (CLIP), have motivated researchers to incorporate CLIP into dual-stream-based VQA methods. This integration aims to harness the model's superior semantic understanding capabilities to replicate the object recognition and detail analysis in ventral stream, as well as spatial relationship analysis in dorsal stream. However, CLIP is originally designed for images and lacks the ability to capture temporal and motion information inherent in videos. %Furthermore, existing feature fusion strategies in no-reference video quality assessment (NR-VQA) often rely on fixed weighting schemes, which fail to adaptively adjust feature importance. To address the limitation, this paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA), which decouples CLIP's visual and textual components, and integrates them into different stages of the NR-VQA pipeline.
摘要：受到人类视觉系统（HVS）的双流理论的启发 - 腹侧流负责对象识别和细节分析，而背流则侧重于空间关系和运动感知 - 越来越多的视频质量评估（VQA）作品构建了基于此框架的作品。在大型多模式模型中的最新进展，特别是对比的语言图像预处理（剪辑），激励研究人员将剪辑纳入基于双流的VQA方法中。这种整合旨在利用该模型的出色语义理解能力，以复制腹侧流中的对象识别和细节分析，以及背面流中的空间关系分析。但是，剪辑最初是为图像设计的，并且缺乏捕获视频中固有的时间和运动信息的能力。此外，不引用视频质量评估（NR-VQA）中的现有特征融合策略通常依赖于固定的加权方案，这些方案无法适应特征的重要性。为了解决限制，本文提出了一种脱钩的视觉模型建模，并具有文本指导的适应性，以进行盲目视频质量评估（DVLTA-VQA），该模型将Clip的视觉和文本组件脱离，并将其集成到NR-VQA管道的不同阶段。

Title: The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation

Authors: Bingjie Gao, Xinyu Gao, Xiaoxue Wu, Yujie Zhou, Yu Qiao, Li Niu, Xinyuan Chen, Yaohui Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.11739
Pdf URL: https://arxiv.org/pdf/2504.11739
Copy Paste: [[2504.11739]] The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation(https://arxiv.org/abs/2504.11739)
Keywords: generation, generative
Abstract: The evolution of Text-to-video (T2V) generative models, trained on large-scale datasets, has been marked by significant progress. However, the sensitivity of T2V generative models to input prompts highlights the critical role of prompt design in influencing generative outcomes. Prior research has predominantly relied on Large Language Models (LLMs) to align user-provided prompts with the distribution of training prompts, albeit without tailored guidance encompassing prompt vocabulary and sentence structure nuances. To this end, we introduce \textbf{RAPO}, a novel \textbf{R}etrieval-\textbf{A}ugmented \textbf{P}rompt \textbf{O}ptimization framework. In order to address potential inaccuracies and ambiguous details generated by LLM-generated prompts. RAPO refines the naive prompts through dual optimization branches, selecting the superior prompt for T2V generation. The first branch augments user prompts with diverse modifiers extracted from a learned relational graph, refining them to align with the format of training prompts via a fine-tuned LLM. Conversely, the second branch rewrites the naive prompt using a pre-trained LLM following a well-defined instruction set. Extensive experiments demonstrate that RAPO can effectively enhance both the static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for user-provided prompts. Project website: \href{this https URL}{GitHub}.
摘要：在大规模数据集上训练的文本对视频（T2V）生成模型的演变已取得了重大进展。但是，T2V生成模型对输入的敏感性促使迅速设计在影响生成结果中的关键作用。先前的研究主要依靠大型语言模型（LLM）来使用户提供的提示与培训提示的分布保持一致，尽管没有量身定制的指导，包括及时的词汇和句子结构的细微差别。为此，我们介绍了\ textbf {rapo}，这是一种小说\ textbf {r} etrieval- \ textbf {a} u fighted \ textbf {p} rompt \ textbf {o} ptimization ptimization ptimization框架。为了解决LLM生成的提示产生的潜在不准确和模棱两可的细节。 Rapo通过双重优化分支完善了天真的提示，为T2V生成选择了出色的提示。第一个分支增强了用户提示，并提示了从学习的关系图中提取的不同修饰符，并改进了它们，以通过微型LLM与培训提示的格式保持一致。相反，第二个分支按照定义明确的指令集使用预训练的LLM重写NAIVE提示。广泛的实验表明，Rapo可以有效地增强生成视频的静态和动态维度，从而证明了对用户提供的提示的及时优化的意义。项目网站：\ href {此https url} {github}。

Title: SkeletonX: Data-Efficient Skeleton-based Action Recognition via Cross-sample Feature Aggregation

Authors: Zongye Zhang, Wenrui Cai, Qingjie Liu, Yunhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11749
Pdf URL: https://arxiv.org/pdf/2504.11749
Copy Paste: [[2504.11749]] SkeletonX: Data-Efficient Skeleton-based Action Recognition via Cross-sample Feature Aggregation(https://arxiv.org/abs/2504.11749)
Keywords: generation
Abstract: While current skeleton action recognition models demonstrate impressive performance on large-scale datasets, their adaptation to new application scenarios remains challenging. These challenges are particularly pronounced when facing new action categories, diverse performers, and varied skeleton layouts, leading to significant performance degeneration. Additionally, the high cost and difficulty of collecting skeleton data make large-scale data collection impractical. This paper studies one-shot and limited-scale learning settings to enable efficient adaptation with minimal data. Existing approaches often overlook the rich mutual information between labeled samples, resulting in sub-optimal performance in low-data scenarios. To boost the utility of labeled data, we identify the variability among performers and the commonality within each action as two key attributes. We present SkeletonX, a lightweight training pipeline that integrates seamlessly with existing GCN-based skeleton action recognizers, promoting effective training under limited labeled data. First, we propose a tailored sample pair construction strategy on two key attributes to form and aggregate sample pairs. Next, we develop a concise and effective feature aggregation module to process these pairs. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and PKU-MMD with various GCN backbones, demonstrating that the pipeline effectively improves performance when trained from scratch with limited data. Moreover, it surpasses previous state-of-the-art methods in the one-shot setting, with only 1/10 of the parameters and much fewer FLOPs. The code and data are available at: this https URL
摘要：尽管当前的骨架动作识别模型在大规模数据集上表现出令人印象深刻的性能，但它们对新应用程序方案的适应仍然具有挑战性。当面对新的动作类别，多样化的表演者和各种骨骼布局时，这些挑战尤其明显，从而导致了显着的性能变性。此外，收集骨骼数据的高成本和困难使大规模数据收集不切实际。本文研究了单发和有限尺度的学习设置，以通过最少的数据进行有效的适应性。现有的方法通常忽略了标记的样品之间的丰富互信息，从而导致低数据表情况下的次优性能。为了提高标记数据的效用，我们将表演者之间的变异性和每个动作中的共同点确定为两个关键属性。我们提出了Skeletonx，这是一种轻巧的训练管道，与现有的基于GCN的骨架动作识别器无缝集成，在有限的标记数据下促进有效的培训。首先，我们在两个关键属性上提出了一个量身定制的样品对结构策略，以形成和聚集样品对。接下来，我们开发一个简洁有效的特征聚合模块来处理这些对。在NTU RGB+D，NTU RGB+D 120和PKU-MMD上进行了广泛的实验，并具有各种GCN骨架，这表明，当使用有限的数据从头开始训练时，管道可以有效地提高性能。此外，它在一次性设置中超过了先前的最新方法，只有1/10参数和更少的失败。代码和数据可用：此HTTPS URL

Title: GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision

Authors: Zihui Zhang, Yafei Yang, Hongtao Wen, Bo Yang
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2504.11754
Pdf URL: https://arxiv.org/pdf/2504.11754
Copy Paste: [[2504.11754]] GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision(https://arxiv.org/abs/2504.11754)
Keywords: generative
Abstract: We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two-stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object-centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real-world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.
摘要：我们研究复杂点云中3D对象分割的严重问题，而无需3D场景的人体标签以进行监督。通过依靠经过预定的2D功能或外部信号（例如3D点）作为对象的运动的相似性，现有的无监督方法通常仅限于识别诸如汽车或其分段对象之类的简单对象，通常由于预处理的特征缺乏对象。在本文中，我们提出了一条新的两阶段管道，称为Grabs。我们方法的核心概念是在第一阶段学习生成和歧视性的以对象为中心的先验作为基础，然后设计一个体现的代理，以学会通过在第二阶段与预告片的生成先验来查询多个对象。我们在两个现实世界数据集和一个新创建的合成数据集上广泛评估了我们的方法，表明了出色的细分性能，显然超过了所有现有的无监督方法。

Title: DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation

Authors: Sang-Jun Park, Keun-Soo Heo, Dong-Hee Shin, Young-Han Son, Ji-Hye Oh, Tae-Eui Kam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11786
Pdf URL: https://arxiv.org/pdf/2504.11786
Copy Paste: [[2504.11786]] DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation(https://arxiv.org/abs/2504.11786)
Keywords: generation
Abstract: The automatic generation of radiology reports has emerged as a promising solution to reduce a time-consuming task and accurately capture critical disease-relevant findings in X-ray images. Previous approaches for radiology report generation have shown impressive performance. However, there remains significant potential to improve accuracy by ensuring that retrieved reports contain disease-relevant findings similar to those in the X-ray images and by refining generated reports. In this study, we propose a Disease-aware image-text Alignment and self-correcting Re-alignment for Trustworthy radiology report generation (DART) framework. In the first stage, we generate initial reports based on image-to-text retrieval with disease-matching, embedding both images and texts in a shared embedding space through contrastive learning. This approach ensures the retrieval of reports with similar disease-relevant findings that closely align with the input X-ray images. In the second stage, we further enhance the initial reports by introducing a self-correction module that re-aligns them with the X-ray images. Our proposed framework achieves state-of-the-art results on two widely used benchmarks, surpassing previous approaches in both report generation and clinical efficacy metrics, thereby enhancing the trustworthiness of radiology reports.
摘要：自动生成的放射学报告已成为减少时必时间的任务并准确捕获X射线图像中关键的与疾病相关的发现的有前途解决方案。放射学报告生成的先前方法表现出了令人印象深刻的表现。但是，通过确保检索到的报告包含与X射线图像中类似的疾病的发现以及通过完善生成的报告，可以通过确保与疾病相关的发现来提高准确性的潜力。在这项研究中，我们提出了一种疾病感知的图像文本一致性和自我修复的可信赖放射学报告（DART）框架的自我修复。在第一阶段，我们根据疾病匹配的图像到文本检索生成初始报告，通过对比度学习将图像和文本嵌入共享的嵌入空间中。这种方法可确保与与输入X射线图像紧密一致的相似疾病的发现的报告检索。在第二阶段，我们通过引入一个自校正模块将其与X射线图像重新对准，从而进一步增强了初始报告。我们提出的框架在两个广泛使用的基准上实现了最先进的结果，在报告生成和临床疗效指标中都超过了以前的方法，从而增强了放射学报告的可信度。

Title: Real-World Depth Recovery via Structure Uncertainty Modeling and Inaccurate GT Depth Fitting

Authors: Delong Suzhang, Meng Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11820
Pdf URL: https://arxiv.org/pdf/2504.11820
Copy Paste: [[2504.11820]] Real-World Depth Recovery via Structure Uncertainty Modeling and Inaccurate GT Depth Fitting(https://arxiv.org/abs/2504.11820)
Keywords: generation
Abstract: The low-quality structure in raw depth maps is prevalent in real-world RGB-D datasets, which makes real-world depth recovery a critical task in recent years. However, the lack of paired raw-ground truth (raw-GT) data in the real world poses challenges for generalized depth recovery. Existing methods insufficiently consider the diversity of structure misalignment in raw depth maps, which leads to poor generalization in real-world depth recovery. Notably, random structure misalignments are not limited to raw depth data but also affect GT depth in real-world datasets. In the proposed method, we tackle the generalization problem from both input and output perspectives. For input, we enrich the diversity of structure misalignment in raw depth maps by designing a new raw depth generation pipeline, which helps the network avoid overfitting to a specific condition. Furthermore, a structure uncertainty module is designed to explicitly identify the misaligned structure for input raw depth maps to better generalize in unseen scenarios. Notably the well-trained depth foundation model (DFM) can help the structure uncertainty module estimate the structure uncertainty better. For output, a robust feature alignment module is designed to precisely align with the accurate structure of RGB images avoiding the interference of inaccurate GT depth. Extensive experiments on multiple datasets demonstrate the proposed method achieves competitive accuracy and generalization capabilities across various challenging raw depth maps.
摘要：原始深度图中的低质量结构在现实世界中的RGB-D数据集中普遍存在，这使现实世界深度恢复成为近年来的关键任务。但是，现实世界中缺乏配对的原始真理（RAW-GT）数据给广义深度恢复带来了挑战。现有方法不足以考虑原始深度图中结构未对准的多样性，这导致现实深度恢复的概括不佳。值得注意的是，随机结构未对准不限于原始深度数据，而是影响现实世界数据集中的GT深度。在提出的方法中，我们从输入和输出角度解决了概括问题。对于输入，我们通过设计新的原始深度生成管道来丰富原始深度图中结构错位的多样性，这有助于网络避免过度适应特定条件。此外，结构不确定性模块旨在明确识别输入原始深度图的未对准结构，以更好地概括在看不见的情况下。值得注意的是，训练有素的深度基础模型（DFM）可以帮助结构不确定性模块更好地估计结构不确定性。为了输出，稳健的特征比对模块设计为与RGB图像的准确结构完全一致，避免了不准确的GT深度的干扰。在多个数据集上进行的广泛实验证明了所提出的方法在各种具有挑战性的原始深度图上实现了竞争精度和概括能力。

Title: A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification

Authors: Bianca Lamm, Janis Keuper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11838
Pdf URL: https://arxiv.org/pdf/2504.11838
Copy Paste: [[2504.11838]] A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification(https://arxiv.org/abs/2504.11838)
Keywords: generation
Abstract: Despite the rapid evolution of learning and computer vision algorithms, Fine-Grained Classification (FGC) still poses an open problem in many practically relevant applications. In the retail domain, for example, the identification of fast changing and visually highly similar products and their properties are key to automated price-monitoring and product recommendation. This paper presents a novel Visual RAG pipeline that combines the Retrieval Augmented Generation (RAG) approach and Vision Language Models (VLMs) for few-shot FGC. This Visual RAG pipeline extracts product and promotion data in advertisement leaflets from various retailers and simultaneously predicts fine-grained product ids along with price and discount information. Compared to previous approaches, the key characteristic of the Visual RAG pipeline is that it allows the prediction of novel products without re-training, simply by adding a few class samples to the RAG database. Comparing several VLM back-ends like GPT-4o [23], GPT-4o-mini [24], and Gemini 2.0 Flash [10], our approach achieves 86.8% accuracy on a diverse dataset.
摘要：尽管学习和计算机视觉算法的快速发展，但在许多实际相关的应用中，细粒度的分类（FGC）仍然构成了一个开放的问题。例如，在零售领域中，识别快速变化和视觉高度相似的产品及其性能是自动化价格监视和产品建议的关键。本文提出了一种新颖的视觉抹布管道，结合了少量发射FGC的检索增强生成（RAG）方法和视觉语言模型（VLMS）。这款视觉RAG管道从各种零售商的广告传单中提取产品和促销数据，同时预测了细粒度的产品ID以及价格和折扣信息。与以前的方法相比，视觉抹布管道的关键特征在于，它可以通过在抹布数据库中添加一些类样本就可以预测新产品而不会重新训练。比较了几个VLM后端，例如GPT-4O [23]，GPT-4O-MINI [24]和GEMINI 2.0 Flash [10]，我们的方法在多样化的数据集上实现了86.8％的精度。

Title: ACE: Attentional Concept Erasure in Diffusion Models

Authors: Finn Carter
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11850
Pdf URL: https://arxiv.org/pdf/2504.11850
Copy Paste: [[2504.11850]] ACE: Attentional Concept Erasure in Diffusion Models(https://arxiv.org/abs/2504.11850)
Keywords: generation
Abstract: Large text-to-image diffusion models have demonstrated remarkable image synthesis capabilities, but their indiscriminate training on Internet-scale data has led to learned concepts that enable harmful, copyrighted, or otherwise undesirable content generation. We address the task of concept erasure in diffusion models, i.e., removing a specified concept from a pre-trained model such that prompting the concept (or related synonyms) no longer yields its depiction, while preserving the model's ability to generate other content. We propose a novel method, Attentional Concept Erasure (ACE), that integrates a closed-form attention manipulation with lightweight fine-tuning. Theoretically, we formulate concept erasure as aligning the model's conditional distribution on the target concept with a neutral distribution. Our approach identifies and nullifies concept-specific latent directions in the cross-attention modules via a gated low-rank adaptation, followed by adversarially augmented fine-tuning to ensure thorough erasure of the concept and its synonyms. Empirically, we demonstrate on multiple benchmarks, including object classes, celebrity faces, explicit content, and artistic styles, that ACE achieves state-of-the-art concept removal efficacy and robustness. Compared to prior methods, ACE better balances generality (erasing concept and related terms) and specificity (preserving unrelated content), scales to dozens of concepts, and is efficient, requiring only a few seconds of adaptation per concept. We will release our code to facilitate safer deployment of diffusion models.
摘要：大型的文本到图像扩散模型已经表现出了显着的图像综合功能，但是它们在互联网规模的数据上进行不加区分的培训导致了学到的概念，这些概念使有害，版权或其他不良的内容产生。我们在扩散模型中解决了概念擦除的任务，即从预训练的模型中删除指定的概念，以促使概念（或相关同义词）不再产生其描写，同时保留模型生成其他内容的能力。我们提出了一种新颖的方法，即注意概念擦除（ACE），该方法将封闭形式的注意操纵与轻巧的微调整合在一起。从理论上讲，我们将概念擦除与中性分布相结合，以使模型在目标概念上的条件分布对齐。我们的方法通过封闭式的低级适应来确定和无效的概念特定的潜在模块中的特定于概念的潜在方向，然后进行对抗增强的微调，以确保彻底删除该概念及其同义词。从经验上讲，我们在多个基准上演示，包括对象类，名人面孔，明确的内容和艺术风格，这些王牌实现了最先进的概念删除功效和鲁棒性。与先前的方法相比，ACE更好地平衡了一般性（擦除概念和相关的术语）和特异性（保留无关的内容），对数十个概念的比例，并且是有效的，每个概念只需要几秒钟的适应性。我们将发布我们的代码，以促进扩散模型的更安全部署。

Title: Synthetic Data for Blood Vessel Network Extraction

Authors: Joël Mathys, Andreas Plesner, Jorel Elmiger, Roger Wattenhofer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11858
Pdf URL: https://arxiv.org/pdf/2504.11858
Copy Paste: [[2504.11858]] Synthetic Data for Blood Vessel Network Extraction(https://arxiv.org/abs/2504.11858)
Keywords: generation
Abstract: Blood vessel networks in the brain play a crucial role in stroke research, where understanding their topology is essential for analyzing blood flow dynamics. However, extracting detailed topological vessel network information from microscopy data remains a significant challenge, mainly due to the scarcity of labeled training data and the need for high topological accuracy. This work combines synthetic data generation with deep learning to automatically extract vessel networks as graphs from volumetric microscopy data. To combat data scarcity, we introduce a comprehensive pipeline for generating large-scale synthetic datasets that mirror the characteristics of real vessel networks. Our three-stage approach progresses from abstract graph generation through vessel mask creation to realistic medical image synthesis, incorporating biological constraints and imaging artifacts at each stage. Using this synthetic data, we develop a two-stage deep learning pipeline of 3D U-Net-based models for node detection and edge prediction. Fine-tuning on real microscopy data shows promising adaptation, improving edge prediction F1 scores from 0.496 to 0.626 by training on merely 5 manually labeled samples. These results suggest that automated vessel network extraction is becoming practically feasible, opening new possibilities for large-scale vascular analysis in stroke research.
摘要：大脑中的血管网络在中风研究中起着至关重要的作用，在这种研究中，了解其拓扑对于分析血流动态至关重要。但是，从显微镜数据中提取详细的拓扑血管网络信息仍然是一个重大挑战，这主要是由于标记的训练数据缺乏，并且需要高拓扑准确性。这项工作将合成数据的生成与深度学习结合在一起，从体积显微镜数据中自动提取血管网络作为图形。为了打击数据稀缺性，我们引入了一条全面的管道，用于生成大规模合成数据集，以反映实际船舶网络的特征。我们的三阶段方法从通过船上遮罩创建产生抽象的图表到现实的医学图像合成，在每个阶段都结合了生物学约束和成像伪影。使用此合成数据，我们开发了一个基于3D U-NET模型的两阶段深度学习管道，以进行节点检测和边缘预测。对实际显微镜数据的微调显示显示出令人鼓舞的适应性，通过仅手动标记的样品进行训练，将边缘预测的F1得分从0.496提高到0.626。这些结果表明，自动化的血管网络提取实际上变得可行，为中风研究中的大规模血管分析开辟了新的可能性。

Title: AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection

Authors: Yuhao Chao, Jie Liu, Jie Tang, Gangshan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11914
Pdf URL: https://arxiv.org/pdf/2504.11914
Copy Paste: [[2504.11914]] AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection(https://arxiv.org/abs/2504.11914)
Keywords: generation
Abstract: Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples, making it imperative to deploy models capable of robust generalization to detect unseen anomalies effectively. Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation, underscoring the need for a paradigm shift. We introduce AnomalyR1, a pioneering framework that leverages VLM-R1, a Multimodal Large Language Model (MLLM) renowned for its exceptional generalization and interpretability, to revolutionize IAD. By integrating MLLM with Group Relative Policy Optimization (GRPO), enhanced by our novel Reasoned Outcome Alignment Metric (ROAM), AnomalyR1 achieves a fully end-to-end solution that autonomously processes inputs of image and domain knowledge, reasons through analysis, and generates precise anomaly localizations and masks. Based on the latest multimodal IAD benchmark, our compact 3-billion-parameter model outperforms existing methods, establishing state-of-the-art results. As MLLM capabilities continue to advance, this study is the first to deliver an end-to-end VLM-based IAD solution that demonstrates the transformative potential of ROAM-enhanced GRPO, positioning our framework as a forward-looking cornerstone for next-generation intelligent anomaly detection systems in industrial applications with limited defective data.
摘要：工业异常检测（IAD）由于稀缺性样本而构成了巨大的挑战，因此必须必须部署能够鲁棒性概括以有效地检测出不见异常的模型。传统的方法通常受到手工制作的功能或特定领域的专家模型的约束，难以解决此限制，强调了对范式转变的需求。我们介绍了Anomalyr1，这是一个利用VLM-R1的开创性框架，该框架是一种以其出色的概括和可解释性为众所周知的多模式大型语言模型（MLLM），以彻底改变IAD。通过将MLLM与组相对策略优化（GRPO）集成，通过我们新颖的合理结果对准度量标准（ROAM）增强，AnomalyRR1实现了完全端到端的解决方案，该解决方案可以自主处理图像知识的输入，通过分析来处理分析，并生成精确的盘位定位和掩膜。基于最新的多模式IAD基准测试，我们紧凑的30亿参数模型优于现有方法，建立了最新的结果。随着MLLM能力继续提高，这项研究是第一个提供基于端到VLM的IAD解决方案的研究，该解决方案证明了漫游增强的GRPO的变革性潜力，将我们的框架定位为下一代智能智能异常检测系统在工业应用中具有有限缺陷数据的前瞻性基础。

Title: Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning

Authors: Hairui Ren, Fan Tang, He Zhao, Zixuan Wang, Dandan Guo, Yi Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11930
Pdf URL: https://arxiv.org/pdf/2504.11930
Copy Paste: [[2504.11930]] Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning(https://arxiv.org/abs/2504.11930)
Keywords: generation
Abstract: Fine-tuning vision-language models (VLMs) with large amounts of unlabeled data has recently garnered significant interest. However, a key challenge remains the lack of high-quality pseudo-labeled data. Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information, leading to sub-optimal performance of unsupervised prompt learning (UPL) methods. In this paper, we introduce a simple yet effective approach called \textbf{A}ugmenting D\textbf{i}scriminative \textbf{R}ichness via Diffusions (AiR), toward learning a richer discriminating way to represent the class comprehensively and thus facilitate classification. Specifically, our approach includes a pseudo-label generation module that leverages high-fidelity synthetic samples to create an auxiliary classifier, which captures richer visual variation, bridging text-image-pair classification to a more robust image-image-pair classification. Additionally, we exploit the diversity of diffusion-based synthetic samples to enhance prompt learning, providing greater information for semantic-visual alignment. Extensive experiments on five public benchmarks, including RESISC45 and Flowers102, and across three learning paradigms-UL, SSL, and TRZSL-demonstrate that AiR achieves substantial and consistent performance improvements over state-of-the-art unsupervised prompt learning methods.
摘要：具有大量未标记数据的微调视觉模型（VLM）最近引起了人们的重大兴趣。但是，一个关键的挑战仍然缺乏高质量的伪标记数据。当前的伪标记策略通常在语义和视觉信息之间的不匹配方面遇到困难，从而导致不受监督的及时学习（UPL）方法的次优性能。在本文中，我们引入了一种简单而有效的方法，称为\ textbf {a} d \ textbf {i}通过扩散（air）进行了精致的\ textbf {r}，以学习一种更丰富的歧视方式，以全面地表示分类，从而有助于分类。具体而言，我们的方法包括一个伪标记的生成模块，该模块利用高保真合成样本来创建辅助分类器，该样品捕获了更丰富的视觉变化，桥接文本图像对分类，以进行更强大的图像图像图分类。此外，我们利用基于扩散的合成样本的多样性来增强及时的学习，为语义 - 视觉对齐提供了更多信息。在包括resisc45和Flowers102在内的五个公共基准以及三个学习范式UL，SSL和TRZSL示出的五个公共基准测试的广泛实验，可以实现空气对最不受欢迎的及时及时学习方法的大量和一致的绩效改进。

Title: R-Meshfusion: Reinforcement Learning Powered Sparse-View Mesh Reconstruction with Diffusion Priors

Authors: Haoyang Wang, Liming Liu, Peiheng Wang, Junlin Hao, Jiangkai Wu, Xinggong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11946
Pdf URL: https://arxiv.org/pdf/2504.11946
Copy Paste: [[2504.11946]] R-Meshfusion: Reinforcement Learning Powered Sparse-View Mesh Reconstruction with Diffusion Priors(https://arxiv.org/abs/2504.11946)
Keywords: generation
Abstract: Mesh reconstruction from multi-view images is a fundamental problem in computer vision, but its performance degrades significantly under sparse-view conditions, especially in unseen regions where no ground-truth observations are available. While recent advances in diffusion models have demonstrated strong capabilities in synthesizing novel views from limited inputs, their outputs often suffer from visual artifacts and lack 3D consistency, posing challenges for reliable mesh optimization. In this paper, we propose a novel framework that leverages diffusion models to enhance sparse-view mesh reconstruction in a principled and reliable manner. To address the instability of diffusion outputs, we propose a Consensus Diffusion Module that filters unreliable generations via interquartile range (IQR) analysis and performs variance-aware image fusion to produce robust pseudo-supervision. Building on this, we design an online reinforcement learning strategy based on the Upper Confidence Bound (UCB) to adaptively select the most informative viewpoints for enhancement, guided by diffusion loss. Finally, the fused images are used to jointly supervise a NeRF-based model alongside sparse-view ground truth, ensuring consistency across both geometry and appearance. Extensive experiments demonstrate that our method achieves significant improvements in both geometric quality and rendering quality.
摘要：来自多视图图像的网格重建是计算机视觉中的一个基本问题，但是在稀疏视图条件下，其性能会大大降低，尤其是在看不见的区域中，没有可用的地面真相观察。尽管扩散模型的最新进展表现出在综合有限输入的新型观点方面表现出很强的能力，但它们的输出通常会遭受视觉伪像且缺乏3D的一致性，因此对可靠的网格优化构成了挑战。在本文中，我们提出了一个新颖的框架，该框架利用扩散模型以原则可靠的方式增强稀疏视图的重建。为了解决扩散输出的不稳定性，我们提出了一个共识扩散模块，该模块通过四分位间范围（IQR）分析过滤了不可靠的世代，并执行方差感知图像融合以产生可靠的伪符号。在此基础上，我们设计了一种基于上限置信度（UCB）的在线增强学习策略，以适应以扩散损失为指导的最有用的观点以增强。最后，融合图像用于与稀疏视野真理共同监督基于NERF的模型，从而确保几何和外观的一致性。广泛的实验表明，我们的方法在几何质量和渲染质量方面都取得了重大改进。

Title: Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions

Authors: Yifei Dong, Fengyi Wu, Sanjian Zhang, Guangyu Chen, Yuzhi Hu, Masumi Yano, Jingdong Sun, Siyu Huang, Feng Liu, Qi Dai, Zhi-Qi Cheng
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2504.11967
Pdf URL: https://arxiv.org/pdf/2504.11967
Copy Paste: [[2504.11967]] Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions(https://arxiv.org/abs/2504.11967)
Keywords: generation
Abstract: Unmanned Aerial Vehicles (UAVs) are indispensable for infrastructure inspection, surveillance, and related tasks, yet they also introduce critical security challenges. This survey provides a wide-ranging examination of the anti-UAV domain, centering on three core objectives-classification, detection, and tracking-while detailing emerging methodologies such as diffusion-based data synthesis, multi-modal fusion, vision-language modeling, self-supervised learning, and reinforcement learning. We systematically evaluate state-of-the-art solutions across both single-modality and multi-sensor pipelines (spanning RGB, infrared, audio, radar, and RF) and discuss large-scale as well as adversarially oriented benchmarks. Our analysis reveals persistent gaps in real-time performance, stealth detection, and swarm-based scenarios, underscoring pressing needs for robust, adaptive anti-UAV systems. By highlighting open research directions, we aim to foster innovation and guide the development of next-generation defense strategies in an era marked by the extensive use of UAVs.
摘要：无人驾驶汽车（UAV）对于基础设施检查，监视和相关任务是必不可少的，但它们也引入了关键的安全挑战。这项调查提供了对抗UAV领域的广泛检查，以三个核心目标分类，检测和跟踪 - 详细介绍了新兴方法，例如基于扩散的数据综合，多模式融合，视觉模型，自我挑选的学习，自我选择的学习和强化学习。我们系统地评估了单模式和多传感器管道（跨越RGB，红外，音频，雷达和RF）的最新解决方案，并讨论大规模以及面向对立的基准测试。我们的分析揭示了实时性能，隐形检测和基于群体的场景的持续差距，从而强调了强大的自适应反UAV系统的紧迫需求。通过强调开放的研究指示，我们旨在在以广泛使用无人机标志的时代来促进和指导下一代防御策略的制定。

Title: Instruction-augmented Multimodal Alignment for Image-Text and Element Matching

Authors: Xinli Yue, JianHui Sun, Junda Lu, Liangchao Yao, Fan Xia, Tianyi Wang, Fengyun Rao, Jing Lyu, Yuetang Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12018
Pdf URL: https://arxiv.org/pdf/2504.12018
Copy Paste: [[2504.12018]] Instruction-augmented Multimodal Alignment for Image-Text and Element Matching(https://arxiv.org/abs/2504.12018)
Keywords: generation, quality assessment
Abstract: With the rapid advancement of text-to-image (T2I) generation models, assessing the semantic alignment between generated images and text descriptions has become a significant research challenge. Current methods, including those based on Visual Question Answering (VQA), still struggle with fine-grained assessments and precise quantification of image-text alignment. This paper presents an improved evaluation method named Instruction-augmented Multimodal Alignment for Image-Text and Element Matching (iMatch), which evaluates image-text semantic alignment by fine-tuning multimodal large language models. We introduce four innovative augmentation strategies: First, the QAlign strategy creates a precise probabilistic mapping to convert discrete scores from multimodal large language models into continuous matching scores. Second, a validation set augmentation strategy uses pseudo-labels from model predictions to expand training data, boosting the model's generalization performance. Third, an element augmentation strategy integrates element category labels to refine the model's understanding of image-text matching. Fourth, an image augmentation strategy employs techniques like random lighting to increase the model's robustness. Additionally, we propose prompt type augmentation and score perturbation strategies to further enhance the accuracy of element assessments. Our experimental results show that the iMatch method significantly surpasses existing methods, confirming its effectiveness and practical value. Furthermore, our iMatch won first place in the CVPR NTIRE 2025 Text to Image Generation Model Quality Assessment - Track 1 Image-Text Alignment.
摘要：随着文本到图像（T2I）生成模型的快速发展，评估生成的图像和文本描述之间的语义一致性已成为一个重大的研究挑战。当前的方法，包括基于视觉问题回答的方法（VQA），仍然在精细粒度评估和图像文本对齐的精确量化方面挣扎。本文提出了一种改进的评估方法，称为图像文本和元素匹配（iMatch）的指定启动的多模式对齐（iMatch），该方法通过微调多模式大语言模型来评估图像文本语义对齐。我们介绍了四种创新的增强策略：首先，Qalign策略创建了精确的概率映射，以将多模式大型模型的离散分数转换为连续匹配的分数。其次，验证集的增强策略使用伪标记，从模型预测来扩展训练数据，从而提高了模型的概括性能。第三，元素增强策略集成了元素类别标签，以完善模型对图像文本匹配的理解。第四，图像增强策略采用随机照明之类的技术来提高模型的鲁棒性。此外，我们提出了及时的增强类型和评分扰动策略，以进一步提高元素评估的准确性。我们的实验结果表明，iMatch方法显着超过了现有方法，证实了其有效性和实践价值。此外，我们的iMatch赢得了CVPR NTIRE 2025文本的第一名，用于图像生成型号质量评估 - 曲目1图像文本对齐。

Title: Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM

Authors: Zirui Pan, Xin Wang, Yipeng Zhang, Hong Chen, Kwan Man Cheng, Yaofei Wu, Wenwu Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12048
Pdf URL: https://arxiv.org/pdf/2504.12048
Copy Paste: [[2504.12048]] Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM(https://arxiv.org/abs/2504.12048)
Keywords: generation
Abstract: Text-to-Video generation, which utilizes the provided text prompt to generate high-quality videos, has drawn increasing attention and achieved great success due to the development of diffusion models recently. Existing methods mainly rely on a pre-trained text encoder to capture the semantic information and perform cross attention with the encoded text prompt to guide the generation of video. However, when it comes to complex prompts that contain dynamic scenes and multiple camera-view transformations, these methods can not decompose the overall information into separate scenes, as well as fail to smoothly change scenes based on the corresponding camera-views. To solve these problems, we propose a novel method, i.e., Modular-Cam. Specifically, to better understand a given complex prompt, we utilize a large language model to analyze user instructions and decouple them into multiple scenes together with transition actions. To generate a video containing dynamic scenes that match the given camera-views, we incorporate the widely-used temporal transformer into the diffusion model to ensure continuity within a single scene and propose CamOperator, a modular network based module that well controls the camera movements. Moreover, we propose AdaControlNet, which utilizes ControlNet to ensure consistency across scenes and adaptively adjusts the color tone of the generated video. Extensive qualitative and quantitative experiments prove our proposed Modular-Cam's strong capability of generating multi-scene videos together with its ability to achieve fine-grained control of camera movements. Generated results are available at this https URL.
摘要：文本到视频的生成利用提供的文本提示来生成高质量的视频，由于最近开发扩散模型，引起了人们的关注，并取得了巨大的成功。现有方法主要依靠预先训练的文本编码器来捕获语义信息，并使用编码的文本提示来指导视频的生成。但是，当涉及包含动态场景和多个摄像头视图转换的复杂提示时，这些方法无法将整体信息分解为单独的场景，并且无法根据相应的摄像头视图平稳更改场景。为了解决这些问题，我们提出了一种新颖的方法，即模块化摄像头。具体来说，为了更好地理解给定的复杂提示，我们利用大型语言模型分析用户说明并将其与过渡操作一起将其分解为多个场景。为了生成包含与给定摄像头视图的动态场景的视频，我们将广泛使用的颞变压器合并到扩散模型中，以确保单个场景中的连续性并提出了基于模块化网络的模块，该模块可以很好地控制摄像机运动。此外，我们提出了AdaControlnet，该andtrolnet利用ControlNet来确保场景之间的一致性并自适应地调整了生成的视频的色调。广泛的定性和定量实验证明了我们提出的模块化摄像头能够生成多场景视频的强大能力，以及其实现对摄像机运动的细粒度控制的能力。生成的结果可在此HTTPS URL上获得。

Title: Generative Deep Learning Framework for Inverse Design of Fuels

Authors: Kiran K. Yalamanchi, Pinaki Pal, Balaji Mohan, Abdullah S. AlRamadan, Jihad A. Badra, Yuanjiang Pei
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2504.12075
Pdf URL: https://arxiv.org/pdf/2504.12075
Copy Paste: [[2504.12075]] Generative Deep Learning Framework for Inverse Design of Fuels(https://arxiv.org/abs/2504.12075)
Keywords: generative
Abstract: In the present work, a generative deep learning framework combining a Co-optimized Variational Autoencoder (Co-VAE) architecture with quantitative structure-property relationship (QSPR) techniques is developed to enable accelerated inverse design of fuels. The Co-VAE integrates a property prediction component coupled with the VAE latent space, enhancing molecular reconstruction and accurate estimation of Research Octane Number (RON) (chosen as the fuel property of interest). A subset of the GDB-13 database, enriched with a curated RON database, is used for model training. Hyperparameter tuning is further utilized to optimize the balance among reconstruction fidelity, chemical validity, and RON prediction. An independent regression model is then used to refine RON prediction, while a differential evolution algorithm is employed to efficiently navigate the VAE latent space and identify promising fuel molecule candidates with high RON. This methodology addresses the limitations of traditional fuel screening approaches by capturing complex structure-property relationships within a comprehensive latent representation. The generative model provides a flexible tool for systematically exploring vast chemical spaces, paving the way for discovering fuels with superior anti-knock properties. The demonstrated approach can be readily extended to incorporate additional fuel properties and synthesizability criteria to enhance applicability and reliability for de novo design of new fuels.
摘要：在目前的工作中，开发了与定量结构 - 特质关系（QSPR）技术结合了合作的变分自动编码器（Co-VAE）结构的生成深度学习框架，以实现加速燃料的逆设计。 Co-Vae整合了一个与VAE潜在空间相结合的属性预测成分，增强了分子重建和研究辛烷值（RON）（RON）的准确估计（选择作为感兴趣的燃料特性）。富含策划的RON数据库的GDB-13数据库的子集用于模型培训。进一步利用高参数调整来优化重建保真度，化学有效性和RON预测之间的平衡。然后，使用独立的回归模型来完善RON预测，而差异进化算法则用于有效地导航VAE潜在空间并确定具有高RON的有希望的燃料分子候选物。该方法通过在全面的潜在代表中捕获复杂的结构 - 特性关系来解决传统燃料筛选方法的局限性。生成模型为系统地探索巨大的化学空间提供了一种灵活的工具，为发现具有出色抗旋转特性的燃料铺平了道路。可以很容易地扩展演示的方法，以纳入其他燃料特性和综合性标准，以增强新燃料从头设计的适用性和可靠性。

Title: Generalized Visual Relation Detection with Diffusion Models

Authors: Kaifeng Gao, Siqi Chen, Hanwang Zhang, Jun Xiao, Yueting Zhuang, Qianru Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12100
Pdf URL: https://arxiv.org/pdf/2504.12100
Copy Paste: [[2504.12100]] Generalized Visual Relation Detection with Diffusion Models(https://arxiv.org/abs/2504.12100)
Keywords: generation, generative
Abstract: Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., ``ride'' can be depicted as ``race'' and ``sit on'', from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, i.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.
摘要：视觉关系检测（VRD）旨在识别图像中对象对之间的关系（或相互作用）。尽管最近的VRD模型取得了令人印象深刻的性能，但它们都仅限于预定义的关系类别，同时未能考虑视觉关系的语义歧义特征。与物体不同，视觉关系的出现总是微妙的，可以从不同角度通过多个谓词单词来描述，例如，``骑行''可以分别从运动和空间位置视图中描绘为``race''和``坐在''和``坐在''上''。为此，我们建议将视觉关系作为连续嵌入，并设计扩散模型，以以条件生成方式（称为diff-vrd）实现广义VRD。我们在潜在空间中建模扩散过程，并将图像中的所有可能关系作为嵌入序列产生。在一代中，受试者对象对的视觉和文本嵌入是有条件的信号，并通过交叉注意进行注入。一代结束后，我们设计了一个随后的匹配阶段，以通过考虑其语义相似性来将关系单词分配给主体对象对。从基于扩散的生成过程中受益，我们的DIFF-VRD能够生成除数据集的预定义类别标签之外的视觉关系。为了正确评估这项通用的VRD任务，我们介绍了两个评估指标，即受图像字幕启发的文本到图像检索和香料PR曲线。人类对象相互作用（HOI）检测和场景图生成（SGG）基准的广泛实验证明了DIFF-VRD的优势和有效性。

Title: Anti-Aesthetics: Protecting Facial Privacy against Customized Text-to-Image Synthesis

Authors: Songping Wang, Yueming Lyu, Shiqi Liu, Ning Li, Tong Tong, Hao Sun, Caifeng Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12129
Pdf URL: https://arxiv.org/pdf/2504.12129
Copy Paste: [[2504.12129]] Anti-Aesthetics: Protecting Facial Privacy against Customized Text-to-Image Synthesis(https://arxiv.org/abs/2504.12129)
Keywords: generation
Abstract: The rise of customized diffusion models has spurred a boom in personalized visual content creation, but also poses risks of malicious misuse, severely threatening personal privacy and copyright protection. Some studies show that the aesthetic properties of images are highly positively correlated with human perception of image quality. Inspired by this, we approach the problem from a novel and intriguing aesthetic perspective to degrade the generation quality of maliciously customized models, thereby achieving better protection of facial identity. Specifically, we propose a Hierarchical Anti-Aesthetic (HAA) framework to fully explore aesthetic cues, which consists of two key branches: 1) Global Anti-Aesthetics: By establishing a global anti-aesthetic reward mechanism and a global anti-aesthetic loss, it can degrade the overall aesthetics of the generated content; 2) Local Anti-Aesthetics: A local anti-aesthetic reward mechanism and a local anti-aesthetic loss are designed to guide adversarial perturbations to disrupt local facial identity. By seamlessly integrating both branches, our HAA effectively achieves the goal of anti-aesthetics from a global to a local level during customized generation. Extensive experiments show that HAA outperforms existing SOTA methods largely in identity removal, providing a powerful tool for protecting facial privacy and copyright.
摘要：定制扩散模型的兴起激发了个性化的视觉内容创建，但也带来了恶意滥用的风险，严重威胁了个人隐私和版权保护。一些研究表明，图像的美学特性与人类对图像质量的看法高度相关。受到这一点的启发，我们从新颖而有趣的美学观点解决了问题，以降低恶意定制模型的发电质量，从而更好地保护面部身份。具体而言，我们提出了一个分层的抗审美（HAA）框架来充分探索美学提示，该框架由两个关键分支组成：1）全球抗审美剂：通过建立全球抗审美奖励机制和全球抗审美奖励机制和一种全球抗审美损失，它可以脱离生成的审美内容的整体美学； 2）局部反审美学：局部抗审美奖励机制和局部抗审美损失旨在指导对抗性扰动以破坏局部面部身份。通过无缝整合两个分支，我们的HAA有效地实现了从全球到定制生成期间从全球到本地一级的抗习得的目标。广泛的实验表明，HAA的表现胜过现有的SOTA方法，主要是删除身份，为保护面部隐私和版权提供了有力的工具。

Title: Coding-Prior Guided Diffusion Network for Video Deblurring

Authors: Yike Liu, Jianhui Zhang, Haipeng Li, Shuaicheng Liu, Bing Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12222
Pdf URL: https://arxiv.org/pdf/2504.12222
Copy Paste: [[2504.12222]] Coding-Prior Guided Diffusion Network for Video Deblurring(https://arxiv.org/abs/2504.12222)
Keywords: generation, generative
Abstract: While recent video deblurring methods have advanced significantly, they often overlook two valuable prior information: (1) motion vectors (MVs) and coding residuals (CRs) from video codecs, which provide efficient inter-frame alignment cues, and (2) the rich real-world knowledge embedded in pre-trained diffusion generative models. We present CPGDNet, a novel two-stage framework that effectively leverages both coding priors and generative diffusion priors for high-quality deblurring. First, our coding-prior feature propagation (CPFP) module utilizes MVs for efficient frame alignment and CRs to generate attention masks, addressing motion inaccuracies and texture variations. Second, a coding-prior controlled generation (CPC) module network integrates coding priors into a pretrained diffusion model, guiding it to enhance critical regions and synthesize realistic details. Experiments demonstrate our method achieves state-of-the-art perceptual quality with up to 30% improvement in IQA metrics. Both the code and the codingprior-augmented dataset will be open-sourced.
摘要：虽然最近的视频灭绝方法已经大大提高了，但它们经常忽略两个有价值的先验信息：（1）运动向量（MV）和视频编解码器的编码残差（CR），它们提供了有效的框架间比对线索，以及（2）在预感染的扩散生成生成的模型中嵌入了丰富的现实世界知识。我们提出了CPGDNET，这是一种新型的两阶段框架，可有效利用编码先验和生成扩散先验的高质量脱毛。首先，我们的编码优先特征传播（CPFP）模块利用MVS来实现有效的框架对准，并且CRS生成注意力面罩，解决运动不准确和纹理变化。其次，编码优先控制（CPC）模块网络将编码先验集成到预处理的扩散模型中，从而指导其增强关键区域并综合现实的细节。实验证明我们的方法可实现最先进的感知质量，而IQA指标提高了30％。代码和编码优先的数据集都将是开源的。

Title: Cobra: Efficient Line Art COlorization with BRoAder References

Authors: Junhao Zhuang, Lingen Li, Xuan Ju, Zhaoyang Zhang, Chun Yuan, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12240
Pdf URL: https://arxiv.org/pdf/2504.12240
Copy Paste: [[2504.12240]] Cobra: Efficient Line Art COlorization with BRoAder References(https://arxiv.org/abs/2504.12240)
Keywords: generation
Abstract: The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control. A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing challenges related to handling extensive reference images, time-consuming inference, and flexible control. We investigate the necessity of extensive contextual image guidance on the quality of line art colorization. To address these challenges, we introduce Cobra, an efficient and versatile method that supports color hints and utilizes over 200 reference images while maintaining low latency. Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency. Results demonstrate that Cobra achieves accurate line art colorization through extensive contextual reference, significantly enhancing inference speed and interactivity, thereby meeting critical industrial demands. We release our codes and models on our project page: this https URL.
摘要：漫画生产行业需要具有高精度，效率，上下文一致性和灵活控制的基于参考的系列艺术色彩。漫画页面通常涉及各种字符，对象和背景，这会使着色过程复杂化。尽管在图像生成的扩散模型中取得了进步，但其在线艺术色彩仍然有限，面临着与处理广泛的参考图像，耗时的推断和灵活控制有关的挑战。我们研究了有关线条艺术色彩质量的广泛上下文图像指导的必要性。为了应对这些挑战，我们引入了眼镜蛇，这是一种高效且多功能的方法，该方法支持颜色提示并利用200多个参考图像，同时保持低潜伏期。 Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency.结果表明，眼镜蛇通过广泛的上下文参考，可显着提高推理速度和互动性，从而达到关键的工业需求，从而实现准确的线条艺术色彩。我们在项目页面上发布代码和模型：此HTTPS URL。

Title: SIDME: Self-supervised Image Demoiréing via Masked Encoder-Decoder Reconstruction

Authors: Xia Wang, Haiyang Sun, Tiantian Cao, Yueying Sun, Min Feng
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.12245
Pdf URL: https://arxiv.org/pdf/2504.12245
Copy Paste: [[2504.12245]] SIDME: Self-supervised Image Demoiréing via Masked Encoder-Decoder Reconstruction(https://arxiv.org/abs/2504.12245)
Keywords: generation
Abstract: Moiré patterns, resulting from aliasing between object light signals and camera sampling frequencies, often degrade image quality during capture. Traditional demoiréing methods have generally treated images as a whole for processing and training, neglecting the unique signal characteristics of different color channels. Moreover, the randomness and variability of moiré pattern generation pose challenges to the robustness of existing methods when applied to real-world data. To address these issues, this paper presents SIDME (Self-supervised Image Demoiréing via Masked Encoder-Decoder Reconstruction), a novel model designed to generate high-quality visual images by effectively processing moiré patterns. SIDME combines a masked encoder-decoder architecture with self-supervised learning, allowing the model to reconstruct images using the inherent properties of camera sampling frequencies. A key innovation is the random masked image reconstructor, which utilizes an encoder-decoder structure to handle the reconstruction task. Furthermore, since the green channel in camera sampling has a higher sampling frequency compared to red and blue channels, a specialized self-supervised loss function is designed to improve the training efficiency and effectiveness. To ensure the generalization ability of the model, a self-supervised moiré image generation method has been developed to produce a dataset that closely mimics real-world conditions. Extensive experiments demonstrate that SIDME outperforms existing methods in processing real moiré pattern data, showing its superior generalization performance and robustness.
摘要：Moiré图案是由于对象信号和摄像机采样频率之间的混叠，在捕获过程中通常会降低图像质量。传统的演示方法通常将图像作为整体处理和训练，忽略了不同颜色通道的独特信号特征。此外，当应用于现实世界数据时，Moiré模式生成的随机性和可变性对现有方法的鲁棒性提出了挑战。为了解决这些问题，本文介绍了Sidme（通过蒙版编码器重建的自我监督图像演示），这是一个新颖的模型，旨在通过有效处理Moiré模式来生成高质量的视觉图像。 Sidme将蒙版的编码器架构与自我监督的学习结合在一起，从而使模型可以使用摄像机采样频率的固有属性重建图像。一个关键的创新是随机掩盖的图像重建器，它利用编码器解码器结构来处理重建任务。此外，由于相机采样中的绿色通道与红色和蓝色通道相比具有更高的采样频率，因此专门的自我监督损失函数旨在提高训练效率和有效性。为了确保模型的概括能力，已经开发了一种自制的Moiré图像生成方法来生成一个密切模仿现实世界条件的数据集。广泛的实验表明，Sidme在处理真正的Moiré模式数据方面的表现优于现有方法，显示出其出色的概括性能和鲁棒性。

Title: FLIP Reasoning Challenge

Authors: Andreas Plesner, Turlan Kuzhagaliyev, Roger Wattenhofer
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12256
Pdf URL: https://arxiv.org/pdf/2504.12256
Copy Paste: [[2504.12256]] FLIP Reasoning Challenge(https://arxiv.org/abs/2504.12256)
Keywords: generation
Abstract: Over the past years, advances in artificial intelligence (AI) have demonstrated how AI can solve many perception and generation tasks, such as image classification and text writing, yet reasoning remains a challenge. This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks on the Idena blockchain. FLIP challenges present users with two orderings of 4 images, requiring them to identify the logically coherent one. By emphasizing sequential reasoning, visual storytelling, and common sense, FLIP provides a unique testbed for multimodal AI systems. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs). Results reveal that even the best open-sourced and closed-sourced models achieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shot settings, compared to human performance of 95.3%. Captioning models aid reasoning models by providing text descriptions of images, yielding better results than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5 Pro. Combining the predictions from 15 models in an ensemble increases the accuracy to 85.2%. These findings highlight the limitations of existing reasoning models and the need for robust multimodal benchmarks like FLIP. The full codebase and dataset will be available at this https URL.
摘要：在过去的几年中，人工智能（AI）的进步证明了AI如何解决许多感知和发电任务，例如图像分类和文本写作，但推理仍然是一个挑战。本文介绍了FLIP数据集，这是一种基于IDENA区块链的人类验证任务评估AI推理功能的基准。 Flip挑战用户提供了两个订单的4个图像，要求他们识别逻辑上的连贯性。通过强调顺序推理，视觉讲故事和常识，Flip为多模式AI系统提供了独特的测试床。我们的实验评估了最新模型，利用视觉语言模型（VLM）和大语言模型（LLMS）。结果表明，即使是最佳的开源和封闭式模型，在零拍摄的情况下，与人类绩效相比，在零拍摄的情况下，最大精度分别达到75.5％和77.9％。字幕模型通过提供图像的文本描述来帮助推理模型，与直接使用原始图像时产生更好的结果，Gemini 1.5 Pro的69.6％vs. 75.2％。将15个模型的预测结合在一起，将精度提高到85.2％。这些发现突出了现有推理模型的局限性以及对FLIP等强大的多模式基准的需求。完整的代码库和数据集将在此HTTPS URL上可用。

Title: VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate

Authors: Zhihang Yuan, Rui Xie, Yuzhang Shang, Hanling Zhang, Siyuan Wang, Shengen Yan, Guohao Dai, Yu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12259
Pdf URL: https://arxiv.org/pdf/2504.12259
Copy Paste: [[2504.12259]] VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate(https://arxiv.org/abs/2504.12259)
Keywords: generation
Abstract: Diffusion Transformer(DiT)-based generation models have achieved remarkable success in video generation. However, their inherent computational demands pose significant efficiency challenges. In this paper, we exploit the inherent temporal non-uniformity of real-world videos and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. Inspired by this temporal non-uniformity, we propose VGDFR, a training-free approach for Diffusion-based Video Generation with Dynamic Latent Frame Rate. VGDFR adaptively adjusts the number of elements in latent space based on the motion frequency of the latent space content, using fewer tokens for low-frequency segments while preserving detail in high-frequency segments. Specifically, our key contributions are: (1) A dynamic frame rate scheduler for DiT video generation that adaptively assigns frame rates for video segments. (2) A novel latent-space frame merging method to align latent representations with their denoised counterparts before merging those redundant in low-resolution space. (3) A preference analysis of Rotary Positional Embeddings (RoPE) across DiT layers, informing a tailored RoPE strategy optimized for semantic and local information capture. Experiments show that VGDFR can achieve a speedup up to 3x for video generation with minimal quality degradation.
摘要：基于扩散的变压器（DIT）的生成模型在视频生成中取得了巨大的成功。但是，它们固有的计算需求构成了巨大的效率挑战。在本文中，我们利用了现实世界视频的固有时间不均匀性，并观察到视频表现出动态信息密度，高动片段比静态场景更需要更大的详细信息。受这种时间不均匀性的启发，我们提出了VGDFR，这是一种无训练的方法，用于基于扩散的视频生成，具有动态潜在的潜在帧速率。 VGDFR根据潜在空间内容的运动频率自适应地调整潜在空间中的元素数量，使用较少的令牌用于低频段，同时保留高频段的细节。具体来说，我们的主要贡献是：（1）DIT视频生成的动态帧速率调度程序，可适应为视频段分配帧速率。（2）一种新型的潜在空间合并方法，将潜在表示与其deNo的同行保持一致，然后再在低分辨率空间中合并冗余。（3）对跨DIT层的旋转位置嵌入（绳索）的偏好分析，告知针对语义和局部信息捕获的量身定制的绳索策略。实验表明，VGDFR可以实现高达3倍的加速度，以最小的质量退化，以获得视频生成。