2025-04-07

Title: Comparative Analysis of Deepfake Detection Models: New Approaches and Perspectives

Authors: Matheus Martins Batista
Subjects: cs.CV, cs.LG, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2504.02900
Pdf URL: https://arxiv.org/pdf/2504.02900
Copy Paste: [[2504.02900]] Comparative Analysis of Deepfake Detection Models: New Approaches and Perspectives(https://arxiv.org/abs/2504.02900)
Keywords: generative
Abstract: The growing threat posed by deepfake videos, capable of manipulating realities and disseminating misinformation, drives the urgent need for effective detection methods. This work investigates and compares different approaches for identifying deepfakes, focusing on the GenConViT model and its performance relative to other architectures present in the DeepfakeBenchmark. To contextualize the research, the social and legal impacts of deepfakes are addressed, as well as the technical fundamentals of their creation and detection, including digital image processing, machine learning, and artificial neural networks, with emphasis on Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Transformers. The performance evaluation of the models was conducted using relevant metrics and new datasets established in the literature, such as WildDeep-fake and DeepSpeak, aiming to identify the most effective tools in the battle against misinformation and media manipulation. The obtained results indicated that GenConViT, after fine-tuning, exhibited superior performance in terms of accuracy (93.82%) and generalization capacity, surpassing other architectures in the DeepfakeBenchmark on the DeepSpeak dataset. This study contributes to the advancement of deepfake detection techniques, offering contributions to the development of more robust and effective solutions against the dissemination of false information.
摘要：Deepfake视频构成的日益严重的威胁，能够操纵现实和传播错误信息，迫使人们迫切需要有效的检测方法。这项工作研究并比较了识别深击的不同方法，重点是Genconvit模型及其性能相对于DeepFakeBench Markch中存在的其他体系结构。为了使研究与之相关，解决了深层蛋糕的社会和法律影响，以及它们创建和检测的技术基础，包括数字图像处理，机器学习和人工神经网络，重点是卷积神经网络（CNNS），生成性的对抗性网络（GANS）和变形金刚。模型的性能评估是使用文献中建立的相关指标和新数据集进行的，例如Wilddeep-Fake和DeepSpeak，旨在确定反对错误信息和媒体操纵的战斗中最有效的工具。获得的结果表明，在微调后，Genconvit在准确性（93.82％）和概括能力方面表现出卓越的性能，超过了DeepSpeak DataSet上DeepFakebenchmark中的其他体系结构。这项研究有助于进步深泡检测技术，为开发更强大和有效的解决方案提供了贡献，以防止传播错误信息。

Title: Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments

Authors: Chenyu Zhang, Daniil Cherniavskii, Andrii Zadaianchuk, Antonios Tragoudaras, Antonios Vozikis, Thijmen Nijdam, Derck W. E. Prinzhorn, Mark Bodracska, Nicu Sebe, Efstratios Gavves
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.02918
Pdf URL: https://arxiv.org/pdf/2504.02918
Copy Paste: [[2504.02918]] Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments(https://arxiv.org/abs/2504.02918)
Keywords: generation, generative
Abstract: Recent advances in image and video generation raise hopes that these models possess world modeling capabilities, the ability to generate realistic, physically plausible videos. This could revolutionize applications in robotics, autonomous driving, and scientific simulation. However, before treating these models as world models, we must ask: Do they adhere to physical conservation laws? To answer this, we introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning. It features 80 real-world videos capturing physical phenomena, guided by conservation laws. Since artificial generations lack ground truth, we assess physical plausibility using physics-informed metrics evaluated with respect to infallible conservation laws known per physical setting, leveraging advances in physics-informed neural networks and vision-language foundation models. Our findings reveal that even with advanced prompting and video conditioning, current models struggle to encode physical principles despite generating aesthetically pleasing videos. All data, leaderboard, and code are open-sourced at our project page.
摘要：图像和视频产生的最新进展使人们希望这些模型具有世界建模能力，即产生现实，物理上合理的视频的能力。这可能会彻底改变机器人技术，自动驾驶和科学模拟的应用。但是，在将这些模型视为世界模型之前，我们必须问：它们是否遵守物理保护法？为了回答这个问题，我们介绍了Morpheus，这是评估视频生产模型的基准。它具有80个现实世界的视频，这些视频捕获了在保护法律的指导下，以捕获身体现象。由于人工世代缺乏地面真理，因此我们使用针对无可靠的保护定律评估的物理知识指标来评估身体的合理性，从而利用了物理知识的神经网络和视觉语言基础模型的进步。我们的发现表明，即使采用高级提示和视频调理，当前的模型尽管产生了美学上令人愉悦的视频，但仍很难编码物理原理。所有数据，排行榜和代码都在我们的项目页面开源。

Title: VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning

Authors: Xianwei Zhuang, Yuxin Xie, Yufan Deng, Dongchao Yang, Liming Liang, Jinghan Ru, Yuguo Yin, Yuexian Zou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02949
Pdf URL: https://arxiv.org/pdf/2504.02949
Copy Paste: [[2504.02949]] VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning(https://arxiv.org/abs/2504.02949)
Keywords: generation, generative
Abstract: In this work, we present VARGPT-v1.1, an advanced unified visual autoregressive model that builds upon our previous framework VARGPT. The model preserves the dual paradigm of next-token prediction for visual understanding and next-scale generation for image synthesis. Specifically, VARGPT-v1.1 integrates: (1) a novel training strategy combining iterative visual instruction tuning with reinforcement learning through Direct Preference Optimization (DPO), (2) an expanded training corpus containing 8.3M visual-generative instruction pairs, (3) an upgraded language model backbone using Qwen2, (4) enhanced image generation resolution, and (5) emergent image editing capabilities without architectural modifications. These advancements enable VARGPT-v1.1 to achieve state-of-the-art performance in multimodal understanding and text-to-image instruction-following tasks, demonstrating significant improvements in both comprehension and generation metrics. Notably, through visual instruction tuning, the model acquires image editing functionality while maintaining architectural consistency with its predecessor, revealing the potential for unified visual understanding, generation, and editing. Our findings suggest that well-designed unified visual autoregressive models can effectively adopt flexible training strategies from large language models (LLMs), exhibiting promising scalability. The codebase and model weights are publicly available at this https URL.
摘要：在这项工作中，我们提出了VARGPT-V1.1，这是一种基于我们以前的框架Vargpt的高级统一视觉自动回归模型。该模型保留了视觉理解的下一个预测的双重范式和图像合成的临时生成。具体而言，VARGPT-V1.1整合：（1）一种新型的培训策略，通过直接优先优化（DPO）结合了迭代视觉教学调谐与增强学习，（2）一个扩展的培训语料库，该培训语料库包含83m的视觉传播指导对，（3）使用QWEN2，（4）增强图像的图像型backone（3）升级语言模型，（3），（3）升级的语言模型，（4）图像backobore（4），（4）图像的图像，（4）图像均值（4）修改。这些进步使VARGPT-V1.1能够在多模式理解和文本图像指导遵循任务中实现最先进的表现，从而证明了理解和发电指标的显着改善。值得注意的是，通过视觉指导调整，该模型在与其前身保持建筑一致性的同时，获得了图像编辑功能，从而揭示了统一的视觉理解，生成和编辑的潜力。我们的发现表明，精心设计的统一视觉自回归模型可以有效地采用大型语言模型（LLM）的灵活培训策略，表现出有希望的可扩展性。代码库和模型权重在此HTTPS URL上公开可用。

Title: Localized Definitions and Distributed Reasoning: A Proof-of-Concept Mechanistic Interpretability Study via Activation Patching

Authors: Nooshin Bahador
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.02976
Pdf URL: https://arxiv.org/pdf/2504.02976
Copy Paste: [[2504.02976]] Localized Definitions and Distributed Reasoning: A Proof-of-Concept Mechanistic Interpretability Study via Activation Patching(https://arxiv.org/abs/2504.02976)
Keywords: generation
Abstract: This study investigates the localization of knowledge representation in fine-tuned GPT-2 models using Causal Layer Attribution via Activation Patching (CLAP), a method that identifies critical neural layers responsible for correct answer generation. The model was fine-tuned on 9,958 PubMed abstracts (epilepsy: 20,595 mentions, EEG: 11,674 mentions, seizure: 13,921 mentions) using two configurations with validation loss monitoring for early stopping. CLAP involved (1) caching clean (correct answer) and corrupted (incorrect answer) activations, (2) computing logit difference to quantify model preference, and (3) patching corrupted activations with clean ones to assess recovery. Results revealed three findings: First, patching the first feedforward layer recovered 56% of correct preference, demonstrating that associative knowledge is distributed across multiple layers. Second, patching the final output layer completely restored accuracy (100% recovery), indicating that definitional knowledge is localised. The stronger clean logit difference for definitional questions further supports this localized representation. Third, minimal recovery from convolutional layer patching (13.6%) suggests low-level features contribute marginally to high-level reasoning. Statistical analysis confirmed significant layer-specific effects (p<0.01). These findings demonstrate that factual knowledge is more localized and associative knowledge depends on distributed representations. We also showed that editing efficacy depends on task type. Our findings not only reconcile conflicting observations about localization in model editing but also emphasize on using task-adaptive techniques for reliable, interpretable updates.
摘要：这项研究研究了通过激活补丁（CLAP）通过因果层归因（CLAP）在微调的GPT-2模型中定位的定位，该方法识别了负责正确答案生成的关键神经层。该模型在9,958个PubMed摘要上进行了微调（癫痫：20,595个提及，脑电图：11,674次提及，癫痫发作：13,921个提及），使用两种配置具有验证损失监控，以备早停止。涉及的拍手（1）缓存清洁（正确答案）和损坏（错误的答案）激活，（2）计算logit差异以量化模型偏好，以及（3）用干净的激活修补损坏的激活以评估恢复。结果揭示了三个发现：首先，修补了第一个馈电层恢复了正确偏好的56％，表明关联知识分布在多层上。其次，修补最终输出层完全恢复了准确性（100％恢复），表明定义知识已定位。定义问题的更强的清洁logit差异进一步支持了此本地化表示。第三，从卷积层弥补（13.6％）中恢复的最小恢复表明，低水平的特征对高级推理有很小的贡献。统计分析证实了明显的层特异性效应（P <0.01）。这些发现表明，事实知识是更本地化的，关联知识取决于分布式表示。我们还表明，编辑功效取决于任务类型。我们的发现不仅调解了关于模型编辑中本地化的相互矛盾的观察，而且还强调使用任务自适应技术来可靠，可解释的更新。

Title: VIP: Video Inpainting Pipeline for Real World Human Removal

Authors: Huiming Sun, Yikang Li, Kangning Yang, Ruineng Li, Daitao Xing, Yangbo Xie, Lan Fu, Kaiyu Zhang, Ming Chen, Jiaming Ding, Jiang Geng, Jie Cai, Zibo Meng, Chiuman Ho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03041
Pdf URL: https://arxiv.org/pdf/2504.03041
Copy Paste: [[2504.03041]] VIP: Video Inpainting Pipeline for Real World Human Removal(https://arxiv.org/abs/2504.03041)
Keywords: generation
Abstract: Inpainting for real-world human and pedestrian removal in high-resolution video clips presents significant challenges, particularly in achieving high-quality outcomes, ensuring temporal consistency, and managing complex object interactions that involve humans, their belongings, and their shadows. In this paper, we introduce VIP (Video Inpainting Pipeline), a novel promptless video inpainting framework for real-world human removal applications. VIP enhances a state-of-the-art text-to-video model with a motion module and employs a Variational Autoencoder (VAE) for progressive denoising in the latent space. Additionally, we implement an efficient human-and-belongings segmentation for precise mask generation. Sufficient experimental results demonstrate that VIP achieves superior temporal consistency and visual fidelity across diverse real-world scenarios, surpassing state-of-the-art methods on challenging datasets. Our key contributions include the development of the VIP pipeline, a reference frame integration technique, and the Dual-Fusion Latent Segment Refinement method, all of which address the complexities of inpainting in long, high-resolution video sequences.
摘要：在高分辨率的视频片段中介绍现实世界中的人类和人行人的拆除提出了重大挑战，尤其是在实现高质量的结果，确保时间一致性以及管理涉及人类，财产及其阴影的复杂对象相互作用的方面。在本文中，我们介绍了VIP（视频介绍管道），这是一个新颖的无效视频介绍框架，用于现实世界中的人类清除应用。 VIP使用运动模块增强了最先进的文本对视频模型，并采用了各种自动编码器（VAE）来进行潜在空间中的渐进性降级。此外，我们实施了有效的人类和细分细分，以生成精确的面具。足够的实验结果表明，VIP在各种现实世界中实现了较高的时间一致性和视觉保真度，从而超过了具有挑战性的数据集的最新方法。我们的主要贡献包括VIP管道的开发，参考框架集成技术和双融合潜在片段细化方法，所有这些方法都解决了在长，高分辨率的高分辨率视频序列中介绍的复杂性。

Title: How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models

Authors: Pascal Chang, Jingwei Tang, Markus Gross, Vinicius C. Azevedo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.03072
Pdf URL: https://arxiv.org/pdf/2504.03072
Copy Paste: [[2504.03072]] How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models(https://arxiv.org/abs/2504.03072)
Keywords: restoration, generation
Abstract: Video editing and generation methods often rely on pre-trained image-based diffusion models. During the diffusion process, however, the reliance on rudimentary noise sampling techniques that do not preserve correlations present in subsequent frames of a video is detrimental to the quality of the results. This either produces high-frequency flickering, or texture-sticking artifacts that are not amenable to post-processing. With this in mind, we propose a novel method for preserving temporal correlations in a sequence of noise samples. This approach is materialized by a novel noise representation, dubbed $\int$-noise (integral noise), that reinterprets individual noise samples as a continuously integrated noise field: pixel values do not represent discrete values, but are rather the integral of an underlying infinite-resolution noise over the pixel area. Additionally, we propose a carefully tailored transport method that uses $\int$-noise to accurately advect noise samples over a sequence of frames, maximizing the correlation between different frames while also preserving the noise properties. Our results demonstrate that the proposed $\int$-noise can be used for a variety of tasks, such as video restoration, surrogate rendering, and conditional video generation. See this https URL for video results.
摘要：视频编辑和生成方法通常依赖于预先训练的基于图像的扩散模型。然而，在扩散过程中，对视频随后帧中不保留相关性的基本噪声采样技术的依赖对结果的质量有害。这要么产生高频闪烁，要么会产生不适合后加工的纹理粘性文物。考虑到这一点，我们提出了一种新的方法，用于保留一系列噪声样本的时间相关性。这种方法是通过称为$ \ int $ -noise（积分噪声）的新型噪声表示形式来实现的，该表示将单个噪声样本重新诠释为连续集成的噪声场：像素值不代表离散值，而是像素区域上基础无限分辨率噪声的积分。此外，我们提出了一种经过精心量身定制的传输方法，该方法使用$ \ int $ noise来准确地将噪声样本超过一系列框架，从而最大程度地提高不同帧之间的相关性，同时还可以保留噪声属性。我们的结果表明，提出的$ \ int $ noise可用于各种任务，例如视频恢复，替代渲染和有条件的视频生成。有关视频结果，请参见此HTTPS URL。

Title: SLACK: Attacking LiDAR-based SLAM with Adversarial Point Injections

Authors: Prashant Kumar, Dheeraj Vattikonda, Kshitij Madhav Bhat, Kunal Dargan, Prem Kalra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03089
Pdf URL: https://arxiv.org/pdf/2504.03089
Copy Paste: [[2504.03089]] SLACK: Attacking LiDAR-based SLAM with Adversarial Point Injections(https://arxiv.org/abs/2504.03089)
Keywords: generation, generative
Abstract: The widespread adoption of learning-based methods for the LiDAR makes autonomous vehicles vulnerable to adversarial attacks through adversarial \textit{point injections (PiJ)}. It poses serious security challenges for navigation and map generation. Despite its critical nature, no major work exists that studies learning-based attacks on LiDAR-based SLAM. Our work proposes SLACK, an end-to-end deep generative adversarial model to attack LiDAR scans with several point injections without deteriorating LiDAR quality. To facilitate SLACK, we design a novel yet simple autoencoder that augments contrastive learning with segmentation-based attention for precise reconstructions. SLACK demonstrates superior performance on the task of \textit{point injections (PiJ)} compared to the best baselines on KITTI and CARLA-64 dataset while maintaining accurate scan quality. We qualitatively and quantitatively demonstrate PiJ attacks using a fraction of LiDAR points. It severely degrades navigation and map quality without deteriorating the LiDAR scan quality.
摘要：对LIDAR的基于学习的方法的广泛采用使自动驾驶汽车通过对抗性\ textit {point Injections（pij）}容易受到对抗攻击的影响。它对导航和地图生成构成了严重的安全挑战。尽管有批判性质，但尚无重大工作，即研究基于学习的基于激光雷达的大满贯的攻击。我们的工作提出了Slack，这是一种端到端的深层生成对抗模型，可攻击激光扫描，几次注射而不会降低激光雷达质量。为了促进松弛，我们设计了一个新颖而简单的自动编码器，该编码器可以增强对比度学习，并以基于细分的精确重建为基础。与Kitti和Carla-64数据集的最佳基线相比，Slack在\ textit {点注射（Pij）}的任务上表现出了出色的性能，同时保持准确的扫描质量。我们在定性和定量上使用一小部分LIDAR点证明了PIJ攻击。它严重降低了导航和地图质量，而不会恶化激光雷达扫描质量。

Title: FontGuard: A Robust Font Watermarking Approach Leveraging Deep Font Knowledge

Authors: Kahim Wong, Jicheng Zhou, Kemou Li, Yain-Whar Si, Xiaowei Wu, Jiantao Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03128
Pdf URL: https://arxiv.org/pdf/2504.03128
Copy Paste: [[2504.03128]] FontGuard: A Robust Font Watermarking Approach Leveraging Deep Font Knowledge(https://arxiv.org/abs/2504.03128)
Keywords: generation
Abstract: The proliferation of AI-generated content brings significant concerns on the forensic and security issues such as source tracing, copyright protection, etc, highlighting the need for effective watermarking technologies. Font-based text watermarking has emerged as an effective solution to embed information, which could ensure copyright, traceability, and compliance of the generated text content. Existing font watermarking methods usually neglect essential font knowledge, which leads to watermarked fonts of low quality and limited embedding capacity. These methods are also vulnerable to real-world distortions, low-resolution fonts, and inaccurate character segmentation. In this paper, we introduce FontGuard, a novel font watermarking model that harnesses the capabilities of font models and language-guided contrastive learning. Unlike previous methods that focus solely on the pixel-level alteration, FontGuard modifies fonts by altering hidden style features, resulting in better font quality upon watermark embedding. We also leverage the font manifold to increase the embedding capacity of our proposed method by generating substantial font variants closely resembling the original font. Furthermore, in the decoder, we employ an image-text contrastive learning to reconstruct the embedded bits, which can achieve desirable robustness against various real-world transmission distortions. FontGuard outperforms state-of-the-art methods by +5.4%, +7.4%, and +5.8% in decoding accuracy under synthetic, cross-media, and online social network distortions, respectively, while improving the visual quality by 52.7% in terms of LPIPS. Moreover, FontGuard uniquely allows the generation of watermarked fonts for unseen fonts without re-training the network. The code and dataset are available at this https URL.
摘要：AI生成的内容的扩散引起了对源和安全问题的重大关注，例如来源追踪，版权保护等，强调了对有效水印技术的需求。基于字体的文本水印已成为嵌入信息的有效解决方案，该解决方案可以确保生成的文本内容的版权，可追溯性和合规性。现有的字体水印方法通常会忽略基本字体知识，这会导致具有低质量和有限嵌入能力的水印字体。这些方法也容易受到现实世界变形，低分辨率字体和不准确的字符分割。在本文中，我们介绍了Fontguard，这是一种新型的字体水印模型，可利用字体模型和语言引导的对比度学习的功能。与以前仅着眼于像素级变化的方法不同，字体守卫通过更改隐藏样式特征来修改字体，从而在水印嵌入式上获得更好的字体质量。我们还利用字体歧管来通过产生与原始字体相似的大量字体变体来提高我们提出的方法的嵌入能力。此外，在解码器中，我们采用了图像文本对比学习来重建嵌入式位，这可以实现对各种现实世界传输扭曲的理想鲁棒性。 fontguard在合成，交叉媒体和在线社交网络扭曲下的解码准确性的最先进方法 +5.4％， +7.4％和 +5.8％，同时在LPIP方面将视觉质量提高了52.7％。此外，Fontguard唯一允许在不重新访问网络的情况下生成未见字体的水印字体。该代码和数据集可在此HTTPS URL上找到。

Title: Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models

Authors: Xuran Ma, Yexin Liu, Yaofu Liu, Xianfeng Wu, Mingzhe Zheng, Zihao Wang, Ser-Nam Lim, Harry Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03140
Pdf URL: https://arxiv.org/pdf/2504.03140
Copy Paste: [[2504.03140]] Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models(https://arxiv.org/abs/2504.03140)
Keywords: generation
Abstract: Recent advances in diffusion models have demonstrated remarkable capabilities in video generation. However, the computational intensity remains a significant challenge for practical applications. While feature caching has been proposed to reduce the computational burden of diffusion models, existing methods typically overlook the heterogeneous significance of individual blocks, resulting in suboptimal reuse and degraded output quality. To this end, we address this gap by introducing ProfilingDiT, a novel adaptive caching strategy that explicitly disentangles foreground and background-focused blocks. Through a systematic analysis of attention distributions in diffusion models, we reveal a key observation: 1) Most layers exhibit a consistent preference for either foreground or background regions. 2) Predicted noise shows low inter-step similarity initially, which stabilizes as denoising progresses. This finding inspires us to formulate a selective caching strategy that preserves full computation for dynamic foreground elements while efficiently caching static background features. Our approach substantially reduces computational overhead while preserving visual fidelity. Extensive experiments demonstrate that our framework achieves significant acceleration (e.g., 2.01 times speedup for Wan2.1) while maintaining visual fidelity across comprehensive quality metrics, establishing a viable method for efficient video generation.
摘要：扩散模型的最新进展表明在视频生成中具有显着的功能。但是，计算强度仍然是实际应用的重大挑战。尽管已经提出了特征缓存来减少扩散模型的计算负担，但现有方法通常忽略各个块的异质意义，从而导致次优的再利用和降级输出质量。为此，我们通过引入PropingDit（一种新型的自适应缓存策略，明确地删除前景和以背景为中心的块）来解决这一差距。通过对扩散模型中注意力分布的系统分析，我们揭示了一个关键的观察：1）大多数层对前景或背景区域表现出一致的偏好。 2）预测的噪声最初显示出较低的阶段相似性，这稳定在降低的进展中。这一发现激发了我们制定一种选择性的缓存策略，该策略可保留动态前景元素的完整计算，同时有效地缓存静态背景特征。我们的方法大大减少了计算开销，同时保留了视觉保真度。广泛的实验表明，我们的框架达到了显着的加速度（例如，WAN2.1的2.01倍速度），同时保持跨综合质量指标的视觉保真度，并为有效的视频生成建立了可行的方法。

Title: NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving

Authors: Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, Zhengzhong Tu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03164
Pdf URL: https://arxiv.org/pdf/2504.03164
Copy Paste: [[2504.03164]] NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving(https://arxiv.org/abs/2504.03164)
Keywords: generation
Abstract: Recent advancements in Vision-Language Models (VLMs) have demonstrated strong potential for autonomous driving tasks. However, their spatial understanding and reasoning-key capabilities for autonomous driving-still exhibit significant limitations. Notably, none of the existing benchmarks systematically evaluate VLMs' spatial reasoning capabilities in driving scenarios. To fill this gap, we propose NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark specifically designed to evaluate the spatial understanding and reasoning capabilities of VLMs in autonomous driving. Built upon the NuScenes dataset, the benchmark is constructed through an automated 3D scene graph generation pipeline and a QA generation pipeline. The benchmark systematically evaluates VLMs' performance in both spatial understanding and reasoning across multiple dimensions. Using this benchmark, we conduct extensive experiments on diverse VLMs, including both general and spatial-enhanced models, providing the first comprehensive evaluation of their spatial capabilities in autonomous driving. Surprisingly, the experimental results show that the spatial-enhanced VLM outperforms in qualitative QA but does not demonstrate competitiveness in quantitative QA. In general, VLMs still face considerable challenges in spatial understanding and reasoning.
摘要：视觉模型（VLM）的最新进步表明，对自动驾驶任务的强大潜力。但是，他们对自动驾驶驾驶的空间理解和推理键能力表现出重大限制。值得注意的是，现有的基准都没有系统地评估VLMS在驾驶方案中的空间推理功能。为了填补这一空白，我们提出了Nuscenes-SpatialQA，这是第一个基于大规模的基于基础的问题解答（QA）基准，该基准是专门用于评估VLMS在自主驾驶中的空间理解和推理能力的专门设计的。基于Nuscenes数据集，基准是通过自动的3D场景图生成管道和质量检查生成管道构建的。基准有系统地评估VLM在跨多个维度的空间理解和推理中的性能。使用此基准，我们对包括通用和空间增强模型在内的各种VLM进行了广泛的实验，从而对其在自动驾驶中的空间能力进行了首次全面评估。令人惊讶的是，实验结果表明，空间增强的VLM在定性质量检查中的表现不高，但在定量质量质量上没有表现出竞争力。通常，VLM在空间理解和推理方面仍然面临着巨大的挑战。

Title: Finding the Reflection Point: Unpadding Images to Remove Data Augmentation Artifacts in Large Open Source Image Datasets for Machine Learning

Authors: Lucas Choi, Ross Greer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03168
Pdf URL: https://arxiv.org/pdf/2504.03168
Copy Paste: [[2504.03168]] Finding the Reflection Point: Unpadding Images to Remove Data Augmentation Artifacts in Large Open Source Image Datasets for Machine Learning(https://arxiv.org/abs/2504.03168)
Keywords: restoration
Abstract: In this paper, we address a novel image restoration problem relevant to machine learning dataset curation: the detection and removal of noisy mirrored padding artifacts. While data augmentation techniques like padding are necessary for standardizing image dimensions, they can introduce artifacts that degrade model evaluation when datasets are repurposed across domains. We propose a systematic algorithm to precisely delineate the reflection boundary through a minimum mean squared error approach with thresholding and remove reflective padding. Our method effectively identifies the transition between authentic content and its mirrored counterpart, even in the presence of compression or interpolation noise. We demonstrate our algorithm's efficacy on the SHEL5k dataset, showing significant performance improvements in zero-shot object detection tasks using OWLv2, with average precision increasing from 0.47 to 0.61 for hard hat detection and from 0.68 to 0.73 for person detection. By addressing annotation inconsistencies and distorted objects in padded regions, our approach enhances dataset integrity, enabling more reliable model evaluation across computer vision tasks.
摘要：在本文中，我们解决了与机器学习数据集策划有关的新型图像恢复问题：检测和去除嘈杂的镜像填充工件。尽管数据增强技术（如填充技术）对于标准化图像维度是必需的，但它们可以引入伪影，以跨域重新利用数据集在数据集时降低模型评估。我们提出了一种系统的算法，通过使用阈值的最小平方误差方法准确地描绘了反射边界并去除反射式填充。我们的方法有效地确定了正宗内容与其镜像对应物之间的过渡，即使在存在压缩或插值噪声的情况下也是如此。我们在SHEL5K数据集上演示了算法的功效，使用OWLV2显示了零摄像对象检测任务的显着改进，而硬HAT检测的平均精度从0.47增加到0.61，对人检测的平均精度从0.68增加到0.68。通过解决带垫区域中的注释不一致和扭曲的对象，我们的方法增强了数据集的完整性，从而在计算机视觉任务上实现了更可靠的模型评估。

Title: REJEPA: A Novel Joint-Embedding Predictive Architecture for Efficient Remote Sensing Image Retrieval

Authors: Shabnam Choudhury, Yash Salunkhe, Sarthak Mehrotra, Biplab Banerjee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03169
Pdf URL: https://arxiv.org/pdf/2504.03169
Copy Paste: [[2504.03169]] REJEPA: A Novel Joint-Embedding Predictive Architecture for Efficient Remote Sensing Image Retrieval(https://arxiv.org/abs/2504.03169)
Keywords: generative
Abstract: The rapid expansion of remote sensing image archives demands the development of strong and efficient techniques for content-based image retrieval (RS-CBIR). This paper presents REJEPA (Retrieval with Joint-Embedding Predictive Architecture), an innovative self-supervised framework designed for unimodal RS-CBIR. REJEPA utilises spatially distributed context token encoding to forecast abstract representations of target tokens, effectively capturing high-level semantic features and eliminating unnecessary pixel-level details. In contrast to generative methods that focus on pixel reconstruction or contrastive techniques that depend on negative pairs, REJEPA functions within feature space, achieving a reduction in computational complexity of 40-60% when compared to pixel-reconstruction baselines like Masked Autoencoders (MAE). To guarantee strong and varied representations, REJEPA incorporates Variance-Invariance-Covariance Regularisation (VICReg), which prevents encoder collapse by promoting feature diversity and reducing redundancy. The method demonstrates an estimated enhancement in retrieval accuracy of 5.1% on BEN-14K (S1), 7.4% on BEN-14K (S2), 6.0% on FMoW-RGB, and 10.1% on FMoW-Sentinel compared to prominent SSL techniques, including CSMAE-SESD, Mask-VLM, SatMAE, ScaleMAE, and SatMAE++, on extensive RS benchmarks BEN-14K (multispectral and SAR data), FMoW-RGB and FMoW-Sentinel. Through effective generalisation across sensor modalities, REJEPA establishes itself as a sensor-agnostic benchmark for efficient, scalable, and precise RS-CBIR, addressing challenges like varying resolutions, high object density, and complex backgrounds with computational efficiency.
摘要：遥感图像档案的快速扩展需要开发基于内容的图像检索（RS-CBIR）强大而有效的技术。本文介绍了Rejepa（与联合预测架构检索），这是一个专为单型号RS-CBIR设计的创新的自我监管框架。 Rejepa利用空间分布的上下文代币编码来预测目标令牌的抽象表示，有效地捕获了高级语义特征并消除了不必要的像素级详细信息。与依赖于负面对的像素重建或对比度技术的生成方法相反，与像素化自动配置器（MAE）相比，REJEPA在特征空间内的功能相比，在特征空间内的功能降低了40-60％的计算复杂性为40-60％。为了确保强大和多样化的表示，Rejepa结合了方差 - 交互式正规化（VICREG），从而通过促进特征多样性和降低冗余来防止编码器崩溃。该方法表明，BEN-14K（S1）的检索准确性估计提高，BEN-14K（S2）为7.4％，FMOW-RGB的检索准确性为7.4％，FMOW-RGB的检索准确性为7.4％，而FMOW-Sentinel的检索准确性为6.0％，与著名的SSL技术相比，fmow-sentinel的范围为10.1％，包括CSMAE-SESD，csmae-sesd，satmae，satmae+satmae+satmae+satmae+satmae，satmae，satmae++satmae，satmae+satmae++satmae+satmae++satmae+satmae++satmae+下BEN-14K（多光谱和SAR数据），FMOW-RGB和FMOW-Sentinel。通过跨传感器模式的有效概括，Rejepa将自己确立为一种传感器不可稳定的基准测试，以实现高效，可扩展和精确的RS-CBIR，以应对诸如不同分辨率，高对象密度和具有计算效率的复杂背景等挑战。

Title: MIMRS: A Survey on Masked Image Modeling in Remote Sensing

Authors: Shabnam Choudhury, Akhil Vasim, Michael Schmitt, Biplab Banerjee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03181
Pdf URL: https://arxiv.org/pdf/2504.03181
Copy Paste: [[2504.03181]] MIMRS: A Survey on Masked Image Modeling in Remote Sensing(https://arxiv.org/abs/2504.03181)
Keywords: super-resolution
Abstract: Masked Image Modeling (MIM) is a self-supervised learning technique that involves masking portions of an image, such as pixels, patches, or latent representations, and training models to predict the missing information using the visible context. This approach has emerged as a cornerstone in self-supervised learning, unlocking new possibilities in visual understanding by leveraging unannotated data for pre-training. In remote sensing, MIM addresses challenges such as incomplete data caused by cloud cover, occlusions, and sensor limitations, enabling applications like cloud removal, multi-modal data fusion, and super-resolution. By synthesizing and critically analyzing recent advancements, this survey (MIMRS) is a pioneering effort to chart the landscape of mask image modeling in remote sensing. We highlight state-of-the-art methodologies, applications, and future research directions, providing a foundational review to guide innovation in this rapidly evolving field.
摘要：蒙版图像建模（MIM）是一种自制的学习技术，涉及掩盖图像的部分，例如像素，贴片或潜在表示，以及训练模型，可使用可见的上下文来预测缺失的信息。这种方法已成为自我监督学习的基石，通过利用未注释的数据进行预训练，从而在视觉理解中解锁了新的可能性。在遥感中，MIM解决了挑战，例如由云覆盖，遮挡和传感器限制引起的不完整数据，从而使应用程序诸如去除云，多模式数据融合和超分辨率之类的应用程序。通过综合和批判性地分析最近的进步，该调查（MIMRS）是一项开创性的努力，旨在绘制遥感中蒙版图像建模的景观。我们重点介绍了最先进的方法，应用和未来的研究方向，提供了基础审查，以指导这个迅速发展的领域的创新。

Title: Steerable Anatomical Shape Synthesis with Implicit Neural Representations

Authors: Bram de Wilde, Max T. Rietberg, Guillaume Lajoinie, Jelmer M. Wolterink
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03313
Pdf URL: https://arxiv.org/pdf/2504.03313
Copy Paste: [[2504.03313]] Steerable Anatomical Shape Synthesis with Implicit Neural Representations(https://arxiv.org/abs/2504.03313)
Keywords: generation, generative
Abstract: Generative modeling of anatomical structures plays a crucial role in virtual imaging trials, which allow researchers to perform studies without the costs and constraints inherent to in vivo and phantom studies. For clinical relevance, generative models should allow targeted control to simulate specific patient populations rather than relying on purely random sampling. In this work, we propose a steerable generative model based on implicit neural representations. Implicit neural representations naturally support topology changes, making them well-suited for anatomical structures with varying topology, such as the thyroid. Our model learns a disentangled latent representation, enabling fine-grained control over shape variations. Evaluation includes reconstruction accuracy and anatomical plausibility. Our results demonstrate that the proposed model achieves high-quality shape generation while enabling targeted anatomical modifications.
摘要：解剖结构的生成建模在虚拟成像试验中起着至关重要的作用，该试验使研究人员可以进行研究而无需进行体内和幻影研究固有的成本和约束。对于临床相关性，生成模型应允许有针对性的控制模拟特定的患者人群，而不是依靠纯粹随机抽样。在这项工作中，我们提出了一个基于隐式神经表示的可传统生成模型。隐性神经表示自然支持拓扑变化，使其适合具有不同拓扑的解剖结构，例如甲状腺。我们的模型学习了一个分离的潜在表示，从而可以对形状变化进行细粒度的控制。评估包括重建精度和解剖学的合理性。我们的结果表明，所提出的模型可以实现高质量的形状产生，同时实现了有针对性的解剖修饰。

Title: Optimal Embedding Guided Negative Sample Generation for Knowledge Graph Link Prediction

Authors: Makoto Takamoto, Daniel Oñoro-Rubio, Wiem Ben Rim, Takashi Maruyama, Bhushan Kotnis
Subjects: cs.LG, cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2504.03327
Pdf URL: https://arxiv.org/pdf/2504.03327
Copy Paste: [[2504.03327]] Optimal Embedding Guided Negative Sample Generation for Knowledge Graph Link Prediction(https://arxiv.org/abs/2504.03327)
Keywords: generation
Abstract: Knowledge graph embedding (KGE) models encode the structural information of knowledge graphs to predicting new links. Effective training of these models requires distinguishing between positive and negative samples with high precision. Although prior research has shown that improving the quality of negative samples can significantly enhance model accuracy, identifying high-quality negative samples remains a challenging problem. This paper theoretically investigates the condition under which negative samples lead to optimal KG embedding and identifies a sufficient condition for an effective negative sample distribution. Based on this theoretical foundation, we propose \textbf{E}mbedding \textbf{MU}tation (\textsc{EMU}), a novel framework that \emph{generates} negative samples satisfying this condition, in contrast to conventional methods that focus on \emph{identifying} challenging negative samples within the training data. Importantly, the simplicity of \textsc{EMU} ensures seamless integration with existing KGE models and negative sampling methods. To evaluate its efficacy, we conducted comprehensive experiments across multiple datasets. The results consistently demonstrate significant improvements in link prediction performance across various KGE models and negative sampling methods. Notably, \textsc{EMU} enables performance improvements comparable to those achieved by models with embedding dimension five times larger. An implementation of the method and experiments are available at this https URL.
摘要：知识图嵌入（KGE）模型编码知识图的结构信息，以预测新链接。对这些模型的有效训练需要区分高精度的正面和负样品。尽管先前的研究表明，提高阴性样品的质量可以显着提高模型的准确性，但确定高质量的负样本仍然是一个具有挑战性的问题。本文理论上研究了负样品导致最佳kg嵌入的条件，并确定了有效的负样本分布的足够条件。 Based on this theoretical foundation, we propose \textbf{E}mbedding \textbf{MU}tation (\textsc{EMU}), a novel framework that \emph{generates} negative samples satisfying this condition, in contrast to conventional methods that focus on \emph{identifying} challenging negative samples within the training data.重要的是，\ textsc {emu}的简单性可确保与现有的KGE模型和负抽样方法无缝集成。为了评估其功效，我们在多个数据集中进行了全面的实验。结果始终显示出各种KGE模型和负抽样方法的链路预测性能的显着改善。值得注意的是，\ textsc {emu}可以改进性能改进，可与具有嵌入尺寸较大的模型相当。该方法和实验的实现可在此HTTPS URL上获得。

Title: QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning

Authors: Quanxing Xu, Ling Zhou, Xian Zhong, Feifei Zhang, Rubing Huang, Chia-Wen Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03337
Pdf URL: https://arxiv.org/pdf/2504.03337
Copy Paste: [[2504.03337]] QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning(https://arxiv.org/abs/2504.03337)
Keywords: generation
Abstract: Existing debiasing approaches in Visual Question Answering (VQA) primarily focus on enhancing visual learning, integrating auxiliary models, or employing data augmentation strategies. However, these methods exhibit two major drawbacks. First, current debiasing techniques fail to capture the superior relation between images and texts because prevalent learning frameworks do not enable models to extract deeper correlations from highly contrasting samples. Second, they do not assess the relevance between the input question and image during inference, as no prior work has examined the degree of input relevance in debiasing studies. Motivated by these limitations, we propose a novel framework, Optimized Question-Image Relation Learning (QIRL), which employs a generation-based self-supervised learning strategy. Specifically, two modules are introduced to address the aforementioned issues. The Negative Image Generation (NIG) module automatically produces highly irrelevant question-image pairs during training to enhance correlation learning, while the Irrelevant Sample Identification (ISI) module improves model robustness by detecting and filtering irrelevant inputs, thereby reducing prediction errors. Furthermore, to validate our concept of reducing output errors through filtering unrelated question-image inputs, we propose a specialized metric to evaluate the performance of the ISI module. Notably, our approach is model-agnostic and can be integrated with various VQA models. Extensive experiments on VQA-CPv2 and VQA-v2 demonstrate the effectiveness and generalization ability of our method. Among data augmentation strategies, our approach achieves state-of-the-art results.
摘要：视觉问题回答（VQA）中的现有辩护方法主要集中于增强视觉学习，整合辅助模型或采用数据增强策略。但是，这些方法表现出两个主要缺点。首先，当前的偏见技术无法捕获图像和文本之间的较高关系，因为普遍的学习框架无法使模型从高度对比的样本中提取更深的相关性。其次，他们没有评估推理过程中输入问题和图像之间的相关性，因为没有先前的工作检查了偏见研究中的输入相关程度。在这些局限性的推动下，我们提出了一个新颖的框架，优化的问题图像关系学习（QIRL），该学习采用了基于一代的自我监督学习策略。具体来说，引入了两个模块以解决上述问题。负图像产生（NIG）模块会在训练过程中自动产生高度无关的问题图像对，以增强相关性学习，而无关的样品识别（ISI）模块可以通过检测和过滤无关的输入来提高模型鲁棒性，从而减少预测错误。此外，为了通过过滤无关的问题图像输入来验证我们减少输出误差的概念，我们提出了一个专门的指标来评估ISI模块的性能。值得注意的是，我们的方法是模型不合时宜的，可以与各种VQA模型集成。对VQA-CPV2和VQA-V2的广泛实验证明了我们方法的有效性和泛化能力。在数据增强策略中，我们的方法取得了最新的结果。

Title: BitHEP -- The Limits of Low-Precision ML in HEP

Authors: Claudius Krause, Daohan Wang, Ramon Winterhalder
Subjects: cs.LG, hep-ex, hep-ph
Abstract URL: https://arxiv.org/abs/2504.03387
Pdf URL: https://arxiv.org/pdf/2504.03387
Copy Paste: [[2504.03387]] BitHEP -- The Limits of Low-Precision ML in HEP(https://arxiv.org/abs/2504.03387)
Keywords: generation, generative
Abstract: The increasing complexity of modern neural network architectures demands fast and memory-efficient implementations to mitigate computational bottlenecks. In this work, we evaluate the recently proposed BitNet architecture in HEP applications, assessing its performance in classification, regression, and generative modeling tasks. Specifically, we investigate its suitability for quark-gluon discrimination, SMEFT parameter estimation, and detector simulation, comparing its efficiency and accuracy to state-of-the-art methods. Our results show that while BitNet consistently performs competitively in classification tasks, its performance in regression and generation varies with the size and type of the network, highlighting key limitations and potential areas for improvement.
摘要：现代神经网络体系结构的复杂性日益增加，需要快速，记忆效率的实现，以减轻计算瓶颈。在这项工作中，我们评估了HEP应用程序中最近提出的BITNET体系结构，评估其在分类，回归和生成建模任务方面的性能。具体而言，我们调查了其对夸克 - 杜松歧视，SMEFT参数估计和检测器仿真的适用性，将其效率和准确性与最新方法进行了比较。我们的结果表明，虽然Bitnet在分类任务中持续竞争性能，但其回归和发电的性能随网络的大小和类型而变化，突出了关键的局限性和潜在的改进领域。

Title: Autonomous state-space segmentation for Deep-RL sparse reward scenarios

Authors: Gianluca Maselli, Vieri Giuliano Santucci
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.03420
Pdf URL: https://arxiv.org/pdf/2504.03420
Copy Paste: [[2504.03420]] Autonomous state-space segmentation for Deep-RL sparse reward scenarios(https://arxiv.org/abs/2504.03420)
Keywords: generation
Abstract: Dealing with environments with sparse rewards has always been crucial for systems developed to operate in autonomous open-ended learning settings. Intrinsic Motivations could be an effective way to help Deep Reinforcement Learning algorithms learn in such scenarios. In fact, intrinsic reward signals, such as novelty or curiosity, are generally adopted to improve exploration when extrinsic rewards are delayed or absent. Building on previous works, we tackle the problem of learning policies in the presence of sparse rewards by proposing a two-level architecture that alternates an ''intrinsically driven'' phase of exploration and autonomous sub-goal generation, to a phase of sparse reward, goal-directed policy learning. The idea is to build several small networks, each one specialized on a particular sub-path, and use them as starting points for future exploration without the need to further explore from scratch previously learnt paths. Two versions of the system have been trained and tested in the Gym SuperMarioBros environment without considering any additional extrinsic reward. The results show the validity of our approach and the importance of autonomously segment the environment to generate an efficient path towards the final goal.
摘要：处理稀疏奖励的环境对于在自动开放式学习设置中开发的系统一直至关重要。内在动机可能是帮助深度加强学习算法在这种情况下学习的有效方法。实际上，当延迟或不存在外部奖励时，通常会采用固有的奖励信号，例如新颖性或好奇心，以改善探索。在以前的作品的基础上，我们通过提出一种两级体系结构来解决学习政策的问题，该架构将“本质上驱动”的探索阶段和自主次目标生成阶段交替出现，以实现稀疏奖励，目标指导的政策学习的阶段。这个想法是建立多个小型网络，每个网络专门从事特定的子路径，并将其用作未来探索的起点，而无需从头开始探索以前学习的路径。该系统的两个版本已经在体育馆超级马里奥布罗斯环境中进行了培训和测试，而没有考虑任何其他外部奖励。结果表明，我们的方法的有效性以及自主分割环境的重要性，为实现最终目标的有效途径。

Title: D-Garment: Physics-Conditioned Latent Diffusion for Dynamic Garment Deformations

Authors: Antoine Dumoulin, Adnane Boukhayma, Laurence Boissieux, Bharath Bhushan Damodaran, Pierre Hellier, Stefanie Wuhrer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03468
Pdf URL: https://arxiv.org/pdf/2504.03468
Copy Paste: [[2504.03468]] D-Garment: Physics-Conditioned Latent Diffusion for Dynamic Garment Deformations(https://arxiv.org/abs/2504.03468)
Keywords: generative
Abstract: Adjusting and deforming 3D garments to body shapes, body motion, and cloth material is an important problem in virtual and augmented reality. Applications are numerous, ranging from virtual change rooms to the entertainment and gaming industry. This problem is challenging as garment dynamics influence geometric details such as wrinkling patterns, which depend on physical input including the wearer's body shape and motion, as well as cloth material features. Existing work studies learning-based modeling techniques to generate garment deformations from example data, and physics-inspired simulators to generate realistic garment dynamics. We propose here a learning-based approach trained on data generated with a physics-based simulator. Compared to prior work, our 3D generative model learns garment deformations for loose cloth geometry, especially for large deformations and dynamic wrinkles driven by body motion and cloth material. Furthermore, the model can be efficiently fitted to observations captured using vision sensors. We propose to leverage the capability of diffusion models to learn fine-scale detail: we model the 3D garment in a 2D parameter space, and learn a latent diffusion model using this representation independent from the mesh resolution. This allows to condition global and local geometric information with body and material information. We quantitatively and qualitatively evaluate our method on both simulated data and data captured with a multi-view acquisition platform. Compared to strong baselines, our method is more accurate in terms of Chamfer distance.
摘要：在虚拟和增强现实中，调整和变形3D服装是身体形状，身体运动和布料材料是一个重要的问题。应用程序很多，从虚拟更改室到娱乐和游戏行业。这个问题具有挑战性，因为服装动态会影响几何细节，例如皱纹图案，这些细节取决于物理输入，包括佩戴者的身体形状和运动以及布料材料特征。现有的工作研究基于学习的建模技术，从示例数据中生成服装变形，并以物理启发的模拟器生成逼真的服装动力学。我们在这里提出了一种基于学习的方法，该方法对基于物理的模拟器生成的数据培训。与先前的工作相比，我们的3D生成模型学习了宽松的布几何形状的服装变形，尤其是对于大变形和由身体运动和布料材料驱动的动态皱纹。此外，该模型可以有效地适用于使用视觉传感器捕获的观测值。我们建议利用扩散模型的能力学习细节细节：我们在2D参数空间中对3D服装进行建模，并使用此表示的潜在扩散模型与网格分辨率独立学习潜在扩散模型。这允许通过身体和物质信息来调节全球和本地几何信息。我们对使用多视图采集平台捕获的模拟数据和数据进行定量和定性评估我们的方法。与强基础相比，在倒角距离方面，我们的方法更准确。

Title: Dynamic Importance in Diffusion U-Net for Enhanced Image Synthesis

Authors: Xi Wang, Ziqi He, Yang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03471
Pdf URL: https://arxiv.org/pdf/2504.03471
Copy Paste: [[2504.03471]] Dynamic Importance in Diffusion U-Net for Enhanced Image Synthesis(https://arxiv.org/abs/2504.03471)
Keywords: generation
Abstract: Traditional diffusion models typically employ a U-Net architecture. Previous studies have unveiled the roles of attention blocks in the U-Net. However, they overlook the dynamic evolution of their importance during the inference process, which hinders their further exploitation to improve image applications. In this study, we first theoretically proved that, re-weighting the outputs of the Transformer blocks within the U-Net is a "free lunch" for improving the signal-to-noise ratio during the sampling process. Next, we proposed Importance Probe to uncover and quantify the dynamic shifts in importance of the Transformer blocks throughout the denoising process. Finally, we design an adaptive importance-based re-weighting schedule tailored to specific image generation and editing tasks. Experimental results demonstrate that, our approach significantly improves the efficiency of the inference process, and enhances the aesthetic quality of the samples with identity consistency. Our method can be seamlessly integrated into any U-Net-based architecture. Code: this https URL
摘要：传统扩散模型通常采用U-NET体系结构。先前的研究已经揭示了U-NET中注意力块的作用。但是，他们忽略了推理过程中其重要性的动态演变，这阻碍了他们进一步的剥削以改善图像应用程序。在这项研究中，我们首先从理论上证明，重新对U-NET中变压器块的输出进行重新加权是“免费午餐”，用于在抽样过程中提高信噪比。接下来，我们提出了重要的探测，以揭示和量化在整个转换过程中变压器块重要性的动态变化。最后，我们设计了针对特定图像生成和编辑任务量身定制的基于自适应重要性的重新加权时间表。实验结果表明，我们的方法显着提高了推理过程的效率，并提高了具有身份一致性的样品的美学质量。我们的方法可以无缝集成到任何基于U-NET的架构中。代码：此HTTPS URL

Title: BUFF: Bayesian Uncertainty Guided Diffusion Probabilistic Model for Single Image Super-Resolution

Authors: Zihao He, Shengchuan Zhang, Runze Hu, Yunhang Shen, Yan Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03490
Pdf URL: https://arxiv.org/pdf/2504.03490
Copy Paste: [[2504.03490]] BUFF: Bayesian Uncertainty Guided Diffusion Probabilistic Model for Single Image Super-Resolution(https://arxiv.org/abs/2504.03490)
Keywords: super-resolution, generation
Abstract: Super-resolution (SR) techniques are critical for enhancing image quality, particularly in scenarios where high-resolution imagery is essential yet limited by hardware constraints. Existing diffusion models for SR have relied predominantly on Gaussian models for noise generation, which often fall short when dealing with the complex and variable texture inherent in natural scenes. To address these deficiencies, we introduce the Bayesian Uncertainty Guided Diffusion Probabilistic Model (BUFF). BUFF distinguishes itself by incorporating a Bayesian network to generate high-resolution uncertainty masks. These masks guide the diffusion process, allowing for the adjustment of noise intensity in a manner that is both context-aware and adaptive. This novel approach not only enhances the fidelity of super-resolved images to their original high-resolution counterparts but also significantly mitigates artifacts and blurring in areas characterized by complex textures and fine details. The model demonstrates exceptional robustness against complex noise patterns and showcases superior adaptability in handling textures and edges within images. Empirical evidence, supported by visual results, illustrates the model's robustness, especially in challenging scenarios, and its effectiveness in addressing common SR issues such as blurring. Experimental evaluations conducted on the DIV2K dataset reveal that BUFF achieves a notable improvement, with a +0.61 increase compared to baseline in SSIM on BSD100, surpassing traditional diffusion approaches by an average additional +0.20dB PSNR gain. These findings underscore the potential of Bayesian methods in enhancing diffusion processes for SR, paving the way for future advancements in the field.
摘要：超分辨率（SR）技术对于增强图像质量至关重要，尤其是在高分辨率图像必不可少但受硬件约束限制的情况下。现有的SR扩散模型主要依赖于高斯的噪声产生模型，在处理自然场景中固有的复杂和可变纹理时，噪声的产生通常不足。为了解决这些缺陷，我们引入了贝叶斯不确定性引导的扩散概率模型（BUFF）。 Buff通过合并贝叶斯网络来产生高分辨率不确定性掩模来区分自己。这些掩模指导扩散过程，以既具有上下文感知和自适应的方式调节噪声强度。这种新颖的方法不仅增强了超级分辨图像的忠诚度，以使其原始的高分辨率对应物，而且显着减轻了人工制品并在以复杂纹理和细节为特征的区域中变得模糊。该模型证明了针对复杂噪声模式的出色鲁棒性，并在处理图像中的纹理和边缘时展示了出色的适应性。在视觉结果的支持下，经验证据说明了该模型的鲁棒性，尤其是在具有挑战性的情况下及其在解决常见的SR问题（例如模糊之类的问题）方面的有效性。在DIV2K数据集上进行的实验评估表明，Buff与BSD100上的SSIM相比，Buff取得了显着的改进，其平均 +0.20dB PSNR增益超过了传统扩散方法。这些发现强调了贝叶斯方法在增强SR的扩散过程中的潜力，为该领域的未来进步铺平了道路。

Title: Diffusion Active Learning: Towards Data-Driven Experimental Design in Computed Tomography

Authors: Luis Barba, Johannes Kirschner, Tomas Aidukas, Manuel Guizar-Sicairos, Benjamín Béjar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.03491
Pdf URL: https://arxiv.org/pdf/2504.03491
Copy Paste: [[2504.03491]] Diffusion Active Learning: Towards Data-Driven Experimental Design in Computed Tomography(https://arxiv.org/abs/2504.03491)
Keywords: generative
Abstract: We introduce Diffusion Active Learning, a novel approach that combines generative diffusion modeling with data-driven sequential experimental design to adaptively acquire data for inverse problems. Although broadly applicable, we focus on scientific computed tomography (CT) for experimental validation, where structured prior datasets are available, and reducing data requirements directly translates to shorter measurement times and lower X-ray doses. We first pre-train an unconditional diffusion model on domain-specific CT reconstructions. The diffusion model acts as a learned prior that is data-dependent and captures the structure of the underlying data distribution, which is then used in two ways: It drives the active learning process and also improves the quality of the reconstructions. During the active learning loop, we employ a variant of diffusion posterior sampling to generate conditional data samples from the posterior distribution, ensuring consistency with the current measurements. Using these samples, we quantify the uncertainty in the current estimate to select the most informative next measurement. Our results show substantial reductions in data acquisition requirements, corresponding to lower X-ray doses, while simultaneously improving image reconstruction quality across multiple real-world tomography datasets.
摘要：我们引入了扩散活动学习，这是一种新颖的方法，将生成扩散建模与数据驱动的顺序实验设计结合在一起，以适应性地获取反向问题的数据。尽管广泛适用，但我们专注于用于实验验证的科学计算机断层扫描（CT），在此结构化的先验数据集可用，并且数据要求直接转化为较短的测量时间和较低的X射线剂量。我们首先在域特异性CT重建上预先培训无条件扩散模型。扩散模型是一个学先的事先，它与数据相关，并捕获了基础数据分布的结构，然后以两种方式使用该结构：它驱动主动学习过程并提高重建质量。在主动学习循环期间，我们采用了扩散后验采样的变体来从后分布中生成条件数据样本，从而确保与当前测量值保持一致。使用这些样品，我们量化了当前估计值中的不确定性，以选择最有用的下一个测量值。我们的结果表明，数据采集要求的大幅减少，对应于较低的X射线剂量，同时提高了多个现实世界中层析成像数据集的图像重建质量。

Title: HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration

Authors: Boyuan Wang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Guan Huang, Lihong Liu, Xingang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03536
Pdf URL: https://arxiv.org/pdf/2504.03536
Copy Paste: [[2504.03536]] HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration(https://arxiv.org/abs/2504.03536)
Keywords: restoration, generation, generative
Abstract: Single-image human reconstruction is vital for digital human modeling applications but remains an extremely challenging task. Current approaches rely on generative models to synthesize multi-view images for subsequent 3D reconstruction and animation. However, directly generating multiple views from a single human image suffers from geometric inconsistencies, resulting in issues like fragmented or blurred limbs in the reconstructed models. To tackle these limitations, we introduce \textbf{HumanDreamer-X}, a novel framework that integrates multi-view human generation and reconstruction into a unified pipeline, which significantly enhances the geometric consistency and visual fidelity of the reconstructed 3D models. In this framework, 3D Gaussian Splatting serves as an explicit 3D representation to provide initial geometry and appearance priority. Building upon this foundation, \textbf{HumanFixer} is trained to restore 3DGS renderings, which guarantee photorealistic results. Furthermore, we delve into the inherent challenges associated with attention mechanisms in multi-view human generation, and propose an attention modulation strategy that effectively enhances geometric details identity consistency across multi-view. Experimental results demonstrate that our approach markedly improves generation and reconstruction PSNR quality metrics by 16.45% and 12.65%, respectively, achieving a PSNR of up to 25.62 dB, while also showing generalization capabilities on in-the-wild data and applicability to various human reconstruction backbone models.
摘要：单像人类重建对于数字人类建模应用至关重要，但仍然是一项极具挑战性的任务。当前方法依靠生成模型来合成多视图图像，以进行后续的3D重建和动画。但是，直接从单个人类图像中产生多个视图遇到了几何不一致，从而导致重建模型中的肢体碎片或模糊的问题。为了应对这些局限性，我们引入了\ textbf {humandreamer-x}，这是一个新颖的框架，将多视图的人类产生和重建整合到统一的管道中，从而显着增强了重建的3D模型的几何一致性和视觉保真。在此框架中，3D高斯剥离用作显式的3D表示，以提供初始的几何形状和优先级。 \ textbf {humanfixer}以此为基础，经过培训可以恢复3DGS渲染，从而保证了逼真的结果。此外，我们深入研究了与多视图人类的注意机制相关的固有挑战，并提出了一种注意调制策略，该策略有效地增强了多视图的几何细节身份一致性。实验结果表明，我们的方法显着提高了PSNR质量指标的生成和重建质量指标，分别提高了16.45％和12.65％，达到了高达25.62 dB的PSNR，同时还显示出对各种人类重建型骨架模型的概括能力。

Title: Multimodal Diffusion Bridge with Attention-Based SAR Fusion for Satellite Image Cloud Removal

Authors: Yuyang Hu, Suhas Lohit, Ulugbek S. Kamilov, Tim K. Marks
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03607
Pdf URL: https://arxiv.org/pdf/2504.03607
Copy Paste: [[2504.03607]] Multimodal Diffusion Bridge with Attention-Based SAR Fusion for Satellite Image Cloud Removal(https://arxiv.org/abs/2504.03607)
Keywords: restoration
Abstract: Deep learning has achieved some success in addressing the challenge of cloud removal in optical satellite images, by fusing with synthetic aperture radar (SAR) images. Recently, diffusion models have emerged as powerful tools for cloud removal, delivering higher-quality estimation by sampling from cloud-free distributions, compared to earlier methods. However, diffusion models initiate sampling from pure Gaussian noise, which complicates the sampling trajectory and results in suboptimal performance. Also, current methods fall short in effectively fusing SAR and optical data. To address these limitations, we propose Diffusion Bridges for Cloud Removal, DB-CR, which directly bridges between the cloudy and cloud-free image distributions. In addition, we propose a novel multimodal diffusion bridge architecture with a two-branch backbone for multimodal image restoration, incorporating an efficient backbone and dedicated cross-modality fusion blocks to effectively extract and fuse features from synthetic aperture radar (SAR) and optical images. By formulating cloud removal as a diffusion-bridge problem and leveraging this tailored architecture, DB-CR achieves high-fidelity results while being computationally efficient. We evaluated DB-CR on the SEN12MS-CR cloud-removal dataset, demonstrating that it achieves state-of-the-art results.
摘要：深度学习通过与合成孔径雷达（SAR）图像融合来解决光学卫星图像中云去除的挑战方面取得了成功。最近，与早期方法相比，通过从无云分布中进行采样，扩散模型已成为云拆除的强大工具，从而提供了更高质量的估计。但是，扩散模型从纯高斯噪声开始采样，这使采样轨迹复杂化并导致次优性能。同样，当前方法在有效地融合SAR和光学数据方面缺乏。为了解决这些局限性，我们提出了用于去除云的扩散桥，DB-CR，该桥直接在云和无云图像分布之间桥接。此外，我们提出了一种新型的多模式扩散桥体结构，该桥结构具有两个分支的主链，用于多模式图像恢复，并结合了有效的主链和专用的交叉模式融合块，以有效地提取并从Synthetic Aperture雷达（SAR）和光学图像中提取和融合。通过将云的去除作为扩散桥问题并利用这种量身定制的体系结构，DB-CR可以在计算上实现高保真性结果。我们评估了SEN12MS-CR云驱动数据集的DB-CR，表明它可以实现最新的结果。

Title: Autonomous and Self-Adapting System for Synthetic Media Detection and Attribution

Authors: Aref Azizpour, Tai D. Nguyen, Matthew C. Stamm
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.03615
Pdf URL: https://arxiv.org/pdf/2504.03615
Copy Paste: [[2504.03615]] Autonomous and Self-Adapting System for Synthetic Media Detection and Attribution(https://arxiv.org/abs/2504.03615)
Keywords: generative
Abstract: Rapid advances in generative AI have enabled the creation of highly realistic synthetic images, which, while beneficial in many domains, also pose serious risks in terms of disinformation, fraud, and other malicious applications. Current synthetic image identification systems are typically static, relying on feature representations learned from known generators; as new generative models emerge, these systems suffer from severe performance degradation. In this paper, we introduce the concept of an autonomous self-adaptive synthetic media identification system -- one that not only detects synthetic images and attributes them to known sources but also autonomously identifies and incorporates novel generators without human intervention. Our approach leverages an open-set identification strategy with an evolvable embedding space that distinguishes between known and unknown sources. By employing an unsupervised clustering method to aggregate unknown samples into high-confidence clusters and continuously refining its decision boundaries, our system maintains robust detection and attribution performance even as the generative landscape evolves. Extensive experiments demonstrate that our method significantly outperforms existing approaches, marking a crucial step toward universal, adaptable forensic systems in the era of rapidly advancing generative models.
摘要：生成AI的快速进步使创建高度逼真的合成图像，尽管在许多领域中有益，但在虚假信息，欺诈和其他恶意应用程序方面也构成了严重的风险。当前的合成图像识别系统通常是静态的，这取决于从已知发生器中学到的特征表示。随着新的生成模型的出现，这些系统遭受了严重的性能降解。在本文中，我们介绍了自主自适应合成媒体识别系统的概念，该系统不仅检测合成图像并将其归因于已知来源，而且自主鉴定并在没有人类干预的情况下结合了新颖的发电机。我们的方法利用开放式识别策略，具有可转化的嵌入空间，该空间区分已知和未知来源。通过采用一种无监督的聚类方法将未知样本汇总到高信心群集中并不断完善其决策边界，我们的系统即使随着生成景观的发展，我们的系统也保持了强大的检测和归因性能。广泛的实验表明，我们的方法在迅速发展的生成模型的时代迈出了至关重要的一步，标志着朝着通用，适应性法医系统迈出的关键步骤。

Title: VISTA-OCR: Towards generative and interactive end to end OCR models

Authors: Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03621
Pdf URL: https://arxiv.org/pdf/2504.03621
Copy Paste: [[2504.03621]] VISTA-OCR: Towards generative and interactive end to end OCR models(https://arxiv.org/abs/2504.03621)
Keywords: generation, generative
Abstract: We introduce \textbf{VISTA-OCR} (Vision and Spatially-aware Text Analysis OCR), a lightweight architecture that unifies text detection and recognition within a single generative model. Unlike conventional methods that require separate branches with dedicated parameters for text recognition and detection, our approach leverages a Transformer decoder to sequentially generate text transcriptions and their spatial coordinates in a unified branch. Built on an encoder-decoder architecture, VISTA-OCR is progressively trained, starting with the visual feature extraction phase, followed by multitask learning with multimodal token generation. To address the increasing demand for versatile OCR systems capable of advanced tasks, such as content-based text localization \ref{content_based_localization}, we introduce new prompt-controllable OCR tasks during this http URL enhance the model's capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples. Although recent Vision Large Language Models (VLLMs) can efficiently perform these tasks, their high computational cost remains a barrier for practical deployment. In contrast, our VISTA$_{\text{omni}}$ variant processes both handwritten and printed documents with only 150M parameters, interactively, by prompting. Extensive experiments on multiple datasets demonstrate that VISTA-OCR achieves better performance compared to state-of-the-art specialized models on standard OCR tasks while showing strong potential for more sophisticated OCR applications, addressing the growing need for interactive OCR systems. All code and annotations for VISTA-OCR will be made publicly available upon acceptance.
摘要：我们介绍\ textbf {Vista-ocr}（视觉和空间感知的文本分析OCR），这是一种轻巧的体系结构，在单个生成模型中统一文本检测和识别。与需要具有专用参数的单独分支以进行文本识别和检测的传统方法不同，我们的方法利用变压器解码器在统一分支中顺序生成文本转录及其空间坐标。 Vista-Ort构建在编码器 - 编码器架构上，逐渐训练，从视觉特征提取阶段开始，然后是多模式代币生成的多任务学习。 To address the increasing demand for versatile OCR systems capable of advanced tasks, such as content-based text localization \ref{content_based_localization}, we introduce new prompt-controllable OCR tasks during this http URL enhance the model's capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples.尽管最近的视觉大型语言模型（VLLM）可以有效执行这些任务，但它们的高计算成本仍然是实际部署的障碍。相比之下，我们的Vista $ _ {\ text {omni}} $ variant Process既可以通过提示，互动地使用1.50亿参数手写和打印文档。在多个数据集上进行的广泛实验表明，与标准OCR任务的最先进的专业模型相比，Vista-OR可以在更复杂的OCR应用中表现出强大的潜力，从而实现了更好的性能，从而满足了对交互式OCR系统不断增长的需求。接受远景的所有代码和注释将在接受后公开提供。

Title: Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions

Authors: Ting-Hsuan Liao, Yi Zhou, Yu Shen, Chun-Hao Paul Huang, Saayan Mitra, Jia-Bin Huang, Uttaran Bhattacharya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03639
Pdf URL: https://arxiv.org/pdf/2504.03639
Copy Paste: [[2504.03639]] Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions(https://arxiv.org/abs/2504.03639)
Keywords: generation
Abstract: We explore how body shapes influence human motion synthesis, an aspect often overlooked in existing text-to-motion generation methods due to the ease of learning a homogenized, canonical body shape. However, this homogenization can distort the natural correlations between different body shapes and their motion dynamics. Our method addresses this gap by generating body-shape-aware human motions from natural language prompts. We utilize a finite scalar quantization-based variational autoencoder (FSQ-VAE) to quantize motion into discrete tokens and then leverage continuous body shape information to de-quantize these tokens back into continuous, detailed motion. Additionally, we harness the capabilities of a pretrained language model to predict both continuous shape parameters and motion tokens, facilitating the synthesis of text-aligned motions and decoding them into shape-aware motions. We evaluate our method quantitatively and qualitatively, and also conduct a comprehensive perceptual study to demonstrate its efficacy in generating shape-aware motions.
摘要：我们探索身体形状如何影响人类运动的综合，这是由于学习均质化的，规范的身体形状而经常在现有文本到动作生成方法中被忽略的方面。但是，这种匀浆会扭曲不同体形与其运动动力学之间的自然相关性。我们的方法通过从自然语言提示中产生身体形状可见的人体动作来解决这一差距。我们利用基于有限标量量化的变分自动编码器（FSQ-VAE）将运动量化为离散令牌，然后利用连续的身体形状信息来将这些令牌放回连续的，详细的运动中。此外，我们利用了预处理的语言模型的能力来预测连续形状参数和运动令牌，从而促进了文本对准运动的综合并将其解码为形状吸引的运动。我们在定量和质量上评估我们的方法，并进行全面的感知研究，以证明其在产生形状感知动作方面的功效。

Title: MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Authors: Wulin Xie, Yi-Fan Zhang, Chaoyou Fu, Yang Shi, Bingyan Nie, Hongkai Chen, Zhang Zhang, Liang Wang, Tieniu Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.03641
Pdf URL: https://arxiv.org/pdf/2504.03641
Copy Paste: [[2504.03641]] MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models(https://arxiv.org/abs/2504.03641)
Keywords: generation
Abstract: Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in this https URL.
摘要：现有的MLLM基准在评估统一的MLLM（U-MLLM）方面面临重大挑战：1）缺乏用于传统任务的标准化基准，从而导致比较不一致； 2）缺乏用于混合模式产生的基准，这无法评估多模式推理能力。我们提出了一个全面的评估框架，旨在系统地评估U-MLLM。我们的基准包括：标准化的传统任务评估。我们从12个数据集中采样样品，涵盖了10项具有30个子任务的任务，确保了整个研究之间的一致和公平的比较。” 2。统一的任务评估。我们介绍了5项测试多模式推理的新任务，包括图像编辑，图像编辑质量编辑，与图像产生和几何推理和几何推理。3。综合模型Benchmark。 Gemini2-Flash与专门的理解（例如Claude-3.5-Sonnet）和生成模型（例如，DALL-E-3）。