2025-04-16

Title: GPT Meets Graphs and KAN Splines: Testing Novel Frameworks on Multitask Fine-Tuned GPT-2 with LoRA

Authors: Gabriel Bo, Marc Bernardino, Justin Gu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2504.10490
Pdf URL: https://arxiv.org/pdf/2504.10490
Copy Paste: [[2504.10490]] GPT Meets Graphs and KAN Splines: Testing Novel Frameworks on Multitask Fine-Tuned GPT-2 with LoRA(https://arxiv.org/abs/2504.10490)
Keywords: generation
Abstract: We explore the potential of integrating learnable and interpretable modules--specifically Kolmogorov-Arnold Networks (KAN) and graph-based representations--within a pre-trained GPT-2 model to enhance multi-task learning accuracy. Motivated by the recent surge in using KAN and graph attention (GAT) architectures in chain-of-thought (CoT) models and debates over their benefits compared to simpler architectures like MLPs, we begin by enhancing a standard self-attention transformer using Low-Rank Adaptation (LoRA), fine-tuning hyperparameters, and incorporating L2 regularization. This approach yields significant improvements. To further boost interpretability and richer representations, we develop two variants that attempt to improve the standard KAN and GAT: Graph LoRA and Hybrid-KAN LoRA (Learnable GPT). However, systematic evaluations reveal that neither variant outperforms the optimized LoRA-enhanced transformer, which achieves 55.249% accuracy on the SST test set, 99.18% on the CFIMDB dev set, and 89.9% paraphrase detection test accuracy. On sonnet generation, we get a CHRF score of 42.097. These findings highlight that efficient parameter adaptation via LoRA remains the most effective strategy for our tasks: sentiment analysis, paraphrase detection, and sonnet generation.
摘要：我们探讨了整合可学习和可解释的模块的潜力 - 特定于Kolmogorov-Arnold网络（KAN）和基于图形的表示形式 - 与预训练的GPT-2模型有关，以提高多任务学习精度。与MLP这样的简单架构相比，通过最近使用KAN和图形注意力（COT）模型中使用KAN和图形注意力（GAT）体系结构的激增，我们就其益处进行了辩论，我们首先使用低级适应（LORA）增强标准的自我发挥变压器（LORA），并结合L2常规化。这种方法可产生重大改进。为了进一步提高可解释性和更丰富的表示，我们开发了两个试图改善标准KAN和GAT的变体：图形Lora和Hybrid-Kan Lora（可学习的GPT）。然而，系统评估表明，两种变体的表现都超过了优化的洛拉增强变压器，在SST测试集上达到了55.249％的精度，在CFIMDB DEV集合中达到99.18％，而89.9％的释义检测测试准确性。在十四行诗中，我们的CHRF得分为42.097。这些发现凸显了通过LORA进行有效的参数适应仍然是我们任务的最有效策略：情感分析，释义检测和十四行诗的生成。

Title: Beyond the Generative Learning Trilemma: Generative Model Assessment in Data Scarcity Domains

Authors: Marco Salmè, Lorenzo Tronchin, Rosa Sicilia, Paolo Soda, Valerio Guarrasi
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.10555
Pdf URL: https://arxiv.org/pdf/2504.10555
Copy Paste: [[2504.10555]] Beyond the Generative Learning Trilemma: Generative Model Assessment in Data Scarcity Domains(https://arxiv.org/abs/2504.10555)
Keywords: generative
Abstract: Data scarcity remains a critical bottleneck impeding technological advancements across various domains, including but not limited to medicine and precision agriculture. To address this challenge, we explore the potential of Deep Generative Models (DGMs) in producing synthetic data that satisfies the Generative Learning Trilemma: fidelity, diversity, and sampling efficiency. However, recognizing that these criteria alone are insufficient for practical applications, we extend the trilemma to include utility, robustness, and privacy, factors crucial for ensuring the applicability of DGMs in real-world scenarios. Evaluating these metrics becomes particularly challenging in data-scarce environments, as DGMs traditionally rely on large datasets to perform optimally. This limitation is especially pronounced in domains like medicine and precision agriculture, where ensuring acceptable model performance under data constraints is vital. To address these challenges, we assess the Generative Learning Trilemma in data-scarcity settings using state-of-the-art evaluation metrics, comparing three prominent DGMs: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models (DMs). Furthermore, we propose a comprehensive framework to assess utility, robustness, and privacy in synthetic data generated by DGMs. Our findings demonstrate varying strengths among DGMs, with each model exhibiting unique advantages based on the application context. This study broadens the scope of the Generative Learning Trilemma, aligning it with real-world demands and providing actionable guidance for selecting DGMs tailored to specific applications.
摘要：数据稀缺仍然是跨越各个领域的技术进步的关键瓶颈，包括但不限于医学和精确农业。为了应对这一挑战，我们探讨了深层生成模型（DGM）在产生满足生成学习三元素的合成数据中的潜力：保真度，多样性和采样效率。但是，认识到仅这些标准不足以实用应用，我们将三元素扩展到包括效用，鲁棒性和隐私性，因此对于确保DGM在现实世界中的适用性至关重要。在数据筛选环境中，评估这些指标尤其具有挑战性，因为DGMS传统上依靠大型数据集来最佳性能。这种限制在医学和精确农业等领域尤其明显，在这些领域，在数据约束下确保可接受的模型性能至关重要。为了应对这些挑战，我们使用最先进的评估指标评估了数据划界设置中的生成学习三元素，比较了三个突出的DGM：变分自动编码器（VAE），生成对抗性网络（GAN）和扩散模型（DMS）。此外，我们提出了一个综合框架，以评估DGMS生成的合成数据中的效用，鲁棒性和隐私性。我们的发现证明了DGM之间的优势不同，每个模型都根据应用程序上下文表现出独特的优势。这项研究扩大了生成学习三元素的范围，使其与现实世界的需求保持一致，并为选择针对特定应用程序量身定制的DGM提供了可行的指导。

Title: VAE-based Feature Disentanglement for Data Augmentation and Compression in Generalized GNSS Interference Classification

Authors: Lucas Heublein, Simon Kocher, Tobias Feigl, Alexander Rügamer, Christopher Mutschler, Felix Ott
Subjects: cs.LG, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2504.10556
Pdf URL: https://arxiv.org/pdf/2504.10556
Copy Paste: [[2504.10556]] VAE-based Feature Disentanglement for Data Augmentation and Compression in Generalized GNSS Interference Classification(https://arxiv.org/abs/2504.10556)
Keywords: generative
Abstract: Distributed learning and Edge AI necessitate efficient data processing, low-latency communication, decentralized model training, and stringent data privacy to facilitate real-time intelligence on edge devices while reducing dependency on centralized infrastructure and ensuring high model performance. In the context of global navigation satellite system (GNSS) applications, the primary objective is to accurately monitor and classify interferences that degrade system performance in distributed environments, thereby enhancing situational awareness. To achieve this, machine learning (ML) models can be deployed on low-resource devices, ensuring minimal communication latency and preserving data privacy. The key challenge is to compress ML models while maintaining high classification accuracy. In this paper, we propose variational autoencoders (VAEs) for disentanglement to extract essential latent features that enable accurate classification of interferences. We demonstrate that the disentanglement approach can be leveraged for both data compression and data augmentation by interpolating the lower-dimensional latent representations of signal power. To validate our approach, we evaluate three VAE variants - vanilla, factorized, and conditional generative - on four distinct datasets, including two collected in controlled indoor environments and two real-world highway datasets. Additionally, we conduct extensive hyperparameter searches to optimize performance. Our proposed VAE achieves a data compression rate ranging from 512 to 8,192 and achieves an accuracy up to 99.92%.
摘要：分布式学习和边缘AI需要有效的数据处理，低延迟通信，分散的模型培训以及严格的数据隐私，以促进边缘设备上的实时智能，同时降低对集中式基础架构的依赖性并确保高模型性能。在全球导航卫星系统（GNSS）应用程序的背景下，主要目标是准确监视和分类干扰物，从而在分布式环境中降低系统性能，从而增强情境意识。为此，可以将机器学习（ML）模型部署在低资源设备上，从而确保最小的通信延迟并保留数据隐私。关键挑战是在保持高分类精度的同时压缩ML模型。在本文中，我们提出了分离的变异自动编码器（VAE），以提取基本的潜在特征，从而可以准确地分类干扰。我们证明，可以通过插值信号功率的较低维度潜在表示，可以利用分离方法来利用数据压缩和数据增强。为了验证我们的方法，我们在四个不同的数据集上评估了三个VAE变体 - 香草，分解和有条件的生成剂，包括在受控室内环境中收集的两个和两个现实世界中的高速公路数据集。此外，我们进行大量的超参数搜索以优化性能。我们提出的VAE达到的数据压缩率范围从512到8,192，精度高达99.92％。

Title: Enhancing Image Restoration through Learning Context-Rich and Detail-Accurate Features

Authors: Hu Gao, Depeng Dang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10558
Pdf URL: https://arxiv.org/pdf/2504.10558
Copy Paste: [[2504.10558]] Enhancing Image Restoration through Learning Context-Rich and Detail-Accurate Features(https://arxiv.org/abs/2504.10558)
Keywords: restoration
Abstract: Image restoration involves recovering high-quality images from their corrupted versions, requiring a nuanced balance between spatial details and contextual information. While certain methods address this balance, they predominantly emphasize spatial aspects, neglecting frequency variation comprehension. In this paper, we present a multi-scale design that optimally balances these competing objectives, seamlessly integrating spatial and frequency domain knowledge to selectively recover the most informative information. Specifically, we develop a hybrid scale frequency selection block (HSFSBlock), which not only captures multi-scale information from the spatial domain, but also selects the most informative components for image restoration in the frequency domain. Furthermore, to mitigate the inherent noise introduced by skip connections employing only addition or concatenation, we introduce a skip connection attention mechanism (SCAM) to selectively determines the information that should propagate through skip connections. The resulting tightly interlinked architecture, named as LCDNet. Extensive experiments conducted across diverse image restoration tasks showcase that our model attains performance levels that are either superior or comparable to those of state-of-the-art algorithms.
摘要：图像恢复涉及从其损坏版本中恢复高质量的图像，需要在空间细节和上下文信息之间保持细微的平衡。尽管某些方法解决了这一平衡，但它们主要强调空间方面，从而忽略了频率变化理解。在本文中，我们提出了一种多尺度设计，该设计可以最佳地平衡这些竞争目标，无缝整合空间和频域知识，以选择性地恢复最有用的信息。具体而言，我们开发了一个混合量表频率选择块（HSFSBlock），该块不仅从空间域捕获了多尺度信息，而且还选择了频域中图像恢复的最有用的组件。此外，为了减轻仅使用添加或串联的跳过连接引入的固有噪声，我们引入了跳过连接注意机制（SCAM），以选择性地确定应通过跳过连接传播的信息。由此被命名为LCDNET的紧密相互连接的建筑。跨不同图像恢复任务进行的广泛实验表明，我们的模型达到的性能水平与最先进的算法相当。

Title: H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models

Authors: Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, Sergey Tulyakov
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.10567
Pdf URL: https://arxiv.org/pdf/2504.10567
Copy Paste: [[2504.10567]] H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models(https://arxiv.org/abs/2504.10567)
Keywords: generation
Abstract: Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time on mobile devices. We also unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single network. In addition, we find that the widely adopted discriminative losses, i.e., GAN, LPIPS, and DWT losses, provide no significant improvements when training AEs at scale. We propose a novel latent consistency loss that does not require complicated discriminator design or hyperparameter tuning, but provides stable improvements in reconstruction quality. Our AE achieves an ultra-high compression ratio and real-time decoding speed on mobile while outperforming prior art in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.
摘要：AutoCoder（AE）是图像和视频生成潜在扩散模型成功的关键，从而降低了降低分辨率并提高效率。但是，就网络设计，压缩比和培训策略而言，AE的力量长期以来一直在不断发展。在这项工作中，我们系统地检查体系结构设计选择并优化计算分布，以获得一系列高效且高压缩的视频AE，这些视频可以实时在移动设备上实时解码。我们还统一了普通自动编码器和图像条件I2V VAE的设计，从而在单个网络中实现了多功能。此外，我们发现广泛采用的歧视性损失，即GAN，LPIPS和DWT损失，在大规模培训AE时没有显着改善。我们提出了一种新型的潜在一致性损失，不需要复杂的判别器设计或超参数调整，但可以稳定地改善重建质量。我们的AE在移动设备上达到了超高的压缩率和实时解码速度，同时在重建指标方面超过了先前的艺术，较大的利润率。我们终于通过在其潜在空间上训练A DIT来验证我们的AE，并展示快速，高质量的文本到视频生成能力。

Title: Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling

Authors: Michal Balcerak, Tamaz Amiranashvili, Suprosanna Shit, Antonio Terpin, Sebastian Kaltenbach, Petros Koumoutsakos, Bjoern Menze
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2504.10612
Pdf URL: https://arxiv.org/pdf/2504.10612
Copy Paste: [[2504.10612]] Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling(https://arxiv.org/abs/2504.10612)
Keywords: generation, generative
Abstract: Generative models often map noise to data by matching flows or scores, but these approaches become cumbersome for incorporating partial observations or additional priors. Inspired by recent advances in Wasserstein gradient flows, we propose Energy Matching, a framework that unifies flow-based approaches with the flexibility of energy-based models (EBMs). Far from the data manifold, samples move along curl-free, optimal transport paths from noise to data. As they approach the data manifold, an entropic energy term guides the system into a Boltzmann equilibrium distribution, explicitly capturing the underlying likelihood structure of the data. We parameterize this dynamic with a single time-independent scalar field, which serves as both a powerful generator and a flexible prior for effective regularization of inverse problems. Our method substantially outperforms existing EBMs on CIFAR-10 generation (FID 3.97 compared to 8.61), while retaining the simulation-free training of transport-based approaches away from the data manifold. Additionally, we exploit the flexibility of our method and introduce an interaction energy for diverse mode exploration. Our approach focuses on learning a static scalar potential energy -- without time conditioning, auxiliary generators, or additional networks -- marking a significant departure from recent EBM methods. We believe this simplified framework significantly advances EBM capabilities and paves the way for their broader adoption in generative modeling across diverse domains.
摘要：生成模型通常通过匹配流量或分数将噪声映射到数据，但是这些方法对于融合部分观测或其他先验而变得笨拙。受Wasserstein梯度流的最新进展的启发，我们提出了能量匹配，该框架将基于流量的方法统一了基于能量的模型（EBM）的灵活性。远离数据歧管，样品沿着无卷曲的最佳传输路径从噪声到数据移动。当他们接近数据歧管时，熵能项将系统引导到玻璃体平衡分布中，明确捕获数据的潜在可能性结构。我们使用一个独立于时间的标量字段来参数化这种动态，该字段既是功能强大的发生器又是有效正规化逆问题的灵活性。我们的方法基本上优于CIFAR-10代的现有EBM（FID 3.97比8.61），同时保留对基于数据歧管的基于运输方法的无模拟培训。此外，我们利用了方法的灵活性，并引入了一种相互作用的能量，以进行多种模式探索。我们的方法着重于学习静态标量势能 - 没有时间调理，辅助发电机或其他网络 - 标志着与最近的EBM方法的显着差异。我们认为，这个简化的框架可以显着提高EBM功能，并为它们在各种领域的生成建模中的广泛采用铺平道路。

Title: Relation-Rich Visual Document Generator for Visual Information Extraction

Authors: Zi-Han Jiang, Chien-Wei Lin, Wei-Hua Li, Hsuan-Tung Liu, Yi-Ren Yeh, Chu-Song Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10659
Pdf URL: https://arxiv.org/pdf/2504.10659
Copy Paste: [[2504.10659]] Relation-Rich Visual Document Generator for Visual Information Extraction(https://arxiv.org/abs/2504.10659)
Keywords: generation
Abstract: Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks. The code and model will be available at this https URL .
摘要：尽管大型语言模型（LLMS）和多模式LLM（MLLM）用于视觉文档理解（VDU），但由于布局多样性和有限的培训数据，来自关系丰富的文档的视觉信息提取（VIE）仍然具有挑战性。尽管现有的合成文档生成器试图解决数据稀缺性，但它们要么依赖于手动设计的布局和模板，要么采用限制布局多样性的基于规则的方法。此外，当前的布局生成方法仅着眼于拓扑模式而不考虑文本内容，这使得它们对于在内容和布局之间具有复杂关联的文档而变得不切实际。 In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring没有人类的标签或注释努力。实验结果表明，我们的方法显着提高了各种VIE基准上的文档理解模型的性能。代码和模型将在此HTTPS URL上可用。

Title: H-MoRe: Learning Human-centric Motion Representation for Action Analysis

Authors: Zhanbo Huang, Xiaoming Liu, Yu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10676
Pdf URL: https://arxiv.org/pdf/2504.10676
Copy Paste: [[2504.10676]] H-MoRe: Learning Human-centric Motion Representation for Action Analysis(https://arxiv.org/abs/2504.10676)
Keywords: generation
Abstract: In this paper, we propose H-MoRe, a novel pipeline for learning precise human-centric motion representation. Our approach dynamically preserves relevant human motion while filtering out background movement. Notably, unlike previous methods relying on fully supervised learning from synthetic data, H-MoRe learns directly from real-world scenarios in a self-supervised manner, incorporating both human pose and body shape information. Inspired by kinematics, H-MoRe represents absolute and relative movements of each body point in a matrix format that captures nuanced motion details, termed world-local flows. H-MoRe offers refined insights into human motion, which can be integrated seamlessly into various action-related applications. Experimental results demonstrate that H-MoRe brings substantial improvements across various downstream tasks, including gait recognition(CL@R1: +16.01%), action recognition(Acc@1: +8.92%), and video generation(FVD: -67.07%). Additionally, H-MoRe exhibits high inference efficiency (34 fps), making it suitable for most real-time scenarios. Models and code will be released upon publication.
摘要：在本文中，我们提出了H-More，这是一种用于学习精确以人为中心运动表示的新型管道。我们的方法在滤除背景运动的同时，动态保留了相关的人类运动。值得注意的是，与以前的方法依赖于从合成数据中进行完全监督的学习不同，H-More以自我监督的方式直接从现实世界中学习，同时结合了人体姿势和身体形状信息。受运动学的启发，H-more以矩阵格式代表了每个体点的绝对和相对运动，该格式捕获了细微的运动细节，称为世界本地流动。 H-More提供了对人类运动的精致见解，可以将其无缝集成到各种与动作相关的应用中。实验结果表明，H -more在各种下游任务中带来了重大改进，包括步态识别（CL@r1： +16.01％），动作识别（ACC@1： +8.92％）和视频生成（FVD：-67.07％）。此外，H-More具有较高的推理效率（34 fps），使其适合大多数实时情况。模型和代码将在发布后发布。

Title: The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang, Haibo Lei, Qifang Gao, Yaqing Li, Weihua Luo, Tsing Li, Qing Wang, Yi Liu, Yang Wang, Hongyu An, Liou Zhang, Shijie Zhao, Lianhong Song, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Jing Wei, Mengyang Wang, Ruilong Guo, Qian Wang, Qingliang Liu, Yang Cheng, Davinci, Enxuan Gu, Pinxin Liu, Yongsheng Yu, Hang Hua, Yunlong Tang, Shihao Wang, Yukun Yang, Zhiyu Zhang, Yukun Yang, Jiyu Wu, Jiancheng Huang, Yifan Liu, Yi Huang, Shifeng Chen, Rui Chen, Yi Feng, Mingxi Li, Cailu Wan, Xiangji Wu, Zibin Liu, Jinyang Zhong, Kihwan Yoon, Ganzorig Gankhuyag, Shengyun Zhong, Mingyang Wu, Renjie Li, Yushen Zuo, Zhengzhong Tu, Zongang Gao, Guannan Chen, Yuan Tian, Wenhui Chen, Weijun Yuan, Zhan Li, Yihang Chen, Yifan Deng, Ruting Deng, Yilin Zhang, Huan Zheng, Yanyan Wei, Wenxuan Zhao, Suiyi Zhao, Fei Wang, Kun Li, Yinggan Tang, Mengjie Su, Jae-hyeon Lee, Dong-Hyeop Son, Ui-Jin Choi, Tiancheng Shao, Yuqing Zhang, Mengcheng Ma
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.10686
Pdf URL: https://arxiv.org/pdf/2504.10686
Copy Paste: [[2504.10686]] The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report(https://arxiv.org/abs/2504.10686)
Keywords: super-resolution
Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
摘要：本文对单位效率高分辨率（ESR）的NTIRE 2025挑战进行了全面评论。挑战旨在推进优化关键计算指标（即运行时，参数和失败）的深层模型的开发，同时在$ \ operatatorName上达到至少26.90 dB的PSNR {div2k \ _lsdir \ _valid} $数据集和26.99 db and 26.99 db On $ \ operatatorName {div2k \ _lsdir \ _test} $ dataset。强大的参与SAW \ textbf {244}注册的参赛者，\ textbf {43}团队提交有效的条目。该报告精心分析了这些方法和结果，强调了最先进的单图像ESR技术的开创性进步。该分析强调了创新的方法，并为该领域的未来研究建立了基准。

Title: SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models

Authors: Stathis Galanakis, Alexandros Lattas, Stylianos Moschoglou, Bernhard Kainz, Stefanos Zafeiriou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10716
Pdf URL: https://arxiv.org/pdf/2504.10716
Copy Paste: [[2504.10716]] SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models(https://arxiv.org/abs/2504.10716)
Keywords: generation
Abstract: Despite recent progress in diffusion models, generating realistic head portraits from novel viewpoints remains a significant challenge. Most current approaches are constrained to limited angular ranges, predominantly focusing on frontal or near-frontal views. Moreover, although the recent emerging large-scale diffusion models have been proven robust in handling 3D scenes, they underperform on facial data, given their complex structure and the uncanny valley pitfalls. In this paper, we propose SpinMeRound, a diffusion-based approach designed to generate consistent and accurate head portraits from novel viewpoints. By leveraging a number of input views alongside an identity embedding, our method effectively synthesizes diverse viewpoints of a subject whilst robustly maintaining its unique identity features. Through experimentation, we showcase our model's generation capabilities in 360 head synthesis, while beating current state-of-the-art multiview diffusion models.
摘要：尽管扩散模型最近取得了进展，但从新颖的角度产生逼真的头像仍然是一个重大挑战。当前大多数方法都被限制在有限的角范围内，主要集中在额叶或近额外的视图上。此外，尽管最近出现的大规模扩散模型已被证明在处理3D场景方面已证明了强大的稳定性，但鉴于其复杂的结构和不可思议的山谷陷阱，它们在面部数据方面的表现不佳。在本文中，我们提出了Spinmeround，这是一种基于扩散的方法，旨在从新颖的角度产生一致而准确的头部肖像。通过利用许多输入视图以及嵌入身份的嵌入方式，我们的方法有效地综合了主题的各种观点，同时稳健地维护其独特的身份特征。通过实验，我们在360头合成中展示了模型的生成能力，同时击败了当前的最新多视频扩散模型。

Title: ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models

Authors: Amirhosein Chahe, Lifeng Zhou
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2504.10757
Pdf URL: https://arxiv.org/pdf/2504.10757
Copy Paste: [[2504.10757]] ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models(https://arxiv.org/abs/2504.10757)
Keywords: generation
Abstract: Vision-language models (VLMs) show promise for autonomous driving but often lack transparent reasoning capabilities that are critical for safety. We investigate whether explicitly modeling reasoning during fine-tuning enhances VLM performance on driving decision tasks. Using GPT-4o, we generate structured reasoning chains for driving scenarios from the DriveLM benchmark with category-specific prompting strategies. We compare reasoning-based fine-tuning, answer-only fine-tuning, and baseline instruction-tuned models across multiple small VLM families (Llama 3.2, Llava 1.5, and Qwen 2.5VL). Our results demonstrate that reasoning-based fine-tuning consistently outperforms alternatives, with Llama3.2-11B-reason achieving the highest performance. Models fine-tuned with reasoning show substantial improvements in accuracy and text generation quality, suggesting explicit reasoning enhances internal representations for driving decisions. These findings highlight the importance of transparent decision processes in safety-critical domains and offer a promising direction for developing more interpretable autonomous driving systems.
摘要：视觉模型（VLMS）显示出对自动驾驶的希望，但通常缺乏对安全至关重要的透明推理能力。我们调查在微调过程中明确建模推理是否会在驱动决策任务上提高VLM的性能。使用GPT-4O，我们生成结构化的推理链，以从Drivelm基准测试中驱动方案，并具有特定于类别的提示策略。我们比较了基于推理的微调，仅答案的微调和基线指令调节模型（Llama 3.2，Llava 1.5和Qwen 2.5VL）。我们的结果表明，基于推理的微调始终优于替代方案，而Llama3.2-11B-季节则取得了最高的表现。通过推理进行微调的模型显示了准确性和文本生成质量的显着改善，这表明明确的推理增强了内部表示驱动决策。这些发现突出了透明决策过程在安全 - 关键领域中的重要性，并为开发更容易解释的自主驾驶系统提供了有希望的方向。

Title: Power-scaled Bayesian Inference with Score-based Generative mModels

Authors: Huseyin Tuna Erdinc, Yunlin Zeng, Abhinav Prakash Gahlot, Felix J. Herrmann
Subjects: cs.LG, cs.CV, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2504.10807
Pdf URL: https://arxiv.org/pdf/2504.10807
Copy Paste: [[2504.10807]] Power-scaled Bayesian Inference with Score-based Generative mModels(https://arxiv.org/abs/2504.10807)
Keywords: generative
Abstract: We propose a score-based generative algorithm for sampling from power-scaled priors and likelihoods within the Bayesian inference framework. Our algorithm enables flexible control over prior-likelihood influence without requiring retraining for different power-scaling configurations. Specifically, we focus on synthesizing seismic velocity models conditioned on imaged seismic. Our method enables sensitivity analysis by sampling from intermediate power posteriors, allowing us to assess the relative influence of the prior and likelihood on samples of the posterior distribution. Through a comprehensive set of experiments, we evaluate the effects of varying the power parameter in different settings: applying it solely to the prior, to the likelihood of a Bayesian formulation, and to both simultaneously. The results show that increasing the power of the likelihood up to a certain threshold improves the fidelity of posterior samples to the conditioning data (e.g., seismic images), while decreasing the prior power promotes greater structural diversity among samples. Moreover, we find that moderate scaling of the likelihood leads to a reduced shot data residual, confirming its utility in posterior refinement.
摘要：我们提出了一种基于得分的生成算法，用于从贝叶斯推理框架内的功率缩放先验和似然采样。我们的算法可以灵活控制对先前的类似影响，而无需重新进行不同的功率尺度配置。具体而言，我们专注于以成像地震为条件的合成地震速度模型。我们的方法通过从中间功率后代进行采样来实现灵敏度分析，从而使我们能够评估先前和可能性对后验分布样本的相对影响。通过一组全面的实验，我们评估了在不同的设置中改变功率参数的效果：仅将其应用于先验，贝叶斯公式的可能性以及同时应用于贝叶斯公式的可能性。结果表明，将似然的功率提高到一定阈值，可以提高后样品的忠诚度，以对调节数据（例如，地震图像），同时降低了先前的功率，可以促进样品之间更大的结构多样性。此外，我们发现可能性的适度缩放会导致射击数据残留降低，从而证实了其后置精炼的效用。

Title: IlluSign: Illustrating Sign Language Videos by Leveraging the Attention Mechanism

Authors: Janna Bruner, Amit Moryossef, Lior Wolf
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10822
Pdf URL: https://arxiv.org/pdf/2504.10822
Copy Paste: [[2504.10822]] IlluSign: Illustrating Sign Language Videos by Leveraging the Attention Mechanism(https://arxiv.org/abs/2504.10822)
Keywords: generative
Abstract: Sign languages are dynamic visual languages that involve hand gestures, in combination with non manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models' ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand's direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch like style to sign languages, especially for hand gestures and facial expressions, poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.
摘要：符号语言是涉及手势的动态视觉语言，结合了非手动元素，例如面部表情。尽管手语的视频录制通常用于教育和文档，但标志的动态性质可能会使详细研究它们的挑战，尤其是对于新的学习者和教育者而言。这项工作旨在将手语视频录像转换为静态插图，这是补充视频内容的附加教育资源。这个过程通常是由艺术家完成的，因此非常昂贵。我们提出了一种通过利用生成模型了解图像的语义和几何方面的能力来说明手语视频的方法。我们的方法着重于将诸如插图样式的草图传输到手语的视频镜头，将标志的开始和终点结合到单个插图中，并使用箭头突出显示手的方向和运动。尽管许多样式转移方法解决了以不同级别的抽象来解决域的适应，但将类似样式的草图应用于符号语言，尤其是在手势和面部表情中，却带来了重大挑战。为了解决这个问题，我们干预了扩散模型的降解过程，将样式作为钥匙和值注入高分辨率注意层，并从图像和边缘从图像和查询中融合几何信息。对于最终插图，我们使用注意机制将注意力重量从开始插图和最终插图中结合在一起，从而导致软组合。我们的方法提供了一种经济有效的解决方案，用于在推理时间生成手语插图，以解决教育材料中缺乏此类资源的问题。

Title: OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

Authors: Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Yuchi Huo, Rui Wang, Chi Zhang, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10825
Pdf URL: https://arxiv.org/pdf/2504.10825
Copy Paste: [[2504.10825]] OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding(https://arxiv.org/abs/2504.10825)
Keywords: generation
Abstract: In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.
摘要：在本文中，我们为可控视频扩散（Omnivdiff）提出了一个新颖的框架，旨在在单个扩散模型中综合和理解多个视频视觉内容。为了实现这一目标，Omnivdiff将所有视频视觉方式视为颜色空间中的所有视频方式，以学习联合分布，同时采用自适应控制策略，该策略会动态调整扩散过程中每种视觉方式的作用，无论是作为一代形态还是调节方式。这允许灵活地操纵每种模式的角色，从而支持各种任务。因此，我们的模型支持三个关键功能：（1）文本条件的视频生成：多模式的视频序列（即RGB，DEPTH，CANNY，SEMEGNAION）是根据一个扩散过程中的文本条件生成的；（2）视频理解：Omnivdiff可以估计输入RGB帧的深度，精明地图和语义分割，同时确保与RGB输入相干；（3）X条件的视频生成：Omnivdiff生成以细粒属性（例如，深度图或分段地图）为条件的视频。通过将这些不同的任务集成到统一的视频扩散框架中，OmnivDiff增强了可控视频扩散的灵活性和可扩展性，使其成为各种下游应用程序（例如视频对视频翻译）的有效工具。广泛的实验证明了我们方法的有效性，强调了其在各种视频相关应用程序中的潜力。

Title: LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation

Authors: Hengyu Shi, Junhao Su, Huansheng Ning, Xiaoming Wei, Jialin Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10829
Pdf URL: https://arxiv.org/pdf/2504.10829
Copy Paste: [[2504.10829]] LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation(https://arxiv.org/abs/2504.10829)
Keywords: generation, generative
Abstract: Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints. While recent methods based on generative models have shown promising results, they typically require substantial amounts of training data or extensive fine-tuning, limiting their versatility and practical applicability. Alternatively, some training-free approaches leveraging in-context learning with Large Language Models (LLMs) have emerged, but they often suffer from limited reasoning capabilities and overly simplistic ranking mechanisms, which restrict their ability to generate consistently high-quality layouts. To this end, we propose LayoutCoT, a novel approach that leverages the reasoning capabilities of LLMs through a combination of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques. Specifically, LayoutCoT transforms layout representations into a standardized serialized format suitable for processing by LLMs. A Layout-aware RAG is used to facilitate effective retrieval and generate a coarse layout by LLMs. This preliminary layout, together with the selected exemplars, is then fed into a specially designed CoT reasoning module for iterative refinement, significantly enhancing both semantic coherence and visual quality. We conduct extensive experiments on five public datasets spanning three conditional layout generation tasks. Experimental results demonstrate that LayoutCoT achieves state-of-the-art performance without requiring training or fine-tuning. Notably, our CoT reasoning module enables standard LLMs, even those without explicit deep reasoning abilities, to outperform specialized deep-reasoning models such as deepseek-R1, highlighting the potential of our approach in unleashing the deep reasoning capabilities of LLMs for layout generation tasks.
摘要：有条件的布局生成旨在自动从用户定义的约束中生成视觉吸引力和语义相干的布局。尽管基于生成模型的最新方法显示出令人鼓舞的结果，但它们通常需要大量的培训数据或广泛的微调，从而限制了它们的多功能性和实际适用性。另外，已经出现了一些使用大语言模型（LLMS）利用中文学习的无训练方法，但它们通常会遭受有限的推理能力和过于简单的排名机制的影响，这限制了它们产生始终如一的高质量布局。为此，我们提出了LayoutCot，这是一种新颖的方法，它通过结合了检索成绩（RAG）和思想链（COT）技术来利用LLM的推理能力。具体而言，LayoutCot将布局表示形式转换为适用于LLM处理的标准序列化格式。布局感知的抹布用于促进有效的检索，并通过LLMS生成粗糙的布局。然后，这种初步布局以及所选的示例将被送入一个专门设计的COT推理模块，以进行迭代精致，从而显着增强了语义连贯性和视觉质量。我们在跨越三个条件布局生成任务的五个公共数据集上进行了广泛的实验。实验结果表明，LayoutCot在不需要培训或微调的情况下实现最先进的性能。值得注意的是，我们的COT推理模块可以使标准的LLM，即使没有明确的深层推理能力的LLM，也可以超越专业的深层模型，例如DeepSeek-R1，强调了我们方法在释放LLMS对布局生成任务的深层推理能力方面的潜力。

Title: Safe-Construct: Redefining Construction Safety Violation Recognition as 3D Multi-View Engagement Task

Authors: Aviral Chharia, Tianyu Ren, Tomotake Furuhata, Kenji Shimada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10880
Pdf URL: https://arxiv.org/pdf/2504.10880
Copy Paste: [[2504.10880]] Safe-Construct: Redefining Construction Safety Violation Recognition as 3D Multi-View Engagement Task(https://arxiv.org/abs/2504.10880)
Keywords: generation
Abstract: Recognizing safety violations in construction environments is critical yet remains underexplored in computer vision. Existing models predominantly rely on 2D object detection, which fails to capture the complexities of real-world violations due to: (i) an oversimplified task formulation treating violation recognition merely as object detection, (ii) inadequate validation under realistic conditions, (iii) absence of standardized baselines, and (iv) limited scalability from the unavailability of synthetic dataset generators for diverse construction scenarios. To address these challenges, we introduce Safe-Construct, the first framework that reformulates violation recognition as a 3D multi-view engagement task, leveraging scene-level worker-object context and 3D spatial understanding. We also propose the Synthetic Indoor Construction Site Generator (SICSG) to create diverse, scalable training data, overcoming data limitations. Safe-Construct achieves a 7.6% improvement over state-of-the-art methods across four violation types. We rigorously evaluate our approach in near-realistic settings, incorporating four violations, four workers, 14 objects, and challenging conditions like occlusions (worker-object, worker-worker) and variable illumination (back-lighting, overexposure, sunlight). By integrating 3D multi-view spatial understanding and synthetic data generation, Safe-Construct sets a new benchmark for scalable and robust safety monitoring in high-risk industries. Project Website: this https URL
摘要：在施工环境中识别违反安全性是至关重要的，但在计算机视觉中仍然没有得到充实的反应。现有模型主要依赖于2D对象检测，由于以下情况下，该模型无法捕获现实世界违规的复杂性：（i）仅将违规识别视为对象检测的过度简单任务配方，（ii）在现实条件下验证不足，（III）（iii）标准化基准的不可伸缩式的分类量有限地伸缩了统一的分析，并且（iv）均具有限制性的生存能力。为了应对这些挑战，我们介绍了安全构建，这是第一个将违规识别作为3D多视图参与任务的违规识别，利用场景级别的工人对象上下文和3D空间理解的框架。我们还建议合成的室内施工现场生成器（SICSG）来创建各种可扩展的培训数据，克服数据限制。在四种违规类型中，安全构造比最先进的方法提高了7.6％。我们严格评估了我们在近现实的环境中的方法，其中包括四个违规，四个工人，14个对象以及诸如遮挡（工人对象，工人工作者）和可变照明（背光，过度暴露，阳光）等具有挑战性的条件。通过整合3D多视图空间理解和综合数据生成，安全构造为高风险行业中可扩展且可靠的安全监控设定了新的基准。项目网站：此HTTPS URL

Title: Bringing together invertible UNets with invertible attention modules for memory-efficient diffusion models

Authors: Karan Jain, Mohammad Nayeem Teli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10883
Pdf URL: https://arxiv.org/pdf/2504.10883
Copy Paste: [[2504.10883]] Bringing together invertible UNets with invertible attention modules for memory-efficient diffusion models(https://arxiv.org/abs/2504.10883)
Keywords: generation
Abstract: Diffusion models have recently gained state of the art performance on many image generation tasks. However, most models require significant computational resources to achieve this. This becomes apparent in the application of medical image synthesis due to the 3D nature of medical datasets like CT-scans, MRIs, electron microscope, etc. In this paper we propose a novel architecture for a single GPU memory-efficient training for diffusion models for high dimensional medical datasets. The proposed model is built by using an invertible UNet architecture with invertible attention modules. This leads to the following two contributions: 1. denoising diffusion models and thus enabling memory usage to be independent of the dimensionality of the dataset, and 2. reducing the energy usage during training. While this new model can be applied to a multitude of image generation tasks, we showcase its memory-efficiency on the 3D BraTS2020 dataset leading to up to 15\% decrease in peak memory consumption during training with comparable results to SOTA while maintaining the image quality.
摘要：扩散模型最近在许多图像生成任务上获得了最先进的性能。但是，大多数模型都需要大量的计算资源来实现这一目标。由于CT扫描，MRIS，Electron显微镜等医疗数据集的3D性质，因此在应用医学图像合成的应用中显而易见。在本文中，我们为高尺寸医学数据集的单个GPU记忆效率训练提供了一种新颖的体系结构。提出的模型是通过使用带有可逆注意模块的可逆的不可估力体系结构来构建的。这导致了以下两个贡献：1。降解扩散模型，从而使记忆使用能够独立于数据集的维度，而2。减少训练过程中的能量使用情况。虽然该新模型可以应用于多种图像生成任务，但我们在3D BRATS2020数据集上展示了其内存效率，导致培训期间峰值记忆消耗的最高降低15 \％，同时可与SOTA相当，同时保持图像质量。

Title: PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving

Authors: Zeyu Zhang, Zijian Chen, Zicheng Zhang, Yuze Sun, Yuan Tian, Ziheng Jia, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10885
Pdf URL: https://arxiv.org/pdf/2504.10885
Copy Paste: [[2504.10885]] PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving(https://arxiv.org/abs/2504.10885)
Keywords: generation
Abstract: Large Multimodal Models (LMMs) have demonstrated impressive capabilities across a wide range of multimodal tasks, achieving ever-increasing performance on various evaluation benchmarks. However, existing benchmarks are typically static and often overlap with pre-training datasets, leading to fixed complexity constraints and substantial data contamination issues. Meanwhile, manually annotated datasets are labor-intensive, time-consuming, and subject to human bias and inconsistency, leading to reliability and reproducibility issues. To address these problems, we propose a fully dynamic multimodal evaluation framework, named Open-ended Visual Puzzle Generation (OVPG), which aims to generate fresh, diverse, and verifiable evaluation data automatically in puzzle-solving tasks. Specifically, the OVPG pipeline consists of a raw material sampling module, a visual content generation module, and a puzzle rule design module, which ensures that each evaluation instance is primitive, highly randomized, and uniquely solvable, enabling continual adaptation to the evolving capabilities of LMMs. Built upon OVPG, we construct PuzzleBench, a dynamic and scalable benchmark comprising 11,840 VQA samples. It features six carefully designed puzzle tasks targeting three core LMM competencies, visual recognition, logical reasoning, and context understanding. PuzzleBench differs from static benchmarks that quickly become outdated. It enables ongoing dataset refreshing through OVPG and a rich set of open-ended puzzle designs, allowing seamless adaptation to the evolving capabilities of LMMs.
摘要：大型的多模型模型（LMM）在各种多模式任务中表现出令人印象深刻的功能，从而在各种评估基准上实现了不断增强的性能。但是，现有的基准通常是静态的，并且通常与预训练数据集重叠，从而导致固定的复杂性约束和实质性数据污染问题。同时，手动注释的数据集是劳动密集型，耗时的，并且会受到人类偏见和不一致的影响，从而导致可靠性和可重复性问题。为了解决这些问题，我们提出了一个完全动态的多模式评估框架，称为开放式视觉难题生成（OVPG），该框架旨在在拼图解决任务中自动生成新鲜，多样化和可验证的评估数据。具体而言，OVPG管道由原材料采样模块，视觉内容产生模块和一个拼图规则设计模块组成，该模块确保每个评估实例都是原始的，高度随机的，可以唯一的，可以持续适应LMMS的Evolland of lmms的能力。我们构建了OVPG，我们构建了Puzzzlebench，这是一个动态且可扩展的基准，包括11,840个VQA样品。它具有六项精心设计的拼图任务，针对三个核心LMM能力，视觉识别，逻辑推理和上下文理解。拼图板与迅速过时的静态基准不同。它使正在进行的数据集通过OVPG和丰富的开放式拼图设计刷新，从而可以无缝适应LMM的不断发展的功能。

Title: InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation

Authors: Yukang Lin, Yan Hong, Zunnan Xu, Xindi Li, Chao Xu, Chuanbiao Song, Ronghui Li, Haoxing Chen, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang, Xiu Li
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10905
Pdf URL: https://arxiv.org/pdf/2504.10905
Copy Paste: [[2504.10905]] InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation(https://arxiv.org/abs/2504.10905)
Keywords: generation
Abstract: Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.
摘要：最近的视频生成研究集中在孤立的动作上，使交互作用像手头相互作用一样大而没有审查。这些相互作用对于新兴的生物识别身份验证系统至关重要，这些身份验证系统依赖于基于交互式运动的反欺骗方法。从安全角度来看，大规模，高质量的互动视频的需求越来越大，以训练和加强身份验证模型。在这项工作中，我们介绍了一种新颖的范式，以动画现实的手面互动。我们的方法同时学习了时空接触动力学和生物力学上合理的变形效应，从而实现了自然相互作用，其中手移动会引起解剖学上准确的面部变形，同时保持无碰撞接触。为了促进这项研究，我们介绍了Interhf，这是一个大规模的手提交互数据集，其中包含18种互动模式和90,000个带注释的视频。此外，我们提出了Interanimate，这是一种专门为相互作用动画设计的区域感知扩散模型。中间的利用可学习的空间和暂时潜在的潜在潜在可有效捕获动态相互作用先验，并整合了将这些先验注入脱氧过程的区域感知的相互作用机制。据我们所知，这项工作代表了系统地研究人体面互动的第一个大规模努力。定性和定量结果表明，媒介产生了高度逼真的动画，从而树立了新的基准。代码和数据将公开以推进研究。

Title: An Efficient and Mixed Heterogeneous Model for Image Restoration

Authors: Yubin Gu, Yuan Meng, Kaihang Zheng, Xiaoshuai Sun, Jiayi Ji, Weijian Ruan, Liujuan Cao, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10967
Pdf URL: https://arxiv.org/pdf/2504.10967
Copy Paste: [[2504.10967]] An Efficient and Mixed Heterogeneous Model for Image Restoration(https://arxiv.org/abs/2504.10967)
Keywords: restoration
Abstract: Image restoration~(IR), as a fundamental multimedia data processing task, has a significant impact on downstream visual applications. In recent years, researchers have focused on developing general-purpose IR models capable of handling diverse degradation types, thereby reducing the cost and complexity of model development. Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. CNNs excel in efficient inference, whereas Transformers and Mamba excel at capturing long-range dependencies and modeling global contexts. While each architecture has demonstrated success in specialized, single-task settings, limited efforts have been made to effectively integrate heterogeneous architectures to jointly address diverse IR challenges. To bridge this gap, we propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion. RestorMixer adopts a three-stage encoder-decoder structure, where each stage is tailored to the resolution and feature characteristics of the input. In the initial high-resolution stage, CNN-based blocks are employed to rapidly extract shallow local features. In the subsequent stages, we integrate a refined multi-directional scanning Mamba module with a multi-scale window-based self-attention mechanism. This hierarchical and adaptive design enables the model to leverage the strengths of CNNs in local feature extraction, Mamba in global context modeling, and attention mechanisms in dynamic feature refinement. Extensive experimental results demonstrate that RestorMixer achieves leading performance across multiple IR tasks while maintaining high inference efficiency. The official code can be accessed at this https URL.
摘要：图像恢复〜（ir）是一项基本的多媒体数据处理任务，对下游的视觉应用产生了重大影响。近年来，研究人员专注于开发能够处理多种降解类型的通用IR模型，从而降低了模型开发的成本和复杂性。当前的主流方法基于三个建筑范式：CNN，变形金刚和Mambas。 CNN在有效的推理方面表现出色，而变形金刚和Mamba则在捕获长期依赖和建模全球环境方面表现出色。尽管每个体系结构在专业的单任务设置中都取得了成功，但已经做出了有限的努力，以有效地整合异质体系结构以共同解决不同的IR挑战。为了弥合这一差距，我们提出了Restormixer，这是一种基于混合体系结构融合的有效且通用的IR模型。 RESTOMIXER采用了三阶段编码器结构，每个阶段都针对输入的分辨率和特征特征量身定制。在最初的高分辨率阶段，采用基于CNN的块来快速提取浅局部特征。在随后的阶段中，我们将精制的多个方向扫描MAMBA模块与基于窗口的自我发场机制相结合。这种分层和自适应设计使该模型能够利用CNN在本地特征提取中的优势，全球上下文建模中的MAMBA以及动态功能改进中的注意力机制。广泛的实验结果表明，RESTOMIXER可以在多个IR任务中实现领先的性能，同时保持高推理效率。可以通过此HTTPS URL访问官方代码。

Title: AFiRe: Anatomy-Driven Self-Supervised Learning for Fine-Grained Representation in Radiographic Images

Authors: Yihang Liu, Lianghua He, Ying Wen, Longzhen Yang, Hongzhou Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10972
Pdf URL: https://arxiv.org/pdf/2504.10972
Copy Paste: [[2504.10972]] AFiRe: Anatomy-Driven Self-Supervised Learning for Fine-Grained Representation in Radiographic Images(https://arxiv.org/abs/2504.10972)
Keywords: restoration
Abstract: Current self-supervised methods, such as contrastive learning, predominantly focus on global discrimination, neglecting the critical fine-grained anatomical details required for accurate radiographic analysis. To address this challenge, we propose an Anatomy-driven self-supervised framework for enhancing Fine-grained Representation in radiographic image analysis (AFiRe). The core idea of AFiRe is to align the anatomical consistency with the unique token-processing characteristics of Vision Transformer. Specifically, AFiRe synergistically performs two self-supervised schemes: (i) Token-wise anatomy-guided contrastive learning, which aligns image tokens based on structural and categorical consistency, thereby enhancing fine-grained spatial-anatomical discrimination; (ii) Pixel-level anomaly-removal restoration, which particularly focuses on local anomalies, thereby refining the learned discrimination with detailed geometrical information. Additionally, we propose Synthetic Lesion Mask to enhance anatomical diversity while preserving intra-consistency, which is typically corrupted by traditional data augmentations, such as Cropping and Affine transformations. Experimental results show that AFiRe: (i) provides robust anatomical discrimination, achieving more cohesive feature clusters compared to state-of-the-art contrastive learning methods; (ii) demonstrates superior generalization, surpassing 7 radiography-specific self-supervised methods in multi-label classification tasks with limited labeling; and (iii) integrates fine-grained information, enabling precise anomaly detection using only image-level annotations.
摘要：当前的自我监督方法，例如对比学习，主要集中在全球歧视上，忽略了准确的射线照相分析所需的关键细粒性解剖学细节。为了应对这一挑战，我们提出了一个由解剖学驱动的自我监督框架，以增强放射线图像分析（AFIRE）中的细粒度表示。 Afire的核心思想是将解剖学的一致性与视觉变压器的独特令牌处理特征保持一致。具体而言，AFIRE协同执行了两个自我监督的方案：（i）令牌的解剖学引导的对比学习，它们基于结构和分类的一致性来使图像令牌对齐，从而增强了细粒度细粒度的空间 - 动态歧视；（ii）像素级异常恢复恢复，特别关注局部异常，从而使用详细的几何信息来完善学习的歧视。此外，我们提出合成病变面膜，以增强解剖学多样性，同时保持一致性，这通常会因传统数据增强而损坏，例如种植和仿射转化。实验结果表明，AFIRE：（i）提供了强大的解剖学歧视，与最先进的对比学习方法相比，获得了更具凝聚力的特征簇；（ii）表现出卓越的概括，超过了7个具有有限标签的多标签分类任务中的7个射线照相特定的自我监督方法；（iii）整合细粒信息，仅使用图像级注释才能精确的异常检测。

Title: ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings

Authors: Zitai Kong, Yiheng Zhu, Yinlong Xu, Hanjing Zhou, Mingzhe Yin, Jialu Wu, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou, Jian Wu
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2504.10983
Pdf URL: https://arxiv.org/pdf/2504.10983
Copy Paste: [[2504.10983]] ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings(https://arxiv.org/abs/2504.10983)
Keywords: generation, generative
Abstract: The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high training cost. To address these challenges, we introduce ProtFlow, a fast flow matching-based protein sequence design framework that operates on embeddings derived from semantically meaningful latent space of protein language models. By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources. Leveraging reflow techniques, ProtFlow enables high-quality single-step sequence generation. Additionally, we develop a joint design pipeline for the design scene of multichain proteins. We evaluate ProtFlow across diverse protein design tasks, including general peptides and long-chain proteins, antimicrobial peptides, and antibodies. Experimental results demonstrate that ProtFlow outperforms task-specific methods in these applications, underscoring its potential and broad applicability in computational protein sequence design and analysis.
摘要：具有所需功能的蛋白质序列的设计是蛋白质工程中的基本任务。深层生成方法，例如自回归模型和扩散模型，已经大大加速了新型蛋白质序列的发现。但是，这些方法主要集中于局部或浅剩余语义，并遭受推理效率低下，建模空间和高训练成本的损失。为了应对这些挑战，我们引入了Protflow，这是一种基于快速匹配的蛋白质序列设计框架，该框架在源自蛋白质语言模型的语义有意义的潜在空间的嵌入式上运行。通过压缩和平滑潜在空间，Protflow在有限的计算资源上训练时会增强性能。利用反流技术，Protflow可以实现高质量的单步序列生成。此外，我们为多偶蛋白的设计现场开发了一个联合设计管道。我们评估了各种蛋白质设计任务的蛋白质，包括一般肽和长链蛋白质，抗菌肽和抗体。实验结果表明，在这些应用中，Protflow优于任务特异性方法，强调了其在计算蛋白序列设计和分析中的潜力和广泛的适用性。

Title: Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation

Authors: Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2504.10987
Pdf URL: https://arxiv.org/pdf/2504.10987
Copy Paste: [[2504.10987]] Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation(https://arxiv.org/abs/2504.10987)
Keywords: generation
Abstract: Differentially Private Synthetic Data Generation (DP-SDG) is a key enabler of private and secure tabular-data sharing, producing artificial data that carries through the underlying statistical properties of the input data. This typically involves adding carefully calibrated statistical noise to guarantee individual privacy, at the cost of synthetic data quality. Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data. These methods study a horizontal public-private partitioning which assumes access to a small number of public rows that can be used for model initialization, providing a small utility gain. However, realistic datasets often naturally consist of public and private attributes, making a vertical public-private partitioning relevant for practical synthetic data deployments. We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting. We compare this framework against our alternative approach that uses conditional generation, highlighting initial limitations of public-data assisted methods and proposing future research directions to address these challenges.
摘要：差异化综合数据生成（DP-SDG）是私有和安全表格数据共享的关键推动因素，它产生了通过输入数据的基本统计属性携带的人工数据。这通常涉及添加精心校准的统计噪声，以保证个人隐私，以合成数据质量为代价。最近的文献探讨了少量公共数据用于帮助增强合成数据的质量的方案。这些方法研究了水平的公私分区，该分区假设可以访问少量的公共行，这些行可以用于模型初始化，从而提供了少量的实用性增益。但是，现实的数据集通常自然由公共和私人属性组成，从而使垂直的公私分区与实际合成数据部署相关。我们提出了一个新颖的框架，该框架将水平公共辅助方法调整到垂直环境中。我们将该框架与使用条件产生的替代方法进行了比较，突出了公共数据辅助方法的初始限制，并提出了未来的研究方向以应对这些挑战。

Title: AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era

Authors: Chenyang Zhu, Xing Zhang, Yuyang Sun, Ching-Chun Chang, Isao Echizen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11015
Pdf URL: https://arxiv.org/pdf/2504.11015
Copy Paste: [[2504.11015]] AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era(https://arxiv.org/abs/2504.11015)
Keywords: generation
Abstract: Recent advances in image generation, particularly diffusion models, have significantly lowered the barrier for creating sophisticated forgeries, making image manipulation detection and localization (IMDL) increasingly challenging. While prior work in IMDL has focused largely on natural images, the anime domain remains underexplored-despite its growing vulnerability to AI-generated forgeries. Misrepresentations of AI-generated images as hand-drawn artwork, copyright violations, and inappropriate content modifications pose serious threats to the anime community and industry. To address this gap, we propose AnimeDL-2M, the first large-scale benchmark for anime IMDL with comprehensive annotations. It comprises over two million images including real, partially manipulated, and fully AI-generated samples. Experiments indicate that models trained on existing IMDL datasets of natural images perform poorly when applied to anime images, highlighting a clear domain gap between anime and natural images. To better handle IMDL tasks in anime domain, we further propose AniXplore, a novel model tailored to the visual characteristics of anime imagery. Extensive evaluations demonstrate that AniXplore achieves superior performance compared to existing methods. Dataset and code can be found in this https URL.
摘要：图像产生的最新进展，尤其是扩散模型，显着降低了创造复杂的伪造的障碍，使图像操纵检测和本地化（IMDL）越来越具有挑战性。尽管IMDL的先前工作主要集中在自然图像上，但动漫域仍然没有散发出，尽管其日益增长的易受AI生成的伪造的脆弱性。对AI生成的图像的虚假陈述为手绘艺术品，侵犯版权和不适当的内容修改对动漫社区和行业构成了严重威胁。为了解决这一差距，我们提出了Animedl-2M，这是动漫IMDL的第一个大规模基准，并具有全面的注释。它包含超过200万张图像，包括真实的，部分操纵和完全AI生成的样品。实验表明，在应用于动漫图像时，在现有的自然图像的现有IMDL数据集上训练的模型表现较差，从而突出了动漫和自然图像之间的清晰域间隙。为了更好地处理动漫域中的IMDL任务，我们进一步提出了Anixplore，这是一种针对动漫图像的视觉特征量身定制的新型模型。广泛的评估表明，与现有方法相比，Anixplore的性能卓越。数据集和代码可以在此HTTPS URL中找到。

Title: Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation

Authors: Andrea Simonelli, Norman Müller, Peter Kontschieder
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11024
Pdf URL: https://arxiv.org/pdf/2504.11024
Copy Paste: [[2504.11024]] Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation(https://arxiv.org/abs/2504.11024)
Keywords: generation
Abstract: The increasing availability of digital 3D environments, whether through image-based 3D reconstruction, generation, or scans obtained by robots, is driving innovation across various applications. These come with a significant demand for 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and performing well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a 3D interactive segmentation method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformer-based decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet, ScanNet++, S3DIS, and KITTI-360, and also on unseen geometric distributions such as the ones obtained by Gaussian Splatting. The project web-page is available at this https URL.
摘要：无论是通过基于图像的3D重建，生成还是由机器人获得的扫描，数字3D环境的可用性都在推动各种应用程序推动创新。这些对3D相互作用的需求很大，例如3D交互分割，这对于诸如对象选择和操纵之类的任务很有用。此外，在各种环境中，尤其是在看不见的环境和陌生的对象中，持续需要有效，精确且表现良好的解决方案。在这项工作中，我们引入了一种3D交互式分割方法，该方法始终超过了内域和室外数据集上先前的最新技术。我们的简单方法将基于体素的稀疏编码器与基于轻巧的变压器的解码器集成在一起，该解码器实现了隐式点击融合，从而实现了卓越的性能和最大化的效率。我们的方法证明了基准数据集的实质性改进，包括扫描仪，扫描仪++，S3DIS和Kitti-360，以及在看不见的几何分布中，例如通过高斯分散获得的几何分布。该项目网页可在此HTTPS URL上找到。

Title: Defending Against Frequency-Based Attacks with Diffusion Models

Authors: Fatemeh Amerehi, Patrick Healy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11034
Pdf URL: https://arxiv.org/pdf/2504.11034
Copy Paste: [[2504.11034]] Defending Against Frequency-Based Attacks with Diffusion Models(https://arxiv.org/abs/2504.11034)
Keywords: generative
Abstract: Adversarial training is a common strategy for enhancing model robustness against adversarial attacks. However, it is typically tailored to the specific attack types it is trained on, limiting its ability to generalize to unseen threat models. Adversarial purification offers an alternative by leveraging a generative model to remove perturbations before classification. Since the purifier is trained independently of both the classifier and the threat models, it is better equipped to handle previously unseen attack scenarios. Diffusion models have proven highly effective for noise purification, not only in countering pixel-wise adversarial perturbations but also in addressing non-adversarial data shifts. In this study, we broaden the focus beyond pixel-wise robustness to explore the extent to which purification can mitigate both spectral and spatial adversarial attacks. Our findings highlight its effectiveness in handling diverse distortion patterns across low- to high-frequency regions.
摘要：对抗训练是增强对抗性攻击的模型鲁棒性的常见策略。但是，它通常是针对经过训练的特定攻击类型量身定制的，限制了其推广到看不见的威胁模型的能力。对抗性纯化通过利用生成模型在分类前消除扰动来提供替代方案。由于净化器是独立于分类器和威胁模型的培训，因此可以更好地处理以前看不见的攻击方案。事实证明，扩散模型对噪声净化非常有效，不仅在对抗像素的对抗扰动方面，而且还解决了非对抗数据变化。在这项研究中，我们将重点扩大到了像素的鲁棒性之外，以探索纯化可以减轻光谱和空间对抗性攻击的程度。我们的发现突出了其在处理低频至高频区域的各种失真模式方面的有效性。

Title: UKDM: Underwater keypoint detection and matching using underwater image enhancement techniques

Authors: Pedro Diaz-Garcia, Felix Escalona, Miguel Cazorla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11063
Pdf URL: https://arxiv.org/pdf/2504.11063
Copy Paste: [[2504.11063]] UKDM: Underwater keypoint detection and matching using underwater image enhancement techniques(https://arxiv.org/abs/2504.11063)
Keywords: generative
Abstract: The purpose of this paper is to explore the use of underwater image enhancement techniques to improve keypoint detection and matching. By applying advanced deep learning models, including generative adversarial networks and convolutional neural networks, we aim to find the best method which improves the accuracy of keypoint detection and the robustness of matching algorithms. We evaluate the performance of these techniques on various underwater datasets, demonstrating significant improvements over traditional methods.
摘要：本文的目的是探索使用水下图像增强技术来改善关键点检测和匹配。通过应用先进的深度学习模型，包括生成对抗网络和卷积神经网络，我们旨在找到最佳方法，以提高关键点检测的准确性和匹配算法的鲁棒性。我们评估了这些技术在各种水下数据集上的性能，证明了对传统方法的显着改善。

Title: Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting

Authors: Jiaxin Huang, Sheng Miao, BangBnag Yang, Yuewen Ma, Yiyi Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11092
Pdf URL: https://arxiv.org/pdf/2504.11092
Copy Paste: [[2504.11092]] Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting(https://arxiv.org/abs/2504.11092)
Keywords: generative
Abstract: Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views - synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.
摘要：从随意捕获的单眼视频中重建4D动态场景是有价值的，但又具有挑战性，因为从单个角度观察到每个时间戳。我们介绍了Vivid4d，这是一种新颖的方法，通过增强观察观点来增强4D单眼视频综合 - 从单眼输入中合成多视图视频。与仅利用几何学先验的现有方法进行监督或在俯瞰几何学时使用生成先验的方法，我们都集成了两者。这将增强视为视频介绍任务的视图将观察到的观点重新定义，在该任务中，观察到的观点被基于单眼深度先验的新观点。为了实现这一目标，我们在未经审查的Web视频上训练视频介绍模型，并使用合成生成的面具模仿翘曲阻塞，从而确保在空间和时间上一致地完成丢失区域。为了进一步减轻单眼深度先验的不准确性，我们引入了迭代视图增强策略和强大的重建损失。实验表明，我们的方法有效地改善了单眼4D场景的重建和完成。

Title: Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models

Authors: Jiangtao Liu, Zhaoxin Wang, Handing Wang, Cong Tian, Yaochu Jin
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2504.11106
Pdf URL: https://arxiv.org/pdf/2504.11106
Copy Paste: [[2504.11106]] Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models(https://arxiv.org/abs/2504.11106)
Keywords: generation, generative
Abstract: Recent advancements in Text-to-Image (T2I) generation have significantly enhanced the realism and creativity of generated images. However, such powerful generative capabilities pose risks related to the production of inappropriate or harmful content. Existing defense mechanisms, including prompt checkers and post-hoc image checkers, are vulnerable to sophisticated adversarial attacks. In this work, we propose TCBS-Attack, a novel query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers. By iteratively optimizing tokens near these boundaries, TCBS-Attack generates semantically coherent adversarial prompts capable of bypassing multiple defensive layers in T2I models. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art jailbreak attacks across various T2I models, including securely trained open-source models and commercial online services like DALL-E 3. TCBS-Attack achieves an ASR-4 of 45\% and an ASR-1 of 21\% on jailbreaking full-chain T2I models, significantly surpassing baseline methods.
摘要：文本到图像（T2i）一代的最新进展显着增强了生成图像的现实主义和创造力。但是，如此强大的生成能力会带来与不适当或有害内容的产生有关的风险。现有的防御机制，包括及时的检查员和事后图像检查器，容易受到复杂的对抗性攻击。在这项工作中，我们提出了TCBS-Attack，这是一种基于查询的新型Black-Box越狱攻击，搜索位于文本和图像检查器定义的决策边界附近的令牌。通过迭代优化这些边界附近的令牌，TCBS-ITSACK生成语义相干的对抗提示，能够绕过T2I模型中多个防御层。广泛的实验表明，我们的方法始终超过各种T2I模型的最先进的越狱攻击，包括经过安全培训的开源模型和诸如DALL-E 3的商业在线服务。TCBS-ESTACK实现45 \％的ASR-4和ASR-1的ASR-1和ASR-1的ASR-1和21 \％的ASR-1 \％\％在越狱全chain T2I模型上，有明显的基础方法，该方法是基础的。

Title: Taming Consistency Distillation for Accelerated Human Image Animation

Authors: Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yujie Wei, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11143
Pdf URL: https://arxiv.org/pdf/2504.11143
Copy Paste: [[2504.11143]] Taming Consistency Distillation for Accelerated Human Image Animation(https://arxiv.org/abs/2504.11143)
Keywords: generation
Abstract: Recent advancements in human image animation have been propelled by video diffusion models, yet their reliance on numerous iterative denoising steps results in high inference costs and slow speeds. An intuitive solution involves adopting consistency models, which serve as an effective acceleration paradigm through consistency distillation. However, simply employing this strategy in human image animation often leads to quality decline, including visual blurring, motion degradation, and facial distortion, particularly in dynamic regions. In this paper, we propose the DanceLCM approach complemented by several enhancements to improve visual quality and motion continuity at low-step regime: (1) segmented consistency distillation with an auxiliary light-weight head to incorporate supervision from real video latents, mitigating cumulative errors resulting from single full-trajectory generation; (2) a motion-focused loss to centre on motion regions, and explicit injection of facial fidelity features to improve face authenticity. Extensive qualitative and quantitative experiments demonstrate that DanceLCM achieves results comparable to state-of-the-art video diffusion models with a mere 2-4 inference steps, significantly reducing the inference burden without compromising video quality. The code and models will be made publicly available.
摘要：视频扩散模型推动了人类图像动画的最新进展，但它们依赖众多迭代的剥离步骤会导致高推理成本和缓慢的速度。直观的解决方案涉及采用一致性模型，该模型通过一致性蒸馏作为有效的加速度范式。但是，仅在人类图像动画中采用这种策略通常会导致质量下降，包括视觉模糊，运动降解和面部扭曲，尤其是在动态区域。在本文中，我们提出了DancelCM方法，并通过多种增强功能进行补充，以提高低步骤的视觉质量和运动连续性：（1）分段一致性蒸馏和辅助轻重量头部以纳入真实视频潜在的监督，从而减轻了来自单一的全面特殊生成的累积误差；（2）以运动为中心运动区域的运动损失，并明确注入面部忠诚特征，以提高面部真实性。广泛的定性和定量实验表明，DancelCM的结果与仅2-4个推理步骤的最新视频扩散模型相当，从而大大减轻了不影响视频质量的推理负担。代码和模型将公开可用。

Title: TerraMind: Large-Scale Generative Multimodality for Earth Observation

Authors: Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11171
Pdf URL: https://arxiv.org/pdf/2504.11171
Copy Paste: [[2504.11171]] TerraMind: Large-Scale Generative Multimodality for Earth Observation(https://arxiv.org/abs/2504.11171)
Keywords: generative
Abstract: We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.
摘要：我们提出了Terramind，这是第一个对地球观测（EO）的任何生成的多模式基础模型。与其他多模型不同，Terramind在跨模态的双尺度表示形式上介绍了令牌级别和像素级数据。在令牌层面上，Terramind编码高级上下文信息以学习跨模式关系，而在像素级别上，Terramind利用细粒度的表示来捕获关键的空间细微差别。我们以全球大规模数据集的九种地理空间方式鉴定了Terramind。 In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for像pangea一样。预训练的数据集，模型权重和我们的代码是在允许许可下开源的。

Title: Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

Authors: Shangyu Liu, Zhenzhe Zheng, Xiaoyao Huang, Fan Wu, Jie Wu
Subjects: cs.LG, cs.AI, cs.DC, cs.IR
Abstract URL: https://arxiv.org/abs/2504.11197
Pdf URL: https://arxiv.org/pdf/2504.11197
Copy Paste: [[2504.11197]] Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance(https://arxiv.org/abs/2504.11197)
Keywords: generation
Abstract: Small language models (SLMs) support efficient deployments on resource-constrained edge devices, but their limited capacity compromises inference performance. Retrieval-augmented generation (RAG) is a promising solution to enhance model performance by integrating external databases, without requiring intensive on-device model retraining. However, large-scale public databases and user-specific private contextual documents are typically located on the cloud and the device separately, while existing RAG implementations are primarily centralized. To bridge this gap, we propose DRAGON, a distributed RAG framework to enhance on-device SLMs through both general and personal knowledge without the risk of leaking document privacy. Specifically, DRAGON decomposes multi-document RAG into multiple parallel token generation processes performed independently and locally on the cloud and the device, and employs a newly designed Speculative Aggregation, a dual-side speculative algorithm to avoid frequent output synchronization between the cloud and device. A new scheduling algorithm is further introduced to identify the optimal aggregation side based on real-time network conditions. Evaluations on real-world hardware testbed demonstrate a significant performance improvement of DRAGON-up to 1.9x greater gains over standalone SLM compared to the centralized RAG, substantial reduction in per-token latency, and negligible Time to First Token (TTFT) overhead.
摘要：小语言模型（SLM）支持在资源受限的边缘设备上有效部署，但其有限的容量损害了推理性能。检索增强的生成（RAG）是通过集成外部数据库来增强模型性能的有前途的解决方案，而无需大量的内部设备模型再培训。但是，大规模的公共数据库和特定于用户的私人上下文文档通常位于云上，并且分别位于设备上，而现有的RAG实现主要集中。为了弥合这一差距，我们提出了龙（Dragon），这是一个分布式的抹布框架，通过一般和个人知识都在没有泄漏文件隐私的风险的情况下增强了设备的SLM。具体而言，Dragon将多文件抹布分解为多个平行的令牌生成过程，在云和设备上独立执行，并采用了新设计的投机聚合，这是一种双侧投机算法，以避免云和设备之间频繁的输出同步。进一步介绍了一种新的调度算法，以根据实时网络条件识别最佳聚合端。对现实世界硬件测试床的评估表明，与集中式抹布相比，龙的增长比独立SLM的增长更大，大幅度降低了延迟的延迟，并且可以忽略不计的时间（TTFT）开销。

Title: Distillation-Supervised Convolutional Low-Rank Adaptation for Efficient Image Super-Resolution

Authors: Xinning Chai, Yao Zhang, Yuxuan Zhang, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11271
Pdf URL: https://arxiv.org/pdf/2504.11271
Copy Paste: [[2504.11271]] Distillation-Supervised Convolutional Low-Rank Adaptation for Efficient Image Super-Resolution(https://arxiv.org/abs/2504.11271)
Keywords: super-resolution
Abstract: Convolutional neural networks (CNNs) have been widely used in efficient image super-resolution. However, for CNN-based methods, performance gains often require deeper networks and larger feature maps, which increase complexity and inference costs. Inspired by LoRA's success in fine-tuning large language models, we explore its application to lightweight models and propose Distillation-Supervised Convolutional Low-Rank Adaptation (DSCLoRA), which improves model performance without increasing architectural complexity or inference costs. Specifically, we integrate ConvLoRA into the efficient SR network SPAN by replacing the SPAB module with the proposed SConvLB module and incorporating ConvLoRA layers into both the pixel shuffle block and its preceding convolutional layer. DSCLoRA leverages low-rank decomposition for parameter updates and employs a spatial feature affinity-based knowledge distillation strategy to transfer second-order statistical information from teacher models (pre-trained SPAN) to student models (ours). This method preserves the core knowledge of lightweight models and facilitates optimal solution discovery under certain conditions. Experiments on benchmark datasets show that DSCLoRA improves PSNR and SSIM over SPAN while maintaining its efficiency and competitive image quality. Notably, DSCLoRA ranked first in the Overall Performance Track of the NTIRE 2025 Efficient Super-Resolution Challenge. Our code and models are made publicly available at this https URL.
摘要：卷积神经网络（CNN）已被广泛用于有效的图像超分辨率。但是，对于基于CNN的方法，性能提高通常需要更深层的网络和更大的特征图，从而增加复杂性和推理成本。受洛拉（Lora）在微调大语模型中的成功启发的启发，我们探索了其对轻量级模型的应用，并提出了蒸馏措施的卷积低级别适应性（DSCLORA），从而改善了模型性能而不会提高建筑复杂性或推荐成本。具体而言，我们通过用建议的SCONVLB模块替换平台模块并将库洛拉层整合到Pixel shuffle块及其先前的卷积层中，将弯曲曲线整合到有效的SR网络跨度中。 DSCLORA利用低名分分解来进行参数更新，并采用基于空间特征的知识蒸馏策略将二阶统计信息从教师模型（预训练的跨度）转移到学生模型（我们的模型）（我们的）。该方法保留了轻质模型的核心知识，并在某些条件下促进了最佳解决方案发现。基准数据集的实验表明，DSClora在跨度上改善了PSNR和SSIM，同时保持其效率和竞争图像质量。值得注意的是，DSClora在NTIRE 2025高效超分辨率挑战的总体表现中排名第一。我们的代码和模型可在此HTTPS URL上公开提供。

Title: UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

Authors: Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11289
Pdf URL: https://arxiv.org/pdf/2504.11289
Copy Paste: [[2504.11289]] UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer(https://arxiv.org/abs/2504.11289)
Keywords: generative
Abstract: This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at this https URL.
摘要：该报告介绍了Unianiame-Dit，这是一个高级项目，利用开源WAN2.1模型的尖端和强大功能来实现一致的人类图像动画。具体而言，为了保留原始WAN2.1模型的可靠生成能力，我们实施了低级适应性（LORA）技术来微调最小参数集，从而大大减少了训练记忆开销。由多个堆叠的3D卷积层组成的轻质姿势编码器旨在编码驾驶姿势的运动信息。此外，我们采用了一个简单的串联操作，将参考外观整合到模型中，并将参考图像的姿势信息整合为增强的姿势比对。实验结果表明，我们的方法在视觉上表现出来，并且在时间上具有一致的高保真动画。在推理期间，受过480p（832x480）视频的培训，表现出强大的概括能力，可以无缝向上向720p（1280x720）。培训和推理代码可在此HTTPS URL上公开获得。

Title: Autoregressive Distillation of Diffusion Transformers

Authors: Yeongmin Kim, Sotiris Anagnostidis, Yuming Du, Edgar Schönfeld, Jonas Kohler, Markos Georgopoulos, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11295
Pdf URL: https://arxiv.org/pdf/2504.11295
Copy Paste: [[2504.11295]] Autoregressive Distillation of Diffusion Transformers(https://arxiv.org/abs/2504.11295)
Keywords: generation
Abstract: Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a $5\times$ reduction in FID degradation compared to the baseline methods while requiring only 1.1\% extra FLOPs on ImageNet-256. Moreover, ARD reaches FID of 1.84 on ImageNet-256 in merely 4 steps and outperforms the publicly available 1024p text-to-image distilled models in prompt adherence score with a minimal drop in FID compared to the teacher. Project page: this https URL.
摘要：具有变压器体系结构的扩散模型表明，在产生高保真图像和可扩展性方面具有有希望的能力。但是，合成所需的迭代抽样过程非常密集。一项工作重点是将解决方案蒸馏到概率流量中，成几步的学生模型。然而，现有方法受到对最近的剥离样本的依赖的限制，使它们容易受到暴露偏见的影响。为了解决这一限制，我们提出了自回旋蒸馏（ARD），这是一种新型方法，利用了颂歌的历史轨迹来预测未来的步骤。 ARD提供了两个关键的好处：1）它通过利用一种预测的历史轨迹来减轻暴露偏见，该轨迹不易累积错误，而2）它利用了ODE轨迹的先前历史作为更有效的粗粒信息来源。 ARD通过添加以代币的时间嵌入来标记轨迹历史记录中的每个输入，并采用块良好的因果关注掩码来修改教师变压器的体系结构。此外，仅在较低的变压器层中纳入历史输入会提高性能和效率。我们验证了ARD在ImageNet和T2i合成上的类调节生成中的有效性。与基线方法相比，我们的模型可实现$ 5 \ times $减少FID降解，同时仅需要1.1 \％额外的拖鞋256。此外，ARD仅在4个步骤中达到Imagenet-256上的FID为1.84，并且比老师的FID相比，FID的下降最小，超过了公开可用的1024p文本到图像蒸馏模型。项目页面：此HTTPS URL。

Title: Looking beyond the next token

Authors: Abitha Thankaraj, Yiding Jiang, J. Zico Kolter, Yonatan Bisk
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.11336
Pdf URL: https://arxiv.org/pdf/2504.11336
Copy Paste: [[2504.11336]] Looking beyond the next token(https://arxiv.org/abs/2504.11336)
Keywords: generation
Abstract: The structure of causal language model training assumes that each token can be accurately predicted from the previous context. This contrasts with humans' natural writing and reasoning process, where goals are typically known before the exact argument or phrasings. While this mismatch has been well studied in the literature, the working assumption has been that architectural changes are needed to address this mismatch. We argue that rearranging and processing the training data sequences can allow models to more accurately imitate the true data-generating process, and does not require any other changes to the architecture or training infrastructure. We demonstrate that this technique, Trelawney, and the inference algorithms derived from it allow us to improve performance on several key benchmarks that span planning, algorithmic reasoning, and story generation tasks. Finally, our method naturally enables the generation of long-term goals at no additional cost. We investigate how using the model's goal-generation capability can further improve planning and reasoning. Additionally, we believe Trelawney could potentially open doors to new capabilities beyond the current language modeling paradigm.
摘要：因果语言模型培训的结构假设可以从先前的上下文中准确预测每个令牌。这与人类的自然写作和推理过程形成鲜明对比，该过程通常在确切的论点或短语之前就知道目标。尽管这种不匹配在文献中进行了很好的研究，但工作的假设是需要建筑变化来解决这一不匹配。我们认为，重新安排和处理培训数据序列可以使模型更准确地模仿真正的数据生成过程，并且不需要对体系结构或培训基础架构进行任何其他更改。我们证明了这种技术，Trelawney和从中得出的推理算法使我们能够在跨越计划，算法推理和故事生成任务的几个关键基准上的性能。最后，我们的方法自然可以使长期目标无需额外成本。我们研究使用模型的目标生成能力如何进一步改善计划和推理。此外，我们认为Trelawney有可能为超出当前语言建模范式的新功能打开大门。

Title: Seedream 3.0 Technical Report

Authors: Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, Weilin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11346
Pdf URL: https://arxiv.org/pdf/2504.11346
Copy Paste: [[2504.11346]] Seedream 3.0 Technical Report(https://arxiv.org/abs/2504.11346)
Keywords: generation
Abstract: We present Seedream 3.0, a high-performance Chinese-English bilingual image generation foundation model. We develop several technical improvements to address existing challenges in Seedream 2.0, including alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions. Specifically, the advancements of Seedream 3.0 stem from improvements across the entire pipeline, from data construction to model deployment. At the data stratum, we double the dataset using a defect-aware training paradigm and a dual-axis collaborative data-sampling framework. Furthermore, we adopt several effective techniques such as mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling in the pre-training phase. During the post-training stage, we utilize diversified aesthetic captions in SFT, and a VLM-based reward model with scaling, thereby achieving outputs that well align with human preferences. Furthermore, Seedream 3.0 pioneers a novel acceleration paradigm. By employing consistent noise expectation and importance-aware timestep sampling, we achieve a 4 to 8 times speedup while maintaining image quality. Seedream 3.0 demonstrates significant improvements over Seedream 2.0: it enhances overall capabilities, in particular for text-rendering in complicated Chinese characters which is important to professional typography generation. In addition, it provides native high-resolution output (up to 2K), allowing it to generate images with high visual quality.
摘要：我们提出了SeedReam 3.0，这是一种高性能的中文双语图像生成基础模型。我们开发了几项技术改进，以应对种子Ream 2.0中的现有挑战，包括与复杂的提示，细粒度的排版，次优的视觉美学和忠诚度以及有限的图像分辨率保持一致。具体而言，SeedReam 3.0的进步源于整个管道的改进，从数据构建到模型部署。在数据层上，我们使用缺陷感知的培训范式和双轴协作数据采样框架加倍数据集。此外，我们采用了几种有效的技术，例如在训练阶段中的混合分辨率训练，跨模式绳，表示对准损失和分辨率意识到的时间段采样。在训练后阶段，我们利用SFT中的多元化美学标题，以及具有缩放率的基于VLM的奖励模型，从而实现了与人类偏好很好的输出。此外，SeedReam 3.0先驱者是一种新型的加速度范式。通过采用一致的噪声期望和重要性感知的时间段采样，我们在保持图像质量的同时达到了4到8倍的速度。 SeedReam 3.0表现出比Seedream 2.0的显着改善：它增强了整体功能，尤其是对于复杂的汉字呈现文本介绍，这对于专业排版的生成很重要。此外，它提供了天然的高分辨率输出（最高2K），从而使其能够以高视觉质量生成图像。

Title: DeepWheel: Generating a 3D Synthetic Wheel Dataset for Design and Performance Evaluation

Authors: Soyoung Yoo, Namwoo Kang
Subjects: cs.CV, physics.app-ph
Abstract URL: https://arxiv.org/abs/2504.11347
Pdf URL: https://arxiv.org/pdf/2504.11347
Copy Paste: [[2504.11347]] DeepWheel: Generating a 3D Synthetic Wheel Dataset for Design and Performance Evaluation(https://arxiv.org/abs/2504.11347)
Keywords: generation, generative
Abstract: Data-driven design is emerging as a powerful strategy to accelerate engineering innovation. However, its application to vehicle wheel design remains limited due to the lack of large-scale, high-quality datasets that include 3D geometry and physical performance metrics. To address this gap, this study proposes a synthetic design-performance dataset generation framework using generative AI. The proposed framework first generates 2D rendered images using Stable Diffusion, and then reconstructs the 3D geometry through 2.5D depth estimation. Structural simulations are subsequently performed to extract engineering performance data. To further expand the design and performance space, topology optimization is applied, enabling the generation of a more diverse set of wheel designs. The final dataset, named DeepWheel, consists of over 6,000 photo-realistic images and 900 structurally analyzed 3D models. This multi-modal dataset serves as a valuable resource for surrogate model training, data-driven inverse design, and design space exploration. The proposed methodology is also applicable to other complex design domains. The dataset is released under the Creative Commons Attribution-NonCommercial 4.0 International(CC BY-NC 4.0) and is available on the this https URL
摘要：数据驱动的设计正在成为加速工程创新的强大策略。但是，由于缺乏包括3D几何形状和身体性能指标的大型高质量数据集，其在车轮设计中的应用仍然受到限制。为了解决这一差距，本研究提出了使用生成AI的合成设计 - 性能数据集生成框架。提出的框架首先使用稳定的扩散生成2D渲染图像，然后通过2.5D深度估计重建3D几何形状。随后进行结构模拟以提取工程性能数据。为了进一步扩大设计和性能空间，应用了拓扑优化，从而可以生成更多样化的车轮设计。最终数据集（名为DeepWheel）由6,000多个照片现实图像和900个结构分析的3D模型组成。该多模式数据集是替代模型培训，数据驱动的逆设计和设计空间探索的宝贵资源。所提出的方法也适用于其他复杂的设计域。该数据集以创意共享归因 - 非商业4.0国际（CC BY-NC 4.0）发布，可在此HTTPS URL上找到

Title: Omni$^2$: Unifying Omnidirectional Image Generation and Editing in an Omni Model

Authors: Liu Yang, Huiyu Duan, Yucheng Zhu, Xiaohong Liu, Lu Liu, Zitong Xu, Guangji Ma, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11379
Pdf URL: https://arxiv.org/pdf/2504.11379
Copy Paste: [[2504.11379]] Omni$^2$: Unifying Omnidirectional Image Generation and Editing in an Omni Model(https://arxiv.org/abs/2504.11379)
Keywords: generation
Abstract: $360^{\circ}$ omnidirectional images (ODIs) have gained considerable attention recently, and are widely used in various virtual reality (VR) and augmented reality (AR) applications. However, capturing such images is expensive and requires specialized equipment, making ODI synthesis increasingly important. While common 2D image generation and editing methods are rapidly advancing, these models struggle to deliver satisfactory results when generating or editing ODIs due to the unique format and broad 360$^{\circ}$ Field-of-View (FoV) of ODIs. To bridge this gap, we construct \textbf{\textit{Any2Omni}}, the first comprehensive ODI generation-editing dataset comprises 60,000+ training data covering diverse input conditions and up to 9 ODI generation and editing tasks. Built upon Any2Omni, we propose an \textbf{\underline{Omni}} model for \textbf{\underline{Omni}}-directional image generation and editing (\textbf{\textit{Omni$^2$}}), with the capability of handling various ODI generation and editing tasks under diverse input conditions using one model. Extensive experiments demonstrate the superiority and effectiveness of the proposed Omni$^2$ model for both the ODI generation and editing tasks.
摘要：$ 360^{\ circ} $ Omniredirectional Images（ODI）最近引起了广泛关注，并广泛用于各种虚拟现实（VR）和增强现实（AR）应用程序中。但是，捕获此类图像很昂贵，需要专业设备，从而使ODI合成越来越重要。尽管常见的2D图像生成和编辑方法正在迅速发展，但由于独特的格式和ODI的宽敞360 $^{\ circ} $ odis的宽360 $^{\ circ}，这些模型在生成或编辑ODI时努力产生令人满意的结果。为了弥合此差距，我们构建\ textbf {\ textit {yy2omni}}，第一个全面的ODI生成编辑数据集包含60,000多个培训数据，涵盖了多种输入条件以及最多9个ODI生成和编辑任务。 Built upon Any2Omni, we propose an \textbf{\underline{Omni}} model for \textbf{\underline{Omni}}-directional image generation and editing (\textbf{\textit{Omni$^2$}}), with the capability of handling various ODI generation and editing tasks under diverse input conditions using one model.广泛的实验证明了拟议的Omni $^2 $模型对ODI生成和编辑任务的优势和有效性。

Title: ADT: Tuning Diffusion Models with Adversarial Supervision

Authors: Dazhong Shen, Guanglu Song, Yi Zhang, Bingqi Ma, Lujundong Li, Dongzhi Jiang, Zhuofan Zong, Yu Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11423
Pdf URL: https://arxiv.org/pdf/2504.11423
Copy Paste: [[2504.11423]] ADT: Tuning Diffusion Models with Adversarial Supervision(https://arxiv.org/abs/2504.11423)
Keywords: generation
Abstract: Diffusion models have achieved outstanding image generation by reversing a forward noising process to approximate true data distributions. During training, these models predict diffusion scores from noised versions of true samples in a single forward pass, while inference requires iterative denoising starting from white noise. This training-inference divergences hinder the alignment between inference and training data distributions, due to potential prediction biases and cumulative error accumulation. To address this problem, we propose an intuitive but effective fine-tuning framework, called Adversarial Diffusion Tuning (ADT), by stimulating the inference process during optimization and aligning the final outputs with training data by adversarial supervision. Specifically, to achieve robust adversarial training, ADT features a siamese-network discriminator with a fixed pre-trained backbone and lightweight trainable parameters, incorporates an image-to-image sampling strategy to smooth discriminative difficulties, and preserves the original diffusion loss to prevent discriminator hacking. In addition, we carefully constrain the backward-flowing path for back-propagating gradients along the inference path without incurring memory overload or gradient explosion. Finally, extensive experiments on Stable Diffusion models (v1.5, XL, and v3), demonstrate that ADT significantly improves both distribution alignment and image quality.
摘要：扩散模型通过逆转向前的no脉过程以近似真实的数据分布来实现出色的图像生成。在训练过程中，这些模型可以在单个正向通过中预测来自真实样品的噪声版本的扩散得分，而推理则需要从白噪声开始进行迭代降解。由于潜在的预测偏见和累积误差积累，这种训练推导差异阻碍了推理和训练数据分布之间的一致性。为了解决这个问题，我们提出了一个直观但有效的微调框架，称为对抗扩散调整（ADT），通过刺激优化过程中的推理过程，并通过对抗性监督将最终输出与培训数据对齐。具体来说，为了实现强大的对抗训练，ADT具有暹罗网络歧视器，具有固定的预训练的主链和轻巧的可训练参数，并结合了图像到图像的抽样策略，以平滑歧视性困难，并保留原始扩散损失，以防止歧视者黑客攻击。此外，我们仔细限制了沿推理路径向后传播梯度的向后流动路径，而不会产生内存过载或梯度爆炸。最后，对稳定扩散模型（V1.5，XL和V3）进行的广泛实验表明，ADT显着改善了分布比对和图像质量。

Title: Elucidating the Design Space of Multimodal Protein Language Models

Authors: Cheng-Yen (Wesley)Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2504.11454
Pdf URL: https://arxiv.org/pdf/2504.11454
Copy Paste: [[2504.11454]] Elucidating the Design Space of Multimodal Protein Language Models(https://arxiv.org/abs/2504.11454)
Keywords: generation, generative
Abstract: Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models.
摘要：多模式蛋白质语言模型（PLM）整合了序列和基于令牌的结构信息，是蛋白质建模，生成和设计的强大基础。但是，将3D结构依赖于离散代币会导致对细粒结构细节和相关性的忠诚度造成的大幅损失。在本文中，我们系统地阐明了多模式PLM的设计空间以克服其局限性。我们将PLMS的令牌化损失和不准确的结构令牌预测视为主要瓶颈。为了解决这些问题，我们提出的设计空间涵盖了改进的生成建模，结构感知的体系结构和代表性学习以及数据探索。我们的进步方法方法是细粒度的监督，表明基于令牌的多模式PLM可以实现强大的结构建模。有效的设计方法显着提高了结构产生的多样性，尤其是在PDB测试集上将RMSD从5.52减少到2.36，甚至超过3B基准，并且与专门的折叠模型相同。

Title: SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL

Authors: Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11455
Pdf URL: https://arxiv.org/pdf/2504.11455
Copy Paste: [[2504.11455]] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL(https://arxiv.org/abs/2504.11455)
Keywords: generation
Abstract: This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications. Through careful exploration of training and inference optimization, we demonstrate that: 1) with only 0.5B parameters, our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on challenging text-to-image benchmarks, e.g., 0.59 on GenEval and 79.66 on DPG; 2) both supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) training could lead to significant improvements on generation aesthectics and prompt alignment; and 3) when optimized with inference acceleraton techniques like vLLM, the time for SimpleAR to generate an 1024x1024 image could be reduced to around 14 seconds. By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation and encourage more participation in this research field. Code is available at this https URL.
摘要：这项工作介绍了Simpleear，这是一种香草自回归视觉生成框架，没有复杂的体系硬化修改。通过仔细探索训练和推理优化，我们证明了：1）仅使用0.5b参数，我们的模型可以生成具有高忠诚度的1024x1024分辨率图像，并在挑战文本对图像基准的挑战，例如Geneval上的0.59和DPG上的79.66; 2）受监督的微调（SFT）和小组相对政策优化（GRPO）培训都可能导致对发电美学的显着改善和及时的一致性； 3）当使用VLLM等推理加速技术进行优化时，Simplear生成1024x1024图像的时间可以减少到14秒左右。通过分享这些发现并开源代码，我们希望揭示自回归视觉生成的潜力，并鼓励更多地参与该研究领域。代码可在此HTTPS URL上找到。

Title: Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

Authors: Ziqi Pang, Xin Xu, Yu-Xiong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11457
Pdf URL: https://arxiv.org/pdf/2504.11457
Copy Paste: [[2504.11457]] Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception(https://arxiv.org/abs/2504.11457)
Keywords: generation, generative
Abstract: With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at this https URL.
摘要：随着图像产生的成功，由于像素生成提供了统一的感知接口，因此越来越多地采用了生成扩散模型。但是，直接重新利用生成的剥离过程的判别目标揭示了以前很少解决的关键差距。如果最终分布仍然合理，则生成模型可以忍受中间抽样误差，但是歧视性任务需要严格的精度，这在挑战多模式任务（例如引用图像分割）中证明了这一点。在这一差距的推动下，我们分析并增强了生成扩散过程和感知任务之间的一致性，重点是在denosing过程中的感知质量如何发展。我们发现：（1）较早的DeNoising步骤对感知质量的贡献不成比例，促使我们提出了反映不同时间步长的量身定制的学习目标；（2）后来的DeNoising步骤显示出意外的感知降解，突出了我们对扩散量扩散数据的数据的敏感性；（3）生成过程唯一地启用了交互性，用作可控制的用户界面，适用于多轮交互中的校正提示。我们的见解显着改善了基于扩散的感知模型，而没有建筑变化，在深度估计，参考图像细分和通才感知任务上实现最新性能。可在此HTTPS URL上找到代码。