2025-05-23

Title: Generative AI for Autonomous Driving: A Review

Authors: Katharina Winter, Abhishek Vivekanandan, Rupert Polley, Yinzhe Shen, Christian Schlauch, Mohamed-Khalil Bouzidi, Bojan Derajic, Natalie Grabowsky, Annajoyce Mariani, Dennis Rochau, Giovanni Lucente, Harsh Yadav, Firas Mualla, Adam Molin, Sebastian Bernhard, Christian Wirth, Ömer Şahin Taş, Nadja Klein, Fabian B. Flohr, Hanno Gottschalk
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.15863
Pdf URL: https://arxiv.org/pdf/2505.15863
Copy Paste: [[2505.15863]] Generative AI for Autonomous Driving: A Review(https://arxiv.org/abs/2505.15863)
Keywords: generation, generative
Abstract: Generative AI (GenAI) is rapidly advancing the field of Autonomous Driving (AD), extending beyond traditional applications in text, image, and video generation. We explore how generative models can enhance automotive tasks, such as static map creation, dynamic scenario generation, trajectory forecasting, and vehicle motion planning. By examining multiple generative approaches ranging from Variational Autoencoder (VAEs) over Generative Adversarial Networks (GANs) and Invertible Neural Networks (INNs) to Generative Transformers (GTs) and Diffusion Models (DMs), we highlight and compare their capabilities and limitations for AD-specific applications. Additionally, we discuss hybrid methods integrating conventional techniques with generative approaches, and emphasize their improved adaptability and robustness. We also identify relevant datasets and outline open research questions to guide future developments in GenAI. Finally, we discuss three core challenges: safety, interpretability, and realtime capabilities, and present recommendations for image generation, dynamic scenario generation, and planning.
摘要：生成的AI（Genai）正在迅速发展自动驾驶（AD）领域，超越了文本，图像和视频生成的传统应用。我们探讨了生成模型如何增强汽车任务，例如静态地图创建，动态场景生成，轨迹预测和车辆运动计划。通过检查从变异自动编码器（VAE）的多种生成方法，而不是生成对抗网络（GAN）和可逆的神经网络（INNS）到生成变压器（GTS）和扩散模型（DMS），我们突出显示并比较其功能和限制，以适用于针对特定应用的功能和限制。此外，我们讨论将常规技术与生成方法相结合的混合方法，并强调它们的适应性和鲁棒性。我们还确定了相关的数据集并概述开放研究问题，以指导Genai的未来发展。最后，我们讨论了三个核心挑战：安全性，可解释性和实时功能，并为图像生成，动态场景生成和计划提供了建议。

Title: SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval

Authors: Nikolaos Chaidos, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Stamou
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15867
Pdf URL: https://arxiv.org/pdf/2505.15867
Copy Paste: [[2505.15867]] SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval(https://arxiv.org/abs/2505.15867)
Keywords: generation
Abstract: Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.
摘要：尽管在图像到图像检索中基于卷积和变压器的架构的主导地位，但这些模型易于源于低级视觉特征（例如颜色）引起的偏见。我们意识到缺乏语义理解是一个关键限制，我们提出了一个基于场景图的新型检索框架，该框架强调语义内容而不是浅表图像特征。场景图检索的先前方法主要依赖于有监督的图形神经网络（GNN），该图形需要从图像标题驱动的地面真相图对。但是，基于字幕的监督的不一致是由于可变文本编码破坏了检索可靠性。为了解决这些问题，我们提出了Scenir，这是一种基于图形自动编码器的无监督检索框架，从而消除了对标记的培训数据的依赖。我们的模型表明了跨指标和运行时效率的卓越性能，超过了现有的基于视觉的，多模式和监督的GNN方法。我们进一步倡导图形编辑距离（GED），作为场景图相似性的确定性和健壮的地面真理度量，在图像到图像检索评估中首次取代了基于标题的替代方案。最后，我们通过将其应用于未注释的数据集通过自动场景图生成来验证我们的方法的普遍性，同时实质上有助于在反事实图像检索中推进最先进的方法。

Title: Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities

Authors: Can Rong, Xin Zhang, Yanxin Xi, Hongjie Sui, Jingtao Ding, Yong Li
Subjects: cs.CV, cs.CY, eess.IV
Abstract URL: https://arxiv.org/abs/2505.15870
Pdf URL: https://arxiv.org/pdf/2505.15870
Copy Paste: [[2505.15870]] Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities(https://arxiv.org/abs/2505.15870)
Keywords: generation
Abstract: Commuting Origin-destination~(OD) flows, capturing daily population mobility of citizens, are vital for sustainable development across cities around the world. However, it is challenging to obtain the data due to the high cost of travel surveys and privacy concerns. Surprisingly, we find that satellite imagery, publicly available across the globe, contains rich urban semantic signals to support high-quality OD flow generation, with over 98\% expressiveness of traditional multisource hard-to-collect urban sociodemographic, economics, land use, and point of interest data. This inspires us to design a novel data generator, GlODGen, which can generate OD flow data for any cities of interest around the world. Specifically, GlODGen first leverages Vision-Language Geo-Foundation Models to extract urban semantic signals related to human mobility from satellite imagery. These features are then combined with population data to form region-level representations, which are used to generate OD flows via graph diffusion models. Extensive experiments on 4 continents and 6 representative cities show that GlODGen has great generalizability across diverse urban environments on different continents and can generate OD flow data for global cities highly consistent with real-world mobility data. We implement GlODGen as an automated tool, seamlessly integrating data acquisition and curation, urban semantic feature extraction, and OD flow generation together. It has been released at this https URL.
摘要：通勤起源污染物〜（OD）流动，捕获公民的日常人口流动性，对于世界各地城市的可持续发展至关重要。但是，由于旅行调查的高成本和隐私问题，获取数据是一项挑战。令人惊讶的是，我们发现在全球公开购买的卫星图像包含丰富的城市语义信号，以支持高质量的OD流量产生，超过98 \％的传统多源代码来源难以收集的城市社会人口统计学，经济学，经济学，土地使用和利益点数据。这激发了我们设计一个新颖的数据生成器Glodgen，该数据生成器可以为世界各地的任何感兴趣城市生成OD流数据。具体而言，GLODGEN首先利用视觉语言地理创始模型来提取与卫星图像中人类移动性相关的城市语义信号。然后将这些特征与总体数据结合在一起，以形成区域级表示，这些表示将通过图扩散模型生成OD流。在4个大洲和6个代表性城市进行的广泛实验表明，Glodgen在不同大陆的各种城市环境中具有很大的普遍性，并且可以为全球城市生成OD流数据，与现实世界中的移动性数据一致。我们将GLODGEN作为自动化工具实施，无缝整合数据采集和策展，城市语义特征提取以及OD流量的产生。它已在此HTTPS URL上发布。

Title: Challenger: Affordable Adversarial Driving Video Generation

Authors: Zhiyuan Xu, Bohan Li, Huan-ang Gao, Mingju Gao, Yong Chen, Ming Liu, Chenxu Yan, Hang Zhao, Shuo Feng, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15880
Pdf URL: https://arxiv.org/pdf/2505.15880
Copy Paste: [[2505.15880]] Challenger: Affordable Adversarial Driving Video Generation(https://arxiv.org/abs/2505.15880)
Keywords: generation
Abstract: Generating photorealistic driving videos has seen significant progress recently, but current methods largely focus on ordinary, non-adversarial scenarios. Meanwhile, efforts to generate adversarial driving scenarios often operate on abstract trajectory or BEV representations, falling short of delivering realistic sensor data that can truly stress-test autonomous driving (AD) systems. In this work, we introduce Challenger, a framework that produces physically plausible yet photorealistic adversarial driving videos. Generating such videos poses a fundamental challenge: it requires jointly optimizing over the space of traffic interactions and high-fidelity sensor observations. Challenger makes this affordable through two techniques: (1) a physics-aware multi-round trajectory refinement process that narrows down candidate adversarial maneuvers, and (2) a tailored trajectory scoring function that encourages realistic yet adversarial behavior while maintaining compatibility with downstream video synthesis. As tested on the nuScenes dataset, Challenger generates a diverse range of aggressive driving scenarios-including cut-ins, sudden lane changes, tailgating, and blind spot intrusions-and renders them into multiview photorealistic videos. Extensive evaluations show that these scenarios significantly increase the collision rate of state-of-the-art end-to-end AD models (UniAD, VAD, SparseDrive, and DiffusionDrive), and importantly, adversarial behaviors discovered for one model often transfer to others.
摘要：最近，生成逼真的驾驶视频已经取得了重大进展，但是当前的方法主要集中在普通的非对抗场景上。同时，生成对抗性驾驶场景的努力通常以抽象的轨迹或BEV表示作用，而无法提供可真正压力测试自动驾驶（AD）系统的真实传感器数据。在这项工作中，我们介绍了Challenger，该框架生成了物理上合理但相反的对抗性驾驶视频。产生此类视频提出了一个根本的挑战：它需要在流量交互的空间和高保真传感器观察中共同优化。 Challenger通过两种技术使这一负担得起：（1）物理意识到的多轮轨迹完善过程，缩小了候选人对抗动作的缩小，（2）（2）量身定制的轨迹评分功能，可以鼓励现实而又对抗性的行为，同时保持与下游视频合成的兼容性。正如在Nuscenes数据集中测试的那样，Challenger产生了各种各样的激进驾驶场景，包括切割，突然的车道变化，尾随和盲点侵入，并将其呈现为多视eview photorealistic视频。广泛的评估表明，这些情况大大提高了最新的端到端广告模型的碰撞率（UniaD，VAD，Sparsedrive和diffusionDrive），并且重要的是，对于一种模型，发现的对抗性行为通常会转移到其他模型上。

Title: Is (Selective) Round-To-Nearest Quantization All You Need?

Authors: Alex Kogan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15909
Pdf URL: https://arxiv.org/pdf/2505.15909
Copy Paste: [[2505.15909]] Is (Selective) Round-To-Nearest Quantization All You Need?(https://arxiv.org/abs/2505.15909)
Keywords: generation
Abstract: Quantization became a necessary tool for serving ever-increasing Large Language Models (LLMs). RTN (Round-to-Nearest) is perhaps the simplest quantization technique that has been around well before LLMs surged to the forefront of machine learning (ML) research. Yet, it has been largely dismissed by recent and more advanced quantization methods that claim superiority over RTN in nearly every aspect of performance. This work aims to dispel this established point of view, showing that RTN is not only much cheaper to apply, but also its token generation throughput can be better than and accuracy can be similar to more advanced alternatives. In particular, we discuss our implementation of RTN based on the recent Marlin kernels and demonstrate how the accuracy of RTN can be gradually improved by selectively increasing the data precision format of certain model layers and modules. Based on our results, we argue that RTN presents a viable and practical choice for quantizing LLMs.
摘要：量化成为服务不断增加的大语言模型（LLM）的必要工具。 RTN（圆头目前）也许是LLM涌入机器学习（ML）研究最前沿之前已经存在的最简单的量化技术。然而，在几乎各个方面的绩效方面，最近和更高级的量化方法都在很大程度上驳斥了它。这项工作旨在消除这一既定的观点，表明RTN不仅可以便宜得多，而且其代币的生成吞吐量也可以远胜于，准确性可能类似于更高级的替代方案。特别是，我们根据最近的Marlin内核讨论了RTN的实现，并通过选择性地提高某些模型层和模块的数据精度格式，演示了如何逐渐提高RTN的准确性。根据我们的结果，我们认为RTN提出了量化LLM的可行且实用的选择。

Title: MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding

Authors: Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun
Subjects: cs.LG, cs.AI, cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2505.15946
Pdf URL: https://arxiv.org/pdf/2505.15946
Copy Paste: [[2505.15946]] MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding(https://arxiv.org/abs/2505.15946)
Keywords: generative
Abstract: Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: this https URL.
摘要：FMRI的解码视觉体验提供了一个强大的途径，以了解人类的看法并发展高级脑部计算机界面。但是，当前的进步通常优先考虑重建忠诚度，同时忽略解释性，这是推导神经科学见解的重要方面。为了解决这一差距，我们提出了更大的脑，这是一种由神经启发的框架，旨在高保真，适应性和可解释的视觉重建。更聪明的人独特地采用了层次结构的架构结构，其中不同的专家处理来自功能相关的体素组的fMRI信号，从而模仿了专门的大脑网络。专家首先经过培训，可以将fMRI编码为冷冻夹空间。然后，一个填充扩散模型通过新颖的双阶段路由机制来综合了图像，并在专家输出中进行了指导，该方法在整个扩散过程中动态权衡了专家的贡献。 More-Brain提供了三个主要的进步：首先，它引入了以神经编码为基础的大脑网络原理的新型专家架构。其次，它通过共享核心专家网络，同时仅适应特定主题的路由器来实现有效的跨主题概括。第三，它提供了增强的机械洞察力，因为明确的路由准确揭示了不同的模型大脑区域如何塑造重建图像的语义和空间属性。广泛的实验验证了More-Brain的高重建保真度，瓶颈分析进一步证明了其对fMRI信号的有效利用，从而区分了真正的神经解码与对生成先验的过度依赖。因此，更多的脑图标志着基于fMRI的视觉解码更为普遍，更可解释的。代码即将公开可用：此HTTPS URL。

Title: VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Authors: Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15952
Pdf URL: https://arxiv.org/pdf/2505.15952
Copy Paste: [[2505.15952]] VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance(https://arxiv.org/abs/2505.15952)
Keywords: generation
Abstract: With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector's sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry's most labor-intensive processes with limited automation options. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain. To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games. Code and data are available at: this https URL
摘要：随着视频游戏现在在娱乐业中获得最高收入，优化游戏开发工作流程对于该行业持续增长至关重要。视觉模型（VLMS）的最新进展为自动化和增强游戏开发的各个方面（尤其是质量保证（QA））提供了巨大的潜力，尤其是质量保证（QA），这仍然是该行业最劳动密集型的流程之一，自动化选项有限。为了准确评估VLM在视频游戏QA任务中的性能并确定其在处理实际场景中的有效性，因此明显需要标准化的基准测试，因为现有基准不足以满足该域的特定要求。为了弥合这一差距，我们介绍了VideoGameqa-Bench，这是一个全面的基准测试，涵盖了各种各样的游戏质量检查活动，包括视觉单元测试，视觉回归测试，核对面的针刺任务，小故障检测以及针对各种游戏的图像和视频的错误报告生成。代码和数据可用：此HTTPS URL

Title: Super-Resolution with Structured Motion

Authors: Gabby Litterio, Juan-David Lizarazo-Ferro, Pedro Felzenszwalb, Rashid Zia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15961
Pdf URL: https://arxiv.org/pdf/2505.15961
Copy Paste: [[2505.15961]] Super-Resolution with Structured Motion(https://arxiv.org/abs/2505.15961)
Keywords: super-resolution
Abstract: We consider the limits of super-resolution using imaging constraints. Due to various theoretical and practical limitations, reconstruction-based methods have been largely restricted to small increases in resolution. In addition, motion-blur is usually seen as a nuisance that impedes super-resolution. We show that by using high-precision motion information, sparse image priors, and convex optimization, it is possible to increase resolution by large factors. A key operation in super-resolution is deconvolution with a box. In general, convolution with a box is not invertible. However, we obtain perfect reconstructions of sparse signals using convex optimization. We also show that motion blur can be helpful for super-resolution. We demonstrate that using pseudo-random motion it is possible to reconstruct a high-resolution target using a single low-resolution image. We present numerical experiments with simulated data and results with real data captured by a camera mounted on a computer controlled stage.
摘要：我们考虑使用成像限制的超分辨率的限制。由于各种理论和实际局限性，基于重建的方法在很大程度上仅限于分辨率的少量增加。另外，运动瘤通常被视为阻碍超分辨率的滋扰。我们表明，通过使用高精度运动信息，稀疏的图像先验和凸优化，可以通过很大的因素增加分辨率。超分辨率的关键操作是带有盒子的反卷积。通常，用盒子的卷积并不可逆。但是，我们使用凸优化获得了稀疏信号的完美重建。我们还表明，运动模糊有助于超分辨率。我们证明，使用伪随机运动，可以使用单个低分辨率图像重建高分辨率目标。我们介绍了使用模拟数据和结果的数值实验，并通过安装在计算机控制阶段上的相机捕获的实际数据。

Title: Position: Agentic Systems Constitute a Key Component of Next-Generation Intelligent Image Processing

Authors: Jinjin Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16007
Pdf URL: https://arxiv.org/pdf/2505.16007
Copy Paste: [[2505.16007]] Position: Agentic Systems Constitute a Key Component of Next-Generation Intelligent Image Processing(https://arxiv.org/abs/2505.16007)
Keywords: generation
Abstract: This position paper argues that the image processing community should broaden its focus from purely model-centric development to include agentic system design as an essential complementary paradigm. While deep learning has significantly advanced capabilities for specific image processing tasks, current approaches face critical limitations in generalization, adaptability, and real-world problem-solving flexibility. We propose that developing intelligent agentic systems, capable of dynamically selecting, combining, and optimizing existing image processing tools, represents the next evolutionary step for the field. Such systems would emulate human experts' ability to strategically orchestrate different tools to solve complex problems, overcoming the brittleness of monolithic models. The paper analyzes key limitations of model-centric paradigms, establishes design principles for agentic image processing systems, and outlines different capability levels for such agents.
摘要：该立场论文认为，图像处理社区应将其重点从纯粹以模型为中心的开发范围扩大到包括代理系统设计作为必不可少的互补范式。尽管深度学习具有特定图像处理任务的明显高级功能，但当前的方法面临着泛化，适应性和现实世界中解决问题的灵活性的关键限制。我们建议开发能够动态选择，组合和优化现有图像处理工具的智能代理系统代表该领域的下一个进化步骤。这样的系统将效仿人类专家在战略上协调不同工具来解决复杂问题的能力，从而克服了整体模型的脆弱性。本文分析了以模型为中心的范例的关键局限性，为代理图像处理系统建立了设计原理，并概述了此类代理的不同能力水平。

Title: Toward Theoretical Insights into Diffusion Trajectory Distillation via Operator Merging

Authors: Weiguo Gao, Ming Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16024
Pdf URL: https://arxiv.org/pdf/2505.16024
Copy Paste: [[2505.16024]] Toward Theoretical Insights into Diffusion Trajectory Distillation via Operator Merging(https://arxiv.org/abs/2505.16024)
Keywords: generation, generative
Abstract: Diffusion trajectory distillation methods aim to accelerate sampling in diffusion models, which produce high-quality outputs but suffer from slow sampling speeds. These methods train a student model to approximate the multi-step denoising process of a pretrained teacher model in a single step, enabling one-shot generation. However, theoretical insights into the trade-off between different distillation strategies and generative quality remain limited, complicating their optimization and selection. In this work, we take a first step toward addressing this gap. Specifically, we reinterpret trajectory distillation as an operator merging problem in the linear regime, where each step of the teacher model is represented as a linear operator acting on noisy data. These operators admit a clear geometric interpretation as projections and rescalings corresponding to the noise schedule. During merging, signal shrinkage occurs as a convex combination of operators, arising from both discretization and limited optimization time of the student model. We propose a dynamic programming algorithm to compute the optimal merging strategy that maximally preserves signal fidelity. Additionally, we demonstrate the existence of a sharp phase transition in the optimal strategy, governed by data covariance structures. Our findings enhance the theoretical understanding of diffusion trajectory distillation and offer practical insights for improving distillation strategies.
摘要：扩散轨迹蒸馏方法旨在加速扩散模型中的采样，该模型产生高质量的输出，但遭受缓慢采样速度的影响。这些方法训练学生模型，以近似于验证的教师模型的多步denoising过程，从而实现了一声的生成。但是，对不同蒸馏策略和生成质量之间权衡的理论见解仍然有限，这使它们的优化和选择变得复杂。在这项工作中，我们迈出了解决这一差距的第一步。具体来说，我们将轨迹蒸馏重新解释为在线性方案中合并问题的轨迹蒸馏，其中教师模型的每个步骤都表示为作用于嘈杂数据的线性操作员。这些操作员承认对与噪声时间表相对应的预测和重新缩放的明确的几何解释。在合并过程中，信号收缩是作为操作员的凸组合而发生的，这是由于学生模型的离散化和有限的优化时间而产生的。我们提出了一种动态编程算法来计算最大程度地保留信号保真度的最佳合并策略。此外，我们证明了由数据协方差结构约束的最佳策略中存在尖锐的相变。我们的发现增强了对扩散轨迹蒸馏的理论理解，并提供了改善蒸馏策略的实用见解。

Title: CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment

Authors: Wen Wen, Yaohong Wu, Yue Sheng, Neil Birkbeck, Balu Adsumilli, Yilin Wang
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16025
Pdf URL: https://arxiv.org/pdf/2505.16025
Copy Paste: [[2505.16025]] CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment(https://arxiv.org/abs/2505.16025)
Keywords: generation, quality assessment
Abstract: Video quality assessment (VQA) is a challenging research topic with broad applications. Effective VQA necessitates sensitivity to pixel-level distortions and a comprehensive understanding of video context to accurately determine the perceptual impact of distortions. Traditional hand-crafted and learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent LLM-based models struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context and Pixel aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g. compression artifacts). The model is trained via a multi-task pipeline optimizing for score prediction, description generation, and pairwise comparisons. Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on established VQA benchmarks and superior robustness to pixel distortions, confirming its efficacy for comprehensive and practical video quality assessment in real-world scenarios.
摘要：视频质量评估（VQA）是一个充满挑战的研究主题，具有广泛的应用。有效的VQA需要对像素级扭曲的敏感性以及对视频上下文的全面理解，以准确确定失真的感知影响。传统的手工制作和基于学习的VQA模型主要集中于像素级扭曲，并且缺乏上下文理解，而最近的基于LLM的模型则以对小扭曲的敏感性或处理质量评分和描述作为独立任务的敏感性。为了解决这些缺点，我们介绍了CP-LLM：上下文和像素意识到的大语言模型。 CP-LLM是一种新型的多模式LLM体系结构，具有双视觉编码器，旨在独立地分析高级（视频上下文）和低级（像素失真）粒度的感知质量，以及随后在这些方面之间相互作用的语言解码器。该设计使CP-LLM能够同时产生良好的质量得分和可解释的质量描述，并具有对像素畸变的敏感性（例如压缩伪像）。该模型是通过多任务管道优化的，用于得分预测，描述生成和成对比较。实验结果表明，CP-LLM在已建立的VQA基准测试中实现了最先进的跨数据库性能，并且对像素扭曲的较高稳定性，从而确认其在现实世界中的全面和实用视频质量评估的功效。

Title: Learning better representations for crowded pedestrians in offboard LiDAR-camera 3D tracking-by-detection

Authors: Shichao Li, Peiliang Li, Qing Lian, Peng Yun, Xiaozhi Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16029
Pdf URL: https://arxiv.org/pdf/2505.16029
Copy Paste: [[2505.16029]] Learning better representations for crowded pedestrians in offboard LiDAR-camera 3D tracking-by-detection(https://arxiv.org/abs/2505.16029)
Keywords: generation
Abstract: Perceiving pedestrians in highly crowded urban environments is a difficult long-tail problem for learning-based autonomous perception. Speeding up 3D ground truth generation for such challenging scenes is performance-critical yet very challenging. The difficulties include the sparsity of the captured pedestrian point cloud and a lack of suitable benchmarks for a specific system design study. To tackle the challenges, we first collect a new multi-view LiDAR-camera 3D multiple-object-tracking benchmark of highly crowded pedestrians for in-depth analysis. We then build an offboard auto-labeling system that reconstructs pedestrian trajectories from LiDAR point cloud and multi-view images. To improve the generalization power for crowded scenes and the performance for small objects, we propose to learn high-resolution representations that are density-aware and relationship-aware. Extensive experiments validate that our approach significantly improves the 3D pedestrian tracking performance towards higher auto-labeling efficiency. The code will be publicly available at this HTTP URL.
摘要：在高度拥挤的城市环境中，感知行人是基于学习的自主感知的艰难的长尾问题。加快这种挑战性场景的3D地面真相生成是至关重要的，但非常具有挑战性。困难包括捕获的行人点云的稀疏性以及用于特定系统设计研究的缺乏合适的基准测试。为了应对挑战，我们首先收集了一个新的多视觉激光镜相机3D 3D多对象跟踪的基准，该基准是高度拥挤的行人，以进行深入分析。然后，我们构建了一个卸货自动标记系统，该系统可从LiDar Point Cloud和Multi-View Images重建行人轨迹。为了提高拥挤的场景的概括能力和小物体的性能，我们建议学习高分辨率表示的密度意识和关系感知。广泛的实验验证了我们的方法可显着提高3D行人跟踪性能，以提高自动标记效率。该代码将在此HTTP URL上公开可用。

Title: An Exploratory Approach Towards Investigating and Explaining Vision Transformer and Transfer Learning for Brain Disease Detection

Authors: Shuvashis Sarker, Shamim Rahim Refat, Faika Fairuj Preotee, Shifat Islam, Tashreef Muhammad, Mohammad Ashraful Hoque
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16039
Pdf URL: https://arxiv.org/pdf/2505.16039
Copy Paste: [[2505.16039]] An Exploratory Approach Towards Investigating and Explaining Vision Transformer and Transfer Learning for Brain Disease Detection(https://arxiv.org/abs/2505.16039)
Keywords: generative
Abstract: The brain is a highly complex organ that manages many important tasks, including movement, memory and thinking. Brain-related conditions, like tumors and degenerative disorders, can be hard to diagnose and treat. Magnetic Resonance Imaging (MRI) serves as a key tool for identifying these conditions, offering high-resolution images of brain structures. Despite this, interpreting MRI scans can be complicated. This study tackles this challenge by conducting a comparative analysis of Vision Transformer (ViT) and Transfer Learning (TL) models such as VGG16, VGG19, Resnet50V2, MobilenetV2 for classifying brain diseases using MRI data from Bangladesh based dataset. ViT, known for their ability to capture global relationships in images, are particularly effective for medical imaging tasks. Transfer learning helps to mitigate data constraints by fine-tuning pre-trained models. Furthermore, Explainable AI (XAI) methods such as GradCAM, GradCAM++, LayerCAM, ScoreCAM, and Faster-ScoreCAM are employed to interpret model predictions. The results demonstrate that ViT surpasses transfer learning models, achieving a classification accuracy of 94.39%. The integration of XAI methods enhances model transparency, offering crucial insights to aid medical professionals in diagnosing brain diseases with greater precision.
摘要：大脑是一个高度复杂的器官，可以管理许多重要的任务，包括运动，记忆和思维。与脑有关的疾病（如肿瘤和退化性疾病）可能很难诊断和治疗。磁共振成像（MRI）是识别这些条件的关键工具，提供了大脑结构的高分辨率图像。尽管如此，解释MRI扫描还是很复杂。这项研究通过对视觉变压器（VIT）和转移学习（TL）模型进行比较分析（例如VGG16，VGG19，RESNET50V2，MOBILENETV2）来应对这一挑战，用于使用基于孟加拉国数据集的MRI数据对脑部疾病进行分类。 VIT以捕获图像中全球关系的能力而闻名，对于医学成像任务特别有效。转移学习有助于通过微调预训练的模型来减轻数据约束。此外，采用了可解释的AI（XAI）方法，例如Gradcam，Gradcam ++，Layercam，ScoreCam和更快的尺度验证方法来解释模型预测。结果表明，VIT超过了转移学习模型，达到了94.39％的分类精度。 XAI方法的整合增强了模型透明度，提供了重要的见解，以帮助医疗专业人员以更高的精度诊断脑部疾病。

Title: Few-Shot Test-Time Optimization Without Retraining for Semiconductor Recipe Generation and Beyond

Authors: Shangding Gu, Donghao Ying, Ming Jin, Yu Joe Lu, Jun Wang, Javad Lavaei, Costas Spanos
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16060
Pdf URL: https://arxiv.org/pdf/2505.16060
Copy Paste: [[2505.16060]] Few-Shot Test-Time Optimization Without Retraining for Semiconductor Recipe Generation and Beyond(https://arxiv.org/abs/2505.16060)
Keywords: generation
Abstract: We introduce Model Feedback Learning (MFL), a novel test-time optimization framework for optimizing inputs to pre-trained AI models or deployed hardware systems without requiring any retraining of the models or modifications to the hardware. In contrast to existing methods that rely on adjusting model parameters, MFL leverages a lightweight reverse model to iteratively search for optimal inputs, enabling efficient adaptation to new objectives under deployment constraints. This framework is particularly advantageous in real-world settings, such as semiconductor manufacturing recipe generation, where modifying deployed systems is often infeasible or cost-prohibitive. We validate MFL on semiconductor plasma etching tasks, where it achieves target recipe generation in just five iterations, significantly outperforming both Bayesian optimization and human experts. Beyond semiconductor applications, MFL also demonstrates strong performance in chemical processes (e.g., chemical vapor deposition) and electronic systems (e.g., wire bonding), highlighting its broad applicability. Additionally, MFL incorporates stability-aware optimization, enhancing robustness to process variations and surpassing conventional supervised learning and random search methods in high-dimensional control settings. By enabling few-shot adaptation, MFL provides a scalable and efficient paradigm for deploying intelligent control in real-world environments.
摘要：我们介绍了模型反馈学习（MFL），这是一种新颖的测试时间优化框架，用于优化预训练的AI模型或已部署的硬件系统，而无需对模型或对硬件进行任何修改。与依靠调整模型参数的现有方法相反，MFL利用轻巧的反向模型迭代地搜索最佳输入，从而有效地适应了在部署约束下的新目标。该框架在现实世界中尤其有利，例如半导体制造食谱生成，在该设置中，修改已部署的系统通常是不可行的或成本良好的。我们在半导体等离子体蚀刻任务上验证了MFL，它仅在五个迭代中就可以实现目标食谱生成，从而大大优于贝叶斯优化和人类专家。除了半导体应用之外，MFL还表现出在化学过程（例如化学蒸气沉积）和电子系统（例如电线键合）中的强大性能，突出了其广泛的适用性。此外，MFL还结合了稳定性优化，增强了过程变化的鲁棒性，并超过了高维控制设置中的常规监督学习和随机搜索方法。通过启用几乎没有射击的改编，MFL提供了可扩展有效的范式，用于在现实世界环境中部署智能控制。

Title: Bidirectional Variational Autoencoders

Authors: Bart Kosko, Olaoluwa Adigun
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.16074
Pdf URL: https://arxiv.org/pdf/2505.16074
Copy Paste: [[2505.16074]] Bidirectional Variational Autoencoders(https://arxiv.org/abs/2505.16074)
Keywords: generation
Abstract: We present the new bidirectional variational autoencoder (BVAE) network architecture. The BVAE uses a single neural network both to encode and decode instead of an encoder-decoder network pair. The network encodes in the forward direction and decodes in the backward direction through the same synaptic web. Simulations compared BVAEs and ordinary VAEs on the four image tasks of image reconstruction, classification, interpolation, and generation. The image datasets included MNIST handwritten digits, Fashion-MNIST, CIFAR-10, and CelebA-64 face images. The bidirectional structure of BVAEs cut the parameter count by almost 50% and still slightly outperformed the unidirectional VAEs.
摘要：我们介绍了新的双向变异自动编码器（BVAE）网络体系结构。 BVAE使用单个神经网络来编码和解码，而不是编码器 - 码头网络对。该网络沿向前方向编码，并通过同一突触网络向后解析。模拟在图像重建，分类，插值和生成的四个图像任务上比较了BVAE和普通VAE。图像数据集包括MNIST手写数字，时尚 - 纳斯特，CIFAR-10和Celeba-64脸部图像。 BVAE的双向结构将参数计数降低了几乎50％，并且仍然略高于单向VAE。

Title: A Survey of Large Language Models for Text-Guided Molecular Discovery: from Molecule Generation to Optimization

Authors: Ziqing Wang, Kexin Zhang, Zihan Zhao, Yibo Wen, Abhishek Pandey, Han Liu, Kaize Ding
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16094
Pdf URL: https://arxiv.org/pdf/2505.16094
Copy Paste: [[2505.16094]] A Survey of Large Language Models for Text-Guided Molecular Discovery: from Molecule Generation to Optimization(https://arxiv.org/abs/2505.16094)
Keywords: generation
Abstract: Large language models (LLMs) are introducing a paradigm shift in molecular discovery by enabling text-guided interaction with chemical spaces through natural language, symbolic notations, with emerging extensions to incorporate multi-modal inputs. To advance the new field of LLM for molecular discovery, this survey provides an up-to-date and forward-looking review of the emerging use of LLMs for two central tasks: molecule generation and molecule optimization. Based on our proposed taxonomy for both problems, we analyze representative techniques in each category, highlighting how LLM capabilities are leveraged across different learning settings. In addition, we include the commonly used datasets and evaluation protocols. We conclude by discussing key challenges and future directions, positioning this survey as a resource for researchers working at the intersection of LLMs and molecular science. A continuously updated reading list is available at this https URL.
摘要：大型语言模型（LLMS）通过通过自然语言，符号符号来实现文本引导与化学空间的相互作用，引入了分子发现的范式转移，并具有新兴的扩展，以结合多模式输入。为了推进分子发现的LLM新领域，该调查对新兴LLM在两个中心任务中的新兴使用：分子产生和分子优化提供了最新的前瞻性回顾。根据我们针对这两个问题的分类法，我们分析了每个类别中的代表性技术，强调了如何在不同的学习环境中利用LLM功能。此外，我们还包括常用的数据集和评估协议。我们通过讨论关键挑战和未来的方向来结束，将这项调查定位为在LLMS与分子科学交集的研究人员的资源。此HTTPS URL可用不断更新的阅读列表。

Title: Tools in the Loop: Quantifying Uncertainty of LLM Question Answering Systems That Use Tools

Authors: Panagiotis Lymperopoulos, Vasanth Sarathy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16113
Pdf URL: https://arxiv.org/pdf/2505.16113
Copy Paste: [[2505.16113]] Tools in the Loop: Quantifying Uncertainty of LLM Question Answering Systems That Use Tools(https://arxiv.org/abs/2505.16113)
Keywords: generation
Abstract: Modern Large Language Models (LLMs) often require external tools, such as machine learning classifiers or knowledge retrieval systems, to provide accurate answers in domains where their pre-trained knowledge is insufficient. This integration of LLMs with external tools expands their utility but also introduces a critical challenge: determining the trustworthiness of responses generated by the combined system. In high-stakes applications, such as medical decision-making, it is essential to assess the uncertainty of both the LLM's generated text and the tool's output to ensure the reliability of the final response. However, existing uncertainty quantification methods do not account for the tool-calling scenario, where both the LLM and external tool contribute to the overall system's uncertainty. In this work, we present a novel framework for modeling tool-calling LLMs that quantifies uncertainty by jointly considering the predictive uncertainty of the LLM and the external tool. We extend previous methods for uncertainty quantification over token sequences to this setting and propose efficient approximations that make uncertainty computation practical for real-world applications. We evaluate our framework on two new synthetic QA datasets, derived from well-known machine learning datasets, which require tool-calling for accurate answers. Additionally, we apply our method to retrieval-augmented generation (RAG) systems and conduct a proof-of-concept experiment demonstrating the effectiveness of our uncertainty metrics in scenarios where external information retrieval is needed. Our results show that the framework is effective in enhancing trust in LLM-based systems, especially in cases where the LLM's internal knowledge is insufficient and external tools are required.
摘要：现代大型语言模型（LLM）通常需要外部工具，例如机器学习分类器或知识检索系统，以便在其预训练知识不足的域中提供准确的答案。 LLM与外部工具的这种集成扩大了其实用性，但也引入了一个关键的挑战：确定合并系统产生的响应的可信度。在高风险应用程序（例如医疗决策）中，必须评估LLM生成的文本和工具的输出的不确定性，以确保最终响应的可靠性。但是，现有的不确定性量化方法并未考虑到工具称呼方案，在这种情况下，LLM和外部工具都会有助于整体系统的不确定性。在这项工作中，我们提出了一个新颖的框架，用于建模工具称呼LLM，该框架通过共同考虑LLM和外部工具的预测不确定性来量化不确定性。我们将以前的方法扩展到代币序列的不确定性量化的方法到此设置，并提出有效的近似值，使现实应用程序的不确定性计算实用。我们在两个新的合成QA数据集上评估了我们的框架，这些数据集衍生自众所周知的机器学习数据集，这些数据集需要工具呼叫才能获得准确的答案。此外，我们将我们的方法应用于检索功能增强的生成（RAG）系统，并进行概念验证实验，证明了我们的不确定性指标在需要外部信息检索的情况下的有效性。我们的结果表明，该框架有效增强对基于LLM的系统的信任，尤其是在LLM内部知识不足且需要外部工具的情况下。

Title: Scalable Graph Generative Modeling via Substructure Sequences

Authors: Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2505.16130
Pdf URL: https://arxiv.org/pdf/2505.16130
Copy Paste: [[2505.16130]] Scalable Graph Generative Modeling via Substructure Sequences(https://arxiv.org/abs/2505.16130)
Keywords: generative
Abstract: Graph neural networks (GNNs) has been predominantly driven by message-passing, where node representations are iteratively updated via local neighborhood aggregation. Despite their success, message-passing suffers from fundamental limitations -- including constrained expressiveness, over-smoothing, over-squashing, and limited capacity to model long-range dependencies. These issues hinder scalability: increasing data size or model size often fails to yield improved performance, limiting the viability of GNNs as backbones for graph foundation models. In this work, we explore pathways beyond message-passing and introduce Generative Graph Pattern Machine (G$^2$PM), a generative Transformer pre-training framework for graphs. G$^2$PM represents graph instances (nodes, edges, or entire graphs) as sequences of substructures, and employs generative pre-training over the sequences to learn generalizable, transferable representations. Empirically, G$^2$PM demonstrates strong scalability: on the ogbn-arxiv benchmark, it continues to improve with model sizes up to 60M parameters, outperforming prior generative approaches that plateau at significantly smaller scales (e.g., 3M). In addition, we systematically analyze the model design space, highlighting key architectural choices that contribute to its scalability and generalization. Across diverse tasks -- including node classification, graph classification, and transfer learning -- G$^2$PM consistently outperforms strong baselines, establishing a compelling foundation for scalable graph learning. The code and dataset are available at this https URL.
摘要：图形神经网络（GNN）主要是由消息通话驱动的，其中节点表示通过局部邻域聚合迭代更新。尽管他们成功了，但消息传递仍受到基本限制，包括表现力的受限，过度平滑，过度阵列和建模远程依赖性的能力有限。这些问题阻碍了可伸缩性：增加数据大小或模型大小通常无法产生改善的性能，从而限制了GNN作为图形基础模型的骨架的可行性。在这项工作中，我们探索了通过消息传播的路径，并引入了生成图形模式机（g $^2 $ pm），这是一个生成变压器的图形预训练框架。 G $^2 $ PM代表图形实例（节点，边缘或整个图）作为子结构的序列，并在序列上采用生成性预训练来学习可通用的，可传输的表示。从经验上讲，g $^2 $ pm表现出强大的可伸缩性：在OGBN-Arxiv基准上，它继续改进，型号尺寸高达60m参数，优于先前生成的方法，该方法在明显较小的尺度（例如3m）上均高得多。此外，我们系统地分析了模型设计空间，突出了有助于其可扩展性和概括的关键体系结构选择。在各种任务（包括节点分类，图形分类和转移学习）中，G $^2 $ pm始终优于强大的基准，为可扩展的图形学习建立了令人信服的基础。该代码和数据集可在此HTTPS URL上找到。

Title: Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention

Authors: Yuang Ai, Huaibo Huang, Tao Wu, Qihang Fan, Ran He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16157
Pdf URL: https://arxiv.org/pdf/2505.16157
Copy Paste: [[2505.16157]] Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention(https://arxiv.org/abs/2505.16157)
Keywords: restoration
Abstract: Transformer-based models have made remarkable progress in image restoration (IR) tasks. However, the quadratic complexity of self-attention in Transformer hinders its applicability to high-resolution images. Existing methods mitigate this issue with sparse or window-based attention, yet inherently limit global context modeling. Linear attention, a variant of softmax attention, demonstrates promise in global context modeling while maintaining linear complexity, offering a potential solution to the above challenge. Despite its efficiency benefits, vanilla linear attention suffers from a significant performance drop in IR, largely due to the low-rank nature of its attention map. To counter this, we propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution. Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer. LAformer achieves effective global perception by integrating linear attention and channel attention, while also enhancing local fitting capabilities through a convolutional gated feed-forward network. Notably, LAformer eliminates hardware-inefficient operations such as softmax and window shifting, enabling efficient processing of high-resolution images. Extensive experiments across 7 IR tasks and 21 benchmarks demonstrate that LAformer outperforms SOTA methods and offers significant computational advantages.
摘要：基于变压器的模型在图像恢复（IR）任务方面取得了显着进展。但是，变压器中自我注意力的二次复杂性阻碍了其对高分辨率图像的适用性。现有方法会以稀疏或基于窗口的注意来减轻此问题，但固有地限制了全局上下文建模。线性关注是软磁心的一种变体，在保持线性复杂性的同时，在全球上下文建模中表现出了希望，为上述挑战提供了潜在的解决方案。尽管具有效率的好处，但香草线性的注意力仍遭受了IR的显着性能下降，这在很大程度上是由于其注意力图的降低性质。为了解决这个问题，我们提出了等级增强的线性注意（RERA），这是一种简单而有效的方法，通过整合轻量级的深度卷积来丰富特征表示。在Rela的基础上，我们提出了一个名为LaFormer的高效图像恢复变压器。 LaFormer通过整合线性注意力和引导注意力来实现有效的全球感知，同时还通过卷积的门控馈线网络增强了本地拟合功能。值得注意的是，LaFormer消除了硬件可爱的操作，例如软件和窗口移动，从而有效地处理了高分辨率图像。跨7项IR任务和21个基准进行的广泛实验表明，Laformer的表现优于SOTA方法，并具有重要的计算优势。

Title: Deep Learning-Driven Ultra-High-Definition Image Restoration: A Survey

Authors: Liyan Wang, Weixiang Zhou, Cong Wang, Kin-Man Lam, Zhixun Su, Jinshan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16161
Pdf URL: https://arxiv.org/pdf/2505.16161
Copy Paste: [[2505.16161]] Deep Learning-Driven Ultra-High-Definition Image Restoration: A Survey(https://arxiv.org/abs/2505.16161)
Keywords: restoration, super-resolution
Abstract: Ultra-high-definition (UHD) image restoration aims to specifically solve the problem of quality degradation in ultra-high-resolution images. Recent advancements in this field are predominantly driven by deep learning-based innovations, including enhancements in dataset construction, network architecture, sampling strategies, prior knowledge integration, and loss functions. In this paper, we systematically review recent progress in UHD image restoration, covering various aspects ranging from dataset construction to algorithm design. This serves as a valuable resource for understanding state-of-the-art developments in the field. We begin by summarizing degradation models for various image restoration subproblems, such as super-resolution, low-light enhancement, deblurring, dehazing, deraining, and desnowing, and emphasizing the unique challenges of their application to UHD image restoration. We then highlight existing UHD benchmark datasets and organize the literature according to degradation types and dataset construction methods. Following this, we showcase major milestones in deep learning-driven UHD image restoration, reviewing the progression of restoration tasks, technological developments, and evaluations of existing methods. We further propose a classification framework based on network architectures and sampling strategies, helping to clearly organize existing methods. Finally, we share insights into the current research landscape and propose directions for further advancements. A related repository is available at this https URL.
摘要：超高定义（UHD）图像恢复旨在特别解决超高分辨率图像中质量降解的问题。该领域的最新进展主要是由基于深度学习的创新驱动的，包括增强数据集构建，网络体系结构，抽样策略，先验知识集成和损失功能。在本文中，我们系统地回顾了UHD图像修复的最新进展，涵盖了从数据集构造到算法设计的各个方面。这是理解该领域最新发展的宝贵资源。我们首先要概述各种图像恢复子问题的退化模型，例如超分辨率，低光增强，降低，去除，脱掩和降低，降低和丧失，并强调其应用于UHD图像恢复的独特挑战。然后，我们重点介绍现有的UHD基准数据集，并根据退化类型和数据集构造方法组织文献。此后，我们展示了深度学习驱动的UHD图像恢复中的主要里程碑，回顾了恢复任务，技术发展的进展以及现有方法的评估。我们进一步提出了一个基于网络体系结构和采样策略的分类框架，有助于清楚地组织现有方法。最后，我们分享对当前研究局势的见解，并提出方向以进一步发展。此HTTPS URL可用相关的存储库。

Title: Erased or Dormant? Rethinking Concept Erasure Through Reversibility

Authors: Ping Liu, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16174
Pdf URL: https://arxiv.org/pdf/2505.16174
Copy Paste: [[2505.16174]] Erased or Dormant? Rethinking Concept Erasure Through Reversibility(https://arxiv.org/abs/2505.16174)
Keywords: generative
Abstract: To what extent does concept erasure eliminate generative capacity in diffusion models? While prior evaluations have primarily focused on measuring concept suppression under specific textual prompts, we explore a complementary and fundamental question: do current concept erasure techniques genuinely remove the ability to generate targeted concepts, or do they merely achieve superficial, prompt-specific suppression? We systematically evaluate the robustness and reversibility of two representative concept erasure methods, Unified Concept Editing and Erased Stable Diffusion, by probing their ability to eliminate targeted generative behaviors in text-to-image models. These methods attempt to suppress undesired semantic concepts by modifying internal model parameters, either through targeted attention edits or model-level fine-tuning strategies. To rigorously assess whether these techniques truly erase generative capacity, we propose an instance-level evaluation strategy that employs lightweight fine-tuning to explicitly test the reactivation potential of erased concepts. Through quantitative metrics and qualitative analyses, we show that erased concepts often reemerge with substantial visual fidelity after minimal adaptation, indicating that current methods suppress latent generative representations without fully eliminating them. Our findings reveal critical limitations in existing concept erasure approaches and highlight the need for deeper, representation-level interventions and more rigorous evaluation standards to ensure genuine, irreversible removal of concepts from generative models.
摘要：概念消除在多大程度上消除了扩散模型中的生成能力？虽然先前的评估主要集中于在特定文本提示下衡量概念抑制，但我们探讨了一个互补和基本的问题：当前的概念擦除技术是否确实会真正消除产生目标概念的能力，或者仅仅实现了肤浅的，迅速的，及时的抑制作用？我们通过探测它们消除文本对图像模型中有针对性的生成行为的能力，系统地评估两种代表性概念擦除方法，统一概念编辑和稳定扩散的鲁棒性和可逆性。这些方法试图通过针对性的注意编辑或模型级微调策略来修改内部模型参数来抑制不希望的语义概念。为了严格评估这些技术是否真正消除了生成能力，我们提出了一种实例级评估策略，该策略采用轻质微调来显式测试擦除概念的重新激活潜力。通过定量指标和定性分析，我们表明，在最小化适应后，擦除的概念通常会以实质性的视觉保真度重新出现，这表明当前的方法抑制了潜在的生成表示，而无需完全消除它们。我们的发现揭示了现有概念擦除方法中的临界局限性，并强调了对更深层次，表示级的干预措施和更严格的评估标准的需求，以确保从生成模型中删除概念的真实，不可逆转。

Title: Understanding Generative AI Capabilities in Everyday Image Editing Tasks

Authors: Mohammad Reza Taesiri, Brandon Collins, Logan Bolton, Viet Dac Lai, Franck Dernoncourt, Trung Bui, Anh Totti Nguyen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16181
Pdf URL: https://arxiv.org/pdf/2505.16181
Copy Paste: [[2505.16181]] Understanding Generative AI Capabilities in Everyday Image Editing Tasks(https://arxiv.org/abs/2505.16181)
Keywords: generative
Abstract: Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on March 25, 2025. However, what subjects do people most often want edited? What kinds of editing actions do they want to perform (e.g., removing or stylizing the subject)? Do people prefer precise edits with predictable outcomes or highly creative ones? By understanding the characteristics of real-world requests and the corresponding edits made by freelance photo-editing wizards, can we draw lessons for improving AI-based editors and determine which types of requests can currently be handled successfully by AI editors? In this paper, we present a unique study addressing these questions by analyzing 83k requests from the past 12 years (2013-2025) on the Reddit community, which collected 305k PSR-wizard edits. According to human ratings, approximately only 33% of requests can be fulfilled by the best AI editors (including GPT-4o, Gemini-2.0-Flash, SeedEdit). Interestingly, AI editors perform worse on low-creativity requests that require precise editing than on more open-ended tasks. They often struggle to preserve the identity of people and animals, and frequently make non-requested touch-ups. On the other side of the table, VLM judges (e.g., o1) perform differently from human judges and may prefer AI edits more than human edits. Code and qualitative examples are available at: this https URL
摘要：Generative AI（Genai）具有自动化日常图像编辑任务的巨大希望，尤其是在2025年3月25日GPT-4O发行后，人们最经常想编辑哪些主题？他们想执行什么样的编辑操作（例如，删除或样式化主题）？人们是否喜欢具有可预测结果的精确编辑？通过了解实际请求的特征以及自由照片编辑向导进行的相应编辑，我们可以绘制用于改进基于AI的编辑的课程，并确定当前可以通过AI编辑成功处理哪些类型的请求？在本文中，我们提出了一项独特的研究，通过分析过去12年（2013 - 2025年）对Reddit社区的83K请求，该研究收集了305K PSR-Wizard编辑。根据人类评分，最佳的AI编辑者（包括GPT-4O，Gemini-2.0-Flash，Seededit）只能满足大约33％的请求。有趣的是，AI编辑者在需要精确编辑的低创造性请求方面的表现要差，而不是在开放式任务上。他们经常难以保留人和动物的身份，并经常进行无需进行修饰。在桌子的另一侧，VLM法官（例如O1）的表现不同于人类法官，并且可能更喜欢AI的编辑，而不是人类的编辑。代码和定性示例可提供：此HTTPS URL

Title: DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

Authors: Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16239
Pdf URL: https://arxiv.org/pdf/2505.16239
Copy Paste: [[2505.16239]] DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution(https://arxiv.org/abs/2505.16239)
Keywords: restoration, super-resolution
Abstract: Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a **28$\times$** speed-up over existing methods such as MGLD-VSR. Code is available at: this https URL.
摘要：扩散模型在现实世界视频超分辨率（VSR）中表现出了有希望的性能。但是，他们需要的数十个采样步骤使推理非常慢。采样加速技术，尤其是单步，提供了潜在的解决方案。尽管如此，由于视频数据和严格的忠诚需求的高训练开销，在VSR中实现一步仍然具有挑战性。为了解决上述问题，我们建议Dove是现实世界中VSR的有效的一步扩散模型。 Dove是通过微调验证的视频扩散模型（即*，Cogvideox）获得的。为了有效地训练Dove，我们介绍了潜在像素训练策略。该策略采用两阶段方案来逐渐使模型适应视频超分辨率任务。同时，我们设计了一个视频处理管道，以构建针对VSR量身定制的高质量数据集，称为HQ-VSR。该数据集的微调进一步增强了鸽子的恢复能力。广泛的实验表明，鸽子表现出与基于多步扩散的VSR方法的可比性或出色的性能。它还提供了出色的推理效率，比现有方法（例如MGLD-VSR）达到了** 28 $ \ times $ **。代码可用：此HTTPS URL。

Title: Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Authors: Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, Tuo Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16265
Pdf URL: https://arxiv.org/pdf/2505.16265
Copy Paste: [[2505.16265]] Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models(https://arxiv.org/abs/2505.16265)
Keywords: generative
Abstract: Reinforcement learning from human feedback (RLHF) has become a powerful post-training paradigm for aligning large language models with human preferences. A core challenge in RLHF is constructing accurate reward signals, where the conventional Bradley-Terry reward models (BT RMs) often suffer from sensitivity to data size and coverage, as well as vulnerability to reward hacking. Generative reward models (GenRMs) offer a more robust alternative by generating chain-of-thought (CoT) rationales followed by a final reward. However, existing GenRMs rely on shallow, vertically scaled reasoning, limiting their capacity to handle nuanced or complex (e.g., reasoning-intensive) tasks. Moreover, their pairwise preference outputs are incompatible with standard RLHF algorithms that require pointwise reward signals. In this work, we introduce Think-RM, a training framework that enables long-horizon reasoning in GenRMs by modeling an internal thinking process. Rather than producing structured, externally provided rationales, Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities such as self-reflection, hypothetical reasoning, and divergent reasoning. To elicit these reasoning abilities, we first warm-up the models by supervised fine-tuning (SFT) over long CoT data. We then further improve the model's long-horizon abilities by rule-based reinforcement learning (RL). In addition, we propose a novel pairwise RLHF pipeline that directly optimizes policies using pairwise preference rewards, eliminating the need for pointwise reward conversion and enabling more effective use of Think-RM outputs. Experiments show that Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%. When combined with our pairwise RLHF pipeline, it demonstrates superior end-policy performance compared to traditional approaches.
摘要：从人类反馈（RLHF）中学习的强化学习已成为将大型语言模型与人类偏好保持一致的强大训练后范式。 RLHF中的核心挑战是构建准确的奖励信号，传统的Bradley-Terry奖励模型（BT RMS）通常会遭受对数据大小和覆盖范围的敏感性，以及奖励黑客攻击的脆弱性。生成奖励模型（GENRMS）通过生成思想链（COT）理由，然后获得最终奖励，从而提供了更健壮的选择。但是，现有的GenRM依靠浅层，垂直缩放的推理，限制了其处理细微或复杂（例如推理密集型）任务的能力。此外，它们的成对偏好输出与需要点式奖励信号的标准RLHF算法不相容。在这项工作中，我们介绍了Think-RM，这是一个培训框架，可以通过建模内部思维过程来实现GenRM中的长途推理。 Think-RM并没有产生结构化的，而是提供了基本原理，而是产生灵活的，自指的推理痕迹，以支持高级能力，例如自我反思，假设的推理和不同的推理。为了引起这些推理能力，我们首先通过长期COT数据进行监督微调（SFT）来热身模型。然后，我们通过基于规则的强化学习（RL）进一步提高了模型的长马能力。此外，我们提出了一种新型的成对RLHF管道，该管道使用成对偏好奖励直接优化策略，从而消除了对奖励转换的需求，并使您更有效地利用了Think-RM输出。实验表明，Think-RM在RM基础上实现了最新的结果，表现优于BT RM和垂直缩放的GenRM，提高了8％。与我们的成对RLHF管道结合使用时，与传统方法相比，它表现出较高的终端性能。

Title: Paired and Unpaired Image to Image Translation using Generative Adversarial Networks

Authors: Gaurav Kumar, Soham Satyadharma, Harpreet Singh
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16310
Pdf URL: https://arxiv.org/pdf/2505.16310
Copy Paste: [[2505.16310]] Paired and Unpaired Image to Image Translation using Generative Adversarial Networks(https://arxiv.org/abs/2505.16310)
Keywords: generation, generative
Abstract: Image to image translation is an active area of research in the field of computer vision, enabling the generation of new images with different styles, textures, or resolutions while preserving their characteristic properties. Recent architectures leverage Generative Adversarial Networks (GANs) to transform input images from one domain to another. In this work, we focus on the study of both paired and unpaired image translation across multiple image domains. For the paired task, we used a conditional GAN model, and for the unpaired task, we trained it using cycle consistency loss. We experimented with different types of loss functions, multiple Patch-GAN sizes, and model architectures. New quantitative metrics - precision, recall, and FID score - were used for analysis. In addition, a qualitative study of the results of different experiments was conducted.
摘要：图像到图像翻译是计算机视觉领域的研究领域，可以在保留其特性属性的同时，可以生成具有不同样式，纹理或分辨率的新图像。最近的体系结构利用生成对抗网络（GAN）将输入图像从一个域转换为另一个域。在这项工作中，我们专注于对多个图像域的配对和未配对图像翻译的研究。对于配对任务，我们使用了有条件的GAN模型，对于未配对的任务，我们使用循环一致性损失对其进行了训练。我们尝试了不同类型的损耗功能，多个贴剂尺寸和模型体系结构。使用新的定量指标 - 精度，召回和FID得分 - 用于分析。此外，还对不同实验的结果进行了定性研究。

Title: NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment

Authors: Shuhao Han, Haotian Fan, Fangyuan Kong, Wenjie Liao, Chunle Guo, Chongyi Li, Radu Timofte, Liang Li, Tao Li, Junhui Cui, Yunqiu Wang, Yang Tai, Jingwei Sun, Jianhui Sun, Xinli Yue, Tianyi Wang, Huan Hou, Junda Lu, Xinyang Huang, Zitang Zhou, Zijian Zhang, Xuhui Zheng, Xuecheng Wu, Chong Peng, Xuezhi Cao, Trong-Hieu Nguyen-Mau, Minh-Hoang Le, Minh-Khoa Le-Phan, Duy-Nam Ly, Hai-Dang Nguyen, Minh-Triet Tran, Yukang Lin, Yan Hong, Chuanbiao Song, Siyuan Li, Jun Lan, Zhichao Zhang, Xinyue Li, Wei Sun, Zicheng Zhang, Yunhao Li, Xiaohong Liu, Guangtao Zhai, Zitong Xu, Huiyu Duan, Jiarui Wang, Guangji Ma, Liu Yang, Lu Liu, Qiang Hu, Xiongkuo Min, Zichuan Wang, Zhenchen Tang, Bo Peng, Jing Dong, Fengbin Guan, Zihao Yu, Yiting Lu, Wei Luo, Xin Li, Minhao Lin, Haofeng Chen, Xuanxuan He, Kele Xu, Qisheng Xu, Zijian Gao, Tianjiao Wan, Bo-Cheng Qiu, Chih-Chung Hsu, Chia-ming Lee, Yu-Fan Lin, Bo Yu, Zehao Wang, Da Mu, Mingxiu Chen, Junkang Fang, Huamei Sun, Wending Zhao, Zhiyu Wang, Wang Liu, Weikang Yu, Puhong Duan, Bin Sun, Xudong Kang, Shutao Li, Shuai He, Lingzhi Fu, Heng Cong, Rongyu Zhang, Jiarong He, Zhishan Qiao, Yongqing Huang, Zewen Chen, Zhe Pang, Juan Wang, Jian Guo, Zhizhuo Shao, Ziyu Feng, Bing Li, Weiming Hu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16314
Pdf URL: https://arxiv.org/pdf/2505.16314
Copy Paste: [[2505.16314]] NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment(https://arxiv.org/abs/2505.16314)
Keywords: restoration, generation, generative, quality assessment
Abstract: This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.
摘要：本文报告了NTIRE 2025在文本上对图像（T2I）生成模型质量评估的挑战，该评估将与CVPR 2025的图像恢复和增强研讨会（NTIRE）的新趋势结合起来。该挑战的目的是解决文本到文本生成模型的质量评估。这项挑战从两个方面评估了文本对图像模型：图像文本对齐和图像结构失真检测，并将其分为对齐轨道和结构轨道。对齐轨道使用了Evalmuse-40K，该轨道包含20个流行生成模型生成的大约40k AI生成的图像（AIGI）。对齐轨道共有371名注册参与者。在开发阶段，总共收到了1,883份提交，并在测试阶段收到507项提交。最后，有12个参与的团队提交了他们的模型和事实表。该结构轨道使用Evalmuse结构，其中包含10,000个AI生成的图像（AIGI），并带有相应的结构失真掩码。在结构轨道中总共注册了211名参与者。在开发阶段，总共收到了1155份提交，并在测试阶段收到487项提交。最后，有8个参与的团队提交了他们的模型和事实表。几乎所有方法都取得了比基线方法更好的结果，并且这两种曲目中的获胜方法都表明了T2I模型质量评估的卓越预测性能。

Title: SuperPure: Efficient Purification of Localized and Distributed Adversarial Patches via Super-Resolution GAN Models

Authors: Hossein Khalili, Seongbin Park, Venkat Bollapragada, Nader Sehatbakhsh
Subjects: cs.CV, cs.CR, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16318
Pdf URL: https://arxiv.org/pdf/2505.16318
Copy Paste: [[2505.16318]] SuperPure: Efficient Purification of Localized and Distributed Adversarial Patches via Super-Resolution GAN Models(https://arxiv.org/abs/2505.16318)
Keywords: super-resolution
Abstract: As vision-based machine learning models are increasingly integrated into autonomous and cyber-physical systems, concerns about (physical) adversarial patch attacks are growing. While state-of-the-art defenses can achieve certified robustness with minimal impact on utility against highly-concentrated localized patch attacks, they fall short in two important areas: (i) State-of-the-art methods are vulnerable to low-noise distributed patches where perturbations are subtly dispersed to evade detection or masking, as shown recently by the DorPatch attack; (ii) Achieving high robustness with state-of-the-art methods is extremely time and resource-consuming, rendering them impractical for latency-sensitive applications in many cyber-physical systems. To address both robustness and latency issues, this paper proposes a new defense strategy for adversarial patch attacks called SuperPure. The key novelty is developing a pixel-wise masking scheme that is robust against both distributed and localized patches. The masking involves leveraging a GAN-based super-resolution scheme to gradually purify the image from adversarial patches. Our extensive evaluations using ImageNet and two standard classifiers, ResNet and EfficientNet, show that SuperPure advances the state-of-the-art in three major directions: (i) it improves the robustness against conventional localized patches by more than 20%, on average, while also improving top-1 clean accuracy by almost 10%; (ii) It achieves 58% robustness against distributed patch attacks (as opposed to 0% in state-of-the-art method, PatchCleanser); (iii) It decreases the defense end-to-end latency by over 98% compared to PatchCleanser. Our further analysis shows that SuperPure is robust against white-box attacks and different patch sizes. Our code is open-source.
摘要：随着基于视觉的机器学习模型越来越多地集成到自主和网络物理系统中，对（物理）对抗斑块攻击的担忧正在增长。虽然最先进的防御能力可以实现认证的鲁棒性，并且对效用对高度集中的局部贴片攻击的影响很小，但它们在两个重要领域的局限性差异很大：（i）最先进的方法容易受到低含水分布式贴片的攻击，在这些贴片中，在这些贴片中，在这些贴片中，在这些斑块中均通过微妙地驱散检测或掩盖Dorpatts攻击，这些斑点被删除或掩盖了Dorpatts攻击；（ii）使用最先进的方法实现高鲁棒性是极度时间和资源消费，使它们对于许多网络物理系统中对潜伏期敏感的应用来说是不切实际的。为了解决鲁棒性和延迟问题，本文提出了一种新的防御策略，用于称为Superpure的对抗贴片攻击。关键的新颖性是开发一种针对分布式和局部贴片的良好的像素掩蔽方案。掩盖涉及利用基于GAN的超分辨率方案来逐渐从对抗斑块中纯化图像。我们使用ImageNet和两个标准分类器（Resnet and Extivelynet）进行的广泛评估表明，Superpure在三个主要方向上提高了最新的最新时间：（i）它使对常规局部贴剂的鲁棒性平均提高了20％以上，同时将TOP-1的清洁准确度提高了几乎10％；（ii）它可以针对分布式贴片攻击实现58％的鲁棒性（而最先进的方法为0％，PatchCleanser）；（iii）与PatchCleanser相比，它将国防端到端的潜伏期降低了98％以上。我们的进一步分析表明，Superpure对白色盒子攻击和不同的斑块大小具有鲁棒性。我们的代码是开源的。

Title: TensorAR: Refinement is All You Need in Autoregressive Image Generation

Authors: Cheng Cheng, Lin Song, Yicheng Xiao, Yuxin Chen, Xuchong Zhang, Hongbin Sun, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16324
Pdf URL: https://arxiv.org/pdf/2505.16324
Copy Paste: [[2505.16324]] TensorAR: Refinement is All You Need in Autoregressive Image Generation(https://arxiv.org/abs/2505.16324)
Keywords: generation
Abstract: Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.
摘要：自回归（AR）图像发电机通过预测因果序列中的离散图像令牌来提供一种对图像生成的友好方法。但是，与扩散模型不同，AR模型缺乏完善先前预测的机制，从而限制了它们的产生质量。在本文中，我们介绍了Tensorar，这是一种新的AR范式，该范式从下一步的预测到下一步的预测进行了重新制定图像的产生。通过以滑动方式生成图像贴片（张量）的重叠窗口，Tensorar可以迭代改进以前生成的内容。为了防止培训期间的信息泄漏，我们提出了一个离散的张量noising方案，该方案通过代码手册索引噪声输入令牌。 Tensorar被实现为与现有AR型号兼容的插件模块。关于Llamagen，Open-Magvit2和RAR的广泛实验表明，张量显着提高了自回旋模型的发电性能。

Title: ChemMLLM: Chemical Multimodal Large Language Model

Authors: Qian Tan, Dongzhan Zhou, Peng Xia, Wanhao Liu, Wanli Ouyang, Lei Bai, Yuqiang Li, Tianfan Fu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16326
Pdf URL: https://arxiv.org/pdf/2505.16326
Copy Paste: [[2505.16326]] ChemMLLM: Chemical Multimodal Large Language Model(https://arxiv.org/abs/2505.16326)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) have made impressive progress in many applications in recent years. However, chemical MLLMs that can handle cross-modal understanding and generation remain underexplored. To fill this gap, in this paper, we propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation. Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets. We benchmark ChemMLLM against a range of general leading MLLMs and Chemical LLMs on these tasks. Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks. For example, in molecule image optimization task, ChemMLLM outperforms the best baseline (GPT-4o) by 118.9\% (4.27 vs 1.95 property improvement). The code is publicly available at this https URL.
摘要：近年来，多模式大语模型（MLLM）在许多应用中取得了令人印象深刻的进步。但是，可以处理跨模式理解和产生的化学MLLM仍未被逐渐倍增。为了填补这一空白，在本文中，我们提出了ChemMllm，这是一种统一的化学化学多模式模型，用于分子理解和产生。此外，我们在文本，分子微笑字符串和图像上设计了五个多模式任务，并策划了数据集。我们在这些任务上针对一系列一般领先的MLLM和化学LLM进行了基准化学基准。实验结果表明，ChemMllm在所有评估的任务中都能达到卓越的性能。例如，在分子图像优化任务中，ChemMllM的表现优于最佳基线（GPT-4O），而118.9 \％（4.27 vs 1.95 1.95属性改善）。该代码在此HTTPS URL上公开可用。

Title: FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design

Authors: Renjie Wei, Songqiang Xu, Qingyu Guo, Meng Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16335
Pdf URL: https://arxiv.org/pdf/2505.16335
Copy Paste: [[2505.16335]] FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design(https://arxiv.org/abs/2505.16335)
Keywords: generation
Abstract: Visual autoregressive (VAR) modeling has marked a paradigm shift in image generation from next-token prediction to next-scale prediction. VAR predicts a set of tokens at each step from coarse to fine scale, leading to better image quality and faster inference speed compared to existing diffusion models. However, the large parameter size and computation cost hinder its deployment on edge devices. To reduce the memory and computation cost, we propose FPQVAR, an efficient post-training floating-point (FP) quantization framework for VAR featuring algorithm and hardware co-design. At the algorithm level, we first identify the challenges of quantizing VAR. To address them, we propose Dual Format Quantization for the highly imbalanced input activation. We further propose Group-wise Hadamard Transformation and GHT-Aware Learnable Transformation to address the time-varying outlier channels. At the hardware level, we design the first low-bit FP quantizer and multiplier with lookup tables on FPGA and propose the first FPGA-based VAR accelerator featuring low-bit FP computation and an elaborate two-level pipeline. Extensive experiments show that compared to the state-of-the-art quantization method, our proposed FPQVAR significantly improves Fréchet Inception Distance (FID) from 10.83 to 3.58, Inception Score (IS) from 175.9 to 241.5 under 4-bit quantization. FPQVAR also significantly improves the performance of 6-bit quantized VAR, bringing it on par with the FP16 model. Our accelerator on AMD-Xilinx VCK190 FPGA achieves a throughput of 1.1 image/s, which is 3.1x higher than the integer-based accelerator. It also demonstrates 3.6x and 2.8x higher energy efficiency compared to the integer-based accelerator and GPU baseline, respectively.
摘要：视觉自我回归（VAR）建模已标志着图像生成从下一步预测到次级预测的范式变化。 VAR可以在从粗略到细节的每个步骤中预测一组令牌，与现有扩散模型相比，可以提高图像质量和更快的推理速度。但是，大型参数大小和计算的成本阻碍了其在边缘设备上的部署。为了降低内存和计算成本，我们提出了FPQVAR，这是一种有效的曲面后浮点（FP）量化框架，用于算法和硬件共同设计。在算法级别，我们首先确定量化VAR的挑战。为了解决这些问题，我们建议对高度不平衡的输入激活进行双重格式量化。我们进一步提出了群体的Hadamard转换和GHT感知的可学习转型，以解决随着时变的离群渠道。在硬件级别上，我们设计了第一个低位FP量化器和乘数在FPGA上使用查找表，并提出了第一个基于FPGA的VAR加速器，该VAR加速器具有低位FP计算和精美的两级管道。广泛的实验表明，与最先进的量化方法相比，我们提出的FPQVAR显着提高了Fréchet的成立距离（FID）从10.83到3.58，INCEPTION评分（IS）从175.9到241.5在4位量化下。 FPQVAR还显着提高了6位量化VAR的性能，使其与FP16模型相当。我们在AMD-Xilinx VCK190 FPGA上的加速器的吞吐量为1.1 Image/s，比基于整数的加速器高3.1倍。与基于整数的加速器和GPU基线相比，它还表现出3.6倍和2.8倍的能量效率。

Title: A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules

Authors: Manuel Ruiz-Botella, Marta Sales-Pardo, Roger Guimerà
Subjects: cs.LG, cs.AI, physics.comp-ph, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.16365
Pdf URL: https://arxiv.org/pdf/2505.16365
Copy Paste: [[2505.16365]] A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules(https://arxiv.org/abs/2505.16365)
Keywords: generation
Abstract: Developing new molecular compounds is crucial to address pressing challenges, from health to environmental sustainability. However, exploring the molecular space to discover new molecules is difficult due to the vastness of the space. Here we introduce CoCoGraph, a collaborative and constrained graph diffusion model capable of generating molecules that are guaranteed to be chemically valid. Thanks to the constraints built into the model and to the collaborative mechanism, CoCoGraph outperforms state-of-the-art approaches on standard benchmarks while requiring up to an order of magnitude fewer parameters. Analysis of 36 chemical properties also demonstrates that CoCoGraph generates molecules with distributions more closely matching real molecules than current models. Leveraging the model's efficiency, we created a database of 8.2M million synthetically generated molecules and conducted a Turing-like test with organic chemistry experts to further assess the plausibility of the generated molecules, and potential biases and limitations of CoCoGraph.
摘要：开发新的分子化合物对于应对从健康到环境可持续性的紧迫挑战至关重要。但是，由于空间的广阔，探索发现新分子的分子空间很困难。在这里，我们介绍了Cocograkem，这是一个协作且受约束的图扩散模型，能够产生保证在化学上有效的分子。得益于模型内置的限制和协作机制，Cocograkem在标准基准测试中优于最先进的方法，同时需要少的参数。对36个化学特性的分析还表明，与当前模型相比，生成具有分布更紧密匹配的实际分子的分子。利用该模型的效率，我们创建了一个合成产生的分子为820万的数据库，并与有机化学专家进行了类似图灵的测试，以进一步评估生成分子的合理性，以及cocograph的潜在偏见和局限性。

Title: AdvReal: Adversarial Patch Generation Framework with Application to Adversarial Safety Evaluation of Object Detection Systems

Authors: Yuanhao Huang, Yilong Ren, Jinlei Wang, Lujia Huo, Xuesong Bai, Jinchuan Zhang, Haiyan Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16402
Pdf URL: https://arxiv.org/pdf/2505.16402
Copy Paste: [[2505.16402]] AdvReal: Adversarial Patch Generation Framework with Application to Adversarial Safety Evaluation of Object Detection Systems(https://arxiv.org/abs/2505.16402)
Keywords: generation
Abstract: Autonomous vehicles are typical complex intelligent systems with artificial intelligence at their core. However, perception methods based on deep learning are extremely vulnerable to adversarial samples, resulting in safety accidents. How to generate effective adversarial examples in the physical world and evaluate object detection systems is a huge challenge. In this study, we propose a unified joint adversarial training framework for both 2D and 3D samples to address the challenges of intra-class diversity and environmental variations in real-world scenarios. Building upon this framework, we introduce an adversarial sample reality enhancement approach that incorporates non-rigid surface modeling and a realistic 3D matching mechanism. We compare with 5 advanced adversarial patches and evaluate their attack performance on 8 object detecotrs, including single-stage, two-stage, and transformer-based models. Extensive experiment results in digital and physical environments demonstrate that the adversarial textures generated by our method can effectively mislead the target detection model. Moreover, proposed method demonstrates excellent robustness and transferability under multi-angle attacks, varying lighting conditions, and different distance in the physical world. The demo video and code can be obtained at this https URL.
摘要：自动驾驶汽车是典型的复杂智能系统，其核心是人工智能。但是，基于深度学习的感知方法极易受到对抗样本的影响，从而导致安全事故。如何在物理世界中产生有效的对抗例子并评估对象检测系统是一个巨大的挑战。在这项研究中，我们为2D和3D样本提出了一个统一的联合对抗训练框架，以应对实际情况下类内多样性和环境变化的挑战。在此框架的基础上，我们引入了一种对抗性样本现实增强方法，该方法结合了非刚性表面建模和现实的3D匹配机制。我们与5个高级对抗贴片进行了比较，并评估了它们对8个对象detecotr的攻击性能，包括单阶段，两阶段和基于变压器的模型。数字和物理环境中的广泛实验结果表明，我们方法产生的对抗纹理可以有效地误导目标检测模型。此外，提出的方法在多角度攻击，不同的照明条件以及物理世界的不同距离下表明了出色的鲁棒性和可传递性。可以在此HTTPS URL上获得演示视频和代码。

Title: Pose-invariant face recognition via feature-space pose frontalization

Authors: Nikolay Stanishev, Yuhang Lu, Touradj Ebrahimi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16412
Pdf URL: https://arxiv.org/pdf/2505.16412
Copy Paste: [[2505.16412]] Pose-invariant face recognition via feature-space pose frontalization(https://arxiv.org/abs/2505.16412)
Keywords: generative
Abstract: Pose-invariant face recognition has become a challenging problem for modern AI-based face recognition systems. It aims at matching a profile face captured in the wild with a frontal face registered in a database. Existing methods perform face frontalization via either generative models or learning a pose robust feature representation. In this paper, a new method is presented to perform face frontalization and recognition within the feature space. First, a novel feature space pose frontalization module (FSPFM) is proposed to transform profile images with arbitrary angles into frontal counterparts. Second, a new training paradigm is proposed to maximize the potential of FSPFM and boost its performance. The latter consists of a pre-training and an attention-guided fine-tuning stage. Moreover, extensive experiments have been conducted on five popular face recognition benchmarks. Results show that not only our method outperforms the state-of-the-art in the pose-invariant face recognition task but also maintains superior performance in other standard scenarios.
摘要：对于现代AI的面部识别系统而言，姿势不变的面部识别已成为一个具有挑战性的问题。它旨在与野外捕获的轮廓面匹配，并在数据库中注册的正面面部。现有方法通过生成模型或学习姿势稳健的特征表示。在本文中，提出了一种新方法，以在功能空间内执行面部额叶和识别。首先，提出了一个新型的特征空间姿势额叶化模块（FSPFM），以将任意角度的轮廓图像转换为额叶对应物。其次，提出了一种新的培训范式，以最大程度地发挥FSPFM的潜力并提高其性能。后者由预训练和注意力引导的微调阶段组成。此外，已经对五个流行的面部识别基准进行了广泛的实验。结果表明，我们的方法不仅超过了姿势不变的面部识别任务中最新的，而且在其他标准方案中也保持了卓越的性能。

Title: Joint Flow And Feature Refinement Using Attention For Video Restoration

Authors: Ranjith Merugu, Mohammad Sameer Suhail, Akshay P Sarashetti, Venkata Bharath Reddy Reddem, Pankaj Kumar Bajpai, Amit Satish Unde
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16434
Pdf URL: https://arxiv.org/pdf/2505.16434
Copy Paste: [[2505.16434]] Joint Flow And Feature Refinement Using Attention For Video Restoration(https://arxiv.org/abs/2505.16434)
Keywords: restoration, super-resolution
Abstract: Recent advancements in video restoration have focused on recovering high-quality video frames from low-quality inputs. Compared with static images, the performance of video restoration significantly depends on efficient exploitation of temporal correlations among successive video frames. The numerous techniques make use of temporal information via flow-based strategies or recurrent architectures. However, these methods often encounter difficulties in preserving temporal consistency as they utilize degraded input video frames. To resolve this issue, we propose a novel video restoration framework named Joint Flow and Feature Refinement using Attention (JFFRA). The proposed JFFRA is based on key philosophy of iteratively enhancing data through the synergistic collaboration of flow (alignment) and restoration. By leveraging previously enhanced features to refine flow and vice versa, JFFRA enables efficient feature enhancement using temporal information. This interplay between flow and restoration is executed at multiple scales, reducing the dependence on precise flow estimation. Moreover, we incorporate an occlusion-aware temporal loss function to enhance the network's capability in eliminating flickering artifacts. Comprehensive experiments validate the versatility of JFFRA across various restoration tasks such as denoising, deblurring, and super-resolution. Our method demonstrates a remarkable performance improvement of up to 1.62 dB compared to state-of-the-art approaches.
摘要：视频恢复的最新进展集中在从低质量输入中恢复高质量的视频帧。与静态图像相比，视频修复的性能显着取决于连续视频帧之间对时间相关性的有效利用。许多技术通过基于流的策略或经常性架构利用时间信息。但是，这些方法通常在保留时间一致性时遇到困难，因为它们利用了降级的输入视频帧。为了解决此问题，我们提出了一个新型的视频恢复框架，名为“关节流”，并使用注意力（JFFRA）进行精致。拟议的JFFRA基于迭代性增强数据的关键理念，通过流动的协同合作（对齐）和恢复。通过利用先前增强的功能来完善流量，反之亦然，JFFRA可以使用时间信息来提高有效的功能。流量和恢复之间的相互作用是在多个尺度上执行的，从而降低了对精确流量估计的依赖。此外，我们结合了遮挡感知的时间损失函数，以增强网络消除闪烁的工件的能力。全面的实验验证了JFFRA在各种恢复任务中的多功能性，例如去核，脱张和超分辨率。与最先进的方法相比，我们的方法表明，高达1.62 dB的性能提高了1.62 dB。

Title: MAGIC: Motion-Aware Generative Inference via Confidence-Guided LLM

Authors: Siwei Meng, Yawei Luo, Ping Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16456
Pdf URL: https://arxiv.org/pdf/2505.16456
Copy Paste: [[2505.16456]] MAGIC: Motion-Aware Generative Inference via Confidence-Guided LLM(https://arxiv.org/abs/2505.16456)
Keywords: generation, generative
Abstract: Recent advances in static 3D generation have intensified the demand for physically consistent dynamic 3D content. However, existing video generation models, including diffusion-based methods, often prioritize visual realism while neglecting physical plausibility, resulting in implausible object dynamics. Prior approaches for physics-aware dynamic generation typically rely on large-scale annotated datasets or extensive model fine-tuning, which imposes significant computational and data collection burdens and limits scalability across scenarios. To address these challenges, we present MAGIC, a training-free framework for single-image physical property inference and dynamic generation, integrating pretrained image-to-video diffusion models with iterative LLM-based reasoning. Our framework generates motion-rich videos from a static image and closes the visual-to-physical gap through a confidence-driven LLM feedback loop that adaptively steers the diffusion model toward physics-relevant motion. To translate visual dynamics into controllable physical behavior, we further introduce a differentiable MPM simulator operating directly on 3D Gaussians reconstructed from the single image, enabling physically grounded, simulation-ready outputs without any supervision or model tuning. Experiments show that MAGIC outperforms existing physics-aware generative methods in inference accuracy and achieves greater temporal coherence than state-of-the-art video diffusion models.
摘要：静态3D代的最新进展加剧了对物理一致动态3D内容的需求。但是，现有的视频生成模型（包括基于扩散的方法）通常在忽略身体合理性的同时优先考虑视觉现实主义，从而导致难以置信的对象动态。物理感知动态生成的先前方法通常依赖于大规模注释的数据集或广泛的模型微调，该数据集施加了巨大的计算和数据收集负担，并限制了整个方案的可扩展性。为了应对这些挑战，我们提出了魔术，这是一个无训练的单位物理属性推理和动态生成的框架，将验证的图像到视频扩散模型与基于迭代LLM的推理整合在一起。我们的框架从静态图像中生成了富动作的视频，并通过信心驱动的LLM反馈回路缩小视觉到物理差距，从而使扩散模型转向与物理相关的运动。为了将视觉动态转化为可控的身体行为，我们进一步引入了一个可从单个图像重建的3D高斯人操作的可区分的MPM模拟器，从而实现了无需任何监督或模型调整的物理接地的，可以进行仿真的输出。实验表明，魔术在推理准确性方面优于现有的物理感知生成方法，并且比最新的视频扩散模型获得了更大的时间连贯性。

Title: Consistent World Models via Foresight Diffusion

Authors: Yu Zhang, Xingzhuo Guo, Haoran Xu, Mingsheng Long
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16474
Pdf URL: https://arxiv.org/pdf/2505.16474
Copy Paste: [[2505.16474]] Consistent World Models via Foresight Diffusion(https://arxiv.org/abs/2505.16474)
Keywords: generation
Abstract: Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in world modeling. However, unlike typical generation tasks that encourage sample diversity, world models entail different sources of uncertainty and require consistent samples aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sample consistency over strong baselines, offering a promising direction for diffusion-based world models.
摘要：扩散和基于流的模型已使各种模式的生成任务取得了重大进展，并最近在世界建模中找到了应用程序。但是，与鼓励样本多样性的典型生成任务不同，世界模型需要不同的不确定性来源，并且需要一致的样品与地面轨迹一致，这是我们在扩散模型中经验观察到的限制。我们认为，学习一致的基于扩散的世界模型中的关键瓶颈在于次优的预测能力，我们将其归因于纠缠状况理解的纠缠和靶向共享体系结构和共同训练方案中的目标。为了解决这个问题，我们提出了一种前瞻性扩散（前进），这是一个基于扩散的世界建模框架，通过将条件理解与目标降解性的理解来增强一致性。前卫将单独的确定性预测流纳入了独立于denoising流的处理条件输入，并进一步利用了经过预定的预测因子来提取指导生成的信息表示。在机器人视频预测和科学时空预测上进行的广泛实验表明，进餐提高了强大基线的预测准确性和样品一致性，为基于扩散的世界模型提供了有希望的方向。

Title: Clear Nights Ahead: Towards Multi-Weather Nighttime Image Restoration

Authors: Yuetong Liu, Yunqiu Xu, Yang Wei, Xiuli Bi, Bin Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16479
Pdf URL: https://arxiv.org/pdf/2505.16479
Copy Paste: [[2505.16479]] Clear Nights Ahead: Towards Multi-Weather Nighttime Image Restoration(https://arxiv.org/abs/2505.16479)
Keywords: restoration, generation
Abstract: Restoring nighttime images affected by multiple adverse weather conditions is a practical yet under-explored research problem, as multiple weather conditions often coexist in the real world alongside various lighting effects at night. This paper first explores the challenging multi-weather nighttime image restoration task, where various types of weather degradations are intertwined with flare effects. To support the research, we contribute the AllWeatherNight dataset, featuring large-scale high-quality nighttime images with diverse compositional degradations, synthesized using our introduced illumination-aware degradation generation. Moreover, we present ClearNight, a unified nighttime image restoration framework, which effectively removes complex degradations in one go. Specifically, ClearNight extracts Retinex-based dual priors and explicitly guides the network to focus on uneven illumination regions and intrinsic texture contents respectively, thereby enhancing restoration effectiveness in nighttime scenarios. In order to better represent the common and unique characters of multiple weather degradations, we introduce a weather-aware dynamic specific-commonality collaboration method, which identifies weather degradations and adaptively selects optimal candidate units associated with specific weather types. Our ClearNight achieves state-of-the-art performance on both synthetic and real-world images. Comprehensive ablation experiments validate the necessity of AllWeatherNight dataset as well as the effectiveness of ClearNight. Project page: this https URL
摘要：恢复受多种不利天气影响影响的夜间图像是一个实用但探索次数不足的研究问题，因为多种天气状况通常在现实世界中共存，而在夜间进行各种照明效果。本文首先探讨了具有挑战性的多天气夜间图像修复任务，其中各种类型的天气退化与火炬效应交织在一起。为了支持这项研究，我们贡献了AllweathEntight数据集，其中包含大型高质量的夜间图像，并使用我们引入的Illumination-aware Aware Aware Aware Aware Sefladation Generation合成了各种构图降解。此外，我们提出了Clearnight，这是一个统一的夜间图像修复框架，它一次有效地消除了复杂的降解。具体而言，Clear Night提取基于Etinex的双手先验，并明确指导网络分别关注不平坦的照明区域和内在纹理内容，从而在夜间场景中提高恢复有效性。为了更好地代表多种天气降解的常见和独特的特征，我们引入了一种天气感知的动态特定共同协作方法，该方法识别天气退化并自适应地选择与特定天气类型相关的最佳候选单位。我们的Clearnight在合成图像和现实世界图像上都实现了最先进的性能。全面的消融实验验证了Allweathernight数据集的必要性以及Clearnight的有效性。项目页面：此HTTPS URL

Title: Neighbour-Driven Gaussian Process Variational Autoencoders for Scalable Structured Latent Modelling

Authors: Xinxing Shi, Xiaoyu Jiang, Mauricio A. Álvarez
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.16481
Pdf URL: https://arxiv.org/pdf/2505.16481
Copy Paste: [[2505.16481]] Neighbour-Driven Gaussian Process Variational Autoencoders for Scalable Structured Latent Modelling(https://arxiv.org/abs/2505.16481)
Keywords: generation
Abstract: Gaussian Process (GP) Variational Autoencoders (VAEs) extend standard VAEs by replacing the fully factorised Gaussian prior with a GP prior, thereby capturing richer correlations among latent variables. However, performing exact GP inference in large-scale GPVAEs is computationally prohibitive, often forcing existing approaches to rely on restrictive kernel assumptions or large sets of inducing points. In this work, we propose a neighbour-driven approximation strategy that exploits local adjacencies in the latent space to achieve scalable GPVAE inference. By confining computations to the nearest neighbours of each data point, our method preserves essential latent dependencies, allowing more flexible kernel choices and mitigating the need for numerous inducing points. Through extensive experiments on tasks including representation learning, data imputation, and conditional generation, we demonstrate that our approach outperforms other GPVAE variants in both predictive performance and computational efficiency.
摘要：高斯工艺（GP）变化自动编码器（VAE）通过替换了完全分解的高斯先验，扩展了标准VAE，并具有GP先验，从而捕获了潜在变量之间的更丰富的相关性。但是，在大规模GPVAE中执行精确的GP推断是计算上的过敏性，通常迫使现有方法依靠限制性内核假设或大量诱导点。在这项工作中，我们提出了一个以邻居驱动的近似策略，该策略利用潜在空间中的局部邻接来实现可扩展的GPVAE推断。通过将计算限制在每个数据点的最近邻居中，我们的方法可以保留基本的潜在依赖性，从而可以更灵活的内核选择并减轻对众多诱导点的需求。通过对包括表示学习，数据插补和有条件生成在内的任务进行的广泛实验，我们证明我们的方法在预测性能和计算效率方面都优于其他GPVAE变体。

Title: InspectionV3: Enhancing Tobacco Quality Assessment with Deep Convolutional Neural Networks for Automated Workshop Management

Authors: Yao Wei, Muhammad Usman, Hazrat Bilal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16485
Pdf URL: https://arxiv.org/pdf/2505.16485
Copy Paste: [[2505.16485]] InspectionV3: Enhancing Tobacco Quality Assessment with Deep Convolutional Neural Networks for Automated Workshop Management(https://arxiv.org/abs/2505.16485)
Keywords: quality assessment
Abstract: The problems that tobacco workshops encounter include poor curing, inconsistencies in supplies, irregular scheduling, and a lack of oversight, all of which drive up expenses and worse quality. Large quantities make manual examination costly, sluggish, and unreliable. Deep convolutional neural networks have recently made strides in capabilities that transcend those of conventional methods. To effectively enhance them, nevertheless, extensive customization is needed to account for subtle variations in tobacco grade. This study introduces InspectionV3, an integrated solution for automated flue-cured tobacco grading that makes use of a customized deep convolutional neural network architecture. A scope that covers color, maturity, and curing subtleties is established via a labelled dataset consisting of 21,113 images spanning 20 quality classes. Expert annotators performed preprocessing on the tobacco leaf images, including cleaning, labelling, and augmentation. Multi-layer CNN factors use batch normalization to describe domain properties like as permeability and moisture spots, and so account for the subtleties of the workshop. Its expertise lies in converting visual patterns into useful information for enhancing workflow. Fast notifications are made possible by real-time, on-the-spot grading that matches human expertise. Images-powered analytics dashboards facilitate the tracking of yield projections, inventories, bottlenecks, and the optimization of data-driven choices. More labelled images are assimilated after further retraining, improving representational capacities and enabling adaptations for seasonal variability. Metrics demonstrate 97% accuracy, 95% precision and recall, 96% F1-score and AUC, 95% specificity; validating real-world viability.
摘要：烟草研讨会遇到的问题包括治愈不佳，供应不一致，不规则安排以及缺乏监督，所有这些都推动了费用和质量较差。大量的手动检查代价高昂，迟钝且不可靠。深度卷积神经网络最近在超越常规方法的能力方面取得了进步。为了有效地增强它们，需要广泛的自定义来说明烟草级的细微变化。这项研究介绍了检验V3，这是一种用于自动固定烟草分级的集成解决方案，利用自定义的深卷积神经网络体系结构。涵盖颜色，成熟度和固化微调的范围是通过标记的数据集建立的，该数据集由21,113张图像组成，涵盖20个质量类别。专家注释者对烟叶图像进行了预处理，包括清洁，标签和增强。多层CNN因子使用批归一化来描述诸如渗透性和水分斑点之类的域特性，因此说明了研讨会的微妙之处。它的专业知识在于将视觉模式转换为有用的信息以增强工作流程。快速通知是通过实时的，现场评分与人类专业知识相匹配的。图像驱动的分析仪表板有助于跟踪产量预测，库存，瓶颈以及数据驱动选择的优化。在进一步进行重新培训后，更多标记的图像被同化，提高了代表能力并为季节性变异性提供适应性。指标表现出97％的精度，95％的精度和召回率，96％的F1得分和AUC，95％的特异性；验证现实世界的生存能力。

Title: ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

Authors: Lingfeng Wang, Hualing Lin, Senda Chen, Tao Wang, Changxu Cheng, Yangyang Zhong, Dong Zheng, Wuyue Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16495
Pdf URL: https://arxiv.org/pdf/2505.16495
Copy Paste: [[2505.16495]] ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation(https://arxiv.org/abs/2505.16495)
Keywords: generation
Abstract: While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at this https URL.
摘要：尽管人类通过根据其复杂性自适应分配注意力来毫不费力地绘制视觉对象和形状，但现有的多模式大语言模型（MLLM）仍受到刚性令牌表示的约束。弥合了这个差距，我们提出了Alto，这是一种自适应长度令牌，用于自动回归面罩的产生。为了实现这一目标，设计了一个新颖的令牌长度预测指标，以及长度正则化项和可区分的令牌块策略。我们进一步构建了无缝将中音集成到MLLM的Altollm。掩盖质量和效率之间权衡的偏好是由小组相对政策优化（GRPO）实现的。实验表明，Altollm在流行的分割基准上以自适应令牌成本实现最先进的性能。代码和模型在此HTTPS URL上发布。

Title: Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

Authors: Jiaxin Liu, Jia Wang, Saihui Hou, Min Ren, Huijia Wu, Zhaofeng He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16512
Pdf URL: https://arxiv.org/pdf/2505.16512
Copy Paste: [[2505.16512]] Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection(https://arxiv.org/abs/2505.16512)
Keywords: generation
Abstract: In recent years, the rapid development of deepfake technology has given rise to an emerging and serious threat to public security: diffusion model-based digital human generation. Unlike traditional face manipulation methods, such models can generate highly realistic videos with consistency through multimodal control signals. Their flexibility and covertness pose severe challenges to existing detection strategies. To bridge this gap, we introduce DigiFakeAV, the first large-scale multimodal digital human forgery dataset based on diffusion models. Employing five latest digital human generation methods (Sonic, Hallo, etc.) and voice cloning method, we systematically produce a dataset comprising 60,000 videos (8.4 million frames), covering multiple nationalities, skin tones, genders, and real-world scenarios, significantly enhancing data diversity and realism. User studies show that the confusion rate between forged and real videos reaches 68%, and existing state-of-the-art (SOTA) detection models exhibit large drops in AUC values on DigiFakeAV, highlighting the challenge of the dataset. To address this problem, we further propose DigiShield, a detection baseline based on spatiotemporal and cross-modal fusion. By jointly modeling the 3D spatiotemporal features of videos and the semantic-acoustic features of audio, DigiShield achieves SOTA performance on both the DigiFakeAV and DF-TIMIT datasets. Experiments show that this method effectively identifies covert artifacts through fine-grained analysis of the temporal evolution of facial features in synthetic videos.
摘要：近年来，DeepFake技术的快速发展引起了对公共安全的新兴和严重威胁：基于扩散模型的数字人类一代。与传统的面部操纵方法不同，这种模型可以通过多模式控制信号生成具有一致性的高度逼真的视频。它们的灵活性和秘密性对现有检测策略构成了严重的挑战。为了弥合这一差距，我们介绍了Digifakeav，这是基于扩散模型的第一个大型多模式人类伪造数据集。采用五种最新的数字人类生成方法（Sonic，Hallo等）和语音克隆方法，我们系统地制作了一个包括60,000个视频（840万帧）的数据集，涵盖了多种国籍，肤色，性别，性别和现实世界情景，从而显着增强了数据多样性和现实主义。用户研究表明，伪造和真实视频之间的混乱率达到68％，现有的最新检测模型在Digifakeav上显示出AUC值的大量下降，突出了数据集的挑战。为了解决这个问题，我们进一步提出Digishield，这是基于时空和跨模式融合的检测基线。通过共同对视频的3D时空特征进行建模和音频的语义声学特征，Digishield在Digifakeav和DF-Timit数据集上都实现了SOTA性能。实验表明，该方法通过对合成视频中面部特征的时间演变进行细粒度分析有效地识别秘密伪影。

Title: Joint Relational Database Generation via Graph-Conditional Diffusion Models

Authors: Mohamed Amine Ketata, David Lüdke, Leo Schwinn, Stephan Günnemann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16527
Pdf URL: https://arxiv.org/pdf/2505.16527
Copy Paste: [[2505.16527]] Joint Relational Database Generation via Graph-Conditional Diffusion Models(https://arxiv.org/abs/2505.16527)
Keywords: generation, generative
Abstract: Building generative models for relational databases (RDBs) is important for applications like privacy-preserving data release and augmenting real datasets. However, most prior work either focuses on single-table generation or relies on autoregressive factorizations that impose a fixed table order and generate tables sequentially. This approach limits parallelism, restricts flexibility in downstream applications like missing value imputation, and compounds errors due to commonly made conditional independence assumptions. We propose a fundamentally different approach: jointly modeling all tables in an RDB without imposing any order. By using a natural graph representation of RDBs, we propose the Graph-Conditional Relational Diffusion Model (GRDM). GRDM leverages a graph neural network to jointly denoise row attributes and capture complex inter-table dependencies. Extensive experiments on six real-world RDBs demonstrate that our approach substantially outperforms autoregressive baselines in modeling multi-hop inter-table correlations and achieves state-of-the-art performance on single-table fidelity metrics.
摘要：建立关系数据库（RDB）的生成模型对于隐私保护数据发布和增强真实数据集等应用程序很重要。但是，大多数先前的工作要么着重于单能生成，要么依赖于自回归因素化，从而施加固定的表顺序并顺序生成表。这种方法限制了并行性，限制了诸如缺少价值插补的下游应用程序中的灵活性，并且由于普遍做出的有条件独立性假设而导致的错误。我们提出了一种根本不同的方法：在不施加任何秩序的情况下共同对RDB中的所有表进行建模。通过使用RDB的自然图表示，我们提出了图形条件关系扩散模型（GRDM）。 GRDM利用图形神经网络共同denoise行属性并捕获复杂的桌间依赖性。对六个现实世界中RDB的广泛实验表明，我们的方法在建模多跳线间相关性并实现单台富达度量方面的最先进性能方面大大优于自回归基准。

Title: HOFT: Householder Orthogonal Fine-tuning

Authors: Alejandro Moreno Arcas, Albert Sanchis, Jorge Civera, Alfons Juan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16531
Pdf URL: https://arxiv.org/pdf/2505.16531
Copy Paste: [[2505.16531]] HOFT: Householder Orthogonal Fine-tuning(https://arxiv.org/abs/2505.16531)
Keywords: generation
Abstract: Adaptation of foundation models using low-rank methods is a widespread approach. Another way to adapt these models is to employ orthogonal fine-tuning methods, which are less time and memory efficient despite their good generalization properties. In this work, we propose Householder Orthogonal Fine-tuning (HOFT), a novel orthogonal fine-tuning method that aims to alleviate time and space complexity. Moreover, some theoretical properties of the orthogonal fine-tuning paradigm are explored. From this exploration, Scaled Householder Orthogonal Fine-tuning (SHOFT) is proposed. Both HOFT and SHOFT are evaluated in downstream tasks, namely commonsense reasoning, machine translation, subject-driven generation and mathematical reasoning. Compared with state-of-the-art adaptation methods, HOFT and SHOFT show comparable or better results.
摘要：使用低级方法适应基础模型是一种广泛的方法。适应这些模型的另一种方法是采用正交微调方法，尽管它们具有良好的概括属性，但它们的时间和记忆效率较小。在这项工作中，我们提出了家庭正交微调（HOFT），这是一种新型的正交微调方法，旨在减轻时间和空间的复杂性。此外，还探索了正交微调范式的某些理论特性。通过这种探索，提出了规模的家庭正交微调（Shoft）。 HOFT和SHOFT都在下游任务中评估，即常识性推理，机器翻译，主题驱动的生成和数学推理。与最先进的适应方法相比，Hoft和Shoft显示出可比或更好的结果。

Title: SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion

Authors: Asrar Alruwayqi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16535
Pdf URL: https://arxiv.org/pdf/2505.16535
Copy Paste: [[2505.16535]] SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion(https://arxiv.org/abs/2505.16535)
Keywords: generative
Abstract: We present a novel framework for dynamic 3D scene reconstruction that integrates three key components: an explicit tri-plane deformation field, a view-conditioned canonical radiance field with spherical harmonics (SH) attention, and a temporally-aware latent diffusion prior. Our method encodes 4D scenes using three orthogonal 2D feature planes that evolve over time, enabling efficient and compact spatiotemporal representation. These features are explicitly warped into a canonical space via a deformation offset field, eliminating the need for MLP-based motion modeling. In canonical space, we replace traditional MLP decoders with a structured SH-based rendering head that synthesizes view-dependent color via attention over learned frequency bands improving both interpretability and rendering efficiency. To further enhance fidelity and temporal consistency, we introduce a transformer-guided latent diffusion module that refines the tri-plane and deformation features in a compressed latent space. This generative module denoises scene representations under ambiguous or out-of-distribution (OOD) motion, improving generalization. Our model is trained in two stages: the diffusion module is first pre-trained independently, and then fine-tuned jointly with the full pipeline using a combination of image reconstruction, diffusion denoising, and temporal consistency losses. We demonstrate state-of-the-art results on synthetic benchmarks, surpassing recent methods such as HexPlane and 4D Gaussian Splatting in visual quality, temporal coherence, and robustness to sparse-view dynamic inputs.
摘要：我们提出了一个动态3D场景重建的新型框架，该框架集成了三个关键组成部分：显式三平面变形场，具有球形谐波（SH）注意力的视图条件的规范辐射场，以及暂时意识到的潜在潜在扩散。我们的方法使用三个正交2D特征平面编码4D场景，它们会随着时间的流逝而发展，从而实现了有效而紧凑的时空表示。这些特征通过变形偏移场明确扭曲到规范空间中，从而消除了对基于MLP的运动建模的需求。在规范空间中，我们用一个基于SH的渲染头代替了传统的MLP解码器，该解码器通过对学到的频带的关注来综合依赖视图的颜色，从而提高了可解释性和渲染效率。为了进一步提高忠诚度和时间一致性，我们引入了变压器引导的潜扩散模块，该模块可以在压缩潜在空间中完善三平面和变形特征。该生成模块在模棱两可或分发（OOD）运动下将场景表示形式表示，从而改善了概括。我们的模型分为两个阶段进行训练：扩散模块首先是独立训练的，然后使用图像重建，扩散性降解和时间一致性损失的组合在整个管道中进行微调。我们展示了合成基准测试的最新结果，超过了最新方法，例如六角形和4D高斯分裂的视觉质量，时间相干性和鲁棒性，可稀疏视图动态输入。

Title: Incremental Sequence Classification with Temporal Consistency

Authors: Lucas Maystre, Gabriel Barello, Tudor Berariu, Aleix Cambray, Rares Dolga, Alvaro Ortega Gonzalez, Andrei Nica, David Barber
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.16548
Pdf URL: https://arxiv.org/pdf/2505.16548
Copy Paste: [[2505.16548]] Incremental Sequence Classification with Temporal Consistency(https://arxiv.org/abs/2505.16548)
Keywords: generation
Abstract: We address the problem of incremental sequence classification, where predictions are updated as new elements in the sequence are revealed. Drawing on temporal-difference learning from reinforcement learning, we identify a temporal-consistency condition that successive predictions should satisfy. We leverage this condition to develop a novel loss function for training incremental sequence classifiers. Through a concrete example, we demonstrate that optimizing this loss can offer substantial gains in data efficiency. We apply our method to text classification tasks and show that it improves predictive accuracy over competing approaches on several benchmark datasets. We further evaluate our approach on the task of verifying large language model generations for correctness in grade-school math problems. Our results show that models trained with our method are better able to distinguish promising generations from unpromising ones after observing only a few tokens.
摘要：我们解决了增量序列分类的问题，其中预测会随着序列中的新元素而更新。利用从增强学习中学习的时间差异学习，我们确定了连续预测应满足的时间一致性条件。我们利用这种情况来开发新的损失函数，以用于训练增量序列分类器。通过一个具体的示例，我们证明了优化这种损失可以为数据效率带来可观的提高。我们将我们的方法应用于文本分类任务，并表明它提高了几个基准数据集上竞争方法的预测精度。我们进一步评估了我们的方法，即验证大型语言模型世代的任务，以了解年级数学问题的正确性。我们的结果表明，接受我们方法训练的模型可以更好地区分有希望的世代与仅观察几个令牌后无所事知的世代。

Title: M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

Authors: Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16565
Pdf URL: https://arxiv.org/pdf/2505.16565
Copy Paste: [[2505.16565]] M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion(https://arxiv.org/abs/2505.16565)
Keywords: generation
Abstract: We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, obtaining an average rank of 1.43 among the 4 compared methods in a user study, while being 6x faster than the second placed method.
摘要：我们解决了单眼到stereo视频转换的问题，并提出了一种新颖的架构，以通过基于深度的基于深度重新投入输入左视图获得的翘曲和完善右视图。我们将稳定的视频扩散（SVD）模型扩展到使用左输入视频，右视频和不辨别式掩码作为条件输入，以生成高质量的右相机视图。为了有效利用相邻框架的信息进行内化，我们修改了SVD中的注意力层以计算二次像素的全部注意力。通过最大程度地减少图像空间损失以确保高质量的生成，我们的模型经过训练，以端到端的方式生成正确的视图视频。我们的方法的表现优于先前的最先进方法，在用户研究中比较4种方法中的平均等级为1.43，而比第二位方法快6倍。

Title: Temporal Object Captioning for Street Scene Videos from LiDAR Tracks

Authors: Vignesh Gopinathan, Urs Zimmermann, Michael Arnold, Matthias Rottmann
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16594
Pdf URL: https://arxiv.org/pdf/2505.16594
Copy Paste: [[2505.16594]] Temporal Object Captioning for Street Scene Videos from LiDAR Tracks(https://arxiv.org/abs/2505.16594)
Keywords: generation
Abstract: Video captioning models have seen notable advancements in recent years, especially with regard to their ability to capture temporal information. While many research efforts have focused on architectural advancements, such as temporal attention mechanisms, there remains a notable gap in understanding how models capture and utilize temporal semantics for effective temporal feature extraction, especially in the context of Advanced Driver Assistance Systems. We propose an automated LiDAR-based captioning procedure that focuses on the temporal dynamics of traffic participants. Our approach uses a rule-based system to extract essential details such as lane position and relative motion from object tracks, followed by a template-based caption generation. Our findings show that training SwinBERT, a video captioning model, using only front camera images and supervised with our template-based captions, specifically designed to encapsulate fine-grained temporal behavior, leads to improved temporal understanding consistently across three datasets. In conclusion, our results clearly demonstrate that integrating LiDAR-based caption supervision significantly enhances temporal understanding, effectively addressing and reducing the inherent visual/static biases prevalent in current state-of-the-art model architectures.
摘要：近年来，视频字幕模型在捕获时间信息的能力方面取得了显着进步。尽管许多研究工作都集中在建筑进步上，例如时间关注机制，但在理解模型如何捕获和利用时间语义来进行有效的时间特征提取方面仍然存在一个显着的差距，尤其是在高级驾驶员辅助系统的背景下。我们提出了一个基于激光雷达的字幕程序，重点是交通参与者的时间动态。我们的方法使用基于规则的系统来提取基本细节，例如车道位置和对象轨道的相对运动，然后是基于模板的字幕生成。我们的发现表明，训练Swinbert是一个视频字幕模型，仅使用前置摄像头图像并使用我们的基于模板的字幕进行监督，该字幕专门设计用于封装细粒的时间行为，从而在三个数据集中始终如一地提高了时间上的时间理解。总之，我们的结果清楚地表明，基于激光雷达的字幕监督会显着增强时间理解，有效地解决和减少当前最新模型架构中普遍存在的固有的视觉/静态偏见。

Title: MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation

Authors: Bohan Zhou, Yi Zhan, Zhongbin Zhang, Zongqing Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16602
Pdf URL: https://arxiv.org/pdf/2505.16602
Copy Paste: [[2505.16602]] MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation(https://arxiv.org/abs/2505.16602)
Keywords: generation
Abstract: Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level "cerebrum" leverages a vision language model (VLM) to infer motion priors from visual-textual context and a monocular depth estimator for object-agnostic spatial reasoning, while a low-level DiT-based flow-matching policy generates fine-grained trajectories with temporal orthogonal filtering to enhance stability. To address dataset inconsistency, we design a dataset curation paradigm with an Inverse MANO Retargeting Network and Virtual RGB-D Renderer, curating a unified dataset of 3.35M RGB-D frames, 24K interactions, and 1.2K objects. Extensive experiments across five in-domain and two cross-domain datasets demonstrate the effectiveness of MEgoHand, achieving substantial reductions in wrist translation error (86.9%) and joint rotation error (34.1%), highlighting its capacity to accurately model fine-grained hand joint structures and generalize robustly across diverse scenarios.
摘要：以自我为中心的手动运动产生对沉浸式AR/VR和机器人模仿至关重要，但由于观点，自我估计，自我观察，透视扭曲和嘈杂的自我动作，由于观点不稳定，仍然具有挑战性。现有的方法依赖于预定义的3D对象先验，将概括性限制在新物体上，这限制了它们对新物体的概括性。同时，最近的多模式方法从抽象的文本提示，用于建模3D手对象相关的复杂管道以及开环预测中的复合误差中产生了模棱两可的产生。我们提出了Megohand，这是一个多模式框架，它综合了从Egentric RGB，文本和初始手姿势的物理上合理的手动相互作用。 Megohand介绍了双层架构：高级“大脑”利用视觉语言模型（VLM）从视觉文本上下文中推断运动先验，而单眼深度估计器进行对象 - 非静态空间推理，而基于低级别的流动匹配策略可以增强了良好的效果或良好的效果。要解决数据集不一致，我们设计了一个具有逆Mano重新定位网络和虚拟RGB-D渲染器的数据集策划范式，策划了335万RGB-D帧，24K交互和1.2K对象的统一数据集。在五个内域和两个跨域数据集进行的广泛实验证明了Megohand的有效性，从而实现了腕部翻译误差（86.9％）和关节旋转误差（34.1％）的大幅减少（34.1％），突显了其能力准确地模拟精细的手动结构并在各种情况下进行稳定性。

Title: CausalDynamics: A large-scale benchmark for structural discovery of dynamical causal models

Authors: Benjamin Herdeanu, Juan Nathaniel, Carla Roesch, Jatan Buch, Gregor Ramien, Johannes Haux, Pierre Gentine
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16620
Pdf URL: https://arxiv.org/pdf/2505.16620
Copy Paste: [[2505.16620]] CausalDynamics: A large-scale benchmark for structural discovery of dynamical causal models(https://arxiv.org/abs/2505.16620)
Keywords: generation
Abstract: Causal discovery for dynamical systems poses a major challenge in fields where active interventions are infeasible. Most methods used to investigate these systems and their associated benchmarks are tailored to deterministic, low-dimensional and weakly nonlinear time-series data. To address these limitations, we present CausalDynamics, a large-scale benchmark and extensible data generation framework to advance the structural discovery of dynamical causal models. Our benchmark consists of true causal graphs derived from thousands of coupled ordinary and stochastic differential equations as well as two idealized climate models. We perform a comprehensive evaluation of state-of-the-art causal discovery algorithms for graph reconstruction on systems with noisy, confounded, and lagged dynamics. CausalDynamics consists of a plug-and-play, build-your-own coupling workflow that enables the construction of a hierarchy of physical systems. We anticipate that our framework will facilitate the development of robust causal discovery algorithms that are broadly applicable across domains while addressing their unique challenges. We provide a user-friendly implementation and documentation on this https URL.
摘要：动态系统的因果发现在主动干预措施不可行的领域构成了重大挑战。用于研究这些系统及其相关基准的大多数方法都是针对确定性，低维和弱非线性时间序列数据量身定制的。为了解决这些局限性，我们提出了因果关系，这是一个大规模的基准和可扩展的数据生成框架，以推动动态因果模型的结构发现。我们的基准是由数千个耦合的普通和随机微分方程以及两个理想化的气候模型得出的真正因果图。我们对具有嘈杂，混杂和滞后动态的系统的图形重建的最新因果发现算法进行全面评估。 Causaldynamics由插件，构建自己的耦合工作流程组成，可实现物理系统的层次结构。我们预计，我们的框架将有助于开发强大的因果发现算法，这些算法在跨领域广泛适用，同时应对其独特的挑战。我们在此HTTPS URL上提供用户友好的实现和文档。

Title: Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports

Authors: Francesco Dalla Serra, Patrick Schrempf, Chaoyang Wang, Zaiqiao Meng, Fani Deligianni, Alison Q. O'Neil
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.16624
Pdf URL: https://arxiv.org/pdf/2505.16624
Copy Paste: [[2505.16624]] Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports(https://arxiv.org/abs/2505.16624)
Keywords: generation
Abstract: We present a novel approach to Chest X-ray (CXR) Visual Question Answering (VQA), addressing both single-image image-difference questions. Single-image questions focus on abnormalities within a specific CXR ("What abnormalities are seen in image X?"), while image-difference questions compare two longitudinal CXRs acquired at different time points ("What are the differences between image X and Y?"). We further explore how the integration of radiology reports can enhance the performance of VQA models. While previous approaches have demonstrated the utility of radiology reports during the pre-training phase, we extend this idea by showing that the reports can also be leveraged as additional input to improve the VQA model's predicted answers. First, we propose a unified method that handles both types of questions and auto-regressively generates the answers. For single-image questions, the model is provided with a single CXR. For image-difference questions, the model is provided with two CXRs from the same patient, captured at different time points, enabling the model to detect and describe temporal changes. Taking inspiration from 'Chain-of-Thought reasoning', we demonstrate that performance on the CXR VQA task can be improved by grounding the answer generator module with a radiology report predicted for the same CXR. In our approach, the VQA model is divided into two steps: i) Report Generation (RG) and ii) Answer Generation (AG). Our results demonstrate that incorporating predicted radiology reports as evidence to the AG model enhances performance on both single-image and image-difference questions, achieving state-of-the-art results on the Medical-Diff-VQA dataset.
摘要：我们提出了一种新颖的方法，用于胸部X射线（CXR）视觉问题回答（VQA），解决了两个单图像差异问题。单图像问题的重点是特定CXR内的异常（“图像x中看到了什么异常？”），而图像差异问题比较在不同时间点获得的两个纵向CXR（“图像x和y之间的差异是什么？”）。我们进一步探讨了放射学报告的整合如何增强VQA模型的性能。尽管以前的方法证明了放射学报告在训练前阶段的实用性，但我们通过证明报告也可以作为附加输入来扩展此想法，以改善VQA模型的预测答案。首先，我们提出了一种统一的方法，该方法可以处理这两种类型的问题，并自动回归产生答案。对于单图像问题，该模型是单个CXR。对于图像差异问题，该模型在同一患者的两个CXR中提供，在不同的时间点捕获，使模型能够检测和描述时间变化。我们从“经过思考的推理”中汲取灵感，我们证明，通过使用相同CXR预测的放射学报告将答案生成器模块接地，可以改善CXR VQA任务的性能。在我们的方法中，VQA模型分为两个步骤：i）报告生成（RG）和ii）答案生成（AG）。我们的结果表明，将预测的放射学报告作为AG模型的证据增强了单像和图像差异问题的性能，从而在医疗DIFF-VQA数据集上实现了最新的结果。

Title: On the Out-of-Distribution Generalization of Self-Supervised Learning

Authors: Wenwen Qiang, Jingyao Wang, Zeen Song, Jiangmeng Li, Changwen Zheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16675
Pdf URL: https://arxiv.org/pdf/2505.16675
Copy Paste: [[2505.16675]] On the Out-of-Distribution Generalization of Self-Supervised Learning(https://arxiv.org/abs/2505.16675)
Keywords: generation
Abstract: In this paper, we focus on the out-of-distribution (OOD) generalization of self-supervised learning (SSL). By analyzing the mini-batch construction during the SSL training phase, we first give one plausible explanation for SSL having OOD generalization. Then, from the perspective of data generation and causal inference, we analyze and conclude that SSL learns spurious correlations during the training process, which leads to a reduction in OOD generalization. To address this issue, we propose a post-intervention distribution (PID) grounded in the Structural Causal Model. PID offers a scenario where the spurious variable and label variable is mutually independent. Besides, we demonstrate that if each mini-batch during SSL training satisfies PID, the resulting SSL model can achieve optimal worst-case OOD performance. This motivates us to develop a batch sampling strategy that enforces PID constraints through the learning of a latent variable model. Through theoretical analysis, we demonstrate the identifiability of the latent variable model and validate the effectiveness of the proposed sampling strategy. Experiments conducted on various downstream OOD tasks demonstrate the effectiveness of the proposed sampling strategy.
摘要：在本文中，我们专注于自我监督学习（SSL）的分布外（OOD）概括。通过分析在SSL训练阶段的小批量结构，我们首先给出了一个合理的SSL解释，用于具有OOD概括。然后，从数据生成和因果推断的角度来看，我们分析并得出结论，SSL在训练过程中学习了虚假的相关性，从而导致OOD泛化的减少。为了解决这个问题，我们提出了基于结构性因果模型中的干预后分布（PID）。 PID提供了一个场景，其中虚假变量和标签变量是相互独立的。此外，我们证明，如果在SSL训练期间的每个小批量满足PID，则最终的SSL模型可以实现最佳的最差Case OOD性能。这促使我们制定了批处理采样策略，该策略通过学习潜在变量模型来实施PID约束。通过理论分析，我们证明了潜在变量模型的可识别性，并验证了提出的采样策略的有效性。在各种下游OOD任务上进行的实验证明了提出的抽样策略的有效性。

Title: Semantic Compression of 3D Objects for Open and Collaborative Virtual Worlds

Authors: Jordan Dotzel, Tony Montes, Mohamed S. Abdelfattah, Zhiru Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16679
Pdf URL: https://arxiv.org/pdf/2505.16679
Copy Paste: [[2505.16679]] Semantic Compression of 3D Objects for Open and Collaborative Virtual Worlds(https://arxiv.org/abs/2505.16679)
Keywords: generative
Abstract: Traditional methods for 3D object compression operate only on structural information within the object vertices, polygons, and textures. These methods are effective at compression rates up to 10x for standard object sizes but quickly deteriorate at higher compression rates with texture artifacts, low-polygon counts, and mesh gaps. In contrast, semantic compression ignores structural information and operates directly on the core concepts to push to extreme levels of compression. In addition, it uses natural language as its storage format, which makes it natively human-readable and a natural fit for emerging applications built around large-scale, collaborative projects within augmented and virtual reality. It deprioritizes structural information like location, size, and orientation and predicts the missing information with state-of-the-art deep generative models. In this work, we construct a pipeline for 3D semantic compression from public generative models and explore the quality-compression frontier for 3D object compression. We apply this pipeline to achieve rates as high as 105x for 3D objects taken from the Objaverse dataset and show that semantic compression can outperform traditional methods in the important quality-preserving region around 100x compression.
摘要：3D对象压缩的传统方法仅在对象顶点，多边形和纹理中的结构信息上运行。这些方法的标准物体大小可有效，可在高达10倍的压缩速率下，但在质地伪像，低多层计数和网格间隙的较高压缩率下迅速恶化。相比之下，语义压缩忽略了结构信息，并直接在核心概念上运行以将其推向极端的压缩水平。此外，它使用自然语言作为其存储格式，这使其本地可读，并且是围绕增强和虚拟现实中的大规模协作项目构建的新兴应用程序的天然拟合。它剥夺了结构信息，例如位置，大小和方向，并通过最先进的深层生成模型预测缺失的信息。在这项工作中，我们从公共生成模型中构建了3D语义压缩的管道，并探索了3D对象压缩的质量压缩边界。我们将此管道应用于从OBJAVERSE数据集中获取的3D对象的速率高达105倍，并表明语义压缩可以在100倍压缩左右的重要质量提供区域中胜过传统方法。

Title: One-Step Diffusion-Based Image Compression with Semantic Distillation

Authors: Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Yuan Zhang, Yan Lu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16687
Pdf URL: https://arxiv.org/pdf/2505.16687
Copy Paste: [[2505.16687]] One-Step Diffusion-Based Image Compression with Semantic Distillation(https://arxiv.org/abs/2505.16687)
Keywords: generation, generative
Abstract: While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasing latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec -- that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 40% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs. Code will be released later.
摘要：尽管最近基于扩散的生成图像编解码器显示出令人印象深刻的性能，但它们的迭代采样过程引入了令人不快的延迟。在这项工作中，我们重新审视了基于扩散的编解码器的设计，并认为对生成压缩不是多步骤采样。基于此洞察力，我们提出了一种基于单步扩散的生成图像编解码器ONEDC，该映像将潜在的压缩模块与一步扩散生成器集成在一起。认识到语义指导在一步扩散中的关键作用，我们建议将高位作为语义信号，克服文本提示的局限性在表示复杂的视觉内容中。为了进一步增强超级优势的语义能力，我们引入了一种语义蒸馏机制，该机制将知识从预告片生成的代币仪转移到高位编解码器。此外，我们采用混合像素和潜在域优化来共同增强重建保真度和感知现实主义。广泛的实验表明，与先前的基于多步扩散的编解码器相比，ONEDC即使一步一代也可以达到SOTA感知质量的质量。代码将稍后发布。

Title: KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

Authors: Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, Xu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16707
Pdf URL: https://arxiv.org/pdf/2505.16707
Copy Paste: [[2505.16707]] KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models(https://arxiv.org/abs/2505.16707)
Keywords: generative
Abstract: Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.
摘要：多模式生成模型的最新进展已在基于教学的图像编辑中取得了重大进展。但是，尽管这些模型产生了视觉上合理的输出，但它们基于知识的推理编辑任务的能力仍然不足。在本文中，我们介绍了Kris-Bench（图像编辑系统基准中基于知识的推理），这是一种诊断基准测试，旨在通过认知知情的镜头评估模型。 Kris-Bench从教育理论中借鉴了三种基础知识类型的编辑任务：事实，概念和程序。基于此分类法，我们设计了22项代表性任务，涵盖了7个推理维度并发布1,267个高质量注释的编辑实例。为了支持细粒度的评估，我们提出了一项综合协议，该方案结合了一种新颖的知识合理度量，并通过知识提示增强并通过人类研究进行了校准。 10个最先进模型的经验结果揭示了推理性能的显着差距，强调了以知识为中心的基准来推动智能图像编辑系统的开发。

Title: Masked Conditioning for Deep Generative Models

Authors: Phillip Mueller, Jannik Wiese, Sebastian Mueller, Lars Mikelsons
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16725
Pdf URL: https://arxiv.org/pdf/2505.16725
Copy Paste: [[2505.16725]] Masked Conditioning for Deep Generative Models(https://arxiv.org/abs/2505.16725)
Keywords: generation, generative
Abstract: Datasets in engineering domains are often small, sparsely labeled, and contain numerical as well as categorical conditions. Additionally. computational resources are typically limited in practical applications which hinders the adoption of generative models for engineering tasks. We introduce a novel masked-conditioning approach, that enables generative models to work with sparse, mixed-type data. We mask conditions during training to simulate sparse conditions at inference time. For this purpose, we explore the use of various sparsity schedules that show different strengths and weaknesses. In addition, we introduce a flexible embedding that deals with categorical as well as numerical conditions. We integrate our method into an efficient variational autoencoder as well as a latent diffusion model and demonstrate the applicability of our approach on two engineering-related datasets of 2D point clouds and images. Finally, we show that small models trained on limited data can be coupled with large pretrained foundation models to improve generation quality while retaining the controllability induced by our conditioning scheme.
摘要：工程域中的数据集通常很小，标记稀少，并且包含数值和分类条件。此外。计算资源通常受到实际应用的限制，这会阻碍采用工程任务的生成模型。我们引入了一种新颖的掩盖条件方法，使生成模型能够与稀疏的混合型数据一起使用。我们在训练过程中掩盖条件，以模拟推理时间稀疏条件。为此，我们探讨了显示出不同优势和劣势的各种稀疏时间表的使用。此外，我们还引入了一种易于处理分类和数值条件的柔性嵌入。我们将方法集成到有效的变分自动编码器以及潜在扩散模型中，并证明了我们的方法在两个与工程相关的2D点云和图像的数据集上的适用性。最后，我们表明，在有限数据上训练的小型模型可以与较大的审核基础模型相结合，以提高发电质量，同时保留由我们的调理方案引起的可控性。

Title: Forward-only Diffusion Probabilistic Models

Authors: Ziwei Luo, Fredrik K. Gustafsson, Jens Sjölund, Thomas B. Schön
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16733
Pdf URL: https://arxiv.org/pdf/2505.16733
Copy Paste: [[2505.16733]] Forward-only Diffusion Probabilistic Models(https://arxiv.org/abs/2505.16733)
Keywords: restoration, generation, generative
Abstract: This work presents a forward-only diffusion (FoD) approach for generative modelling. In contrast to traditional diffusion models that rely on a coupled forward-backward diffusion scheme, FoD directly learns data generation through a single forward diffusion process, yielding a simple yet efficient generative framework. The core of FoD is a state-dependent linear stochastic differential equation that involves a mean-reverting term in both the drift and diffusion functions. This mean-reversion property guarantees the convergence to clean data, naturally simulating a stochastic interpolation between source and target distributions. More importantly, FoD is analytically tractable and is trained using a simple stochastic flow matching objective, enabling a few-step non-Markov chain sampling during inference. The proposed FoD model, despite its simplicity, achieves competitive performance on various image-conditioned (e.g., image restoration) and unconditional generation tasks, demonstrating its effectiveness in generative modelling. Our code is available at this https URL.
摘要：这项工作为生成建模提供了一种仅向前扩散（FOD）方法。与依赖于前向后扩散方案的传统扩散模型相反，FOD直接通过单个正向扩散过程学习数据生成，从而产生了一个简单而有效的生成框架。 FOD的核心是状态依赖性的线性随机微分方程，该方程涉及在漂移和扩散函数中均值逆性项。这种均值逆转属性保证了与清洁数据的收敛性，自然模拟了源和目标分布之间的随机插值。更重要的是，FOD是可以在分析上进行的，并使用简单的随机流匹配目标进行训练，从而在推理过程中实现了几步的非马尔科夫链采样。提出的FOD模型尽管很简单，但在各种图像条件（例如，图像恢复）和无条件生成任务上取得了竞争性能，证明了其在生成建模中的有效性。我们的代码可在此HTTPS URL上找到。

Title: Mesh-RFT: Enhancing Mesh Generation via Fine-grained Reinforcement Fine-Tuning

Authors: Jian Liu, Jing Xu, Song Guo, Jing Li, Jingfeng Guo, Jiaao Yu, Haohan Weng, Biwen Lei, Xianghui Yang, Zhuo Chen, Fangqi Zhu, Tao Han, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16761
Pdf URL: https://arxiv.org/pdf/2505.16761
Copy Paste: [[2505.16761]] Mesh-RFT: Enhancing Mesh Generation via Fine-grained Reinforcement Fine-Tuning(https://arxiv.org/abs/2505.16761)
Keywords: generation
Abstract: Existing pretrained models for 3D mesh generation often suffer from data biases and produce low-quality results, while global reinforcement learning (RL) methods rely on object-level rewards that struggle to capture local structure details. To address these challenges, we present \textbf{Mesh-RFT}, a novel fine-grained reinforcement fine-tuning framework that employs Masked Direct Preference Optimization (M-DPO) to enable localized refinement via quality-aware face masking. To facilitate efficient quality evaluation, we introduce an objective topology-aware scoring system to evaluate geometric integrity and topological regularity at both object and face levels through two metrics: Boundary Edge Ratio (BER) and Topology Score (TS). By integrating these metrics into a fine-grained RL strategy, Mesh-RFT becomes the first method to optimize mesh quality at the granularity of individual faces, resolving localized errors while preserving global coherence. Experiment results show that our M-DPO approach reduces Hausdorff Distance (HD) by 24.6\% and improves Topology Score (TS) by 3.8\% over pre-trained models, while outperforming global DPO methods with a 17.4\% HD reduction and 4.9\% TS gain. These results demonstrate Mesh-RFT's ability to improve geometric integrity and topological regularity, achieving new state-of-the-art performance in production-ready mesh generation. Project Page: \href{this https URL}{this https URL}.
摘要：现有的3D网格生成模型通常会遭受数据偏差的困扰并产生低质量的结果，而全球强化学习（RL）方法依赖于难以捕获本地结构细节的对象级别的奖励。为了应对这些挑战，我们提出了\ textbf {mesh-rft}，这是一种新颖的细粒钢筋微调框架，它采用蒙版直接偏好优化（M-DPO），通过质量了解的面孔掩盖来启用局部改进。为了促进有效的质量评估，我们引入了一个客观的拓扑意识评分系统，以通过两个指标（BER）和拓扑评分（TS）来评估对象和面部水平的几何完整性和拓扑规律性。通过将这些指标集成到细粒的RL策略中，网状RFT成为第一种在单个面部粒度上优化网格质量的第一种方法，从而解决了局部错误，同时保留了全球连贯性。实验结果表明，我们的M-DPO方法将Hausdorff距离（HD）降低了24.6 \％，并且在预训练的模型上将拓扑评分（TS）提高了3.8 \％，同时以17.4 \％的HD降低和4.9 \％TS TS的增长，胜过全球DPO方法。这些结果表明，网状RFT提高几何完整性和拓扑规律性的能力，在生产就绪的网格生成中实现了新的最新性能。项目页面：\ href {此https url} {this https url}。

Title: Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation

Authors: Hongji Yang, Yucheng Zhou, Wencheng Han, Jianbing Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16763
Pdf URL: https://arxiv.org/pdf/2505.16763
Copy Paste: [[2505.16763]] Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation(https://arxiv.org/abs/2505.16763)
Keywords: generation
Abstract: Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.
摘要：文本到图像模型对于根据给定的文本提示而产生高质量的图像非常有力，但是制作这些提示通常需要专门的词汇。为了解决这个问题，现有方法通过大量手动注释数据和训练的审美评估模型的监督训练重写模型。为了减轻对模型培训的数据量表的依赖性以及受过训练的模型引入的偏见，我们提出了一个新颖的提示优化框架，旨在将简单的用户提示重新列为复杂的提示，以使其提示到文本到图像模型。具体而言，我们采用大型视觉语言模型（LVLM）作为求解器来重写用户提示，并同时采用LVLM作为奖励模型来评分优化提示产生的图像的美感和对齐。我们利用LVLM的先验知识来提供奖励，即AI反馈，而不是繁琐的人类反馈。同时，求解器和奖励模型被统一为一个模型，并在加强学习中迭代以通过提供解决方案和判断自己来实现自我完善。两个流行数据集的结果表明，我们的方法表现优于其他强大的竞争对手。

Title: Learning Flexible Forward Trajectories for Masked Molecular Diffusion

Authors: Hyunjin Seo, Taewon Kim, Sihyun Yu, SungSoo Ahn
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16790
Pdf URL: https://arxiv.org/pdf/2505.16790
Copy Paste: [[2505.16790]] Learning Flexible Forward Trajectories for Masked Molecular Diffusion(https://arxiv.org/abs/2505.16790)
Keywords: generation
Abstract: Masked diffusion models (MDMs) have achieved notable progress in modeling discrete data, while their potential in molecular generation remains underexplored. In this work, we explore their potential and introduce the surprising result that naively applying standards MDMs severely degrades the performance. We identify the critical cause of this issue as a state-clashing problem-where the forward diffusion of distinct molecules collapse into a common state, resulting in a mixture of reconstruction targets that cannot be learned using typical reverse diffusion process with unimodal predictions. To mitigate this, we propose Masked Element-wise Learnable Diffusion (MELD) that orchestrates per-element corruption trajectories to avoid collision between distinct molecular graphs. This is achieved through a parameterized noise scheduling network that assigns distinct corruption rates to individual graph elements, i.e., atoms and bonds. Extensive experiments on diverse molecular benchmarks reveal that MELD markedly enhances overall generation quality compared to element-agnostic noise scheduling, increasing the chemical validity of vanilla MDMs on ZINC250K from 15% to 93%, Furthermore, it achieves state-of-the-art property alignment in conditional generation tasks.
摘要：蒙版扩散模型（MDMS）在建模离散数据方面取得了显着的进展，而它们在分子生成中的潜力仍然没有得到充实。在这项工作中，我们探索了它们的潜力，并引入了令人惊讶的结果，即天真地使用标准MDM会严重降低表现。我们将这个问题的关键原因确定为一个局限性的问题 - 在这种问题上，不同分子的正向扩散崩溃了，导致了重建靶标的混合物，这些靶标混合了无法使用典型的反向扩散过程和单峰预测来学习的。为了减轻这种情况，我们提出了掩盖元素的可学习扩散（MELD），该扩散（MELD）协调每个元素的损坏轨迹，以避免不同的分子图之间的碰撞。这是通过参数化的噪声调度网络来实现的，该网络将不同的损坏率分配给单个图元素，即原子和债券。对各种分子基准测试的广泛实验表明，与元素 - 不合稳定噪声调度相比，MELD显着提高了整体发电质量，从而将锌250K的香草MDMS的化学有效性从15％提高到93％，从15％提高到93％，此外，它可以在条件生成任务中实现最新的现有财产一致性。

Title: Cohort-Based Active Modality Acquisition

Authors: Tillmann Rheude, Roland Eils, Benjamin Wild
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16791
Pdf URL: https://arxiv.org/pdf/2505.16791
Copy Paste: [[2505.16791]] Cohort-Based Active Modality Acquisition(https://arxiv.org/abs/2505.16791)
Keywords: generative
Abstract: Real-world machine learning applications often involve data from multiple modalities that must be integrated effectively to make robust predictions. However, in many practical settings, not all modalities are available for every sample, and acquiring additional modalities can be costly. This raises the question: which samples should be prioritized for additional modality acquisition when resources are limited? While prior work has explored individual-level acquisition strategies and training-time active learning paradigms, test-time and cohort-based acquisition remain underexplored despite their importance in many real-world settings. We introduce Cohort-based Active Modality Acquisition (CAMA), a novel test-time setting to formalize the challenge of selecting which samples should receive additional modalities. We derive acquisition strategies that leverage a combination of generative imputation and discriminative modeling to estimate the expected benefit of acquiring missing modalities based on common evaluation metrics. We also introduce upper-bound heuristics that provide performance ceilings to benchmark acquisition strategies. Experiments on common multimodal datasets demonstrate that our proposed imputation-based strategies can more effectively guide the acquisition of new samples in comparison to those relying solely on unimodal information, entropy guidance, and random selections. Our work provides an effective solution for optimizing modality acquisition at the cohort level, enabling better utilization of resources in constrained settings.
摘要：现实世界的机器学习应用程序通常涉及来自多种模式的数据，这些数据必须有效地集成以做出强大的预测。但是，在许多实际设置中，并非每个样本都可以使用所有模式，而获取其他方式可能会很昂贵。这提出了一个问题：在资源有限时，应优先考虑哪些样本以获取其他方式？虽然先前的工作探讨了个人级别的获取策略和培训时间积极学习范例，但尽管在许多现实世界中，尽管它们在许多现实世界中的重要性，但基于测试时间和基于队列的收购仍然没有受到重视。我们介绍了基于队列的主动模式采集（CAMA），这是一种新型的测试时间设置，以正式化选择哪些样品应获得其他方式的挑战。我们得出了采集策略，这些策略利用了生成归档和判别建模的结合，以估计基于共同评估指标获得缺失模式的预期益处。我们还介绍了上限的启发式方法，为基准获取策略提供性能上限。对共同多模式数据集进行的实验表明，与仅依靠单峰信息，熵指导和随机选择的人相比，我们提出的基于插补的策略可以更有效地指导新样本的获取。我们的工作提供了一个有效的解决方案，可以在队列级别优化模态获取，从而更好地利用受限的设置。

Title: REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training

Authors: Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, Kai Wang, Yang You
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16792
Pdf URL: https://arxiv.org/pdf/2505.16792
Copy Paste: [[2505.16792]] REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training(https://arxiv.org/abs/2505.16792)
Keywords: generative
Abstract: Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at this https URL .
摘要：扩散变压器（DITS）提供了最先进的图像质量，但他们的训练仍然很慢。最近的一种补救措施 - 代表对齐（REPA），将DIT隐藏的特征与非生成教师（例如Dino）的特征相匹配 - 急剧加速了早期的时代，但高原甚至在以后的表现降低了表现。我们将这种失败追踪到容量不匹配：一旦生成学生开始建模联合数据分布，教师的较低维度嵌入和注意力模式就会成为纹身夹克，而不是指南。然后，我们介绍Haste（与阶段终止的整体统一，以进行有效的培训），这是一项两阶段的时间表，可以保持帮助并放弃障碍。第一阶段应用了整体对齐损失，同时将注意图（关系启示）和特征投影（语义锚）从教师蒸发到DIT的中层层，从而产生了快速的收敛。然后，第二阶段执行单发终止，以停用对齐损失，一旦命中了一个简单的触发器（例如固定迭代），将DIT释放，以专注于去索尼和利用其生成能力。急速加快了对不同dit的训练，而没有建筑变化。在ImageNet 256x256上，它在50个时期达到了香草Sit-XL/2基线FID，并匹配Repa在500个时期的最佳FID，相当于优化步骤的28倍降低。 HARTE还改善了MS-Coco上的文本对图像的位置，这证明是一种简单而有原则的食谱，用于跨各种任务进行有效的扩散训练。我们的代码可在此HTTPS URL上找到。

Title: V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

Authors: Hanyue Lou, Jinxiu Liang, Minggui Teng, Yi Wang, Boxin Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16797
Pdf URL: https://arxiv.org/pdf/2505.16797
Copy Paste: [[2505.16797]] V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation(https://arxiv.org/abs/2505.16797)
Keywords: generation
Abstract: Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150 times reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours--an order of magnitude larger than existing event datasets, yielding substantial improvements.
摘要：基于事件的相机具有独特的优势，例如高时间分辨率，高动态范围和低功耗。但是，现有的合成数据生成管道的巨大存储要求和I/O负担以及实际数据的稀缺性阻止了基于事件的培训数据集扩展，从而限制了事件视觉模型的开发和概括能力。为了应对这一挑战，我们介绍了视频对素（V2V），这种方法将传统的视频框架直接转换为基于事件的素voxel网格表示形式，从而完全绕开了存储密集型事件流的生成。 V2V可以减少存储要求的150倍，同时支持即时参数随机化以增强模型鲁棒性。利用这一效率，我们在10,000个不同的视频上训练了几次视频重建和光流估计模型架构，总计52小时 - 比现有事件数据集大的数量级，从而获得了实质性改进。

Title: A modular framework for automated evaluation of procedural content generation in serious games with deep reinforcement learning agents

Authors: Eleftherios Kalafatis, Konstantinos Mitsis, Konstantia Zarkogianni, Maria Athanasiou, Konstantina Nikita
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16801
Pdf URL: https://arxiv.org/pdf/2505.16801
Copy Paste: [[2505.16801]] A modular framework for automated evaluation of procedural content generation in serious games with deep reinforcement learning agents(https://arxiv.org/abs/2505.16801)
Keywords: generation
Abstract: Serious Games (SGs) are nowadays shifting focus to include procedural content generation (PCG) in the development process as a means of offering personalized and enhanced player experience. However, the development of a framework to assess the impact of PCG techniques when integrated into SGs remains particularly challenging. This study proposes a methodology for automated evaluation of PCG integration in SGs, incorporating deep reinforcement learning (DRL) game testing agents. To validate the proposed framework, a previously introduced SG featuring card game mechanics and incorporating three different versions of PCG for nonplayer character (NPC) creation has been deployed. Version 1 features random NPC creation, while versions 2 and 3 utilize a genetic algorithm approach. These versions are used to test the impact of different dynamic SG environments on the proposed framework's agents. The obtained results highlight the superiority of the DRL game testing agents trained on Versions 2 and 3 over those trained on Version 1 in terms of win rate (i.e. number of wins per played games) and training time. More specifically, within the execution of a test emulating regular gameplay, both Versions 2 and 3 peaked at a 97% win rate and achieved statistically significant higher (p=0009) win rates compared to those achieved in Version 1 that peaked at 94%. Overall, results advocate towards the proposed framework's capability to produce meaningful data for the evaluation of procedurally generated content in SGs.
摘要：如今，认真的游戏（SG）已成为转移的重点，以在开发过程中包括程序内容（PCG），以作为提供个性化和增强的玩家体验的手段。但是，在整合到SGS中时，开发了评估PCG技术影响的框架仍然特别具有挑战性。这项研究提出了一种用于自动评估SG中PCG集成的方法，并结合了深钢筋学习（DRL）游戏测试剂。为了验证所提出的框架，已经部署了一种先前引入的具有纸牌游戏机制的SG，并结合了三种不同版本的非玩家角色（NPC）创建版本。版本1具有随机的NPC创建，而版本2和3使用遗传算法方法。这些版本用于测试不同动态SG环境对拟议框架代理的影响。获得的结果突出了DRL游戏测试代理在版本2和3上的优越性，而在版本1上，就获胜率（即每场比赛的胜利数）和训练时间而言。更具体地说，在模拟常规游戏玩法的测试的执行中，两个版本2和3以97％的胜利率达到顶峰，并且与在94％峰值达到94％的版本1中获得的统计学意义率更高（P = 0009）。总体而言，结果倡导拟议框架的能力，可以生成有意义的数据，以评估SGS中的程序生成的内容。

Title: Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining

Authors: Shangquan Sun, Wenqi Ren, Juxiang Zhou, Shu Wang, Jianhou Gan, Xiaochun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16811
Pdf URL: https://arxiv.org/pdf/2505.16811
Copy Paste: [[2505.16811]] Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining(https://arxiv.org/abs/2505.16811)
Keywords: restoration
Abstract: Significant progress has been made in video restoration under rainy conditions over the past decade, largely propelled by advancements in deep learning. Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. Specifically, we design spatial and temporal state-space model layers to extract spatial features and incorporate temporal dependencies across frames, respectively. To improve multi-frame feature fusion, we derive a dynamic stacking filter, which adaptively approximates statistical filters for superior pixel-wise feature refinement. Moreover, we develop a median stacking loss to enable semi-supervised learning by generating pseudo-clean patches based on the sparsity prior of rain. To further explore the capacity of deraining models in supporting other vision-based tasks in rainy environments, we introduce a novel real-world benchmark focused on object detection and tracking in rainy conditions. Our method is extensively evaluated across multiple benchmarks containing numerous synthetic and real-world rainy videos, consistently demonstrating its superiority in quantitative metrics, visual quality, efficiency, and its utility for downstream tasks.
摘要：在过去的十年中，视频恢复在视频恢复中取得了重大进展，这在很大程度上是由于深度学习的进步所推动的。然而，依赖配对数据的现有方法努力将有效地推广到现实情况下，这主要是由于合成和真实的降雨效应之间的差异。为了解决这些局限性，我们提出了一个双分支时空的状态空间模型，以增强视频序列中的降雨脱落。具体而言，我们设计了空间和时间状态空间模型层，以提取空间特征并分别在框架上纳入时间依赖性。为了改善多帧特征融合，我们得出了动态堆叠过滤器，该滤波器可适应地近似于统计过滤器，以进行出色的像素特征细化。此外，我们通过基于降雨的稀疏性产生伪清洁贴片来产生伪清洁贴片，从而产生中位数堆叠损失，以实现半监督学习。为了进一步探索模型在支持多雨环境中其他基于视觉的任务方面的能力，我们引入了一种新型的现实世界基准，该基准专注于在雨天条件下对象检测和跟踪。我们的方法在包含许多合成和现实世界的多雨视频的多个基准上进行了广泛的评估，始终证明其在定量指标，视觉质量，效率，效率以及对下游任务的实用性方面的优势。

Title: Perceptual Quality Assessment for Embodied AI

Authors: Chunyi Li, Jiaohao Xiao, Jianbo Zhang, Farong Wen, Zicheng Zhang, Yuan Tian, Xiangyang Zhu, Xiaohong Liu, Zhengxue Cheng, Weisi Lin, Guangtao Zhai
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.16815
Pdf URL: https://arxiv.org/pdf/2505.16815
Copy Paste: [[2505.16815]] Perceptual Quality Assessment for Embodied AI(https://arxiv.org/abs/2505.16815)
Keywords: quality assessment
Abstract: Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: this https URL
摘要：近年来，体现的AI已经迅速发展，但仍主要部署在实验室中，现实世界中的各种扭曲都限制了其应用。传统上，图像质量评估（IQA）方法用于预测人类对扭曲图像的偏好；但是，没有IQA方法可以评估体现任务中图像的可用性，即机器人的感知质量。为了为未来的体现场景提供准确可靠的质量指标，我们首先提出了该主题：IQA用于体现的AI。具体而言，我们（1）基于梅尔顿系统和元认知理论，构建了一个感知认知决策 - 执行管道，并定义了全面的主观得分收集过程；（2）建立了包含36K参考/扭曲的图像对的体现的IQA数据库，并具有超过5m的细颗粒注释，由视觉语言模型/视觉语言动作模型/现实世界机器人提供；（3）训练和验证了体现IQA主流IQA方法的性能，表明有必要为体现的AI开发更准确的质量指标。我们衷心希望，通过评估，我们可以在现实世界中促进体现AI的应用。项目页面：此HTTPS URL

Title: Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts

Authors: Taewon Kang, Ming C. Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16819
Pdf URL: https://arxiv.org/pdf/2505.16819
Copy Paste: [[2505.16819]] Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts(https://arxiv.org/abs/2505.16819)
Keywords: generation
Abstract: Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling -- character-driven dialogue and speech -- remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior. While a story generation model such as Text2Story generates the corresponding visual scene, we focus on generating expressive character utterances from these prompts and the scene image. We apply a pretrained vision-language encoder to extract a high-level semantic feature from the representative frame, capturing salient visual context. This feature is then combined with the structured prompts and used to guide a large language model in synthesizing natural, character-consistent dialogue. To ensure contextual consistency across scenes, we introduce a Recursive Narrative Bank that conditions each dialogue generation on the accumulated dialogue history from prior scenes. This approach enables characters to speak in ways that reflect their evolving goals and interactions throughout a story. Finally, we render each utterance as expressive, character-consistent speech, resulting in fully-voiced video narratives. Our framework requires no additional training and demonstrates applicability across a variety of story settings, from fantasy adventures to slice-of-life episodes.
摘要：基于场景的视频生成的最新进展使系统能够从结构化提示中综合相干的视觉叙事。然而，讲故事的关键维度 - 角色驱动的对话和言语 - 仍然没有被忽视。在本文中，我们提出了一条模块化管道，该管道将动作级的提示转换为视觉和听觉上扎根的叙事对话，以自然的声音和性格表达丰富了视觉讲故事。我们的方法将输入作为每个场景的一对提示，其中第一个定义设置，而第二个则指定角色的行为。诸如Text2Story之类的故事生成模型会生成相应的视觉场景，但我们专注于从这些提示和场景图像中产生表达性角色话语。我们应用了预处理的视觉语言编码器，从代表性框架中提取高级语义特征，从而捕获显着的视觉上下文。然后将此功能与结构化提示结合使用，并用于指导大型语言模型，以综合自然，符合性格的对话。为了确保跨场景的上下文一致性，我们介绍了一个递归叙事银行，该银行将每次对话的生成都在以前的场景中累积的对话历史记录。这种方法使角色能够以反映其整个故事中不断发展的目标和互动的方式讲话。最后，我们将每种话语作为表达，角色一致的演讲，从而产生了完全发声的视频叙述。我们的框架不需要额外的培训，并且在各种故事设置中展示了从幻想冒险到寿命片段的适用性。

Title: LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Authors: Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, Aditya Grover
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16839
Pdf URL: https://arxiv.org/pdf/2505.16839
Copy Paste: [[2505.16839]] LaViDa: A Large Diffusion Language Model for Multimodal Understanding(https://arxiv.org/abs/2505.16839)
Keywords: generation
Abstract: Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.
摘要：现代视觉模型（VLM）可以解决需要视觉推理的各种任务。在实际情况下，VLM的理想属性包括快速推理和可控的生成（例如，将输出限制为遵守所需格式）。但是，像Llava这样的现有自回家（AR）VLM在这些方面挣扎。离散扩散模型（DMS）提供了一种有希望的替代方案，可以实现并行解码，以更快的推理和双向上下文，以通过文本注入来控制可控的生成。虽然在仅语言设置方面有效，但DMS的多模式任务潜力却没有充满反感。我们介绍了建立在DMS上的VLM家族Lavida。我们通过为DMS配备视觉编码器并共同微调组合零件以进行多模式指令来建立Lavida。为了应对遇到的挑战，Lavida结合了新的技术，例如用于有效训练的互补掩蔽，前缀KV缓存以进行有效的推理，以及用于高质量抽样的时间段转移。实验表明，Lavida在MMMU等多模式基准上实现了与AR VLM的竞争性或卓越性能，同时提供了DMS的独特优势，包括灵活的速度质量折衷，可控性和双向推理。在可可字幕上，拉维达（Lavida）超过了+4.1苹果酒的开放式闭合 - 隔壁8B，速度为1.92倍。在双向任务上，它在诗歌完成时取得了 +59％的提高。这些结果证明了Lavida是AR VLM的强大替代方法。代码和型号将在相机就绪版本中发布。

Title: Redefining Clustered Federated Learning for System Identification: The Path of ClusterCraft

Authors: Ertuğrul Keçeci, Müjde Güzelkaya, Tufan Kumbasar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16857
Pdf URL: https://arxiv.org/pdf/2505.16857
Copy Paste: [[2505.16857]] Redefining Clustered Federated Learning for System Identification: The Path of ClusterCraft(https://arxiv.org/abs/2505.16857)
Keywords: generation
Abstract: This paper addresses the System Identification (SYSID) problem within the framework of federated learning. We introduce a novel algorithm, Incremental Clustering-based federated learning method for SYSID (IC-SYSID), designed to tackle SYSID challenges across multiple data sources without prior knowledge. IC-SYSID utilizes an incremental clustering method, ClusterCraft (CC), to eliminate the dependency on the prior knowledge of the dataset. CC starts with a single cluster model and assigns similar local workers to the same clusters by dynamically increasing the number of clusters. To reduce the number of clusters generated by CC, we introduce ClusterMerge, where similar cluster models are merged. We also introduce enhanced ClusterCraft to reduce the generation of similar cluster models during the training. Moreover, IC-SYSID addresses cluster model instability by integrating a regularization term into the loss function and initializing cluster models with scaled Glorot initialization. It also utilizes a mini-batch deep learning approach to manage large SYSID datasets during local training. Through the experiments conducted on a real-world representing SYSID problem, where a fleet of vehicles collaboratively learns vehicle dynamics, we show that IC-SYSID achieves a high SYSID performance while preventing the learning of unstable clusters.
摘要：本文解决了联合学习框架内的系统识别（SYSID）问题。我们介绍了一种新型算法，基于增量聚类的联合学习方法（IC-SYSID），旨在解决跨多个数据源的SYSID挑战，而没有事先知识。 IC-SysID利用一种增量聚类方法clustercraft（CC）消除了对数据集知识的依赖性。 CC从单个集群模型开始，并通过动态增加簇数来为同一集群分配相似的本地工人。为了减少CC生成的群集数量，我们引入了ClusterMerge，其中合并了类似的群集模型。我们还引入了增强的群集，以减少培训期间类似集群模型的产生。此外，IC-SYSID通过将正则化项集成到损耗函数中并以缩放Glorot初始化初始化群集模型来解决群集模型的不稳定性。它还利用一种迷你批次深度学习方法来管理本地培训期间大型SYSID数据集。通过对代表Sysid问题的现实世界进行的实验，其中一组车队协作学习了车辆动力学，我们表明IC-Sysid在阻止学习不稳定的群集的同时取得了高的Sysid性能。

Title: GCAL: Adapting Graph Models to Evolving Domain Shifts

Authors: Ziyue Qiao, Qianyi Cai, Hao Dong, Jiawei Gu, Pengyang Wang, Meng Xiao, Xiao Luo, Hui Xiong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16860
Pdf URL: https://arxiv.org/pdf/2505.16860
Copy Paste: [[2505.16860]] GCAL: Adapting Graph Models to Evolving Domain Shifts(https://arxiv.org/abs/2505.16860)
Keywords: generation
Abstract: This paper addresses the challenge of graph domain adaptation on evolving, multiple out-of-distribution (OOD) graphs. Conventional graph domain adaptation methods are confined to single-step adaptation, making them ineffective in handling continuous domain shifts and prone to catastrophic forgetting. This paper introduces the Graph Continual Adaptive Learning (GCAL) method, designed to enhance model sustainability and adaptability across various graph domains. GCAL employs a bilevel optimization strategy. The "adapt" phase uses an information maximization approach to fine-tune the model with new graph domains while re-adapting past memories to mitigate forgetting. Concurrently, the "generate memory" phase, guided by a theoretical lower bound derived from information bottleneck theory, involves a variational memory graph generation module to condense original graphs into memories. Extensive experimental evaluations demonstrate that GCAL substantially outperforms existing methods in terms of adaptability and knowledge retention.
摘要：本文解决了图形域对不断发展的多个分布（OOD）图的挑战。常规的图形域适应方法仅限于单步适应，使它们无效地处理连续域移动和容易遭受灾难性遗忘。本文介绍了图形持续自适应学习（GCAL）方法，旨在增强各种图形域的模型可持续性和适应性。 GCAL采用了双重优化策略。 “ Adapt”阶段使用信息最大化方法，用新的图形域微调模型，同时重新调整过去的记忆以减轻遗忘。同时，“生成内存”阶段以从信息瓶颈理论得出的理论下限引导，涉及变异记忆图生成模块，以将原始图凝结到记忆中。广泛的实验评估表明，在适应性和知识保留方面，GCAL显然优于现有方法。

Title: Conditional Panoramic Image Generation via Masked Autoregressive Modeling

Authors: Chaoyang Wang, Xiangtai Li, Lu Qi, Xiaofan Lin, Jinbin Bai, Qianyu Zhou, Yunhai Tong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16862
Pdf URL: https://arxiv.org/pdf/2505.16862
Copy Paste: [[2505.16862]] Conditional Panoramic Image Generation via Masked Autoregressive Modeling(https://arxiv.org/abs/2505.16862)
Keywords: generation, generative
Abstract: Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to the violation of the identically and independently distributed (i.i.d.) Gaussian noise assumption caused by their spherical mapping. Second, these methods often treat text-conditioned generation (text-to-panorama) and image-conditioned generation (panorama outpainting) as separate tasks, relying on distinct architectures and task-specific data. In this work, we propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges. PAR avoids the i.i.d. assumption constraint and integrates text and image conditioning into a cohesive architecture, enabling seamless generation across tasks. To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence and propose a consistency alignment strategy to improve generation quality. Extensive experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks while showcasing promising scalability and generalization capabilities.
摘要：全景图的最新进展突显了现有方法的两个关键局限性。首先，大多数方法都是基于扩散模型的，由于违反了相同和独立分布的（i.i.d.）高斯噪声假设，它们固有地适合于等应角投影（ERP）全景。其次，这些方法通常将文本条件的生成（文本对式）和图像条件的生成（全景支出）视为单独的任务，依靠不同的架构和特定于任务的数据。在这项工作中，我们提出了一个统一的框架，全景自回归模型（PAR），该模型利用掩盖的自回旋建模来应对这些挑战。 par避免了I.I.D.假设约束并将文本和图像调节整合到凝聚力的架构中，从而使跨任务无缝生成。为了解决现有生成模型中固有的不连续性，我们引入圆形填充以增强空间连贯性，并提出一种一致性一致性策略来提高发电质量。广泛的实验表明，在文本到图像生成和全景图上的竞争性能，同时展示了有希望的可扩展性和概括能力。

Title: Training-Free Efficient Video Generation via Dynamic Token Carving

Authors: Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16864
Pdf URL: https://arxiv.org/pdf/2505.16864
Copy Paste: [[2505.16864]] Training-Free Efficient Video Generation via Dynamic Token Carving(https://arxiv.org/abs/2505.16864)
Keywords: generation
Abstract: Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: this https URL
摘要：尽管视频扩散变压器（DIT）模型具有显着的产生质量，但其实际部署受到广泛的计算要求的严重阻碍。这种效率低下源于两个关键挑战：自我注意的二次复杂性相对于令牌长度和扩散模型的多步骤性质。为了解决这些局限性，我们提出了Jenga，这是一种新型的推理管道，将动态注意力雕刻与进行性分辨率产生相结合。我们的方法利用了两个关键的见解：（1）早期的脱氧步骤不需要高分辨率的潜伏期，并且（2）后来的步骤不需要密切关注。 Jenga引入了一种宽阔的注意机制，该机制使用3D空间填充曲线动态选择相关的令牌相互作用，以及一种逐步分辨率策略，该策略逐渐增加了生成期间的潜在分辨率。实验结果表明，Jenga在多个最先进的视频扩散模型中实现了可比的发电质量（8.83 $ \ times $速度，VBENCH上的性能下降为0.01 \％）。作为即插即用的解决方案，Jenga通过将推理时间从几分钟减少到几秒钟，在现代硬件上实现了实用，高质量的视频生成 - 而无需模型重新训练。代码：此HTTPS URL

Title: DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

Authors: Qirui Jiao, Daoyuan Chen, Yilun Huang, Xika Lin, Ying Shen, Yaliang Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16915
Pdf URL: https://arxiv.org/pdf/2505.16915
Copy Paste: [[2505.16915]] DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?(https://arxiv.org/abs/2505.16915)
Keywords: generation
Abstract: While recent text-to-image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, their performance significantly degrades when confronted with long, detail-intensive prompts required in professional applications. We present DetailMaster, the first comprehensive benchmark specifically designed to evaluate T2I models' systematical abilities to handle extended textual inputs that contain complex compositional requirements. Our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Explicit Spatial/Interactive Relationships. The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators. Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations: state-of-the-art models achieve merely ~50% accuracy in key dimensions like attribute binding and spatial reasoning, while all models showing progressive performance degradation as prompt length increases. Our analysis highlights systemic failures in structural comprehension and detail overload handling, motivating future research into architectures with enhanced compositional reasoning. We open-source the dataset, data curation code, and evaluation tools to advance detail-rich T2I generation and enable broad applications that would otherwise be infeasible due to the lack of a dedicated benchmark.
摘要：虽然最近的文本对图像（T2I）模型在简短描述中综合图像中表现出令人印象深刻的功能，但在面对专业应用所需的长而细节密集的提示时，它们的性能会大大降低。我们介绍了细节师，这是第一个专门设计用于评估T2I模型的系统能力以处理包含复杂组成要求的扩展文本输入的系统能力的综合基准。我们的基准测试引入了四个关键评估维度：字符属性，结构化字符位置，多维场景属性以及显式的空间/交互式关系。该基准包括漫长而细节的富裕提示，平均为284.89代币，高质量由专家注释验证。对7种通用和5个长期优化的T2I模型的评估揭示了关键的绩效限制：最新模型仅在关键维度（例如属性结合和空间推理）中实现了〜50％的精度，而所有模型都显示出逐步降级的逐步降级，这是迅速的长度增加。我们的分析强调了结构理解和细节超负荷处理中的系统性失败，以增强的构图推理激发未来对体系结构的研究。我们开源数据集，数据策划代码和评估工具，以推动详细的T2I生成详细信息，并启用广泛的应用程序，由于缺乏专用基准，否则这些应用程序是不可行的。

Title: Scalable and Interpretable Contextual Bandits: A Literature Review and Retail Offer Prototype

Authors: Nikola Tankovic, Robert Sajina
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16918
Pdf URL: https://arxiv.org/pdf/2505.16918
Copy Paste: [[2505.16918]] Scalable and Interpretable Contextual Bandits: A Literature Review and Retail Offer Prototype(https://arxiv.org/abs/2505.16918)
Keywords: generation
Abstract: This paper presents a concise review of Contextual Multi-Armed Bandit (CMAB) methods and introduces an experimental framework for scalable, interpretable offer selection, addressing the challenge of fast-changing offers. The approach models context at the product category level, allowing offers to span multiple categories and enabling knowledge transfer across similar offers. This improves learning efficiency and generalization in dynamic environments. The framework extends standard CMAB methodology to support multi-category contexts, and achieves scalability through efficient feature engineering and modular design. Advanced features such as MPG (Member Purchase Gap) and MF (Matrix Factorization) capture nuanced user-offer interactions, with implementation in Python for practical deployment. A key contribution is interpretability at scale: logistic regression models yield transparent weight vectors, accessible via a large language model (LLM) interface for real-time, user-level tracking and explanation of evolving preferences. This enables the generation of detailed member profiles and identification of behavioral patterns, supporting personalized offer optimization and enhancing trust in automated decisions. By situating our prototype alongside established paradigms like Generalized Linear Models and Thompson Sampling, we demonstrate its value for both research and real-world CMAB applications.
摘要：本文介绍了上下文多军强盗（CMAB）方法的简洁审查，并引入了可扩展，可解释的报价选择的实验框架，以应对快速变化的报价的挑战。方法模型在产品类别级别上的上下文，允许优惠跨越多个类别，并在类似报价中启用知识转移。这提高了动态环境中的学习效率和概括。该框架扩展了标准的CMAB方法，以支持多类别上下文，并通过有效的功能工程和模块化设计实现可扩展性。高级功能，例如MPG（成员购买差距）和MF（矩阵分解）捕获细微的用户交互，并在Python中实现了实际部署。一个关键的贡献是规模上的可解释性：逻辑回归模型产生透明的权重向量，可通过大型语言模型（LLM）接口访问，用于实时，用户级跟踪和对不断发展的偏好的说明。这使得生成详细的成员资料和行为模式的识别，支持个性化的要约优化并增强对自动决策的信任。通过将我们的原型与广义线性模型和汤普森采样等建立的范式置于建立的范式之下，我们证明了其对研究和现实世界中CMAB应用的价值。

Title: MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

Authors: Csaba Dékány, Stefan Balauca, Robin Staab, Dimitar I. Dimitrov, Martin Vechev
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16947
Pdf URL: https://arxiv.org/pdf/2505.16947
Copy Paste: [[2505.16947]] MixAT: Combining Continuous and Discrete Adversarial Training for LLMs(https://arxiv.org/abs/2505.16947)
Keywords: generation
Abstract: Despite recent efforts in Large Language Models (LLMs) safety and alignment, current adversarial attacks on frontier LLMs are still able to force harmful generations consistently. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to a diverse set of discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at this https URL.
摘要：尽管最近在大型语言模型（LLM）的安全性和一致性方面做出了努力，但当前对Frontier LLM的对抗性攻击仍然能够持续迫使有害世代。尽管对对抗性训练进行了广泛的研究并证明可以显着提高传统机器学习模型的鲁棒性，但在LLM的背景下，其优势和缺点知之甚少。具体而言，尽管现有的离散对抗攻击在产生有害内容方面有效，但使用混凝土对抗提示的LLM通常在计算上通常很昂贵，从而依赖持续的放松。由于这些放松与离散输入令牌不符，因此这种潜在训练方法通常会使模型容易受到各种离散攻击的影响。在这项工作中，我们旨在通过引入Mixat来弥合这一差距，这是一种新颖的方法，结合了训练期间更强大，更快的连续攻击。我们严格评估了各种最新攻击的混合件，提出了至少一个攻击成功率（ALO-ASR）指标，以捕获模型的最坏情况。与先前的防御（ALO-ASR> 50％）相比，我们显示的Mixat实现了更好的鲁棒性（ALO-ASR <20％），同时保持与基于连续松弛方法的方法相当的运行时。我们进一步分析了在现实的部署环境中进行分析的混合件，探讨了聊天模板，量化，低排名适配器和温度如何影响对抗性训练和评估，从而揭示了当前方法中的其他盲点。我们的结果表明，Mixat的离散连续防御提供了原则性，优越的稳健性 - 准确性的折衷方案，并以最少的计算开销，强调了其建立更安全的LLM的希望。我们在此HTTPS URL上提供代码和模型。

Title: Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models

Authors: Alessandro Favero, Antonio Sclocchi, Matthieu Wyart
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.16959
Pdf URL: https://arxiv.org/pdf/2505.16959
Copy Paste: [[2505.16959]] Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models(https://arxiv.org/abs/2505.16959)
Keywords: generative
Abstract: Diffusion probabilistic models have become a cornerstone of modern generative AI, yet the mechanisms underlying their generalization remain poorly understood. In fact, if these models were perfectly minimizing their training loss, they would just generate data belonging to their training set, i.e., memorize, as empirically found in the overparameterized regime. We revisit this view by showing that, in highly overparameterized diffusion models, generalization in natural data domains is progressively achieved during training before the onset of memorization. Our results, ranging from image to language diffusion models, systematically support the empirical law that memorization time is proportional to the dataset size. Generalization vs. memorization is then best understood as a competition between time scales. We show that this phenomenology is recovered in diffusion models learning a simple probabilistic context-free grammar with random rules, where generalization corresponds to the hierarchical acquisition of deeper grammar rules as training time grows, and the generalization cost of early stopping can be characterized. We summarize these results in a phase diagram. Overall, our results support that a principled early-stopping criterion - scaling with dataset size - can effectively optimize generalization while avoiding memorization, with direct implications for hyperparameter transfer and privacy-sensitive applications.
摘要：扩散概率模型已成为现代生成AI的基石，但是其概括的基础机制仍然很少了解。实际上，如果这些模型完全最大程度地减少了训练损失，那么它们只会生成属于其培训集的数据，即记住，如在过度参数化制度中所发现的那样。我们通过表明在高度参数化的扩散模型中回顾了这一观点，在记忆开始之前，在训练期间逐渐实现了自然数据域中的概括。从图像到语言扩散模型，我们的结果系统地支持了记忆时间与数据集大小成正比的经验定律。然后，最好将概括与记忆最好理解为时间尺度之间的竞争。我们表明，在扩散模型中恢复了这种现象学，学习一种简单的概率无上下文语法，具有随机规则，其中泛化对应于随着训练时间的增长，对更深层次的语法规则的层次结构对应，并且可以表征早期停止的概括成本。我们在相图中总结了这些结果。总体而言，我们的结果支持了一个原则性的早期标准（用数据集大小扩展）可以有效地优化概括，同时避免记忆，这对超参数转移和对隐私敏感的应用程序有直接影响。

Title: Creatively Upscaling Images with Global-Regional Priors

Authors: Yurui Qian, Qi Cai, Yingwei Pan, Ting Yao, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16976
Pdf URL: https://arxiv.org/pdf/2505.16976
Copy Paste: [[2505.16976]] Creatively Upscaling Images with Global-Regional Priors(https://arxiv.org/abs/2505.16976)
Keywords: generation
Abstract: Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.
摘要：当代扩散模型在文本到图像的生成中表现出显着的能力，而仍然仅限于限制分辨率（例如1,024 x 1,024）。最近的进步通过回收预训练的扩散模型并通过区域deno的采样或卷积来扩展无调的高分辨率图像产生。但是，这些模型努力同时保留全球语义结构，并在高分辨率图像中产生创造性的区域细节。为了解决这个问题，我们提出了C-Upscale，这是一种无需调整图像的新食谱，可以通过多模式LLM衍生自给定的全球提示和估计区域提示的全球区域先验旋转。从技术上讲，低分辨率图像的低频成分在鼓励高分辨率生成的全球语义一致性之前被认为是全球结构。接下来，我们在区域deNo的过程中对全球提示和每个区域之间的筛查进行区域关注控制，从而导致区域关注之前减轻了对象重复问题。包含丰富描述性细节的估计区域提示在推动区域细节生成的创造力之前，进一步充当区域语义。定量和定性评估都表明，我们的c仪设法生成超高分辨率的图像（例如4,096 x 4,096和8,192 x 8,192），具有较高的视觉保真度和更具创造性的区域细节。

Title: Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On

Authors: Siqi Wan, Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16977
Pdf URL: https://arxiv.org/pdf/2505.16977
Copy Paste: [[2505.16977]] Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On(https://arxiv.org/abs/2505.16977)
Keywords: generation
Abstract: Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: this https URL.
摘要：扩散模型已显示在虚拟尝试（VTON）任务中的初步成功。典型的双分支体系结构分别包括两个用于隐式服装变形和合成图像生成的UNET，并且已成为VTON任务的配方。然而，由于扩散模型的内在随机性，保留给定服装的形状和每个细节仍然具有挑战性。为了减轻这个问题，我们在纯真地建议将视觉对应物明确利用为驯服扩散过程，而不是简单地将整个服装送入UNET作为外观参考。具体来说，我们将细粒度的外观和纹理细节解释为一组结构化的语义点，并通过局部流动扭曲将扎根于衣服的语义点与目标人越过目标。然后将这样的2D点扩展到具有目标人的深度/正常地图的3D感知线索中。对应模仿了将衣服放在人体上的方式，而3D感知的提示则作为语义点匹配，以监督扩散模型训练。进一步设计了以点为重点的扩散损失，以充分利用语义点匹配的优势。广泛的实验证明了我们方法的强大细节保存，这是由Viton-HD和DressCode数据集的最先进的VTON表现证明的。代码可公开可用：此HTTPS URL。

Title: Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

Authors: Runpeng Yu, Xinyin Ma, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16990
Pdf URL: https://arxiv.org/pdf/2505.16990
Copy Paste: [[2505.16990]] Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding(https://arxiv.org/abs/2505.16990)
Keywords: generation
Abstract: In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at this https URL.
摘要：在这项工作中，我们提出了Dimple，这是第一个离散扩散多模式模型（DMLLM）。我们观察到，使用纯离散扩散方法的训练会导致训练不稳定，次优性能和严重的偏差问题。为了应对这些挑战，我们设计了一种新颖的训练范式，该范围将初始自回归阶段与随后的扩散阶段相结合。这种方法产生了Dimple-7b模型，该模型在同一数据集上训练，并使用与Llava-Next相似的培训管道。 Dimple-7b最终在性能中超过Llava-Next，表明DMLLM可以实现与自回归模型相当的性能。为了提高推论效率，我们提出了一种称为自信解码的解码策略，该策略会动态调整每个步骤生成的令牌数量，从而大大减少了发电迭代的数量。在自回归模型中，生成期间的正迭代次数等于响应长度。但是，通过自信解码，DIMPLE所需的迭代次数甚至仅为$ \ frac {\ text {wendspt {wendment Length}} {3} $。我们还重新实现了自回归模型中的预填充技术，并证明它不会显着影响大多数基准评估的性能，同时提供1.5倍至7倍的加速。此外，我们探讨了Dimple使用结构先验精确控制其响应的能力。这些先验以一种不同于基于教学的或经过思考的提示的方式来实现结构化响应，并允许对响应格式和长度进行细粒度的控制，这在自回旋模型中很难实现。总体而言，这项工作验证了DMLLM的可行性和优势，并提高了其推论效率和可控性。代码和型号可在此HTTPS URL上找到。

Title: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

Authors: Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, Xue Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.17011
Pdf URL: https://arxiv.org/pdf/2505.17011
Copy Paste: [[2505.17011]] Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space(https://arxiv.org/abs/2505.17011)
Keywords: generation, generative
Abstract: We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.
摘要：我们提出了Adaptok，这是一种自适应的时间因果视频令牌，可以灵活地根据视频内容为不同的帧分配令牌。 Adaptok配备了块遮盖策略，该策略在训练过程中随机掉落每个区块的尾部令牌，并有一个因果分散者，以使用不同数量的令牌来预测视频框架的重建质量。在推断期间，进一步提出了基于整数线性编程的自适应令牌分配策略，以调整给定分数的令牌用法。这种设计允许在可控的总预算下进行样本，内容感知和时间动态的令牌分配。在UCF-101和动力学600上进行视频重建和发电的广泛实验证明了我们方法的有效性。如果没有其他图像数据，Adaptok始终在不同的标记预算下始终提高重建质量和发电性能，从而可以进行更可扩展且有效的生成视频建模。

Title: When Are Concepts Erased From Diffusion Models?

Authors: Kevin Lu, Nicky Kriplani, Rohit Gandikota, Minh Pham, David Bau, Chinmay Hegde, Niv Cohen
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.17013
Pdf URL: https://arxiv.org/pdf/2505.17013
Copy Paste: [[2505.17013]] When Are Concepts Erased From Diffusion Models?(https://arxiv.org/abs/2505.17013)
Keywords: generation
Abstract: Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model's internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.
摘要：概念擦除是有选择地防止模型产生特定概念的能力，引起了人们日益增长的兴趣，并采用各种方法来应对挑战。但是，尚不清楚这些方法如何彻底删除目标概念。首先，我们在扩散模型中提出了两个用于擦除机制的概念模型：（i）减少产生目标概念的可能性，以及（ii）干扰该模型的内部指导机制。为了彻底评估该概念是否已从模型中真正删除，我们引入了一系列独立的评估。我们的评估框架包括对抗性攻击，新颖的探测技术以及对模型的替代世代的分析，代替了擦除的概念。我们的结果阐明了最大程度地减少副作用和保持对对抗提示的鲁棒性之间的张力。从广义上讲，我们的工作强调了在扩散模型中彻底评估对擦除的重要性。

Title: Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

Authors: Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, Pheng-Ann Heng
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17017
Pdf URL: https://arxiv.org/pdf/2505.17017
Copy Paste: [[2505.17017]] Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO(https://arxiv.org/abs/2505.17017)
Keywords: generation
Abstract: Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at this https URL
摘要：最近的进步强调了增强学习（RL）在增强大语言模型（LLMS）的思想链（COT）推理能力方面的重要作用。两种突出的RL算法，即直接偏好优化（DPO）和小组相对策略优化（GRPO），是这些发展的核心，展示了不同的利弊。自回归图像生成（也可以解释为连续的COT推理过程）提出了与基于LLM的COT推理不同的独特挑战。这些涵盖能够确保文本图像一致性，提高图像美学质量并设计复杂的奖励模型，而不是依靠更简单的基于规则的奖励。尽管最近的努力将RL扩展到了该领域，但这些探索通常缺乏对特定领域挑战的深入分析和不同RL策略的特征。为了弥合这一差距，我们在自回归图像生成中对GRPO和DPO算法进行了首次全面研究，评估了它们的内域性能和室外概括，同时仔细检查了不同奖励模型对其各自功能的影响。我们的发现表明，GRPO和DPO具有不同的优势，并且至关重要的是，具有更强内在概括能力的模型可能会增强应用RL算法的概括潜力。此外，我们系统地探索了三种普遍的缩放策略，以增强其内域和室外能力，从而获得了对每个范式有效扩展性能的独特见解。我们希望我们的研究铺平了一条新的途径，以激发未来的工作，以开发更有效的RL算法，以在自回归图像产生的领域中实现强大的COT推理。代码在此HTTPS URL上发布

Title: GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Authors: Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2505.17022
Pdf URL: https://arxiv.org/pdf/2505.17022
Copy Paste: [[2505.17022]] GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning(https://arxiv.org/abs/2505.17022)
Keywords: generation
Abstract: Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at this https URL.
摘要：视觉生成模型在从文本提示中创建逼真的图像方面取得了显着的进步，但是在具有精确空间关系和属性的多个对象的复杂提示中挣扎。有效处理此类提示需要明确推理语义内容和空间布局。我们提出了GOT-R1，该框架应用了增强学习以增强视觉生成中的语义空间推理。 GOT-R1以创意链方法为基础，使模型能够通过精心设计的强化学习自主发现预定义模板以外的有效推理策略。为了实现这一目标，我们提出了一个双阶段的多维奖励框架，该奖励框架利用MLLM来评估推理过程和最终输出，从而在整个一代人中有效监督。奖励系统以统一的方法评估语义对齐，空间准确性和视觉质量。实验结果表明，T2I-CompAnch基准有了显着改善，尤其是在涉及精确空间关系和属性结合的组成任务中。 GOT-R1通过成功将复杂的推理能力转移到视觉生成域，从而推进了图像生成的最新作品。为了促进未来的研究，我们在此HTTPS URL上公开可用的代码和预估计的模型。