2025-05-30

Title: HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

Authors: Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, Yimeng Wang, Kai Yu, Wenxuan Chen, Ziwei Feng, Zijian Gong, Jianzhuang Pan, Yi Peng, Rui Tian, Siyu Wang, Bo Zhao, Ting Yao, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.22705
Pdf URL: https://arxiv.org/pdf/2505.22705
Copy Paste: [[2505.22705]] HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer(https://arxiv.org/abs/2505.22705)
Keywords: generation, generative
Abstract: Recent advancements in image generative foundation models have prioritized quality improvements but often at the cost of increased computational complexity and inference latency. To address this critical trade-off, we introduce HiDream-I1, a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds. HiDream-I1 is constructed with a new sparse Diffusion Transformer (DiT) structure. Specifically, it starts with a dual-stream decoupled design of sparse DiT with dynamic Mixture-of-Experts (MoE) architecture, in which two separate encoders are first involved to independently process image and text tokens. Then, a single-stream sparse DiT structure with dynamic MoE architecture is adopted to trigger multi-model interaction for image generation in a cost-efficient manner. To support flexiable accessibility with varied model capabilities, we provide HiDream-I1 in three variants: HiDream-I1-Full, HiDream-I1-Dev, and HiDream-I1-Fast. Furthermore, we go beyond the typical text-to-image generation and remould HiDream-I1 with additional image conditions to perform precise, instruction-based editing on given images, yielding a new instruction-based image editing model namely HiDream-E1. Ultimately, by integrating text-to-image generation and instruction-based image editing, HiDream-I1 evolves to form a comprehensive image agent (HiDream-A1) capable of fully interactive image creation and refinement. To accelerate multi-modal AIGC research, we have open-sourced all the codes and model weights of HiDream-I1-Full, HiDream-I1-Dev, HiDream-I1-Fast, HiDream-E1 through our project websites: this https URL and this https URL. All features can be directly experienced via this https URL.
摘要：图像生成基础模型的最新进步已优先考虑质量改进，但通常以增加计算复杂性和推理潜伏期为代价。为了解决这个关键的权衡，我们引入了Hidream-I1，这是一种新的开源图像生成基础模型，具有17B参数，可在几秒钟内实现最新的图像生成质量。 Hidream-I1由新的稀疏扩散变压器（DIT）结构构建。具体而言，它是从具有动态专家（MOE）体系结构的稀疏DIT的双流解耦设计开始的，其中首先参与了两个单独的编码器，以独立处理图像和文本令牌。然后，采用具有动态MOE架构的单个稀疏DIT结构，以触发以成本效益的方式触发图像生成的多模型相互作用。为了通过不同的模型功能支持弹性可访问性，我们提供了三种变体中的Hidream-i1：Hidream-i1-Full，Hidream-i1-dev和Hidream-i1-fast。此外，我们超越了典型的文本对图像生成，并具有额外的图像条件，可以在给定图像上执行基于指令的精确，基于指令的编辑，从而产生了基于指令的新图像编辑模型，即Hidream-e1。最终，通过集成基于文本图像生成和基于指令的图像编辑，Hidream-i1演变成形成能够完全互动的图像创建和改进的综合图像代理（Hidream-A1）。为了加速多模式AIGC研究，我们通过我们的项目网站开源了Hidream-I1-Full，Hidream-I1-Dev，Hidream-I1-Fast，Hidream-i1-fast，Hidream-e1的所有代码和模型权重，通过我们的项目网站：此HTTPS url和此HTTPS url。所有功能都可以通过此HTTPS URL直接体验。

Title: Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization

Authors: Yuxi Zhang, Yueting Li, Xinyu Du, Sibo Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22792
Pdf URL: https://arxiv.org/pdf/2505.22792
Copy Paste: [[2505.22792]] Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization(https://arxiv.org/abs/2505.22792)
Keywords: generation
Abstract: Generating images from rhetorical languages remains a critical challenge for text-to-image models. Even state-of-the-art (SOTA) multimodal large language models (MLLM) fail to generate images based on the hidden meaning inherent in rhetorical language--despite such content being readily mappable to visual representations by humans. A key limitation is that current models emphasize object-level word embedding alignment, causing metaphorical expressions to steer image generation towards their literal visuals and overlook the intended semantic meaning. To address this, we propose Rhet2Pix, a framework that formulates rhetorical text-to-image generation as a multi-step policy optimization problem, incorporating a two-layer MDP diffusion module. In the outer layer, Rhet2Pix converts the input prompt into incrementally elaborated sub-sentences and executes corresponding image-generation actions, constructing semantically richer visuals. In the inner layer, Rhet2Pix mitigates reward sparsity during image generation by discounting the final reward and optimizing every adjacent action pair along the diffusion denoising trajectory. Extensive experiments demonstrate the effectiveness of Rhet2Pix in rhetorical text-to-image generation. Our model outperforms SOTA MLLMs such as GPT-4o, Grok-3 and leading academic baselines across both qualitative and quantitative evaluations. The code and dataset used in this work are publicly available.
摘要：从修辞语言中生成图像仍然是文本对图像模型的关键挑战。即使是最先进的（SOTA）多模式大语言模型（MLLM）也无法基于修辞语言中固有的隐藏含义生成图像，尽管这些内容很容易被人类的视觉表示。一个关键的限制是，当前模型强调对象级单词嵌入对齐，从而导致隐喻表达式将图像生成转向其字面视觉效果，并忽略预期的语义含义。为了解决这个问题，我们提出了RHET2PIX，该框架将修辞学文本对图像生成作为多步策略优化问题，并结合了两层MDP MDP扩散模块。在外层，Rhet2Pix将输入提示转换为逐步详细的子句子，并执行相应的图像生成动作，从而构建语义上更丰富的视觉效果。在内层中，RHET2PIX通过打折最终的奖励并优化沿扩散的Denoising轨迹来减轻图像生成过程中的奖励稀疏性。广泛的实验证明了rhet2pix在修辞文本对图像生成中的有效性。我们的模型在定性和定量评估中都优于SOTA MLLM，例如GPT-4O，Grok-3和领先的学术基线。这项工作中使用的代码和数据集公开可用。

Title: Preference Learning with Response Time

Authors: Ayush Sawarni, Sahasrajit Sarmasarkar, Vasilis Syrgkanis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.22820
Pdf URL: https://arxiv.org/pdf/2505.22820
Copy Paste: [[2505.22820]] Preference Learning with Response Time(https://arxiv.org/abs/2505.22820)
Keywords: generative
Abstract: This paper investigates the integration of response time data into human preference learning frameworks for more effective reward model elicitation. While binary preference data has become fundamental in fine-tuning foundation models, generative AI systems, and other large-scale models, the valuable temporal information inherent in user decision-making remains largely unexploited. We propose novel methodologies to incorporate response time information alongside binary choice data, leveraging the Evidence Accumulation Drift Diffusion (EZ) model, under which response time is informative of the preference strength. We develop Neyman-orthogonal loss functions that achieve oracle convergence rates for reward model learning, matching the theoretical optimal rates that would be attained if the expected response times for each query were known a priori. Our theoretical analysis demonstrates that for linear reward functions, conventional preference learning suffers from error rates that scale exponentially with reward magnitude. In contrast, our response time-augmented approach reduces this to polynomial scaling, representing a significant improvement in sample efficiency. We extend these guarantees to non-parametric reward function spaces, establishing convergence properties for more complex, realistic reward models. Our extensive experiments validate our theoretical findings in the context of preference learning over images.
摘要：本文研究了响应时间数据与人类偏好学习框架的整合，以进行更有效的奖励模型启发。虽然二进制优先数据已经在微调基础模型，生成AI系统和其他大规模模型中成为基础，但用户决策中固有的宝贵时间信息仍然在很大程度上没有探索。我们建议将响应时间信息与二进制选择数据一起纳入新方法，从而利用证据积累漂移扩散（EZ）模型，在此模型下，响应时间是偏好强度的信息。我们开发了Neyman-Ottrothonal损失函数，以实现奖励模型学习的甲骨文收敛率，并匹配如果每个查询的预期响应时间均以先验为先验的预期响应时间，这些理论最佳率将达到。我们的理论分析表明，对于线性奖励函数，常规偏好学习会遭受错误率，这些错误率以奖励幅度成倍扩展。相反，我们的响应时间增强方法将其降低到多项式缩放，代表样本效率的显着提高。我们将这些保证扩展到非参数奖励功能空间，为更复杂，现实的奖励模型建立收敛属性。我们的广泛实验在偏好学习而不是图像的背景下验证了我们的理论发现。

Title: PGLearn -- An Open-Source Learning Toolkit for Optimal Power Flow

Authors: Michael Klamkin, Mathieu Tanneau, Pascal Van Hentenryck
Subjects: cs.LG, cs.AI, eess.SY, math.OC
Abstract URL: https://arxiv.org/abs/2505.22825
Pdf URL: https://arxiv.org/pdf/2505.22825
Copy Paste: [[2505.22825]] PGLearn -- An Open-Source Learning Toolkit for Optimal Power Flow(https://arxiv.org/abs/2505.22825)
Keywords: generation
Abstract: Machine Learning (ML) techniques for Optimal Power Flow (OPF) problems have recently garnered significant attention, reflecting a broader trend of leveraging ML to approximate and/or accelerate the resolution of complex optimization problems. These developments are necessitated by the increased volatility and scale in energy production for modern and future grids. However, progress in ML for OPF is hindered by the lack of standardized datasets and evaluation metrics, from generating and solving OPF instances, to training and benchmarking machine learning models. To address this challenge, this paper introduces PGLearn, a comprehensive suite of standardized datasets and evaluation tools for ML and OPF. PGLearn provides datasets that are representative of real-life operating conditions, by explicitly capturing both global and local variability in the data generation, and by, for the first time, including time series data for several large-scale systems. In addition, it supports multiple OPF formulations, including AC, DC, and second-order cone formulations. Standardized datasets are made publicly available to democratize access to this field, reduce the burden of data generation, and enable the fair comparison of various methodologies. PGLearn also includes a robust toolkit for training, evaluating, and benchmarking machine learning models for OPF, with the goal of standardizing performance evaluation across the field. By promoting open, standardized datasets and evaluation metrics, PGLearn aims at democratizing and accelerating research and innovation in machine learning applications for optimal power flow problems. Datasets are available for download at this https URL.
摘要：最佳功率流（OPF）问题的机器学习（ML）技术最近引起了极大的关注，这反映了利用ML到近似和/或加速复杂优化问题的更广泛的趋势。这些事态发展是需要现代和未来网格能源生产的波动性和规模提高的。但是，缺乏标准化的数据集和评估指标，从生成和求解OPF实例到培训和基准测试机器学习模型，因此OPF的ML进展受到了阻碍。为了应对这一挑战，本文介绍了PGLEARN，这是ML和OPF的全面标准化数据集和评估工具。 PGLEARN提供了代表现实生活条件的数据集，通过明确捕获数据生成中的全球和局部变异性，并首次通过几个大规模系统的时间序列数据进行捕获。此外，它支持多种OPF公式，包括AC，DC和二阶锥体配方。公开使用标准化的数据集，以使对该领域的访问民主化，减少数据生成的负担并实现各种方法的公平比较。 PGLEARN还包括一个可用于OPF的培训，评估和基准的机器学习模型的强大工具包，其目的是在整个领域进行标准化的性能评估。通过促进开放，标准化的数据集和评估指标，PGLEARN旨在使机器学习应用中的研究和创新在机器学习应用中民主化，以实现最佳的功率流问题。数据集可在此HTTPS URL上下载。

Title: Kernel-Smoothed Scores for Denoising Diffusion: A Bias-Variance Study

Authors: Franck Gabriel, François Ged, Maria Han Veiga, Emmanuel Schertzer
Subjects: cs.LG, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2505.22841
Pdf URL: https://arxiv.org/pdf/2505.22841
Copy Paste: [[2505.22841]] Kernel-Smoothed Scores for Denoising Diffusion: A Bias-Variance Study(https://arxiv.org/abs/2505.22841)
Keywords: generative
Abstract: Diffusion models now set the benchmark in high-fidelity generative sampling, yet they can, in principle, be prone to memorization. In this case, their learned score overfits the finite dataset so that the reverse-time SDE samples are mostly training points. In this paper, we interpret the empirical score as a noisy version of the true score and show that its covariance matrix is asymptotically a re-weighted data PCA. In large dimension, the small time limit makes the noise variance blow up while simultaneously reducing spatial correlation. To reduce this variance, we introduce a kernel-smoothed empirical score and analyze its bias-variance trade-off. We derive asymptotic bounds on the Kullback-Leibler divergence between the true distribution and the one generated by the modified reverse SDE. Regularization on the score has the same effect as increasing the size of the training dataset, and thus helps prevent memorization. A spectral decomposition of the forward diffusion suggests better variance control under some regularity conditions of the true data distribution. Reverse diffusion with kernel-smoothed empirical score can be reformulated as a gradient descent drifted toward a Log-Exponential Double-Kernel Density Estimator (LED-KDE). This perspective highlights two regularization mechanisms taking place in denoising diffusions: an initial Gaussian kernel first diffuses mass isotropically in the ambient space, while a second kernel applied in score space concentrates and spreads that mass along the data manifold. Hence, even a straightforward regularization-without any learning-already mitigates memorization and enhances generalization. Numerically, we illustrate our results with several experiments on synthetic and MNIST datasets.
摘要：现在，扩散模型在高保真生成性采样中设定了基准，但原则上可以容易记忆。在这种情况下，他们的学分得分过高了有限的数据集，因此反向时间的SDE样本主要是训练点。在本文中，我们将经验分数解释为真实分数的嘈杂版本，并表明其协方差矩阵渐近地是重新加权的数据PCA。在很大的维度中，较小的时间限制使噪声方差爆炸，同时减少空间相关性。为了减少这一差异，我们引入了核心平滑的经验得分并分析其偏见变化权衡。我们在真实分布与修改后的反向SDE产生的kullback-leibler差异上得出了渐近边界。分数的正则化与增加训练数据集的大小相同，从而有助于防止记忆。正向扩散的光谱分解表明，在真实数据分布的某些规律性条件下，更好的方差控制。与内核平滑的经验得分相反的扩散可以重新构建，因为梯度下降向数指数的双核密度估计器（LED-KDE）倾斜。该视角突出了在脱氧扩散中发生的两种正则化机制：初始高斯内核首先在环境空间中各向同性地扩散质量，而第二个内核在得分空间浓缩中应用并沿数据歧管扩散。因此，即使没有任何学习的直接正规化也可以减轻记忆并增强概括。从数值上讲，我们通过有关合成和MNIST数据集的几个实验来说明我们的结果。

Title: RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation

Authors: Nikita Khramov, Andrei Kozyrev, Gleb Solovev, Anton Podkopaev
Subjects: cs.LG, cs.AI, cs.LO, cs.SE
Abstract URL: https://arxiv.org/abs/2505.22846
Pdf URL: https://arxiv.org/pdf/2505.22846
Copy Paste: [[2505.22846]] RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation(https://arxiv.org/abs/2505.22846)
Keywords: generation, generative
Abstract: Interactive Theorem Proving was repeatedly shown to be fruitful combined with Generative Artificial Intelligence. This paper assesses multiple approaches to Rocq generation and illuminates potential avenues for improvement. We highlight the importance of thorough premise selection for generating Rocq proofs and propose a novel approach, leveraging retrieval via a self-attentive embedder model. The evaluation of the designed approach shows up to 28% relative increase of the generator's performance. We tackle the problem of writing Rocq proofs using a multi-stage agentic system, tailored for formal verification, and demonstrate its high effectiveness. We conduct an ablation study and show the use of multi-agent debate on the planning stage of proof synthesis.
摘要：互动定理证明反复证明与生成人工智能相结合。本文评估了多种ROCQ生成的方法，并阐明了潜在的改进途径。我们强调了彻底的前提选择对生成ROCQ证明的重要性，并提出了一种新颖的方法，并通过自我牵手的嵌入式模型来利用检索。对设计方法的评估显示发电机性能的相对相对增长多达28％。我们解决了使用多阶段代理系统编写ROCQ证明的问题，该系统是为正式验证而定制的，并证明了其高效。我们进行了一项消融研究，并显示了在证明综合计划阶段的多代理辩论的使用。

Title: CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting

Authors: Kornel Howil, Joanna Waczyńska, Piotr Borycki, Tadeusz Dziarmaga, Marcin Mazur, Przemysław Spurek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22854
Pdf URL: https://arxiv.org/pdf/2505.22854
Copy Paste: [[2505.22854]] CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting(https://arxiv.org/abs/2505.22854)
Keywords: generative
Abstract: Gaussian Splatting (GS) has recently emerged as an efficient representation for rendering 3D scenes from 2D images and has been extended to images, videos, and dynamic 4D content. However, applying style transfer to GS-based representations, especially beyond simple color changes, remains challenging. In this work, we introduce CLIPGaussians, the first unified style transfer framework that supports text- and image-guided stylization across multiple modalities: 2D images, videos, 3D objects, and 4D scenes. Our method operates directly on Gaussian primitives and integrates into existing GS pipelines as a plug-in module, without requiring large generative models or retraining from scratch. CLIPGaussians approach enables joint optimization of color and geometry in 3D and 4D settings, and achieves temporal coherence in videos, while preserving a model size. We demonstrate superior style fidelity and consistency across all tasks, validating CLIPGaussians as a universal and efficient solution for multimodal style transfer.
摘要：高斯（Gaussian）脱落（GS）最近已成为从2D图像渲染3D场景的有效表示，并已扩展到图像，视频和动态4D内容。但是，将样式转移应用于基于GS的表示，尤其是除了简单的颜色更改之外，仍然具有挑战性。在这项工作中，我们介绍了Clipgaussians，这是第一个统一样式转移框架，该框架支持多种模式的文本和图像引导的风格化：2D图像，视频，3D对象和4D场景。我们的方法直接在高斯原始图上运行，并将其作为插件模块集成到现有的GS管道中，而无需大型生成模型或从头开始重新培训。 Clipgaussian的方法可以在3D和4D设置中对颜色和几何形状进行联合优化，并在视频中实现时间连贯性，同时保留模型大小。我们在所有任务中都表现出了卓越的风格保真度和一致性，从而将Clipgaussians验证为多模式样式转移的通用和高效解决方案。

Title: Scaling Offline RL via Efficient and Expressive Shortcut Models

Authors: Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, Wen Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22866
Pdf URL: https://arxiv.org/pdf/2505.22866
Copy Paste: [[2505.22866]] Scaling Offline RL via Efficient and Expressive Shortcut Models(https://arxiv.org/abs/2505.22866)
Keywords: generative
Abstract: Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes, making policy optimization difficult. In this paper, we introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models - a novel class of generative models - to scale both training and inference. SORL's policy can capture complex data distributions and can be trained simply and efficiently in a one-stage training procedure. At test time, SORL introduces both sequential and parallel inference scaling by using the learned Q-function as a verifier. We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute. We release the code at this http URL.
摘要：扩散和流模型已成为能够建模多样化和多模式行为的强大生成方法。但是，由于其噪声采样过程的迭代性质，将这些模型应用于离线增强学习（RL）仍然具有挑战性，这使得策略优化变得困难。在本文中，我们引入了可扩展的离线增强学习（SORL），这是一种新的离线RL算法，利用快捷方式模型（一种新型的生成模型）来扩展培训和推理。 SORL的策略可以捕获复杂的数据分布，并且可以在一个阶段的培训程序中简单有效地进行培训。在测试时，SORL通过将学习的Q功能作为验证者引入顺序和并行推理缩放。我们证明，SORL在一系列离线RL任务中实现了强劲的性能，并随着测试时间计算的增加表现出积极的缩放行为。我们在此HTTP URL上发布代码。

Title: CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

Authors: Junbo Yin, Chao Zha, Wenjia He, Chencheng Xu, Xin Gao
Subjects: cs.CV, cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2505.22869
Pdf URL: https://arxiv.org/pdf/2505.22869
Copy Paste: [[2505.22869]] CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models(https://arxiv.org/abs/2505.22869)
Keywords: generation
Abstract: Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at this https URL.
摘要：现有的PLM基于特定模式的单条件约束生成蛋白质序列，从而努力同时满足不同方式的多个约束。在这项工作中，我们介绍了CFP-GEN，这是一种新型的扩散语言模型，用于组合功能蛋白的产生。 CFP-gen通过将多模式条件与功能性，序列和结构约束来促进了从头蛋白的设计。具体而言，引入注释引导的特征调制（AGFM）模块是为基于组合功能注释（例如GO TERM，IPR域和EC数字）动态调整蛋白质特征分布的。同时，残基控制功能编码（RCFE）模块捕获了残留的相互作用，以确保更精确的控制。此外，可以无缝集成到现成的3D结构编码器以施加几何约束。我们证明，CFP-gen可以使高通量产生具有与天然蛋白相当的功能性的新型蛋白质，同时在设计多功能蛋白质方面取得了高成功率。此HTTPS URL上可用的代码和数据。

Title: Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Authors: Ruichen Chen, Keith G. Mills, Liyao Jiang, Chao Gao, Di Niu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22918
Pdf URL: https://arxiv.org/pdf/2505.22918
Copy Paste: [[2505.22918]] Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape(https://arxiv.org/abs/2505.22918)
Keywords: generation
Abstract: Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1\% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45\% end-to-end % and over 92\% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: \href{this https URL}{this https URL}
摘要：扩散变压器（DIT）已成为生成视频和图像等高质量视觉内容的事实模型。一个巨大的瓶颈是注意机制，复杂性随分辨率和视频长度的次数缩放。减轻这种负担的一种逻辑方法是稀疏的注意力，其中计算中仅包含一个子集或斑块的子集。但是，现有技术无法在极高的稀疏度水平上保留视觉质量，甚至可能会引起不可忽略的计算开销。％为了解决这一问题，我们提出了重新定义，该重新定义通过利用扩散模型的时间冗余来克服注意机制内的概率归一化转移，从而实现了视觉生成模型的稀疏稀疏注意力。具体而言，重新定义基于先前的软磁分布历史记录重塑注意力评分，以保留在非常高的稀疏度下的全二次注意力的视觉质量。在T2V/T2I模型（例如Cogvideox和Pixart Dits）上进行的实验结果表明，在推断期间，重新定义需要少于3.1 \％的令牌，表现优于FastDitattn，例如FastDitattn，Sparse Videogent和Minference。此外，我们衡量了潜伏期，以表明我们的方法可以达到超过45％的端到端％和超过92 \％的自我注意力发注意力延迟，而H100 GPU则以可忽略不计的H100 GPU。代码在线可在此处提供：\ href {this HTTPS url} {此https url}

Title: Leveraging Diffusion Models for Synthetic Data Augmentation in Protein Subcellular Localization Classification

Authors: Sylvey Lin, Zhi-Yi Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22926
Pdf URL: https://arxiv.org/pdf/2505.22926
Copy Paste: [[2505.22926]] Leveraging Diffusion Models for Synthetic Data Augmentation in Protein Subcellular Localization Classification(https://arxiv.org/abs/2505.22926)
Keywords: generation, generative
Abstract: We investigate whether synthetic images generated by diffusion models can enhance multi-label classification of protein subcellular localization. Specifically, we implement a simplified class-conditional denoising diffusion probabilistic model (DDPM) to produce label-consistent samples and explore their integration with real data via two hybrid training strategies: Mix Loss and Mix Representation. While these approaches yield promising validation performance, our proposed MixModel exhibits poor generalization to unseen test data, underscoring the challenges of leveraging synthetic data effectively. In contrast, baseline classifiers built on ResNet backbones with conventional loss functions demonstrate greater stability and test-time performance. Our findings highlight the importance of realistic data generation and robust supervision when incorporating generative augmentation into biomedical image classification.
摘要：我们研究扩散模型产生的合成图像是否可以增强蛋白质亚细胞定位的多标签分类。具体而言，我们实施了简化的类条件定义扩散概率模型（DDPM），以产生标签符合样品，并通过两种混合培训策略：混合损失和混合表示形式探索其与实际数据的集成。尽管这些方法产生了有希望的验证性能，但我们提出的MixModel表现出较差的概括，可以看不见测试数据，从而强调了有效利用合成数据的挑战。相比之下，基线分类器建立在具有常规损失功能的重新网式骨架上表明稳定性和测试时间性能更高。我们的发现突出了将生成增强纳入生物医学图像分类时，现实数据生成和强大监督的重要性。

Title: ATI: Any Trajectory Instruction for Controllable Video Generation

Authors: Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, Chongyang Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22944
Pdf URL: https://arxiv.org/pdf/2505.22944
Copy Paste: [[2505.22944]] ATI: Any Trajectory Instruction for Controllable Video Generation(https://arxiv.org/abs/2505.22944)
Keywords: generation, generative
Abstract: We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion using trajectory-based inputs. In contrast to prior methods that address these motion types through separate modules or task-specific designs, our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models via a lightweight motion injector. Users can specify keypoints and their motion paths to control localized deformations, entire object motion, virtual camera dynamics, or combinations of these. The injected trajectory signals guide the generative process to produce temporally consistent and semantically aligned motion sequences. Our framework demonstrates superior performance across multiple video motion control tasks, including stylized motion effects (e.g., motion brushes), dynamic viewpoint changes, and precise local motion manipulation. Experiments show that our method provides significantly better controllability and visual quality compared to prior approaches and commercial solutions, while remaining broadly compatible with various state-of-the-art video generation backbones. Project page: this https URL.
摘要：我们提出了一个在视频生成中进行运动控制的统一框架，该框架无缝整合了使用基于轨迹的输入的相机运动，对象级翻译和细粒度的本地运动。与先前通过单独的模块或特定任务设计解决这些运动类型的方法相反，我们的方法通过将用户定义的轨迹投射到预训练的图像到视频生成模型的潜在空间中，从而提供了凝聚力的解决方案。用户可以指定关键点及其运动路径，以控制局部变形，整个对象运动，虚拟摄像头动力学或这些组合。注入的轨迹信号指导生成过程，以产生时间一致和语义比对运动序列。我们的框架展示了跨多个视频运动控制任务的卓越性能，包括风格化运动效果（例如运动刷），动态视点变化和精确的本地运动操纵。实验表明，与先前的方法和商业解决方案相比，我们的方法提供了明显更好的可控性和视觉质量，同时仍然与各种最新的视频生成骨架保持广泛兼容。项目页面：此HTTPS URL。

Title: Directed Graph Grammars for Sequence-based Learning

Authors: Michael Sun, Orion Foo, Gang Liu, Wojciech Matusik, Jie Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.22949
Pdf URL: https://arxiv.org/pdf/2505.22949
Copy Paste: [[2505.22949]] Directed Graph Grammars for Sequence-based Learning(https://arxiv.org/abs/2505.22949)
Keywords: generation, generative
Abstract: Directed acyclic graphs (DAGs) are a class of graphs commonly used in practice, with examples that include electronic circuits, Bayesian networks, and neural architectures. While many effective encoders exist for DAGs, it remains challenging to decode them in a principled manner, because the nodes of a DAG can have many different topological orders. In this work, we propose a grammar-based approach to constructing a principled, compact and equivalent sequential representation of a DAG. Specifically, we view a graph as derivations over an unambiguous grammar, where the DAG corresponds to a unique sequence of production rules. Equivalently, the procedure to construct such a description can be viewed as a lossless compression of the data. Such a representation has many uses, including building a generative model for graph generation, learning a latent space for property prediction, and leveraging the sequence representational continuity for Bayesian Optimization over structured data. Code is available at this https URL.
摘要：定向无环图（DAG）是一类在实践中使用的图形，其中包括电子电路，贝叶斯网络和神经体系结构。尽管存在许多有效的DAG编码器，但以原则性的方式解码它们仍然具有挑战性，因为DAG的节点可以具有许多不同的拓扑顺序。在这项工作中，我们提出了一种基于语法的方法来构建DAG的原则，紧凑和等效的顺序表示。具体而言，我们将图视为在明确的语法上的派生，其中DAG对应于独特的生产规则序列。同等地，构建此类描述的过程可以看作是数据的无损压缩。这样的表示有许多用途，包括建立用于图生成的生成模型，学习属性预测的潜在空间，并利用贝叶斯优化在结构化数据上的序列表示连续性。代码可在此HTTPS URL上找到。

Title: MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming

Authors: Chengqi Zheng, Jianda Chen, Yueming Lyu, Wen Zheng Terence Ng, Haopeng Zhang, Yew-Soon Ong, Ivor Tsang, Haiyan Yin
Subjects: cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2505.22967
Pdf URL: https://arxiv.org/pdf/2505.22967
Copy Paste: [[2505.22967]] MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming(https://arxiv.org/abs/2505.22967)
Keywords: generation
Abstract: Despite the promise of autonomous agentic reasoning, existing workflow generation methods frequently produce fragile, unexecutable plans due to unconstrained LLM-driven construction. We introduce MermaidFlow, a framework that redefines the agentic search space through safety-constrained graph evolution. At its core, MermaidFlow represent workflows as a verifiable intermediate representation using Mermaid, a structured and human-interpretable graph language. We formulate domain-aware evolutionary operators, i.e., crossover, mutation, insertion, and deletion, to preserve semantic correctness while promoting structural diversity, enabling efficient exploration of a high-quality, statically verifiable workflow space. Without modifying task settings or evaluation protocols, MermaidFlow achieves consistent improvements in success rates and faster convergence to executable plans on the agent reasoning benchmark. The experimental results demonstrate that safety-constrained graph evolution offers a scalable, modular foundation for robust and interpretable agentic reasoning systems.
摘要：尽管有自主性代理推理的承诺，但由于不受限制的LLM驱动的结构，现有的工作流生成方法经常产生脆弱的，不可阻止的计划。我们介绍了Mermaidflow，这是一个框架，该框架通过安全限制的图表演变重新定义了代理搜索空间。 Mermaidflow以Mermaid使用Mermaid（一种结构化且人性化的图形语言）表示工作流程作为可验证的中间表示。我们制定了域感知的进化运算符，即交叉，突变，插入和删除，以保持语义正确性，同时促进结构多样性，从而有效探索高质量的静态可验证的工作流程。在不修改任务设置或评估协议的情况下，MermaidFlow可以在成功率上取得一致的提高，并在代理推理基准上对可执行计划的更快收敛。实验结果表明，安全受限的图进化为可靠和可解释的代理推理系统提供了可扩展的模块化基础。

Title: EquiReg: Equivariance Regularized Diffusion for Inverse Problems

Authors: Bahareh Tolooshams, Aditi Chandrashekar, Rayhan Zirvi, Abbas Mammadov, Jiachen Yao, Chuwei Wang, Anima Anandkumar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22973
Pdf URL: https://arxiv.org/pdf/2505.22973
Copy Paste: [[2505.22973]] EquiReg: Equivariance Regularized Diffusion for Inverse Problems(https://arxiv.org/abs/2505.22973)
Keywords: restoration
Abstract: Diffusion models represent the state-of-the-art for solving inverse problems such as image restoration tasks. In the Bayesian framework, diffusion-based inverse solvers incorporate a likelihood term to guide the prior sampling process, generating data consistent with the posterior distribution. However, due to the intractability of the likelihood term, many current methods rely on isotropic Gaussian approximations, which lead to deviations from the data manifold and result in inconsistent, unstable reconstructions. We propose Equivariance Regularized (EquiReg) diffusion, a general framework for regularizing posterior sampling in diffusion-based inverse problem solvers. EquiReg enhances reconstructions by reweighting diffusion trajectories and penalizing those that deviate from the data manifold. We define a new distribution-dependent equivariance error, empirically identify functions that exhibit low error for on-manifold samples and higher error for off-manifold samples, and leverage these functions to regularize the diffusion sampling process. When applied to a variety of solvers, EquiReg outperforms state-of-the-art diffusion models in both linear and nonlinear image restoration tasks, as well as in reconstructing partial differential equations.
摘要：扩散模型代表了解决逆问题（例如图像恢复任务）的最新模型。在贝叶斯框架中，基于扩散的反求解器结合了指导先前采样过程的可能性术语，从而生成与后验分布一致的数据。但是，由于可能性术语的棘手性，许多当前方法依赖于各向同性高斯近似值，这会导致与数据歧管的偏差，并导致不稳定，不稳定的重建。我们提出了均衡稳定（Equireg）扩散，这是一个在基于扩散的反问题求解器中正规化后采样的一般框架。 Equireg通过重新加权扩散轨迹并惩罚偏离数据歧管的扩散轨迹来增强重建。我们定义了一个新的依赖分布依赖性的误差误差，从经验上识别出在播种样品中显示出较低误差的功能，而对off-manifold样品的误差较高，并利用这些功能来正规化扩散采样过程。当应用于各种求解器时，Equireg在线性和非线性图像恢复任务以及重建部分微分方程中的最先进扩散模型。

Title: Toward Memory-Aided World Models: Benchmarking via Spatial Consistency

Authors: Kewei Lian, Shaofei Cai, Yilun Du, Yitao Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22976
Pdf URL: https://arxiv.org/pdf/2505.22976
Copy Paste: [[2505.22976]] Toward Memory-Aided World Models: Benchmarking via Spatial Consistency(https://arxiv.org/abs/2505.22976)
Keywords: generation
Abstract: The ability to simulate the world in a spatially consistent manner is a crucial requirements for effective world models. Such a model enables high-quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. Designing a memory module is a crucial component for addressing spatial consistency: such a model must not only retain long-horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, there are no dataset designed to promote the development of memory modules by explicitly enforcing spatial consistency constraints. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long-range spatial consistency. To bridge this gap, we construct a dataset and corresponding benchmark by sampling 150 distinct locations within the open-world environment of Minecraft, collecting about 250 hours (20 million frames) of loop-based navigation videos with actions. Our dataset follows a curriculum design of sequence lengths, allowing models to learn spatial consistency on increasingly complex navigation trajectories. Furthermore, our data collection pipeline is easily extensible to new Minecraft environments and modules. Four representative world model baselines are evaluated on our benchmark. Dataset, benchmark, and code are open-sourced to support future research.
摘要：以空间一致的方式模拟世界的能力是有效世界模型的关键要求。这样的模型可实现高质量的视觉生成，还可以确保世界模型对诸如模拟和计划之类的下游任务的可靠性。设计内存模块是解决空间一致性的关键组成部分：这种模型不仅必须保留长胜下的观察信息，而且还可以构建显式或隐式内部空间表示。但是，没有旨在通过明确执行空间一致性约束来促进内存模块开发的数据集。此外，大多数现有的基准主要强调视觉连贯性或发电质量，从而忽略了长期空间一致性的要求。为了弥合这一差距，我们通过在Minecraft的开放世界环境中采样150个不同位置来构建数据集和相应的基准测试，从而收集了大约250个小时（2000万帧）的基于循环的导航视频。我们的数据集遵循序列长度的课程设计，使模型可以在日益复杂的导航轨迹上学习空间一致性。此外，新的Minecraft环境和模块很容易扩展我们的数据收集管道。在我们的基准测试中评估了四个代表性的世界模型基线。数据集，基准和代码是开源的，以支持未来的研究。

Title: HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions

Authors: Shuolin Xu, Siming Zheng, Ziyi Wang, HC Yu, Jinwei Chen, Huaqi Zhang, Bo Li, Peng-Tao Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22977
Pdf URL: https://arxiv.org/pdf/2505.22977
Copy Paste: [[2505.22977]] HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions(https://arxiv.org/abs/2505.22977)
Keywords: generation
Abstract: Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes, there are still obvious limitations when facing complex human body motions (Hypermotion) that contain highly dynamic, non-standard motions, and the lack of a high-quality benchmark for evaluation of complex human motion animations. To address this challenge, we introduce the \textbf{Open-HyperMotionX Dataset} and \textbf{HyperMotionX Bench}, which provide high-quality human pose annotations and curated video clips for evaluating and improving pose-guided human image animation models under complex human motion conditions. Furthermore, we propose a simple yet powerful DiT-based video generation baseline and design spatial low-frequency enhanced RoPE, a novel module that selectively enhances low-frequency spatial feature modeling by introducing learnable frequency scaling. Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences. Extensive experiments demonstrate the effectiveness of our dataset and proposed approach in advancing the generation quality of complex human motion image animations. Code and dataset will be made publicly available.
摘要：扩散模型的最新进展显着改善了有条件的视频生成，尤其是在姿势引导的人类图像动画任务中。尽管现有方法能够在常规运动和静态场景中产生高保真性和时间一致的动画序列，但是当面对复杂的人体运动（超运动）时，仍然存在明显的局限性，这些动作（超运动）包含高度动态的，非标准的运动，并且缺乏用于评估复杂人类运动动画的高质量基准的高质量基准。为了应对这一挑战，我们介绍了\ textbf {open-hypermotionx dataset}和\ textbf {hypermotionx台式}，它提供了高质量的人姿势注释和策划的视频剪辑，以评估和改善姿势引导的人类图像在复杂的人类运动条件下的人类图像动画模型。此外，我们提出了一个简单但功能强大的基于DIT的视频基线和设计空间低频增强绳索，这是一个新型的模块，通过引入可学习的频率缩放来有选择地增强低频空间特征建模。我们的方法显着提高了高度动态人类运动序列中的结构稳定性和外观一致性。广泛的实验证明了我们的数据集和提出方法在提高复杂人类运动图像动画的产生质量方面的有效性。代码和数据集将公开可用。

Title: MOVi: Training-free Text-conditioned Multi-Object Video Generation

Authors: Aimon Rahman, Jiang Liu, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Yusheng Su, Vishal M. Patel, Zicheng Liu, Emad Barsoum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.22980
Pdf URL: https://arxiv.org/pdf/2505.22980
Copy Paste: [[2505.22980]] MOVi: Training-free Text-conditioned Multi-Object Video Generation(https://arxiv.org/abs/2505.22980)
Keywords: generation
Abstract: Recent advances in diffusion-based text-to-video (T2V) models have demonstrated remarkable progress, but these models still face challenges in generating videos with multiple objects. Most models struggle with accurately capturing complex object interactions, often treating some objects as static background elements and limiting their movement. In addition, they often fail to generate multiple distinct objects as specified in the prompt, resulting in incorrect generations or mixed features across objects. In this paper, we present a novel training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs). We use an LLM as the ``director'' of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements. We further refine the generation process by manipulating the attention mechanism to better capture object-specific features and motion patterns, and prevent cross-object feature interference. Extensive experiments validate the effectiveness of our training free approach in significantly enhancing the multi-object generation capabilities of existing video diffusion models, resulting in 42% absolute improvement in motion dynamics and object generation accuracy, while also maintaining high fidelity and motion smoothness.
摘要：基于扩散的文本对视频（T2V）模型的最新进展表现出了显着的进步，但是这些模型在生成具有多个对象的视频方面仍然面临挑战。大多数模型都在准确捕获复杂的对象相互作用，通常将某些对象视为静态背景元素并限制其运动。此外，它们通常无法像提示中指定的多个不同的对象生成多个不同的对象，从而导致几代或对象之间的混合特征。在本文中，我们为多物体视频生成提供了一种新颖的无培训方法，该方法利用了扩散模型和大型语言模型（LLMS）的开放世界知识。我们将LLM用作对象轨迹的``导演''，并通过噪声重新定位应用轨迹来实现对现实运动的精确控制。我们通过操纵注意机制以更好地捕获特定于对象的特征和运动模式并防止跨对象特征干扰来进一步完善生成过程。广泛的实验验证了我们训练方法在显着增强现有视频扩散模型的多目标生成能力方面的有效性，从而导致运动动力学和对象产生准确性的绝对增强42％，同时还保持了高忠诚度和运动平稳性。

Title: EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge

Authors: Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, Alex Smola
Subjects: cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2505.23009
Pdf URL: https://arxiv.org/pdf/2505.23009
Copy Paste: [[2505.23009]] EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge(https://arxiv.org/abs/2505.23009)
Keywords: generation
Abstract: Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on $\textit{EmergentTTS}$, we introduce $\textit{EmergentTTS-Eval}$, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation $\href{this https URL}{code}$ and the $\href{this https URL}{dataset}$.
摘要：文本到语音（TTS）基准通常无法捕获模型如何处理细微差别和语义复杂文本。在$ \ textit {equarkenttts} $上建立，我们介绍了$ \ textit {emperttts-eval} $，这是一个全面的基准，涵盖了六个具有挑战性的TTS方案：情感，副语言学，外语，句法复杂性，复杂的发音，复杂的发音（例如，urls，Formulas和Formulas）和问题。至关重要的是，我们的框架可以使测试案例生成和评估自动化，从而使基准很容易扩展。从一小撮人写的种子提示开始，我们使用LLMS迭代地扩展了它们，以针对特定的结构，语音和韵律挑战，从而导致1,645种不同的测试用例。此外，我们使用大型音频语言模型（LALM）采用了一种模型的法官方法，以评估跨多个维度（例如表达的情绪，韵律，语言和发音精度）的语音。我们在EmprentTTS-Eval上评估了最新的开源和专有TTS系统，例如11LAB，DeepGram和Openai的4o-Mini-TTS，表明其揭示了细粒度差异差异的能力。结果表明，AS-A-A-Gudge方法提供了强大的TTS评估和与人类偏好的高度相关性。我们开源评估$ \ href {此https url} {code} $和$ \ href {this https url} {dataset} $。

Title: SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model

Authors: Bowen Chen, Keyan Chen, Mohan Yang, Zhengxia Zou, Zhenwei Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23010
Pdf URL: https://arxiv.org/pdf/2505.23010
Copy Paste: [[2505.23010]] SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model(https://arxiv.org/abs/2505.23010)
Keywords: super-resolution
Abstract: High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring. However, due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation. Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition. Existing RSISR methods primarily focus on low-level characteristics in pixel space, while neglecting the high-level understanding of remote sensing scenes. This may lead to semantically inconsistent artifacts in the reconstructed results. Motivated by this observation, our work aims to explore the role of high-level semantic knowledge in improving RSISR performance. We propose a Semantic-Guided Super-Resolution framework, SeG-SR, which leverages Vision-Language Models (VLMs) to extract semantic knowledge from input images and uses it to guide the super resolution (SR) process. Specifically, we first design a Semantic Feature Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic knowledge from remote sensing images. Next, we propose a Semantic Localization Module (SLM), which derives a series of semantic guidance from the extracted semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM) that uses semantic guidance to modulate the features extracted by the SR network, effectively incorporating high-level scene understanding into the SR pipeline. We validate the effectiveness and generalizability of SeG-SR through extensive experiments: SeG-SR achieves state-of-the-art performance on two datasets and consistently delivers performance improvements across various SR architectures. Codes can be found at this https URL.
摘要：高分辨率（HR）遥感图像在包括城市规划和环境监测在内的广泛应用中起着至关重要的作用。但是，由于传感器和数据传输链接的局限性，实践中获得的图像通常会遭受分辨率降解的影响。遥感图像超分辨率（RSISR）旨在重建低分辨率（LR）输入的HR图像，从而提供了直接HR图像采集的经济高效且有效的替代方案。现有的RSISR方法主要集中于像素空间中的低级特征，同时忽略了对遥感场景的高级理解。这可能会导致在重建结果中的语义上不一致的伪影。在这一观察结果的推动下，我们的工作旨在探讨高级语义知识在改善RSISR性能中的作用。我们提出了一个语义引导的超分辨率框架SEG-SR，该框架利用视觉语言模型（VLM）从输入图像中提取语义知识，并使用它来指导超级分辨率（SR）过程。具体而言，我们首先设计了使用验证的VLM从遥感图像中提取语义知识的语义特征提取模块（SFEM）。接下来，我们提出一个语义定位模块（SLM），该模块从提取的语义知识中得出了一系列的语义指导。最后，我们开发了一个可学习的调制模块（LMM），该模块使用语义指导来调节SR网络提取的功能，从而有效地将高级场景理解纳入SR管道中。我们通过广泛的实验来验证SEG-SR的有效性和概括性：SEG-SR在两个数据集上实现最先进的性能，并始终在各种SR体系结构中提供性能提高。代码可以在此HTTPS URL上找到。

Title: $K^2$VAE: A Koopman-Kalman Enhanced Variational AutoEncoder for Probabilistic Time Series Forecasting

Authors: Xingjian Wu, Xiangfei Qiu, Hongfan Gao, Jilin Hu, Bin Yang, Chenjuan Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23017
Pdf URL: https://arxiv.org/pdf/2505.23017
Copy Paste: [[2505.23017]] $K^2$VAE: A Koopman-Kalman Enhanced Variational AutoEncoder for Probabilistic Time Series Forecasting(https://arxiv.org/abs/2505.23017)
Keywords: generative
Abstract: Probabilistic Time Series Forecasting (PTSF) plays a crucial role in decision-making across various fields, including economics, energy, and transportation. Most existing methods excell at short-term forecasting, while overlooking the hurdles of Long-term Probabilistic Time Series Forecasting (LPTSF). As the forecast horizon extends, the inherent nonlinear dynamics have a significant adverse effect on prediction accuracy, and make generative models inefficient by increasing the cost of each iteration. To overcome these limitations, we introduce $K^2$VAE, an efficient VAE-based generative model that leverages a KoopmanNet to transform nonlinear time series into a linear dynamical system, and devises a KalmanNet to refine predictions and model uncertainty in such linear system, which reduces error accumulation in long-term forecasting. Extensive experiments demonstrate that $K^2$VAE outperforms state-of-the-art methods in both short- and long-term PTSF, providing a more efficient and accurate solution.
摘要：概率时间序列预测（PTSF）在包括经济学，能源和运输在内的各个领域的决策中起着至关重要的作用。大多数现有的方法在短期预测中出色，同时忽略了长期概率时间序列预测的障碍（LPTSF）。随着预测范围的扩展，固有的非线性动力学对预测准确性产生了重大不利影响，并通过增加每次迭代的成本来使生成模型效率低下。为了克服这些限制，我们引入了$ k^2 $ vae，这是一种有效的基于VAE的生成模型，它利用Koopmannet将非线性时间序列转换为线性动力学系统，并设计了Kalmannet来完善这种线性系统中的预测和模型不确定性，从而减少了长期预测中误差的积累。广泛的实验表明，$ k^2 $ vae在短期和长期PTSF中都优于最先进的方法，提供了更有效，更准确的解决方案。

Title: Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation

Authors: Jihai Zhang, Tianle Li, Linjie Li, Zhengyuan Yang, Yu Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23043
Pdf URL: https://arxiv.org/pdf/2505.23043
Copy Paste: [[2505.23043]] Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation(https://arxiv.org/abs/2505.23043)
Keywords: generation
Abstract: Recent advancements in unified vision-language models (VLMs), which integrate both visual understanding and generation capabilities, have attracted significant attention. The underlying hypothesis is that a unified architecture with mixed training on both understanding and generation tasks can enable mutual enhancement between understanding and generation. However, this hypothesis remains underexplored in prior works on unified VLMs. To address this gap, this paper systematically investigates the generalization across understanding and generation tasks in unified VLMs. Specifically, we design a dataset closely aligned with real-world scenarios to facilitate extensive experiments and quantitative evaluations. We evaluate multiple unified VLM architectures to validate our findings. Our key findings are as follows. First, unified VLMs trained with mixed data exhibit mutual benefits in understanding and generation tasks across various architectures, and this mutual benefits can scale up with increased data. Second, better alignment between multimodal input and output spaces will lead to better generalization. Third, the knowledge acquired during generation tasks can transfer to understanding tasks, and this cross-task generalization occurs within the base language model, beyond modality adapters. Our findings underscore the critical necessity of unifying understanding and generation in VLMs, offering valuable insights for the design and optimization of unified VLMs.
摘要：整合视觉理解和发电能力的统一视觉模型（VLM）的最新进展引起了极大的关注。基本的假设是，对理解和生成任务进行混合培训的统一体系结构可以在理解和产生之间相互增强。但是，该假设在统一VLM的先前工作中仍未得到充实。为了解决这一差距，本文系统地研究了统一VLM中的理解和发电任务的概括。具体而言，我们设计了一个与现实世界情景紧密对齐的数据集，以促进广泛的实验和定量评估。我们评估多个统一的VLM架构以验证我们的发现。我们的主要发现如下。首先，经过混合数据培训的统一VLM在理解和各种体系结构的生成任务方面具有相互利益，并且这种相互利益可以随着数据的增加而扩展。其次，多模式输入和输出空间之间的更好对齐将导致更好的概括。第三，生成任务中获得的知识可以转移到理解任务，而这种交叉任务概括发生在基本语言模型中，超出了模态适配器。我们的发现强调了在VLM中统一理解和产生的关键必要性，为统一VLM的设计和优化提供了宝贵的见解。

Title: Zero-P-to-3: Zero-Shot Partial-View Images to 3D Object

Authors: Yuxuan Lin, Ruihang Chu, Zhenyu Chen, Xiao Tang, Lei Ke, Haoling Li, Yingji Zhong, Zhihao Li, Shiyong Liu, Xiaofei Wu, Jianzhuang Liu, Yujiu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23054
Pdf URL: https://arxiv.org/pdf/2505.23054
Copy Paste: [[2505.23054]] Zero-P-to-3: Zero-Shot Partial-View Images to 3D Object(https://arxiv.org/abs/2505.23054)
Keywords: generation, generative
Abstract: Generative 3D reconstruction shows strong potential in incomplete observations. While sparse-view and single-image reconstruction are well-researched, partial observation remains underexplored. In this context, dense views are accessible only from a specific angular range, with other perspectives remaining inaccessible. This task presents two main challenges: (i) limited View Range: observations confined to a narrow angular scope prevent effective traditional interpolation techniques that require evenly distributed perspectives. (ii) inconsistent Generation: views created for invisible regions often lack coherence with both visible regions and each other, compromising reconstruction consistency. To address these challenges, we propose \method, a novel training-free approach that integrates the local dense observations and multi-source priors for reconstruction. Our method introduces a fusion-based strategy to effectively align these priors in DDIM sampling, thereby generating multi-view consistent images to supervise invisible views. We further design an iterative refinement strategy, which uses the geometric structures of the object to enhance reconstruction quality. Extensive experiments on multiple datasets show the superiority of our method over SOTAs, especially in invisible regions.
摘要：生成的3D重建在不完整的观察结果中显示出强大的潜力。虽然稀疏视图和单图像重建是经过充分研究的，但部分观察仍然没有被忽略。在这种情况下，只有从特定的角度范围内才能访问稠密的观点，而其他观点仍然无法访问。这项任务提出了两个主要挑战：（i）有限的视图范围：限制狭窄的角度范围的观察结果可以防止有效的传统插值技术需要均匀分布的观点。（ii）不一致的一代：为无形区域创建的观点通常与可见区域和彼此之间缺乏连贯性，从而损害了重建一致性。为了应对这些挑战，我们提出了一种新型的无训练方法，该方法将局部密集的观察结果和多源先验集成了重建。我们的方法引入了一种基于融合的策略，以有效地将这些先验的ddim抽样统一，从而生成多视图一致的图像以监督隐形视图。我们进一步设计了一种迭代改进策略，该策略使用对象的几何结构来增强重建质量。在多个数据集上进行的广泛实验表明，我们方法比SOTA的优越性，尤其是在无形区域。

Title: DINGO: Constrained Inference for Diffusion LLMs

Authors: Tarun Suresh, Debangshu Banerjee, Shubham Ugare, Sasa Misailovic, Gagandeep Singh
Subjects: cs.LG, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2505.23061
Pdf URL: https://arxiv.org/pdf/2505.23061
Copy Paste: [[2505.23061]] DINGO: Constrained Inference for Diffusion LLMs(https://arxiv.org/abs/2505.23061)
Keywords: generation
Abstract: Diffusion LLMs have emerged as a promising alternative to conventional autoregressive LLMs, offering significant potential for improved runtime efficiency. However, existing diffusion models lack the ability to provably enforce user-specified formal constraints, such as regular expressions, which makes them unreliable for tasks that require structured outputs, such as fixed-schema JSON generation. Unlike autoregressive models that generate tokens sequentially, diffusion LLMs predict a block of tokens in parallel. This parallelism makes traditional constrained decoding algorithms, which are designed for sequential token prediction, ineffective at preserving the true output distribution. To address this limitation, we propose DINGO, a dynamic programming-based constrained decoding strategy that is both efficient and provably distribution-preserving. DINGO enables sampling of output strings with the highest probability under the model's predicted distribution, while strictly satisfying any user-specified regular expression. On standard symbolic math and JSON generation benchmarks, DINGO achieves up to a 68 percentage point improvement over unconstrained inference
摘要：扩散LLM已成为常规自回旋LLM的有希望的替代品，为提高运行时效率提供了巨大的潜力。但是，现有的扩散模型缺乏能够实现用户指定的正式约束（例如正则表达式）的能力，这使得它们对于需要结构化输出的任务（例如固定的Schema JSON生成）不可靠。与依次生成代币的自回旋模型不同，扩散LLMS并行预测了一个令牌的块。该并行性使传统的受约束解码算法设计用于顺序令牌预测，在保留真实输出分布方面无效。为了解决这一限制，我们提出了Dingo，这是一种基于动态的基于编程的约束解码策略，既有效又可以证明具有分配性能。在模型的预测分布下，DINGO启用具有最高概率的输出字符串的采样，同时严格满足任何用户指定的正则表达式。关于标准的符号数学和JSON生成基准，Dingo比不受限制的推断可提高68个百分点

Title: URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration

Authors: Rui Xu, Yuzhen Niu, Yuezhou Li, Huangbiao Xu, Wenxi Liu, Yuzhong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23068
Pdf URL: https://arxiv.org/pdf/2505.23068
Copy Paste: [[2505.23068]] URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration(https://arxiv.org/abs/2505.23068)
Keywords: restoration
Abstract: Existing low-light image enhancement (LLIE) and joint LLIE and deblurring (LLIE-deblur) models have made strides in addressing predefined degradations, yet they are often constrained by dynamically coupled degradations. To address these challenges, we introduce a Unified Receptance Weighted Key Value (URWKV) model with multi-state perspective, enabling flexible and effective degradation restoration for low-light images. Specifically, we customize the core URWKV block to perceive and analyze complex degradations by leveraging multiple intra- and inter-stage states. First, inspired by the pupil mechanism in the human visual system, we propose Luminance-adaptive Normalization (LAN) that adjusts normalization parameters based on rich inter-stage states, allowing for adaptive, scene-aware luminance modulation. Second, we aggregate multiple intra-stage states through exponential moving average approach, effectively capturing subtle variations while mitigating information loss inherent in the single-state mechanism. To reduce the degradation effects commonly associated with conventional skip connections, we propose the State-aware Selective Fusion (SSF) module, which dynamically aligns and integrates multi-state features across encoder stages, selectively fusing contextual information. In comparison to state-of-the-art models, our URWKV model achieves superior performance on various benchmarks, while requiring significantly fewer parameters and computational resources.
摘要：现有的低光图像增强（LLIE），关节Llie和Deblurring（Llie-Deblur）模型在解决预定义的降解方面取得了长足的进步，但通常会受到动态耦合降解的限制。为了应对这些挑战，我们引入了具有多状态透视图的统一接收加权钥匙值（URWKV）模型，从而为低光图像提供了灵活有效的降解恢复。具体而言，我们通过利用多个阶段内和阶段的状态来自定义核心URWKV块来感知和分析复杂的降解。首先，受到人类视觉系统中的学生机制的启发，我们提出了亮度自适应归一化（LAN），该归一化（LAN）基于富阶段的阶段状态来调整归一化参数，从而允许自适应，场景感知的亮度调节。其次，我们通过指数移动平均方法来汇总多个阶段内状态，从而有效捕获微妙的变化，同时减轻单状态机制固有的信息损失。为了减少通常与常规跳过连接相关的降解效果，我们提出了州感知的选择性融合（SSF）模块，该模块动态地对齐和集成了跨编码器阶段的多状态功能，选择性地融合了上下文信息。与最先进的模型相比，我们的URWKV模型在各种基准测试中都能达到卓越的性能，同时需要更少的参数和计算资源。

Title: GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

Authors: Gwanghyun Kim, Xueting Li, Ye Yuan, Koki Nagano, Tianye Li, Jan Kautz, Se Young Chun, Umar Iqbal
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2505.23085
Pdf URL: https://arxiv.org/pdf/2505.23085
Copy Paste: [[2505.23085]] GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion(https://arxiv.org/abs/2505.23085)
Keywords: generation
Abstract: Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model's role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.
摘要：从视频中估算准确和时间一致的3D人类几何形状是计算机视觉中的一个挑战性问题。现有的方法主要针对单个图像进行了优化，通常会遇到时间不一致，并且无法捕获细粒度的动态细节。为了解决这些局限性，我们提出了Geoman，这是一种新型的架构，旨在产生单眼人类视频的准确和时间一致的深度和正常估计。 Geoman解决了两个关键挑战：高质量4D训练数据的稀缺性以及对度量深度估算的需求，以准确模拟人类的规模。为了克服第一个挑战，Geoman采用基于图像的模型来估算视频的第一帧的深度和正态，然后对视频扩散模型进行了调节，将视频几何估计任务重新描述为图像到视频生成问题。该设计将几何估计的繁重提升到图像模型中，并简化了视频模型的角色，以专注于复杂的细节，同时使用从大型视频数据集中学到的先验。因此，Geoman可以提高时间一致性和可推广性，同时需要最少的4D培训数据。为了应对准确的人类大小估计的挑战，我们引入了一个根层的深度表示，该表示保留了关键的人类规模细节，并且更容易从单眼输入中估算，从而克服了传统仿射不变和度量深度表示的局限性。 Geoman在定性和定量评估中都取得了最先进的表现，证明了其在克服视频中3D人类几何估算中长期存在的挑战方面的有效性。

Title: Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving

Authors: Yunshen Wang, Yicheng Liu, Tianyuan Yuan, Yucheng Mao, Yingshi Liang, Xiuyu Yang, Honggang Zhang, Hang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23115
Pdf URL: https://arxiv.org/pdf/2505.23115
Copy Paste: [[2505.23115]] Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving(https://arxiv.org/abs/2505.23115)
Keywords: generative
Abstract: Accurately predicting 3D occupancy grids from visual inputs is critical for autonomous driving, but current discriminative methods struggle with noisy data, incomplete observations, and the complex structures inherent in 3D scenes. In this work, we reframe 3D occupancy prediction as a generative modeling task using diffusion models, which learn the underlying data distribution and incorporate 3D scene priors. This approach enhances prediction consistency, noise robustness, and better handles the intricacies of 3D spatial structures. Our extensive experiments show that diffusion-based generative models outperform state-of-the-art discriminative approaches, delivering more realistic and accurate occupancy predictions, especially in occluded or low-visibility regions. Moreover, the improved predictions significantly benefit downstream planning tasks, highlighting the practical advantages of our method for real-world autonomous driving applications.
摘要：准确地预测来自视觉输入的3D占用网格对于自动驾驶至关重要，但是当前的歧视方法在嘈杂的数据，不完整的观察结果以及3D场景中固有的复杂结构中遇到了困难。在这项工作中，我们使用扩散模型将3D占用预测作为生成建模任务进行了重新构建，该模型学习了基础数据分布并结合了3D场景先验。这种方法增强了预测一致性，稳健性，并更好地处理了3D空间结构的复杂性。我们的广泛实验表明，基于扩散的生成模型优于最先进的判别方法，提供了更现实和准确的占用预测，尤其是在闭塞或低可见度区域中。此外，改进的预测显着受益于下游计划任务，突出了我们在现实世界自动驾驶应用程序中方法的实际优势。

Title: TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance

Authors: Keren Ye, Ignacio Garcia Dorado, Michalis Raptis, Mauricio Delbracio, Irene Zhu, Peyman Milanfar, Hossein Talebi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23119
Pdf URL: https://arxiv.org/pdf/2505.23119
Copy Paste: [[2505.23119]] TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance(https://arxiv.org/abs/2505.23119)
Keywords: super-resolution, generation
Abstract: While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text. To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8 based text encoder and cross-attention. Recognizing that OCR may sometimes produce inaccurate results in real-world scenarios, we have developed two innovative methods to enhance the robustness of our model. By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process, enhancing fine details within the text and improving overall legibility. The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR, underscoring the efficacy of our approach.
摘要：尽管使用扩散模型在图像超分辨率（SR）方面取得的最新进展已显示出在改善整体图像质量方面的希望，但它们在场景文本图像中的应用却揭示了局限性。这些模型通常会在准确的文本区域定位中挣扎，并且无法有效地对图像和多语言角色与形状先验进行建模。这会导致不一致，幻觉的产生以及超级分辨文本的感知质量下降。为了解决这些问题，我们介绍了TextSR，这是一种专门为多语言场景文本图像超分辨率设计的多模式扩散模型。 TextSR利用文本检测器来查明图像中的文本区域，然后采用光学字符识别（OCR）从这些区域提取多语言文本。然后，使用基于UTF-8的文本编码器和交叉注意将提取的文本字符转换为视觉形状。认识到OCR有时可能会在现实世界中产生不准确的结果，因此我们开发了两种创新方法来增强我们的模型的鲁棒性。通过将文本字符先验与低分辨率文本图像相结合，我们的模型有效地指导了超分辨率的过程，增强了文本中的细节并提高了整体知名度。我们的模型在TextZoom和TextVQA数据集上的出色性能为Stisr设定了新的基准，强调了我们方法的功效。

Title: MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation

Authors: Siyuan Wang, Jiawei Liu, Wei Wang, Yeying Jin, Jinsong Du, Zhi Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23120
Pdf URL: https://arxiv.org/pdf/2505.23120
Copy Paste: [[2505.23120]] MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation(https://arxiv.org/abs/2505.23120)
Keywords: generation
Abstract: Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of different parts of the body in terms of amplitude of motion, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in video, leading to more pronounced artifacts and distortions. Existing approaches typically address this issue by introducing additional a priori information, but this can limit the practical application of the task. Specifically, we propose a Motion Mask-Guided Two-Stage Network (MMGT) that uses audio, as well as motion masks and motion features generated from the audio signal to jointly drive the generation of synchronized speech gesture videos. In the first stage, the Spatial Mask-Guided Audio Pose Generation (SMGA) Network generates high-quality pose videos and motion masks from audio, effectively capturing large movements in key regions such as the face and gestures. In the second stage, we integrate the Motion Masked Hierarchical Audio Attention (MM-HAA) into the Stabilized Diffusion Video Generation model, overcoming limitations in fine-grained motion generation and region-specific detail control found in traditional methods. This guarantees high-quality, detailed upper-body video generation with accurate texture and motion details. Evaluations show improved video quality, lip-sync, and gesture. The model and code are available at this https URL.
摘要：共同言论的手势视频生成旨在从音频驱动的静止图像中生成生动的语音视频，这是由于身体不同部分的多样性而在运动振幅，音频相关性和详细功能方面都具有挑战性的。仅依靠音频，因为控制信号通常无法在视频中捕获大型手势运动，从而导致更明显的伪影和扭曲。现有方法通常通过引入其他先验信息来解决此问题，但这可能会限制任务的实际应用。具体而言，我们提出了使用音频的运动面具引导的两阶段网络（MMGT），以及从音频信号产生的运动掩模和运动功能，以共同推动同步的语音手势视频的产生。在第一阶段，空间掩盖引导的音频姿势产生（SMGA）网络从音频产生高质量的姿势视频和运动口罩，从而有效地捕获了诸如面部和手势等关键区域的大型运动。在第二阶段，我们将运动掩盖的层次音频关注（MM-HAA）整合到稳定的扩散视频生成模型中，克服了在传统方法中发现的细粒运动产生和特定于区域的细节控制中的局限性。这保证了具有准确的纹理和运动细节的高质量，详细的上身视频生成。评估表明视频质量改善，唇部同步和手势。该模型和代码可在此HTTPS URL上找到。

Title: HMAD: Advancing E2E Driving with Anchored Offset Proposals and Simulation-Supervised Multi-target Scoring

Authors: Bin Wang, Pingjun Li, Jinkun Liu, Jun Cheng, Hailong Lei, Yinze Rong, Huan-ang Gao, Kangliang Chen, Xing Pan, Weihao Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23129
Pdf URL: https://arxiv.org/pdf/2505.23129
Copy Paste: [[2505.23129]] HMAD: Advancing E2E Driving with Anchored Offset Proposals and Simulation-Supervised Multi-target Scoring(https://arxiv.org/abs/2505.23129)
Keywords: generation
Abstract: End-to-end autonomous driving faces persistent challenges in both generating diverse, rule-compliant trajectories and robustly selecting the optimal path from these options via learned, multi-faceted evaluation. To address these challenges, we introduce HMAD, a framework integrating a distinctive Bird's-Eye-View (BEV) based trajectory proposal mechanism with learned multi-criteria scoring. HMAD leverages BEVFormer and employs learnable anchored queries, initialized from a trajectory dictionary and refined via iterative offset decoding (inspired by DiffusionDrive), to produce numerous diverse and stable candidate trajectories. A key innovation, our simulation-supervised scorer module, then evaluates these proposals against critical metrics including no at-fault collisions, drivable area compliance, comfortableness, and overall driving quality (i.e., extended PDM score). Demonstrating its efficacy, HMAD achieves a 44.5% driving score on the CVPR 2025 private test set. This work highlights the benefits of effectively decoupling robust trajectory generation from comprehensive, safety-aware learned scoring for advanced autonomous driving.
摘要：端到端的自动驾驶面孔在产生多样化，符合规则的轨迹和强有力地通过学习的多方面评估从这些选项中选择最佳途径的挑战。为了应对这些挑战，我们介绍了HMAD，这是一个框架，该框架集成了一个独特的鸟类视图（BEV）基于学习的轨迹提案机制，并具有学识渊博的多标准评分。 HMAD利用BeVformer并采用可学习的锚定查询，从轨迹词典初始化，并通过迭代偏移解码（受扩散驱动器的启发）进行了精制，以产生许多多样化稳定的候选轨迹。一项关键的创新是我们的模拟监督分手模块，然后评估这些建议，以针对关键指标，包括没有拨号碰撞，可驾驶的区域合规性，舒适性和整体驾驶质量（即扩展的PDM分数）。 HMAD证明了其功效，在CVPR 2025私人测试套装上，HMAD的驾驶得分达到44.5％。这项工作突出了有效地将强大的轨迹生成从综合，安全意识到的学习得分中进行高级自主驾驶的评分的好处。

Title: Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing

Authors: Tongtong Su, Chengyu Wang, Jun Huang, Dongming Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23134
Pdf URL: https://arxiv.org/pdf/2505.23134
Copy Paste: [[2505.23134]] Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing(https://arxiv.org/abs/2505.23134)
Keywords: restoration, generative
Abstract: Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named {Zero-to-Hero}, which focuses on reference-based video editing that disentangles the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across other frames. We leverage correspondence within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. It offers a solid ZERO-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with over-saturated colors and unknown blurring issues. Starting from Zero-Stage, our Hero-Stage Holistically learns a conditional generative model for vidEo RestOration. To accurately evaluate the consistency of the appearance, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB. The project page is at this https URL.
摘要：根据用户需求的外观编辑是视频编辑中的关键任务。现有的文本指导方法通常会导致有关用户意图的歧义，并限制对对象的特定方面的细粒度控制。为了克服这些局限性，本文介绍了一种名为{Zero to-Hero}的新方法，该方法着重于基于参考的视频编辑，将编辑过程分解为两个不同的问题。它通过首先编辑锚框来实现这一目标，以满足用户要求作为参考图像，然后始终如一地传播其在其他帧中的外观。我们利用原始框架内的对应关系来指导注意力机制，这比以前提出的光流或时间模块在内存友好的视频生成模型中更强大，尤其是在处理表现出较大动作的对象时。它提供了稳定的零拍初始化，可确保准确性和时间一致性。然而，注意注意机制的干预导致复杂的成像降解，过度饱和颜色和未知的模糊问题。从零阶段开始，我们的英雄阶段从整体上学习了一个有条件的生成模型以进行视频修复。为了准确评估外观的一致性，我们使用Blender构建了一组具有多个外观的视频，从而实现了精细的和确定性的评估。我们的方法的表现优于表现最佳的基线，而PSNR改进为2.6 dB。项目页面位于此HTTPS URL。

Title: VERINA: Benchmarking Verifiable Code Generation

Authors: Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, Dawn Song
Subjects: cs.LG, cs.AI, cs.LO, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2505.23135
Pdf URL: https://arxiv.org/pdf/2505.23135
Copy Paste: [[2505.23135]] VERINA: Benchmarking Verifiable Code Generation(https://arxiv.org/abs/2505.23135)
Keywords: generation
Abstract: Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often lack support for end-to-end verifiable code generation. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, generates only 61.4% correct code, 51.0% sound and complete specifications, and 3.6% successful proofs, with one trial per task. We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on this https URL and our evaluation code on this https URL.
摘要：大型语言模型（LLM）越来越多地集成到软件开发中，但是确保LLM生成的代码的正确性仍然具有挑战性，并且通常需要昂贵的手动审查。可验证的代码生成 - 共同生成代码，规格和代码规范对齐方式的证明 - 提供了一种有希望的途径来解决此限制，并在编码中进一步释放了LLMS的好处。然而，评估存在很大的差距：当前的基准通常缺乏对端到端可验证代码生成的支持。在本文中，我们介绍了Verina（可验证的代码生成领域），这是一种高质量的基准，可以对代码，规范和证明生成及其组成进行全面的模块化评估。 Verina由189个手动策划的编码任务组成，并具有详细的问题描述，参考实现，正式规格和广泛的测试套件。我们对最先进的LLM的广泛评估揭示了可验证的代码生成中的重大挑战，尤其是在证明生成中，强调了在验证域中改善基于LLM的定理抛弃的必要性。最佳模型，OpenAi O4-Mini，仅生成61.4％的正确代码，51.0％的声音和完整规格以及3.6％的成功证明，每任任务为一项试验。我们希望Verina通过提供严格而全面的基准来促进可验证代码生成的进展。我们在此HTTPS URL上发布数据集和此HTTPS URL的评估代码。

Title: Implicit Inversion turns CLIP into a Decoder

Authors: Antonio D'Orazio, Maria Rosaria Briglia, Donato Crisostomi, Dario Loi, Emanuele Rodolà, Iacopo Masi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23161
Pdf URL: https://arxiv.org/pdf/2505.23161
Copy Paste: [[2505.23161]] Implicit Inversion turns CLIP into a Decoder(https://arxiv.org/abs/2505.23161)
Keywords: generation, generative
Abstract: CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.
摘要：剪辑是一种歧视模型，该模型训练有素，可在共享嵌入空间中对齐图像和文本。由于其多模式结构，它是许多生成管道的骨干，在该管道中，对解码器进行了训练，可以从共享空间映射到图像。在这项工作中，我们表明，仅使用夹子，就可以单独使用图像合成 - 没有任何解码器，训练或微调。我们的方法优化了一种频率感知的隐式神经表示，该神经表示通过对网络层进行分层频率来鼓励粗糙的生成。为了稳定此反向映射，我们引入了对抗性稳健的初始化，轻巧的正交procrust将投影投影到对齐本地文本和图像嵌入，以及将输出固定到自然图像统计信息的混合损失。在不改变夹的权重的情况下，此框架可以解锁文本对象生成，样式传输和图像重建等功能。这些发现表明，歧视模型可能具有藏在朴素的视线中的未开发的生成潜力。

Title: RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

Authors: Liu Liu, Xiaofeng Wang, Guosheng Zhao, Keyu Li, Wenkang Qin, Jiaxiong Qiu, Zheng Zhu, Guan Huang, Zhizhong Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23171
Pdf URL: https://arxiv.org/pdf/2505.23171
Copy Paste: [[2505.23171]] RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer(https://arxiv.org/abs/2505.23171)
Keywords: generation
Abstract: Imitation Learning has become a fundamental approach in robotic manipulation. However, collecting large-scale real-world robot demonstrations is prohibitively expensive. Simulators offer a cost-effective alternative, but the sim-to-real gap make it extremely challenging to scale. Therefore, we introduce RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike previous methods, RoboTransfer integrates multi-view geometry with explicit control over scene components, such as background and object attributes. By incorporating cross-view feature interactions and global depth/normal conditions, RoboTransfer ensures geometry consistency across views. This framework allows fine-grained control, including background edits and object swaps. Experiments demonstrate that RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity. In addition, policies trained on the data generated by RoboTransfer achieve a 33.3% relative improvement in the success rate in the DIFF-OBJ setting and a substantial 251% relative improvement in the more challenging DIFF-ALL scenario. Explore more demos on our project page: this https URL
摘要：模仿学习已成为机器人操纵中的一种基本方法。但是，收集大型现实世界机器人的演示非常昂贵。模拟器提供了一种具有成本效益的替代方案，但是SIM到实现的差距使其规模极具挑战性。因此，我们引入了Robotransfer，这是一种基于扩散的视频生成框架，用于机器人数据合成。与以前的方法不同，Robotransfer将多视图的几何形状与对场景组件（例如背景和对象属性）的明确控制。通过结合跨视图的特征交互和全局深度/正常条件，Robotransfer确保了视图之间的几何形状一致性。该框架允许精细的控制，包括背景编辑和对象互换。实验表明，Robotransfer能够生成具有增强的几何一致性和视觉保真度的多视频视频。此外，对Robotransfer生成的数据进行培训的策略在DIFF-OBJ设置中的成功率相对提高了33.3％，并且在更具挑战性的DIFF-ALL情况下，相对改善的相对相对改善可实现33.3％。在我们的项目页面上探索更多演示：此HTTPS URL

Title: Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging

Authors: Ping Wang, Lishun Wang, Gang Qu, Xiaodong Wang, Yulun Zhang, Xin Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23180
Pdf URL: https://arxiv.org/pdf/2505.23180
Copy Paste: [[2505.23180]] Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging(https://arxiv.org/abs/2505.23180)
Keywords: restoration
Abstract: Deep-unrolling and plug-and-play (PnP) approaches have become the de-facto standard solvers for single-pixel imaging (SPI) inverse problem. PnP approaches, a class of iterative algorithms where regularization is implicitly performed by an off-the-shelf deep denoiser, are flexible for varying compression ratios (CRs) but are limited in reconstruction accuracy and speed. Conversely, unrolling approaches, a class of multi-stage neural networks where a truncated iterative optimization process is transformed into an end-to-end trainable network, typically achieve better accuracy with faster inference but require fine-tuning or even retraining when CR changes. In this paper, we address the challenge of integrating the strengths of both classes of solvers. To this end, we design an efficient deep image restorer (DIR) for the unrolling of HQS (half quadratic splitting) and ADMM (alternating direction method of multipliers). More importantly, a general proximal trajectory (PT) loss function is proposed to train HQS/ADMM-unrolling networks such that learned DIR approximates the proximal operator of an ideal explicit restoration regularizer. Extensive experiments demonstrate that, the resulting proximal unrolling networks can not only flexibly handle varying CRs with a single model like PnP algorithms, but also outperform previous CR-specific unrolling networks in both reconstruction accuracy and speed. Source codes and models are available at this https URL.
摘要：深纸和插件（PNP）方法已成为单像素成像（SPI）逆问题的事实标准求解器。 PNP方法是一类迭代算法，其中正规化是由现成的深层Denoiser隐式执行的，对于不同的压缩比（CRS）是灵活的，但重建精度和速度受到限制。相反，展开的方法是一类多阶段神经网络，其中截断的迭代优化过程转换为端到端可训练网络，通常可以更快地提高推断，但在更改时需要进行微调，甚至需要进行微调。在本文中，我们应对整合两类求解器的优势的挑战。为此，我们设计了一个有效的深层图像修复器（DIR），以用于HQS（半二次分裂）和ADMM（乘数的交替方向方法）的展开。更重要的是，提出了一般的近端轨迹（PT）损耗函数来训练HQS/ADMM-未汇总网络，从而学到了DIR近似于理想的显式恢复正常器的近端运算符。广泛的实验表明，所得的近端展开网络不仅可以通过PNP算法（例如PNP算法）灵活地处理不同的CR，而且在重建精度和速度下都超过了以前的CR特异性展开网络。源代码和模型可在此HTTPS URL上找到。

Title: HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image

Authors: Junyi Guo, Jingxuan Zhang, Fangyu Wu, Huanda Lu, Qiufeng Wang, Wenmian Yang, Eng Gee Lim, Dongming Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23186
Pdf URL: https://arxiv.org/pdf/2505.23186
Copy Paste: [[2505.23186]] HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image(https://arxiv.org/abs/2505.23186)
Keywords: generation
Abstract: Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset will be released.
摘要：基于扩散的服装合成任务主要集中在时尚域中的设计阶段，而服装生产过程仍然很大程度上没有被逐渐消失。为了弥合这一差距，我们引入了一项新任务：纯草图到逼真的服装图像（FS2RG），该图像通过集成平坦的草图和文本指导来生成逼真的服装图像。 FS2RG提出了两个关键挑战：1）织物特性仅由文本提示指导，为基于扩散的模型提供了不足的视觉监督，这限制了它们捕获细粒织物细节的能力； 2）平面草图和文本指导可能会提供相互矛盾的信息，要求该模型在保持结构相干性的同时选择性地保留或修改服装属性。为了解决这项任务，我们提出了一个构成两个核心组成部分的新型框架：i）一种多模式的语义增强机制，可以增强跨文本和视觉方式的织物表示，以及ii）一种和谐的交叉注意机制，该机制能够通过平坦的素描和文本（允许控制）（允许出现的构成）（允许构成）（图像）（图像）（图像）（图像）（图像）（图像）（图像），以形式构成（图像），以绘制（图像）（构图）（图像）（图像），以形成（构图）（图像）（构图）（图）输出。此外，我们收集了多模式的详细服装，这是制衣生成的最大开源数据集。实验结果和用户研究证明了小便在服装合成中的有效性。代码和数据集将发布。

Title: Fooling the Watchers: Breaking AIGC Detectors via Semantic Prompt Attacks

Authors: Run Hao, Peng Ying
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.23192
Pdf URL: https://arxiv.org/pdf/2505.23192
Copy Paste: [[2505.23192]] Fooling the Watchers: Breaking AIGC Detectors via Semantic Prompt Attacks(https://arxiv.org/abs/2505.23192)
Keywords: generation
Abstract: The rise of text-to-image (T2I) models has enabled the synthesis of photorealistic human portraits, raising serious concerns about identity misuse and the robustness of AIGC detectors. In this work, we propose an automated adversarial prompt generation framework that leverages a grammar tree structure and a variant of the Monte Carlo tree search algorithm to systematically explore the semantic prompt space. Our method generates diverse, controllable prompts that consistently evade both open-source and commercial AIGC detectors. Extensive experiments across multiple T2I models validate its effectiveness, and the approach ranked first in a real-world adversarial AIGC detection competition. Beyond attack scenarios, our method can also be used to construct high-quality adversarial datasets, providing valuable resources for training and evaluating more robust AIGC detection and defense systems.
摘要：文本对图像（T2I）模型的兴起使人逼真的人类肖像的综合，引起了对身份滥用和AIGC检测器鲁棒性的严重关注。在这项工作中，我们提出了一个自动化的对抗及时生成框架，该框架利用语法树结构和蒙特卡洛树搜索算法的变体，以系统地探索语义提示空间。我们的方法产生了多种可控的提示，这些提示始终逃避开源和商业AIGC检测器。跨多个T2I模型的广泛实验验证了其有效性，该方法在现实世界中的对抗AIGC检测竞赛中排名第一。除攻击方案外，我们的方法还可以用于构建高质量的对抗数据集，为培训和评估更强大的AIGC检测和防御系统提供宝贵的资源。

Title: HyperPointFormer: Multimodal Fusion in 3D Space with Dual-Branch Cross-Attention Transformers

Authors: Aldino Rizaldy, Richard Gloaguen, Fabian Ewald Fassnacht, Pedram Ghamisi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23206
Pdf URL: https://arxiv.org/pdf/2505.23206
Copy Paste: [[2505.23206]] HyperPointFormer: Multimodal Fusion in 3D Space with Dual-Branch Cross-Attention Transformers(https://arxiv.org/abs/2505.23206)
Keywords: generation
Abstract: Multimodal remote sensing data, including spectral and lidar or photogrammetry, is crucial for achieving satisfactory land-use / land-cover classification results in urban scenes. So far, most studies have been conducted in a 2D context. When 3D information is available in the dataset, it is typically integrated with the 2D data by rasterizing the 3D data into 2D formats. Although this method yields satisfactory classification results, it falls short in fully exploiting the potential of 3D data by restricting the model's ability to learn 3D spatial features directly from raw point clouds. Additionally, it limits the generation of 3D predictions, as the dimensionality of the input data has been reduced. In this study, we propose a fully 3D-based method that fuses all modalities within the 3D point cloud and employs a dedicated dual-branch Transformer model to simultaneously learn geometric and spectral features. To enhance the fusion process, we introduce a cross-attention-based mechanism that fully operates on 3D points, effectively integrating features from various modalities across multiple scales. The purpose of cross-attention is to allow one modality to assess the importance of another by weighing the relevant features. We evaluated our method by comparing it against both 3D and 2D methods using the 2018 IEEE GRSS Data Fusion Contest (DFC2018) dataset. Our findings indicate that 3D fusion delivers competitive results compared to 2D methods and offers more flexibility by providing 3D predictions. These predictions can be projected onto 2D maps, a capability that is not feasible in reverse. Additionally, we evaluated our method on different datasets, specifically the ISPRS Vaihingen 3D and the IEEE 2019 Data Fusion Contest. Our code will be published here: this https URL.
摘要：多模式遥感数据（包括光谱和激光雷达或摄影测量法）对于在城市场景中实现令人满意的土地使用 /土地覆盖分类至关重要。到目前为止，大多数研究都是在2D背景下进行的。当数据集中有3D信息可用时，通常通过将3D数据栅格化为2D格式，将其与2D数据集成在一起。尽管该方法得出令人满意的分类结果，但通过限制模型直接从原始点云中学习3D空间特征的能力，可以完全利用3D数据的潜力。此外，随着输入数据的维度降低，它限制了3D预测的生成。在这项研究中，我们提出了一种完全基于3D的方法，该方法融合了3D点云中的所有模态，并采用专用的双支化变压器模型同时学习几何和光谱特征。为了增强融合过程，我们引入了一种基于交叉注意的机制，该机制在3D点上完全运行，有效地整合了来自多个尺度的各种方式的特征。跨注意的目的是允许一种方式通过权衡相关特征来评估另一种方式的重要性。我们通过使用2018 IEEE GRSS数据融合竞赛（DFC2018）数据集将其与3D和2D方法进行比较来评估我们的方法。我们的发现表明，与2D方法相比，3D Fusion可提供竞争成果，并通过提供3D预测来提供更大的灵活性。这些预测可以投影到2D地图上，这是不可行的功能。此外，我们在不同数据集上评估了我们的方法，特别是ISPRS Vaihingen 3D和IEEE 2019数据融合竞赛。我们的代码将在此处发布：此HTTPS URL。

Title: Generalizability vs. Counterfactual Explainability Trade-Off

Authors: Fabiano Veglianti, Flavio Giorgi, Fabrizio Silvestri, Gabriele Tolomei
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.23225
Pdf URL: https://arxiv.org/pdf/2505.23225
Copy Paste: [[2505.23225]] Generalizability vs. Counterfactual Explainability Trade-Off(https://arxiv.org/abs/2505.23225)
Keywords: generation
Abstract: In this work, we investigate the relationship between model generalization and counterfactual explainability in supervised learning. We introduce the notion of $\varepsilon$-valid counterfactual probability ($\varepsilon$-VCP) -- the probability of finding perturbations of a data point within its $\varepsilon$-neighborhood that result in a label change. We provide a theoretical analysis of $\varepsilon$-VCP in relation to the geometry of the model's decision boundary, showing that $\varepsilon$-VCP tends to increase with model overfitting. Our findings establish a rigorous connection between poor generalization and the ease of counterfactual generation, revealing an inherent trade-off between generalization and counterfactual explainability. Empirical results validate our theory, suggesting $\varepsilon$-VCP as a practical proxy for quantitatively characterizing overfitting.
摘要：在这项工作中，我们研究了监督学习中模型概括与反事实解释性之间的关系。我们介绍了$ \ varepsilon $ -VALID反事实概率（$ \ varepsilon $ -vcp）的概念 - 在其$ \ varepsilon $ -Neighneighborhood中查找数据点扰动的概率，从而导致标签更改。我们提供了与模型决策边界几何形状相关的$ \ varepsilon $ -VCP的理论分析，这表明$ \ varepsilon $ -vcp倾向于随着模型过度拟合而增加。我们的发现建立了不良的概括与反事实发电的易用性之间的严格联系，从而揭示了概括和反事实解释性之间的固有权衡。经验结果验证了我们的理论，表明$ \ varepsilon $ -VCP是定量表征过度拟合的实际代理。

Title: Advancing Image Super-resolution Techniques in Remote Sensing: A Comprehensive Survey

Authors: Yunliang Qi, Meng Lou, Yimin Liu, Lu Li, Zhen Yang, Wen Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23248
Pdf URL: https://arxiv.org/pdf/2505.23248
Copy Paste: [[2505.23248]] Advancing Image Super-resolution Techniques in Remote Sensing: A Comprehensive Survey(https://arxiv.org/abs/2505.23248)
Keywords: super-resolution
Abstract: Remote sensing image super-resolution (RSISR) is a crucial task in remote sensing image processing, aiming to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts. Despite the growing number of RSISR methods proposed in recent years, a systematic and comprehensive review of these methods is still lacking. This paper presents a thorough review of RSISR algorithms, covering methodologies, datasets, and evaluation metrics. We provide an in-depth analysis of RSISR methods, categorizing them into supervised, unsupervised, and quality evaluation approaches, to help researchers understand current trends and challenges. Our review also discusses the strengths, limitations, and inherent challenges of these techniques. Notably, our analysis reveals significant limitations in existing methods, particularly in preserving fine-grained textures and geometric structures under large-scale degradation. Based on these findings, we outline future research directions, highlighting the need for domain-specific architectures and robust evaluation protocols to bridge the gap between synthetic and real-world RSISR scenarios.
摘要：遥感图像超分辨率（RSISR）是遥感图像处理中的至关重要任务，旨在从其低分辨率（LR）对应物中重建高分辨率（HR）图像。尽管近年来提出的RSISR方法越来越多，但仍缺乏对这些方法的系统和全面综述。本文详细介绍了RSISR算法，涵盖方法，数据集和评估指标。我们对RSISR方法进行了深入的分析，将其分类为受监督，无监督和质量评估方法，以帮助研究人员了解当前的趋势和挑战。我们的审查还讨论了这些技术的优势，局限性和固有的挑战。值得注意的是，我们的分析揭示了现有方法的重大局限性，尤其是在大规模降解下保存细粒纹理和几何结构时。基于这些发现，我们概述了未来的研究方向，强调了对特定于领域的体系结构的需求和强大的评估协议，以弥合合成和现实世界中RSISR场景之间的差距。

Title: UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes

Authors: Yixun Liang, Kunming Luo, Xiao Chen, Rui Chen, Hongyu Yan, Weiyu Li, Jiarui Liu, Ping Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23253
Pdf URL: https://arxiv.org/pdf/2505.23253
Copy Paste: [[2505.23253]] UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes(https://arxiv.org/abs/2505.23253)
Keywords: generation, generative
Abstract: We present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based inpainting to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we propose to bypass the limitations of UV mapping by operating directly in a unified 3D functional space. Specifically, we first propose that lifts texture generation into 3D space via Texture Functions (TFs)--a continuous, volumetric representation that maps any 3D point to a texture value based solely on surface proximity, independent of mesh topology. Then, we propose to predict these TFs directly from images and geometry inputs using a transformer-based Large Texturing Model (LTM). To further enhance texture quality and leverage powerful 2D priors, we develop an advanced LoRA-based strategy for efficiently adapting large-scale Diffusion Transformers (DiTs) for high-quality multi-view texture synthesis as our first stage. Extensive experiments demonstrate that UniTEX achieves superior visual quality and texture integrity compared to existing approaches, offering a generalizable and scalable solution for automated 3D texture generation. Code will available in: this https URL.
摘要：我们提出Unitex，这是一种新颖的两阶段3D纹理生成框架，可为3D资产创建高质量的一致纹理。现有方法主要依赖于基于紫外线的涂料来完善纹理，然后将生成的多视图图像重新投影到3D形状上，这引入了与拓扑歧义有关的挑战。为了解决这个问题，我们建议通过直接在统一的3D功能空间中操作来绕过UV映射的局限性。具体而言，我们首先提出，通过纹理函数（TFS）将纹理生成提升到3D空间 - 一种连续的，体积的表示，将任何3D点映射到仅基于表面接近的纹理值，而与网格拓扑无关。然后，我们建议使用基于变压器的大型纹理模型（LTM）直接从图像和几何输入中预测这些TF。为了进一步提高纹理质量并利用强大的2D先验，我们开发了一种基于洛拉的先进策略，用于有效地调整大规模扩散变压器（DIT），以作为我们的第一个阶段，以使高质量的多视图纹理合成。广泛的实验表明，与现有方法相比，UNITEX具有优异的视觉质量和纹理完整性，为自动3D纹理生成提供了可概括且可扩展的解决方案。代码将在：此HTTPS URL中可用。

Title: Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs

Authors: Zheng Sun, Yi Wei, Long Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23265
Pdf URL: https://arxiv.org/pdf/2505.23265
Copy Paste: [[2505.23265]] Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs(https://arxiv.org/abs/2505.23265)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) are of great application across many domains, such as multimodal understanding and generation. With the development of diffusion models (DM) and unified MLLMs, the performance of image generation has been significantly improved, however, the study of image screening is rare and its performance with MLLMs is unsatisfactory due to the lack of data and the week image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive medical image screening dataset with 1500+ samples, each sample consists of a medical image, four generated images, and a multiple-choice answer. The dataset evaluates the aesthetic reasoning ability under four aspects: \textit{(1) Appearance Deformation, (2) Principles of Physical Lighting and Shadow, (3) Placement Layout, (4) Extension Rationality}. For methodology, we utilize long chains of thought (CoT) and Group Relative Policy Optimization with Dynamic Proportional Accuracy reward, called DPA-GRPO, to enhance the image aesthetic reasoning ability of MLLMs. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT-4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the reinforcement learning approach, we are able to surpass the score of both large-scale models and leading closed-source models using a much smaller model. We hope our attempt on medical image screening will serve as a regular configuration in image aesthetic reasoning in the future.
摘要：多模式的大语言模型（MLLM）在许多领域（例如多模式理解和生成）具有很好的应用。随着扩散模型（DM）和统一的MLLM的发展，图像产生的性能得到了显着改善，但是，由于缺乏数据和每周图像在MLLM中，图像筛选的研究很少见，其在MLLM的性能不令人满意。在这项工作中，我们提出了一个完整的解决方案，以在数据和方法方面解决这些问题。对于数据，我们收集了一个具有1500多个样本的综合医疗图像筛选数据集，每个样本由医疗图像，四个生成的图像和一个多项选择的答案组成。数据集评估了四个方面的美学推理能力：\ textit {（1）外观变形，（2）物理照明和阴影的原理，（3）放置布局，（4）扩展理性}。对于方法论，我们利用长长的思想链（COT）和组相对策略优化，并以动态比例精度奖励（称为DPA-GRPO）来增强MLLM的图像美学推理能力。我们的实验结果表明，即使是最新的封闭源MLLM，例如GPT-4O和QWEN-VL-MAX，也表现出类似于图像美学推理中随机猜测的性能。相比之下，通过利用强化学习方法，我们能够使用较小的模型超过大型模型和领先的闭合源模型的得分。我们希望我们对医疗图像筛查的尝试将在未来的图像美学推理中成为常规配置。

Title: Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs

Authors: Haokun Chen, Yueqi Zhang, Yuan Bi, Yao Zhang, Tong Liu, Jinhe Bi, Jian Lan, Jindong Gu, Claudia Grosser, Denis Krompass, Nassir Navab, Volker Tresp
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.23270
Pdf URL: https://arxiv.org/pdf/2505.23270
Copy Paste: [[2505.23270]] Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs(https://arxiv.org/abs/2505.23270)
Keywords: generative
Abstract: In recent years, Large Language Models (LLMs) have achieved remarkable advancements, drawing significant attention from the research community. Their capabilities are largely attributed to large-scale architectures, which require extensive training on massive datasets. However, such datasets often contain sensitive or copyrighted content sourced from the public internet, raising concerns about data privacy and ownership. Regulatory frameworks, such as the General Data Protection Regulation (GDPR), grant individuals the right to request the removal of such sensitive information. This has motivated the development of machine unlearning algorithms that aim to remove specific knowledge from models without the need for costly retraining. Despite these advancements, evaluating the efficacy of unlearning algorithms remains a challenge due to the inherent complexity and generative nature of LLMs. In this work, we introduce a comprehensive auditing framework for unlearning evaluation, comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods. By using various auditing algorithms, we evaluate the effectiveness and robustness of different unlearning strategies. To explore alternatives beyond prompt-based auditing, we propose a novel technique that leverages intermediate activation perturbations, addressing the limitations of auditing methods that rely solely on model inputs and outputs.
摘要：近年来，大型语言模型（LLMS）取得了显着的进步，引起了研究界的重大关注。它们的功能在很大程度上归因于大规模架构，这些架构需要在大规模数据集上进行广泛的培训。但是，此类数据集通常包含来自公共Internet的敏感或受版权保护的内容，从而引起了人们对数据隐私和所有权的担忧。监管框架，例如一般数据保护法规（GDPR），授予个人要求删除此类敏感信息的权利。这激发了机器学习算法的开发，这些算法旨在消除模型中的特定知识而无需昂贵的重新训练。尽管有这些进步，但由于LLM的固有复杂性和生成性质，评估未学习算法的功效仍然是一个挑战。在这项工作中，我们介绍了一个全面的审计框架，用于学习评估，包括三个基准数据集，六种未学习算法和五种基于及时的审计方法。通过使用各种审计算法，我们评估了不同学习策略的有效性和鲁棒性。为了探索除了及时的审核之外的替代方案，我们提出了一种利用中间激活扰动的新技术，解决了仅依赖模型输入和输出的审核方法的局限性。

Title: LADA: Scalable Label-Specific CLIP Adapter for Continual Learning

Authors: Mao-Lin Luo, Zi-Hao Zhou, Tong Wei, Min-Ling Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23271
Pdf URL: https://arxiv.org/pdf/2505.23271
Copy Paste: [[2505.23271]] LADA: Scalable Label-Specific CLIP Adapter for Continual Learning(https://arxiv.org/abs/2505.23271)
Keywords: generation
Abstract: Continual learning with vision-language models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (Label-specific ADApter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from being interfered with by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The implementation code is available at this https URL.
摘要：诸如剪辑之类的视觉模型的持续学习通过利用其可转移表示形式，为可扩展的机器学习系统提供了通向可扩展的机器学习系统的途径。现有基于夹的方法通过添加多个可学习参数来调整预训练的图像编码器，每个任务都使用一组参数。这需要在推理过程中选择输入图像的预期参数，这很容易降低性能的错误。为了解决这个问题，我们介绍LADA（特定于标签的适配器）。 LADA没有将参数划分为跨任务，而是将轻巧的特定于标签的内存单元附加到冷冻剪辑图像编码器上，从而通过汇总任务 - 语言知识来实现歧视性特征。为了防止灾难性的遗忘，Lada使用特征蒸馏来进行可见的课程，以防止其特征被新课程干扰。 LADA在图像编码器之后定位，可防止梯度流到冷冻夹参数，从而确保有效的训练。广泛的结果表明，LADA在持续的学习环境中实现了最先进的表现。该实现代码可在此HTTPS URL上获得。

Title: RSFAKE-1M: A Large-Scale Dataset for Detecting Diffusion-Generated Remote Sensing Forgeries

Authors: Zhihong Tan, Jiayi Wang, Huiying Shi, Binyuan Huang, Hongchen Wei, Zhenzhong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23283
Pdf URL: https://arxiv.org/pdf/2505.23283
Copy Paste: [[2505.23283]] RSFAKE-1M: A Large-Scale Dataset for Detecting Diffusion-Generated Remote Sensing Forgeries(https://arxiv.org/abs/2505.23283)
Keywords: generation
Abstract: Detecting forged remote sensing images is becoming increasingly critical, as such imagery plays a vital role in environmental monitoring, urban planning, and national security. While diffusion models have emerged as the dominant paradigm for image generation, their impact on remote sensing forgery detection remains underexplored. Existing benchmarks primarily target GAN-based forgeries or focus on natural images, limiting progress in this critical domain. To address this gap, we introduce RSFAKE-1M, a large-scale dataset of 500K forged and 500K real remote sensing images. The fake images are generated by ten diffusion models fine-tuned on remote sensing data, covering six generation conditions such as text prompts, structural guidance, and inpainting. This paper presents the construction of RSFAKE-1M along with a comprehensive experimental evaluation using both existing detectors and unified baselines. The results reveal that diffusion-based remote sensing forgeries remain challenging for current methods, and that models trained on RSFAKE-1M exhibit notably improved generalization and robustness. Our findings underscore the importance of RSFAKE-1M as a foundation for developing and evaluating next-generation forgery detection approaches in the remote sensing domain. The dataset and other supplementary materials are available at this https URL.
摘要：检测锻造的遥感图像越来越重要，因为这些图像在环境监测，城市规划和国家安全中起着至关重要的作用。尽管扩散模型已成为图像产生的主要范式，但它们对遥感伪造检测的影响仍未被逐渐倍增。现有基准主要针对基于GAN的伪造或专注于自然图像，从而限制了此关键领域的进展。为了解决此差距，我们介绍了RSFake-1M，这是一个大规模的数据集，由500K锻造和500K真实的遥感图像。假图像是通过遥感数据微调的十个扩散模型生成的，涵盖了六世代的条件，例如文本提示，结构指导和内部介入。本文介绍了RSFAKE-1M的构建以及使用现有检测器和统一基准的全面实验评估。结果表明，基于扩散的遥感伪造对于当前方法仍然具有挑战性，并且在RSFake-1M上训练的模型表现出明显改善的概括和鲁棒性。我们的发现强调了RSFake-1M作为开发和评估遥感域中下一代伪造方法的基础的重要性。该数据集和其他补充材料可在此HTTPS URL上找到。

Title: GenCAD-Self-Repairing: Feasibility Enhancement for 3D CAD Generation

Authors: Chikaha Tsuji, Enrique Flores Medina, Harshit Gupta, Md Ferdous Alam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23287
Pdf URL: https://arxiv.org/pdf/2505.23287
Copy Paste: [[2505.23287]] GenCAD-Self-Repairing: Feasibility Enhancement for 3D CAD Generation(https://arxiv.org/abs/2505.23287)
Keywords: generation, generative
Abstract: With the advancement of generative AI, research on its application to 3D model generation has gained traction, particularly in automating the creation of Computer-Aided Design (CAD) files from images. GenCAD is a notable model in this domain, leveraging an autoregressive transformer-based architecture with a contrastive learning framework to generate CAD programs. However, a major limitation of GenCAD is its inability to consistently produce feasible boundary representations (B-reps), with approximately 10% of generated designs being infeasible. To address this, we propose GenCAD-Self-Repairing, a framework that enhances the feasibility of generative CAD models through diffusion guidance and a self-repairing pipeline. This framework integrates a guided diffusion denoising process in the latent space and a regression-based correction mechanism to refine infeasible CAD command sequences while preserving geometric accuracy. Our approach successfully converted two-thirds of infeasible designs in the baseline method into feasible ones, significantly improving the feasibility rate while simultaneously maintaining a reasonable level of geometric accuracy between the point clouds of ground truth models and generated models. By significantly improving the feasibility rate of generating CAD models, our approach helps expand the availability of high-quality training data and enhances the applicability of AI-driven CAD generation in manufacturing, architecture, and product design.
摘要：随着生成AI的发展，对3D模型生成的应用的研究已获得了吸引力，尤其是在自动化图像中计算机辅助设计（CAD）文件的过程中。 Gencad是该领域中的一个著名模型，利用具有对比度学习框架生成CAD程序的基于自回旋变压器的体系结构。但是，GENCAD的主要局限性是它无法始终产生可行的边界表示（B-REP），而生成的设计中约有10％是不可行的。为了解决这个问题，我们提出了Gencad-feel-sef-Repairing，该框架通过扩散指导和自我修复管道来增强生成CAD模型的可行性。该框架在潜在空间中集成了引导的扩散降解过程，并集成了基于回归的校正机制，以优化不可行的CAD命令序列，同时保持几何精度。我们的方法成功地将基线方法中的三分之二不可行的设计转换为可行的设计，从而显着提高了可行性率，同时保持了地面真相模型的点云和生成模型之间的合理几何准确性。通过显着提高生成CAD模型的可行性率，我们的方法有助于扩大高质量培训数据的可用性，并增强AI驱动的CAD生成在制造，体系结构和产品设计中的适用性。

Title: Score-based Generative Modeling for Conditional Independence Testing

Authors: Yixin Ren, Chenghou Jin, Yewei Xia, Li Ke, Longtao Huang, Hui Xue, Hao Zhang, Jihong Guan, Shuigeng Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23309
Pdf URL: https://arxiv.org/pdf/2505.23309
Copy Paste: [[2505.23309]] Score-based Generative Modeling for Conditional Independence Testing(https://arxiv.org/abs/2505.23309)
Keywords: generative
Abstract: Determining conditional independence (CI) relationships between random variables is a fundamental yet challenging task in machine learning and statistics, especially in high-dimensional settings. Existing generative model-based CI testing methods, such as those utilizing generative adversarial networks (GANs), often struggle with undesirable modeling of conditional distributions and training instability, resulting in subpar performance. To address these issues, we propose a novel CI testing method via score-based generative modeling, which achieves precise Type I error control and strong testing power. Concretely, we first employ a sliced conditional score matching scheme to accurately estimate conditional score and use Langevin dynamics conditional sampling to generate null hypothesis samples, ensuring precise Type I error control. Then, we incorporate a goodness-of-fit stage into the method to verify generated samples and enhance interpretability in practice. We theoretically establish the error bound of conditional distributions modeled by score-based generative models and prove the validity of our CI tests. Extensive experiments on both synthetic and real-world datasets show that our method significantly outperforms existing state-of-the-art methods, providing a promising way to revitalize generative model-based CI testing.
摘要：确定随机变量之间的有条件独立性（CI）关系是机器学习和统计数据中的一项基本但具有挑战性的任务，尤其是在高维环境中。现有的基于生成模型的CI测试方法，例如利用生成对抗网络（GAN）的CI测试方法，通常在有条件分布和训练不稳定性的不良模型中遇到困难，从而导致表现不佳。为了解决这些问题，我们通过基于得分的生成建模提出了一种新型的CI测试方法，该方法可以实现精确的I型错误控制和强大的测试能力。具体来说，我们首先采用切成条件的得分匹配方案来准确估计条件分数并使用langevin动力学条件抽样来生成无假设样本，从而确保精确的I型I型误差控制。然后，我们将合适的阶段纳入验证生成的样品并在实践中增强可解释性的方法中。从理论上讲，我们建立了由基于得分的生成模型建模的条件分布的误差结合，并证明了我们的CI测试的有效性。关于合成和现实世界数据集的广泛实验表明，我们的方法显着胜过现有的最新方法，这为振兴基于生成模型的CI测试提供了一种有希望的方法。

Title: TRACE: Trajectory-Constrained Concept Erasure in Diffusion Models

Authors: Finn Carter
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23312
Pdf URL: https://arxiv.org/pdf/2505.23312
Copy Paste: [[2505.23312]] TRACE: Trajectory-Constrained Concept Erasure in Diffusion Models(https://arxiv.org/abs/2505.23312)
Keywords: generative
Abstract: Text-to-image diffusion models have shown unprecedented generative capability, but their ability to produce undesirable concepts (e.g.~pornographic content, sensitive identities, copyrighted styles) poses serious concerns for privacy, fairness, and safety. {Concept erasure} aims to remove or suppress specific concept information in a generative model. In this paper, we introduce \textbf{TRACE (Trajectory-Constrained Attentional Concept Erasure)}, a novel method to erase targeted concepts from diffusion models while preserving overall generative quality. Our approach combines a rigorous theoretical framework, establishing formal conditions under which a concept can be provably suppressed in the diffusion process, with an effective fine-tuning procedure compatible with both conventional latent diffusion (Stable Diffusion) and emerging rectified flow models (e.g.~FLUX). We first derive a closed-form update to the model's cross-attention layers that removes hidden representations of the target concept. We then introduce a trajectory-aware finetuning objective that steers the denoising process away from the concept only in the late sampling stages, thus maintaining the model's fidelity on unrelated content. Empirically, we evaluate TRACE on multiple benchmarks used in prior concept erasure studies (object classes, celebrity faces, artistic styles, and explicit content from the I2P dataset). TRACE achieves state-of-the-art performance, outperforming recent methods such as ANT, EraseAnything, and MACE in terms of removal efficacy and output quality.
摘要：文本到图像扩散模型表现出了前所未有的生成能力，但是它们产生不良概念的能力（例如〜色情内容，敏感身份，版权风格）对隐私，公平和安全提出了严重的关注。 {概念擦除}旨在在生成模型中删除或抑制特定的概念信息。在本文中，我们介绍了\ textbf {trace（轨迹受限的注意概念擦除）}，这是一种新的方法，用于从扩散模型中删除有针对性的概念，同时保留整体生成质量。我们的方法结合了一个严格的理论框架，建立了形式条件，在该条件下可以证明概念可以在扩散过程中被抑制，并有效的微调程序与常规潜在扩散（稳定扩散）和新兴的整流流量模型（例如〜Flux）兼容。我们首先对模型的跨注意层进行了封闭式更新，该层消除了目标概念的隐藏表示形式。然后，我们引入了一个轨迹感知的固定目标，该目标仅在晚期采样阶段将其脱落过程远离概念，从而维持模型对无关内容的保真度。从经验上讲，我们评估了先前概念擦除研究中使用的多个基准（对象类，名人面，艺术风格和I2P数据集中的明确内容）的痕迹。 Trace在去除功效和产出质量方面取得了最新的性能，胜过最新的方法，例如蚂蚁，擦除和梅斯。

Title: Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

Authors: Hengyuan Cao, Yutong Feng, Biao Gong, Yijing Tian, Yunhong Lu, Chuang Liu, Bin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23325
Pdf URL: https://arxiv.org/pdf/2505.23325
Copy Paste: [[2505.23325]] Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis(https://arxiv.org/abs/2505.23325)
Keywords: generation, generative
Abstract: Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed \textit{Dimension-Reduction Attack} (\texttt{DRA-Ctrl}), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. \texttt{DRA-Ctrl} provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities. The project page is this https URL.
摘要：视频生成模型可以视为世界模拟器，因为它们能够捕获现实世界中固有的动态，连续变化的能力。这些模型在视觉，时间，空间和因果关系上整合了高维信息，从而可以预测各种状态的受试者。自然而有价值的研究方向是探索高维空间中训练有素的视频生成模型是否可以有效地支持较低维度的任务，例如可控的图像生成。在这项工作中，我们提出了一个用于视频到图像知识压缩和任务适应的范式，称为\ textIt {Dimension-reduction攻击}（\ texttt {dra-ctrl}），利用视频模型的优势，包括长距离上下文建模和实现各种代理的远程模型，以执行各种生成一代任务。特别是，为了解决连续视频帧与离散图像生成之间的挑战差距，我们引入了基于混合的过渡策略，以确保平稳的适应。此外，我们使用量身定制的掩蔽机制重新设计了注意力结构，以更好地与图像级控制的文本提示更好地对齐文本提示。各种图像生成任务（例如主题驱动和空间条件的生成）的实验表明，重新利用的视频模型的表现优于直接在图像上训练的视频模型。这些结果突出了大型视频发电机对更广泛的视觉应用的未开发潜力。 \ texttt {dra-ctrl}提供了重复使用资源密集型视频模型的新见解，并为跨视觉方式的未来统一生成模型提供基础。项目页面是此HTTPS URL。

Title: Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization

Authors: Matteo Gallici, Haitz Sáez de Ocáriz Borde
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23331
Pdf URL: https://arxiv.org/pdf/2505.23331
Copy Paste: [[2505.23331]] Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization(https://arxiv.org/abs/2505.23331)
Keywords: generation, generative
Abstract: Fine-tuning pre-trained generative models with Reinforcement Learning (RL) has emerged as an effective approach for aligning outputs more closely with nuanced human preferences. In this paper, we investigate the application of Group Relative Policy Optimization (GRPO) to fine-tune next-scale visual autoregressive (VAR) models. Our empirical results demonstrate that this approach enables alignment to intricate reward signals derived from aesthetic predictors and CLIP embeddings, significantly enhancing image quality and enabling precise control over the generation style. Interestingly, by leveraging CLIP, our method can help VAR models generalize beyond their initial ImageNet distribution: through RL-driven exploration, these models can generate images aligned with prompts referencing image styles that were absent during pre-training. In summary, we show that RL-based fine-tuning is both efficient and effective for VAR models, benefiting particularly from their fast inference speeds, which are advantageous for online sampling, an aspect that poses significant challenges for diffusion-based alternatives.
摘要：通过增强学习（RL）的微调预训练的生成模型已成为一种有效的方法，可以更紧密地与细微的人类偏好保持一致。在本文中，我们研究了小组相对策略优化（GRPO）在微调临时视觉自回旋（VAR）模型中的应用。我们的经验结果表明，这种方法使对齐能够对齐源自美学预测变量和剪辑嵌入的复杂奖励信号，从而显着提高了图像质量，并能够对生成方式进行精确的控制。有趣的是，通过利用剪辑，我们的方法可以帮助VAR模型超出其初始成像分布的推广：通过RL驱动的探索，这些模型可以生成与提示在预训练过程中缺少图像样式的提示相一致的图像。总而言之，我们表明，基于RL的微调对VAR模型既有效又有效，特别是从它们的快速推理速度中受益，这对于在线抽样是有利的，这一方面对基于扩散的替代方案构成了重大挑战。

Title: Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering

Authors: Sixian Wang, Zhiwei Tang, Tsung-Hui Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23343
Pdf URL: https://arxiv.org/pdf/2505.23343
Copy Paste: [[2505.23343]] Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering(https://arxiv.org/abs/2505.23343)
Keywords: generation, generative
Abstract: Diffusion models often exhibit inconsistent sample quality due to stochastic variations inherent in their sampling trajectories. Although training-based fine-tuning (e.g. DDPO [1]) and inference-time alignment techniques[2] aim to improve sample fidelity, they typically necessitate full denoising processes and external reward signals. This incurs substantial computational costs, hindering their broader applicability. In this work, we unveil an intriguing phenomenon: a previously unobserved yet exploitable link between sample quality and characteristics of the denoising trajectory during classifier-free guidance (CFG). Specifically, we identify a strong correlation between high-density regions of the sample distribution and the Accumulated Score Differences (ASD)--the cumulative divergence between conditional and unconditional scores. Leveraging this insight, we introduce CFG-Rejection, an efficient, plug-and-play strategy that filters low-quality samples at an early stage of the denoising process, crucially without requiring external reward signals or model retraining. Importantly, our approach necessitates no modifications to model architectures or sampling schedules and maintains full compatibility with existing diffusion frameworks. We validate the effectiveness of CFG-Rejection in image generation through extensive experiments, demonstrating marked improvements on human preference scores (HPSv2, PickScore) and challenging benchmarks (GenEval, DPG-Bench). We anticipate that CFG-Rejection will offer significant advantages for diverse generative modalities beyond images, paving the way for more efficient and reliable high-quality sample generation.
摘要：由于其采样轨迹固有的随机变化，扩散模型通常表现出不一致的样品质量。尽管基于培训的微调（例如DDPO [1]）和推理时间对准技术[2]旨在提高样本忠诚度，但它们通常需要完全降级过程和外部奖励信号。这会产生大量的计算成本，阻碍了其更广泛的适用性。在这项工作中，我们揭示了一种有趣的现象：以前未观察到的却是在无分类指导期间的脱氧轨迹特征（CFG）之间的样本质量和特征之间的联系。具体而言，我们确定了样本分布的高密度区域与累积得分差异（ASD）之间的强相关性 - 条件分数和无条件得分之间的累积差异。利用这种见解，我们引入了CFG拒绝，这是一种有效的，即插即用的策略，可以在脱索过程的早期阶段过滤低质量的样本，至关重要地不需要外部奖励信号或模型再培训。重要的是，我们的方法不需要对建筑或采样时间表进行建模并保持与现有扩散框架的完全兼容性。我们通过广泛的实验验证了CFG排斥反应在图像生成中的有效性，证明了人类偏好得分（HPSV2，PickScore）和具有挑战性的基准（Geneval，DPG BENC）的明显改善。我们预计，CFG拒绝将为图像以外的各种生成方式提供显着优势，为更高效，更可靠的高质量样本生成铺平道路。

Title: Beyond Optimal Transport: Model-Aligned Coupling for Flow Matching

Authors: Yexiong Lin, Yu Yao, Tongliang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23346
Pdf URL: https://arxiv.org/pdf/2505.23346
Copy Paste: [[2505.23346]] Beyond Optimal Transport: Model-Aligned Coupling for Flow Matching(https://arxiv.org/abs/2505.23346)
Keywords: generation
Abstract: Flow Matching (FM) is an effective framework for training a model to learn a vector field that transports samples from a source distribution to a target distribution. To train the model, early FM methods use random couplings, which often result in crossing paths and lead the model to learn non-straight trajectories that require many integration steps to generate high-quality samples. To address this, recent methods adopt Optimal Transport (OT) to construct couplings by minimizing geometric distances, which helps reduce path crossings. However, we observe that such geometry-based couplings do not necessarily align with the model's preferred trajectories, making it difficult to learn the vector field induced by these couplings, which prevents the model from learning straight trajectories. Motivated by this, we propose Model-Aligned Coupling (MAC), an effective method that matches training couplings based not only on geometric distance but also on alignment with the model's preferred transport directions based on its prediction error. To avoid the time-costly match process, MAC proposes to select the top-$k$ fraction of couplings with the lowest error for training. Extensive experiments show that MAC significantly improves generation quality and efficiency in few-step settings compared to existing methods. Project page: this https URL
摘要：流匹配（FM）是训练模型的有效框架，以学习将样品从源分布传输到目标分布的矢量场。为了训练模型，早期的FM方法使用随机耦合，这通常会导致交叉路径，并导致模型学习非紧密轨迹，这些轨迹需要许多集成步骤来生成高质量的样本。为了解决这个问题，最近的方法采用最佳运输（OT）来通过最小化几何距离来构建耦合，从而有助于减少路径交叉。但是，我们观察到，这种基于几何的耦合不一定与模型的首选轨迹保持一致，因此很难学习这些耦合引起的向量场，从而阻止该模型学习直轨道。在此激励的情况下，我们提出了模型对准耦合（MAC），这是一种有效的方法，它不仅基于基于几何距离，而且还基于模型的预测误差与模型的首选传输方向匹配。为了避免时间成本匹配的过程，MAC建议选择最低训练错误的顶部$ K $耦合分数。广泛的实验表明，与现有方法相比，MAC显着提高了几步设置的发电质量和效率。项目页面：此HTTPS URL

Title: UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Authors: Weijia Mao, Zhenheng Yang, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23380
Pdf URL: https://arxiv.org/pdf/2505.23380
Copy Paste: [[2505.23380]] UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning(https://arxiv.org/abs/2505.23380)
Keywords: generation
Abstract: Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in this https URL.
摘要：统一的多模式大型语言模型（例如Show-O和Janus）在一代和理解任务中都取得了强大的性能。但是，这些模型通常依赖于大规模数据集，并且需要在训练阶段进行大量计算。此外，已经提出了几种训练后方法，但它们通常取决于外部数据或仅限于特定于任务的自定义。在这项工作中，我们介绍了Unirl，这是一种自我改善的训练后方法。我们的方法使模型能够从提示中生成图像，并将它们用作每次迭代中的训练数据，而无需依靠任何外部图像数据。此外，它使这两个任务能够相互增强：生成的图像用于理解，并且理解结果用于监督生成。我们探索受监督的微调（SFT）和小组相对策略优化（GRPO）以优化模型。 Unirl提供了三个关键优势：（1）它不需要外部图像数据，因为在培训期间，模型本身生成了所有培训样本；（2）它不仅改善了个人任务绩效，还可以减少发电和理解之间的失衡；（3）在训练后阶段只需要几个其他训练步骤。我们在Show-O和Janus上评估Unirl，在Show-O中获得了0.77的遗传得分，Janus得分为0.65。代码和模型将在此HTTPS URL中发布。

Title: Automated Modeling Method for Pathloss Model Discovery

Authors: Ahmad Anaqreh, Shih-Kai Chou, Mihael Mohorčič, Carolina Fortuna
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.23383
Pdf URL: https://arxiv.org/pdf/2505.23383
Copy Paste: [[2505.23383]] Automated Modeling Method for Pathloss Model Discovery(https://arxiv.org/abs/2505.23383)
Keywords: generation
Abstract: Modeling propagation is the cornerstone for designing and optimizing next-generation wireless systems, with a particular emphasis on 5G and beyond era. Traditional modeling methods have long relied on statistic-based techniques to characterize propagation behavior across different environments. With the expansion of wireless communication systems, there is a growing demand for methods that guarantee the accuracy and interoperability of modeling. Artificial intelligence (AI)-based techniques, in particular, are increasingly being adopted to overcome this challenge, although the interpretability is not assured with most of these methods. Inspired by recent advancements in AI, this paper proposes a novel approach that accelerates the discovery of path loss models while maintaining interpretability. The proposed method automates the model formulation, evaluation, and refinement, facilitating model discovery. We evaluate two techniques: one based on Deep Symbolic Regression, offering full interpretability, and the second based on Kolmogorov-Arnold Networks, providing two levels of interpretability. Both approaches are evaluated on two synthetic and two real-world datasets. Our results show that Kolmogorov-Arnold Networks achieve R^2 values close to 1 with minimal prediction error, while Deep Symbolic Regression generates compact models with moderate accuracy. Moreover, on the selected examples, we demonstrate that automated methods outperform traditional methods, achieving up to 75% reduction in prediction errors, offering accurate and explainable solutions with potential to increase the efficiency of discovering next-generation path loss models.
摘要：建模繁殖是设计和优化下一代无线系统的基石，特别强调了5G和ERA。传统的建模方法长期以来一直依赖于基于统计的技术来表征不同环境之间的传播行为。随着无线通信系统的扩展，对方法的需求不断增长，以确保建模的准确性和互操作性。尤其是基于人工智能（AI）的技术，越来越多地采用以克服这一挑战，尽管使用大多数这些方法可以确保可解释性。受AI最新进步的启发，本文提出了一种新型方法，该方法在维持可解释性的同时加速了路径损失模型的发现。所提出的方法可自动化模型公式，评估和改进，从而促进模型发现。我们评估了两种技术：一种基于深层符号回归，提供完整的可解释性，第二种基于Kolmogorov-Arnold网络，提供了两个级别的可解释性。两种方法都在两个合成和两个现实世界数据集上进行评估。我们的结果表明，Kolmogorov-Arnold网络实现了接近1的R^2值，而预测误差最小，而深符号回归产生了具有适度精度的紧凑模型。此外，在选定的示例中，我们证明了自动化方法的表现优于传统方法，预测误差降低了75％，提供了准确且可解释的解决方案，具有提高发现下一代路径损失模型的效率。

Title: Video Editing for Audio-Visual Dubbing

Authors: Binyamin Manela, Sharon Gannot, Ethan Fetyaya
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23406
Pdf URL: https://arxiv.org/pdf/2505.23406
Copy Paste: [[2505.23406]] Video Editing for Audio-Visual Dubbing(https://arxiv.org/abs/2505.23406)
Keywords: generation
Abstract: Visual dubbing, the synchronization of facial movements with new speech, is crucial for making content accessible across different languages, enabling broader global reach. However, current methods face significant limitations. Existing approaches often generate talking faces, hindering seamless integration into original scenes, or employ inpainting techniques that discard vital visual information like partial occlusions and lighting variations. This work introduces EdiDub, a novel framework that reformulates visual dubbing as a content-aware editing task. EdiDub preserves the original video context by utilizing a specialized conditioning scheme to ensure faithful and accurate modifications rather than mere copying. On multiple benchmarks, including a challenging occluded-lip dataset, EdiDub significantly improves identity preservation and synchronization. Human evaluations further confirm its superiority, achieving higher synchronization and visual naturalness scores compared to the leading methods. These results demonstrate that our content-aware editing approach outperforms traditional generation or inpainting, particularly in maintaining complex visual elements while ensuring accurate lip synchronization.
摘要：视觉配音，面部运动与新语音的同步，对于使不同语言的内容访问，从而实现更广泛的全球影响力至关重要。但是，当前方法面临重大局限性。现有的方法通常会产生谈话面孔，阻碍无缝集成到原始场景中，或采用填充技术来丢弃重要的视觉信息，例如部分遮挡和照明变化。这项工作介绍了Edidub，这是一个新颖的框架，将视觉配音重新定义为内容感知的编辑任务。 Edidub通过利用专门的调节方案来确保原始视频上下文，以确保忠实，准确的修改，而不是仅仅复制。在多个基准测试基准（包括具有挑战性的LIP数据集）上，EDIDUB显着改善了身份保存和同步。与领先方法相比，人类评估进一步证实了其优越性，达到更高的同步和视觉自然得分。这些结果表明，我们的内容感知的编辑方法的表现优于传统产生或介入，尤其是在保持复杂的视觉元素的同时确保准确的唇部同步时。

Title: Bidirectional predictive coding

Authors: Gaspard Oliviers, Mufeng Tang, Rafal Bogacz
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23415
Pdf URL: https://arxiv.org/pdf/2505.23415
Copy Paste: [[2505.23415]] Bidirectional predictive coding(https://arxiv.org/abs/2505.23415)
Keywords: generative
Abstract: Predictive coding (PC) is an influential computational model of visual learning and inference in the brain. Classical PC was proposed as a top-down generative model, where the brain actively predicts upcoming visual inputs, and inference minimises the prediction errors. Recent studies have also shown that PC can be formulated as a discriminative model, where sensory inputs predict neural activities in a feedforward manner. However, experimental evidence suggests that the brain employs both generative and discriminative inference, while unidirectional PC models show degraded performance in tasks requiring bidirectional processing. In this work, we propose bidirectional PC (bPC), a PC model that incorporates both generative and discriminative inference while maintaining a biologically plausible circuit implementation. We show that bPC matches or outperforms unidirectional models in their specialised generative or discriminative tasks, by developing an energy landscape that simultaneously suits both tasks. We also demonstrate bPC's superior performance in two biologically relevant tasks including multimodal learning and inference with missing information, suggesting that bPC resembles biological visual inference more closely.
摘要：预测编码（PC）是一个有影响力的计算模型，是大脑中视觉学习和推断的影响。经典PC被提出为自上而下的生成模型，其中大脑会积极预测即将到来的视觉输入，并推断最小化预测误差。最近的研究还表明，PC可以作为歧视模型配制，其中感觉输入以前馈方式预测神经活动。但是，实验证据表明，大脑同时采用生成和歧视性推断，而单向PC模型在需要双向处理的任务中表现出降低的性能。在这项工作中，我们提出了双向PC（BPC），这是一种PC模型，该模型既包含生成性和判别性推断，同时保持生物学上合理的电路实现。我们表明，通过开发同时适合这两个任务的能量格局，BPC在其专业生成或判别任务中匹配或胜过单向模型。我们还展示了BPC在两项与生物学相关的任务中的出色表现，包括多模式学习和缺少信息的推断，这表明BPC更像生物学视觉推断。

Title: CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis

Authors: Runmin Jiang, Genpei Zhang, Yuntian Yang, Siqi Wu, Yuheng Zhang, Wanyue Feng, Yizhou Zhao, Xi Xiao, Xiao Wang, Tianyang Wang, Xingjian Li, Min Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23444
Pdf URL: https://arxiv.org/pdf/2505.23444
Copy Paste: [[2505.23444]] CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis(https://arxiv.org/abs/2505.23444)
Keywords: generation, generative
Abstract: Cryo-electron microscopy (cryo-EM) offers near-atomic resolution imaging of macromolecules, but developing robust models for downstream analysis is hindered by the scarcity of high-quality annotated data. While synthetic data generation has emerged as a potential solution, existing methods often fail to capture both the structural diversity of biological specimens and the complex, spatially varying noise inherent in cryo-EM imaging. To overcome these limitations, we propose CryoCCD, a synthesis framework that integrates biophysical modeling with generative techniques. Specifically, CryoCCD produces multi-scale cryo-EM micrographs that reflect realistic biophysical variability through compositional heterogeneity, cellular context, and physics-informed imaging. To generate realistic noise, we employ a conditional diffusion model, enhanced by cycle consistency to preserve structural fidelity and mask-aware contrastive learning to capture spatially adaptive noise patterns. Extensive experiments show that CryoCCD generates structurally accurate micrographs and enhances performance in downstream tasks, outperforming state-of-the-art baselines in both particle picking and reconstruction.
摘要：冷冻电子显微镜（Cryo-EM）提供了大分子的接近原子分辨率成像，但是高质量注释的数据的稀缺性阻碍了开发可靠的下游分析模型来进行下游分析。尽管合成数据的生成已经成为潜在的解决方案，但现有方法通常无法捕获生物标本的结构多样性和冷冻EM成像中固有的复合物，空间变化的噪声。为了克服这些局限性，我们提出了CryoCCD，这是将生物物理建模与生成技术集成的合成框架。具体而言，CroccD会产生多尺度的冷冻EM显微照片，这些显微照片通过组成异质性，细胞上下文和物理化成像来反映现实的生物物理变异性。为了产生逼真的噪声，我们采用条件扩散模型，通过循环一致性增强，以保持结构保真度和面具感知的对比度学习以捕获空间自适应的噪声模式。广泛的实验表明，CroyoCCD会生成结构准确的显微照片，并提高下游任务中的性能，在粒子拾取和重建中的最先进基线表现优于最先进的基线。

Title: A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation

Authors: Shuzhou Sun, Li Liu, Tianpeng Liu, Shuaifeng Zhi, Ming-Ming Cheng, Janne Heikkilä, Yongxiang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23451
Pdf URL: https://arxiv.org/pdf/2505.23451
Copy Paste: [[2505.23451]] A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation(https://arxiv.org/abs/2505.23451)
Keywords: generation
Abstract: Existing two-stage Scene Graph Generation (SGG) frameworks typically incorporate a detector to extract relationship features and a classifier to categorize these relationships; therefore, the training paradigm follows a causal chain structure, where the detector's inputs determine the classifier's inputs, which in turn influence the final predictions. However, such a causal chain structure can yield spurious correlations between the detector's inputs and the final predictions, i.e., the prediction of a certain relationship may be influenced by other relationships. This influence can induce at least two observable biases: tail relationships are predicted as head ones, and foreground relationships are predicted as background ones; notably, the latter bias is seldom discussed in the literature. To address this issue, we propose reconstructing the causal chain structure into a reverse causal structure, wherein the classifier's inputs are treated as the confounder, and both the detector's inputs and the final predictions are viewed as causal variables. Specifically, we term the reconstructed causal paradigm as the Reverse causal Framework for SGG (RcSGG). RcSGG initially employs the proposed Active Reverse Estimation (ARE) to intervene on the confounder to estimate the reverse causality, \ie the causality from final predictions to the classifier's inputs. Then, the Maximum Information Sampling (MIS) is suggested to enhance the reverse causality estimation further by considering the relationship information. Theoretically, RcSGG can mitigate the spurious correlations inherent in the SGG framework, subsequently eliminating the induced biases. Comprehensive experiments on popular benchmarks and diverse SGG frameworks show the state-of-the-art mean recall rate.
摘要：现有的两阶段场景图（SGG）框架通常会合并一个检测器来提取关系特征和分类器，以分类这些关系；因此，训练范式遵循因果链结构，其中检测器的输入决定了分类器的输入，从而影响最终预测。但是，这种因果链结构可以在检测器的输入与最终预测之间产生虚假的相关性，即某种关系的预测可能会受到其他关系的影响。这种影响至少可以引起两个可观察的偏见：尾巴关系被预测为头部的关系，前景关系被预测为背景关系。值得注意的是，文献中很少讨论后一种偏见。为了解决这个问题，我们建议将因果链结构重建为反向因果结构，其中分类器的输入被视为混杂因素，并且探测器的输入和最终预测都被视为因果变量。具体而言，我们将重建的因果范式称为SGG（RCSGG）的反向因果框架。 RCSGG最初采用了建议的主动反向估计（IS）来干预混杂因素来估计反向因果关系，\ ie从最终预测到分类器输入的因果关系。然后，建议最大的信息抽样（MIS）通过考虑关系信息进一步增强反向因果关系估计。从理论上讲，RCSGG可以减轻SGG框架中固有的虚假相关性，随后消除了诱导的偏见。关于流行基准和不同SGG框架的全面实验显示了最先进的召回率。

Title: Diffusion Guidance Is a Controllable Policy Improvement Operator

Authors: Kevin Frans, Seohong Park, Pieter Abbeel, Sergey Levine
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.23458
Pdf URL: https://arxiv.org/pdf/2505.23458
Copy Paste: [[2505.23458]] Diffusion Guidance Is a Controllable Policy Improvement Operator(https://arxiv.org/abs/2505.23458)
Keywords: generative
Abstract: At the core of reinforcement learning is the idea of learning beyond the performance in the data. However, scaling such systems has proven notoriously tricky. In contrast, techniques from generative modeling have proven remarkably scalable and are simple to train. In this work, we combine these strengths, by deriving a direct relation between policy improvement and guidance of diffusion models. The resulting framework, CFGRL, is trained with the simplicity of supervised learning, yet can further improve on the policies in the data. On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance. Of particular importance, CFGRL can operate without explicitly learning a value function, allowing us to generalize simple supervised methods (e.g., goal-conditioned behavioral cloning) to further prioritize optimality, gaining performance for "free" across the board.
摘要：增强学习的核心是学习超出数据表现的想法。但是，扩展这种系统已被证明是棘手的。相比之下，生成建模的技术已被证明是可扩展的，并且易于训练。在这项工作中，我们结合了这些优势，通过得出策略改进与扩散模型的指导之间的直接关系。最终的框架CFGRL经过了监督学习的简单性训练，但可以进一步改善数据中的策略。在离线RL任务上，我们观察到可靠的趋势 - 增加指导加权会导致性能提高。特别重要的是，CFGRL可以在不明确学习价值函数的情况下运行，从而使我们能够概括简单监督的方法（例如，目标条件条件的行为克隆），以进一步优先考虑最佳性，从而在整个董事会中获得“自由”的性能。

Title: LAFR: Efficient Diffusion-based Blind Face Restoration via Latent Codebook Alignment Adapter

Authors: Runyi Li, Bin Chen, Jian Zhang, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23462
Pdf URL: https://arxiv.org/pdf/2505.23462
Copy Paste: [[2505.23462]] LAFR: Efficient Diffusion-based Blind Face Restoration via Latent Codebook Alignment Adapter(https://arxiv.org/abs/2505.23462)
Keywords: restoration
Abstract: Blind face restoration from low-quality (LQ) images is a challenging task that requires not only high-fidelity image reconstruction but also the preservation of facial identity. While diffusion models like Stable Diffusion have shown promise in generating high-quality (HQ) images, their VAE modules are typically trained only on HQ data, resulting in semantic misalignment when encoding LQ inputs. This mismatch significantly weakens the effectiveness of LQ conditions during the denoising process. Existing approaches often tackle this issue by retraining the VAE encoder, which is computationally expensive and memory-intensive. To address this limitation efficiently, we propose LAFR (Latent Alignment for Face Restoration), a novel codebook-based latent space adapter that aligns the latent distribution of LQ images with that of HQ counterparts, enabling semantically consistent diffusion sampling without altering the original VAE. To further enhance identity preservation, we introduce a multi-level restoration loss that combines constraints from identity embeddings and facial structural priors. Additionally, by leveraging the inherent structural regularity of facial images, we show that lightweight finetuning of diffusion prior on just 0.9% of FFHQ dataset is sufficient to achieve results comparable to state-of-the-art methods, reduce training time by 70%. Extensive experiments on both synthetic and real-world face restoration benchmarks demonstrate the effectiveness and efficiency of LAFR, achieving high-quality, identity-preserving face reconstruction from severely degraded inputs.
摘要：低质量（LQ）图像的盲人恢复是一项艰巨的任务，不仅需要高保真图像重建，而且需要维护面部身份。尽管像稳定扩散这样的扩散模型在生成高质量（HQ）图像方面已显示出希望，但它们的VAE模块通常仅在HQ数据上训练，从而在编码LQ输入时会导致语义未对准。这种不匹配显着削弱了LQ条件在脱氧过程中的有效性。现有的方法通常通过重新验证VAE编码器来解决此问题，VAE编码器在计算上昂贵且内存密集。为了有效地解决此限制，我们提出了Lafr（面部恢复的潜在对齐），这是一种基于代码书的新型潜在空间适配器，将LQ图像的潜在分布与HQ对应物的潜在分布相一致，从而实现语义上一致的扩散采样，而无需更改原始VAE。为了进一步增强身份保存，我们引入了多层恢复损失，结合了身份嵌入和面部结构先验的约束。此外，通过利用面部图像的固有结构规律性，我们表明，仅在0.9％的FFHQ数据集上，对扩散的轻巧填充足以实现与最先进方法相当的结果，从而将训练时间减少70％。对合成和现实世界的面部恢复基准的广泛实验证明了LAFR的有效性和效率，从而获得了严重退化的输入，从而实现了高质量的，具有身份的面部重建。

Title: VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation

Authors: Shi-Xue Zhang, Hongfa Wang, Duojun Huang, Xin Li, Xiaobin Zhu, Xu-Cheng Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23484
Pdf URL: https://arxiv.org/pdf/2505.23484
Copy Paste: [[2505.23484]] VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation(https://arxiv.org/abs/2505.23484)
Keywords: generation
Abstract: Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: this https URL.
摘要：视频字幕在文本到视频生成任务中起着至关重要的作用，因为它们的质量直接影响了生成视频的语义连贯性和视觉保真度。尽管大型视觉模型（VLM）在字幕生成中表现出了巨大的潜力，但现有的基准不足解决了细粒度的评估，尤其是在捕获视频生成至关重要的时空细节时。为了解决这一差距，我们介绍了细粒度的视频标题评估基准（VCAPSBENCE），这是第一个大规模细粒基准，其中包括5,677（5k+）视频和109,796（100k+）问题 - 答案对。这些QA对在21个细粒度的尺寸（例如，相机运动和射击类型）上被系统地注释，这些维度对于文本到视频的生成至关重要。我们进一步介绍了三个指标（准确性（AR），不一致率（IR），覆盖率（CR））和一个自动化的评估管道利用大型语言模型（LLM）通过对比度QA-PAIRS分析来验证字幕质量。通过为字幕优化提供可行的见解，我们的基准可以推进强大的文本对视频模型的开发。数据集和代码可在网站上找到：此HTTPS URL。

Title: R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation

Authors: Kaijie Chen, Zihao Lin, Zhiyang Xu, Ying Shen, Yuguang Yao, Joy Rimchala, Jiaxin Zhang, Lifu Huang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.23493
Pdf URL: https://arxiv.org/pdf/2505.23493
Copy Paste: [[2505.23493]] R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation(https://arxiv.org/abs/2505.23493)
Keywords: generation
Abstract: Reasoning is a fundamental capability often required in real-world text-to-image (T2I) generation, e.g., generating ``a bitten apple that has been left in the air for more than a week`` necessitates understanding temporal decay and commonsense concepts. While recent T2I models have made impressive progress in producing photorealistic images, their reasoning capability remains underdeveloped and insufficiently evaluated. To bridge this gap, we introduce R2I-Bench, a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation. R2I-Bench comprises meticulously curated data instances, spanning core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design R2IScore, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality. Extensive experiments with 16 representative T2I models, including a strong pipeline-based framework that decouples reasoning and generation using the state-of-the-art language and image generation models, demonstrate consistently limited reasoning performance, highlighting the need for more robust, reasoning-aware architectures in the next generation of T2I systems. Project Page: this https URL
摘要：推理是现实世界中文本到图像（T2i）一代中通常需要的基本能力，例如，生成``一个被捕捞的苹果，该苹果已经在空中停留超过一周了。尽管最近的T2I模型在产生影像图中取得了令人印象深刻的进展，但其推理能力仍然不发达且评估不足。为了弥合这一差距，我们介绍了R2i-Bench，这是一种全面的基准测试，该基准专门设计用于严格评估推理驱动的T2i生成。 R2i-Bench包括精心策划的数据实例，涵盖了核心推理类别，包括常识，数学，逻辑，组成，数值，因果和概念混合。为了促进细粒度的评估，我们设计了R2iscore，这是一种基于实例特定的，面向推理的评估问题，评估了三个关键维度：文本图像对齐，推理准确性和图像质量。具有16个代表性T2I模型的广泛实验，包括一个基于管道的强大框架，该框架使用最先进的语言和图像生成模型将推理和生成分解，表明了推理性能始终有限，突出了下一代T2I系统中对更强大的，推理意识的建筑的需求。项目页面：此HTTPS URL

Title: Maximum Likelihood Learning of Latent Dynamics Without Reconstruction

Authors: Samo Hromadka, Kai Biegun, Lior Fox, James Heald, Maneesh Sahani
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.23569
Pdf URL: https://arxiv.org/pdf/2505.23569
Copy Paste: [[2505.23569]] Maximum Likelihood Learning of Latent Dynamics Without Reconstruction(https://arxiv.org/abs/2505.23569)
Keywords: generative
Abstract: We introduce a novel unsupervised learning method for time series data with latent dynamical structure: the recognition-parametrized Gaussian state space model (RP-GSSM). The RP-GSSM is a probabilistic model that learns Markovian Gaussian latents explaining statistical dependence between observations at different time steps, combining the intuition of contrastive methods with the flexible tools of probabilistic generative models. Unlike contrastive approaches, the RP-GSSM is a valid probabilistic model learned via maximum likelihood. Unlike generative approaches, the RP-GSSM has no need for an explicit network mapping from latents to observations, allowing it to focus model capacity on inference of latents. The model is both tractable and expressive: it admits exact inference thanks to its jointly Gaussian latent prior, while maintaining expressivity with an arbitrarily nonlinear neural network link between observations and latents. These qualities allow the RP-GSSM to learn task-relevant latents without ad-hoc regularization, auxiliary losses, or optimizer scheduling. We show how this approach outperforms alternatives on problems that include learning nonlinear stochastic dynamics from video, with or without background distractors. Our results position the RP-GSSM as a useful foundation model for a variety of downstream applications.
摘要：我们引入了一种具有潜在动力学结构的时间序列数据的新颖的无监督学习方法：识别参数高斯状态空间模型（RP-GSSM）。 RP-GSSM是一个概率模型，它学习了马尔可维亚高斯潜伏期，解释了在不同时间步骤的观察结果之间的统计依赖性，将对比方法的直觉与概率生成模型的灵活工具相结合。与对比方法不同，RP-GSSM是通过最大可能性学习的有效概率模型。与生成方法不同，RP-GSSM不需要从潜伏期到观察的明确网络映射，从而使其可以将模型能力集中在潜伏的推理上。该模型既具有易合性又表现力：它的推理归功于其共同的高斯潜在先验，同时通过观测和潜伏期之间的任意非线性神经网络保持表现力。这些素质使RP-GSSM可以学习与任务相关的潜在，而无需临时正则化，辅助损失或优化器计划。我们展示了这种方法在包括视频，有或没有背景干扰器的情况下学习非线性随机动态的问题如何优于替代方案。我们的结果将RP-GSSM定位为各种下游应用程序的有用基础模型。

Title: BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

Authors: Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J. Maddison, Bo Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.23579
Pdf URL: https://arxiv.org/pdf/2505.23579
Copy Paste: [[2505.23579]] BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model(https://arxiv.org/abs/2505.23579)
Keywords: generation
Abstract: Unlocking deep, interpretable biological reasoning from complex genomic data is a major AI challenge hindering scientific discovery. Current DNA foundation models, despite strong sequence representation, struggle with multi-step reasoning and lack inherent transparent, biologically intuitive explanations. We introduce BioReason, a pioneering architecture that, for the first time, deeply integrates a DNA foundation model with a Large Language Model (LLM). This novel connection enables the LLM to directly process and reason with genomic information as a fundamental input, fostering a new form of multimodal biological understanding. BioReason's sophisticated multi-step reasoning is developed through supervised fine-tuning and targeted reinforcement learning, guiding the system to generate logical, biologically coherent deductions. On biological reasoning benchmarks including KEGG-based disease pathway prediction - where accuracy improves from 88% to 97% - and variant effect prediction, BioReason demonstrates an average 15% performance gain over strong single-modality baselines. BioReason reasons over unseen biological entities and articulates decision-making through interpretable, step-by-step biological traces, offering a transformative approach for AI in biology that enables deeper mechanistic insights and accelerates testable hypothesis generation from genomic data. Data, code, and checkpoints are publicly available at this https URL
摘要：从复杂的基因组数据中解释深层，可解释的生物学推理是阻碍科学发现的主要AI挑战。当前的DNA基础模型尽管有很强的序列表示，但仍在多步推理和缺乏固有的透明，生物学直觉的解释中挣扎。我们介绍了Bioreason，这是一种开创性的建筑，该建筑首次将DNA基础模型与大型语言模型（LLM）深入融合。这种新颖的联系使LLM能够直接以基因组信息为基础输入来处理和理由，从而促进了一种新形式的多模式生物学理解。生物季的复杂多步推理是通过监督的微调和有针对性的强化学习来开发的，从而指导系统以产生逻辑，生物学上一致的扣除。在包括基于KEGG的疾病途径预测（精度从88％提高到97％）和变异效应预测的生物推理基准上，生物季节表明，平均表现出15％的性能增长比强的单模式基础。生物季节的理由是对看不见的生物实体的原因，并通过可解释的逐步生物学痕迹来阐明决策，为生物学中的AI提供了变革性的方法，该方法可以使更深层的机械洞察力并从基因组数据中加速可检验的假设产生。数据，代码和检查点可在此HTTPS URL上公开可用

Title: LLM Performance for Code Generation on Noisy Tasks

Authors: Radzim Sendyka, Christian Cabrera, Andrei Paleyes, Diana Robinson, Neil Lawrence
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2505.23598
Pdf URL: https://arxiv.org/pdf/2505.23598
Copy Paste: [[2505.23598]] LLM Performance for Code Generation on Noisy Tasks(https://arxiv.org/abs/2505.23598)
Keywords: generation
Abstract: This paper investigates the ability of large language models (LLMs) to recognise and solve tasks which have been obfuscated beyond recognition. Focusing on competitive programming and benchmark tasks (LeetCode and MATH), we compare performance across multiple models and obfuscation methods, such as noise and redaction. We demonstrate that all evaluated LLMs can solve tasks obfuscated to a level where the text would be unintelligible to human readers, and does not contain key pieces of instruction or context. We introduce the concept of eager pattern matching to describe this behaviour, which is not observed in tasks published after the models' knowledge cutoff date, indicating strong memorisation or overfitting to training data, rather than legitimate reasoning about the presented problem. We report empirical evidence of distinct performance decay patterns between contaminated and unseen datasets. We discuss the implications for benchmarking and evaluations of model behaviour, arguing for caution when designing experiments using standard datasets. We also propose measuring the decay of performance under obfuscation as a possible strategy for detecting dataset contamination and highlighting potential safety risks and interpretability issues for automated software systems.
摘要：本文研究了大型语言模型（LLM）识别和解决已被识别的任务的能力。专注于竞争性编程和基准任务（LeetCode和Math），我们比较了多种模型和混淆方法（例如噪声和修订）的性能。我们证明，所有评估的LLM都可以将混淆的任务求解到人类读者难以理解的水平，并且不包含关键的指导或上下文。我们介绍了急切的模式匹配的概念来描述这种行为，这在模型的知识截止日期之后发布的任务中未观察到，表明记忆力强或过度适合培训数据，而不是对提出的问题的合法推理。我们报告了受污染和看不见的数据集之间明显的性能衰减模式的经验证据。我们讨论了模型行为的基准测试和评估的含义，在使用标准数据集设计实验时，请谨慎行事。我们还建议测量混淆下的绩效衰减，作为检测数据集污染的可能策略，并突出自动化软件系统的潜在安全风险和可解释性问题。

Title: Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Authors: Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.23606
Pdf URL: https://arxiv.org/pdf/2505.23606
Copy Paste: [[2505.23606]] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model(https://arxiv.org/abs/2505.23606)
Keywords: generation
Abstract: Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.
摘要：统一的一代模型旨在在单个体系结构和解码范式中处理跨模式（例如文本生成，图像生成和视觉推理）的各种任务。自回归的统一模型由于顺序解码而受到推断缓慢的速度，而非自动进取的统一模型由于预贴预定的骨架而导致的概括较弱。我们介绍了Muddit，这是一种统一的离散扩散变压器，可以在文本和图像模态中快速和平行生成。与先前从头开始训练的统一扩散模型不同，Muddit与预验证的文本对主链的强大视觉先验与轻量级的文本解码器相结合，从而在统一体系结构下实现了灵活且高质量的多模式生成。经验结果表明，与在质量和效率方面显着更大的自回归模型相比，MUDDIT具有竞争性或卓越的性能。这项工作突出了纯粹的离散扩散的潜力，即配备强大的视觉先验，作为统一发电的可扩展性和有效的骨干。

Title: Inference-time Scaling of Diffusion Models through Classical Search

Authors: Xiangcheng Zhang, Haowei Lin, Haotian Ye, James Zou, Jianzhu Ma, Yitao Liang, Yilun Du
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.23614
Pdf URL: https://arxiv.org/pdf/2505.23614
Copy Paste: [[2505.23614]] Inference-time Scaling of Diffusion Models through Classical Search(https://arxiv.org/abs/2505.23614)
Keywords: generation, generative
Abstract: Classical search algorithms have long underpinned modern artificial intelligence. In this work, we tackle the challenge of inference-time control in diffusion models -- adapting generated outputs to meet diverse test-time objectives -- using principles from classical search. We propose a general framework that orchestrates local and global search to efficiently navigate the generative space. It employs a theoretically grounded local search via annealed Langevin MCMC and performs compute-efficient global exploration using breadth-first and depth-first tree search. We evaluate our approach on a range of challenging domains, including planning, offline reinforcement learning, and image generation. Across all tasks, we observe significant gains in both performance and efficiency. These results show that classical search provides a principled and practical foundation for inference-time scaling in diffusion models. Project page at this http URL.
摘要：古典搜索算法长期以来一直为现代人工智能提供了支持。在这项工作中，我们使用经典搜索中的原理来解决扩散模型中推理时间控制的挑战 - 调整生成的输出以满足不同的测试时间目标。我们提出了一个一般框架，该框架精心策划本地和全局搜索，以有效地浏览生成空间。它通过退火Langevin MCMC采用理论上扎根的本地搜索，并使用广度优先和深度优先的树搜索进行计算有效的全球探索。我们在一系列具有挑战性的领域中评估我们的方法，包括计划，离线强化学习和图像产生。在所有任务中，我们都会观察到绩效和效率的显着提高。这些结果表明，经典搜索为扩散模型中的推理时间扩展提供了原则性和实用的基础。此HTTP URL的项目页面。

Title: MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment

Authors: John Halloran
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2505.23634
Pdf URL: https://arxiv.org/pdf/2505.23634
Copy Paste: [[2505.23634]] MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment(https://arxiv.org/abs/2505.23634)
Keywords: generation, generative
Abstract: The model context protocol (MCP) has been widely adapted as an open standard enabling the seamless integration of generative AI agents. However, recent work has shown the MCP is susceptible to retrieval-based "falsely benign" attacks (FBAs), allowing malicious system access and credential theft, but requiring that users download compromised files directly to their systems. Herein, we show that the threat model of MCP-based attacks is significantly broader than previously thought, i.e., attackers need only post malicious content online to deceive MCP agents into carrying out their attacks on unsuspecting victims' systems. To improve alignment guardrails against such attacks, we introduce a new MCP dataset of FBAs and (truly) benign samples to explore the effectiveness of direct preference optimization (DPO) for the refusal training of large language models (LLMs). While DPO improves model guardrails against such attacks, we show that the efficacy of refusal learning varies drastically depending on the model's original post-training alignment scheme--e.g., GRPO-based LLMs learn to refuse extremely poorly. Thus, to further improve FBA refusals, we introduce Retrieval Augmented Generation for Preference alignment (RAG-Pref), a novel preference alignment strategy based on RAG. We show that RAG-Pref significantly improves the ability of LLMs to refuse FBAs, particularly when combined with DPO alignment, thus drastically improving guardrails against MCP-based attacks.
摘要：模型上下文协议（MCP）已被广泛改编为开放标准，从而实现了生成AI代理的无缝集成。但是，最近的工作表明，MCP容易受到基于检索的“虚假良性”攻击（FBAS）的影响，从而使恶意系统访问和凭据盗窃，但要求用户直接将折衷的文件直接下载到其系统中。在此，我们表明，基于MCP的攻击的威胁模型比以前想象的要广泛得多，即攻击者只需要在线发布恶意内容即可欺骗MCP代理商对毫无戒心的受害者系统进行攻击。为了改善对此类攻击的对齐护栏，我们引入了一个新的MCP数据集和FBA和（真正的）良性样本，以探索直接偏好优化（DPO）的有效性（DPO），以拒绝大语模型（LLMS）的拒绝培训。尽管DPO改善了针对此类攻击的模型护栏，但我们表明，拒绝学习的功效取决于模型的原始训练后对准方案（例如，基于GRPO的LLMS）学会拒绝极差。因此，为了进一步改善FBA拒绝，我们引入了检索增强产生以进行偏好对齐（RAG-PREF），这是一种基于抹布的新型偏好比对策略。我们表明，RAG-PREF显着提高了LLM拒绝FBA的能力，尤其是当与DPO对齐结合时，从而极大地改善了针对基于MCP的攻击的护栏。

Title: VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

Authors: Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23656
Pdf URL: https://arxiv.org/pdf/2505.23656
Copy Paste: [[2505.23656]] VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models(https://arxiv.org/abs/2505.23656)
Keywords: generation
Abstract: Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at this https URL.
摘要：文本到视频（T2V）扩散模型的最新进展已实现了高保真和现实的视频综合。但是，当前的T2V模型由于其固有能力有限，无法准确理解物理学，因此通常难以生成物理上合理的内容。我们发现，虽然T2V模型中的表示具有一定的理解能力，但它们却显着落后于最近的视频自我监督学习方法的能力。为此，我们提出了一个名为VideorePA的新颖框架，该框架通过对齐令牌级别的关系来提炼物理学从视频理解基础模型中理解到T2V模型。这会缩小理解差距的物理学，并使更多物理学可行的一代。具体而言，我们引入了令牌关系蒸馏（TRD）损失，利用时空对准以提供适合鉴定强大预训练的预训练的T2V模型的软指导，这与先前的表示对准（REPA）方法的关键偏离。据我们所知，VideorePA是第一种用于填充T2V模型的REPA方法，专门用于注入物理知识。经验评估表明，VideorePA显着增强了基线方法，Cogvideox的物理意识，从而在相关基准方面取得了重大改进，并证明了产生与直观物理一致的视频的强大能力。此HTTPS URL提供了更多视频结果。

Title: D-AR: Diffusion via Autoregressive Models

Authors: Ziteng Gao, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23660
Pdf URL: https://arxiv.org/pdf/2505.23660
Copy Paste: [[2505.23660]] D-AR: Diffusion via Autoregressive Models(https://arxiv.org/abs/2505.23660)
Keywords: generation
Abstract: This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at this https URL
摘要：本文通过自回旋模型（D-AR）进行了扩散，这是一种新的范式，以标准的下一步预测方式将图像扩散过程重新塑造为一种香草自回归过程。我们首先设计将图像转换为离散令牌序列的令牌，在该序列中，可以将不同位置的令牌解码为像素空间中的不同扩散降解步骤。得益于扩散属性，这些令牌自然遵循粗到最新的顺序，这直接将其自动进行自回旋建模。因此，我们在这些令牌上应用标准的下一步预测，而无需修改任何基本设计（因果面具或训练/推理策略），而这种顺序自动回归的代币生成直接反映了图像空间中的扩散过程。也就是说，一旦自回旋模型生成代币的增量，我们就可以直接以流方式将这些令牌分解为相应的扩散降解步骤。我们的管道自然揭示了几个有趣的属性，例如，它仅生成一个子集时，它支持一致的预览，并启用零摄像的布局控制的合成。在标准Imagenet基准测试上，我们的方法使用具有256个离散令牌的775M Llama主链实现2.09 FID。我们希望我们的工作能够激发对视觉综合自动回归体系结构的未来研究，尤其是在大型语言模型中。代码和型号将在此HTTPS URL上可用

Title: OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Authors: Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23661
Pdf URL: https://arxiv.org/pdf/2505.23661
Copy Paste: [[2505.23661]] OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation(https://arxiv.org/abs/2505.23661)
Keywords: generation
Abstract: In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at this https URL.
摘要：在本报告中，我们介绍了OpenUni，这是一个简单，轻巧且完全开源的基线，用于统一多模式的理解和发电。受统一模型学习中的主要实践的启发，我们采用了一种有效的培训策略，该策略通过桥接现成的多模式大型语言模型（LLM）和扩散模型通过一组可学习的查询和基于轻量级变压器的连接器来最大程度地减少训练的复杂性和间接费用。借助极简主义的体系结构，我们证明了OpenUni可以：1）产生高质量和指令对准图像，2）在诸如Geneval，DPG-台式和明智的标准基准上实现出色的性能，只有1.1B和3.1B激活参数。为了支持开放研究和社区的进步，我们在此HTTPS URL上发布了所有模型权重，培训代码和我们精选的培训数据集（包括23m图像文本对）。

Title: AMBER: Adaptive Mesh Generation by Iterative Mesh Resolution Prediction

Authors: Niklas Freymuth, Tobias Würth, Nicolas Schreiber, Balazs Gyenes, Andreas Boltres, Johannes Mitsch, Aleksandar Taranovic, Tai Hoang, Philipp Dahlinger, Philipp Becker, Luise Kärger, Gerhard Neumann
Subjects: cs.LG, cs.CG
Abstract URL: https://arxiv.org/abs/2505.23663
Pdf URL: https://arxiv.org/pdf/2505.23663
Copy Paste: [[2505.23663]] AMBER: Adaptive Mesh Generation by Iterative Mesh Resolution Prediction(https://arxiv.org/abs/2505.23663)
Keywords: generation
Abstract: The cost and accuracy of simulating complex physical systems using the Finite Element Method (FEM) scales with the resolution of the underlying mesh. Adaptive meshes improve computational efficiency by refining resolution in critical regions, but typically require task-specific heuristics or cumbersome manual design by a human expert. We propose Adaptive Meshing By Expert Reconstruction (AMBER), a supervised learning approach to mesh adaptation. Starting from a coarse mesh, AMBER iteratively predicts the sizing field, i.e., a function mapping from the geometry to the local element size of the target mesh, and uses this prediction to produce a new intermediate mesh using an out-of-the-box mesh generator. This process is enabled through a hierarchical graph neural network, and relies on data augmentation by automatically projecting expert labels onto AMBER-generated data during training. We evaluate AMBER on 2D and 3D datasets, including classical physics problems, mechanical components, and real-world industrial designs with human expert meshes. AMBER generalizes to unseen geometries and consistently outperforms multiple recent baselines, including ones using Graph and Convolutional Neural Networks, and Reinforcement Learning-based approaches.
摘要：使用有限元方法（FEM）尺度模拟复杂物理系统的成本和准确性，并分辨出基础网格。自适应网格通过完善关键区域的分辨率提高计算效率，但通常需要特定于任务的启发式方法或人工设计的繁琐的手动设计。我们提出了专家重建（Amber）的自适应网格，这是一种监督的学习方法，以适应网格。从粗网格开始，琥珀色迭代可以预测尺寸字段，即从几何形状到目标网格的局部元素大小的函数映射，并使用该预测使用开箱即用的网格发电机生成新的中间网格。通过分层图神经网络启用了此过程，并通过在培训期间自动将专家标签投影到琥珀生成的数据上来依赖数据的增强。我们在2D和3D数据集上评估了琥珀色，包括经典物理问题，机械组件以及使用人类专家网格的现实世界设计。 Amber概括了看不见的几何形状，并始终超过了最近的多个基线，包括使用图和卷积神经网络的基线以及基于增强学习的方法。

Title: ImmunoDiff: A Diffusion Model for Immunotherapy Response Prediction in Lung Cancer

Authors: Moinak Bhattacharya, Judy Huang, Amna F. Sher, Gagandeep Singh, Chao Chen, Prateek Prasanna
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23675
Pdf URL: https://arxiv.org/pdf/2505.23675
Copy Paste: [[2505.23675]] ImmunoDiff: A Diffusion Model for Immunotherapy Response Prediction in Lung Cancer(https://arxiv.org/abs/2505.23675)
Keywords: generative
Abstract: Accurately predicting immunotherapy response in Non-Small Cell Lung Cancer (NSCLC) remains a critical unmet need. Existing radiomics and deep learning-based predictive models rely primarily on pre-treatment imaging to predict categorical response outcomes, limiting their ability to capture the complex morphological and textural transformations induced by immunotherapy. This study introduces ImmunoDiff, an anatomy-aware diffusion model designed to synthesize post-treatment CT scans from baseline imaging while incorporating clinically relevant constraints. The proposed framework integrates anatomical priors, specifically lobar and vascular structures, to enhance fidelity in CT synthesis. Additionally, we introduce a novel cbi-Adapter, a conditioning module that ensures pairwise-consistent multimodal integration of imaging and clinical data embeddings, to refine the generative process. Additionally, a clinical variable conditioning mechanism is introduced, leveraging demographic data, blood-based biomarkers, and PD-L1 expression to refine the generative process. Evaluations on an in-house NSCLC cohort treated with immune checkpoint inhibitors demonstrate a 21.24% improvement in balanced accuracy for response prediction and a 0.03 increase in c-index for survival prediction. Code will be released soon.
摘要：准确地预测非小细胞肺癌（NSCLC）中的免疫疗法反应仍然是一个至关重要的未满足需求。现有的放射素学和深度学习的预测模型主要依赖于预处理成像来预测分类响应结果，从而限制了它们捕获免疫疗法引起的复杂形态和质地转化的能力。这项研究介绍了Immunodiff，这是一种解剖学感知的扩散模型，旨在合成从基线成像进行后处理后的CT扫描，同时纳入临床相关的约束。提出的框架整合了解剖学先验，特别是小叶和血管结构，以增强CT合成中的保真度。此外，我们引入了一种新型的CBI适配器，该调节模块可确保成像成像和临床数据嵌入的多模式整合，以完善生成过程。此外，还引入了临床变量调节机制，利用人口统计数据，基于血液的生物标志物和PD-L1表达来完善生成过程。对用免疫检查点抑制剂治疗的内部NSCLC队列进行的评估表明，平衡精度的响应预测均提高了21.24％，C-Index的生存预测增加了0.03。代码将很快发布。

Title: VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Authors: Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.23693
Pdf URL: https://arxiv.org/pdf/2505.23693
Copy Paste: [[2505.23693]] VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos(https://arxiv.org/abs/2505.23693)
Keywords: generation
Abstract: MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.
摘要：最近对MLLM进行了广泛的研究以进行视频问题回答。但是，大多数现有评估都集中在自然视频，忽略合成视频（例如AI生成的内容（AIGC））上。同时，一些视频生成的作品依靠MLLM来评估生成的视频的质量，但是MLLM在解释AIGC视频的功能仍然很大程度上尚未得到充满刺激。为了解决这个问题，我们提出了一个新的基准，即VF-eval，该基准介绍了四个任务 - 协调验证，错误意识，错误类型检测和推理评估，以全面评估MLLM在AIGC视频中的能力。我们在VF-Eval上评估了13个Frontier MLLM，并发现即使是表现最好的模型GPT-4.1，也努力在所有任务中都始终如一地实现良好的性能。这突出了我们基准的挑战性质。此外，为了调查VF-eval在改善视频生成中的实际应用，我们进行了一个实验，重新启动，表明将MLLM与人类反馈更加紧密地对齐可以使视频生成受益。

Title: DiCoFlex: Model-agnostic diverse counterfactuals with flexible control

Authors: Oleksii Furman, Ulvi Movsum-zada, Patryk Marszalek, Maciej Zięba, Marek Śmieja
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.23700
Pdf URL: https://arxiv.org/pdf/2505.23700
Copy Paste: [[2505.23700]] DiCoFlex: Model-agnostic diverse counterfactuals with flexible control(https://arxiv.org/abs/2505.23700)
Keywords: generation, generative
Abstract: Counterfactual explanations play a pivotal role in explainable artificial intelligence (XAI) by offering intuitive, human-understandable alternatives that elucidate machine learning model decisions. Despite their significance, existing methods for generating counterfactuals often require constant access to the predictive model, involve computationally intensive optimization for each instance and lack the flexibility to adapt to new user-defined constraints without retraining. In this paper, we propose DiCoFlex, a novel model-agnostic, conditional generative framework that produces multiple diverse counterfactuals in a single forward pass. Leveraging conditional normalizing flows trained solely on labeled data, DiCoFlex addresses key limitations by enabling real-time user-driven customization of constraints such as sparsity and actionability at inference time. Extensive experiments on standard benchmark datasets show that DiCoFlex outperforms existing methods in terms of validity, diversity, proximity, and constraint adherence, making it a practical and scalable solution for counterfactual generation in sensitive decision-making domains.
摘要：反事实解释通过提供直观的，人为理解的替代方案来阐明机器学习模型的决策，在可解释的人工智能（XAI）中起着关键作用。尽管具有重要意义，但现有的产生反事实的方法通常需要持续访问预测模型，涉及每个实例的计算密集型优化，并且缺乏适应新的用户定义的约束而无需重新培训的灵活性。在本文中，我们提出了Dicoflex，这是一种新型的模型不合时宜的，有条件的生成框架，在单个正向通行证中产生多种不同的反事实。利用仅在标记数据上训练的条件归一化流，Dicoflex通过实现实时用户驱动的约束定制（例如稀疏性和推理时的可行性）来解决关键局限性。对标准基准数据集进行的广泛实验表明，Dicoflex在有效性，多样性，接近性和约束依从性方面优于现有方法，这使其成为在敏感决策领域中反事实生成的实用和可扩展解决方案。

Title: PixelThink: Towards Efficient Chain-of-Pixel Reasoning

Authors: Song Wang, Gongfan Fang, Lingdong Kong, Xiangtai Li, Jianyun Xu, Sheng Yang, Qiang Li, Jianke Zhu, Xinchao Wang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.23727
Pdf URL: https://arxiv.org/pdf/2505.23727
Copy Paste: [[2505.23727]] PixelThink: Towards Efficient Chain-of-Pixel Reasoning(https://arxiv.org/abs/2505.23727)
Keywords: generation
Abstract: Existing reasoning segmentation approaches typically fine-tune multimodal large language models (MLLMs) using image-text pairs and corresponding mask labels. However, they exhibit limited generalization to out-of-distribution scenarios without an explicit reasoning process. Although recent efforts leverage reinforcement learning through group-relative policy optimization (GRPO) to enhance reasoning ability, they often suffer from overthinking - producing uniformly verbose reasoning chains irrespective of task complexity. This results in elevated computational costs and limited control over reasoning quality. To address this problem, we propose PixelThink, a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty to regulate reasoning generation within a reinforcement learning paradigm. The model learns to compress reasoning length in accordance with scene complexity and predictive confidence. To support comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark with annotated reasoning references and difficulty scores, along with a suite of metrics designed to assess segmentation accuracy, reasoning quality, and efficiency jointly. Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance. Our work contributes novel perspectives towards efficient and interpretable multimodal understanding. The code and model will be publicly available.
摘要：现有的推理细分方法通常使用图像文本对和相应的掩码标签微调多模式大型语言模型（MLLMS）。但是，他们在没有明确推理过程的情况下对分布场景的概括有限。尽管最近的努力通过群体相关政策优化（GRPO）利用强化学习来增强推理能力，但它们通常会遭受过度思考的痛苦 - 不论任务复杂性如何产生统一的冗长推理链。这导致计算成本升高和对推理质量的控制有限。为了解决这个问题，我们提出了PixelThink，这是一个简单而有效的方案，该方案整合了外部估计的任务难度和内部测量的模型不确定性，以调节强化学习范式中的推理产生。该模型学会根据场景的复杂性和预测信心来压缩推理长度。为了支持全面的评估，我们介绍了Reasonseg-Diff，这是一个扩展的基准，具有带注释的推理参考和难度分数，以及一套旨在评估细分准确性，推理质量和共同效率的指标。实验结果表明，所提出的方法提高了推理效率和整体分割性能。我们的工作为有效且可解释的多模式理解做出了新的观点。代码和模型将公开可用。

Title: How Animals Dance (When You're Not Looking)

Authors: Xiaojuan Wang, Aleksander Holynski, Brian Curless, Ira Kemelmacher, Steve Seitz
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2505.23738
Pdf URL: https://arxiv.org/pdf/2505.23738
Copy Paste: [[2505.23738]] How Animals Dance (When You're Not Looking)(https://arxiv.org/abs/2505.23738)
Keywords: generation
Abstract: We present a keyframe-based framework for generating music-synchronized, choreography aware animal dance videos. Starting from a few keyframes representing distinct animal poses -- generated via text-to-image prompting or GPT-4o -- we formulate dance synthesis as a graph optimization problem: find the optimal keyframe structure that satisfies a specified choreography pattern of beats, which can be automatically estimated from a reference dance video. We also introduce an approach for mirrored pose image generation, essential for capturing symmetry in dance. In-between frames are synthesized using an video diffusion model. With as few as six input keyframes, our method can produce up to 30 second dance videos across a wide range of animals and music tracks.
摘要：我们提出了一个基于关键的框架，用于生成音乐同步，编舞意识到的动物舞蹈视频。从代表不同动物姿势的几个关键帧开始 - 通过文本到图像提示或GPT-4O产生 - 我们将舞蹈合成作为图形优化问题：找到满足节拍的指定编排模式的最佳键盘结构，可以自动从参考舞蹈视频中估算。我们还引入了一种镜像姿势图像产生的方法，这对于捕获舞蹈对称性至关重要。使用视频扩散模型合成框架之间的框架之间。我们的方法只有六个输入关键帧，可以在各种动物和音乐曲目中生成多达30秒的舞蹈视频。

Title: MAGREF: Masked Guidance for Any-Reference Video Generation

Authors: Yufan Deng, Xun Guo, Yuanyang Yin, Jacob Zhiyuan Fang, Yiding Yang, Yizhi Wang, Shenghai Yuan, Angtian Wang, Bo Liu, Haibin Huang, Chongyang Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23742
Pdf URL: https://arxiv.org/pdf/2505.23742
Copy Paste: [[2505.23742]] MAGREF: Masked Guidance for Any-Reference Video Generation(https://arxiv.org/abs/2505.23742)
Keywords: generation, generative
Abstract: Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: this https URL
摘要：随着深层生成模型的出现，尤其是基于扩散的方法，视频生成取得了长足的进步。但是，基于多个参考主题的视频生成仍然面临着保持多主体一致性和确保高发电质量的重大挑战。在本文中，我们提出了MagRef，这是一个统一的参考视频生成的统一框架，该框架引入了蒙版指南，以实现以不同的参考图像和文本提示为条件的连贯的多主题视频综合。具体而言，我们提出了（1）一种区域感知的动态遮罩机制，该机制使单个模型能够灵活处理各种主题推理，包括人类，对象和背景，而没有架构变化，以及（2）像素的通道串联机制，该机制在通道维度上运行以更好地保存外观。我们的模型提供了最先进的视频生成质量，从单个受试者培训到具有连贯的综合综合和对单个受试者的精确控制的复杂多主体场景，表现优于现有的开源和商业基线。为了促进评估，我们还引入了全面的多主题视频基准。广泛的实验证明了我们的方法的有效性，为可扩展，可控制和高保真性的多主体视频综合铺平了道路。代码和模型可以在以下位置找到：此HTTPS URL

Title: DarkDiff: Advancing Low-Light Raw Enhancement by Retasking Diffusion Models for Camera ISP

Authors: Amber Yijia Zheng, Yu Zhang, Jun Hu, Raymond A. Yeh, Chen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23743
Pdf URL: https://arxiv.org/pdf/2505.23743
Copy Paste: [[2505.23743]] DarkDiff: Advancing Low-Light Raw Enhancement by Retasking Diffusion Models for Camera ISP(https://arxiv.org/abs/2505.23743)
Keywords: generative
Abstract: High-quality photography in extreme low-light conditions is challenging but impactful for digital cameras. With advanced computing hardware, traditional camera image signal processor (ISP) algorithms are gradually being replaced by efficient deep networks that enhance noisy raw images more intelligently. However, existing regression-based models often minimize pixel errors and result in oversmoothing of low-light photos or deep shadows. Recent work has attempted to address this limitation by training a diffusion model from scratch, yet those models still struggle to recover sharp image details and accurate colors. We introduce a novel framework to enhance low-light raw images by retasking pre-trained generative diffusion models with the camera ISP. Extensive experiments demonstrate that our method outperforms the state-of-the-art in perceptual quality across three challenging low-light raw image benchmarks.
摘要：在极端弱光条件下的高质量摄影具有挑战性，但对数码相机的影响很大。借助高级计算硬件，传统的相机图像信号处理器（ISP）算法逐渐被有效的深网替代，这些网络可以更智能地增强嘈杂的原始图像。但是，现有的基于回归的模型通常会最大程度地减少像素错误，并导致低光照片或深色阴影过度厚度。最近的工作试图通过从头开始训练扩散模型来解决这一限制，但是这些模型仍然难以恢复尖锐的图像细节和准确的颜色。我们引入了一个新颖的框架，通过使用相机ISP重新训练预训练的生成扩散模型来增强低光原始图像。广泛的实验表明，在三个挑战性的低光原始图像基准中，我们的方法优于最先进的感知质量。

Title: To Trust Or Not To Trust Your Vision-Language Model's Prediction

Authors: Hao Dong, Moru Liu, Jian Liang, Eleni Chatzi, Olga Fink
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23745
Pdf URL: https://arxiv.org/pdf/2505.23745
Copy Paste: [[2505.23745]] To Trust Or Not To Trust Your Vision-Language Model's Prediction(https://arxiv.org/abs/2505.23745)
Keywords: generation
Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities in aligning visual and textual modalities, enabling a wide range of applications in multimodal understanding and generation. While they excel in zero-shot and transfer learning scenarios, VLMs remain susceptible to misclassification, often yielding confident yet incorrect predictions. This limitation poses a significant risk in safety-critical domains, where erroneous predictions can lead to severe consequences. In this work, we introduce TrustVLM, a training-free framework designed to address the critical challenge of estimating when VLM's predictions can be trusted. Motivated by the observed modality gap in VLMs and the insight that certain concepts are more distinctly represented in the image embedding space, we propose a novel confidence-scoring function that leverages this space to improve misclassification detection. We rigorously evaluate our approach across 17 diverse datasets, employing 4 architectures and 2 VLMs, and demonstrate state-of-the-art performance, with improvements of up to 51.87% in AURC, 9.14% in AUROC, and 32.42% in FPR95 compared to existing baselines. By improving the reliability of the model without requiring retraining, TrustVLM paves the way for safer deployment of VLMs in real-world applications. The code will be available at this https URL.
摘要：视觉模型（VLM）在对齐视觉和文本方式方面表现出很强的功能，从而在多模式理解和产生中实现了广泛的应用。尽管他们在零射门和转移学习方案中表现出色，但VLM仍然容易受到错误分类的影响，通常会产生自信而又不正确的预测。此限制在安全关键领域构成了重大风险，在这种局限性领域中，错误的预测会导致严重的后果。在这项工作中，我们介绍了TrustVLM，这是一个无培训的框架，旨在应对估算VLM何时可以信任的关键挑战。由VLMS中观察到的模态差距和某些概念在嵌入图像嵌入空间中更为明显的见解的动机，我们提出了一种新型的置信度计算函数，该功能利用该空间来改善错误分类检测。我们严格评估了17个不同数据集的方法，采用了4种体系结构和2个VLM，并展示了最先进的性能，与现有基线相比，AURC中高达51.87％的AURC，AUROC的9.14％，在FPR95中提高了32.42％。通过提高模型的可靠性而无需重新培训，TrustVLM为在现实世界应用程序中更安全的VLMS铺平了道路。该代码将在此HTTPS URL上可用。

Title: LoRAShop: Training-Free Multi-Concept Image Generation and Editing with Rectified Flow Transformers

Authors: Yusuf Dalva, Hidir Yesiltepe, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23758
Pdf URL: https://arxiv.org/pdf/2505.23758
Copy Paste: [[2505.23758]] LoRAShop: Training-Free Multi-Concept Image Generation and Editing with Rectified Flow Transformers(https://arxiv.org/abs/2505.23758)
Keywords: generation
Abstract: We introduce LoRAShop, the first framework for multi-concept image editing with LoRA models. LoRAShop builds on a key observation about the feature interaction patterns inside Flux-style diffusion transformers: concept-specific transformer features activate spatially coherent regions early in the denoising process. We harness this observation to derive a disentangled latent mask for each concept in a prior forward pass and blend the corresponding LoRA weights only within regions bounding the concepts to be personalized. The resulting edits seamlessly integrate multiple subjects or styles into the original scene while preserving global context, lighting, and fine details. Our experiments demonstrate that LoRAShop delivers better identity preservation compared to baselines. By eliminating retraining and external constraints, LoRAShop turns personalized diffusion models into a practical `photoshop-with-LoRAs' tool and opens new avenues for compositional visual storytelling and rapid creative iteration.
摘要：我们介绍Lorashop，这是使用Lora模型进行多概念图像编辑的第一个框架。 Lorashop建立在磁通式扩散变压器内部特征相互作用模式的关键观察基础上：概念特异性变压器特征在脱氧过程的早期激活空间相干区域。我们利用这种观察来在先前的传球中为每个概念提供一个分离的潜在面具，并将相应的洛拉权重融合到范围内的概念。结果编辑将多个主题或样式无缝地整合到原始场景中，同时保留全球上下文，照明和细节。我们的实验表明，Lorashop与基准相比提供了更好的身份保护。通过消除重试和外部约束，Lorashop将个性化的扩散模型变成了实用的“ Photoshop-with-Loras”工具，并为构图视觉讲故事和快速创造性迭代打开了新的途径。