2025-05-27

Title: Model-Distributed Inference for Large Language Models at the Edge

Authors: Davide Macario, Hulya Seferoglu, Erdem Koyuncu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18164
Pdf URL: https://arxiv.org/pdf/2505.18164
Copy Paste: [[2505.18164]] Model-Distributed Inference for Large Language Models at the Edge(https://arxiv.org/abs/2505.18164)
Keywords: generation
Abstract: We introduce Model-Distributed Inference for Large-Language Models (MDI-LLM), a novel framework designed to facilitate the deployment of state-of-the-art large-language models (LLMs) across low-power devices at the edge. This is accomplished by dividing the model into multiple partitions, which are then assigned to different devices/nodes within the network. These nodes exchange intermediate activation vectors via device-to-device links, enabling collaborative computation. To enhance the efficiency of this process, we propose the "recurrent pipeline parallelism" technique, which reduces idle time on each device and facilitates parallel inference during the generation of multiple text sequences. By leveraging the combined computational resources of multiple edge devices, MDI-LLM enables the deployment of LLMs that exceed the memory capacity of individual devices, making it possible to perform inference on low-cost hardware. Furthermore, as the number of participating devices increases, MDI-LLM boosts token generation throughput and reduces memory consumption per device.
摘要：我们介绍了大型语言模型（MDI-LLM）的模型分布的推断，这是一个新颖的框架，旨在促进在边缘的低功率设备上部署最先进的大语言模型（LLMS）。这是通过将模型分为多个分区来完成的，然后将模型分配给网络中的不同设备/节点。这些节点通过设备到设备链接交换中间激活向量，从而实现协作计算。为了提高此过程的效率，我们提出了“经常性管道并行性”技术，该技术降低了每个设备上的空闲时间，并促进了在多个文本序列的生成过程中并行推断。通过利用多个边缘设备的组合计算资源，MDI-LLM可以部署超过单个设备内存能力的LLM，从而可以对低成本硬件进行推断。此外，随着参与设备的数量增加，MDI-LLM会增加令牌生成的吞吐量，并减少每个设备的内存消耗。

Title: Emotion Knowledge Enhancement for Vision Large Language Models: A Self-Verification Approach for High-Quality Emotion Instruction Data Generation

Authors: Feifan Wang, Tengfei Song, Minggui He, Chang Su, Zhanglin Wu, Hao Yang, Wenming Zheng, Osamu Yoshie
Subjects: cs.LG, cs.GR
Abstract URL: https://arxiv.org/abs/2505.18168
Pdf URL: https://arxiv.org/pdf/2505.18168
Copy Paste: [[2505.18168]] Emotion Knowledge Enhancement for Vision Large Language Models: A Self-Verification Approach for High-Quality Emotion Instruction Data Generation(https://arxiv.org/abs/2505.18168)
Keywords: generation
Abstract: Facial emotion perception in the vision large language model (VLLM) is crucial for achieving natural human-machine interaction. However, creating high-quality annotations for both coarse- and fine-grained facial emotion analysis demands costly expertise. The lack of such high-quality instruction data limits the performance of VLLMs in facial emotion perception. To address this, we propose a self-verification approach with emotion knowledge enhancement (SEKE), which generates high-quality instruction data for multi-grained emotion analysis cost-effectively using closed-source VLLM. This approach integrates prior human knowledge to VLLM inference, guided by the inherent correlations between three grained levels of emotion descriptions, i.e., discrete expression, valence-arousal, and action unit, to reliably generate comprehensive annotations. A self-verification strategy with Uncertainty-Aware Monte Carlo sampling (SV-UAMC) is further embedded to efficiently extract more accurate VLLM predictions, further improving annotation reliability. Consequently, we construct a facial emotion instruction dataset (FEID) containing three comprehensive descriptions, which provides coarse- and fine-grained emotional information for effective model training. Additionally, we introduce a facial emotion analysis benchmark (FEAB) to measure the VLLM's corresponding ability. Our method significantly outperforms state-of-the-art methods on three downstream facial emotion analysis tasks.
摘要：视觉中的面部情感感知大语言模型（VLLM）对于实现自然人机相互作用至关重要。但是，为粗粒和细粒度的面部情感分析创建高质量的注释需要昂贵的专业知识。缺乏这样的高质量指导数据限制了VLLM在面部情绪感知中的表现。为了解决这个问题，我们提出了一种自我验证的方法，并通过情绪知识增强（SEKE），该方法使用封闭源VLLM生成了高质量的教学数据，以成本效率地进行多元透明的情绪分析。这种方法将先前的人类知识与VLLM推论相结合，并在三个粒度的情绪描述之间的固有相关性（即离散表达，价值和行动单元）之间的固有相关性提供，以可靠地产生全面的注释。具有不确定性感知的蒙特卡洛采样（SV-UAMC）的自我验证策略进一步嵌入，以有效提取更准确的VLLM预测，从而进一步提高注释可靠性。因此，我们构建了一个面部情感教学数据集（FEID），其中包含三个综合描述，该描述为有效的模型培训提供了粗糙和细粒度的情感信息。此外，我们引入了面部情感分析基准（FEAB），以测量VLLM相应的能力。在三个下游面部情感分析任务上，我们的方法大大优于最先进的方法。

Title: Riemannian Flow Matching for Brain Connectivity Matrices via Pullback Geometry

Authors: Antoine Collas, Ce Ju, Nicolas Salvy, Bertrand Thirion
Subjects: cs.LG, eess.SP, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18193
Pdf URL: https://arxiv.org/pdf/2505.18193
Copy Paste: [[2505.18193]] Riemannian Flow Matching for Brain Connectivity Matrices via Pullback Geometry(https://arxiv.org/abs/2505.18193)
Keywords: generative
Abstract: Generating realistic brain connectivity matrices is key to analyzing population heterogeneity in brain organization, understanding disease, and augmenting data in challenging classification problems. Functional connectivity matrices lie in constrained spaces--such as the set of symmetric positive definite or correlation matrices--that can be modeled as Riemannian manifolds. However, using Riemannian tools typically requires redefining core operations (geodesics, norms, integration), making generative modeling computationally inefficient. In this work, we propose DiffeoCFM, an approach that enables conditional flow matching (CFM) on matrix manifolds by exploiting pullback metrics induced by global diffeomorphisms on Euclidean spaces. We show that Riemannian CFM with such metrics is equivalent to applying standard CFM after data transformation. This equivalence allows efficient vector field learning, and fast sampling with standard ODE solvers. We instantiate DiffeoCFM with two different settings: the matrix logarithm for covariance matrices and the normalized Cholesky decomposition for correlation matrices. We evaluate DiffeoCFM on three large-scale fMRI datasets with more than 4600 scans from 2800 subjects (ADNI, ABIDE, OASIS-3) and two EEG motor imagery datasets with over 30000 trials from 26 subjects (BNCI2014-002 and BNCI2015-001). It enables fast training and achieves state-of-the-art performance, all while preserving manifold constraints.
摘要：产生现实的大脑连通性矩阵是分析大脑组织中人口异质性，了解疾病并增加挑战性分类问题中数据的关键。功能连通性矩阵位于受约束的空间中，例如一组对称正定或相关矩阵 - 可以将其建模为Riemannian歧管。但是，使用Riemannian工具通常需要重新定义核心操作（地球学，规范，集成），从而使生成建模的计算效率低下。在这项工作中，我们提出了DIFFEOCFM，该方法通过利用由欧几里得空间上全局差异性诱导的回调指标来实现基质歧管上有条件流量匹配（CFM）的方法。我们表明，具有此类指标的Riemannian CFM等于在数据转换后应用标准CFM。这种等效性允许有效的矢量场学习，并使用标准ODE求解器进行快速采样。我们使用两种不同的设置实例化diffeOCFM：用于协方差矩阵的矩阵对数和用于相关矩阵的归一化cholesky分解。我们评估了三个大型fMRI数据集的DIFFEOCFM，其中2800名受试者（ADNI，ABIDE，OASIS，OASIS-3）和两个EEG Motor Imagery DataSets进行了4600多次扫描，并具有来自26个受试者的30000多个试验（BNCI2014-002和BNCI20152015-20122015--001）。它可以快速培训并实现最先进的性能，同时保留多种多样的约束。

Title: Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

Authors: Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18227
Pdf URL: https://arxiv.org/pdf/2505.18227
Copy Paste: [[2505.18227]] Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality(https://arxiv.org/abs/2505.18227)
Keywords: generative
Abstract: In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.
摘要：在变压器体系结构中，代币\ textemdash离散单元从原始数据\ textemdash衍生而成，是通过将输入分割为固定长度的块来形成的。然后将每个令牌映射到一个嵌入式中，从而可以在保留输入的基本信息的同时进行并行注意计算。由于变压器自发机制的二次计算复杂性，令牌还原主要被用作效率策略。在单一视觉和语言领域中尤其如此，在这种视觉和语言领域中，它有助于平衡计算成本，内存使用和推论延迟。尽管取得了这些进步，但本文认为，在大型生成模型时代，降低代币应超越其传统面向效率的作用。取而代之的是，我们将其定位为生成建模的基本原则，严重影响模型架构和更广泛的应用。具体而言，我们认为，跨视觉，语言和多模式系统，降低令牌可以：（i）促进更深层的多模式整合和对齐，（ii）减轻“过度思考”和幻觉，（iii）在长期输入中保持连贯性，以及（iv）增强训练稳定性，等等。通过这样做，我们概述了有希望的未来方向，包括算法设计，增强学习引导的令牌减少，对文化学习的代币优化以及更广泛的ML和科学领域。我们强调了它推动新的模型体系结构和学习策略的潜力，以提高鲁棒性，提高可解释性并更好地与生成建模的目标保持一致。

Title: Follow the Energy, Find the Path: Riemannian Metrics from Energy-Based Models

Authors: Louis Béthune, David Vigouroux, Yilun Du, Rufin VanRullen, Thomas Serre, Victor Boutin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18230
Pdf URL: https://arxiv.org/pdf/2505.18230
Copy Paste: [[2505.18230]] Follow the Energy, Find the Path: Riemannian Metrics from Energy-Based Models(https://arxiv.org/abs/2505.18230)
Keywords: generative
Abstract: What is the shortest path between two data points lying in a high-dimensional space? While the answer is trivial in Euclidean geometry, it becomes significantly more complex when the data lies on a curved manifold -- requiring a Riemannian metric to describe the space's local curvature. Estimating such a metric, however, remains a major challenge in high dimensions. In this work, we propose a method for deriving Riemannian metrics directly from pretrained Energy-Based Models (EBMs) -- a class of generative models that assign low energy to high-density regions. These metrics define spatially varying distances, enabling the computation of geodesics -- shortest paths that follow the data manifold's intrinsic geometry. We introduce two novel metrics derived from EBMs and show that they produce geodesics that remain closer to the data manifold and exhibit lower curvature distortion, as measured by alignment with ground-truth trajectories. We evaluate our approach on increasingly complex datasets: synthetic datasets with known data density, rotated character images with interpretable geometry, and high-resolution natural images embedded in a pretrained VAE latent space. Our results show that EBM-derived metrics consistently outperform established baselines, especially in high-dimensional settings. Our work is the first to derive Riemannian metrics from EBMs, enabling data-aware geodesics and unlocking scalable, geometry-driven learning for generative modeling and simulation.
摘要：位于高维空间中的两个数据点之间的最短路径是什么？虽然答案在欧几里得的几何形状中是微不足道的，但当数据位于弯曲的歧管上时，它变得更加复杂 - 需要一个Riemannian指标来描述该空间的局部曲率。但是，估计这种度量仍然是高维度的主要挑战。在这项工作中，我们提出了一种直接从验证的基于能量的模型（EBM）的方法来推导Riemannian指标的方法，这是一类生成模型，这些模型将低能分配给高密度区域。这些度量标准定义了空间变化的距离，从而实现了测量学的计算 - 遵循数据歧管固有几何形状的最短路径。我们介绍了两个来自EBM的新型指标，并表明它们产生的大地测量学可以保持靠近数据歧管并显示出较低的曲率失真，如与地面轨迹的对齐方式所测量。我们在日益复杂的数据集上评估了我们的方法：具有已知数据密度的合成数据集，具有可解释的几何形状的旋转字符图像以及嵌入预识别的VAE潜在空间中的高分辨率自然图像。我们的结果表明，EBM衍生的指标始终超过建立的基线，尤其是在高维环境中。我们的工作是第一个从EBM中得出Riemannian指标的工作，从而实现了数据感知的测量学和解锁可扩展的，几何学驱动的学习，以生成建模和仿真。

Title: Decomposition of Water Demand Patterns Using Skewed Gaussian Distributions for Behavioral Insights and Operational Planning

Authors: Roy Elkayam
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18245
Pdf URL: https://arxiv.org/pdf/2505.18245
Copy Paste: [[2505.18245]] Decomposition of Water Demand Patterns Using Skewed Gaussian Distributions for Behavioral Insights and Operational Planning(https://arxiv.org/abs/2505.18245)
Keywords: generation
Abstract: This study presents a novel approach for decomposing urban water demand patterns using Skewed Gaussian Distributions (SGD) to derive behavioral insights and support operational planning. Hourly demand profiles contain critical information for both long-term infrastructure design and daily operations, influencing network pressures, water quality, energy consumption, and overall reliability. By breaking down each daily demand curve into a baseline component and distinct peak components, the proposed SGD method characterizes each peak with interpretable parameters, including peak amplitude, timing (mean), spread (duration), and skewness (asymmetry), thereby reconstructing the observed pattern and uncovering latent usage dynamics. This detailed peak-level decomposition enables both operational applications, e.g. anomaly and leakage detection, real-time demand management, and strategic analyses, e.g. identifying behavioral shifts, seasonal influences, or policy impacts on consumption patterns. Unlike traditional symmetric Gaussian or purely statistical time-series models, SGDs explicitly capture asymmetric peak shapes such as sharp morning surges followed by gradual declines, improving the fidelity of synthetic pattern generation and enhancing the detection of irregular consumption behavior. The method is demonstrated on several real-world datasets, showing that SGD outperforms symmetric Gaussian models in reconstruction accuracy, reducing root-mean-square error by over 50% on average, while maintaining physical interpretability. The SGD framework can also be used to construct synthetic demand scenarios by designing daily peak profiles with chosen characteristics. All implementation code is publicly available at: this https URL
摘要：这项研究提出了一种新的方法，用于使用偏斜的高斯分布（SGD）分解城市用水的需求模式，以获得行为见解并支持运营计划。小时需求概况包含长期基础设施设计和日常操作的关键信息，影响网络压力，水质，能源消耗和整体可靠性。通过将每日需求曲线分解为基线成分和不同的峰分量，提出的SGD方法将每个峰都用可解释的参数来表征，包括峰值振幅，时机（平均），差异（持续时间）和偏度（不对称），从而重建了观察到的模式并揭示了潜在的使用动态。这种详细的峰级分解可以实现两个操作应用，例如异常和泄漏检测，实时需求管理和战略分析，例如确定行为转变，季节性影响或政策对消费模式的影响。与传统的对称高斯或纯粹的统计时间序列模型不同，SGD明确捕获了不对称的峰形状，例如急剧的早晨潮流，随后逐渐下降，改善了合成模式产生的保真度并增强了不规则消耗行为的检测。该方法在几个现实世界中的数据集上进行了证明，表明SGD在重建精度中的表现优于对称高斯模型，使根平方误差平均减少了50％以上，同时保持物理可解释性。 SGD框架也可用于通过设计具有选定特征的每日峰值轮廓来构建合成需求方案。所有实施代码均可公开可用：此HTTPS URL

Title: CONCORD: Concept-Informed Diffusion for Dataset Distillation

Authors: Jianyang Gu, Haonan Wang, Ruoxi Jia, Saeed Vahidian, Vyacheslav Kungurtsev, Wei Jiang, Yiran Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18358
Pdf URL: https://arxiv.org/pdf/2505.18358
Copy Paste: [[2505.18358]] CONCORD: Concept-Informed Diffusion for Dataset Distillation(https://arxiv.org/abs/2505.18358)
Keywords: generation, generative
Abstract: Dataset distillation (DD) has witnessed significant progress in creating small datasets that encapsulate rich information from large original ones. Particularly, methods based on generative priors show promising performance, while maintaining computational efficiency and cross-architecture generalization. However, the generation process lacks explicit controllability for each sample. Previous distillation methods primarily match the real distribution from the perspective of the entire dataset, whereas overlooking concept completeness at the instance level. The missing or incorrectly represented object details cannot be efficiently compensated due to the constrained sample amount typical in DD settings. To this end, we propose incorporating the concept understanding of large language models (LLMs) to perform Concept-Informed Diffusion (CONCORD) for dataset distillation. Specifically, distinguishable and fine-grained concepts are retrieved based on category labels to inform the denoising process and refine essential object details. By integrating these concepts, the proposed method significantly enhances both the controllability and interpretability of the distilled image generation, without relying on pre-trained classifiers. We demonstrate the efficacy of CONCORD by achieving state-of-the-art performance on ImageNet-1K and its subsets. The code implementation is released in this https URL.
摘要：数据集蒸馏（DD）在创建小型数据集方面取得了重大进展，这些数据集封装了来自大型原始信息的丰富信息。特别是，基于生成先验的方法表现出令人鼓舞的性能，同时保持计算效率和跨体积概括。但是，生成过程缺乏每个样本的明确可控性。先前的蒸馏方法主要从整个数据集的角度匹配实际分布，而在实例级别忽略了概念的完整性。由于DD设置中典型的样本数量，由于丢失或错误表示的对象细节无法有效补偿。为此，我们建议将大语言模型（LLMS）的概念理解纳入数据集蒸馏的概念信息扩散（Concord）。具体而言，根据类别标签检索了可区分和细粒度的概念，以告知转换过程并完善基本对象细节。通过整合这些概念，提出的方法显着增强了蒸馏图像生成的可控性和解释性，而无需依赖于预训练的分类器。我们通过在ImageNet-1K及其子集上实现最新性能来证明协和的功效。代码实现在此HTTPS URL中发布。

Title: Applications of Modular Co-Design for De Novo 3D Molecule Generation

Authors: Danny Reidenbach, Filipp Nikitin, Olexandr Isayev, Saee Paliwal
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.18392
Pdf URL: https://arxiv.org/pdf/2505.18392
Copy Paste: [[2505.18392]] Applications of Modular Co-Design for De Novo 3D Molecule Generation(https://arxiv.org/abs/2505.18392)
Keywords: generation, generative
Abstract: De novo 3D molecule generation is a pivotal task in drug discovery. However, many recent geometric generative models struggle to produce high-quality 3D structures, even if they maintain 2D validity and topological stability. To tackle this issue and enhance the learning of effective molecular generation dynamics, we present Megalodon-a family of scalable transformer models. These models are enhanced with basic equivariant layers and trained using a joint continuous and discrete denoising co-design objective. We assess Megalodon's performance on established molecule generation benchmarks and introduce new 3D structure benchmarks that evaluate a model's capability to generate realistic molecular structures, particularly focusing on energetics. We show that Megalodon achieves state-of-the-art results in 3D molecule generation, conditional structure generation, and structure energy benchmarks using diffusion and flow matching. Furthermore, doubling the number of parameters in Megalodon to 40M significantly enhances its performance, generating up to 49x more valid large molecules and achieving energy levels that are 2-10x lower than those of the best prior generative models.
摘要：从头3D分子产生是药物发现中的关键任务。但是，即使它们保持2D有效性和拓扑稳定性，许多最近的几何生成模型即使它们保持高质量的3D结构。为了解决这个问题并增强有效分子产生动力学的学习，我们提出了Megalodon-a可扩展变压器模型家族。这些模型通过基本的模棱两可的层增强，并使用关节连续和离散的denoising共同设计目标进行了训练。我们评估了Megalodon在既定分子生成基准上的性能，并引入了新的3D结构基准，这些基准评估了模型生成逼真的分子结构的能力，尤其是专注于能量学。我们表明，Megalodon实现了最先进的结果，从而可以使用扩散和流量匹配来产生3D分子的产生，条件结构的产生和结构能量基准。此外，将Megalodon的参数数量增加一倍，从而显着提高其性能，产生高达49倍的有效大分子，并达到比最佳先前生成型模型低2-10倍的能量水平。

Title: Taming Diffusion for Dataset Distillation with High Representativeness

Authors: Lin Zhao, Yushu Wu, Xinru Jiang, Jianyang Gu, Yanzhi Wang, Xiaolin Xu, Pu Zhao, Xue Lin
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18399
Pdf URL: https://arxiv.org/pdf/2505.18399
Copy Paste: [[2505.18399]] Taming Diffusion for Dataset Distillation with High Representativeness(https://arxiv.org/abs/2505.18399)
Keywords: generation
Abstract: Recent deep learning models demand larger datasets, driving the need for dataset distillation to create compact, cost-efficient datasets while maintaining performance. Due to the powerful image generation capability of diffusion, it has been introduced to this field for generating distilled images. In this paper, we systematically investigate issues present in current diffusion-based dataset distillation methods, including inaccurate distribution matching, distribution deviation with random noise, and separate sampling. Building on this, we propose D^3HR, a novel diffusion-based framework to generate distilled datasets with high representativeness. Specifically, we adopt DDIM inversion to map the latents of the full dataset from a low-normality latent domain to a high-normality Gaussian domain, preserving information and ensuring structural consistency to generate representative latents for the distilled dataset. Furthermore, we propose an efficient sampling scheme to better align the representative latents with the high-normality Gaussian distribution. Our comprehensive experiments demonstrate that D^3HR can achieve higher accuracy across different model architectures compared with state-of-the-art baselines in dataset distillation. Source code: this https URL.
摘要：最近的深度学习模型需要更大的数据集，从而促进了数据集蒸馏的需求，以在保持性能的同时创建紧凑的，具有成本效益的数据集。由于扩散的强大图像生成能力，已将其引入该领域以生成蒸馏图像。在本文中，我们系统地研究了基于当前扩散的数据集蒸馏方法中存在的问题，包括不准确的分布匹配，随机噪声和单独的采样。在此基础上，我们提出了D^3HR，这是一种基于扩散的新型框架，以生成具有高代表性的蒸馏数据集。具体而言，我们采用DDIM反转来映射整个数据集的潜在潜伏区域的潜在域到高正常的高斯域，保留信息并确保结构性一致性为蒸馏数据集生成代表性潜伏期。此外，我们提出了一个有效的抽样方案，以更好地使代表性潜伏期与高均衡性高斯分布对齐。我们的全面实验表明，与数据集蒸馏中的最新基准相比，D^3HR可以在不同模型架构上实现更高的精度。源代码：此HTTPS URL。

Title: Rehabilitation Exercise Quality Assessment and Feedback Generation Using Large Language Models with Prompt Engineering

Authors: Jessica Tang, Ali Abedi, Tracey J.F. Colella, Shehroz S. Khan
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2505.18412
Pdf URL: https://arxiv.org/pdf/2505.18412
Copy Paste: [[2505.18412]] Rehabilitation Exercise Quality Assessment and Feedback Generation Using Large Language Models with Prompt Engineering(https://arxiv.org/abs/2505.18412)
Keywords: generation, quality assessment
Abstract: Exercise-based rehabilitation improves quality of life and reduces morbidity, mortality, and rehospitalization, though transportation constraints and staff shortages lead to high dropout rates from rehabilitation programs. Virtual platforms enable patients to complete prescribed exercises at home, while AI algorithms analyze performance, deliver feedback, and update clinicians. Although many studies have developed machine learning and deep learning models for exercise quality assessment, few have explored the use of large language models (LLMs) for feedback and are limited by the lack of rehabilitation datasets containing textual feedback. In this paper, we propose a new method in which exercise-specific features are extracted from the skeletal joints of patients performing rehabilitation exercises and fed into pre-trained LLMs. Using a range of prompting techniques, such as zero-shot, few-shot, chain-of-thought, and role-play prompting, LLMs are leveraged to evaluate exercise quality and provide feedback in natural language to help patients improve their movements. The method was evaluated through extensive experiments on two publicly available rehabilitation exercise assessment datasets (UI-PRMD and REHAB24-6) and showed promising results in exercise assessment, reasoning, and feedback generation. This approach can be integrated into virtual rehabilitation platforms to help patients perform exercises correctly, support recovery, and improve health outcomes.
摘要：基于运动的康复改善了生活质量，并降低了发病率，死亡率和重新住院，尽管运输限制和员工短缺导致康复计划的辍学率很高。虚拟平台使患者能够在家中完成规定的练习，而AI算法分析了绩效，提供反馈和更新临床医生。尽管许多研究已经开发了机器学习和锻炼质量评估的深度学习模型，但很少有人探索使用大型语言模型（LLM）进行反馈，并且由于缺乏包含文本反馈的康复数据集而受到限制。在本文中，我们提出了一种新方法，其中从进行康复运动的患者的骨骼关节中提取了运动特定的特征，并将其送入预先训练的LLMS中。使用一系列提示技术，例如零射，很少的，经过思考链和角色扮演提示，LLMS被利用以评估运动质量并提供自然语言的反馈，以帮助患者改善其运动。通过对两个公开可用的康复锻炼评估数据集（UI-PRMD和REBHAB24-6）进行广泛的实验，对该方法进行了评估，并在锻炼评估，推理和反馈生成方面显示出令人鼓舞的结果。该方法可以集成到虚拟康复平台中，以帮助患者正确地进行锻炼，支持康复并改善健康状况。

Title: TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP

Authors: Yuliang Cai, Jesse Thomason, Mohammad Rostami
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18434
Pdf URL: https://arxiv.org/pdf/2505.18434
Copy Paste: [[2505.18434]] TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP(https://arxiv.org/abs/2505.18434)
Keywords: generation
Abstract: Vision-language models (VLMs), such as CLIP, have demonstrated strong performance across a range of downstream tasks. However, CLIP is still limited in negation understanding: the ability to recognize the absence or exclusion of a concept. Existing methods address the problem by using a large language model (LLM) to generate large-scale data of image captions containing negation for further fine-tuning CLIP. However, these methods are both time- and compute-intensive, and their evaluations are typically restricted to image-text matching tasks. To expand the horizon, we (1) introduce a training-time negation data generation pipeline such that negation captions are generated during the training stage, which only increases 2.5% extra training time, and (2) we propose the first benchmark, Neg-TtoI, for evaluating text-to-image generation models on prompts containing negation, assessing model's ability to produce semantically accurate images. We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.
摘要：视觉语言模型（VLM）（例如剪辑）在一系列下游任务中表现出强劲的性能。但是，剪辑仍然受到否定理解的限制：识别概念缺失或排除的能力。现有方法通过使用大型语言模型（LLM）来解决该问题，以生成包含否定的图像标题的大规模数据，以进一步进行微调剪辑。但是，这些方法既有时间和计算密集型，它们的评估通常仅限于图像文本匹配任务。为了扩大视野，我们（1）介绍了培训时间否定数据生成管道，以便在训练阶段生成否定字幕，这仅增加了2.5％的额外训练时间，（2）我们提出了第一个基准测试标准，用于评估包含否定的提示的文本到图像生成模型，以评估模型的模型，评估模型的能力，以评估模型的能力，以评估模型精确的图像图像。我们表明，我们提出的方法TNG-CLIP在图像到文本匹配，文本对图像检索和图像生成的各种否定基准上实现了SOTA性能。

Title: Performance and Generalizability Impacts of Incorporating Geolocation into Deep Learning for Dynamic PM2.5 Estimation

Authors: Morteza Karimzadeh, Zhongying Wang, James L. Crooks
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18461
Pdf URL: https://arxiv.org/pdf/2505.18461
Copy Paste: [[2505.18461]] Performance and Generalizability Impacts of Incorporating Geolocation into Deep Learning for Dynamic PM2.5 Estimation(https://arxiv.org/abs/2505.18461)
Keywords: generation
Abstract: Deep learning models have demonstrated success in geospatial applications, yet quantifying the role of geolocation information in enhancing model performance and geographic generalizability remains underexplored. A new generation of location encoders have emerged with the goal of capturing attributes present at any given location for downstream use in predictive modeling. Being a nascent area of research, their evaluation has remained largely limited to static tasks such as species distributions or average temperature mapping. In this paper, we discuss and quantify the impact of incorporating geolocation into deep learning for a real-world application domain that is characteristically dynamic (with fast temporal change) and spatially heterogeneous at high resolutions: estimating surface-level daily PM2.5 levels using remotely sensed and ground-level data. We build on a recently published deep learning-based PM2.5 estimation model that achieves state-of-the-art performance on data observed in the contiguous United States. We examine three approaches for incorporating geolocation: excluding geolocation as a baseline, using raw geographic coordinates, and leveraging pretrained location encoders. We evaluate each approach under within-region (WR) and out-of-region (OoR) evaluation scenarios. Aggregate performance metrics indicate that while naïve incorporation of raw geographic coordinates improves within-region performance by retaining the interpolative value of geographic location, it can hinder generalizability across regions. In contrast, pretrained location encoders like GeoCLIP enhance predictive performance and geographic generalizability for both WR and OoR scenarios. However, qualitative analysis reveals artifact patterns caused by high-degree basis functions and sparse upstream samples in certain areas, and ablation results indicate varying performance among location encoders...
摘要：深度学习模型已经在地理空间应用中表现出成功，但量化地理位置信息在增强模型性能和地理可推广性中的作用仍然没有得到充实。出现了新一代的位置编码器，目的是捕获在预测建模中下游使用的任何给定位置存在的属性。作为一个新生的研究领域，他们的评估仍然在很大程度上仅限于静态任务，例如物种分布或平均温度映射。在本文中，我们讨论并量化了将地理位置纳入深度学习中的影响，以在高分辨率下具有动态性（随时间变化的快速变化）和空间异质性的真实世界应用领域：估计表面级别的PM2.5水平，使用远程感知和地面数据估算表面级别的PM2.5水平。我们建立在最近发表的基于深度学习的PM2.5估计模型的基础上，该模型在连续美国观察到的数据上实现了最新的性能。我们研究了三种用于合并地理位置的方法：使用原始地理坐标排除地理位置作为基线，并利用预审预测的位置编码器。我们在区域内（WR）和区域外（OOR）评估方案下评估每种方法。总体绩效指标表明，虽然原始地理坐标的幼稚结合通过保留地理位置的插值来改善区域内绩效，但它可能会阻碍各个地区的普遍性。相比之下，诸如Geoclip之类的预处理位置编码器可增强WR和OOR场景的预测性能和地理概括性。但是，定性分析揭示了某些领域的高度基础功能和稀疏上游样本引起的伪影模式，消融结果表明位置编码器之间的性能不同。

Title: HonestFace: Towards Honest Face Restoration with One-Step Diffusion Model

Authors: Jingkai Wang, Wu Miao, Jue Gong, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18469
Pdf URL: https://arxiv.org/pdf/2505.18469
Copy Paste: [[2505.18469]] HonestFace: Towards Honest Face Restoration with One-Step Diffusion Model(https://arxiv.org/abs/2505.18469)
Keywords: restoration, generation
Abstract: Face restoration has achieved remarkable advancements through the years of development. However, ensuring that restored facial images exhibit high fidelity, preserve authentic features, and avoid introducing artifacts or biases remains a significant challenge. This highlights the need for models that are more "honest" in their reconstruction from low-quality inputs, accurately reflecting original characteristics. In this work, we propose HonestFace, a novel approach designed to restore faces with a strong emphasis on such honesty, particularly concerning identity consistency and texture realism. To achieve this, HonestFace incorporates several key components. First, we propose an identity embedder to effectively capture and preserve crucial identity features from both the low-quality input and multiple reference faces. Second, a masked face alignment method is presented to enhance fine-grained details and textural authenticity, thereby preventing the generation of patterned or overly synthetic textures and improving overall clarity. Furthermore, we present a new landmark-based evaluation metric. Based on affine transformation principles, this metric improves the accuracy compared to conventional L2 distance calculations for facial feature alignment. Leveraging these contributions within a one-step diffusion model framework, HonestFace delivers exceptional restoration results in terms of facial fidelity and realism. Extensive experiments demonstrate that our approach surpasses existing state-of-the-art methods, achieving superior performance in both visual quality and quantitative assessments. The code and pre-trained models will be made publicly available at this https URL .
摘要：在多年来的发展中，面部恢复取得了显着的进步。但是，确保恢复的面部图像表现出高忠诚，保留真实的特征，避免引入人工制品或偏见仍然是一个重大挑战。这凸显了需要从低质量投入的重建中更“诚实”的模型，从而准确地反映了原始特征。在这项工作中，我们提出了诚实的表情，这是一种旨在恢复面孔的新方法，以非常强调这种诚实，尤其是关于身份一致性和纹理现实主义。为了实现这一目标，Hextface合并了几个关键组件。首先，我们提出一个身份嵌入器，以有效地捕获和保留低质量输入和多个参考面的关键身份特征。其次，提出了一种蒙面的面部对准方法，以增强细粒细节和纹理真实性，从而防止产生图案化或过度合成的纹理并提高整体清晰度。此外，我们提出了一个新的基于里程碑的评估指标。基于仿射转化原理，与常规的L2距离计算相比，该指标提高了准确性。在一步扩散模型框架内利用这些贡献，Hextface在面部忠诚和现实主义方面提供了出色的恢复结果。广泛的实验表明，我们的方法超过了现有的最新方法，在视觉质量和定量评估中都取得了卓越的性能。该代码和预培训模型将在此HTTPS URL上公开可用。

Title: Syn3DTxt: Embedding 3D Cues for Scene Text Generation

Authors: Li-Syun Hsiung, Jun-Kai Tu, Kuan-Wu Chu, Yu-Hsuan Chiu, Yan-Tsung Peng, Sheng-Luen Chung, Gee-Sern Jison Hsu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18479
Pdf URL: https://arxiv.org/pdf/2505.18479
Copy Paste: [[2505.18479]] Syn3DTxt: Embedding 3D Cues for Scene Text Generation(https://arxiv.org/abs/2505.18479)
Keywords: generation
Abstract: This study aims to investigate the challenge of insufficient three-dimensional context in synthetic datasets for scene text rendering. Although recent advances in diffusion models and related techniques have improved certain aspects of scene text generation, most existing approaches continue to rely on 2D data, sourcing authentic training examples from movie posters and book covers, which limits their ability to capture the complex interactions among spatial layout and visual effects in real-world scenes. In particular, traditional 2D datasets do not provide the necessary geometric cues for accurately embedding text into diverse backgrounds. To address this limitation, we propose a novel standard for constructing synthetic datasets that incorporates surface normals to enrich three-dimensional scene characteristic. By adding surface normals to conventional 2D data, our approach aims to enhance the representation of spatial relationships and provide a more robust foundation for future scene text rendering methods. Extensive experiments demonstrate that datasets built under this new standard offer improved geometric context, facilitating further advancements in text rendering under complex 3D-spatial conditions.
摘要：这项研究旨在调查合成数据集中三维环境不足的挑战，以进行场景文本渲染。尽管扩散模型和相关技术的最新进展改善了场景文本生成的某些方面，但大多数现有方法继续依赖2D数据，从电影海报和书籍封面中采购了真实的培训示例，这限制了它们在现实世界中捕获空间布局和视觉效果之间复杂相互作用的能力。特别是，传统的2D数据集没有提供必要的几何提示，以将文本准确地嵌入到不同的背景中。为了解决这一限制，我们提出了一个新的标准，用于构建合成数据集，该数据集结合了表面正态以丰富三维场景特征。通过将表面正常添加到常规2D数据中，我们的方法旨在增强空间关系的表示，并为未来的场景文本渲染方法提供更强大的基础。广泛的实验表明，在此新标准下构建的数据集提供了改进的几何环境，从而促进了在复杂的3D空间条件下文本渲染的进一步进步。

Title: The Prompt is Mightier than the Example

Authors: Shengzhe Xu, Nikhil Muralidhar, Naren Ramakrishnan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18485
Pdf URL: https://arxiv.org/pdf/2505.18485
Copy Paste: [[2505.18485]] The Prompt is Mightier than the Example(https://arxiv.org/abs/2505.18485)
Keywords: generation
Abstract: Numerous recent prompt optimization approaches like chain-of-thought, have been demonstrated to significantly improve the quality of content generated by large language models (LLMs). In-context learning (ICL), a recent paradigm where a few representative examples guide content generation has also led to strong improvements in generation quality of LLM generated content. This idea has been applied to great effect in synthetic tabular data generation, where LLMs, through effective use of ICL and prompt optimization, can generate data that approximate samples from complex, heterogeneous distributions based on representative examples. However, ensuring high-fidelity synthetic data often requires a very large number of ICL examples which may be unavailable or costly to obtain. At the same time, as LLMs get larger and larger, their in-built prior knowledge becomes vast and can potentially substitute for specific data examples. In this paper, we introduce Knowledge-Guided Prompting (KGP) as a new knob in prompt optimization and explore the ability of KGP-based prompt optimization to offset the cost of ICL. Specifically, we explore the question `how many examples can a prompt substitute for?' and explore knowledge-guided prompting (KGP) where domain knowledge, either inferred or available, is explicitly injected into the prompt, reducing dependence on ICL examples. Our experiments systematically explore the trade-off between ICL and KGP, revealing an empirical scaling law that quantifies how quality of generated synthetic data varies with increasing domain knowledge and decreasing example count. Our results demonstrate that knowledge-guided prompting can be a scalable alternative, or addition, to in-context examples, unlocking new approaches to synthetic data generation.
摘要：已经证明了许多最近的迅速优化方法，例如思考链，可以显着提高大语模型（LLMS）产生的内容质量。在最近的范式中，在文化学习（ICL）中，一些代表性的示例指南的内容也导致了LLM生成的内容的发电质量的强烈提高。这个想法已应用于合成表格数据生成中的巨大效果，其中LLM通过有效地使用ICL和迅速优化，可以生成数据，这些数据近似于基于代表性示例的复杂，异质分布的样本。但是，确保高保真综合数据通常需要大量的ICL示例，这些示例可能无法获得或昂贵。同时，随着LLM越来越大，其内置的先验知识变得庞大，并且可以代替特定的数据示例。在本文中，我们将知识引导提示（KGP）作为迅速优化的新旋钮介绍，并探索基于KGP的及时及时优化以抵消ICL成本的能力。具体来说，我们探讨了一个问题：“迅速替代了多少个例子？”并探索知识引导的提示（kgp），其中域知识（推断或可用）被明确注入提示中，从而减少了对ICL示例的依赖。我们的实验系统地探索了ICL和KGP之间的权衡，揭示了一种经验缩放定律，该法律量化了生成的合成数据的质量如何随着域知识的增加和减少示例计数而变化。我们的结果表明，知识引导的提示可以是对秘密示例的可扩展替代方案，也可以是添加的替代方案，从而解锁了合成数据生成的新方法。

Title: Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking

Authors: Chen-Hao Chao, Wei-Fang Sun, Hanwen Liang, Chun-Yi Lee, Rahul G. Krishnan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18495
Pdf URL: https://arxiv.org/pdf/2505.18495
Copy Paste: [[2505.18495]] Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking(https://arxiv.org/abs/2505.18495)
Keywords: generative
Abstract: Masked diffusion models (MDM) are powerful generative models for discrete data that generate samples by progressively unmasking tokens in a sequence. Each token can take one of two states: masked or unmasked. We observe that token sequences often remain unchanged between consecutive sampling steps; consequently, the model repeatedly processes identical inputs, leading to redundant computation. To address this inefficiency, we propose the Partial masking scheme (Prime), which augments MDM by allowing tokens to take intermediate states interpolated between the masked and unmasked states. This design enables the model to make predictions based on partially observed token information, and facilitates a fine-grained denoising process. We derive a variational training objective and introduce a simple architectural design to accommodate intermediate-state inputs. Our method demonstrates superior performance across a diverse set of generative modeling tasks. On text data, it achieves a perplexity of 15.36 on OpenWebText, outperforming previous MDM (21.52), autoregressive models (17.54), and their hybrid variants (17.58), without relying on an autoregressive formulation. On image data, it attains competitive FID scores of 3.26 on CIFAR-10 and 6.98 on ImageNet-32, comparable to leading continuous generative models.
摘要：蒙版扩散模型（MDM）是用于离散数据的强大生成模型，该模型通过逐渐揭示序列来生成样品的生成样品。每个令牌都可以采用两个状态之一：蒙面或揭露。我们观察到令牌序列在连续采样步骤之间通常保持不变。因此，该模型反复处理相同的输入，从而导致冗余计算。为了解决此效率低下，我们提出了部分掩盖方案（PRIME），该方案通过允许令牌允许将MDM提高MDM，从而将中间状态插入掩盖状态和未掩盖状态之间。该设计使该模型能够根据部分观察到的令牌信息进行预测，并促进细粒度的转化过程。我们得出一个变异训练目标，并引入了简单的建筑设计，以适应中等状态的输入。我们的方法证明了各种生成建模任务的卓越性能。在文本数据上，它在OpenWebText上达到了15.36的困惑，表现优于先前的MDM（21.52），自回旋模型（17.54）及其混合变体（17.58），而无需依赖自动性表述。在图像数据上，它在Imagenet-32上获得了CIFAR-10的竞争性FID得分为3.26，与领先的连续生成模型相当。

Title: Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

Authors: Aofei Chang, Le Huang, Alex James Boyd, Parminder Bhatia, Taha Kass-Hout, Cao Xiao, Fenglong Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18503
Pdf URL: https://arxiv.org/pdf/2505.18503
Copy Paste: [[2505.18503]] Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning(https://arxiv.org/abs/2505.18503)
Keywords: generation
Abstract: Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing mitigation methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose A$^3$Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. A$^3$Tune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A$^3$MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A$^3$Tune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.
摘要：医学大型视觉模型（MED-LVLM）经常在视觉输入上表现出次优的注意力分布，从而导致幻觉或不准确的输出。现有的缓解方法主要依赖于推理时间干预措施，这些干预措施受到关注适应或需要其他监督。为了解决这个问题，我们提出了一个$^3 $ Tune，这是一个新颖的微调框架，用于自动注意调整。 $^3 $ TUNE利用SAM的零拍弱标签，使用BiomedClip将其改进到及时感知的标签中，然后选择性地修改视觉关注的注意力头，以改善对齐方式，同时最大程度地减少干扰。此外，我们引入了一个$^3 $ MOE模块，从而为各种提示和图像的注意调整提供了自适应参数选择。关于医疗VQA和报告生成基准测试的广泛实验表明，$^3 $ tune的表现优于最先进的基线，从而在Med-LVLMS中实现了增强的注意力分布和性能。

Title: Improved Immiscible Diffusion: Accelerate Diffusion Training by Reducing Its Miscibility

Authors: Yiheng Li, Feng Liang, Dan Kondratyuk, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18521
Pdf URL: https://arxiv.org/pdf/2505.18521
Copy Paste: [[2505.18521]] Improved Immiscible Diffusion: Accelerate Diffusion Training by Reducing Its Miscibility(https://arxiv.org/abs/2505.18521)
Keywords: generation, generative
Abstract: The substantial training cost of diffusion models hinders their deployment. Immiscible Diffusion recently showed that reducing diffusion trajectory mixing in the noise space via linear assignment accelerates training by simplifying denoising. To extend immiscible diffusion beyond the inefficient linear assignment under high batch sizes and high dimensions, we refine this concept to a broader miscibility reduction at any layer and by any implementation. Specifically, we empirically demonstrate the bijective nature of the denoising process with respect to immiscible diffusion, ensuring its preservation of generative diversity. Moreover, we provide thorough analysis and show step-by-step how immiscibility eases denoising and improves efficiency. Extending beyond linear assignment, we propose a family of implementations including K-nearest neighbor (KNN) noise selection and image scaling to reduce miscibility, achieving up to >4x faster training across diverse models and tasks including unconditional/conditional generation, image editing, and robotics planning. Furthermore, our analysis of immiscibility offers a novel perspective on how optimal transport (OT) enhances diffusion training. By identifying trajectory miscibility as a fundamental bottleneck, we believe this work establishes a potentially new direction for future research into high-efficiency diffusion training. The code is available at this https URL.
摘要：扩散模型的大量培训成本阻碍了他们的部署。不混溶的扩散最近表明，通过线性分配减少噪声空间中的扩散轨迹混合，通过简化DeNOSIS来加速训练。为了将不混溶的扩散扩散到高批量和高维度下的线性分配效率低下，我们将此概念完善，以在任何层和任何实施中降低更广泛的混乱性。具体而言，我们从经验上证明了在不可分割的扩散方面的denoising过程的界限性质，从而确保了其生成多样性的保存。此外，我们提供了彻底的分析，并逐步展示了不可便当度如何减轻并提高效率。延伸超越线性分配，我们提出了一系列实施，包括K-Neartimen Neighbor（KNN）噪声选择和图像缩放，以降低混溶性，在各种模型和任务中实现高达4倍的培训，包括无条件/条件/有条件的生成，图像编辑和机器人技术计划。此外，我们对不信用性的分析为最佳运输（OT）如何增强扩散训练提供了新的观点。通过将轨迹混杂性确定为一种基本瓶颈，我们认为这项工作为未来研究高效扩散训练的研究提供了一个新的方向。该代码可在此HTTPS URL上找到。

Title: Joint-stochastic-approximation Autoencoders with Application to Semi-supervised Learning

Authors: Wenbo He, Zhijian Ou
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18558
Pdf URL: https://arxiv.org/pdf/2505.18558
Copy Paste: [[2505.18558]] Joint-stochastic-approximation Autoencoders with Application to Semi-supervised Learning(https://arxiv.org/abs/2505.18558)
Keywords: generative
Abstract: Our examination of existing deep generative models (DGMs), including VAEs and GANs, reveals two problems. First, their capability in handling discrete observations and latent codes is unsatisfactory, though there are interesting efforts. Second, both VAEs and GANs optimize some criteria that are indirectly related to the data likelihood. To address these problems, we formally present Joint-stochastic-approximation (JSA) autoencoders - a new family of algorithms for building deep directed generative models, with application to semi-supervised learning. The JSA learning algorithm directly maximizes the data log-likelihood and simultaneously minimizes the inclusive KL divergence the between the posteriori and the inference model. We provide theoretical results and conduct a series of experiments to show its superiority such as being robust to structure mismatch between encoder and decoder, consistent handling of both discrete and continuous variables. Particularly we empirically show that JSA autoencoders with discrete latent space achieve comparable performance to other state-of-the-art DGMs with continuous latent space in semi-supervised tasks over the widely adopted datasets - MNIST and SVHN. To the best of our knowledge, this is the first demonstration that discrete latent variable models are successfully applied in the challenging semi-supervised tasks.
摘要：我们对包括VAE和GAN在内的现有深层生成模型（DGM）的检查发现了两个问题。首先，尽管有有趣的努力，但它们在处理离散观测和潜在代码方面的能力并不令人满意。其次，VAE和GAN都优化了与数据可能性间接相关的一些标准。为了解决这些问题，我们正式呈现联合传播应用程序（JSA）自动编码器 - 一种用于构建深层定向生成模型的新算法系列，并应用于半监督学习。 JSA学习算法直接最大化数据对数的样本，并同时最大程度地减少了包含性的KL差异，后者和推理模型之间的差异。我们提供理论结果并进行一系列实验，以显示其优越性，例如在编码器和解码器之间结构不匹配，同时处理离散变量和连续变量。尤其是我们从经验上表明，具有离散潜在空间的JSA自动编码器与其他最先进的DGM相当，在广泛采用的数据集（MNIST和SVHN）上，在半监督任务中具有连续的潜在空间可比。据我们所知，这是第一次证明离散的潜在变量模型成功地应用于具有挑战性的半监督任务中。

Title: On Denoising Walking Videos for Gait Recognition

Authors: Dongyang Jin, Chao Fan, Jingzhe Ma, Jingkai Zhou, Weihua Chen, Shiqi Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18582
Pdf URL: https://arxiv.org/pdf/2505.18582
Copy Paste: [[2505.18582]] On Denoising Walking Videos for Gait Recognition(https://arxiv.org/abs/2505.18582)
Keywords: generative
Abstract: To capture individual gait patterns, excluding identity-irrelevant cues in walking videos, such as clothing texture and color, remains a persistent challenge for vision-based gait recognition. Traditional silhouette- and pose-based methods, though theoretically effective at removing such distractions, often fall short of high accuracy due to their sparse and less informative inputs. Emerging end-to-end methods address this by directly denoising RGB videos using human priors. Building on this trend, we propose DenoisingGait, a novel gait denoising method. Inspired by the philosophy that "what I cannot create, I do not understand", we turn to generative diffusion models, uncovering how they partially filter out irrelevant factors for gait understanding. Additionally, we introduce a geometry-driven Feature Matching module, which, combined with background removal via human silhouettes, condenses the multi-channel diffusion features at each foreground pixel into a two-channel direction vector. Specifically, the proposed within- and cross-frame matching respectively capture the local vectorized structures of gait appearance and motion, producing a novel flow-like gait representation termed Gait Feature Field, which further reduces residual noise in diffusion features. Experiments on the CCPG, CASIA-B*, and SUSTech1K datasets demonstrate that DenoisingGait achieves a new SoTA performance in most cases for both within- and cross-domain evaluations. Code is available at this https URL.
摘要：为了捕捉单个步态模式，在步行视频中不包括身份 - 艾尔特尔略有线索（例如服装纹理和颜色）仍然是基于视觉步态识别的持续挑战。传统的剪影和基于姿势的方法虽然在理论上有效地消除了这种干扰，但由于其稀疏且信息性较低的输入，通常不准确。新兴的端到端方法通过直接使用人类先验来直接降低RGB视频来解决此问题。在这一趋势的基础上，我们提出了一种新型步态denoising方法DeNoisingGait。受到“我无法创造的东西，我不了解”的哲学启发，我们转向生成的扩散模型，发现它们如何部分过滤出无关紧要的步态理解因素。此外，我们引入了一个几何驱动的特征匹配模块，该模块与通过人类轮廓的背景拆除结合使用，将每个前景像素的多通道扩散特征凝结成两个通道方向向量。具体而言，所提出的内部和跨框架匹配分别捕获了步态外观和运动的局部矢量化结构，从而产生了一种新型的流体样步态表示，称为步态特征场，从而进一步降低了扩散特征中的残留噪声。 CCPG，CASIA-B*和Sustech1K数据集的实验表明，在大多数情况下，对于内部和跨域评估，DeNoisingGait在大多数情况下都能达到新的SOTA性能。代码可在此HTTPS URL上找到。

Title: EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

Authors: GuangHao Meng, Sunan He, Jinpeng Wang, Tao Dai, Letian Zhang, Jieming Zhu, Qing Li, Gang Wang, Rui Zhang, Yong Jiang
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2505.18594
Pdf URL: https://arxiv.org/pdf/2505.18594
Copy Paste: [[2505.18594]] EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models(https://arxiv.org/abs/2505.18594)
Keywords: generative
Abstract: Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a novel, trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW utilizes EVD knowledge and the generative capabilities of the language model to effectively rewrite queries. With our specialized training strategy, EaRW can generate high-quality and low-noise EVD-enhanced queries. Extensive quantitative and qualitative experiments on image-text retrieval benchmarks validate the superiority of EvdCLIP on vision-language retrieval tasks.
摘要：Vision语言检索（VLR）在学术界和行业中都引起了极大的关注，其中涉及使用文本（或图像）作为查询来检索相应的图像（或文本）。但是，现有方法经常忽略丰富的实体视觉语义知识，从而导致错误的检索结果。为了解决这个问题，我们提出了实体视觉描述增强剪辑（EVDCLIP），旨在利用实体的视觉知识来丰富查询。具体而言，由于人类通过视觉提示识别实体，因此我们采用大型语言模型（LLM）来生成实体视觉描述（EVD）作为对齐文本数据的一致性提示。然后将这些EVD集成到原始查询中，以创建视觉富含EVD的启动查询。此外，认识到EVD增强的查询可能会引入噪音或低质量的扩展，因此我们为视觉检索任务开发了一种新颖的，可训练的EVD感知重写器（EARW）。 Earw利用EVD知识和语言模型的生成能力来有效地重写查询。通过我们的专业培训策略，EARW可以产生高质量和低噪声EVD增强的查询。对图像文本检索基准的广泛定量和定性实验验证了EVDCLIP对视觉检索任务的优越性。

Title: Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Authors: Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18600
Pdf URL: https://arxiv.org/pdf/2505.18600
Copy Paste: [[2505.18600]] Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment(https://arxiv.org/abs/2505.18600)
Keywords: super-resolution
Abstract: Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity.
摘要：现代的单像超分辨率（SISR）模型以训练的规模因素提供了光现实的结果，但是当被要求放大远远超出该制度时，倒塌了。我们使用Zoom（COZ）（COZ）来解决这种可扩展性瓶颈，这是一个模型不合时宜的框架，将SISR分配到具有多尺度意识的提示的中间尺度状态的自动回归链中。 COZ反复重新使用骨干SR模型，将条件概率分解为可拖动的子问题，以实现极端的分辨率，而无需其他训练。由于视觉提示在高宏伟速度下会减少，因此我们通过视觉模型（VLM）生成的多尺度感知文本提示来增强每个变焦步骤。使用普通奖励政策优化（GRPO）和评论家VLM对及时提取器本身进行微调，从而使文本指导对人类的喜好保持一致。实验表明，包裹在COZ中的标准4倍扩散SR模型达到了256倍扩大，具有高感知质量和忠诚度。

Title: Rethinking Causal Mask Attention for Vision-Language Inference

Authors: Xiaohuan Pei, Tao Huang, YanXiang Ma, Chang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18605
Pdf URL: https://arxiv.org/pdf/2505.18605
Copy Paste: [[2505.18605]] Rethinking Causal Mask Attention for Vision-Language Inference(https://arxiv.org/abs/2505.18605)
Keywords: generative
Abstract: Causal attention has become a foundational mechanism in autoregressive vision-language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the model's ability to leverage future context that often contains essential semantic cues for accurate inference. In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting. We first empirically analyze the effect of previewing future tokens for vision queries and demonstrate that rigid masking undermines the model's capacity to capture useful contextual semantic representations. Based on these findings, we propose a lightweight attention family that aggregates future visual context into past representations via pooling, effectively preserving the autoregressive structure while enhancing cross-token dependencies. We evaluate a range of causal masks across diverse vision-language inference settings and show that selectively compressing future semantic context into past representations benefits the inference.
摘要：因果关注已成为自回归视觉语言模型（VLM）的基础机制，在单个生成框架下统一了文本和视觉输入。但是，现有的基于因果面具的策略是从大型语言模型（LLMS）继承的，它们是针对仅文本解码的，在预填充阶段，它们对视觉令牌的适应性不足。严格掩盖视力查询的未来位置会引入过度严格的约束，这阻碍了模型利用通常包含基本语义提示的未来上下文的能力，以进行准确的推论。在这项工作中，我们从经验上研究了不同的因果掩蔽策略如何影响视觉推论，然后提出了针对这种环境量身定制的未来感知的关注家族。我们首先凭经验分析了预览未来令牌以获取视觉查询的效果，并证明刚性掩盖会破坏该模型捕获有用的上下文语义表示的能力。基于这些发现，我们提出了一个轻巧的关注家族，该家族通过合并，将未来的视觉上下文汇总到过去的表示形式中，有效地保留自回归结构，同时增强交叉依赖性。我们评估了各种视力语言推理环境中的一系列因果面具，并表明将未来的语义上下文选择性地压缩为过去表示有益于推理。

Title: Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter

Authors: Weizhi Zhong, Huan Yang, Zheng Liu, Huiguo He, Zijian He, Xuesong Niu, Di Zhang, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18612
Pdf URL: https://arxiv.org/pdf/2505.18612
Copy Paste: [[2505.18612]] Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter(https://arxiv.org/abs/2505.18612)
Keywords: generation
Abstract: Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-tuning for each new concept, which is time-consuming and prone to overfitting on limited training images. In this work, we propose a novel tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning. Our method builds upon the modulation mechanism in pretrained Diffusion Transformers (DiTs) model, leveraging the localized and semantically meaningful properties of the modulation space. Specifically, we propose a novel module, Mod-Adapter, to predict concept-specific modulation direction for the modulation process of concept-related text tokens. It incorporates vision-language cross-attention for extracting concept visual features, and Mixture-of-Experts (MoE) layers that adaptively map the concept features into the modulation space. Furthermore, to mitigate the training difficulty caused by the large gap between the concept image space and the modulation space, we introduce a VLM-guided pretraining strategy that leverages the strong image understanding capabilities of vision-language models to provide semantic supervision signals. For a comprehensive comparison, we extend a standard benchmark by incorporating abstract concepts. Our method achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations.
摘要：个性化的文本到图像生成旨在综合不同情况下用户提供的概念的图像。尽管在多概念个性化方面取得了最新进展，但大多数人限于对象概念和努力自定义抽象概念（例如，姿势，照明）。一些方法已经开始探索支持抽象概念的多概念个性化，但是它们需要为每个新概念进行测试时间进行微调，这很耗时且容易过度适应有限的培训图像。在这项工作中，我们提出了一种用于多概念个性化的新颖无调方法，可以有效地自定义对象和抽象概念而无需测试时间进行微调。我们的方法建立在预处理扩散变压器（DITS）模型中的调制机制上，利用调制空间的局部和语义意义的特性。具体而言，我们提出了一个新型模块，以预测与概念相关的文本令牌调制过程的特定概念调制方向。它结合了视觉交叉注意，用于提取概念的视觉特征，以及将概念特征自适应地映射到调制空间中的特征的混合物（MOE）层。此外，为了减轻概念图像空间和调制空间之间较大差距造成的训练难度，我们引入了VLM引导的训练术策略，以利用视觉模型的强大图像理解能力来提供语义监督信号。为了进行全面的比较，我们通过合并抽象概念来扩展标准基准。我们的方法在多概念个性化中实现了最先进的表现，并得到了定量，定性和人类评估的支持。

Title: Flow Matching for Geometric Trajectory Simulation

Authors: Kiet Bennema ten Brinke, Koen Minartz, Vlado Menkovski
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18647
Pdf URL: https://arxiv.org/pdf/2505.18647
Copy Paste: [[2505.18647]] Flow Matching for Geometric Trajectory Simulation(https://arxiv.org/abs/2505.18647)
Keywords: generative
Abstract: The simulation of N-body systems is a fundamental problem with applications in a wide range of fields, such as molecular dynamics, biochemistry, and pedestrian dynamics. Machine learning has become an invaluable tool for scaling physics-based simulators and developing models directly from experimental data. In particular, recent advances based on deep generative modeling and geometric deep learning have enabled probabilistic simulation by modeling complex distributions over trajectories while respecting the permutation symmetry that is fundamental to N-body systems. However, to generate realistic trajectories, existing methods must learn complex transformations starting from uninformed noise and do not allow for the exploitation of domain-informed priors. In this work, we propose STFlow to address this limitation. By leveraging flow matching and data-dependent couplings, STFlow facilitates physics-informed simulation of geometric trajectories without sacrificing model expressivity or scalability. Our evaluation on N-body dynamical systems, molecular dynamics, and pedestrian dynamics benchmarks shows that STFlow produces significantly lower prediction errors while enabling more efficient inference, highlighting the benefits of employing physics-informed prior distributions in probabilistic geometric trajectory modeling.
摘要：N体系统的仿真是在各个领域的应用，例如分子动力学，生物化学和行人动力学的基本问题。机器学习已成为用于扩展基于物理的模拟器并直接从实验数据开发模型的宝贵工具。特别是，基于深层生成建模和几何深度学习的最新进展通过对轨迹上的复杂分布进行建模，同时尊重对N体系基本的置换对称性，从而实现了概率模拟。但是，为了产生逼真的轨迹，现有方法必须学习从不知情的噪声开始的复杂变换，并且不允许对域信息的先验进行开发。在这项工作中，我们提出了STFLOW来解决此限制。通过利用流量匹配和数据依赖性耦合，STFlow促进了对几何轨迹的物理信息模拟，而无需牺牲模型表达性或可伸缩性。我们对N体动力学系统，分子动力学和行人动力学基准的评估表明，STFLOD会产生明显较低的预测误差，同时可以更有效地推断，从而强调了在概率几何轨迹轨迹建模中采用物理学知识的先验分布的益处。

Title: SuperGS: Consistent and Detailed 3D Super-Resolution Scene Reconstruction via Gaussian Splatting

Authors: Shiyun Xie, Zhiru Wang, Yinghao Zhu, Xu Wang, Chengwei Pan, Xiwang Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18649
Pdf URL: https://arxiv.org/pdf/2505.18649
Copy Paste: [[2505.18649]] SuperGS: Consistent and Detailed 3D Super-Resolution Scene Reconstruction via Gaussian Splatting(https://arxiv.org/abs/2505.18649)
Keywords: super-resolution
Abstract: Recently, 3D Gaussian Splatting (3DGS) has excelled in novel view synthesis (NVS) with its real-time rendering capabilities and superior quality. However, it encounters challenges for high-resolution novel view synthesis (HRNVS) due to the coarse nature of primitives derived from low-resolution input views. To address this issue, we propose SuperGS, an expansion of Scaffold-GS designed with a two-stage coarse-to-fine training framework. In the low-resolution stage, we introduce a latent feature field to represent the low-resolution scene, which serves as both the initialization and foundational information for super-resolution optimization. In the high-resolution stage, we propose a multi-view consistent densification strategy that backprojects high-resolution depth maps based on error maps and employs a multi-view voting mechanism, mitigating ambiguities caused by multi-view inconsistencies in the pseudo labels provided by 2D prior models while avoiding Gaussian redundancy. Furthermore, we model uncertainty through variational feature learning and use it to guide further scene representation refinement and adjust the supervisory effect of pseudo-labels, ensuring consistent and detailed scene reconstruction. Extensive experiments demonstrate that SuperGS outperforms state-of-the-art HRNVS methods on both forward-facing and 360-degree datasets.
摘要：最近，3D高斯脱落（3DGS）具有实时渲染能力和卓越的质量，在新型视图合成（NVS）方面表现出色。然而，由于低分辨率输入视图的原始性质的粗糙性质，它遇到了高分辨率小说合成（HRNV）的挑战。为了解决这个问题，我们提出了Supergs，这是使用两阶段的粗到精细训练框架设计的脚手架GS的扩展。在低分辨率阶段，我们引入了一个潜在特征字段来表示低分辨率场景，该场景既是超分辨率优化的初始化和基础信息。在高分辨率阶段，我们提出了一种多视图一致的致密策略，该策略基于误差图，基于误差图，采用了高分辨率深度图，并采用了多视图投票机制，从而减轻由2D先验模型提供的伪造型在避免使用的pseudo Lab中，避免了2D先验模型，避免避免竞技场。此外，我们通过各种特征学习对不确定性进行建模，并使用它来指导场景表示改进并调整伪标签的监督效果，从而确保一致且详细的场景重建。广泛的实验表明，在朝前和360度数据集上，Supergs的表现优于最先进的HRNV方法。

Title: ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos

Authors: Xiaodong Wang, Peixi Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18650
Pdf URL: https://arxiv.org/pdf/2505.18650
Copy Paste: [[2505.18650]] ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos(https://arxiv.org/abs/2505.18650)
Keywords: generation
Abstract: Real-world driving requires people to observe the current environment, anticipate the future, and make appropriate driving decisions. This requirement is aligned well with the capabilities of world models, which understand the environment and predict the future. However, recent world models in autonomous driving are built explicitly, where they could predict the future by controllable driving video generation. We argue that driving world models should have two additional abilities: action control and action prediction. Following this line, previous methods are limited because they predict the video requires given actions of the same length as the video and ignore the dynamical action laws. To address these issues, we propose ProphetDWM, a novel end-to-end driving world model that jointly predicts future videos and actions. Our world model has an action module to learn latent action from the present to the future period by giving the action sequence and observations. And a diffusion-model-based transition module to learn the state distribution. The model is jointly trained by learning latent actions given finite states and predicting action and video. The joint learning connects the action dynamics and states and enables long-term future prediction. We evaluate our method in video generation and action prediction tasks on the Nuscenes dataset. Compared to the state-of-the-art methods, our method achieves the best video consistency and best action prediction accuracy, while also enabling high-quality long-term video and action generation.
摘要：现实世界驾驶要求人们观察当前的环境，预测未来并做出适当的驾驶决策。这一要求与世界模型的能力很好地保持一致，这些世界模型可以理解环境并预测未来。但是，自动驾驶中最近的世界模型是明确构建的，他们可以通过可控的驾驶视频生成来预测未来。我们认为，驾驶世界模型应该具有两个额外的能力：行动控制和行动预测。遵循这一行，以前的方法受到限制，因为它们预测视频需要给定与视频相同长度的动作，并忽略动态动作定律。为了解决这些问题，我们提出了ProphetDWM，这是一种新颖的端到端驾驶世界模型，共同预测未来的视频和行动。我们的世界模型具有一个动作模块，可以通过给出动作序列和观察来学习从当前到未来时期的潜在行动。以及基于扩散模型的过渡模块，以学习状态分布。该模型是通过学习有限状态并预测行动和视频的潜在行动共同训练的。联合学习将行动动态和状态联系起来，并实现了长期的未来预测。我们在Nuscenes数据集上评估了视频生成和动作预测任务的方法。与最先进的方法相比，我们的方法实现了最佳的视频一致性和最佳动作预测准确性，同时还可以实现高质量的长期视频和动作生成。

Title: So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection

Authors: Zhenglin Huang, Tianxiao Li, Xiangtai Li, Haiquan Wen, Yiwei He, Jiangning Zhang, Hao Fei, Xi Yang, Xiaowei Huang, Bei Peng, Guangliang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18660
Pdf URL: https://arxiv.org/pdf/2505.18660
Copy Paste: [[2505.18660]] So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection(https://arxiv.org/abs/2505.18660)
Keywords: generative
Abstract: Recent advances in AI-powered generative models have enabled the creation of increasingly realistic synthetic images, posing significant risks to information integrity and public trust on social media platforms. While robust detection frameworks and diverse, large-scale datasets are essential to mitigate these risks, existing academic efforts remain limited in scope: current datasets lack the diversity, scale, and realism required for social media contexts, while detection methods struggle with generalization to unseen generative technologies. To bridge this gap, we introduce So-Fake-Set, a comprehensive social media-oriented dataset with over 2 million high-quality images, diverse generative sources, and photorealistic imagery synthesized using 35 state-of-the-art generative models. To rigorously evaluate cross-domain robustness, we establish a novel and large-scale (100K) out-of-domain benchmark (So-Fake-OOD) featuring synthetic imagery from commercial models explicitly excluded from the training distribution, creating a realistic testbed for evaluating real-world performance. Leveraging these resources, we present So-Fake-R1, an advanced vision-language framework that employs reinforcement learning for highly accurate forgery detection, precise localization, and explainable inference through interpretable visual rationales. Extensive experiments show that So-Fake-R1 outperforms the second-best method, with a 1.3% gain in detection accuracy and a 4.5% increase in localization IoU. By integrating a scalable dataset, a challenging OOD benchmark, and an advanced detection framework, this work establishes a new foundation for social media-centric forgery detection research. The code, models, and datasets will be released publicly.
摘要：AI驱动的生成模型的最新进展使创建日益现实的合成图像，对社交媒体平台上的信息完整性和公众信任构成了重大风险。尽管强大的检测框架和多样化，但大规模数据集对于减轻这些风险至关重要，但现有的学术工作在范围上仍然有限：当前的数据集缺乏社交媒体环境所需的多样性，规模和现实主义，而检测方法则与概括相对于看不见的生成技术而努力。为了弥合这一差距，我们介绍了So-Fake-Set，这是一种全面的面向社交媒体的数据集，具有超过200万个高质量的图像，不同的生成源和使用35种最先进的生成模型合成的影像图。为了严格评估跨域的鲁棒性，我们建立了一种新颖的大规模（100K）层外基准（So-fake-ood），这些基准（So-fake-ood）具有从训练分布中明确排除的商业模型中的合成图像，从而创建了一个现实的测试床，以评估现实世界中的性能。利用这些资源，我们提出了So-Fake-R1，这是一种先进的视觉语言框架，它采用强化学习，通过可解释的视觉理由来高度准确伪造，精确的定位和可解释的推断。广泛的实验表明，SO-Fake-R1的表现优于第二好的方法，检测准确性增长了1.3％，本地化提高了4.5％。通过整合可扩展的数据集，具有挑战性的OOD基准和高级检测框架，这项工作为以社交媒体为中心的伪造检测研究建立了新的基础。代码，模型和数据集将公开发布。

Title: DVD-Quant: Data-free Video Diffusion Transformers Quantization

Authors: Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18663
Pdf URL: https://arxiv.org/pdf/2505.18663
Copy Paste: [[2505.18663]] DVD-Quant: Data-free Video Diffusion Transformers Quantization(https://arxiv.org/abs/2505.18663)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment. While post-training quantization (PTQ) presents a promising approach to accelerate Video DiT models, existing methods suffer from two critical limitations: (1) dependence on lengthy, computation-heavy calibration procedures, and (2) considerable performance deterioration after quantization. To address these challenges, we propose DVD-Quant, a novel Data-free quantization framework for Video DiTs. Our approach integrates three key innovations: (1) Progressive Bounded Quantization (PBQ) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction, as well as (3) $\delta$-Guided Bit Switching ($\delta$-GBS) for adaptive bit-width allocation. Extensive experiments across multiple video generation benchmarks demonstrate that DVD-Quant achieves an approximately 2$\times$ speedup over full-precision baselines on HunyuanVideo while maintaining visual fidelity. Notably, DVD-Quant is the first to enable W4A4 PTQ for Video DiTs without compromising video quality. Code and models will be available at this https URL.
摘要：扩散变压器（DIT）已成为视频生成的最新体系结构，但是它们的计算和内存需要阻碍实际部署。虽然训练后量化（PTQ）提出了一种有希望的加速视频DIT模型的方法，但现有方法受到了两个关键局限性：（1）依赖于冗长的，计算重的校准程序，以及（2）量化后的性能恶化。为了应对这些挑战，我们提出了DVD-Quant，这是一个新颖的视频DIT的无数据量化框架。我们的方法集成了三个关键的创新：（1）进行渐进的有界量化（PBQ）和（2）自动缩放旋转量化（ARQ），用于校准无数据量化误差，以及（3）$ \ delta $ - 指导的位点开关（$ \ \\ delta $ -GBS），以适应适合于适应的位点位分配。多个视频生成基准的广泛实验表明，DVD量牌在保持视觉保真度的同时，在Hunyuanvideo上实现了大约2 $ \ times $加速的速度。值得注意的是，DVD量牌是第一个在不损害视频质量的情况下启用W4A4 PTQ进行视频dit的人。代码和型号将在此HTTPS URL上可用。

Title: ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation

Authors: Zhen Li, Yukai Guo, Duan Li, Xinyuan Guo, Bowen Li, Lanxi Xiao, Shenyu Qiao, Jiashu Chen, Zijian Wu, Hui Zhang, Xinhuan Shu, Shixia Liu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.18668
Pdf URL: https://arxiv.org/pdf/2505.18668
Copy Paste: [[2505.18668]] ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation(https://arxiv.org/abs/2505.18668)
Keywords: generation
Abstract: Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the understanding and generation of infographic charts. The dataset is constructed through an inductive process that identifies 75 chart types, 330 chart variations, and 68 layout templates from real infographic charts and uses them to create synthetic ones programmatically. We showcase the utility of this dataset through: 1) improving infographic chart understanding via fine-tuning, 2) benchmarking code generation for infographic charts, and 3) enabling example-based infographic chart generation. By capturing the visual and structural complexity of real design, ChartGalaxy provides a useful resource for enhancing multimodal reasoning and generation in LVLMs.
摘要：信息图表图是通过将视觉元素（例如图表，图像）与文本信息相结合（例如，图表，图像）来传达抽象数据的强大媒介。但是，它们的视觉和结构丰富度对大型视觉模型（LVLM）构成了挑战，这些模型通常在纯图表上进行训练。为了弥合这一差距，我们介绍了ChartGalaxy，这是一个数百万级数据集，旨在提高信息图表的理解和生成。数据集是通过归纳过程构建的，该过程识别75种图表类型，330个图表变化和68个布局模板，并使用它们以编程方式创建合成图。我们通过以下方式展示了此数据集的实用性：1）通过微调来改善信息图表的理解，2）为信息图表图制定基准代码生成，以及3）启用基于示例的信息图表生成。通过捕获真实设计的视觉和结构复杂性，ChartGalaxy提供了一种有用的资源，用于增强LVLM中的多模式推理和生成。

Title: Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?

Authors: Hongzheng Yang, Yongqiang Chen, Zeyu Qin, Tongliang Liu, Chaowei Xiao, Kun Zhang, Bo Han
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18672
Pdf URL: https://arxiv.org/pdf/2505.18672
Copy Paste: [[2505.18672]] Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?(https://arxiv.org/abs/2505.18672)
Keywords: generation
Abstract: Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could locate the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose Concept Concentration (COCA). Instead of identifying the faithful locations to intervene, COCA refractors the training data with an explicit reasoning process, which firstly identifies the potential unsafe concepts and then decides the responses. Essentially, COCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that COCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.
摘要：表示干预旨在定位和修改大型语言模型（LLMS）中编码基本概念的表示形式，以引发对齐和预期的行为。尽管经验取得了成功，但从未研究过是否可以找到忠实的干预概念。在这项工作中，我们探讨了安全一致性的问题。如果干预措施是忠实的，那么介入的LLMS应删除有害的概念，并对分布的对抗提示和分布外（OOD）越狱（OOD）越狱。虽然在线性设置中删除LLM的良性功能的情况下擦除有害概念是可行的，但我们表明它在一般的非线性环境中是不可行的。为了解决这个问题，我们提出了概念集中度（可可）。可口可乐没有识别忠实的位置进行干预，而是使用明确的推理过程来训练数据，该过程首先确定潜在的不安全概念，然后决定响应。本质上，可口可分简化了有害表示和良性表示之间的决策界限，从而实现了更有效的线性擦除。具有多种表示干预方法和模型体系结构的广泛实验表明，可口可乐大大降低了分布和OOD越狱成功率，同时在数学和代码生成等常规任务上保持强劲的绩效。

Title: Restoring Real-World Images with an Internal Detail Enhancement Diffusion Model

Authors: Peng Xiao, Hongbo Zhao, Yijun Wang, Jianxin Lin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18674
Pdf URL: https://arxiv.org/pdf/2505.18674
Copy Paste: [[2505.18674]] Restoring Real-World Images with an Internal Detail Enhancement Diffusion Model(https://arxiv.org/abs/2505.18674)
Keywords: restoration, generative
Abstract: Restoring real-world degraded images, such as old photographs or low-resolution images, presents a significant challenge due to the complex, mixed degradations they exhibit, such as scratches, color fading, and noise. Recent data-driven approaches have struggled with two main challenges: achieving high-fidelity restoration and providing object-level control over colorization. While diffusion models have shown promise in generating high-quality images with specific controls, they often fail to fully preserve image details during restoration. In this work, we propose an internal detail-preserving diffusion model for high-fidelity restoration of real-world degraded images. Our method utilizes a pre-trained Stable Diffusion model as a generative prior, eliminating the need to train a model from scratch. Central to our approach is the Internal Image Detail Enhancement (IIDE) technique, which directs the diffusion model to preserve essential structural and textural information while mitigating degradation effects. The process starts by mapping the input image into a latent space, where we inject the diffusion denoising process with degradation operations that simulate the effects of various degradation factors. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art models in both qualitative assessments and perceptual quantitative evaluations. Additionally, our approach supports text-guided restoration, enabling object-level colorization control that mimics the expertise of professional photo editing.
摘要：恢复现实世界中的降级图像，例如旧照片或低分辨率图像，由于它们所展示的复杂的混合降解，例如划痕，颜色褪色和噪音，提出了重大挑战。最近的数据驱动方法在两个主要挑战中挣扎：实现高保真恢复并提供对物种化的对象级别的控制。尽管扩散模型在生成具有特定控件的高质量图像方面已显示出希望，但它们通常无法在恢复过程中完全保留图像细节。在这项工作中，我们提出了一个内部细节传播扩散模型，用于对现实世界中降级图像的高保真恢复。我们的方法利用预先训练的稳定扩散模型作为生成性先验，从而消除了从头开始训练模型的需求。我们方法的核心是内部图像细节增强（IIDE）技术，该技术指导扩散模型保留基本的结构和纹理信息，同时减轻降解效果。该过程首先将输入图像映射到潜在的空间中，在该空间中，我们将扩散的降解过程注入降解操作，以模拟各种降解因子的影响。广泛的实验表明，在定性评估和感知定量评估中，我们的方法在定性评估和感知定量评估中都显着优于最先进的模型。此外，我们的方法还支持文本指导的恢复，从而实现了模仿专业照片编辑专业知识的对象级着色控制。

Title: Manifold-aware Representation Learning for Degradation-agnostic Image Restoration

Authors: Bin Ren, Yawei Li, Xu Zheng, Yuqian Fu, Danda Pani Paudel, Ming-Hsuan Yang, Luc Van Gool, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18679
Pdf URL: https://arxiv.org/pdf/2505.18679
Copy Paste: [[2505.18679]] Manifold-aware Representation Learning for Degradation-agnostic Image Restoration(https://arxiv.org/abs/2505.18679)
Keywords: restoration
Abstract: Image Restoration (IR) aims to recover high quality images from degraded inputs affected by various corruptions such as noise, blur, haze, rain, and low light conditions. Despite recent advances, most existing approaches treat IR as a direct mapping problem, relying on shared representations across degradation types without modeling their structural diversity. In this work, we present MIRAGE, a unified and lightweight framework for all in one IR that explicitly decomposes the input feature space into three semantically aligned parallel branches, each processed by a specialized module attention for global context, convolution for local textures, and MLP for channel-wise statistics. This modular decomposition significantly improves generalization and efficiency across diverse degradations. Furthermore, we introduce a cross layer contrastive learning scheme that aligns shallow and latent features to enhance the discriminability of shared representations. To better capture the underlying geometry of feature representations, we perform contrastive learning in a Symmetric Positive Definite (SPD) manifold space rather than the conventional Euclidean space. Extensive experiments show that MIRAGE not only achieves new state of the art performance across a variety of degradation types but also offers a scalable solution for challenging all-in-one IR scenarios. Our code and models will be publicly available at this https URL.
摘要：图像恢复（IR）旨在从受噪声，模糊，阴霾，雨水和低光条件等各种腐败影响的降级输入中恢复高质量的图像。尽管最近进步，但大多数现有方法将IR视为直接映射问题，依靠跨降解类型的共享表示，而无需对其结构多样性进行建模。在这项工作中，我们介绍了一个IR的Mirage，这是一个IR中的一个统一和轻巧的框架，将输入特征空间明确分解为三个语义上的平行分支，每个分支都通过专门的模块关注全球环境，本地质地的卷积和MLP的MLP来处理。这种模块化分解可显着提高各种降解的概括和效率。此外，我们引入了一个横层对比学习方案，该方案与浅层和潜在特征保持一致，以增强共享表示形式的可区分性。为了更好地捕获特征表示的潜在几何形状，我们在对称的正定义（SPD）歧管空间而不是传统的欧几里得空间中执行对比度学习。广泛的实验表明，幻影不仅在各种退化类型中实现了新的最新表现状态，而且还为挑战多一ir IR场景提供了可扩展的解决方案。我们的代码和模型将在此HTTPS URL上公开可用。

Title: Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

Authors: Wenchao Zhang, Jiahe Tian, Runze He, Jizhong Han, Jiao Dai, Miaomiao Feng, Wei Mi, Xiaodan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18730
Pdf URL: https://arxiv.org/pdf/2505.18730
Copy Paste: [[2505.18730]] Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation(https://arxiv.org/abs/2505.18730)
Keywords: generation
Abstract: Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a comprehensive benchmark designed to measure the alignment of generated images with real-world knowledge that extends beyond the explicit user prompts. ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios. We further introduce ABPScore, a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts, which demonstrates strong correlations with human judgments. Through a comprehensive evaluation of 8 popular T2I models using ABP, we find that even state-of-the-art models, such as GPT-4o, face limitations in integrating simple real-world knowledge into generated images. To mitigate this issue, we introduce a training-free strategy within ABP, named Inference-Time Knowledge Injection (ITKI). By applying this strategy to optimize 200 challenging samples, we achieved an improvement of approximately 43% in ABPScore. The dataset and code are available in this https URL.
摘要：最近的文本对图像（T2I）生成模型已显着发展，从而从文本提示中创建了高保真图像。但是，现有的评估基准主要集中于生成的图像和提示之间的明确一致性，从而忽略了超出提示的现实世界知识的一致性。为了解决此差距，我们引入了Align Beyond Privest（ABP），这是一个综合基准，旨在测量具有现实世界知识的生成图像的对齐，该图像扩展到超出明确的用户提示。 ABP包括2,000多个精心制作的提示，涵盖了六个不同场景的现实知识。我们进一步介绍了AbPSCore，该指标利用现有的多模式大语模型（MLLM）来评估生成的图像与超越提示的世界知识之间的一致性，这表明与人类判断有很强的相关性。通过对使用ABP的8种流行T2I模型的全面评估，我们发现，即使是最新的模型，例如GPT-4O，也会在将简单的现实世界知识整合到生成的图像中时面临限制。为了减轻此问题，我们在ABP中引入了无培训的策略，该策略名为推理时间知识注入（ITKI）。通过应用此策略来优化200个具有挑战性的样本，我们在ABPSCORE中实现了约43％的提高。此HTTPS URL中可用数据集和代码。

Title: Smart Energy Guardian: A Hybrid Deep Learning Model for Detecting Fraudulent PV Generation

Authors: Xiaolu Chen, Chenghao Huang, Yanru Zhang, Hao Wang
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2505.18755
Pdf URL: https://arxiv.org/pdf/2505.18755
Copy Paste: [[2505.18755]] Smart Energy Guardian: A Hybrid Deep Learning Model for Detecting Fraudulent PV Generation(https://arxiv.org/abs/2505.18755)
Keywords: generation
Abstract: With the proliferation of smart grids, smart cities face growing challenges due to cyber-attacks and sophisticated electricity theft behaviors, particularly in residential photovoltaic (PV) generation systems. Traditional Electricity Theft Detection (ETD) methods often struggle to capture complex temporal dependencies and integrating multi-source data, limiting their effectiveness. In this work, we propose an efficient ETD method that accurately identifies fraudulent behaviors in residential PV generation, thus ensuring the supply-demand balance in smart cities. Our hybrid deep learning model, combining multi-scale Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Transformer, excels in capturing both short-term and long-term temporal dependencies. Additionally, we introduce a data embedding technique that seamlessly integrates time-series data with discrete temperature variables, enhancing detection robustness. Extensive simulation experiments using real-world data validate the effectiveness of our approach, demonstrating significant improvements in the accuracy of detecting sophisticated energy theft activities, thereby contributing to the stability and fairness of energy systems in smart cities.
摘要：随着智能网格的扩散，智能城市由于网络攻击和复杂的电力盗窃行为而面临越来越多的挑战，尤其是在住宅光伏（PV）生成系统中。传统的电力盗窃检测（ETD）方法通常难以捕获复杂的时间依赖性和整合多源数据，从而限制了它们的有效性。在这项工作中，我们提出了一种有效的ETD方法，该方法可以准确地识别住宅光伏生成中的欺诈行为，从而确保智能城市的供求平衡。我们的混合深度学习模型，结合了多尺度卷积神经网络（CNN），长期记忆（LSTM）和变压器，在捕获短期和长期时间依赖性方面出色。此外，我们引入了一种数据嵌入技术，该技术将时间序列数据与离散的温度变量无缝整合，从而增强了检测鲁棒性。使用现实世界数据进行的广泛的模拟实验验证了我们方法的有效性，证明了检测复杂的能量盗窃活动的准确性的显着提高，从而有助于智能城市能量系统的稳定性和公平性。

Title: GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

Authors: Shutong Ding, Ke Hu, Shan Zhong, Haoyang Luo, Weinan Zhang, Jingya Wang, Jun Wang, Ye Shi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18763
Pdf URL: https://arxiv.org/pdf/2505.18763
Copy Paste: [[2505.18763]] GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning(https://arxiv.org/abs/2505.18763)
Keywords: generative
Abstract: Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of large-scale parallel GPU-accelerated simulators, such as IsaacLab, which are optimized for on-policy RL algorithms and enable rapid training of complex robotic tasks. A key challenge lies in computing state-action log-likelihoods under diffusion policies, which is straightforward for Gaussian policies but intractable for flow-based models due to irreversible forward-reverse processes and discretization errors (e.g., Euler-Maruyama approximations). To bridge this gap, we propose GenPO, a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings. GenPO introduces a novel doubled dummy action mechanism that enables invertibility via alternating updates, resolving log-likelihood computation barriers. Furthermore, we also use the action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates. Extensive experiments on eight IsaacLab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate GenPO's superiority over existing RL baselines. Notably, GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.
摘要：加强学习的最新进展（RL）证明了基于生成扩散的政策的强大探索能力和多模式。尽管在离线RL和非政策RL设置中已经取得了很大的进步，但将扩散策略集成到PPO等政策框架中仍然没有遭到反感。考虑到大规模平行的GPU加速模拟器（例如Isaaclab）广泛使用，该差距尤其显着，这些模拟器（例如Isaaclab）对policy rl算法进行了优化，并可以快速培训复杂的机器人任务。一个关键挑战在于扩散策略下的计算状态行动对数类似物，这对于高斯政策很简单，但由于不可逆的前向反向过程和离散误差，对于基于流的模型而言是棘手的（例如，Euler-Maruyama近似值）。为了弥合这一差距，我们提出了Genpo，这是一种生成策略优化框架，利用确切的扩散反演来构建可逆动作映射。 Genpo引入了一种新颖的双伪装动作机制，该机制可以通过交替更新，解决对数类样式的计算障碍，从而实现可逆性。此外，我们还使用动作对数的熵和KL差异估计，使KL自适应学习率和熵正则化在上政策更新中。对八个伊萨克拉布基准测试的广泛实验，包括腿部运动（ANT，人形，Anymal-D，Unitre H1，Go2），灵巧的操纵（阴影手），空中控制（Quadcopter）和机器人Arm Tasks（FrankA），证明了Genpo对现有RL基础的优势。值得注意的是，GENPO是成功将扩散策略整合到policy RL中的第一种方法，从而释放了它们进行大规模并行训练和现实世界机器人部署的潜力。

Title: Multiple Wasserstein Gradient Descent Algorithm for Multi-Objective Distributional Optimization

Authors: Dai Hai Nguyen, Hiroshi Mamitsuka, Atsuyoshi Nakamura
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.18765
Pdf URL: https://arxiv.org/pdf/2505.18765
Copy Paste: [[2505.18765]] Multiple Wasserstein Gradient Descent Algorithm for Multi-Objective Distributional Optimization(https://arxiv.org/abs/2505.18765)
Keywords: generative
Abstract: We address the optimization problem of simultaneously minimizing multiple objective functionals over a family of probability distributions. This type of Multi-Objective Distributional Optimization commonly arises in machine learning and statistics, with applications in areas such as multiple target sampling, multi-task learning, and multi-objective generative modeling. To solve this problem, we propose an iterative particle-based algorithm, which we call Muliple Wasserstein Gradient Descent (MWGraD), which constructs a flow of intermediate empirical distributions, each being represented by a set of particles, which gradually minimize the multiple objective functionals simultaneously. Specifically, MWGraD consists of two key steps at each iteration. First, it estimates the Wasserstein gradient for each objective functional based on the current particles. Then, it aggregates these gradients into a single Wasserstein gradient using dynamically adjusted weights and updates the particles accordingly. In addition, we provide theoretical analysis and present experimental results on both synthetic and real-world datasets, demonstrating the effectiveness of MWGraD.
摘要：我们解决了在概率分布家族上同时最大程度地减少多个目标功能的优化问题。这种类型的多目标分布优化通常在机器学习和统计数据中出现，并在多个目标采样，多任务学习和多目标生成模型等领域中进行了应用。为了解决这个问题，我们提出了一种基于迭代粒子的算法，我们称之为Muliple Wasserstein梯度下降（MWGRAD），该算法构建了中间经验分布的流动，每种粒子都会逐渐使多个目标函数同时最小化。具体而言，MWGRAD包括每次迭代时的两个关键步骤。首先，它根据电流粒子估计每个目标功能的Wasserstein梯度。然后，它使用动态调节的权重将这些梯度汇总为单个Wasserstein梯度，并相应地更新粒子。此外，我们提供了理论分析，并在合成和现实世界数据集上提供了实验结果，证明了MWGRAD的有效性。

Title: StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations

Authors: Yanjie Li, Wenxuan Zhang, Xinqi Lyu, Yihao Liu, Bin Xiao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18766
Pdf URL: https://arxiv.org/pdf/2505.18766
Copy Paste: [[2505.18766]] StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations(https://arxiv.org/abs/2505.18766)
Keywords: generation
Abstract: Recently, text-to-image diffusion models have been widely used for style mimicry and personalized customization through methods such as DreamBooth and Textual Inversion. This has raised concerns about intellectual property protection and the generation of deceptive content. Recent studies, such as Glaze and Anti-DreamBooth, have proposed using adversarial noise to protect images from these attacks. However, recent purification-based methods, such as DiffPure and Noise Upscaling, have successfully attacked these latest defenses, showing the vulnerabilities of these methods. Moreover, present methods show limited transferability across models, making them less effective against unknown text-to-image models. To address these issues, we propose a novel anti-mimicry method, StyleGuard. We propose a novel style loss that optimizes the style-related features in the latent space to make it deviate from the original image, which improves model-agnostic transferability. Additionally, to enhance the perturbation's ability to bypass diffusion-based purification, we designed a novel upscale loss that involves ensemble purifiers and upscalers during training. Extensive experiments on the WikiArt and CelebA datasets demonstrate that StyleGuard outperforms existing methods in robustness against various transformations and purifications, effectively countering style mimicry in various models. Moreover, StyleGuard is effective on different style mimicry methods, including DreamBooth and Textual Inversion.
摘要：最近，通过Dreambooth和Textual Risverion等方法，文本到图像扩散模型已被广泛用于模仿样式和个性化定制。这引起了人们对知识产权保护和欺骗性内容产生的关注。最近的研究，例如釉料和抗Dreambooth，提出了使用对抗噪声保护图像免受这些攻击的影响。但是，最近基于纯化的方法（例如扩散和噪声升级）成功攻击了这些最新的防御，显示了这些方法的脆弱性。此外，目前的方法显示在模型之间的可传递性有限，从而使它们在未知的文本对图像模型中的有效性降低。为了解决这些问题，我们提出了一种新颖的反仿真方法，StyleGuard。我们提出了一种新型的样式损失，可优化潜在空间中与样式相关的功能，以使其偏离原始图像，从而提高模型不稳定性的可传递性。此外，为了增强扰动绕过基于扩散的纯化的能力，我们设计了一种新型的高档损失，涉及训练过程中的集合净化器和上限器。在Wikiart和Celeba数据集上进行的广泛实验表明，StyleGuard在针对各种转换和净化方面的鲁棒性优于现有方法，在各种模型中有效地对抗样式模仿。此外，StyleGuard对不同的模仿方法有效，包括Dreambooth和文本反演。

Title: Dual-Path Stable Soft Prompt Generation for Domain Generalization

Authors: Yuedi Zhang, Shuanghao Bai, Wanqi Zhou, Zhirong Luan, Badong Chen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18770
Pdf URL: https://arxiv.org/pdf/2505.18770
Copy Paste: [[2505.18770]] Dual-Path Stable Soft Prompt Generation for Domain Generalization(https://arxiv.org/abs/2505.18770)
Keywords: generation
Abstract: Domain generalization (DG) aims to learn a model using data from one or multiple related but distinct source domains that can generalize well to unseen out-of-distribution target domains. Inspired by the success of large pre-trained vision-language models (VLMs), prompt tuning has emerged as an effective generalization strategy. However, it often struggles to capture domain-specific features due to its reliance on manually or fixed prompt inputs. Recently, some prompt generation methods have addressed this limitation by dynamically generating instance-specific and domain-specific prompts for each input, enriching domain information and demonstrating potential for enhanced generalization. Through further investigation, we identify a notable issue in existing prompt generation methods: the same input often yields significantly different and suboptimal prompts across different random seeds, a phenomenon we term Prompt Variability. To address this, we introduce negative learning into the prompt generation process and propose Dual-Path Stable Soft Prompt Generation (DPSPG), a transformer-based framework designed to improve both the stability and generalization of prompts. Specifically, DPSPG incorporates a complementary prompt generator to produce negative prompts, thereby reducing the risk of introducing misleading information. Both theoretical and empirical analyses demonstrate that negative learning leads to more robust and effective prompts by increasing the effective margin and reducing the upper bound of the gradient norm. Extensive experiments on five DG benchmark datasets show that DPSPG consistently outperforms state-of-the-art methods while maintaining prompt stability.
摘要：域的概括（DG）旨在使用来自一个或多个相关但不同的源域的数据来学习模型，这些数据可以很好地概括为看不见的目标域。受到大型预训练视觉模型（VLM）成功的启发，及时调整已成为有效的概括策略。但是，由于其对手动或固定的提示输入的依赖，它通常很难捕获特定于域的特征。最近，一些迅速的生成方法通过动态生成每个输入的特定于实例和域特异性提示，丰富域信息并证明增强概括的潜力，从而解决了这一限制。通过进一步研究，我们在现有及时生成方法中确定了一个显着的问题：相同的输入通常会在不同的随机种子上产生明显不同和次优的提示，这是一种现象，我们称为迅速可变性。为了解决这个问题，我们将负面学习引入迅速生成过程中，并提出双路稳定稳定及时生成（DPSPG），这是一个基于变压器的框架，旨在提高提示的稳定性和概括。具体而言，DPSPG合并了一个互补的提示发电机来产生负面提示，从而降低了引入误导信息的风险。理论和经验分析都表明，负面学习通过增加有效的边缘并减少梯度规范的上限，从而导致更健壮和有效的提示。在五个DG基准数据集上进行的广泛实验表明，DPSPG在保持及时稳定性的同时始终超过最先进的方法。

Title: OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks

Authors: Jiayu Wang, Yang Jiao, Yue Yu, Tianwen Qian, Shaoxiang Chen, Jingjing Chen, Yu-Gang Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18775
Pdf URL: https://arxiv.org/pdf/2505.18775
Copy Paste: [[2505.18775]] OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks(https://arxiv.org/abs/2505.18775)
Keywords: generation, generative
Abstract: Recent breakthroughs in large multimodal models (LMMs), such as the impressive GPT-4o-Native, have demonstrated remarkable proficiency in following general-purpose instructions for image generation. However, current benchmarks often lack the necessary breadth and depth to fully evaluate the diverse capabilities of these models. To overcome this limitation, we introduce OmniGenBench, a novel and comprehensive benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs across both perception-centric and cognition-centric dimensions. Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand. For rigorous evaluation, we further employ a dual-mode protocol. This protocol utilizes off-the-shelf visual parsing tools for perception-centric tasks and a powerful LLM-based judger for cognition-centric tasks to assess the alignment between generated images and user instructions. Using OmniGenBench, we evaluate mainstream generative models, including prevalent models like GPT-4o, Gemini-2.0-Flash, and Seedream, and provide in-depth comparisons and analyses of their this http URL and data are available at this https URL.
摘要：在大型多模型模型（LMM）（例如令人印象深刻的GPT-4O-noative）中，最近的突破表现出非常熟练的熟练程度，可以遵循图像生成的通用指令。但是，当前的基准通常缺乏必要的广度和深度来充分评估这些模型的各种功能。为了克服这一局限性，我们引入了Omnigenbench，这是一种新颖而全面的基准测试，旨在评估以感知为中心和以认知为中心维度的最先进的LMMS的指导跟踪能力。我们的Omnigenbench包括以实际情况为基础的57个不同的子任务，根据他们要求的特定模型功能进行系统的分类。对于严格的评估，我们进一步采用了双模式协议。该协议利用现成的视觉解析工具用于以感知为中心的任务，以及针对以认知为中心任务的强大基于LLM的判断，以评估生成的图像和用户说明之间的对齐方式。使用Omnigenbench，我们评估了主流生成模型，包括普遍的模型，例如GPT-4O，GEMINI-2.0-FLASH和SEEDREAM，并在此HTTPS url上提供了对其HTTP URL和数据的深入比较和分析。

Title: HD-PiSSA: High-Rank Distributed Orthogonal Adaptation

Authors: Yiding Wang, Fauxu meng, Xuefeng Zhang, Fan Jiang, Pingzhi Tang, Muhan Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18777
Pdf URL: https://arxiv.org/pdf/2505.18777
Copy Paste: [[2505.18777]] HD-PiSSA: High-Rank Distributed Orthogonal Adaptation(https://arxiv.org/abs/2505.18777)
Keywords: generation
Abstract: Existing parameter-efficient fine-tuning (PEFT) methods for large language models (LLMs), such as LoRA and PiSSA, constrain model updates to low-rank subspaces, limiting their expressiveness and leading to suboptimal performance on complex tasks. To address this, we introduce High-rank Distributed PiSSA (HD-PiSSA), a distributed PEFT approach that initializes orthogonal adapters across different devices and aggregates their delta updates collectively on W for fine-tuning. Unlike Data Parallel LoRA or PiSSA, which maintain identical adapters across all devices, HD-PiSSA assigns different principal components of the pre-trained weights to each GPU, significantly expanding the range of update directions. This results in over 16x higher effective updated ranks than data-parallel LoRA or PiSSA when fine-tuning on 8 GPUs with the same per-device adapter rank. Empirically, we evaluate HD-PiSSA across various challenging downstream tasks, including mathematics, code generation, and multi-task learning. In the multi-task setting, HD-PiSSA achieves average gains of 10.0 absolute points (14.63%) over LoRA and 4.98 points (6.60%) over PiSSA across 12 benchmarks, demonstrating its benefits from the extra optimization flexibility.
摘要：大型语言模型（LLMS）（例如Lora和Pissa）的现有参数有效的微调方法（PEFT）方法将模型更新限制为低级子空间，从而限制了它们的表现力并导致复杂任务的次优性能。为了解决这个问题，我们引入了高级分布式PISSA（HD-PISSA），这是一种分布式的PEFT方法，该方法初始化了不同设备的正交适配器，并在W上集体汇总了其Delta更新以进行微调。与在所有设备上保持相同适配器的数据并行Lora或PISSA不同，HD-PISSA将预训练权重的不同主要组件分配给每个GPU，从而大大扩展了更新方向的范围。当在8 gpus进行微调时，这会导致比数据并行的LORA或PISSA高16倍以上的效率更新等级，并具有相同的人均适配器等级。从经验上讲，我们在各种具有挑战性的下游任务中评估了HD-PISSA，包括数学，代码生成和多任务学习。在多任务设置中，HD-PISSA可在洛拉（Lora）上获得10.0个绝对点（14.63％）的平均收益，而PISSA比12个基准的平均点获得了4.98点（6.60％），这表明了其额外优化灵活性的好处。

Title: Think Twice before Adaptation: Improving Adaptability of DeepFake Detection via Online Test-Time Adaptation

Authors: Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.18787
Pdf URL: https://arxiv.org/pdf/2505.18787
Copy Paste: [[2505.18787]] Think Twice before Adaptation: Improving Adaptability of DeepFake Detection via Online Test-Time Adaptation(https://arxiv.org/abs/2505.18787)
Keywords: generation
Abstract: Deepfake (DF) detectors face significant challenges when deployed in real-world environments, particularly when encountering test samples deviated from training data through either postprocessing manipulations or distribution shifts. We demonstrate postprocessing techniques can completely obscure generation artifacts presented in DF samples, leading to performance degradation of DF detectors. To address these challenges, we propose Think Twice before Adaptation (\texttt{T$^2$A}), a novel online test-time adaptation method that enhances the adaptability of detectors during inference without requiring access to source training data or labels. Our key idea is to enable the model to explore alternative options through an Uncertainty-aware Negative Learning objective rather than solely relying on its initial predictions as commonly seen in entropy minimization (EM)-based approaches. We also introduce an Uncertain Sample Prioritization strategy and Gradients Masking technique to improve the adaptation by focusing on important samples and model parameters. Our theoretical analysis demonstrates that the proposed negative learning objective exhibits complementary behavior to EM, facilitating better adaptation capability. Empirically, our method achieves state-of-the-art results compared to existing test-time adaptation (TTA) approaches and significantly enhances the resilience and generalization of DF detectors during inference. Code is available \href{this https URL}{here}.
摘要：当部署在现实世界环境中时，DeepFake（DF）检测器面临重大挑战，尤其是在遇到通过后处理操作或分配变化偏离培训数据的测试样品时。我们证明，后处理技术可以完全掩盖DF样品中呈现的产生伪像，从而导致DF探测器的性能下降。为了应对这些挑战，我们建议在适应之前三思而后行（\ texttt {t $^2 $ a}），这是一种新颖的在线测试时间适应方法，可以增强推理过程中检测器的适应性，而无需访问源培训数据或标签。我们的关键思想是使模型能够通过不确定性感知的负面学习目标探索替代选项，而不是仅依靠其初始预测，如熵最小化（EM）基于基于的方法。我们还引入了不确定的样本优先级策略和梯度掩盖技术，以通过关注重要样本和模型参数来改善适应性。我们的理论分析表明，提出的负面学习目标表现出对EM的互补行为，从而促进了更好的适应能力。从经验上讲，与现有的测试时间适应方法（TTA）方法相比，我们的方法实现了最新的结果，并显着增强了推断过程中DF探测器的弹性和概括。代码可用\ href {this HTTPS url} {there}。

Title: VORTA: Efficient Video Diffusion via Routing Sparse Attention

Authors: Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Shunyu Liu, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18809
Pdf URL: https://arxiv.org/pdf/2505.18809
Copy Paste: [[2505.18809]] VORTA: Efficient Video Diffusion via Routing Sparse Attention(https://arxiv.org/abs/2505.18809)
Keywords: generation
Abstract: Video Diffusion Transformers (VDiTs) have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent attention acceleration methods leverage the sparsity of attention patterns to improve efficiency; however, they often overlook inefficiencies of redundant long-range interactions. To address this problem, we propose \textbf{VORTA}, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants throughout the sampling process. It achieves a $1.76\times$ end-to-end speedup without quality loss on VBench. Furthermore, VORTA can seamlessly integrate with various other acceleration methods, such as caching and step distillation, reaching up to $14.41\times$ speedup with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of VDiTs in real-world settings.
摘要：视频扩散变压器（VDITS）在高质量的视频生成中取得了显着的进展，但是由于高维视频序列的注意力二次复杂性，因此在计算上保持昂贵。最近的注意加速方法利用注意力模式的稀疏性提高效率。但是，他们经常忽略冗余远程相互作用的效率低下。为了解决这个问题，我们提出\ textbf {vorta}，这是一个具有两个新颖组成部分的加速框架：1）一种稀疏的注意机制，可有效捕获长期依赖性，以及2）一种路由策略，该策略在整个采样过程中都可以自适应地替换出完整的3D注意力，并用专用的稀疏注意变体替换了全3D。它实现了$ 1.76 \ times $端到端的速度，而Vbench却没有质量损失。此外，Vorta可以与其他各种加速方法（例如缓存和阶梯蒸馏）无缝集成，最高可及时$ 14.41 \ times $加速，而性能降解可忽略不计。 Vorta展示了其效率，并提高了现实世界中VDIT的实用性。

Title: How to build a consistency model: Learning flow maps via self-distillation

Authors: Nicholas M. Boffi, Michael S. Albergo, Eric Vanden-Eijnden
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.18825
Pdf URL: https://arxiv.org/pdf/2505.18825
Copy Paste: [[2505.18825]] How to build a consistency model: Learning flow maps via self-distillation(https://arxiv.org/abs/2505.18825)
Keywords: generative
Abstract: Building on the framework proposed in Boffi et al. (2024), we present a systematic approach for learning flow maps associated with flow and diffusion models. Flow map-based models, commonly known as consistency models, encompass recent efforts to improve the efficiency of generative models based on solutions to differential equations. By exploiting a relationship between the velocity field underlying a continuous-time flow and the instantaneous rate of change of the flow map, we show how to convert existing distillation schemes into direct training algorithms via self-distillation, eliminating the need for pre-trained models. We empirically evaluate several instantiations of our framework, finding that high-dimensional tasks like image synthesis benefit from objective functions that avoid temporal and spatial derivatives of the flow map, while lower-dimensional tasks can benefit from objectives incorporating higher-order derivatives to capture sharp features.
摘要：建立在Boffi等人提出的框架的基础上。（2024），我们提出了一种系统的学习流图，用于与流量和扩散模型相关的学习流图。基于流图的模型，通常称为一致性模型，包括基于微分方程解决方案的生成模型效率的最新努力。通过利用连续时间流与流程图的瞬时变化速率之间的速度场之间的关系，我们通过自我鉴定将现有的蒸馏方案转换为直接训练算法，从而消除了对预训练的模型的需求。我们从经验上评估了框架的几个实例化，发现诸如图像合成之类的高维任务受益于避免流量图的时间和空间衍生物的客观功能，而较低维度的任务可以受益于结合较高衍生品的目标，从而捕获尖锐的特征。

Title: Localizing Knowledge in Diffusion Transformers

Authors: Arman Zarei, Samyadeep Basu, Keivan Rezaei, Zihao Lin, Sayan Nag, Soheil Feizi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18832
Pdf URL: https://arxiv.org/pdf/2505.18832
Copy Paste: [[2505.18832]] Localizing Knowledge in Diffusion Transformers(https://arxiv.org/abs/2505.18832)
Keywords: generative
Abstract: Understanding how knowledge is distributed across the layers of generative models is crucial for improving interpretability, controllability, and adaptation. While prior work has explored knowledge localization in UNet-based architectures, Diffusion Transformer (DiT)-based models remain underexplored in this context. In this paper, we propose a model- and knowledge-agnostic method to localize where specific types of knowledge are encoded within the DiT blocks. We evaluate our method on state-of-the-art DiT-based models, including PixArt-alpha, FLUX, and SANA, across six diverse knowledge categories. We show that the identified blocks are both interpretable and causally linked to the expression of knowledge in generated outputs. Building on these insights, we apply our localization framework to two key applications: model personalization and knowledge unlearning. In both settings, our localized fine-tuning approach enables efficient and targeted updates, reducing computational cost, improving task-specific performance, and better preserving general model behavior with minimal interference to unrelated or surrounding content. Overall, our findings offer new insights into the internal structure of DiTs and introduce a practical pathway for more interpretable, efficient, and controllable model editing.
摘要：了解知识如何在生成模型的各个层中分布，对于改善可解释性，可控性和适应性至关重要。虽然先前的工作探索了基于UNET的架构中的知识本地化，但在这种情况下，基于扩散变压器（DIT）的模型仍未得到充实。在本文中，我们提出了一种模型和知识不足的方法来定位在DIT块中编码特定类型的知识的位置。我们在六种不同的知识类别中评估了基于最新的DIT模型，包括Pixart-Alpha，Flux和Sana在内的方法。我们表明，所识别的块既可以解释，又与生成输出中知识的表达链接。在这些见解的基础上，我们将本地化框架应用于两个关键应用程序：模型个性化和知识学习。在这两种情况下，我们本地化的微调方法都可以实现有效的和有针对性的更新，降低计算成本，提高特定于任务的性能，并更好地保留一般模型行为，而对无关或周围内容的干扰最少。总体而言，我们的发现为DIT的内部结构提供了新的见解，并引入了一种实用途径，以进行更容易解释，高效和可控的模型编辑。

Title: Eye-See-You: Reverse Pass-Through VR and Head Avatars

Authors: Ankan Dash, Jingyi Gu, Guiling Wang, Chen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18869
Pdf URL: https://arxiv.org/pdf/2505.18869
Copy Paste: [[2505.18869]] Eye-See-You: Reverse Pass-Through VR and Head Avatars(https://arxiv.org/abs/2505.18869)
Keywords: generation, generative
Abstract: Virtual Reality (VR) headsets, while integral to the evolving digital ecosystem, present a critical challenge: the occlusion of users' eyes and portions of their faces, which hinders visual communication and may contribute to social isolation. To address this, we introduce RevAvatar, an innovative framework that leverages AI methodologies to enable reverse pass-through technology, fundamentally transforming VR headset design and interaction paradigms. RevAvatar integrates state-of-the-art generative models and multimodal AI techniques to reconstruct high-fidelity 2D facial images and generate accurate 3D head avatars from partially observed eye and lower-face regions. This framework represents a significant advancement in AI4Tech by enabling seamless interaction between virtual and physical environments, fostering immersive experiences such as VR meetings and social engagements. Additionally, we present VR-Face, a novel dataset comprising 200,000 samples designed to emulate diverse VR-specific conditions, including occlusions, lighting variations, and distortions. By addressing fundamental limitations in current VR systems, RevAvatar exemplifies the transformative synergy between AI and next-generation technologies, offering a robust platform for enhancing human connection and interaction in virtual environments.
摘要：虚拟现实（VR）耳机虽然是不断发展的数字生态系统不可或缺的一部分，但提出了一个关键的挑战：用户的眼睛和部分脸部的遮挡，这会阻碍视觉交流，并可能有助于社会隔离。为了解决这个问题，我们介绍了Revavatar，这是一个创新的框架，利用AI方法来实现反向通行技术，从根本上改变VR耳机设计和交互范式。 Revavatar集成了最新的生成模型和多模式AI技术，以重建高保真2D面部图像，并从部分观察到的眼睛和低面区域生成准确的3D头像。该框架通过实现虚拟和物理环境之间的无缝互动，促进了沉浸式的体验，例如VR会议和社交活动，这代表了AI4Tech的重大进步。此外，我们提出了VR-FACE，这是一个新型的数据集，其中包含200,000个样品，旨在模拟各种VR特异性条件，包括遮挡，照明变化和扭曲。通过解决当前VR系统中的基本局限性，Revavatar例证了AI和下一代技术之间的变革性协同作用，为在虚拟环境中增强人的连接和互动提供了强大的平台。

Title: Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Authors: Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18875
Pdf URL: https://arxiv.org/pdf/2505.18875
Copy Paste: [[2505.18875]] Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation(https://arxiv.org/abs/2505.18875)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively.
摘要：扩散变压器（DIT）对于视频生成至关重要，但由于注意力的二次复杂性而遭受了显着的潜伏期。通过仅计算关键令牌，稀疏的注意力降低了计算成本，并提供了有希望的加速方法。但是，我们确定现有方法在相同的计算预算下无法达到最佳生成质量，原因有两个：（1）关键令牌标记识别：当前方法基于位置而不是语义的集群令牌，导致汇总表示不精确。（2）过度的计算浪费：关键令牌分散在非关键标记中，导致在GPU上浪费了计算，该计算被优化用于处理连续令牌。在本文中，我们提出了SVG2，这是一个无训练的框架，可最大化识别精度并最大程度地减少计算浪费，从而在发电质量和效率之间实现帕累托前沿权衡。 SVG2的核心是语义吸引的置换，它们基于使用k均值的语义相似性将其簇和重新递归代币。这种方法确保了精确的群集表示，提高了识别精度和关键令牌的致密布局，从而无需填充即可有效地计算。此外，SVG2集成了顶级动态预算控制和定制的内核实现，在Hunyuanvideo和Wan 2.1上分别保持高达30和26的PSNR，达到2.30倍和1.89倍的速度。

Title: REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

Authors: Weihan Xu, Yimeng Ma, Jingyue Huang, Yang Li, Wenye Ma, Taylor Berg-Kirkpatrick, Julian McAuley, Paul Pu Liang, Hao-Wen Dong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18880
Pdf URL: https://arxiv.org/pdf/2505.18880
Copy Paste: [[2505.18880]] REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing(https://arxiv.org/abs/2505.18880)
Keywords: generation
Abstract: Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.
摘要：简短的视频是促进内容和改善知识可访问性的有效工具。尽管现有的提取视频摘要方法难以产生连贯的叙述，但现有的抽象方法不能从输入视频中“报价”，即，在其输出中插入简短的视频剪辑。在这项工作中，我们探索了新颖的视频编辑模型，以生成短裤，这些短裤具有连贯的叙述，并与从长输入视频中提取的嵌入式视频插入。我们提出了一个新颖的检索结构框架，该框架允许大型语言模型引用多模式资源，同时保持连贯的叙述。我们提出的重新系统首先使用填补的大型语言模型与报价占位符一起生成输出故事脚本，然后使用新颖的检索模型来替换报价占位符，通过选择一个最能从候选人引用的视频剪辑中支持叙述的视频剪辑来替换报价占位符。我们研究了关于纪录片预告片的任务的提议方法，其中简短的访谈插入通常用于支持纪录片的叙述。我们的客观评估表明，所提出的方法可以有效地插入简短的视频剪辑，同时保持连贯的叙述。在一项主观调查中，我们表明，我们所提出的方法在预告片中的连贯性，一致性和现实主义方面优于现有的抽象和提取方法。

Title: SD-OVON: A Semantics-aware Dataset and Benchmark Generation Pipeline for Open-Vocabulary Object Navigation in Dynamic Scenes

Authors: Dicong Qiu, Jiadi You, Zeying Gong, Ronghe Qiu, Hui Xiong, Junwei Liang
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2505.18881
Pdf URL: https://arxiv.org/pdf/2505.18881
Copy Paste: [[2505.18881]] SD-OVON: A Semantics-aware Dataset and Benchmark Generation Pipeline for Open-Vocabulary Object Navigation in Dynamic Scenes(https://arxiv.org/abs/2505.18881)
Keywords: generation
Abstract: We present the Semantics-aware Dataset and Benchmark Generation Pipeline for Open-vocabulary Object Navigation in Dynamic Scenes (SD-OVON). It utilizes pretraining multimodal foundation models to generate infinite unique photo-realistic scene variants that adhere to real-world semantics and daily commonsense for the training and the evaluation of navigation agents, accompanied with a plugin for generating object navigation task episodes compatible to the Habitat simulator. In addition, we offer two pre-generated object navigation task datasets, SD-OVON-3k and SD-OVON-10k, comprising respectively about 3k and 10k episodes of the open-vocabulary object navigation task, derived from the SD-OVON-Scenes dataset with 2.5k photo-realistic scans of real-world environments and the SD-OVON-Objects dataset with 0.9k manually inspected scanned and artist-created manipulatable object models. Unlike prior datasets limited to static environments, SD-OVON covers dynamic scenes and manipulatable objects, facilitating both real-to-sim and sim-to-real robotic applications. This approach enhances the realism of navigation tasks, the training and the evaluation of open-vocabulary object navigation agents in complex settings. To demonstrate the effectiveness of our pipeline and datasets, we propose two baselines and evaluate them along with state-of-the-art baselines on SD-OVON-3k. The datasets, benchmark and source code are publicly available.
摘要：我们为动态场景（SD-OVON）中的开放式视频对象导航提供语义感知数据集和基准生成管道。它利用预处理的多模式基础模型来生成无限独特的照片真实场景变体，这些场景变体符合现实世界的语义和每日常识，用于训练和评估导航剂，并带有插件，用于生成对象导航任务情节与栖息地模拟器兼容。此外，我们还提供两个预先生成的对象导航任务数据集，SD-OVON-3K和SD-OVON-10K，分别包括开放式摄制对象导航任务的大约3K和10K情节，该任务衍生自SD-Ovon-Scenes数据集，具有2.5k photo-Scans-9经过检查的扫描和艺术家创建的可操作对象模型。与以前的数据集限于静态环境不同，SD-ovon涵盖了动态场景和可操作的对象，从而促进了真实对sim和SIM到现实的机器人应用程序。这种方法增强了导航任务的现实主义，在复杂设置中开放式摄制对象导航剂的培训和评估。为了证明管道和数据集的有效性，我们提出了两个基准，并将它们与SD-Ovon-3K上的最先进基线一起评估。数据集，基准和源代码公开可用。

Title: Partition Generative Modeling: Masked Modeling Without Masks

Authors: Justin Deschenaux, Lan Tran, Caglar Gulcehre
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18883
Pdf URL: https://arxiv.org/pdf/2505.18883
Copy Paste: [[2505.18883]] Partition Generative Modeling: Masked Modeling Without Masks(https://arxiv.org/abs/2505.18883)
Keywords: generation, generative
Abstract: We introduce ``Partition Generative Models'' (PGMs), a novel approach to masked generative modeling (MGMs), particularly effective for masked diffusion language modeling (MDLMs). PGM divides tokens into two distinct groups and employs sparse attention patterns to prevent cross-group information exchange. Hence, the model is trained to predict tokens in one group based solely on information from the other group. This partitioning strategy eliminates the need for MASK tokens entirely. While traditional MGMs inefficiently process MASK tokens during generation, PGMs achieve greater computational efficiency by operating exclusively on unmasked tokens. Our experiments on OpenWebText with a context length of 1024 tokens demonstrate that PGMs deliver at least 5x improvements in both latency and throughput compared to MDLM when using the same number of sampling steps, while generating samples with better generative perplexity than MDLM. Finally, we show that PGMs can be distilled with Self-Distillation Through Time (SDTT), a method originally devised for MDLM, in order to achieve further inference gains.
摘要：我们介绍了``分区生成模型''（PGMS），这是一种掩盖生成建模（MGM）的新方法，对于掩盖的扩散语言建模（MDLMS）尤其有效。 PGM将令牌分为两个不同的组，并采用稀疏注意模式来防止跨组信息交换。因此，该模型经过训练，可以仅基于另一组信息的信息来预测一个组中的令牌。这种分区策略完全消除了对面具令牌的需求。尽管传统的MGM在一代期间无法处理蒙版代币，但PGM通过独家在未遵循的令牌上操作来实现更大的计算效率。我们在OpenWebText上具有1024个令牌的实验表明，与MDLM相比，PGMS在使用相同数量的采样步骤时，延迟和吞吐量至少提高了5倍，同时生成具有比MDLM更好生成性的样品的样品。最后，我们表明，可以通过时间（SDTT）进行自distillation（SDTT）（一种最初为MDLM设计的方法）进行蒸馏，以实现进一步的推理收益。

Title: PromptWise: Online Learning for Cost-Aware Prompt Assignment in Generative Models

Authors: Xiaoyan Hu, Lauren Pick, Ho-fung Leung, Farzan Farnia
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18901
Pdf URL: https://arxiv.org/pdf/2505.18901
Copy Paste: [[2505.18901]] PromptWise: Online Learning for Cost-Aware Prompt Assignment in Generative Models(https://arxiv.org/abs/2505.18901)
Keywords: generation, generative
Abstract: The rapid advancement of generative AI models has provided users with numerous options to address their prompts. When selecting a generative AI model for a given prompt, users should consider not only the performance of the chosen model but also its associated service cost. The principle guiding such consideration is to select the least expensive model among the available satisfactory options. However, existing model-selection approaches typically prioritize performance, overlooking pricing differences between models. In this paper, we introduce PromptWise, an online learning framework designed to assign a sequence of prompts to a group of large language models (LLMs) in a cost-effective manner. PromptWise strategically queries cheaper models first, progressing to more expensive options only if the lower-cost models fail to adequately address a given prompt. Through numerical experiments, we demonstrate PromptWise's effectiveness across various tasks, including puzzles of varying complexity and code generation/translation tasks. The results highlight that PromptWise consistently outperforms cost-unaware baseline methods, emphasizing that directly assigning prompts to the most expensive models can lead to higher costs and potentially lower average performance.
摘要：生成AI模型的快速进步为用户提供了许多来解决其提示的选项。在为给定提示选择生成的AI模型时，用户不仅应考虑所选模型的性能，还应考虑其相关的服务成本。指导这种考虑的原则是在可用的令人满意的选项中选择最便宜的模型。但是，现有的模型选择方法通常优先考虑性能，忽略了模型之间的定价差异。在本文中，我们迅速介绍了一个在线学习框架，旨在以具有成本效益的方式为一组大型语言模型（LLM）分配一系列提示。迅速从策略性地查询便宜的型号，首先要进行更昂贵的选项，仅当低成本模型无法充分解决给定的提示时。通过数值实验，我们在各种任务中迅速演示了有效性，包括各种复杂性和代码生成/翻译任务的难题。结果强调，迅速的表现始终超过成本 - 非洲基线方法，强调将提示直接分配给最昂贵的型号会导致更高的成本和潜在的平均绩效。

Title: Hybrid Neural-MPM for Interactive Fluid Simulations in Real-Time

Authors: Jingxuan Xu, Hong Huang, Chuhang Zou, Manolis Savva, Yunchao Wei, Wuyang Chen
Subjects: cs.LG, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2505.18926
Pdf URL: https://arxiv.org/pdf/2505.18926
Copy Paste: [[2505.18926]] Hybrid Neural-MPM for Interactive Fluid Simulations in Real-Time(https://arxiv.org/abs/2505.18926)
Keywords: generative
Abstract: We propose a neural physics system for real-time, interactive fluid simulations. Traditional physics-based methods, while accurate, are computationally intensive and suffer from latency issues. Recent machine-learning methods reduce computational costs while preserving fidelity; yet most still fail to satisfy the latency constraints for real-time use and lack support for interactive applications. To bridge this gap, we introduce a novel hybrid method that integrates numerical simulation, neural physics, and generative control. Our neural physics jointly pursues low-latency simulation and high physical fidelity by employing a fallback safeguard to classical numerical solvers. Furthermore, we develop a diffusion-based controller that is trained using a reverse modeling strategy to generate external dynamic force fields for fluid manipulation. Our system demonstrates robust performance across diverse 2D/3D scenarios, material types, and obstacle interactions, achieving real-time simulations at high frame rates (11~29% latency) while enabling fluid control guided by user-friendly freehand sketches. We present a significant step towards practical, controllable, and physically plausible fluid simulations for real-time interactive applications. We promise to release both models and data upon acceptance.
摘要：我们提出了一种用于实时交互式流体模拟的神经物理系统。基于物理的传统方法虽然准确，但在计算中却遭受了延迟问题的限制。最近的机器学习方法降低了计算成本，同时保持了忠诚度；然而，大多数人仍然无法满足实时使用的延迟约束，并且缺乏对交互式应用程序的支持。为了弥合这一差距，我们引入了一种新型的混合方法，该方法整合了数值模拟，神经物理学和生成性控制。我们的神经物理学通过对经典数值求解器进行后备保障，共同追求低延迟模拟和高物理保真度。此外，我们开发了一个基于扩散的控制器，该控制器通过反向建模策略进行训练，以生成用于流体操作的外部动力学场。我们的系统展示了各种2D/3D方案，材料类型和障碍交互作用的稳健性能，以高框架速率（11〜29％的潜伏期）实现实时模拟，同时启用以用户友好的徒手徒手素描为指导的流体控制。我们为实时交互式应用迈出了重要的一步，朝着实用，可控制和物理上合理的流体模拟迈出了重要一步。我们承诺在接受后发布模型和数据。

Title: How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation

Authors: Yining Pan, Qiongjie Cui, Xulei Yang, Na Zhao
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2505.18956
Pdf URL: https://arxiv.org/pdf/2505.18956
Copy Paste: [[2505.18956]] How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation(https://arxiv.org/abs/2505.18956)
Keywords: generation
Abstract: LiDAR-based 3D panoptic segmentation often struggles with the inherent sparsity of data from LiDAR sensors, which makes it challenging to accurately recognize distant or small objects. Recently, a few studies have sought to overcome this challenge by integrating LiDAR inputs with camera images, leveraging the rich and dense texture information provided by the latter. While these approaches have shown promising results, they still face challenges, such as misalignment during data augmentation and the reliance on post-processing steps. To address these issues, we propose Image-Assists-LiDAR (IAL), a novel multi-modal 3D panoptic segmentation framework. In IAL, we first introduce a modality-synchronized data augmentation strategy, PieAug, to ensure alignment between LiDAR and image inputs from the start. Next, we adopt a transformer decoder to directly predict panoptic segmentation results. To effectively fuse LiDAR and image features into tokens for the decoder, we design a Geometric-guided Token Fusion (GTF) module. Additionally, we leverage the complementary strengths of each modality as priors for query initialization through a Prior-based Query Generation (PQG) module, enhancing the decoder's ability to generate accurate instance masks. Our IAL framework achieves state-of-the-art performance compared to previous multi-modal 3D panoptic segmentation methods on two widely used benchmarks. Code and models are publicly available at
摘要：基于激光雷达的3D圆锥分段通常会与LIDAR传感器的数据的固有稀疏性斗争，这使得准确识别远处或小物体的质疑。最近，一些研究试图通过将激光雷达输入与摄像头图像相结合，利用后者提供的丰富而密集的纹理信息来克服这一挑战。尽管这些方法显示出令人鼓舞的结果，但它们仍然面临挑战，例如在数据增强过程中的未对准以及对后处理步骤的依赖。为了解决这些问题，我们提出了一个新型的多模式3D全景分割框架图像辅助-LIDAR（IAL）。在IAL中，我们首先引入了一种模态同步数据增强策略PIEAUG，以确保从一开始，LIDAR和图像输入之间的对齐方式。接下来，我们采用变压器解码器直接预测全盘分割结果。为了有效地将LIDAR和图像特征融合到解码器的令牌中，我们设计了一个几何引导令牌融合（GTF）模块。此外，我们利用每种模态的互补优势作为先验初始化的先验初始化（PQG）模块，从而增强了解码器生成准确实例掩码的能力。与以前使用两种广泛使用的基准的多模式3D全景分割方法相比，我们的IAL框架实现了最先进的性能。代码和模型可在

Title: CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation

Authors: Jiong Wu, Yang Xing, Boxiao Yu, Wei Shao, Kuang Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18958
Pdf URL: https://arxiv.org/pdf/2505.18958
Copy Paste: [[2505.18958]] CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation(https://arxiv.org/abs/2505.18958)
Keywords: generation
Abstract: Most publicly available medical segmentation datasets are only partially labeled, with annotations provided for a subset of anatomical structures. When multiple datasets are combined for training, this incomplete annotation poses challenges, as it limits the model's ability to learn shared anatomical representations among datasets. Furthermore, vision-only frameworks often fail to capture complex anatomical relationships and task-specific distinctions, leading to reduced segmentation accuracy and poor generalizability to unseen datasets. In this study, we proposed a novel CLIP-DINO Prompt-Driven Segmentation Network (CDPDNet), which combined a self-supervised vision transformer with CLIP-based text embedding and introduced task-specific text prompts to tackle these challenges. Specifically, the framework was constructed upon a convolutional neural network (CNN) and incorporated DINOv2 to extract both fine-grained and global visual features, which were then fused using a multi-head cross-attention module to overcome the limited long-range modeling capability of CNNs. In addition, CLIP-derived text embeddings were projected into the visual space to help model complex relationships among organs and tumors. To further address the partial label challenge and enhance inter-task discriminative capability, a Text-based Task Prompt Generation (TTPG) module that generated task-specific prompts was designed to guide the segmentation. Extensive experiments on multiple medical imaging datasets demonstrated that CDPDNet consistently outperformed existing state-of-the-art segmentation methods. Code and pretrained model are available at: this https URL.
摘要：大多数公开可用的医疗细分数据集仅部分标记，并提供了一个为解剖结构的子集提供的注释。当组合多个数据集进行培训时，这种不完整的注释会带来挑战，因为它限制了模型在数据集中学习共享解剖表示的能力。此外，仅视力的框架通常无法捕获复杂的解剖学关系和特定于任务的区别，从而降低了细分精度和不看到数据集的普遍性。在这项研究中，我们提出了一个新颖的夹型迪诺及时驱动的分割网络（CDPDNET），该网络将自我监督的视觉变压器与基于夹的文本嵌入结合在一起，并引入了特定于任务的文本提示，以应对这些挑战。具体而言，该框架是在卷积神经网络（CNN）上构建的，并掺入了Dinov2，以提取细粒度和全局视觉特征，然后使用多头跨注意模块融合了框架，以克服有限的CNN远程建模能力。此外，将夹子衍生的文本嵌入被投影到视觉空间中，以帮助建模器官和肿瘤之间的复杂关系。为了进一步解决部分标签挑战并增强任务间判别能力，基于文本的任务提示生成（TTPG）模块生成了特定于任务的提示，旨在指导分段。在多个医学成像数据集上进行的广泛实验表明，CDPDNET始终超过现有的最新分割方法。代码和预估计的模型可在以下网址提供：此HTTPS URL。

Title: MGD$^3$: Mode-Guided Dataset Distillation using Diffusion Models

Authors: Jeffrey A. Chan-Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, Mubarak Shah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18963
Pdf URL: https://arxiv.org/pdf/2505.18963
Copy Paste: [[2505.18963]] MGD$^3$: Mode-Guided Dataset Distillation using Diffusion Models(https://arxiv.org/abs/2505.18963)
Keywords: generative
Abstract: Dataset distillation has emerged as an effective strategy, significantly reducing training costs and facilitating more efficient model deployment. Recent advances have leveraged generative models to distill datasets by capturing the underlying data distribution. Unfortunately, existing methods require model fine-tuning with distillation losses to encourage diversity and representativeness. However, these methods do not guarantee sample diversity, limiting their performance. We propose a mode-guided diffusion model leveraging a pre-trained diffusion model without the need to fine-tune with distillation losses. Our approach addresses dataset diversity in three stages: Mode Discovery to identify distinct data modes, Mode Guidance to enhance intra-class diversity, and Stop Guidance to mitigate artifacts in synthetic samples that affect performance. Our approach outperforms state-of-the-art methods, achieving accuracy gains of 4.4%, 2.9%, 1.6%, and 1.6% on ImageNette, ImageIDC, ImageNet-100, and ImageNet-1K, respectively. Our method eliminates the need for fine-tuning diffusion models with distillation losses, significantly reducing computational costs. Our code is available on the project webpage: this https URL
摘要：数据集蒸馏已成为一种有效的策略，大大降低了培训成本并促进了更有效的模型部署。最近的进步通过捕获基础数据分布来利用生成模型来提炼数据集。不幸的是，现有的方法需要模型进行微调和蒸馏损失，以鼓励多样性和代表性。但是，这些方法不能保证样本多样性，从而限制其性能。我们提出了一个模式引导的扩散模型，该模型利用了预先训练的扩散模型，而无需通过蒸馏损失进行微调。我们的方法解决了三个阶段的数据集多样性：模式发现以识别不同的数据模式，模式指导以增强阶级内部多样性，并停止指导以减轻影响性能的合成样本中的伪像。我们的方法在Imagenette，ImageIdc，ImageNet-100和Imagenet-1K上的准确性提高分别优于最先进的方法，其准确性提高分别为4.4％，2.9％，1.6％和1.6％。我们的方法消除了对蒸馏损失的微调扩散模型的需求，从而大大降低了计算成本。我们的代码可在项目网页上找到：此HTTPS URL

Title: Protein Design with Dynamic Protein Vocabulary

Authors: Nuowei Liu, Jiahao Kuang, Yanting Liu, Changzhi Sun, Tao Ji, Yuanbin Wu, Man Lan
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2505.18966
Pdf URL: https://arxiv.org/pdf/2505.18966
Copy Paste: [[2505.18966]] Protein Design with Dynamic Protein Vocabulary(https://arxiv.org/abs/2505.18966)
Keywords: generative
Abstract: Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.6%.
摘要：蛋白质设计是生物技术中的一个基本挑战，旨在在可能的蛋白质的广阔空间内设计具有特定功能的新序列。深层生成模型的最新进展使基于函数的蛋白质设计从文本描述中获得了设计，但与结构上的合理性斗争。受到利用自然蛋白质结构的经典蛋白质设计方法的启发，我们探索合并自然蛋白质的片段是否可以增强生成模型中的可折叠性。我们的经验结果表明，即使随机融合片段也可以提高可折叠性。在这种见解的基础上，我们介绍了FODVA，这是一种新型的蛋白质设计方法，该方法整合了用于功能描述的文本编码器，用于设计蛋白质的蛋白质语言模型以及一个片段编码器，以基于文本功能描述动态检索蛋白质片段。实验结果表明，我们的方法有效地设计了蛋白质序列，这些蛋白质序列在功能上既排列又在结构上合理。与最先进的模型相比，FODVA使用少于0.04％的训练数据实现了可比的功能比对，同时设计了更大折叠的蛋白质，而PLDDT的蛋白质比例更高，PLDDT高于70的蛋白质的比例增加了7.38％，而PAE低于10的蛋白质的比例增加了9.6％。

Title: GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization

Authors: Zixuan Chen, Hao Lin, Ke Xu, Xinghao Jiang, Tanfeng Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.18979
Pdf URL: https://arxiv.org/pdf/2505.18979
Copy Paste: [[2505.18979]] GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization(https://arxiv.org/abs/2505.18979)
Keywords: generation, generative
Abstract: Text-to-image (T2I) generation models can inadvertently produce not-safe-for-work (NSFW) content, prompting the integration of text and image safety filters. Recent advances employ large language models (LLMs) for semantic-level detection, rendering traditional token-level perturbation attacks largely ineffective. However, our evaluation shows that existing jailbreak methods are ineffective against these modern filters. We introduce GhostPrompt, the first automated jailbreak framework that combines dynamic prompt optimization with multimodal feedback. It consists of two key components: (i) Dynamic Optimization, an iterative process that guides a large language model (LLM) using feedback from text safety filters and CLIP similarity scores to generate semantically aligned adversarial prompts; and (ii) Adaptive Safety Indicator Injection, which formulates the injection of benign visual cues as a reinforcement learning problem to bypass image-level filters. GhostPrompt achieves state-of-the-art performance, increasing the ShieldLM-7B bypass rate from 12.5\% (Sneakyprompt) to 99.0\%, improving CLIP score from 0.2637 to 0.2762, and reducing the time cost by $4.2 \times$. Moreover, it generalizes to unseen filters including GPT-4.1 and successfully jailbreaks DALLE 3 to generate NSFW images in our evaluation, revealing systemic vulnerabilities in current multimodal defenses. To support further research on AI safety and red-teaming, we will release code and adversarial prompts under a controlled-access protocol.
摘要：文本对图像（T2I）生成模型可以无意间生成不安全的工作（NSFW）内容，从而促使文本和图像安全过滤器的集成。最近的进步采用大型语言模型（LLM）进行语义级检测，从而使传统的令牌级扰动攻击在很大程度上无效。但是，我们的评估表明，现有的越狱方法对这些现代过滤器无效。我们介绍了Ghost Prompt，这是第一个将动态及时优化与多模式反馈相结合的自动越狱框架。它由两个关键组成部分组成：（i）动态优化，这是一个迭代过程，使用文本安全过滤器和剪辑相似性得分的反馈来指导大型语言模型（LLM），以生成语义上一致的对抗提示；（ii）自适应安全指标注入，该指示器注入了良性视觉线索的注入，作为加强学习问题，以绕过图像级过滤器。 Ghost Prompt实现了最先进的性能，将ShieldLM-7B旁路率从12.5 \％（SneakyPrompt）提高到99.0 \％，将夹得分从0.2637提高到0.2762，并将时间成本降低为$ 4.2 \ tim $ $ $。此外，它概括了包括GPT-4.1在内的未见过滤器和成功越狱Dalle 3在我们的评估中生成NSFW图像，从而揭示了当前多模式防御的系统性漏洞。为了支持对AI安全性和红色团队的进一步研究，我们将根据受控访问协议发布代码和对抗提示。

Title: STRICT: Stress Test of Rendering Images Containing Text

Authors: Tianyu Zhang, Xinyu Wang, Zhenghan Tai, Lu Li, Jijun Chi, Jingrui Tian, Hailin He, Suyuchen Wang
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.18985
Pdf URL: https://arxiv.org/pdf/2505.18985
Copy Paste: [[2505.18985]] STRICT: Stress Test of Rendering Images Containing Text(https://arxiv.org/abs/2505.18985)
Keywords: generation, generative
Abstract: While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce $\textbf{STRICT}$, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline at this https URL.
摘要：尽管扩散模型具有综合现实和多样化场景的能力，使文本到图像的生成彻底改变了文本到图像的生成，但他们继续努力在图像中产生一致且清晰的文本。这种缺点通常归因于基于扩散的生成中固有的局部性偏差，这限制了其对远程空间依赖性建模的能力。在本文中，我们介绍了$ \ textbf {strict} $，这是一种基准测试，旨在系统地强调扩散模型在图像中呈现相干和指令一致的文本的能力。我们的基准测试评估了跨多个维度的模型：（1）可以生成的可读文本的最大长度；（2）生成的文本的正确性和可读性，以及（3）不遵循生成文本的说明的比率。我们评估了几种最先进的模型，包括专有和开源变体，并揭示了长期一致性和跟随能力的持续限制。我们的发现提供了对建筑瓶颈的见解，并激发了多模式生成建模的未来研究方向。我们在此HTTPS URL上发布了整个评估管道。

Title: NTIRE 2025 Challenge on Video Quality Enhancement for Video Conferencing: Datasets, Methods and Results

Authors: Varun Jain, Zongwei Wu, Quan Zou, Louis Florentin, Henrik Turbell, Sandeep Siddhartha, Radu Timofte, others
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.18988
Pdf URL: https://arxiv.org/pdf/2505.18988
Copy Paste: [[2505.18988]] NTIRE 2025 Challenge on Video Quality Enhancement for Video Conferencing: Datasets, Methods and Results(https://arxiv.org/abs/2505.18988)
Keywords: quality assessment
Abstract: This paper presents a comprehensive review of the 1st Challenge on Video Quality Enhancement for Video Conferencing held at the NTIRE workshop at CVPR 2025, and highlights the problem statement, datasets, proposed solutions, and results. The aim of this challenge was to design a Video Quality Enhancement (VQE) model to enhance video quality in video conferencing scenarios by (a) improving lighting, (b) enhancing colors, (c) reducing noise, and (d) enhancing sharpness - giving a professional studio-like effect. Participants were given a differentiable Video Quality Assessment (VQA) model, training, and test videos. A total of 91 participants registered for the challenge. We received 10 valid submissions that were evaluated in a crowdsourced framework.
摘要：本文对在CVPR 2025举行的NTIRE研讨会举行的视频质量增强视频质量增强的第一项挑战进行了全面评论，并突出了问题陈述，数据集，提议的解决方案和结果。这项挑战的目的是设计视频质量增强（VQE）模型，以通过（a）改善照明，（b）增强颜色，（c）降低噪音，以及（d）增强清晰度 - 产生专业工作室的效果，以增强视频会议方案的视频质量。为参与者提供了可区分的视频质量评估（VQA）模型，培训和测试视频。共有91名参与者注册了挑战。我们收到了10项有效的意见书，这些提交在众包框架中进行了评估。

Title: Training-free Stylized Text-to-Image Generation with Fast Inference

Authors: Xin Ma, Yaohui Wang, Xinyuan Chen, Tien-Tsin Wong, Cunjian Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19063
Pdf URL: https://arxiv.org/pdf/2505.19063
Copy Paste: [[2505.19063]] Training-free Stylized Text-to-Image Generation with Fast Inference(https://arxiv.org/abs/2505.19063)
Keywords: generation, generative
Abstract: Although diffusion models exhibit impressive generative capabilities, existing methods for stylized image generation based on these models often require textual inversion or fine-tuning with style images, which is time-consuming and limits the practical applicability of large-scale diffusion models. To address these challenges, we propose a novel stylized image generation method leveraging a pre-trained large-scale diffusion model without requiring fine-tuning or any additional optimization, termed as OmniPainter. Specifically, we exploit the self-consistency property of latent consistency models to extract the representative style statistics from reference style images to guide the stylization process. Additionally, we then introduce the norm mixture of self-attention, which enables the model to query the most relevant style patterns from these statistics for the intermediate output content features. This mechanism also ensures that the stylized results align closely with the distribution of the reference style images. Our qualitative and quantitative experimental results demonstrate that the proposed method outperforms state-of-the-art approaches.
摘要：尽管扩散模型具有令人印象深刻的生成能力，但是基于这些模型的风格化图像生成的现有方法通常需要文本反转或样式图像进行微调，这是耗时的，并且限制了大规模扩散模型的实际适用性。为了应对这些挑战，我们提出了一种新型的风格化图像生成方法，该方法利用了预先训练的大规模扩散模型，而无需进行微调或任何其他优化，称为OmniPainter。具体而言，我们利用潜在一致性模型的自矛盾属性从参考样式图像中提取代表性样式统计信息来指导风格化过程。此外，我们介绍了自我注意力的标准混合物，这使该模型可以从这些统计数据中查询中间输出内容特征的最相关样式模式。该机制还确保了风格化的结果与参考样式图像的分布紧密吻合。我们的定性和定量实验结果表明，所提出的方法的表现优于最先进的方法。

Title: MMP-2K: A Benchmark Multi-Labeled Macro Photography Image Quality Assessment Database

Authors: Jiashuo Chang, Zhengyi Li, Jianxun Lou, Zhen Qiu, Hanhe Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19065
Pdf URL: https://arxiv.org/pdf/2505.19065
Copy Paste: [[2505.19065]] MMP-2K: A Benchmark Multi-Labeled Macro Photography Image Quality Assessment Database(https://arxiv.org/abs/2505.19065)
Keywords: quality assessment
Abstract: Macro photography (MP) is a specialized field of photography that captures objects at an extremely close range, revealing tiny details. Although an accurate macro photography image quality assessment (MPIQA) metric can benefit macro photograph capturing, which is vital in some domains such as scientific research and medical applications, the lack of MPIQA data limits the development of MPIQA metrics. To address this limitation, we conducted a large-scale MPIQA study. Specifically, to ensure diversity both in content and quality, we sampled 2,000 MP images from 15,700 MP images, collected from three public image websites. For each MP image, 17 (out of 21 after outlier removal) quality ratings and a detailed quality report of distortion magnitudes, types, and positions are gathered by a lab study. The images, quality ratings, and quality reports form our novel multi-labeled MPIQA database, MMP-2k. Experimental results showed that the state-of-the-art generic IQA metrics underperform on MP images. The database and supplementary materials are available at this https URL.
摘要：宏观摄影（MP）是一个专业的摄影领域，可捕获非常近距离的物体，从而揭示了细节。尽管准确的宏观摄影图像质量评估（MPIQA）指标可以使宏观照片捕获受益，这在某些领域（例如科学研究和医学应用）至关重要，但缺乏MPIQA数据限制了MPIQA指标的发展。为了解决这一限制，我们进行了大规模的MPIQA研究。具体来说，为了确保内容和质量的多样性，我们从三个公共图像网站收集的15,700 MP图像中取样了2,000张MP图像。对于每个MP图像，通过一项实验室研究收集了质量评级的17个（在拆卸后21个）质量等级和详细的质量报告。图像，质量评级和质量报告构成了我们新颖的多标签MPIQA数据库MMP-2K。实验结果表明，在MP图像上，最新的通用IQA指标表现不佳。该HTTPS URL可用数据库和补充材料。

Title: Towards Generalized Proactive Defense against Face Swappingwith Contour-Hybrid Watermark

Authors: Ruiyang Xia, Dawei Zhou, Decheng Liu, Lin Yuan, Jie Li, Nannan Wang, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19081
Pdf URL: https://arxiv.org/pdf/2505.19081
Copy Paste: [[2505.19081]] Towards Generalized Proactive Defense against Face Swappingwith Contour-Hybrid Watermark(https://arxiv.org/abs/2505.19081)
Keywords: generation
Abstract: Face swapping, recognized as a privacy and security concern, has prompted considerable defensive research. With the advancements in AI-generated content, the discrepancies between the real and swapped faces have become nuanced. Considering the difficulty of forged traces detection, we shift the focus to the face swapping purpose and proactively embed elaborate watermarks against unknown face swapping techniques. Given that the constant purpose is to swap the original face identity while preserving the background, we concentrate on the regions surrounding the face to ensure robust watermark generation, while embedding the contour texture and face identity information to achieve progressive image determination. The watermark is located in the facial contour and contains hybrid messages, dubbed the contour-hybrid watermark (CMark). Our approach generalizes face swapping detection without requiring any swapping techniques during training and the storage of large-scale messages in advance. Experiments conducted across 8 face swapping techniques demonstrate the superiority of our approach compared with state-of-the-art passive and proactive detectors while achieving a favorable balance between the image quality and watermark robustness.
摘要：面部交换被认为是隐私和安全问题，引发了大量的防御性研究。随着AI生成的内容的进步，真实面孔和交换面之间的差异变得细微。考虑到锻造痕迹检测的难度，我们将焦点转移到面部交换目的，并主动嵌入精心嵌入的水印与未知的面部交换技术。鉴于持续的目的是在保留背景的同时交换原始面部身份，因此我们专注于面部周围的区域，以确保稳健的水印产生，同时嵌入轮廓纹理和面部身份信息以实现渐进的图像确定。水印位于面轮廓中，并包含混合信息，称为Contour-Hybrid Watermark（Cmark）。我们的方法概括了面部交换检测，而无需在培训过程中需要任何交换技术，并提前存储了大规模消息。在8种面部交换技术上进行的实验证明了我们方法的优势与最先进的被动和主动探测器相比，同时在图像质量和水印鲁棒性之间取得了良好的平衡。

Title: Jodi: Unification of Visual Generation and Understanding via Joint Modeling

Authors: Yifeng Xu, Zhenliang He, Meina Kan, Shiguang Shan, Xilin Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19084
Pdf URL: https://arxiv.org/pdf/2505.19084
Copy Paste: [[2505.19084]] Jodi: Unification of Visual Generation and Understanding via Joint Modeling(https://arxiv.org/abs/2505.19084)
Keywords: generation
Abstract: Visual generation and understanding are two deeply interconnected aspects of human intelligence, yet they have been traditionally treated as separate tasks in machine learning. In this paper, we propose Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Specifically, Jodi is built upon a linear diffusion transformer along with a role switch mechanism, which enables it to perform three particular types of tasks: (1) joint generation, where the model simultaneously generates images and multiple labels; (2) controllable generation, where images are generated conditioned on any combination of labels; and (3) image perception, where multiple labels can be predicted at once from a given image. Furthermore, we present the Joint-1.6M dataset, which contains 200,000 high-quality images collected from public sources, automatic labels for 7 visual domains, and LLM-generated captions. Extensive experiments demonstrate that Jodi excels in both generation and understanding tasks and exhibits strong extensibility to a wider range of visual domains. Code is available at this https URL.
摘要：视觉生成和理解是人类智力的两个深厚的互连方面，但传统上被视为机器学习中的独立任务。在本文中，我们提出了Jodi，这是一个扩散框架，通过共同建模图像域和多个标签域来统一视觉生成和理解。具体而言，JODI是在线性扩散变压器以及角色开关机制的基础上构建的，它使其能够执行三种特定类型的任务：（1）联合生成，该模型同时生成图像和多个标签；（2）可控的生成，其中图像是在标签的任何组合上生成的；（3）图像感知，其中可以从给定的图像立即预测多个标签。此外，我们介绍了关节1.6m数据集，该数据集包含从公共来源收集的200,000张高质量图像，7个视觉域的自动标签和LLM生成的字幕。广泛的实验表明，乔迪在一代和理解任务中都表现出色，并且对更广泛的视觉领域表现出强大的可扩展性。代码可在此HTTPS URL上找到。

Title: Plug-and-Play Context Feature Reuse for Efficient Masked Generation

Authors: Xuejie Liu, Anji Liu, Guy Van den Broeck, Yitao Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19089
Pdf URL: https://arxiv.org/pdf/2505.19089
Copy Paste: [[2505.19089]] Plug-and-Play Context Feature Reuse for Efficient Masked Generation(https://arxiv.org/abs/2505.19089)
Keywords: generation, generative
Abstract: Masked generative models (MGMs) have emerged as a powerful framework for image synthesis, combining parallel decoding with strong bidirectional context modeling. However, generating high-quality samples typically requires many iterative decoding steps, resulting in high inference costs. A straightforward way to speed up generation is by decoding more tokens in each step, thereby reducing the total number of steps. However, when many tokens are decoded simultaneously, the model can only estimate the univariate marginal distributions independently, failing to capture the dependency among them. As a result, reducing the number of steps significantly compromises generation fidelity. In this work, we introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs by constructing low-cost steps via reusing feature embeddings from previously decoded context tokens. ReCAP interleaves standard full evaluations with lightweight steps that cache and reuse context features, substantially reducing computation while preserving the benefits of fine-grained, iterative generation. We demonstrate its effectiveness on top of three representative MGMs (MaskGIT, MAGE, and MAR), including both discrete and continuous token spaces and covering diverse architectural designs. In particular, on ImageNet256 class-conditional generation, ReCAP achieves up to 2.4x faster inference than the base model with minimal performance drop, and consistently delivers better efficiency-fidelity trade-offs under various generation settings.
摘要：蒙版生成模型（MGM）已成为图像合成的强大框架，将并行解码与强双向上下文建模相结合。但是，生成高质量样本通常需要许多迭代的解码步骤，从而导致高推理成本。加快生成的直接方法是在每个步骤中解码更多的令牌，从而减少了总数。但是，当许多令牌同时解码时，模型只能独立估计单变量的边际分布，未能捕获它们之间的依赖性。结果，减少步骤的数量显着损害了产生的保真度。在这项工作中，我们介绍了recap（重复使用的上下文感知预测），这是一个插件播放模块，该模块通过通过以前解码的上下文令牌重复使用功能嵌入来构建低成本步骤来加速MGMS的推断。回顾一下标准的完整评估，轻巧的步骤可以缓存和重用上下文特征，从而大大降低了计算，同时保留了细粒度，迭代生成的好处。我们证明了它在三个代表性MGM（Maskgit，Mage和Mar）顶部的有效性，包括离散和连续的令牌空间以及涵盖各种建筑设计。特别是，在ImagEnet256类条件生成上，回顾的推断比基本模型的推断速度最高2.4倍，并且在各种生成环境下始终如一地提供更好的效率 - 效益权衡。

Title: CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design

Authors: Hui Zhang, Dexiang Hong, Maoke Yang, Yutao Chen, Zhao Zhang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19114
Pdf URL: https://arxiv.org/pdf/2505.19114
Copy Paste: [[2505.19114]] CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design(https://arxiv.org/abs/2505.19114)
Keywords: generation
Abstract: Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment. Prior work has explored automated graphic design generation using diffusion models, aiming to streamline creative workflows and democratize design capabilities. However, complex graphic design scenarios require accurately adhering to design intent specified by multiple heterogeneous user-provided elements (\eg images, layouts, and texts), which pose multi-condition control challenges for existing methods. Specifically, previous single-condition control models demonstrate effectiveness only within their specialized domains but fail to generalize to other conditions, while existing multi-condition methods often lack fine-grained control over each sub-condition and compromise overall compositional harmony. To address these limitations, we introduce CreatiDesign, a systematic solution for automated graphic design covering both model architecture and dataset construction. First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements with minimal architectural modifications to the base diffusion model. Furthermore, to ensure that each condition precisely controls its designated image region and to avoid interference between conditions, we propose a multimodal attention mask mechanism. Additionally, we develop a fully automated pipeline for constructing graphic design datasets, and introduce a new dataset with 400K samples featuring multi-condition annotations, along with a comprehensive benchmark. Experimental results show that CreatiDesign outperforms existing models by a clear margin in faithfully adhering to user intent.
摘要：图形设计在广告，营销和多媒体娱乐的视觉交流中起着至关重要的作用。先前的工作已经使用扩散模型探索了自动图形设计生成，旨在简化创意工作流程并使设计功能民主化。但是，复杂的图形设计方案需要准确地粘附，以设计由多个异构用户提供的元素（\ eg图像，布局和文本）指定的意图，这对现有方法构成了多条件控制挑战。具体而言，以前的单条件控制模型仅在其专业领域内才显示出有效性，但未能推广到其他条件，而现有的多条件方法通常缺乏对每个子条件的细粒度控制，并且妥协整体组成和谐。为了解决这些限制，我们介绍了Questidesign，这是一种用于自动图形设计的系统解决方案，涵盖了模型体系结构和数据集构建。首先，我们设计了一个统一的多条件驱动的体系结构，该体系结构可以灵活，精确地集成异质设计元素，并对基础扩散模型进行最小化的架构修改。此外，为了确保每种条件精确控制其指定的图像区域并避免条件之间的干扰，我们提出了一种多模式注意掩码机制。此外，我们开发了一条用于构建图形设计数据集的全自动管道，并引入了一个新数据集，其中包含400K样品，其中包含多条件注释，以及全面的基准。实验结果表明，Exenidesign通过忠实地遵守用户意图的明确利润来优于现有模型。

Title: Freqformer: Image-Demoiréing Transformer via Efficient Frequency Decomposition

Authors: Xiaoyang Liu, Bolin Qiu, Jiezhang Cao, Zheng Chen, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19120
Pdf URL: https://arxiv.org/pdf/2505.19120
Copy Paste: [[2505.19120]] Freqformer: Image-Demoiréing Transformer via Efficient Frequency Decomposition(https://arxiv.org/abs/2505.19120)
Keywords: restoration
Abstract: Image demoiréing remains a challenging task due to the complex interplay between texture corruption and color distortions caused by moiré patterns. Existing methods, especially those relying on direct image-to-image restoration, often fail to disentangle these intertwined artifacts effectively. While wavelet-based frequency-aware approaches offer a promising direction, their potential remains underexplored. In this paper, we present Freqformer, a Transformer-based framework specifically designed for image demoiréing through targeted frequency separation. Our method performs an effective frequency decomposition that explicitly splits moiré patterns into high-frequency spatially-localized textures and low-frequency scale-robust color distortions, which are then handled by a dual-branch architecture tailored to their distinct characteristics. We further propose a learnable Frequency Composition Transform (FCT) module to adaptively fuse the frequency-specific outputs, enabling consistent and high-fidelity reconstruction. To better aggregate the spatial dependencies and the inter-channel complementary information, we introduce a Spatial-Aware Channel Attention (SA-CA) module that refines moiré-sensitive regions without incurring high computational cost. Extensive experiments on various demoiréing benchmarks demonstrate that Freqformer achieves state-of-the-art performance with a compact model size. The code is publicly available at this https URL.
摘要：由于质地损坏与莫伊尔模式引起的颜色扭曲之间的复杂相互作用，图像演示仍然是一项具有挑战性的任务。现有的方法，尤其是那些依靠直接图像到图像恢复的方法，通常无法有效地解散这些相互交织的伪像。尽管基于小波的频率感知方法提供了有希望的方向，但它们的潜力仍未得到充实。在本文中，我们提出了Freqformer，这是一种基于变压器的框架，专门针对通过目标频率分离而设计。我们的方法执行有效的频率分解，将Moiré模式明确地分配为高频空间定位的纹理和低频尺度尺寸折叠，然后由针对其独特特征量身定制的双分支结构来处理。我们进一步提出了一个可学习的频率组成变换（FCT）模块，以适应特定于频率的输出，从而实现一致且高保真的重建。为了更好地汇总空间依赖性和渠道间的互补信息，我们引入了空间感知的通道注意（SA-CA）模块，该模块可以完善Moiré敏感区域而不会产生高计算成本。对各种演示基准测试的广泛实验表明，Freqformer以紧凑的模型大小实现最先进的性能。该代码在此HTTPS URL上公开可用。

Title: Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

Authors: Eric Tillman Bill, Cristian Perez Jensen, Sotiris Anagnostidis, Dimitri von Rütte
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19122
Pdf URL: https://arxiv.org/pdf/2505.19122
Copy Paste: [[2505.19122]] Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers(https://arxiv.org/abs/2505.19122)
Keywords: generative
Abstract: Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $\sim$12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring $\sim$5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.
摘要：脱氧扩散模型具有显着的生成能力，但由于其固有的随机性，训练仍然具有挑战性，高空梯度估计导致收敛速度缓慢。先前的工作表明，幅度保存有助于稳定U-NET体系结构中的训练。这项工作探讨了这种效果是否扩展到扩散变压器（DIT）体系结构。因此，我们提出了一种保留量级的设计，该设计可以稳定训练而无需归一化层。在维持激活幅度的目标的推动下，我们还引入了旋转调制，这是一种使用学习旋转而不是传统缩放或转移的新型调理方法。通过对小规模模型的经验评估和消融研究，我们表明，具有幅度的策略可显着提高性能，尤其是将FID得分降低了$ \ sim $ 12.8％。此外，我们表明旋转调制与缩放率相结合与Adaln具有竞争力，而要求$ \ sim $ 5.4％的参数少5.4％。这项工作为条件策略和幅度控制提供了见解。我们将公开发布我们方法的实施。

Title: Benchmarking Laparoscopic Surgical Image Restoration and Beyond

Authors: Jialun Pei, Diandian Guo, Donghui Yang, Zhixi Li, Yuxin Feng, Long Ma, Bo Du, Pheng-Ann Heng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19161
Pdf URL: https://arxiv.org/pdf/2505.19161
Copy Paste: [[2505.19161]] Benchmarking Laparoscopic Surgical Image Restoration and Beyond(https://arxiv.org/abs/2505.19161)
Keywords: restoration, generation
Abstract: In laparoscopic surgery, a clear and high-quality visual field is critical for surgeons to make accurate intraoperative decisions. However, persistent visual degradation, including smoke generated by energy devices, lens fogging from thermal gradients, and lens contamination due to blood or tissue fluid splashes during surgical procedures, severely impair visual clarity. These degenerations can seriously hinder surgical workflow and pose risks to patient safety. To systematically investigate and address various forms of surgical scene degradation, we introduce a real-world open-source surgical image restoration dataset covering laparoscopic environments, called SurgClean, which involves multi-type image restoration tasks, e.g., desmoking, defogging, and desplashing. SurgClean comprises 1,020 images with diverse degradation types and corresponding paired reference labels. Based on SurgClean, we establish a standardized evaluation benchmark and provide performance for 22 representative generic task-specific image restoration approaches, including 12 generic and 10 task-specific image restoration approaches. Experimental results reveal substantial performance gaps relative to clinical requirements, highlighting a critical opportunity for algorithm advancements in intelligent surgical restoration. Furthermore, we explore the degradation discrepancies between surgical and natural scenes from structural perception and semantic understanding perspectives, providing fundamental insights for domain-specific image restoration research. Our work aims to empower the capabilities of restoration algorithms to increase surgical environments and improve the efficiency of clinical procedures.
摘要：在腹腔镜手术中，明确而高质量的视野对于外科医生做出准确的术中决策至关重要。然而，持续的视觉降解，包括能量设备产生的烟雾，从热梯度中雾化的镜头以及在手术过程中由于血液或组织流体飞溅而引起的晶状体污染，严重损害了视觉清晰度。这些退化可能会严重阻碍手术工作流程，并为患者安全带来风险。为了系统地调查和解决各种形式的手术场景降解，我们介绍了一个现实世界中的开源外科手术图像恢复数据集，涵盖了称为Surgclean的腹腔镜环境，其中涉及多类型图像恢复任务，例如Desmoking，DeSmoking，Dewogging和DeSplashing。 Surgclean包含1,020张图像，具有不同的降解类型和相应的配对参考标签。基于Surgclean，我们建立了标准化的评估基准，并为22种代表性的特定任务图像恢复方法提供了性能，包括12种通用和10个特定于任务的图像恢复方法。实验结果揭示了相对于临床要求的巨大性能差距，这突出了智能手术恢复算法进步的关键机会。此外，我们探索了从结构感知和语义理解观点中的外科和自然场景之间的降解差异，为领域特定的图像恢复研究提供了基本见解。我们的工作旨在增强恢复算法的能力，以增加手术环境并提高临床程序的效率。

Title: Towards Understanding the Mechanisms of Classifier-Free Guidance

Authors: Xiang Li, Rongrong Wang, Qing Qu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19210
Pdf URL: https://arxiv.org/pdf/2505.19210
Copy Paste: [[2505.19210]] Towards Understanding the Mechanisms of Classifier-Free Guidance(https://arxiv.org/abs/2505.19210)
Keywords: generation
Abstract: Classifier-free guidance (CFG) is a core technique powering state-of-the-art image generation systems, yet its underlying mechanisms remain poorly understood. In this work, we begin by analyzing CFG in a simplified linear diffusion model, where we show its behavior closely resembles that observed in the nonlinear case. Our analysis reveals that linear CFG improves generation quality via three distinct components: (i) a mean-shift term that approximately steers samples in the direction of class means, (ii) a positive Contrastive Principal Components (CPC) term that amplifies class-specific features, and (iii) a negative CPC term that suppresses generic features prevalent in unconditional data. We then verify that these insights in real-world, nonlinear diffusion models: over a broad range of noise levels, linear CFG resembles the behavior of its nonlinear counterpart. Although the two eventually diverge at low noise levels, we discuss how the insights from the linear analysis still shed light on the CFG's mechanism in the nonlinear regime.
摘要：无分类器指导（CFG）是为最先进的图像生成系统提供动力的核心技术，但其基本机制仍然很少了解。在这项工作中，我们首先在简化的线性扩散模型中分析CFG开始，在该模型中，我们显示其行为与在非线性情况下观察到的相似之处。我们的分析表明，线性CFG通过三个不同的组成部分提高了生成质量：（i）平均偏移术语，该术语大约将样品置于班级均值的方向，（ii）阳性对比主成分（CPC）术语，该术语会放大类特异性特定特定特征的类别，以及（iii）一个负CPC术语，可以抑制通用的通用特征在于无剂量数据。然后，我们验证了现实世界中非线性扩散模型中的这些见解：在广泛的噪声水平上，线性CFG类似于其非线性对应物的行为。尽管两者最终在低噪声水平上分歧，但我们讨论了线性分析中的见解如何仍能阐明非线性状态中CFG的机制。

Title: RAISE: Realness Assessment for Image Synthesis and Evaluation

Authors: Aniruddha Mukherjee, Spriha Dubey, Somdyuti Paul
Subjects: cs.CV, cs.AI, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2505.19233
Pdf URL: https://arxiv.org/pdf/2505.19233
Copy Paste: [[2505.19233]] RAISE: Realness Assessment for Image Synthesis and Evaluation(https://arxiv.org/abs/2505.19233)
Keywords: generative
Abstract: The rapid advancement of generative AI has enabled the creation of highly photorealistic visual content, offering practical substitutes for real images and videos in scenarios where acquiring real data is difficult or expensive. However, reliably substituting real visual content with AI-generated counterparts requires robust assessment of the perceived realness of AI-generated visual content, a challenging task due to its inherent subjective nature. To address this, we conducted a comprehensive human study evaluating the perceptual realness of both real and AI-generated images, resulting in a new dataset, containing images paired with subjective realness scores, introduced as RAISE in this paper. Further, we develop and train multiple models on RAISE to establish baselines for realness prediction. Our experimental results demonstrate that features derived from deep foundation vision models can effectively capture the subjective realness. RAISE thus provides a valuable resource for developing robust, objective models of perceptual realness assessment.
摘要：生成AI的快速进步使创建高度逼真的视觉内容，在获取真实数据很困难或昂贵的情况下为真实图像和视频提供了实际替代品。但是，可靠地用AI生成的对应物可靠地替代真实的视觉内容，需要对AI生成的视觉内容的感知现实性进行强有力的评估，这是由于其固有的主观性质而具有挑战性的任务。为了解决这个问题，我们进行了一项全面的人类研究，评估了真实和AI生成的图像的感知现实性，从而产生了一个新的数据集，其中包含与主观现实性分数配对的图像，并在本文中引入了提高。此外，我们开发并培训多个模型，以建立现实性预测的基准。我们的实验结果表明，从深层基础视力模型中得出的特征可以有效地捕获主观现实。因此，加薪为开发可感知现实评估的强大客观模型提供了宝贵的资源。

Title: DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

Authors: Chen Shi, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19239
Pdf URL: https://arxiv.org/pdf/2505.19239
Copy Paste: [[2505.19239]] DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving(https://arxiv.org/abs/2505.19239)
Keywords: generation
Abstract: Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision-3D point cloud forecasting, 2D semantic representation, and image generation-to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic-aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX's predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX's effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX's capability as a general-purpose world model, paving the way for robust and unified autonomous driving frameworks.
摘要：数据驱动的学习具有先进的自动驾驶，但特定于任务的模型由于其狭窄的优化目标以及依赖昂贵的注释数据而困难的情况。我们提出了Drivex，这是一种自我监督的世界模型，从大规模驾驶视频中学习了可通用的场景动态和整体表示（几何，语义和运动）。 Drivex引入了OMNI场景建模（OSM），该模块统一了多模式监督-3D点云预测，2D语义表示和图像生成，以捕获综合场景演变。为了简化学习复杂的动态，我们提出了一种脱钩的潜在世界建模策略，该策略将世界表示与未来的状态解码区分开，并通过动态感知的射线采样增强以增强运动建模。为了进行下游适应，我们设计了未来的空间注意力（FSA），这是一个统一的范式，该范式从Drivex的预测中动态汇总时空特征，以增强特定于任务的推断。广泛的实验证明了Drivex的有效性：它在3D未来的点云对先前工作的预测方面取得了重大改进，同时在包括占用率预测，流量估计和端到端驾驶在内的各种任务上获得最新的结果。这些结果验证了Drivex作为通用世界模型的能力，为坚固和统一的自主驾驶框架铺平了道路。

Title: ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

Authors: Xiaoqiang Lin, Arun Verma, Zhongxiang Dai, Daniela Rus, See-Kiong Ng, Bryan Kian Hsiang Low
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19241
Pdf URL: https://arxiv.org/pdf/2505.19241
Copy Paste: [[2505.19241]] ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment(https://arxiv.org/abs/2505.19241)
Keywords: generation
Abstract: The recent success of using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks like question answering, mathematical reasoning, and code generation. However,3 achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions (e.g., linearity). To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Extensive experiments show that ActiveDPO outperforms existing methods across various models and datasets.
摘要：使用人类偏好来对齐大语言模型（LLM）的最新成功已大大提高了它们在各种下游任务中的表现，例如问答，数学推理和代码生成。但是，3实现有效的LLM一致性取决于高质量的人类偏好数据集。收集这些数据集需要人类偏好注释，这是昂贵且资源密集的，需要有效的主动数据选择方法。现有方法要么缺乏强大的理论基础，要么取决于限制性奖励函数假设（例如线性性）。为此，我们提出了一种ActivedPo算法，该算法使用理论上接地的数据选择准则来实现非线性奖励函数，同时直接利用LLM本身来参数化用于主动数据选择的奖励模型。结果，ActivedPo明确地说明了LLM对数据选择的影响，与选择数据的方法不同，而无需考虑要对齐的LLM，从而导致更有效，有效的数据收集。广泛的实验表明，ActivedPo在各种模型和数据集上都优于现有方法。

Title: Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

Authors: Yu Zhang, Jialei Zhou, Xinchen Li, Qi Zhang, Zhongwei Wan, Tianyu Wang, Duoqian Miao, Changwei Wang, Longbing Cao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19261
Pdf URL: https://arxiv.org/pdf/2505.19261
Copy Paste: [[2505.19261]] Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning(https://arxiv.org/abs/2505.19261)
Keywords: generation
Abstract: Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST. This framework converts a complete-text caption into a split-text caption, a collection of simplified sentences, to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner. Specifically, DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives and hierarchically sorting out and constructing these primitives into a split-text input. Moreover, we partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types and determine the appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens via cross-attention. In this way, DiT-ST enhances the representation learning of specific semantic primitive types across different stages. Extensive experiments validate the effectiveness of our proposed DiT-ST in mitigating the complete-text comprehension defect.
摘要：当前的文本到图像扩散生成通常采用完整的文本调节。由于复杂的语法，扩散变压器（DIT）固有地遭受了完整文本字幕的理解缺陷。一fly完整的文本输入要么忽略关键语义细节，要么通过同时建模各种语义原始类型来引起语义混乱。为了减轻DIT的缺陷，我们提出了一个名为Dit-ST的新颖的分型调节框架。该框架将完整的文本字幕转换为一个拆分文字字幕（简化句子的集合），以明确表达各种语义基原始人及其互连。然后以分层和增量的方式将拆分文本字幕注入DIT-ST的不同变性阶段。具体而言，DIT-ST利用大型语言模型来解析字幕，提取不同的原始图，并分层整理并将这些原语构建为分式文本输入。此外，我们根据其差异敏感性将扩散剥离过程分配给各种语义原始类型，并确定适当的时间段，以通过交叉注意将各种语义原始类型的多种语义原始类型注入输入令牌。通过这种方式，DIT-ST会增强在不同阶段的特定语义原始类型的表示。广泛的实验验证了我们提出的DIT-ST在减轻完整理解缺陷方面的有效性。

Title: Hypercube-RAG: Hypercube-Based Retrieval-Augmented Generation for In-domain Scientific Question-Answering

Authors: Jimeng Shi, Sizhe Zhou, Bowen Jin, Wei Hu, Shaowen Wang, Giri Narasimhan, Jiawei Han
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19288
Pdf URL: https://arxiv.org/pdf/2505.19288
Copy Paste: [[2505.19288]] Hypercube-RAG: Hypercube-Based Retrieval-Augmented Generation for In-domain Scientific Question-Answering(https://arxiv.org/abs/2505.19288)
Keywords: generation
Abstract: Large language models (LLMs) often need to incorporate external knowledge to solve theme-specific problems. Retrieval-augmented generation (RAG), which empowers LLMs to generate more qualified responses with retrieved external data and knowledge, has shown its high promise. However, traditional semantic similarity-based RAGs struggle to return concise yet highly relevant information for domain knowledge-intensive tasks, such as scientific question-answering (QA). Built on a multi-dimensional (cube) structure called Hypercube, which can index documents in an application-driven, human-defined, multi-dimensional space, we introduce the Hypercube-RAG, a novel RAG framework for precise and efficient retrieval. Given a query, Hypercube-RAG first decomposes it based on its entities and topics and then retrieves relevant documents from cubes by aligning these decomposed components with hypercube dimensions. Experiments on three in-domain scientific QA datasets demonstrate that our method improves accuracy by 3.7% and boosts retrieval efficiency by 81.2%, measured as relative gains over the strongest RAG baseline. More importantly, our Hypercube-RAG inherently offers explainability by revealing the underlying predefined hypercube dimensions used for retrieval. The code and data sets are available at this https URL.
摘要：大型语言模型（LLM）通常需要合并外部知识来解决特定于主题的问题。检索授权的生成（RAG）赋予LLM的能力，可以通过检索外部数据和知识生成更合格的响应，已显示出其较高的承诺。但是，基于传统的语义相似性的抹布努力为领域知识密集型任务（例如科学的问答（QA））返回简洁但高度相关的信息。我们建立在称为HyperCube的多维（立方体）结构上，该结构可以在应用程序驱动的，人为定义的多维空间中索引文档，我们介绍了HyperCube-Rag，这是一种新型的RAG框架，以进行精确且高效的检索。给定查询，超立方体rag首先根据其实体和主题分解它，然后通过将这些分解的组件与超立方体维度对齐，从立方体中检索相关文档。三个内域科学质量检查数据集的实验表明，我们的方法将准确性提高了3.7％，并将检索效率提高了81.2％，这是根据最强的抹布基线的相对增长来衡量的。更重要的是，我们的HyperCube-rag通过揭示用于检索的潜在预定义的超立方体尺寸来固有地提供解释性。代码和数据集可在此HTTPS URL上找到。

Title: TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis

Authors: Kazi Mahathir Rahman, Showrin Rahman, Sharmin Sultana Srishty
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19291
Pdf URL: https://arxiv.org/pdf/2505.19291
Copy Paste: [[2505.19291]] TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis(https://arxiv.org/abs/2505.19291)
Keywords: generation
Abstract: Text-embedded image generation plays a critical role in industries such as graphic design, advertising, and digital content creation. Text-to-Image generation methods leveraging diffusion models, such as TextDiffuser-2, have demonstrated promising results in producing images with embedded text. TextDiffuser-2 effectively generates bounding box layouts that guide the rendering of visual text, achieving high fidelity and coherence. However, existing approaches often rely on resource-intensive processes and are limited in their ability to run efficiently on both CPU and GPU platforms. To address these challenges, we propose a novel two-stage pipeline that integrates reinforcement learning (RL) for rapid and optimized text layout generation with a diffusion-based image synthesis model. Our RL-based approach significantly accelerates the bounding box prediction step while reducing overlaps, allowing the system to run efficiently on both CPUs and GPUs. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Our approach has been evaluated on the MARIOEval benchmark, achieving OCR and CLIPScore metrics close to state-of-the-art models, while being 97.64% more faster and requiring only 2MB of memory to run.
摘要：文本包裹的图像生成在图形设计，广告和数字内容创建等行业中起着至关重要的作用。利用扩散模型（例如TextDiffuser-2）的文本到图像生成方法在生产具有嵌入式文本的图像方面表现出了有希望的结果。 TextDiffuser-2有效地生成了边界框布局，以指导视觉文本的渲染，从而实现高忠诚和连贯性。但是，现有方法通常依赖于资源密集型流程，并且在CPU和GPU平台上有效运行的能力受到限制。为了应对这些挑战，我们提出了一条新型的两阶段管道，该管道将加固学习（RL）与基于扩散的图像合成模型进行快速和优化的文本布局生成。我们基于RL的方法可以显着加速边界框的预测步骤，同时减少重叠，从而使系统在CPU和GPU上有效运行。广泛的评估表明，我们的框架保持或超过TextDiffuser-2在文本放置和图像合成方面的质量，并具有明显更快的运行时和提高的灵活性。广泛的评估表明，我们的框架保持或超过TextDiffuser-2在文本放置和图像合成方面的质量，并具有明显更快的运行时和提高的灵活性。我们的方法已在Marioeval基准测试中进行了评估，可实现接近最新模型的OCR和ClipsCore指标，而更快的速度则增加了97.64％，只需要2MB的内存即可运行。

Title: Alchemist: Turning Public Text-to-Image Data into Generative Gold

Authors: Valerii Startsev, Alexander Ustyuzhanin, Alexey Kirillov, Dmitry Baranchuk, Sergey Kastryulin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19297
Pdf URL: https://arxiv.org/pdf/2505.19297
Copy Paste: [[2505.19297]] Alchemist: Turning Public Text-to-Image Data into Generative Gold(https://arxiv.org/abs/2505.19297)
Keywords: generative
Abstract: Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.
摘要：训练预训练将文本到图像（T2I）模型具有广泛的世界知识，但仅此模型通常不足以实现高审美质量和一致性。因此，监督的微调（SFT）对于进一步的完善至关重要。但是，其有效性在很大程度上取决于微调数据集的质量。现有的公共SFT数据集经常针对狭窄的域（例如动漫或特定艺术风格），而创建高质量的通用SFT数据集仍然是一个重大挑战。当前的策展方法通常是昂贵的，并且难以确定真正有影响力的样本。公共通用数据集的稀缺性使这一挑战更加复杂，因为领先的模型通常依赖于大型，专有且记录不足的内部数据，从而阻碍了更广泛的研究进度。本文介绍了一种新的方法，用于创建通用SFT数据集，通过利用预先训练的生成模型作为高影响力训练样本的估计器。我们将此方法应用于构建和释放炼金术士，这是一种紧凑型（3,350个样本）但高效的SFT数据集。实验表明，炼金术士大大提高了五种公共T2I模型的生成质量，同时保留了多样性和风格。此外，我们向公众发布了微调模型的权重。

Title: Concept Reachability in Diffusion Models: Beyond Dataset Constraints

Authors: Marta Aparicio Rodriguez, Xenia Miscouridou, Anastasia Borovykh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19313
Pdf URL: https://arxiv.org/pdf/2505.19313
Copy Paste: [[2505.19313]] Concept Reachability in Diffusion Models: Beyond Dataset Constraints(https://arxiv.org/abs/2505.19313)
Keywords: generation
Abstract: Despite significant advances in quality and complexity of the generations in text-to-image models, prompting does not always lead to the desired outputs. Controlling model behaviour by directly steering intermediate model activations has emerged as a viable alternative allowing to reach concepts in latent space that may otherwise remain inaccessible by prompt. In this work, we introduce a set of experiments to deepen our understanding of concept reachability. We design a training data setup with three key obstacles: scarcity of concepts, underspecification of concepts in the captions, and data biases with tied concepts. Our results show: (i) concept reachability in latent space exhibits a distinct phase transition, with only a small number of samples being sufficient to enable reachability, (ii) where in the latent space the intervention is performed critically impacts reachability, showing that certain concepts are reachable only at certain stages of transformation, and (iii) while prompting ability rapidly diminishes with a decrease in quality of the dataset, concepts often remain reliably reachable through steering. Model providers can leverage this to bypass costly retraining and dataset curation and instead innovate with user-facing control mechanisms.
摘要：尽管文本到图像模型中世代相传的质量和复杂性取得了重大进步，但提示并不总是会导致所需的输出。通过直接转向中间模型激活来控制模型行为已成为可行的替代方案，允许在潜在空间中达到概念，否则这些概念可能会通过提示无法访问。在这项工作中，我们介绍了一系列实验，以加深我们对概念可达性的理解。我们设计了一个具有三个关键障碍的培训数据设置：概念的稀缺性，标题中的概念的规定以及具有绑定概念的数据偏见。我们的结果表明：（i）潜在空间中的概念可达到性表现出明显的相过渡，只有少数样本足以启用可及性，（ii）在潜在空间中，进行干预会影响迫切的能力，表明某些概念仅在某些阶段在某些阶段到达转型的阶段，并且在转换的某些阶段可以快速降低概念，并且（iii）可以降低（iii），并且（iii）可以降低质量，而质量是质量，质量是质量，并且质量均高质量，质量是质量的，并且是质量的，质量是质量的，并且是质量的质量。转向。模型提供商可以利用这一点绕过昂贵的重新培训和数据集策划，而是通过面向用户的控制机制进行创新。

Title: Likert or Not: LLM Absolute Relevance Judgments on Fine-Grained Ordinal Scales

Authors: Charles Godfrey, Ping Nie, Natalia Ostapuk, David Ken, Shang Gao, Souheil Inati
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2505.19334
Pdf URL: https://arxiv.org/pdf/2505.19334
Copy Paste: [[2505.19334]] Likert or Not: LLM Absolute Relevance Judgments on Fine-Grained Ordinal Scales(https://arxiv.org/abs/2505.19334)
Keywords: generation
Abstract: Large language models (LLMs) obtain state of the art zero shot relevance ranking performance on a variety of information retrieval tasks. The two most common prompts to elicit LLM relevance judgments are pointwise scoring (a.k.a. relevance generation), where the LLM sees a single query-document pair and outputs a single relevance score, and listwise ranking (a.k.a. permutation generation), where the LLM sees a query and a list of documents and outputs a permutation, sorting the documents in decreasing order of relevance. The current research community consensus is that listwise ranking yields superior performance, and significant research effort has been devoted to crafting LLM listwise ranking algorithms. The underlying hypothesis is that LLMs are better at making relative relevance judgments than absolute ones. In tension with this hypothesis, we find that the gap between pointwise scoring and listwise ranking shrinks when pointwise scoring is implemented using a sufficiently large ordinal relevance label space, becoming statistically insignificant for many LLM-benchmark dataset combinations (where ``significant'' means ``95\% confidence that listwise ranking improves NDCG@10''). Our evaluations span four LLMs, eight benchmark datasets from the BEIR and TREC-DL suites, and two proprietary datasets with relevance labels collected after the training cut-off of all LLMs evaluated.
摘要：大型语言模型（LLMS）在各种信息检索任务上获得了零拍摄相关性等级性能的状态。引起LLM相关性判断的两个最常见的提示是计分数（又称相关性生成），其中LLM看到了一个查询文档对并输出单个相关得分，并列表等级（又称置换生成），其中LLM可以看到文档和列表的订单和输出的订单，并将文档置于置换率，并将文档置于置换率，并将文档置于置换状态。当前的研究社区共识是，列表的排名产生了卓越的表现，并且大量的研究工作专门用于制定LLM ListWise等级算法。基本的假设是，LLM比绝对判断更好地做出相对相关性判断。在与这一假设的紧张关系中，我们发现，当使用足够大的序数相关标签空间实现点得分时，刻度评分和列表排名之间的差距会缩小，这对于许多LLM基准数据集组合而言变得无关紧要（其中``重要的是''n'n'n'n'n'n Mean's the``95 \％prusitfuspents``95 \％fusited''''列表nd ndc ndc@ndc ndc nd crived nd crived ndc@ndc@nd nd crived nd crived nd c。我们的评估跨越了四个LLM，贝尔和TREC-DL套件的八个基准数据集，以及在评估所有LLMS训练后收集的相关标签的两个专有数据集。

Title: Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions

Authors: Chenrui Ma, Xi Xiao, Tianyang Wang, Yanning Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19352
Pdf URL: https://arxiv.org/pdf/2505.19352
Copy Paste: [[2505.19352]] Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions(https://arxiv.org/abs/2505.19352)
Keywords: generative
Abstract: Current text-driven image editing methods typically follow one of two directions: relying on large-scale, high-quality editing pair datasets to improve editing precision and diversity, or exploring alternative dataset-free techniques. However, constructing large-scale editing datasets requires carefully designed pipelines, is time-consuming, and often results in unrealistic samples or unwanted artifacts. Meanwhile, dataset-free methods may suffer from limited instruction comprehension and restricted editing capabilities. Faced with these challenges, the present work develops a novel paradigm for instruction-driven image editing that leverages widely available and enormous text-image pairs, instead of relying on editing pair datasets. Our approach introduces a multi-scale learnable region to localize and guide the editing process. By treating the alignment between images and their textual descriptions as supervision and learning to generate task-specific editing regions, our method achieves high-fidelity, precise, and instruction-consistent image editing. Extensive experiments demonstrate that the proposed approach attains state-of-the-art performance across various tasks and benchmarks, while exhibiting strong adaptability to various types of generative models.
摘要：当前的文本驱动图像编辑方法通常遵循以下两个方向之一：依靠大规模的高质量编辑对数据集来提高编辑精度和多样性，或探索替代数据集的无数据集技术。但是，构建大规模编辑数据集需要精心设计的管道，耗时，并且通常会导致不切实际的样品或不必要的人工制品。同时，无数据集方法可能会受到有限的指导理解和限制编辑功能的影响。面对这些挑战，目前的工作为指导驱动的图像编辑开发了一种新颖的范式，该范例利用了广泛的可用和巨大的文本图像对，而不是依靠编辑配对数据集。我们的方法引入了一个多尺度可学习区域，以本地化和指导编辑过程。通过将图像及其文本描述之间的一致性视为监督和学习生成特定任务的编辑区域，我们的方法可以实现高保真，准确和指令一致的图像编辑。广泛的实验表明，所提出的方法在各种任务和基准测试中达到了最先进的性能，同时表现出对各种生成模型的强大适应性。

Title: Absolute Coordinates Make Motion Generation Easy

Authors: Zichong Meng, Zeyu Han, Xiaogang Peng, Yiming Xie, Huaizu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19377
Pdf URL: https://arxiv.org/pdf/2505.19377
Copy Paste: [[2505.19377]] Absolute Coordinates Make Motion Generation Easy(https://arxiv.org/abs/2505.19377)
Keywords: generation
Abstract: State-of-the-art text-to-motion generation models rely on the kinematic-aware, local-relative motion representation popularized by HumanML3D, which encodes motion relative to the pelvis and to the previous frame with built-in redundancy. While this design simplifies training for earlier generation models, it introduces critical limitations for diffusion models and hinders applicability to downstream tasks. In this work, we revisit the motion representation and propose a radically simplified and long-abandoned alternative for text-to-motion generation: absolute joint coordinates in global space. Through systematic analysis of design choices, we show that this formulation achieves significantly higher motion fidelity, improved text alignment, and strong scalability, even with a simple Transformer backbone and no auxiliary kinematic-aware losses. Moreover, our formulation naturally supports downstream tasks such as text-driven motion control and temporal/spatial editing without additional task-specific reengineering and costly classifier guidance generation from control signals. Finally, we demonstrate promising generalization to directly generate SMPL-H mesh vertices in motion from text, laying a strong foundation for future research and motion-related applications.
摘要：最先进的文本到动作生成模型依赖于由HumanML3D普及的运动学，局部相关运动表示，该运动代表相对于骨盆编码运动，并具有内置冗余的上一个框架。尽管此设计简化了对早期模型的培训，但它引入了扩散模型的关键限制，并阻碍了对下游任务的适用性。在这项工作中，我们重新审视运动表示，并提出了一种从文本到动作生成的根本简化和长期遗弃的替代方案：全球空间中的绝对关节坐标。通过对设计选择的系统分析，我们表明，即使有一个简单的变压器主链，并且没有辅助运动学感知的损失，这种配方也可以显着更高的运动保真度，改善文本对准和强大的可扩展性。此外，我们的公式自然支持下游任务，例如文本驱动的运动控制和时间/空间编辑，而无需从控制信号中进行其他特定于任务的重新设计和昂贵的分类器指导生成。最后，我们展示了有希望的概括，可以直接从文本中进行运动中的SMPL-H网格顶点，为未来的研究和运动相关应用奠定了坚实的基础。

Title: Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

Authors: Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, Chen Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19386
Pdf URL: https://arxiv.org/pdf/2505.19386
Copy Paste: [[2505.19386]] Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals(https://arxiv.org/abs/2505.19386)
Keywords: generation
Abstract: Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.
摘要：视频生成模型的最新进展引发了人们对能够模拟现实环境的世界模型的兴趣。尽管导航已经进行了充分探索，但模仿现实世界力量的身体有意义的相互作用仍在大部分研究中。在这项工作中，我们使用物理力作为视频生成的控制信号进行了调查，并提出了力提示，这使用户能够通过局部点力（例如戳工厂）和全球风力场（例如在织物上吹风）与图像进行交互。我们证明，这些力提示可以使视频通过在原始预验证的模型中利用视觉和运动来实现对物理控制信号的响应，而无需在推理时使用任何3D资产或物理模拟器。力提示的主要挑战是在现实世界中难以获得高质量的配对力 - 视频训练数据，这是由于难以获得力信号的困难，以及由于物理模拟器的视觉质量和域多样性的限制而引起的合成数据。我们的关键发现是，当适应搅拌机合成的视频的物理力量调节时，视频生成模型即使对几个物体的演示有限，也可以很好地概括。我们的方法可以生成视频，以模拟各种几何，设置和材料的力量。我们还试图了解这种概括的来源，并执行消融，以揭示两个关键要素：视觉多样性和训练过程中特定文本关键字的使用。我们的方法仅针对四个A100 GPU进行了大约15K培训示例的培训，并且在效力依从性和物理现实主义方面的现有方法优于现有方法，从而使世界模型更接近现实世界的物理互动。我们在我们的项目页面上发布所有数据集，代码，权重和交互式视频演示。

Title: Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept Suppression

Authors: Yiwei Xie, Ping Liu, Zheng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19398
Pdf URL: https://arxiv.org/pdf/2505.19398
Copy Paste: [[2505.19398]] Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept Suppression(https://arxiv.org/abs/2505.19398)
Keywords: generation, generative
Abstract: Text-to-Image (T2I) models have demonstrated impressive capabilities in generating high-quality and diverse visual content from natural language prompts. However, uncontrolled reproduction of sensitive, copyrighted, or harmful imagery poses serious ethical, legal, and safety challenges. To address these concerns, the concept erasure paradigm has emerged as a promising direction, enabling the selective removal of specific semantic concepts from generative models while preserving their overall utility. This survey provides a comprehensive overview and in-depth synthesis of concept erasure techniques in T2I diffusion models. We systematically categorize existing approaches along three key dimensions: intervention level, which identifies specific model components targeted for concept removal; optimization structure, referring to the algorithmic strategies employed to achieve suppression; and semantic scope, concerning the complexity and nature of the concepts addressed. This multi-dimensional taxonomy enables clear, structured comparisons across diverse methodologies, highlighting fundamental trade-offs between erasure specificity, generalization, and computational complexity. We further discuss current evaluation benchmarks, standardized metrics, and practical datasets, emphasizing gaps that limit comprehensive assessment, particularly regarding robustness and practical effectiveness. Finally, we outline major challenges and promising future directions, including disentanglement of concept representations, adaptive and incremental erasure strategies, adversarial robustness, and new generative architectures. This survey aims to guide researchers toward safer, more ethically aligned generative models, providing foundational knowledge and actionable recommendations to advance responsible development in generative AI.
摘要：文本对图像（T2I）模型在从自然语言提示中产生高质量和多样化的视觉内容方面表现出了令人印象深刻的功能。但是，敏感，版权或有害图像的不受控制的复制构成了严重的道德，法律和安全挑战。为了解决这些问题，概念擦除范式已成为一个有希望的方向，从而可以选择性地从生成模型中删除特定的语义概念，同时保留其整体效用。这项调查提供了T2I扩散模型中概念擦除技术的全面概述和深入的综合。我们将沿三个关键维度的现有方法系统地分类：干预水平，该方法确定了针对概念删除的特定模型组件；优化结构，指用于实现抑制的算法策略；和语义范围，涉及所解决概念的复杂性和性质。这种多维分类法实现了各种方法的清晰，结构化的比较，突出了擦除特异性，概括和计算复杂性之间的基本权衡。我们进一步讨论当前的评估基准，标准化指标和实际数据集，强调了限制全面评估的差距，尤其是关于鲁棒性和实际有效性的差距。最后，我们概述了主要挑战和有希望的未来方向，包括概念表示，适应性和增量擦除策略，对抗性鲁棒性和新的生成型体系结构的解开。这项调查旨在指导研究人员建立更安全，更具道德上一致的生成模型，提供基础知识和可行的建议，以推动生成AI中负责任的发展。

Title: MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

Authors: Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, Jiebo Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19415
Pdf URL: https://arxiv.org/pdf/2505.19415
Copy Paste: [[2505.19415]] MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models(https://arxiv.org/abs/2505.19415)
Keywords: generation
Abstract: Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design. We will release the dataset and evaluation code to foster rigorous, unified evaluation and accelerate future innovations in multi-modal image generation.
摘要：最近的多模式图像发生器，例如GPT-4O，Gemini 2.0 Flash和Gemini 2.5 Pro Excel，以遵循复杂的说明，编辑图像并保持概念一致性。但是，它们仍然通过不交互式工具包进行评估：缺乏多模式调节的文本对图像（T2I）基准，以及忽略构成构图语义和常识知识的自定义图像生成基准。我们提出了Mmig-Bench，这是一种全面的多模式图像生成基准，通过将4,850个注释的文本提示与380个主题，人类，动物，物体和艺术风格的1,750个多视图参考图像配对，从而统一这些任务。 Mmig-Bench配备了三级评估框架：（1）视觉伪像的低级指标和对象的身份保存；（2）新颖的方面匹配分数（AMS）：基于VQA的中层度量标准，可提供精细的及时图像对齐，并显示与人类判断的密切相关；（3）美学和人类偏好的高级指标。使用MMIG板凳，我们基准了17个最先进的模型，包括Gemini 2.5 Pro，Flux，Dreambooth和IP-Aprapter，并以32K人类评级验证我们的指标，从而深入了解建筑和数据设计。我们将发布数据集和评估代码，以促进严格的，统一的评估，并加快多模式图像生成中的未来创新。

Title: LlamaSeg: Image Segmentation via Autoregressive Mask Generation

Authors: Jiru Deng, Tengjin Weng, Tianyu Yang, Wenhan Luo, Zhiheng Li, Wenhao Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19422
Pdf URL: https://arxiv.org/pdf/2505.19422
Copy Paste: [[2505.19422]] LlamaSeg: Image Segmentation via Autoregressive Mask Generation(https://arxiv.org/abs/2505.19422)
Keywords: generation, generative
Abstract: We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally integrates segmentation tasks into autoregressive architectures. To support large-scale training, we introduce a data annotation pipeline and construct the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or diverse textual descriptions, covering a wide spectrum of real-world scenarios. This enables our model to localize objects in images based on text prompts and to generate fine-grained masks. To more accurately evaluate the quality of masks produced by visual generative models, we further propose a composite metric that combines Intersection over Union (IoU) with Average Hausdorff Distance (AHD), offering a more precise assessment of contour fidelity. Experimental results demonstrate that our method surpasses existing generative models across multiple datasets and yields more detailed segmentation masks.
摘要：我们提出了Llamaseg，这是一个视觉自动回归框架，可通过自然语言说明统一多个图像分割任务。我们将图像分割重新制定为视觉生成问题，代表面具为“视觉”令牌，并采用骆驼风格的变压器直接从图像输入中预测它们。通过遵守下一步的预测范式，我们的方法自然地将细分任务整合到自回归体系结构中。为了支持大规模培训，我们介绍了数据注释管道并构建SA-OVRS数据集，该数据集包含2M分割掩码，并注明了超过5,800个开放式视频计标签或不同的文本描述，覆盖了广泛的现实风景。这使我们的模型能够根据文本提示将对象定位在图像中，并生成细粒度的掩模。为了更准确地评估视觉生成模型产生的口罩的质量，我们进一步提出了一个复合度量标准，该指标将联合（IOU）与平均Hausdorff距离（AHD）相结合，提供了更精确的轮廓忠诚评估。实验结果表明，我们的方法超过了多个数据集的现有生成模型，并产生了更详细的分割掩模。

Title: Structure Disruption: Subverting Malicious Diffusion-Based Inpainting via Self-Attention Query Perturbation

Authors: Yuhao He, Jinyu Tian, Haiwei Wu, Jianqing Li
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19425
Pdf URL: https://arxiv.org/pdf/2505.19425
Copy Paste: [[2505.19425]] Structure Disruption: Subverting Malicious Diffusion-Based Inpainting via Self-Attention Query Perturbation(https://arxiv.org/abs/2505.19425)
Keywords: generation
Abstract: The rapid advancement of diffusion models has enhanced their image inpainting and editing capabilities but also introduced significant societal risks. Adversaries can exploit user images from social media to generate misleading or harmful content. While adversarial perturbations can disrupt inpainting, global perturbation-based methods fail in mask-guided editing tasks due to spatial constraints. To address these challenges, we propose Structure Disruption Attack (SDA), a powerful protection framework for safeguarding sensitive image regions against inpainting-based editing. Building upon the contour-focused nature of self-attention mechanisms of diffusion models, SDA optimizes perturbations by disrupting queries in self-attention during the initial denoising step to destroy the contour generation process. This targeted interference directly disrupts the structural generation capability of diffusion models, effectively preventing them from producing coherent images. We validate our motivation through visualization techniques and extensive experiments on public datasets, demonstrating that SDA achieves state-of-the-art (SOTA) protection performance while maintaining strong robustness.
摘要：扩散模型的快速发展增强了他们的图像介绍和编辑功能，但也引入了重大的社会风险。对手可以从社交媒体中利用用户图像来产生误导性或有害内容。尽管对抗性扰动会破坏灌输，但由于空间约束，基于全局扰动的方法在掩模引导的编辑任务中失败。为了应对这些挑战，我们提出了结构破坏攻击（SDA），这是一个有力的保护框架，可保护敏感的图像区域免于基于内部的编辑。 SDA以自我注意力的自我注意机制的重点性质为基础，通过在最初的DeNoising步骤中破坏自我注意的查询来优化扰动，以破坏轮廓生成过程。该靶向干扰直接破坏了扩散模型的结构产生能力，从而有效地阻止了它们产生相干图像。我们通过可视化技术和公共数据集中的广泛实验来验证我们的动力，表明SDA在保持强大的鲁棒性的同时，实现了最先进的保护性能（SOTA）保护性能。

Title: Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

Authors: Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, Bo Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19433
Pdf URL: https://arxiv.org/pdf/2505.19433
Copy Paste: [[2505.19433]] Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression(https://arxiv.org/abs/2505.19433)
Keywords: generation
Abstract: Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in this https URL.
摘要：训练后压缩降低了大语言模型（LLM）的计算和内存成本，从而实现了资源有效的部署。但是，现有的压缩基准仅着眼于语言建模（例如，困惑）和自然语言理解任务（例如胶水精度），忽略了代理功能 - 工作流程，工具使用/功能调用，长篇小说理解和现实世界中的应用程序。我们介绍了代理压缩基准（ACBENCH），这是评估压缩方式如何影响LLMS代理能力的第一个综合基准。 Acbench跨度（1）12个跨4个功能的任务（例如，用于工作流的Worfbench，用于长篇文本检索的针刺），（2）量化（GPTQ，AWQ）和修剪（Wanda，sparsegpt）和（3）15个型号，包括小型（gemma-2b），标准（qwen2.5 7b-32b），ll和（DeepSeek-R1-Distill）。我们的实验揭示了压缩权衡：4位量化可保留工作流的产生和工具使用（1％-3％下降），但将现实世界应用的精度降低了10％-15％。我们介绍了Erank，Top-K排名相关性和能量来系统化分析。 ACBENCH提供了可行的见解，可在代理方案中优化LLM压缩。该代码可以在此HTTPS URL中找到。

Title: Your Classifier Can Do More: Towards Bridging the Gaps in Classification, Robustness, and Generation

Authors: Kaichao Jiang, He Wang, Xiaoshuai Hao, Xiulong Yang, Ajian Liu, Qi Chu, Yunfeng Diao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19459
Pdf URL: https://arxiv.org/pdf/2505.19459
Copy Paste: [[2505.19459]] Your Classifier Can Do More: Towards Bridging the Gaps in Classification, Robustness, and Generation(https://arxiv.org/abs/2505.19459)
Keywords: generation, generative
Abstract: Joint Energy-based Models (JEMs), a class of hybrid generative-discriminative models, are well known for their ability to achieve both high classification accuracy and generative capability within a single model. However, their robustness still lags significantly behind the classifiers based adversarial training (AT). Conversely, while AT is currently the most effective approach to improving the classifier's robustness, it typically sacrifices accuracy on clean data and lacks generative capability. The triple trade-off between classification accuracy, generative capability and robustness, raises a natural question: Can a single model simultaneously achieve high classification accuracy, adversarial robustness, and generative performance? -- a goal that has been rarely explored. To address this question, we systematically analyze the energy distribution differences of clean, adversarial, and generated samples across various JEM variants and adversarially trained models. We observe that AT tends to reduce the energy gap between clean and adversarial samples, while JEMs reduce the gap between clean and synthetic ones. This observation suggests a key insight: if the energy distributions of all three data types can be aligned, we might unify the strengths of AT and JEMs, resolving their inherent trade-offs. Building on this idea, we propose Energy-based Joint Distribution Adversarial Training (EB-JDAT), to jointly model the clean data distribution, the adversarial distribution, and the classifier by maximizing their joint probability. EB-JDAT is a general and flexible optimization method, compatible with various JEM variants. Extensive experimental results demonstrate that EB-JDAT not only maintains near original accuracy and generative capability of JEMs, but also significantly enhances robustness, even surpassing state-of-the-art ATs.
摘要：基于联合能量的模型（JEMS）是一类混合生成歧视模型，以其在单个模型中达到高分类精度和生成能力的能力而闻名。但是，它们的稳健性仍然显着落后于基于分类器的对抗训练（AT）。相反，虽然AT是目前最有效的方法来改善分类器的鲁棒性，但它通常会牺牲清洁数据的准确性，并且缺乏生成能力。分类准确性，生成能力和鲁棒性之间的三重权衡提出了一个自然的问题：单个模型可以同时实现高分类精度，对抗性鲁棒性和生成性能吗？ - 一个很少探索的目标。为了解决这个问题，我们系统地分析了各种JEM变体和受对抗训练的模型的清洁，对抗和生成样品的能量分布差异。我们观察到，在倾向于减少清洁和对抗样品之间的能量差距，而JEM则减少了清洁和合成的样品之间的差距。该观察结果表明了一个关键的见解：如果所有三种数据类型的能量分布都可以对齐，我们可能会统一AT和JEM的优势，从而解决它们固有的权衡。在这个想法的基础上，我们提出了基于能量的联合分配对抗训练（EB-JDAT），以通过最大化其关节概率来共同对清洁数据分布，对抗分布和分类器进行建模。 EB-JDAT是一种通用且灵活的优化方法，与各种JEM变体兼容。广泛的实验结果表明，EB-JDAT不仅保持了JEM的原始精度和生成能力，而且还显着增强了鲁棒性，甚至超过了最新的ATS。

Title: Diversity-Driven Generative Dataset Distillation Based on Diffusion Model with Self-Adaptive Memory

Authors: Mingzhuo Li, Guang Li, Jiafeng Mao, Takahiro Ogawa, Miki Haseyama
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.19469
Pdf URL: https://arxiv.org/pdf/2505.19469
Copy Paste: [[2505.19469]] Diversity-Driven Generative Dataset Distillation Based on Diffusion Model with Self-Adaptive Memory(https://arxiv.org/abs/2505.19469)
Keywords: generative
Abstract: Dataset distillation enables the training of deep neural networks with comparable performance in significantly reduced time by compressing large datasets into small and representative ones. Although the introduction of generative models has made great achievements in this field, the distributions of their distilled datasets are not diverse enough to represent the original ones, leading to a decrease in downstream validation accuracy. In this paper, we present a diversity-driven generative dataset distillation method based on a diffusion model to solve this problem. We introduce self-adaptive memory to align the distribution between distilled and real datasets, assessing the representativeness. The degree of alignment leads the diffusion model to generate more diverse datasets during the distillation process. Extensive experiments show that our method outperforms existing state-of-the-art methods in most situations, proving its ability to tackle dataset distillation tasks.
摘要：数据集蒸馏可以通过将大型数据集压缩成小型和代表性的数据集，以训练深度神经网络，并在时间上显着降低了表现。尽管引入生成模型在该领域取得了巨大的成就，但其蒸馏数据集的分布不足以代表原始数据，从而降低了下游验证的准确性。在本文中，我们提出了一种基于扩散模型来解决此问题的多样性驱动的生成数据集蒸馏方法。我们引入自适应记忆，以使蒸馏和真实数据集之间的分布对齐，以评估代表性。对齐程度导致扩散模型在蒸馏过程中生成更多样化的数据集。广泛的实验表明，在大多数情况下，我们的方法优于现有的最新方法，证明了其解决数据集蒸馏任务的能力。

Title: Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

Authors: Hao Kang, Qingru Zhang, Han Cai, Weiyuan Xu, Tushar Krishna, Yilun Du, Tsachy Weissman
Subjects: cs.LG, cs.AI, cs.DC, cs.MA
Abstract URL: https://arxiv.org/abs/2505.19481
Pdf URL: https://arxiv.org/pdf/2505.19481
Copy Paste: [[2505.19481]] Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs(https://arxiv.org/abs/2505.19481)
Keywords: generation
Abstract: Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks, and are increasingly deployed as agents in dynamic environments such as code generation and recommendation systems. However, many real-world applications, such as high-frequency trading and real-time competitive gaming, require decisions under strict latency constraints, where faster responses directly translate into higher rewards. Despite the importance of this latency quality trade off, it remains underexplored in the context of LLM based agents. In this work, we present the first systematic study of this trade off in real time decision making tasks. To support our investigation, we introduce two new benchmarks: HFTBench, a high frequency trading simulation, and StreetFighter, a competitive gaming platform. Our analysis reveals that optimal latency quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance. To address this, we propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands. Our method achieves the best performance on both benchmarks, improving win rate by up to 80% in Street Fighter and boosting daily yield by up to 26.52% in trading, underscoring the need for latency aware evaluation and deployment strategies for LLM based agents. These results demonstrate the critical importance of latency aware evaluation and deployment strategies for real world LLM based agents. Our benchmarks are available at Latency Sensitive Benchmarks.
摘要：大型语言模型（LLM）在各种推理和发电任务中表现出色，并且在动态环境（例如代码生成和推荐系统）中越来越多地部署为代理。但是，许多现实世界中的应用程序，例如高频交易和实时竞争游戏，都需要在严格的延迟限制下进行决策，在这些决策中，更快的响应直接转化为更高的奖励。尽管这种潜伏期质量的权衡很重要，但在基于LLM的代理商的背景下，它仍然没有得到充实的态度。在这项工作中，我们介绍了实时决策任务对这一权衡进行的首次系统研究。为了支持我们的调查，我们介绍了两个新的基准：高频交易模拟HFTBENCH和竞争性游戏平台Streetfighter。我们的分析表明，最佳的延迟质量平衡因任务而有所不同，而牺牲质量为较低的延迟可以显着提高下游性能。为了解决这个问题，我们提出了FPX，这是一个自适应框架，该框架根据实时需求动态选择模型大小和量化水平。我们的方法在两个基准测试中都取得了最佳性能，在街头战斗机中提高了多达80％的胜利，并将每日收益率提高了26.52％的交易，强调了对基于LLM的代理商的潜伏期意识评估和部署策略的需求。这些结果表明，潜伏期意识评估和部署策略对基于现实世界LLM的代理的重要性至关重要。我们的基准可在延迟敏感的基准测试中获得。

Title: The Role of Video Generation in Enhancing Data-Limited Action Understanding

Authors: Wei Li, Dezhao Luo, Dongbao Yang, Zhenhang Li, Weiping Wang, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19495
Pdf URL: https://arxiv.org/pdf/2505.19495
Copy Paste: [[2505.19495]] The Role of Video Generation in Enhancing Data-Limited Action Understanding(https://arxiv.org/abs/2505.19495)
Keywords: generation
Abstract: Video action understanding tasks in real-world scenarios always suffer data limitations. In this paper, we address the data-limited action understanding problem by bridging data scarcity. We propose a novel method that employs a text-to-video diffusion transformer to generate annotated data for model training. This paradigm enables the generation of realistic annotated data on an infinite scale without human intervention. We proposed the information enhancement strategy and the uncertainty-based label smoothing tailored to generate sample training. Through quantitative and qualitative analysis, we observed that real samples generally contain a richer level of information than generated samples. Based on this observation, the information enhancement strategy is proposed to enhance the informative content of the generated samples from two aspects: the environments and the characters. Furthermore, we observed that some low-quality generated samples might negatively affect model training. To address this, we devised the uncertainty-based label smoothing strategy to increase the smoothing of these samples, thus reducing their impact. We demonstrate the effectiveness of the proposed method on four datasets across five tasks and achieve state-of-the-art performance for zero-shot action recognition.
摘要：视频动作理解现实情况下的任务始终遭受数据限制。在本文中，我们通过弥合数据稀缺来解决数据限制的动作理解问题。我们提出了一种新颖的方法，该方法采用文本对视频扩散变压器来生成带注释的数据进行模型训练。该范式使得无需人工干预的无限规模就可以生成现实的注释数据。我们提出了信息增强策略和基于不确定性的标签平滑量身定制，以生成样品培训。通过定量和定性分析，我们观察到实际样品通常包含比生成样本更丰富的信息。基于此观察结果，提出了信息增强策略，以增强来自两个方面的样本的信息内容：环境和角色。此外，我们观察到一些低质量的样品可能会对模型训练产生负面影响。为了解决这个问题，我们设计了基于不确定性的标签平滑策略，以增加这些样品的平滑性，从而减少了它们的影响。我们在五个任务上的四个数据集上展示了所提出的方法的有效性，并实现了零拍动识别的最先进性能。

Title: Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models

Authors: Nanxing Hu, Xiaoyue Duan, Jinchao Zhang, Guoliang Kang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19498
Pdf URL: https://arxiv.org/pdf/2505.19498
Copy Paste: [[2505.19498]] Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models(https://arxiv.org/abs/2505.19498)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) usually generate texts which satisfy context coherence but don't match the visual input. Such a hallucination issue hinders LVLMs' applicability in the real world. The key to solving hallucination in LVLM is to make the text generation rely more on the visual content. Most previous works choose to enhance/adjust the features/output of a specific modality (i.e., visual or textual) to alleviate hallucinations in LVLM, which do not explicitly or systematically enhance the visual reliance. In this paper, we comprehensively investigate the factors which may degenerate the visual reliance in text generation of LVLM from a Bayesian perspective. Based on our observations, we propose to mitigate hallucination in LVLM from three aspects. Firstly, we observe that not all visual tokens are informative in generating meaningful texts. We propose to evaluate and remove redundant visual tokens to avoid their disturbance. Secondly, LVLM may encode inappropriate prior information, making it lean toward generating unexpected words. We propose a simple yet effective way to rectify the prior from a Bayesian perspective. Thirdly, we observe that starting from certain steps, the posterior of next-token prediction conditioned on visual tokens may collapse to a prior distribution which does not depend on any informative visual tokens at all. Thus, we propose to stop further text generation to avoid hallucination. Extensive experiments on three benchmarks including POPE, CHAIR, and MME demonstrate that our method can consistently mitigate the hallucination issue of LVLM and performs favorably against previous state-of-the-arts.
摘要：大型视觉模型（LVLM）通常会产生满足上下文连贯性但不匹配视觉输入的文本。这样的幻觉问题阻碍了LVLMS在现实世界中的适用性。解决LVLM中幻觉的关键是使文本生成更多地依赖于视觉内容。大多数以前的作品都选择增强/调整特定模式的功能/输出（即视觉或文本），以减轻LVLM中幻觉的幻觉，而LVLM不会明确或系统地增强视觉依赖。在本文中，我们全面研究了可能从贝叶斯的角度脱离文本生成LVLM的视觉依赖的因素。根据我们的观察，我们建议从三个方面减轻LVLM的幻觉。首先，我们观察到，并非所有的视觉令牌都在生成有意义的文本方面提供了信息。我们建议评估和删除多余的视觉令牌，以避免它们的干扰。其次，LVLM可能会编码不适当的先验信息，从而使其倾向于生成意外的单词。我们提出了一种简单而有效的方法来从贝叶斯的角度纠正先前。第三，我们观察到，从某些步骤开始，以视觉令牌为条件的下一步预测的后部可能会崩溃到先前的分布，而这根本不取决于任何有益的视觉令牌。因此，我们建议停止进一步的文本生成，以避免幻觉。对包括教皇，椅子和MME在内的三个基准测试的广泛实验表明，我们的方法可以始终减轻LVLM的幻觉问题，并对以前的最新面对面的表现有利。

Title: DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

Authors: Pingzhi Li, Zhen Tan, Huaizhi Qu, Huan Liu, Tianlong Chen
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19504
Pdf URL: https://arxiv.org/pdf/2505.19504
Copy Paste: [[2505.19504]] DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation(https://arxiv.org/abs/2505.19504)
Keywords: generation
Abstract: Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD).In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs remain accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving or even improving the original performance of the teacher model, student models distilled from the defensively generated teacher outputs demonstrate catastrophically reduced performance, demonstrating our method's effectiveness as a practical safeguard against KD-based model imitation.
摘要：大型语言模型（LLMS）代表了实质性的智力和经济投资，但是它们的有效性可以无意间通过知识蒸馏（KD）来促进模型模仿。在实际情况下，竞争对手可以通过简单地观察到可公开访问的产量来使其独自观察到复杂的性能，从而使专有的LLM能力蒸馏出来。现有的保护方法（例如水印（水印）仅在事后识别模仿，而其他防御措施则假设学生模型模仿了教师的内部逻辑，从而使它们无效地反对蒸馏纯粹是从观察到的输出文本中。本文面临着在基于API的访问的现实约束中积极保护LLM的挑战。我们引入了有效而有效的防御产出（DOGE）策略，该策略巧妙地修改了LLM的输出行为。它的输出对于合法用户仍然准确且有用，但旨在误导蒸馏，严重破坏了模仿尝试。我们通过仅通过对抗性损失的教师LLM的最终线性层进行微调来实现这一目标。这种有针对性的训练方法可以预测并破坏推理期间的蒸馏尝试。我们的实验表明，在保留甚至改善教师模型的原始表现时，从防御性产生的教师产出中提取的学生模型表明了灾难性降低的性能，证明了我们方法的有效性是针对基于KD的模型模仿的实用保障。

Title: Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models

Authors: Yifan Jia, Kailin Jiang, Yuyang Liang, Qihan Ren, Yi Xin, Rui Yang, Fenze Feng, Mingcai Chen, Hengyang Lu, Haozhe Wang, Xiaoye Qu, Dongrui Liu, Lizhen Cui, Yuntao Du
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19509
Pdf URL: https://arxiv.org/pdf/2505.19509
Copy Paste: [[2505.19509]] Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models(https://arxiv.org/abs/2505.19509)
Keywords: generation
Abstract: Large Multimodal Models(LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation(RAG) frameworks where the contextual information from external sources may contradict the model's internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most focus solely on intra-memory conflicts, while context-memory and inter-context conflicts remain largely investigated. Furthermore, commonly used factual knowledge-based evaluations are often overlooked, and existing datasets lack a thorough investigation into conflict detection capabilities. To bridge this gap, we propose MMKC-Bench, a benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios. MMKC-Bench encompasses three types of multimodal knowledge conflicts and includes 1,573 knowledge instances and 3,381 images across 23 broad types, collected through automated pipelines with human verification. We evaluate three representative series of LMMs on both model behavior analysis and conflict detection tasks. Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence. We hope MMKC-Bench will foster further research in multimodal knowledge conflict and enhance the development of multimodal RAG systems. The source code is available at this https URL.
摘要：当遇到多模式知识冲突时，大型多模型（LMM）面临着显着的挑战，尤其是在检索型生成（RAG）框架下，来自外部来源的上下文信息可能与模型的内部参数知识相矛盾，从而导致不可靠的输出。但是，现有的基准无法反映这种现实的冲突情景。大多数仅关注内存冲突，而上下文记忆和互字段冲突仍在很大程度上进行了研究。此外，经常忽略了常用的基于事实知识的评估，现有数据集缺乏对冲突检测能力的彻底调查。为了弥合这一差距，我们提出了MMKC Bench，这是一种基准测试，旨在评估上下文记忆和互语场景中的事实知识冲突。 MMKC板凳涵盖了三种类型的多模式知识冲突，包括1,573个知识实例和3,381张图像，这些图像是通过人类验证的自动化管道收集的23种广泛类型的3381张图像。我们在模型行为分析和冲突检测任务上评估了三个代表性的LMMS系列。我们的发现表明，尽管当前的LMM能够识别知识冲突，但它们倾向于支持内部参数知识，而不是外部证据。我们希望MMKC板凳能够在多模式知识冲突中进一步研究，并增强多模式RAG系统的发展。源代码可在此HTTPS URL上找到。

Title: Toward Patient-specific Partial Point Cloud to Surface Completion for Pre- to Intra-operative Registration in Image-guided Liver Interventions

Authors: Nakul Poudel, Zixin Yang, Kelly Merrell, Richard Simon, Cristian A. Linte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19518
Pdf URL: https://arxiv.org/pdf/2505.19518
Copy Paste: [[2505.19518]] Toward Patient-specific Partial Point Cloud to Surface Completion for Pre- to Intra-operative Registration in Image-guided Liver Interventions(https://arxiv.org/abs/2505.19518)
Keywords: generation
Abstract: Intra-operative data captured during image-guided surgery lacks sub-surface information, where key regions of interest, such as vessels and tumors, reside. Image-to-physical registration enables the fusion of pre-operative information and intra-operative data, typically represented as a point cloud. However, this registration process struggles due to partial visibility of the intra-operative point cloud. In this research, we propose a patient-specific point cloud completion approach to assist with the registration process. Specifically, we leverage VN-OccNet to generate a complete liver surface from a partial intra-operative point cloud. The network is trained in a patient-specific manner, where simulated deformations from the pre-operative model are used to train the model. First, we conduct an in-depth analysis of VN-OccNet's rotation-equivariant property and its effectiveness in recovering complete surfaces from partial intra-operative surfaces. Next, we integrate the completed intra-operative surface into the Go-ICP registration algorithm to demonstrate its utility in improving initial rigid registration outcomes. Our results highlight the promise of this patient-specific completion approach in mitigating the challenges posed by partial intra-operative visibility. The rotation equivariant and surface generation capabilities of VN-OccNet hold strong promise for developing robust registration frameworks for variations of the intra-operative point cloud.
摘要：图像引导手术期间捕获的术中数据缺乏地下信息，其中关键区域（例如血管和肿瘤）居住在其中。图像到物理注册可以融合术前信息和术中数据，通常表示为点云。但是，由于术中云的部分可见性，这种注册过程陷入困境。在这项研究中，我们提出了一种特定于患者的点云完成方法，以协助注册过程。具体而言，我们利用VN-occnet从部分术中云中产生完整的肝表面。该网络以特定于患者的方式进行训练，其中使用术前模型的模拟变形来训练模型。首先，我们对VN-occnet的旋转 - 等级性质进行了深入的分析及其在从部分内部表面回收完整表面方面的有效性。接下来，我们将完成的术中表面集成到GO-ICP注册算法中，以证明其在改善初始刚性注册结果方面的效用。我们的结果强调了这种特定患者完成方法的希望，以减轻部分术中可见性带来的挑战。 VN-occnet的旋转模糊和表面产生能力有很大的希望，可以为术中云的变化开发强大的注册框架。

Title: Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift

Authors: Gihoon Kim, Hyungjin Park, Taesup Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19519
Pdf URL: https://arxiv.org/pdf/2505.19519
Copy Paste: [[2505.19519]] Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift(https://arxiv.org/abs/2505.19519)
Keywords: generative
Abstract: Personalization using text-to-image diffusion models involves adapting a pretrained model to novel subjects with only a few image examples. This task presents a fundamental challenge, as the model must not only learn the new subject effectively but also preserve its ability to generate diverse and coherent outputs across a wide range of prompts. In other words, successful personalization requires integrating new concepts without forgetting previously learned generative capabilities. Forgetting denotes unintended distributional drift, where the model's output distribution deviates from that of the original pretrained model. In this paper, we provide an analysis of this issue and identify a mismatch between standard training objectives and the goals of personalization. To address this, we propose a new training objective based on a Lipschitz-bounded formulation that explicitly constrains deviation from the pretrained distribution. Our method provides improved control over distributional drift and performs well even in data-scarce scenarios. Experimental results demonstrate that our approach consistently outperforms existing personalization methods, achieving higher CLIP-T, CLIP-I, and DINO scores.
摘要：使用文本对图像扩散模型的个性化涉及将预验证的模型适应新的主题，只有几个图像示例。该任务提出了一个根本的挑战，因为该模型不仅必须有效地学习新主题，而且还必须保留其在各种提示中产生多样和相干产出的能力。换句话说，成功的个性化需要整合新概念，而不会忘记以前学习的生成能力。忘记表示意想不到的分布漂移，其中模型的输出分布与原始预验证的模型的偏离。在本文中，我们对此问题进行了分析，并确定标准培训目标与个性化目标之间的不匹配。为了解决这个问题，我们提出了一个基于Lipschitz结合的配方的新培训目标，该配方明确限制了偏离验证的分布的偏差。我们的方法提供了对分布漂移的改进控制，即使在数据筛选方案中也表现良好。实验结果表明，我们的方法始终胜过现有的个性化方法，从而达到了较高的剪辑-T，剪辑I和恐龙分数。

Title: Applications and Effect Evaluation of Generative Adversarial Networks in Semi-Supervised Learning

Authors: Jiyu Hu, Haijiang Zeng, Zhen Tian
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.19522
Pdf URL: https://arxiv.org/pdf/2505.19522
Copy Paste: [[2505.19522]] Applications and Effect Evaluation of Generative Adversarial Networks in Semi-Supervised Learning(https://arxiv.org/abs/2505.19522)
Keywords: generation, generative
Abstract: In recent years, image classification, as a core task in computer vision, relies on high-quality labelled data, which restricts the wide application of deep learning models in practical scenarios. To alleviate the problem of insufficient labelled samples, semi-supervised learning has gradually become a research hotspot. In this paper, we construct a semi-supervised image classification model based on Generative Adversarial Networks (GANs), and through the introduction of the collaborative training mechanism of generators, discriminators and classifiers, we achieve the effective use of limited labelled data and a large amount of unlabelled data, improve the quality of image generation and classification accuracy, and provide an effective solution for the task of image recognition in complex environments.
摘要：近年来，图像分类作为计算机视觉中的核心任务，依赖于高质量的标记数据，这限制了在实际场景中深度学习模型的广泛应用。为了减轻标记样品不足的问题，半监督的学习逐渐成为研究热点。在本文中，我们基于生成的对抗网络（GAN）构建了一个半监督的图像分类模型，并通过引入生成器，歧视器和分类器的协作培训机制，我们实现了有限的标记数据，并为图像生成和分类的质量提供了有效的图像和分类的质量，并为图像提供了有效的质量，并为图像提供了有效的图像质量。

Title: TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs

Authors: Juntong Wang, Jiarui Wang, Huiyu Duan, Guangtao Zhai, Xiongkuo Min
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19535
Pdf URL: https://arxiv.org/pdf/2505.19535
Copy Paste: [[2505.19535]] TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs(https://arxiv.org/abs/2505.19535)
Keywords: quality assessment
Abstract: Text-driven video editing is rapidly advancing, yet its rigorous evaluation remains challenging due to the absence of dedicated video quality assessment (VQA) models capable of discerning the nuances of editing quality. To address this critical gap, we introduce TDVE-DB, a large-scale benchmark dataset for text-driven video editing. TDVE-DB consists of 3,857 edited videos generated from 12 diverse models across 8 editing categories, and is annotated with 173,565 human subjective ratings along three crucial dimensions, i.e., edited video quality, editing alignment, and structural consistency. Based on TDVE-DB, we first conduct a comprehensive evaluation for the 12 state-of-the-art editing models revealing the strengths and weaknesses of current video techniques, and then benchmark existing VQA methods in the context of text-driven video editing evaluation. Building on these insights, we propose TDVE-Assessor, a novel VQA model specifically designed for text-driven video editing assessment. TDVE-Assessor integrates both spatial and temporal video features into a large language model (LLM) for rich contextual understanding to provide comprehensive quality assessment. Extensive experiments demonstrate that TDVE-Assessor substantially outperforms existing VQA models on TDVE-DB across all three evaluation dimensions, setting a new state-of-the-art. Both TDVE-DB and TDVE-Assessor will be released upon the publication.
摘要：文本驱动的视频编辑正在迅速发展，但由于缺乏专用的视频质量评估（VQA）模型，其严格的评估仍然具有挑战性，能够辨别编辑质量的细微差别。为了解决这个关键的差距，我们介绍了TDVE-DB，这是一种用于文本驱动视频编辑的大规模基准数据集。 TDVE-DB由3,857个编辑视频组成，这些视频由12个编辑类别中的12个不同模型产生，并在三个重要的维度上以173,565个人类主观评级为注释，即编辑的视频质量，编辑对齐和结构一致性。基于TDVE-DB，我们首先对12种最先进的编辑模型进行了全面评估，揭示了当前视频技术的优势和缺点，然后在文本驱动的视频编辑评估中基准了现有的VQA方法。在这些见解的基础上，我们提出了TDVE-Assessor，这是一种专门为文本驱动视频编辑评估而设计的新型VQA模型。 TDVE-Assessor将空间和时间视频功能都集成到大型语言模型（LLM）中，以提供丰富的上下文理解，以提供全面的质量评估。广泛的实验表明，在所有三个评估维度上，TDVE-Assessor在TDVE-DB上的现有VQA模型大大优于现有的VQA模型，从而设定了新的最新时间。 TDVE-DB和TDVE-Assessor都将在出版物上发布。

Title: On scalable and efficient training of diffusion samplers

Authors: Minkyu Kim, Kiyoung Seong, Dongyeop Woo, Sungsoo Ahn, Minsu Kim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19552
Pdf URL: https://arxiv.org/pdf/2505.19552
Copy Paste: [[2505.19552]] On scalable and efficient training of diffusion samplers(https://arxiv.org/abs/2505.19552)
Keywords: generation
Abstract: We address the challenge of training diffusion models to sample from unnormalized energy distributions in the absence of data, the so-called diffusion samplers. Although these approaches have shown promise, they struggle to scale in more demanding scenarios where energy evaluations are expensive and the sampling space is high-dimensional. To address this limitation, we propose a scalable and sample-efficient framework that properly harmonizes the powerful classical sampling method and the diffusion sampler. Specifically, we utilize Monte Carlo Markov chain (MCMC) samplers with a novelty-based auxiliary energy as a Searcher to collect off-policy samples, using an auxiliary energy function to compensate for exploring modes the diffusion sampler rarely visits. These off-policy samples are then combined with on-policy data to train the diffusion sampler, thereby expanding its coverage of the energy landscape. Furthermore, we identify primacy bias, i.e., the preference of samplers for early experience during training, as the main cause of mode collapse during training, and introduce a periodic re-initialization trick to resolve this issue. Our method significantly improves sample efficiency on standard benchmarks for diffusion samplers and also excels at higher-dimensional problems and real-world molecular conformer generation.
摘要：在没有数据（所谓的扩散采样器）的情况下，我们解决了训练扩散模型对从非均衡能量分布进行样本的挑战。尽管这些方法已经显示出希望，但他们很难在能量评估昂贵且采样空间高维空间的更苛刻的场景中进行扩展。为了解决这一限制，我们提出了一个可扩展和样品效率的框架，该框架适当地协调强大的经典采样方法和扩散采样器。具体而言，我们利用具有新颖性的辅助能量作为搜索者来利用蒙特卡洛马尔可夫链（MCMC）采样器来收集外部样品，并使用辅助能量函数来补偿探索模式的扩散采样器很少访问。然后将这些非政策样品与上式数据结合使用，以训练扩散采样器，从而扩大其对能量景观的覆盖范围。此外，我们确定了首要偏见，即，在培训期间，采样器偏爱早期经验，这是训练过程中模式崩溃的主要原因，并引入了定期的重新定位技巧来解决此问题。我们的方法显着提高了扩散采样器标准基准的样品效率，并且在高维问题和现实世界中的分子构象异构体生成方面也擅长。

Title: Aggregated Structural Representation with Large Language Models for Human-Centric Layout Generation

Authors: Jiongchao Jin, Shengchu Zhao, Dajun Chen, Wei Jiang, Yong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19554
Pdf URL: https://arxiv.org/pdf/2505.19554
Copy Paste: [[2505.19554]] Aggregated Structural Representation with Large Language Models for Human-Centric Layout Generation(https://arxiv.org/abs/2505.19554)
Keywords: generation, generative
Abstract: Time consumption and the complexity of manual layout design make automated layout generation a critical task, especially for multiple applications across different mobile devices. Existing graph-based layout generation approaches suffer from limited generative capability, often resulting in unreasonable and incompatible outputs. Meanwhile, vision based generative models tend to overlook the original structural information, leading to component intersections and overlaps. To address these challenges, we propose an Aggregation Structural Representation (ASR) module that integrates graph networks with large language models (LLMs) to preserve structural information while enhancing generative capability. This novel pipeline utilizes graph features as hierarchical prior knowledge, replacing the traditional Vision Transformer (ViT) module in multimodal large language models (MLLM) to predict full layout information for the first time. Moreover, the intermediate graph matrix used as input for the LLM is human editable, enabling progressive, human centric design generation. A comprehensive evaluation on the RICO dataset demonstrates the strong performance of ASR, both quantitatively using mean Intersection over Union (mIoU), and qualitatively through a crowdsourced user study. Additionally, sampling on relational features ensures diverse layout generation, further enhancing the adaptability and creativity of the proposed approach.
摘要：时间消耗和手动布局设计的复杂性使自动布局生成成为关键任务，尤其是对于不同移动设备的多个应用程序。现有的基于图的布局生成方法具有有限的生成能力，通常导致不合理和不兼容的输出。同时，基于视觉的生成模型倾向于忽略原始结构信息，从而导致组件交集和重叠。为了应对这些挑战，我们提出了一个聚合结构表示（ASR）模块，该模块将图形网络与大语言模型（LLMS）集成在一起，以保留结构信息，同时增强生成能力。这款新颖的管道利用图形作为层次的先验知识，取代了多模式大语言模型（MLLM）中传统的视觉变压器（VIT）模块，以首次预测完整的布局信息。此外，用作LLM的输入的中间图矩阵是人类可编辑的，具有渐进性，以人为中心的设计产生。对RICO数据集进行的全面评估表明，ASR的出色性能，均使用联合（MIOU）的平均交叉点进行定量，并通过众包用户研究进行定性。此外，对关系特征的采样可确保各种布局产生，从而进一步增强了所提出方法的适应性和创造力。

Title: What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation

Authors: Jianghang Lin, Yue Hu, Jiangtao Shen, Yunhang Shen, Liujuan Cao, Shengchuan Zhang, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19569
Pdf URL: https://arxiv.org/pdf/2505.19569
Copy Paste: [[2505.19569]] What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation(https://arxiv.org/abs/2505.19569)
Keywords: generative
Abstract: Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system's process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and target concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-Language Model (G-VLM) that mimics human cognition by generating object concepts to provide semantic guidance for region segmentation. (2) A Concept-Aware Visual Enhancer Module that fuses textual concept features with global visual representations, enabling adaptive visual perception based on target concepts. (3) A Cognition-Inspired Decoder that integrates local instance features with G-VLM-provided semantic cues, allowing selective classification over a subset of relevant categories. Extensive experiments demonstrate that our framework achieves significant improvements, reaching $27.2$ PQ, $17.0$ mAP, and $35.3$ mIoU on A-150. It further attains $56.2$, $28.2$, $15.4$, $59.2$, $18.7$, and $95.8$ mIoU on Cityscapes, Mapillary Vistas, A-847, PC-59, PC-459, and PAS-20, respectively. In addition, our framework supports vocabulary-free segmentation, offering enhanced flexibility in recognizing unseen categories. Code will be public.
摘要：开放的词汇图像分割解决了通过利用视觉语言对齐方式在推理时间识别动态可调的，预定义的新类别的挑战。但是，现有的范例通常执行类不足的区域分割，然后进行类别匹配，这偏离了人类视觉系统基于语义概念识别对象的过程，从而导致区域细分和目标概念之间的不良对准。为了弥合这一差距，我们为开放词汇图像分割提出了一个新颖的认知风格框架，该框架模仿了人类的视觉识别过程：首先形成对对象的概念理解，然后才能感知其空间范围。该框架由三个核心组成部分组成：（1）一种生成视觉语言模型（G-VLM），该模型通过生成对象概念来为区域分割提供语义指导来模仿人类认知。（2）一种概念感知的视觉增强器模块，该模块将文本概念特征与全局视觉表示融合在一起，从而基于目标概念实现自适应视觉感知。（3）由认知启发的解码器将局部实例特征与G-VLM提供的语义提示集成在一起，从而可以在相关类别的子集上进行选择性分类。广泛的实验表明，我们的框架可取得重大改进，达到27.2美元的PQ，17.0美元的地图和A-150 $ 35.3 $ MIOU。它进一步达到$ 56.2 $，$ 28.2 $，$ 15.4 $，$ 59.2 $，$ 18.7 $和$ 95.8 $ MIOU，MAPILLARY VISTAS，A-847，A-847，PC-59，PC-59，PC-459和PAS-20。此外，我们的框架支持无词汇的细分，在识别看不见的类别方面具有增强的灵活性。代码将是公开的。

Title: VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models

Authors: Hu Xiaobin, Liang Yujie, Luo Donghao, Peng Xu, Zhang Jiangning, Zhu Junwei, Wang Chengjie, Fu Yanwei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19571
Pdf URL: https://arxiv.org/pdf/2505.19571
Copy Paste: [[2505.19571]] VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models(https://arxiv.org/abs/2505.19571)
Keywords: generation
Abstract: While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons:(1) Current metrics inadequately reflect human perception, particularly in unpaired try-on settings;(2)Most existing test sets are limited to indoor scenarios, lacking complexity for real-world evaluation; and (3) An ideal system should guide future advancements in virtual try-on generation. To address these needs, we introduce VTBench, a hierarchical benchmark suite that systematically decomposes virtual image try-on into hierarchical, disentangled dimensions, each equipped with tailored test sets and evaluation criteria. VTBench exhibits three key advantages:1) Multi-Dimensional Evaluation Framework: The benchmark encompasses five critical dimensions for virtual try-on generation (e.g., overall image quality, texture preservation, complex background consistency, cross-category size adaptability, and hand-occlusion handling). Granular evaluation metrics of corresponding test sets pinpoint model capabilities and limitations across diverse, challenging scenarios.2) Human Alignment: Human preference annotations are provided for each test set, ensuring the benchmark's alignment with perceptual quality across all evaluation dimensions. (3) Valuable Insights: Beyond standard indoor settings, we analyze model performance variations across dimensions and investigate the disparity between indoor and real-world try-on scenarios. To foster the field of virtual try-on towards challenging real-world scenario, VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.
摘要：尽管虚拟试验取得了重大进展，但将这些模型评估到现实世界情景仍然是一个挑战。全面的基准是至关重要的三个关键原因：（1）当前的指标反映人类的看法不足，尤其是在未配对的尝试环境中；（2）大多数现有的测试集仅限于室内场景，缺乏现实世界评估的复杂性；（3）理想的系统应指导虚拟尝试生成中的未来进步。为了满足这些需求，我们介绍了VTBENCH，这是一个层次基准套件，系统地将虚拟图像试用为层次结构，分开的尺寸，每个尺寸都配备了量身定制的测试集和评估标准。 VTBENCH具有三个关键优势：1）多维评估框架：基准包括虚拟试验生成的五个关键维度（例如，整体图像质量，纹理保存，复杂的背景一致性，跨类别大小的适应性适应性和手工概括性处理）。相应测试集的粒状评估指标查明模型的功能和局限性，具有挑战性的场景。2）人类对齐：为每个测试组提供人类偏好注释，以确保基准测试在所有评估维度的知觉质量的一致性。（3）有价值的见解：除了标准室内设置之外，我们还分析了跨维度的模型性能变化，并研究了室内和现实世界中的试验场景之间的差异。为了促进虚拟尝试的领域，以挑战现实世界的情况，将开源VTBench，包括所有测试集，评估协议，生成的结果和人类注释。

Title: Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes

Authors: Kaiqing Lin, Zhiyuan Yan, Ke-Yue Zhang, Li Hao, Yue Zhou, Yuzhen Lin, Weixiang Li, Taiping Yao, Shouhong Ding, Bin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19582
Pdf URL: https://arxiv.org/pdf/2505.19582
Copy Paste: [[2505.19582]] Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes(https://arxiv.org/abs/2505.19582)
Keywords: generation
Abstract: Securing personal identity against deepfake attacks is increasingly critical in the digital age, especially for celebrities and political figures whose faces are easily accessible and frequently targeted. Most existing deepfake detection methods focus on general-purpose scenarios and often ignore the valuable prior knowledge of known facial identities, e.g., "VIP individuals" whose authentic facial data are already available. In this paper, we propose \textbf{VIPGuard}, a unified multimodal framework designed to capture fine-grained and comprehensive facial representations of a given identity, compare them against potentially fake or similar-looking faces, and reason over these comparisons to make accurate and explainable predictions. Specifically, our framework consists of three main stages. First, fine-tune a multimodal large language model (MLLM) to learn detailed and structural facial attributes. Second, we perform identity-level discriminative learning to enable the model to distinguish subtle differences between highly similar faces, including real and fake variations. Finally, we introduce user-specific customization, where we model the unique characteristics of the target face identity and perform semantic reasoning via MLLM to enable personalized and explainable deepfake detection. Our framework shows clear advantages over previous detection works, where traditional detectors mainly rely on low-level visual cues and provide no human-understandable explanations, while other MLLM-based models often lack a detailed understanding of specific face identities. To facilitate the evaluation of our method, we built a comprehensive identity-aware benchmark called \textbf{VIPBench} for personalized deepfake detection, involving the latest 7 face-swapping and 7 entire face synthesis techniques for generation.
摘要：在数字时代，确保个人身份抵抗深击攻击的越来越重要，尤其是对于易于访问面孔并经常针对性的名人和政治人物而言。大多数现有的DeepFake检测方法都集中在通用场景上，并且通常会忽略已知面部身份的有价值的先验知识，例如“ VIP个人”，其真实面部数据已经可用。在本文中，我们提出了\ textbf {Vipguard}，这是一个统一的多模式框架，旨在捕获给定身份的细粒度和全面的面部表征，将它们与潜在的假或外观外观的面孔进行比较，并在这些比较中进行理由，以做出准确和可解释的预测。具体而言，我们的框架包括三个主要阶段。首先，微调多模式大语言模型（MLLM），以学习详细的结构面部属性。其次，我们执行身份级别的歧视性学习，以使模型能够区分高度相似的面孔之间的细微差异，包括真实和假变化。最后，我们介绍了特定于用户的自定义，在该自定义上，我们在其中对目标面部身份的独特特征进行建模，并通过MLLM执行语义推理，以实现个性化和可解释的深层捕获检测。我们的框架比以前的探测工作具有明显的优势，在该作品中，传统探测器主要依赖于低级的视觉提示，并且没有提供人为理解的解释，而其他基于MLLM的模型通常缺乏对特定面部身份的详细理解。为了促进对我们方法的评估，我们建立了一个全面的身份感知的基准，称为\ textbf {vipbench}，以进行个性化的深击检测，涉及最新的7个面部交换和7种整个面部合成技术。

Title: Learning to Reason without External Rewards

Authors: Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19590
Pdf URL: https://arxiv.org/pdf/2505.19590
Copy Paste: [[2505.19590]] Learning to Reason without External Rewards(https://arxiv.org/abs/2505.19590)
Keywords: generation
Abstract: Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at this https URL
摘要：通过可验证的奖励（RLVR）培训大型语言模型（LLMS）通过加强学习进行复杂的推理，但受到依赖昂贵的特定于领域的监督的限制。我们从内部反馈（RLIF）中探索强化学习，该框架使LLM可以从没有外部奖励或标记数据的固有信号中学习。我们提出了Intuitor，这是一种使用模型自信的RLIF方法，称为自我确定性，是其唯一的奖励信号。 Intuitor用自我确定得分代替了小组相对策略优化（GRPO）中的外部奖励，从而实现了完全无监督的学习。实验表明，直觉与GRPO在数学基准测试上的性能相匹配，同时在不需要金解决方案或测试案例的情况下实现了对诸如代码生成之类的跨域任务的卓越概括。我们的发现表明，固有的模型信号可以推动跨域的有效学习，为自主AI系统提供了可扩展的RLVR替代方案，在该系统中，可验证的奖励无法获得。代码可在此HTTPS URL上找到

Title: Preference Optimization by Estimating the Ratio of the Data Distribution

Authors: Yeongmin Kim, Heesun Bae, Byeonghu Na, Il-Chul Moon
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19601
Pdf URL: https://arxiv.org/pdf/2505.19601
Copy Paste: [[2505.19601]] Preference Optimization by Estimating the Ratio of the Data Distribution(https://arxiv.org/abs/2505.19601)
Keywords: generation
Abstract: Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO loss that enables a policy model to match the target policy from a likelihood ratio estimation perspective. The ratio of the target policy provides a unique identification of the policy distribution without relying on reward models or partition functions. This allows the generalized loss to retain both simplicity and theoretical guarantees, which prior work such as $f$-PO fails to achieve simultaneously. We propose Bregman preference optimization (BPO), a generalized framework for ratio matching that provides a family of objective functions achieving target policy optimality. BPO subsumes DPO as a special case and offers tractable forms for all instances, allowing implementation with a few lines of code. We further develop scaled Basu's power divergence (SBA), a gradient scaling method that can be used for BPO instances. The BPO framework complements other DPO variants and is applicable to target policies defined by these variants. In experiments, unlike other probabilistic loss extensions such as $f$-DPO or $f$-PO, which exhibit a trade-off between generation fidelity and diversity, instances of BPO improve both win rate and entropy compared with DPO. When applied to Llama-3-Instruct-8B, BPO achieves state-of-the-art performance among Llama-3-8B backbones, with a 55.9\% length-controlled win rate on AlpacaEval2.
摘要：直接偏好优化（DPO）被广泛用作一种简单且稳定的方法，用于将大型语言模型（LLMS）与人类偏好保持一致。本文研究了广义DPO损失，从而使策略模型从似然比估计的角度匹配目标策略。目标策略的比率在不依赖奖励模型或分区功能的情况下提供了对策略分布的独特标识。这允许普遍的损失保留简单性和理论保证，例如$ f $ -po之类的工作未能同时实现。我们提出了Bregman偏好优化（BPO），这是一个比率匹配的广义框架，提供了实现目标策略最佳性的一系列客观功能。 BPO将DPO作为一种特殊情况，并为所有实例提供可拖动的表格，从而允许使用几行代码实现。我们进一步开发了缩放BASU的功率差异（SBA），这是一种可用于BPO实例的梯度缩放方法。 BPO框架补充了其他DPO变体，适用于这些变体定义的目标策略。在实验中，与其他概率损失扩展不同，例如$ f $ -dpo或$ f $ -po，它们在发电忠诚度和多样性之间表现出了权衡，而BPO的实例则提高了与DPO相比的获胜率和熵。当应用于Llama-3-Instruct-8B时，BPO在Llama-3-8B骨架之间达到了最先进的性能，在Alpacaeval2上具有55.9％的长度控制率。

Title: SESaMo: Symmetry-Enforcing Stochastic Modulation for Normalizing Flows

Authors: Janik Kreit, Dominic Schuh, Kim A. Nicoli, Lena Funcke
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2505.19619
Pdf URL: https://arxiv.org/pdf/2505.19619
Copy Paste: [[2505.19619]] SESaMo: Symmetry-Enforcing Stochastic Modulation for Normalizing Flows(https://arxiv.org/abs/2505.19619)
Keywords: generative
Abstract: Deep generative models have recently garnered significant attention across various fields, from physics to chemistry, where sampling from unnormalized Boltzmann-like distributions represents a fundamental challenge. In particular, autoregressive models and normalizing flows have become prominent due to their appealing ability to yield closed-form probability densities. Moreover, it is well-established that incorporating prior knowledge - such as symmetries - into deep neural networks can substantially improve training performances. In this context, recent advances have focused on developing symmetry-equivariant generative models, achieving remarkable results. Building upon these foundations, this paper introduces Symmetry-Enforcing Stochastic Modulation (SESaMo). Similar to equivariant normalizing flows, SESaMo enables the incorporation of inductive biases (e.g., symmetries) into normalizing flows through a novel technique called stochastic modulation. This approach enhances the flexibility of the generative model, allowing to effectively learn a variety of exact and broken symmetries. Our numerical experiments benchmark SESaMo in different scenarios, including an 8-Gaussian mixture model and physically relevant field theories, such as the $\phi^4$ theory and the Hubbard model.
摘要：深层生成模型最近在从物理学到化学的各个领域都引起了人们的重大关注，在这些领域中，来自非标准化玻尔兹曼的分布的采样代表了一个基本挑战。特别是，由于它们具有封闭形式概率密度的吸引力，自回旋模型和正常流量变得很突出。此外，良好的是，将先验知识（例如对称性）纳入深度神经网络可以大大改善训练性能。在这种情况下，最近的进步集中在开发对称性等值生成模型上，取得了显着的结果。本文以这些基础为基础，引入了对称性的随机调制（SESAMO）。类似于归一化流量，sesamo可以通过一种称为随机调制的新技术将电感偏差（例如对称性）融合到标准化流中。这种方法增强了生成模型的灵活性，从而有效地学习了各种精确和破裂的对称性。在不同情况下，我们的数值实验基准了sesamo，包括8高斯混合模型和物理相关的现场理论，例如$ \ phi^4 $理论和哈伯德模型。

Title: HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment

Authors: Ming Meng, Qi Dong, Jiajie Li, Zhe Zhu, Xingyu Wang, Zhaoxin Fan, Wei Zhao, Wenjun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19638
Pdf URL: https://arxiv.org/pdf/2505.19638
Copy Paste: [[2505.19638]] HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment(https://arxiv.org/abs/2505.19638)
Keywords: generation
Abstract: Virtual try-on technology has become increasingly important in the fashion and retail industries, enabling the generation of high-fidelity garment images that adapt seamlessly to target human models. While existing methods have achieved notable progress, they still face significant challenges in maintaining consistency across different poses. Specifically, geometric distortions lead to a lack of spatial consistency, mismatches in garment structure and texture across poses result in semantic inconsistency, and the loss or distortion of fine-grained details diminishes visual fidelity. To address these challenges, we propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses. HF-VTON consists of three key modules: (1) the Appearance-Preserving Warp Alignment Module (APWAM), which aligns garments to human poses, addressing geometric deformations and ensuring spatial consistency; (2) the Semantic Representation and Comprehension Module (SRCM), which captures fine-grained garment attributes and multi-pose data to enhance semantic representation, maintaining structural, textural, and pattern consistency; and (3) the Multimodal Prior-Guided Appearance Generation Module (MPAGM), which integrates multimodal features and prior knowledge from pre-trained models to optimize appearance generation, ensuring both semantic and geometric consistency. Additionally, to overcome data limitations in existing benchmarks, we introduce the SAMP-VTONS dataset, featuring multi-pose pairs and rich textual annotations for a more comprehensive evaluation. Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS, excelling in visual fidelity, semantic consistency, and detail preservation.
摘要：虚拟的尝试技术在时尚和零售行业中变得越来越重要，从而使高保真服装图像的产生无缝地适应人类模型。尽管现有方法取得了显着的进步，但在维持不同姿势的一致性方面，它们仍然面临重大挑战。具体而言，几何变形导致缺乏空间一致性，服装结构的不匹配和跨姿势的质地导致语义上的不一致以及细粒细节的丢失或失真会减少视觉保真度。为了应对这些挑战，我们提出了HF-VTON，这是一个新颖的框架，可确保在各种姿势之间进行高保真虚拟的尝试性能。 HF-VTON由三个关键模块组成：（1）具有外观的经线对准模块（APWAM），该模块（APWAM）将衣服与人体姿势保持一致，解决几何变形并确保空间一致性；（2）语义表示和理解模块（SRCM），该模块捕获细粒的服装属性和多置数据以增强语义表示，维持结构，纹理和模式一致性；（3）多模式的先前引导的外观生成模块（MPAGM），该模块（MPAGM）整合了从预训练的模型中的多模式特征和先验知识，以优化外观生成，以确保语义和几何一致性。此外，为了克服现有基准的数据限制，我们介绍了Samp-Vtons数据集，其中包含多姿势对和丰富的文本注释，以进行更全面的评估。实验结果表明，HF-VTON在Viton-HD和Samp-Vtons上的最先进方法优于视觉忠诚度，语义一致性和细节保存方面的表现。

Title: Energy-based generator matching: A neural sampler for general state space

Authors: Dongyeop Woo, Minsu Kim, Minkyu Kim, Kiyoung Seong, Sungsoo Ahn
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19646
Pdf URL: https://arxiv.org/pdf/2505.19646
Copy Paste: [[2505.19646]] Energy-based generator matching: A neural sampler for general state space(https://arxiv.org/abs/2505.19646)
Keywords: generative
Abstract: We propose Energy-based generator matching (EGM), a modality-agnostic approach to train generative models from energy functions in the absence of data. Extending the recently proposed generator matching, EGM enables training of arbitrary continuous-time Markov processes, e.g., diffusion, flow, and jump, and can generate data from continuous, discrete, and a mixture of two modalities. To this end, we propose estimating the generator matching loss using self-normalized importance sampling with an additional bootstrapping trick to reduce variance in the importance weight. We validate EGM on both discrete and multimodal tasks up to 100 and 20 dimensions, respectively.
摘要：我们提出了基于能量的生成器匹配（EGM），这是一种在没有数据的情况下从能量函数中训练生成模型的一种模态反应方法。 EGM扩展了最近提出的发电机匹配，可以训练任意连续的马尔可夫过程，例如扩散，流和跳跃，并可以从连续，离散和两种模态的混合物中生成数据。为此，我们提出了使用自相应的重要性抽样估算发电机匹配损失，并具有额外的自举式技巧，以减少重要性权重的差异。我们在离散和多模式任务上分别验证了EGM，分别为100和20维度。

Title: ReDDiT: Rehashing Noise for Discrete Visual Generation

Authors: Tianren Ma, Xiaosong Zhang, Boyu Yang, Junlan Feng, Qixiang Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19656
Pdf URL: https://arxiv.org/pdf/2505.19656
Copy Paste: [[2505.19656]] ReDDiT: Rehashing Noise for Discrete Visual Generation(https://arxiv.org/abs/2505.19656)
Keywords: generation, generative
Abstract: Discrete diffusion models are gaining traction in the visual generative area for their efficiency and compatibility. However, the pioneered attempts still fall behind the continuous counterparts, which we attribute to the noise (absorbing state) design and sampling heuristics. In this study, we propose the rehashing noise framework for discrete diffusion transformer, termed ReDDiT, to extend absorbing states and improve expressive capacity of discrete diffusion models. ReDDiT enriches the potential paths that latent variables can traverse during training with randomized multi-index corruption. The derived rehash sampler, which reverses the randomized absorbing paths, guarantees the diversity and low discrepancy of the generation process. These reformulations lead to more consistent and competitive generation quality, mitigating the need for heavily tuned randomness. Experiments show that ReDDiT significantly outperforms the baseline (reducing gFID from 6.18 to 1.61) and is on par with the continuous counterparts with higher efficiency.
摘要：离散扩散模型正在在视觉生成区域中获得其效率和兼容性的吸引力。但是，开创性的尝试仍然落后于连续的同行，我们将其归因于噪声（吸收状态）设计和采样启发式方法。在这项研究中，我们提出了称为Reddit的离散扩散变压器的重新噪声框架，以扩展吸收状态并提高离散扩散模型的表达能力。 Reddit丰富了潜在变量在训练过程中可以通过随机多数指数损坏遍历的潜在路径。衍生的Rehash采样器逆转了随机吸收路径，可确保生成过程的多样性和差异。这些重新策略可提高更一致和竞争性的发电质量，从而减轻对随机性进行重新调整的需求。实验表明，REDDIT显着胜过基线（将GFID从6.18降低到1.61），并且与效率更高的连续同行相当。

Title: Burst Image Super-Resolution via Multi-Cross Attention Encoding and Multi-Scan State-Space Decoding

Authors: Tengda Huang, Yu Zhang, Tianren Li, Yufu Qu, Fulin Liu, Zhenzhong Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19668
Pdf URL: https://arxiv.org/pdf/2505.19668
Copy Paste: [[2505.19668]] Burst Image Super-Resolution via Multi-Cross Attention Encoding and Multi-Scan State-Space Decoding(https://arxiv.org/abs/2505.19668)
Keywords: super-resolution
Abstract: Multi-image super-resolution (MISR) can achieve higher image quality than single-image super-resolution (SISR) by aggregating sub-pixel information from multiple spatially shifted frames. Among MISR tasks, burst super-resolution (BurstSR) has gained significant attention due to its wide range of applications. Recent methods have increasingly adopted Transformers over convolutional neural networks (CNNs) in super-resolution tasks, due to their superior ability to capture both local and global context. However, most existing approaches still rely on fixed and narrow attention windows that restrict the perception of features beyond the local field. This limitation hampers alignment and feature aggregation, both of which are crucial for high-quality super-resolution. To address these limitations, we propose a novel feature extractor that incorporates two newly designed attention mechanisms: overlapping cross-window attention and cross-frame attention, enabling more precise and efficient extraction of sub-pixel information across multiple frames. Furthermore, we introduce a Multi-scan State-Space Module with the cross-frame attention mechanism to enhance feature aggregation. Extensive experiments on both synthetic and real-world benchmarks demonstrate the superiority of our approach. Additional evaluations on ISO 12233 resolution test charts further confirm its enhanced super-resolution performance.
摘要：通过从多个空间移动的帧中汇总子像素信息，多图像超分辨率（MISR）可以比单像超级分辨率（SISR）获得更高的图像质量。在MISR任务中，由于其广泛的应用范围，爆发超分辨率（BUSTSR）引起了重大关注。最近的方法越来越多地采用了超分辨率任务中卷积神经网络（CNN）的变压器，因为它们既可以捕获本地和全球环境。但是，大多数现有方法仍然依赖固定和狭窄的注意力窗口，这些窗口限制了本地领域以外的特征的感知。这种限制阻碍了对齐和特征聚集，这两个对高质量的超分辨率至关重要。为了解决这些局限性，我们提出了一种新型的特征提取器，该提取器结合了两种新设计的注意机制：重叠的交叉窗口注意力和跨框架的注意力，从而更加精确，有效地提取了跨多个帧的子像素信息。此外，我们引入了一个具有跨帧注意机制的多扫描状态空间模块，以增强特征聚合。对合成和实际基准测试的广泛实验证明了我们方法的优势。对ISO 12233分辨率测试图表的其他评估进一步证实了其增强的超分辨率性能。

Title: Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling

Authors: Haiyang Sun, Shujie Hu, Shujie Liu, Lingwei Meng, Hui Wang, Bing Han, Yifan Yang, Yanqing Liu, Sheng Zhao, Yan Lu, Yanmin Qian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19669
Pdf URL: https://arxiv.org/pdf/2505.19669
Copy Paste: [[2505.19669]] Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling(https://arxiv.org/abs/2505.19669)
Keywords: generation
Abstract: Zero-shot streaming text-to-speech is an important research topic in human-computer interaction. Existing methods primarily use a lookahead mechanism, relying on future text to achieve natural streaming speech synthesis, which introduces high processing latency. To address this issue, we propose SMLLE, a streaming framework for generating high-quality speech frame-by-frame. SMLLE employs a Transducer to convert text into semantic tokens in real time while simultaneously obtaining duration alignment information. The combined outputs are then fed into a fully autoregressive (AR) streaming model to reconstruct mel-spectrograms. To further stabilize the generation process, we design a Delete < Bos > Mechanism that allows the AR model to access future text introducing as minimal delay as possible. Experimental results suggest that the SMLLE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems. Samples are available on this https URL.
摘要：零拍传输文本到语音是人类计算机互动中的重要研究主题。现有方法主要使用lookahead机制，依靠未来的文本来实现自然流语音综合，从而引入了高处理延迟。为了解决这个问题，我们提出了SMLLE，这是一个流式传输框架，用于生成高质量的逐帧框架。 SMLLE使用换能器实时将文本转换为语义令牌，同时获得持续时间对齐信息。然后将组合的输出馈入完全自动回旋（AR）流模型以重建MEL光谱图。为了进一步稳定生成过程，我们设计了一种删除机制，该机制允许AR模型访问将来的文本，以尽可能最小的延迟。实验结果表明，SMLLE优于当前流tts方法，并且在句子级TTS系统上实现了可比的性能。该HTTPS URL上可用样品。

Title: Graph Guided Diffusion: Unified Guidance for Conditional Graph Generation

Authors: Victor M. Tenorio, Nicolas Zilberstein, Santiago Segarra, Antonio G. Marques
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19685
Pdf URL: https://arxiv.org/pdf/2505.19685
Copy Paste: [[2505.19685]] Graph Guided Diffusion: Unified Guidance for Conditional Graph Generation(https://arxiv.org/abs/2505.19685)
Keywords: generation, generative
Abstract: Diffusion models have emerged as powerful generative models for graph generation, yet their use for conditional graph generation remains a fundamental challenge. In particular, guiding diffusion models on graphs under arbitrary reward signals is difficult: gradient-based methods, while powerful, are often unsuitable due to the discrete and combinatorial nature of graphs, and non-differentiable rewards further complicate gradient-based guidance. We propose Graph Guided Diffusion (GGDiff), a novel guidance framework that interprets conditional diffusion on graphs as a stochastic control problem to address this challenge. GGDiff unifies multiple guidance strategies, including gradient-based guidance (for differentiable rewards), control-based guidance (using control signals from forward reward evaluations), and zero-order approximations (bridging gradient-based and gradient-free optimization). This comprehensive, plug-and-play framework enables zero-shot guidance of pre-trained diffusion models under both differentiable and non-differentiable reward functions, adapting well-established guidance techniques to graph generation--a direction largely unexplored. Our formulation balances computational efficiency, reward alignment, and sample quality, enabling practical conditional generation across diverse reward types. We demonstrate the efficacy of GGDiff in various tasks, including constraints on graph motifs, fairness, and link prediction, achieving superior alignment with target rewards while maintaining diversity and fidelity.
摘要：扩散模型已成为图形生成的强大生成模型，但它们用于有条件的图形生成仍然是一个基本挑战。特别是，在任意奖励信号下图表上的指导扩散模型很困难：基于梯度的方法虽然强大，但由于图的离散和组合性质，通常不适合使用，而非可不同的奖励则使基于梯度的指导更加复杂。我们提出了图形引导扩散（GGDIFF），这是一个新型的指导框架，将图表上的条件扩散解释为应对这一挑战的随机控制问题。 GGDIFF统一了多种指导策略，包括基于梯度的指导（用于可区分的奖励），基于控制的指导（使用远期奖励评估中的控制信号）和零级近似值（基于BRIDGING基于梯度和无梯度的优化）。这个全面的，插件的框架可以在可区分和非差异奖励功能下对预训练的扩散模型进行零拍导的指导，从而将良好的指导技术调整到图形生成，这在很大程度上没有探索。我们的公式平衡了计算效率，奖励对准和样本质量，从而使各种奖励类型的有条件产生能够实用。我们证明了GGDIFF在各种任务中的功效，包括对图主题的限制，公平性和链接预测，在保持多样性和忠诚度的同时，实现了与目标奖励的优越对齐。

Title: DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving

Authors: Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Yining Shi, Chuang Zhang, Sifa Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19692
Pdf URL: https://arxiv.org/pdf/2505.19692
Copy Paste: [[2505.19692]] DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving(https://arxiv.org/abs/2505.19692)
Keywords: generation, generative
Abstract: Camera sensor simulation serves as a critical role for autonomous driving (AD), e.g. evaluating vision-based AD algorithms. While existing approaches have leveraged generative models for controllable image/video generation, they remain constrained to generating multi-view video sequences with fixed camera viewpoints and video frequency, significantly limiting their downstream applications. To address this, we present a generalizable camera simulation framework DriveCamSim, whose core innovation lies in the proposed Explicit Camera Modeling (ECM) mechanism. Instead of implicit interaction through vanilla attention, ECM establishes explicit pixel-wise correspondences across multi-view and multi-frame dimensions, decoupling the model from overfitting to the specific camera configurations (intrinsic/extrinsic parameters, number of views) and temporal sampling rates presented in the training data. For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines, proposing an information-preserving control mechanism. This control mechanism not only improves conditional controllability, but also can be extended to be identity-aware to enhance temporal consistency in foreground object rendering. With above designs, our model demonstrates superior performance in both visual quality and controllability, as well as generalization capability across spatial-level (camera parameters variations) and temporal-level (video frame rate variations), enabling flexible user-customizable camera simulation tailored to diverse application scenarios. Code will be avaliable at this https URL for facilitating future research.
摘要：相机传感器模拟是自动驾驶（AD）的关键作用，例如评估基于视觉的AD算法。尽管现有的方法已利用生成模型进行可控的图像/视频生成，但它们仍然限制为具有固定相机视点和视频频率的多视视频序列，从而大大限制了其下游应用程序。为了解决这个问题，我们提出了一个可推广的相机模拟框架DriveCamsim，其核心创新在于提议的显式摄像机建模（ECM）机制。 ECM不是通过香草的关注来进行隐式相互作用，而是在多视图和多帧维度上建立明确的像素对应关系，从而将模型从过度拟合到特定的摄像机配置（内在/外在参数，视图数量，视图数）和训练数据中显示的时间抽样率。对于可控制的生成，我们确定了现有条件编码和注射管道中固有的信息损失问题，提出了提供信息的控制机制。这种控制机制不仅可以提高条件可控性，而且可以扩展为身份意识，以提高前景对象渲染的时间一致性。借助上述设计，我们的模型在视觉质量和可控性方面都表现出卓越的性能，以及空间级别（相机参数变化）和时间级别（视频帧速率变化）之间的概括能力，从而实现了针对各种应用程序场景量身定制的灵活用户可启动的相机模拟。在此HTTPS URL中，代码将是可用的，以促进未来的研究。

Title: Modeling Beyond MOS: Quality Assessment Models Must Integrate Context, Reasoning, and Multimodality

Authors: Mohamed Amine Kerkouri, Marouane Tliba, Aladine Chetouani, Nour Aburaed, Alessandro Bruno
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2505.19696
Pdf URL: https://arxiv.org/pdf/2505.19696
Copy Paste: [[2505.19696]] Modeling Beyond MOS: Quality Assessment Models Must Integrate Context, Reasoning, and Multimodality(https://arxiv.org/abs/2505.19696)
Keywords: quality assessment
Abstract: This position paper argues that Mean Opinion Score (MOS), while historically foundational, is no longer sufficient as the sole supervisory signal for multimedia quality assessment models. MOS reduces rich, context-sensitive human judgments to a single scalar, obscuring semantic failures, user intent, and the rationale behind quality decisions. We contend that modern quality assessment models must integrate three interdependent capabilities: (1) context-awareness, to adapt evaluations to task-specific goals and viewing conditions; (2) reasoning, to produce interpretable, evidence-grounded justifications for quality judgments; and (3) multimodality, to align perceptual and semantic cues using vision-language models. We critique the limitations of current MOS-centric benchmarks and propose a roadmap for reform: richer datasets with contextual metadata and expert rationales, and new evaluation metrics that assess semantic alignment, reasoning fidelity, and contextual sensitivity. By reframing quality assessment as a contextual, explainable, and multimodal modeling task, we aim to catalyze a shift toward more robust, human-aligned, and trustworthy evaluation systems.
摘要：该立场论文认为，卑鄙的意见评分（MOS）虽然历史上的基础，但不再足以作为多媒体质量评估模型的唯一监督信号。 MOS减少了对单个标量的丰富，上下文敏感的人类判断，掩盖了语义失败，用户意图以及质量决策背后的理由。我们认为，现代质量评估模型必须整合三个相互依存的功能：（1）上下文意识，以使评估适应特定于任务的目标和观看条件；（2）推理，为质量判断产生可解释的，有证据的理由；（3）多模态，以使用视觉模型对齐感知和语义提示。我们批评当前以MOS为中心的基准的局限性，并提出了改革的路线图：具有上下文元数据和专家理由的较丰富的数据集，以及评估语义一致性，推理忠诚度和上下文敏感性的新评估指标。通过将质量评估作为上下文，可解释和多模式建模任务，我们旨在促进向更健壮，人类和值得信赖的评估系统转变。

Title: Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments

Authors: Junming Liu, Yanting Gao, Siyuan Meng, Yifei Sun, Aoqi Wu, Yufei Jin, Yirong Chen, Ding Wang, Guosun Zeng
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2505.19699
Pdf URL: https://arxiv.org/pdf/2505.19699
Copy Paste: [[2505.19699]] Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments(https://arxiv.org/abs/2505.19699)
Keywords: generation, generative
Abstract: Federated Learning (FL) is a decentralized machine learning paradigm that enables clients to collaboratively train models while preserving data privacy. However, the coexistence of model and data heterogeneity gives rise to inconsistent representations and divergent optimization dynamics across clients, ultimately hindering robust global performance. To transcend these challenges, we propose Mosaic, a novel data-free knowledge distillation framework tailored for heterogeneous distributed environments. Mosaic first trains local generative models to approximate each client's personalized distribution, enabling synthetic data generation that safeguards privacy through strict separation from real data. Subsequently, Mosaic forms a Mixture-of-Experts (MoE) from client models based on their specialized knowledge, and distills it into a global model using the generated data. To further enhance the MoE architecture, Mosaic integrates expert predictions via a lightweight meta model trained on a few representative prototypes. Extensive experiments on standard image classification benchmarks demonstrate that Mosaic consistently outperforms state-of-the-art approaches under both model and data heterogeneity. The source code has been published at this https URL.
摘要：Federated Learning（FL）是一个分散的机器学习范式，使客户能够在保留数据隐私的同时协作训练模型。但是，模型和数据异质性的共存会导致客户端的表示不一致，并且跨客户的优化动态不同，最终阻碍了强大的全球性能。为了超越这些挑战，我们提出了Mosaic，这是一种针对异质分布环境量身定制的新型无数据知识蒸馏框架。 Mosaic首先训练当地生成模型近似于每个客户的个性化分布，从而使合成数据生成可以通过严格的分离实际数据来保护隐私。随后，马赛克基于客户模型的专业知识形成了专家模型的混合物（MOE），并使用生成的数据将其提炼成全球模型。为了进一步增强MOE体系结构，Mosaic通过对一些代表性原型训练的轻质元模型整合了专家预测。对标准图像分类基准的广泛实验表明，在模型和数据异质性下，马赛克始终超过最先进的方法。源代码已在此HTTPS URL上发布。

Title: On the Relation between Rectified Flows and Optimal Transport

Authors: Johannes Hertrich, Antonin Chambolle, Julie Delon
Subjects: cs.LG, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2505.19712
Pdf URL: https://arxiv.org/pdf/2505.19712
Copy Paste: [[2505.19712]] On the Relation between Rectified Flows and Optimal Transport(https://arxiv.org/abs/2505.19712)
Keywords: generative
Abstract: This paper investigates the connections between rectified flows, flow matching, and optimal transport. Flow matching is a recent approach to learning generative models by estimating velocity fields that guide transformations from a source to a target distribution. Rectified flow matching aims to straighten the learned transport paths, yielding more direct flows between distributions. Our first contribution is a set of invariance properties of rectified flows and explicit velocity fields. In addition, we also provide explicit constructions and analysis in the Gaussian (not necessarily independent) and Gaussian mixture settings and study the relation to optimal transport. Our second contribution addresses recent claims suggesting that rectified flows, when constrained such that the learned velocity field is a gradient, can yield (asymptotically) solutions to optimal transport problems. We study the existence of solutions for this problem and demonstrate that they only relate to optimal transport under assumptions that are significantly stronger than those previously acknowledged. In particular, we present several counter-examples that invalidate earlier equivalence results in the literature, and we argue that enforcing a gradient constraint on rectified flows is, in general, not a reliable method for computing optimal transport maps.
摘要：本文研究了整流的流，流匹配和最佳传输之间的连接。流量匹配是一种通过估计引导从源到目标分布的转换的速度场来进行学习生成模型的方法。整流的流匹配旨在拉直学习的传输路径，在分布之间产生更多直接的流动。我们的第一个贡献是一组整流的流量和显式速度场的不变性属性。此外，我们还提供了高斯（不一定是独立）和高斯混合物设置中的明确构造和分析，并研究与最佳运输的关系。我们的第二个贡献解决了最近的主张，表明在受到限制的情况下，纠正的流量是一种梯度，可以（渐近）解决最佳运输问题（渐近）解决方案。我们研究了该问题的解决方案的存在，并证明它们仅在比以前所承认的明显强的假设下与最佳运输有关。特别是，我们提出了几个反审查，使早期的等效性无效，因此我们认为对整流流的梯度约束通常不是计算最佳传输图的可靠方法。

Title: HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance

Authors: Jue Gong, Tingyu Yang, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19742
Pdf URL: https://arxiv.org/pdf/2505.19742
Copy Paste: [[2505.19742]] HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance(https://arxiv.org/abs/2505.19742)
Keywords: restoration
Abstract: Human-centered images often suffer from severe generic degradation during transmission and are prone to human motion blur (HMB), making restoration challenging. Existing research lacks sufficient focus on these issues, as both problems often coexist in practice. To address this, we design a degradation pipeline that simulates the coexistence of HMB and generic noise, generating synthetic degraded data to train our proposed HAODiff, a human-aware one-step diffusion. Specifically, we propose a triple-branch dual-prompt guidance (DPG), which leverages high-quality images, residual noise (LQ minus HQ), and HMB segmentation masks as training targets. It produces a positive-negative prompt pair for classifier-free guidance (CFG) in a single diffusion step. The resulting adaptive dual prompts let HAODiff exploit CFG more effectively, boosting robustness against diverse degradations. For fair evaluation, we introduce MPII-Test, a benchmark rich in combined noise and HMB cases. Extensive experiments show that our HAODiff surpasses existing state-of-the-art (SOTA) methods in terms of both quantitative metrics and visual quality on synthetic and real-world datasets, including our introduced MPII-Test. Code is available at: this https URL.
摘要：以人为中心的图像在传播过程中通常会遭受严重的通用降解，并且容易受到人类运动模糊（HMB）的影响，从而使恢复具有挑战性。现有的研究缺乏对这些问题的足够的关注，因为这两个问题在实践中常常共存。为了解决这个问题，我们设计了一种退化管道，该管道模拟了HMB和通用噪声的共存，生成合成的降解数据以训练我们所提出的Haodiff，这是一种人类感知的一步一步扩散。具体而言，我们提出了三支分支双提取指南（DPG），该指南利用高质量的图像，残留噪声（LQ减去HQ）和HMB分割掩码作为训练目标。它在单个扩散步骤中产生一个阳性阴性及时对，用于无分类器指导（CFG）。由此产生的自适应双重提示使Haodiff更有效地利用CFG，从而提高了针对各种降解的稳健性。为了进行公平的评估，我们引入了MPII检验，这是一种富含噪声和HMB病例的基准。广泛的实验表明，我们的Haodiff在合成和真实数据集的定量指标和视觉质量方面都超过了现有的最新方法（SOTA）方法，包括我们引入的MPII测试。代码可用：此HTTPS URL。

Title: Improving Heart Rejection Detection in XPCI Images Using Synthetic Data Augmentation

Authors: Jakov Samardžija, Donik Vršnak, Sven Lončarić
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19746
Pdf URL: https://arxiv.org/pdf/2505.19746
Copy Paste: [[2505.19746]] Improving Heart Rejection Detection in XPCI Images Using Synthetic Data Augmentation(https://arxiv.org/abs/2505.19746)
Keywords: generation
Abstract: Accurate identification of acute cellular rejection (ACR) in endomyocardial biopsies is essential for effective management of heart transplant patients. However, the rarity of high-grade rejection cases (3R) presents a significant challenge for training robust deep learning models. This work addresses the class imbalance problem by leveraging synthetic data generation using StyleGAN to augment the limited number of real 3R images. Prior to GAN training, histogram equalization was applied to standardize image appearance and improve the consistency of tissue representation. StyleGAN was trained on available 3R biopsy patches and subsequently used to generate 10,000 realistic synthetic images. These were combined with real 0R samples, that is samples without rejection, in various configurations to train ResNet-18 classifiers for binary rejection classification. Three classifier variants were evaluated: one trained on real 0R and synthetic 3R images, another using both synthetic and additional real samples, and a third trained solely on real data. All models were tested on an independent set of real biopsy images. Results demonstrate that synthetic data improves classification performance, particularly when used in combination with real samples. The highest-performing model, which used both real and synthetic images, achieved strong precision and recall for both classes. These findings underscore the value of hybrid training strategies and highlight the potential of GAN-based data augmentation in biomedical image analysis, especially in domains constrained by limited annotated datasets.
摘要：心内膜活检中急性细胞排斥（ACR）的准确鉴定对于有效治疗心脏移植患者至关重要。但是，高级排斥案例（3R）的稀有性给训练强大的深度学习模型带来了重大挑战。这项工作通过利用stylegan来增强有限数量的真实3R图像来利用合成数据来解决类不平衡问题。在GAN训练之前，将直方图均衡用于标准化图像外观并提高组织表示的一致性。 thilegan接受了可用的3R活检补丁的培训，随后用于生成10,000个逼真的合成图像。这些与真实的0R样本结合使用，即不排斥的样品，以各种配置来训练Resnet-18分类器进行二进制排斥分类。评估了三个分类器变体：一个在实际0R和合成3R图像上训练，另一个使用合成和其他真实样品进行了培训，而第三个则仅对实际数据进行了训练。所有模型均在独立的一组真实活检图像上进行了测试。结果表明，合成数据可改善分类性能，尤其是与实际样品结合使用时。使用真实图像和合成图像的最高表现模型均达到了两种类别的良好精度和回忆。这些发现强调了混合培训策略的价值，并突出了基于GAN的数据增强在生物医学图像分析中的潜力，尤其是在受到注释数据集约束的域中。

Title: Discrete Markov Bridge

Authors: Hengli Li, Yuxuan Wang, Song-Chun Zhu, Ying Nian Wu, Zilong Zheng
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19752
Pdf URL: https://arxiv.org/pdf/2505.19752
Copy Paste: [[2505.19752]] Discrete Markov Bridge(https://arxiv.org/abs/2505.19752)
Keywords: generation
Abstract: Discrete diffusion has recently emerged as a promising paradigm in discrete data modeling. However, existing methods typically rely on a fixed rate transition matrix during training, which not only limits the expressiveness of latent representations, a fundamental strength of variational methods, but also constrains the overall design space. To address these limitations, we propose Discrete Markov Bridge, a novel framework specifically designed for discrete representation learning. Our approach is built upon two key components: Matrix Learning and Score Learning. We conduct a rigorous theoretical analysis, establishing formal performance guarantees for Matrix Learning and proving the convergence of the overall framework. Furthermore, we analyze the space complexity of our method, addressing practical constraints identified in prior studies. Extensive empirical evaluations validate the effectiveness of the proposed Discrete Markov Bridge, which achieves an Evidence Lower Bound (ELBO) of 1.38 on the Text8 dataset, outperforming established baselines. Moreover, the proposed model demonstrates competitive performance on the CIFAR-10 dataset, achieving results comparable to those obtained by image-specific generation approaches.
摘要：离散扩散最近已成为离散数据建模中有希望的范式。但是，现有方法通常依赖于训练期间的固定速率过渡矩阵，这不仅限制了潜在表示的表现力，这是变异方法的基本强度，而且还限制了总体设计空间。为了解决这些限制，我们提出了离散的马尔可夫桥，这是一个专门设计用于离散表示学习的新颖框架。我们的方法基于两个关键组成部分：矩阵学习和分数学习。我们进行了严格的理论分析，为矩阵学习建立了正式的绩效保证，并证明了整体框架的融合。此外，我们分析了方法的空间复杂性，并解决了先前研究中确定的实际约束。广泛的经验评估验证了拟议的离散马尔可夫桥的有效性，该桥的有效性是在文本8数据集上获得1.38的证据，表现优于已建立的基准。此外，提出的模型在CIFAR-10数据集上展示了竞争性能，从而实现了与特定于图像特定生成方法获得的结果相当的结果。

Title: The Missing Point in Vision Transformers for Universal Image Segmentation

Authors: Sajjad Shahabodini, Mobina Mansoori, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2505.19795
Pdf URL: https://arxiv.org/pdf/2505.19795
Copy Paste: [[2505.19795]] The Missing Point in Vision Transformers for Universal Image Segmentation(https://arxiv.org/abs/2505.19795)
Keywords: generation
Abstract: Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: this https URL}{this https URL.
摘要：在计算机视觉中，图像分割仍然是一项具有挑战性的任务，要求强大的面罩生成和精确的分类。最近基于面具的方法通过捕获全球环境产生高质量的口罩。但是，准确地对这些面具进行分类，尤其是在含糊不清的边界和班级分布不平衡的情况下，仍然是一个开放的挑战。在这项工作中，我们介绍了VIT-P，这是一种新型的两阶段分割框架，将掩盖的生成与分类相反。第一阶段采用提案生成器来生成类不足的面具建议，而第二阶段则利用基于点的分类模型（VIT变压器（VIT））通过重点关注掩码中心点来完善预测。 VIT-P是无培训的适配器，允许整合各种预训练的视觉变压器，而无需修改其体系结构，从而确保适应着密集的预测任务。此外，我们证明了粗糙和边界框的注释可以有效地增强分类，而无需在精细注释数据集上进行额外的培训，从而降低注释成本，同时保持强劲的性能。可可，ADE20K和CityScapes数据集进行了广泛的实验，验证了VIT-P的有效性，在ADE20K Panoptic分割上以54.0 PQ实现了最先进的结果，在CityScapes语义分段上为87.4 MIOU，在ADEE220K Semantic Semantic Chempecation上进行了63.6 MIOU。代码和预估计的模型可在以下网址提供：此HTTPS URL} {此HTTPS URL。

Title: A Regularization-Guided Equivariant Approach for Image Restoration

Authors: Yulu Bai, Jiahong Fu, Qi Xie, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19799
Pdf URL: https://arxiv.org/pdf/2505.19799
Copy Paste: [[2505.19799]] A Regularization-Guided Equivariant Approach for Image Restoration(https://arxiv.org/abs/2505.19799)
Keywords: restoration
Abstract: Equivariant and invariant deep learning models have been developed to exploit intrinsic symmetries in data, demonstrating significant effectiveness in certain scenarios. However, these methods often suffer from limited representation accuracy and rely on strict symmetry assumptions that may not hold in practice. These limitations pose a significant drawback for image restoration tasks, which demands high accuracy and precise symmetry representation. To address these challenges, we propose a rotation-equivariant regularization strategy that adaptively enforces the appropriate symmetry constraints on the data while preserving the network's representational accuracy. Specifically, we introduce EQ-Reg, a regularizer designed to enhance rotation equivariance, which innovatively extends the insights of data-augmentation-based and equivariant-based methodologies. This is achieved through self-supervised learning and the spatial rotation and cyclic channel shift of feature maps deduce in the equivariant framework. Our approach firstly enables a non-strictly equivariant network suitable for image restoration, providing a simple and adaptive mechanism for adjusting equivariance based on task. Extensive experiments across three low-level tasks demonstrate the superior accuracy and generalization capability of our method, outperforming state-of-the-art approaches.
摘要：已经开发了模棱两可的深度学习模型，以利用数据中的内在对称性，在某些情况下显示出显着的有效性。但是，这些方法通常会遭受有限的代表精度，并依赖于实践中可能不存在的严格对称假设。这些限制对于图像恢复任务构成了重要的缺点，这需要高精度和精确的对称表示。为了应对这些挑战，我们提出了一种旋转等值的正则化策略，该策略可以适应数据对数据的适当对称约束，同时保留网络的表示准确性。具体而言，我们引入了EQ-REG，这是一种旨在增强旋转模棱两可的正常化程序，它创新地扩展了基于数据启发的基于数据启发和基于等价的方法的见解。这是通过自我监督的学习以及在模糊框架中推断出的特征地图的空间旋转和循环频道移动来实现的。我们的方法首先实现了一个适合图像恢复的非刻板标准网络，提供了一种简单而适应性的机制，用于根据任务调整肩variance。跨三个低级任务进行的广泛实验证明了我们方法的卓越准确性和概括能力，表现优于最先进的方法。

Title: Zero-Shot Pseudo Labels Generation Using SAM and CLIP for Semi-Supervised Semantic Segmentation

Authors: Nagito Saito, Shintaro Ito, Koichi Ito, Takafumi Aoki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19846
Pdf URL: https://arxiv.org/pdf/2505.19846
Copy Paste: [[2505.19846]] Zero-Shot Pseudo Labels Generation Using SAM and CLIP for Semi-Supervised Semantic Segmentation(https://arxiv.org/abs/2505.19846)
Keywords: generation
Abstract: Semantic segmentation is a fundamental task in medical image analysis and autonomous driving and has a problem with the high cost of annotating the labels required in training. To address this problem, semantic segmentation methods based on semi-supervised learning with a small number of labeled data have been proposed. For example, one approach is to train a semantic segmentation model using images with annotated labels and pseudo labels. In this approach, the accuracy of the semantic segmentation model depends on the quality of the pseudo labels, and the quality of the pseudo labels depends on the performance of the model to be trained and the amount of data with annotated labels. In this paper, we generate pseudo labels using zero-shot annotation with the Segment Anything Model (SAM) and Contrastive Language-Image Pretraining (CLIP), improve the accuracy of the pseudo labels using the Unified Dual-Stream Perturbations Approach (UniMatch), and use them as enhanced labels to train a semantic segmentation model. The effectiveness of the proposed method is demonstrated through the experiments using the public datasets: PASCAL and MS COCO.
摘要：语义细分是医学图像分析和自动驾驶中的一项基本任务，并且在注释培训所需的标签的高成本方面存在问题。为了解决这个问题，已经提出了基于半监督学习的语义分割方法，并提出了少量标记的数据。例如，一种方法是使用带有注释标签和伪标签的图像训练语义分割模型。在这种方法中，语义分割模型的准确性取决于伪标签的质量，伪标签的质量取决于要训练的模型的性能以及带有带注释标签的数据量。在本文中，我们使用零射门注释与段的任何模型（SAM）和对比性语言图像预处理（剪辑）生成伪标签，提高了使用统一的双流扰动方法（UNIMATCH）的伪标签的准确性，并将其用作增强的标签来训练语义分组模型。通过使用公共数据集的实验证明了该方法的有效性：Pascal和MS Coco。

Title: A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking

Authors: Zixiang Zhao, Haowen Bai, Bingxin Ke, Yukun Cui, Lilun Deng, Yulun Zhang, Kai Zhang, Konrad Schindler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19858
Pdf URL: https://arxiv.org/pdf/2505.19858
Copy Paste: [[2505.19858]] A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking(https://arxiv.org/abs/2505.19858)
Keywords: generation
Abstract: The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel framework for temporally coherent video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: this https URL.
摘要：现实世界是动态的，但大多数图像融合方法独立地处理静态帧，忽略了视频中的时间相关性，导致闪烁和时间不一致。为了解决这个问题，我们提出了统一的视频融合（UNIVF），这是一个新型的临时视频融合框架，利用多帧学习和基于光流的功能扭曲，以提供信息丰富的时间相干视频融合。为了支持其开发，我们还介绍了视频融合基准（VF Bench），这是涵盖四个视频融合任务的第一个全面的基准：多曝光，多聚焦，可视化和医学融合。 VF板凳提供了通过合成数据生成和从现有数据集进行严格策划获得的高质量，良好的视频对，并具有统一的评估协议，共同评估了视频融合的空间质量和时间一致性。广泛的实验表明，UNIVF在VF板凳上所有任务中都取得了最新的结果。项目页面：此HTTPS URL。

Title: Deep Active Inference Agents for Delayed and Long-Horizon Environments

Authors: Yavar Taheri Yeganeh, Mohsen Jafari, Andrea Matta
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19867
Pdf URL: https://arxiv.org/pdf/2505.19867
Copy Paste: [[2505.19867]] Deep Active Inference Agents for Delayed and Long-Horizon Environments(https://arxiv.org/abs/2505.19867)
Keywords: generative
Abstract: With the recent success of world-model agents, which extend the core idea of model-based reinforcement learning by learning a differentiable model for sample-efficient control across diverse tasks, active inference (AIF) offers a complementary, neuroscience-grounded paradigm that unifies perception, learning, and action within a single probabilistic framework powered by a generative model. Despite this promise, practical AIF agents still rely on accurate immediate predictions and exhaustive planning, a limitation that is exacerbated in delayed environments requiring plans over long horizons, tens to hundreds of steps. Moreover, most existing agents are evaluated on robotic or vision benchmarks which, while natural for biological agents, fall short of real-world industrial complexity. We address these limitations with a generative-policy architecture featuring (i) a multi-step latent transition that lets the generative model predict an entire horizon in a single look-ahead, (ii) an integrated policy network that enables the transition and receives gradients of the expected free energy, (iii) an alternating optimization scheme that updates model and policy from a replay buffer, and (iv) a single gradient step that plans over long horizons, eliminating exhaustive planning from the control loop. We evaluate our agent in an environment that mimics a realistic industrial scenario with delayed and long-horizon settings. The empirical results confirm the effectiveness of the proposed approach, demonstrating the coupled world-model with the AIF formalism yields an end-to-end probabilistic controller capable of effective decision making in delayed, long-horizon settings without handcrafted rewards or expensive planning.
摘要：随着世界模型代理的最新成功，通过学习一个可区分的模型来扩展基于模型的强化学习的核心思想，用于跨不同任务的样本有效控制，主动推理（AIF）提供了一种互补的，神经科学的范式，可以在单个概率的框架中统一感知，学习和动作，以产生生成的模型。尽管有这样的承诺，实用的AIF代理仍然依靠准确的即时预测和详尽的计划，这一限制会在延迟的环境中加剧，需要长时间的计划，数十至数百个步骤。此外，大多数现有的代理都是根据机器人或视觉基准进行评估的，尽管对生物学剂的天然，但它们却没有现实世界中的工业复杂性。 We address these limitations with a generative-policy architecture featuring (i) a multi-step latent transition that lets the generative model predict an entire horizon in a single look-ahead, (ii) an integrated policy network that enables the transition and receives gradients of the expected free energy, (iii) an alternating optimization scheme that updates model and policy from a replay buffer, and (iv) a single gradient step that plans over long horizons, eliminating exhaustive从控制循环进行计划。我们在模仿具有延迟和长途环境的现实工业场景的环境中评估我们的代理。经验结果证实了拟议方法的有效性，证明了与AIF形式主义的耦合世界模型产生的端到端概率控制器，能够在没有手工奖励或昂贵计划的情况下有效决策。

Title: Harnessing the Power of Training-Free Techniques in Text-to-2D Generation for Text-to-3D Generation via Score Distillation Sampling

Authors: Junhong Lee, Seungwook Kim, Minsu Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19868
Pdf URL: https://arxiv.org/pdf/2505.19868
Copy Paste: [[2505.19868]] Harnessing the Power of Training-Free Techniques in Text-to-2D Generation for Text-to-3D Generation via Score Distillation Sampling(https://arxiv.org/abs/2505.19868)
Keywords: generation
Abstract: Recent studies show that simple training-free techniques can dramatically improve the quality of text-to-2D generation outputs, e.g. Classifier-Free Guidance (CFG) or FreeU. However, these training-free techniques have been underexplored in the lens of Score Distillation Sampling (SDS), which is a popular and effective technique to leverage the power of pretrained text-to-2D diffusion models for various tasks. In this paper, we aim to shed light on the effect such training-free techniques have on SDS, via a particular application of text-to-3D generation via 2D lifting. We present our findings, which show that varying the scales of CFG presents a trade-off between object size and surface smoothness, while varying the scales of FreeU presents a trade-off between texture details and geometric errors. Based on these findings, we provide insights into how we can effectively harness training-free techniques for SDS, via a strategic scaling of such techniques in a dynamic manner with respect to the timestep or optimization iteration step. We show that using our proposed scheme strikes a favorable balance between texture details and surface smoothness in text-to-3D generations, while preserving the size of the output and mitigating the occurrence of geometric defects.
摘要：最近的研究表明，简单的无培训技术可以显着提高文本对2D生成产出的质量，例如无分类器指导（CFG）或FreeU。但是，这些无训练技术在得分蒸馏采样（SDS）的镜头中没有充满反应，这是一种流行而有效的技术，可利用预验证的文本到2D扩散模型的能力来完成各种任务。在本文中，我们旨在通过特定的通过2D举重来阐明这种无培训技术对SD的影响。我们提出了我们的发现，这些发现表明，改变CFG的尺度在对象大小和表面平滑度之间进行了权衡，而改变FreeU的尺度则在纹理细节和几何错误之间呈现了权衡。基于这些发现，我们提供了有关如何通过以这种动态方式对TimeStep或优化迭代步骤的动态方式进行此类技术的战略缩放来有效利用SDS的无训练技术的见解。我们表明，使用我们建议的方案在文本到3D代的纹理细节和表面平滑之间取得了良好的平衡，同时保留输出的大小并减轻几何缺陷的发生。

Title: Deep Spectral Prior

Authors: Yanqi Cheng, Tieyong Zeng, Pietro Lio, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
Subjects: cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2505.19873
Pdf URL: https://arxiv.org/pdf/2505.19873
Copy Paste: [[2505.19873]] Deep Spectral Prior(https://arxiv.org/abs/2505.19873)
Keywords: super-resolution
Abstract: We introduce Deep Spectral Prior (DSP), a new formulation of Deep Image Prior (DIP) that redefines image reconstruction as a frequency-domain alignment problem. Unlike traditional DIP, which relies on pixel-wise loss and early stopping to mitigate overfitting, DSP directly matches Fourier coefficients between the network output and observed measurements. This shift introduces an explicit inductive bias towards spectral coherence, aligning with the known frequency structure of images and the spectral bias of convolutional neural networks. We provide a rigorous theoretical framework demonstrating that DSP acts as an implicit spectral regulariser, suppressing high-frequency noise by design and eliminating the need for early stopping. Our analysis spans four core dimensions establishing smooth convergence dynamics, local stability, and favourable bias-variance tradeoffs. We further show that DSP naturally projects reconstructions onto a frequency-consistent manifold, enhancing interpretability and robustness. These theoretical guarantees are supported by empirical results across denoising, inpainting, and super-resolution tasks, where DSP consistently outperforms classical DIP and other unsupervised baselines.
摘要：我们介绍了深度频谱先验（DSP），这是一种深层图像先验（DIP）的新公式，将图像重建重建为频域对准问题。与传统的倾角不同，这依赖于像素的损失和早期停止以减轻过度拟合，而DSP直接与网络输出和观察到的测量值之间的傅立叶系数直接匹配。这种转变引入了对光谱一致性的明确感应偏置，与已知的图像频率结构和卷积神经网络的光谱偏置对齐。我们提供了一个严格的理论框架，表明DSP充当隐式光谱正规机，通过设计来抑制高频噪声，并消除了早期停止的需求。我们的分析跨越了四个核心维度，从而建立了平稳的收敛动力学，局部稳定性和有利的偏见差异。我们进一步表明，DSP自然地将重构投射到频率一致的歧管上，从而增强了解释性和鲁棒性。这些理论保证得到了在denoising，indpain和超分辨率任务之间的经验结果的支持，在这些任务中，DSP始终超过经典的倾角和其他无监督的基准。

Title: StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

Authors: Yi Wu, Lingting Zhu, Shengju Qian, Lei Liu, Wandi Qiao, Lequan Yu, Bin Li
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2505.19874
Pdf URL: https://arxiv.org/pdf/2505.19874
Copy Paste: [[2505.19874]] StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation(https://arxiv.org/abs/2505.19874)
Keywords: generation, generative
Abstract: In the current research landscape, multimodal autoregressive (AR) models have shown exceptional capabilities across various domains, including visual understanding and generation. However, complex tasks such as style-aligned text-to-image generation present significant challenges, particularly in data acquisition. In analogy to instruction-following tuning for image editing of AR models, style-aligned generation requires a reference style image and prompt, resulting in a text-image-to-image triplet where the output shares the style and semantics of the input. However, acquiring large volumes of such triplet data with specific styles is considerably more challenging than obtaining conventional text-to-image data used for training generative models. To address this issue, we propose StyleAR, an innovative approach that combines a specially designed data curation method with our proposed AR models to effectively utilize text-to-image binary data for style-aligned text-to-image generation. Our method synthesizes target stylized data using a reference style image and prompt, but only incorporates the target stylized image as the image modality to create high-quality binary data. To facilitate binary data training, we introduce a CLIP image encoder with a perceiver resampler that translates the image input into style tokens aligned with multimodal tokens in AR models and implement a style-enhanced token technique to prevent content leakage which is a common issue in previous work. Furthermore, we mix raw images drawn from large-scale text-image datasets with stylized images to enhance StyleAR's ability to extract richer stylistic features and ensure style consistency. Extensive qualitative and quantitative experiments demonstrate our superior performance.
摘要：在当前的研究格局中，多模式自回旋（AR）模型在包括视觉理解和产生在内的各个领域都表现出非凡的功能。但是，诸如样式一致的文本到图像生成之类的复杂任务提出了重大挑战，尤其是在数据获取中。类似于AR模型的图像编辑的指令跟随调整，样式一致的生成需要参考样式图像和提示，从而产生了文本图像到图像形象，其中输出共享输入的样式和语义。但是，与获取用于培训生成模型的常规文本对图像数据相比，以特定样式获取大量此类三重态数据更具挑战性。为了解决这个问题，我们提出了一种创新的造型方法，该方法将特殊设计的数据策展方法与我们建议的AR模型相结合，以有效利用文本到图像二进制数据，以实现样式与样式的文本对图像生成。我们的方法使用参考样式图像并提示合成目标风格的数据，但仅将目标风格化的图像作为图像模态创建高质量的二进制数据。为了促进二进制数据培训，我们引入了一个使用感知器重采样器的剪贴图像编码器，该剪辑图像将图像输入转换为与AR模型中多模式令牌对齐的样式令牌，并实现一种风格增强的令牌技术，以防止内容泄漏，这在先前的工作中是常见的问题。此外，我们将来自大规模文本图像数据集绘制的原始图像与风格化的图像混合在一起，以增强造型型提取更丰富的风格特征并确保样式一致性的能力。广泛的定性和定量实验证明了我们的出色表现。

Title: Dynamic-I2V: Exploring Image-to-Video Generaion Models via Multimodal LLM

Authors: Peng Liu, Xiaoming Ren, Fengkai Liu, Qingsong Xie, Quanlong Zheng, Yanhao Zhang, Haonan Lu, Yujiu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19901
Pdf URL: https://arxiv.org/pdf/2505.19901
Copy Paste: [[2505.19901]] Dynamic-I2V: Exploring Image-to-Video Generaion Models via Multimodal LLM(https://arxiv.org/abs/2505.19901)
Keywords: generation
Abstract: Recent advancements in image-to-video (I2V) generation have shown promising performance in conventional scenarios. However, these methods still encounter significant challenges when dealing with complex scenes that require a deep understanding of nuanced motion and intricate object-action relationships. To address these challenges, we present Dynamic-I2V, an innovative framework that integrates Multimodal Large Language Models (MLLMs) to jointly encode visual and textual conditions for a diffusion transformer (DiT) architecture. By leveraging the advanced multimodal understanding capabilities of MLLMs, our model significantly improves motion controllability and temporal coherence in synthesized videos. The inherent multimodality of Dynamic-I2V further enables flexible support for diverse conditional inputs, extending its applicability to various downstream generation tasks. Through systematic analysis, we identify a critical limitation in current I2V benchmarks: a significant bias towards favoring low-dynamic videos, stemming from an inadequate balance between motion complexity and visual quality metrics. To resolve this evaluation gap, we propose DIVE - a novel assessment benchmark specifically designed for comprehensive dynamic quality measurement in I2V generation. In conclusion, extensive quantitative and qualitative experiments confirm that Dynamic-I2V attains state-of-the-art performance in image-to-video generation, particularly revealing significant improvements of 42.5%, 7.9%, and 11.8% in dynamic range, controllability, and quality, respectively, as assessed by the DIVE metric in comparison to existing methods.
摘要：在传统方案中，图像到视频（I2V）一代的最新进步表现出了令人鼓舞的表现。但是，在处理需要深入了解细微的运动和复杂的对象关系关系的复杂场景时，这些方法仍然遇到重大挑战。为了应对这些挑战，我们提出了Dynamic-I2V，这是一个创新的框架，该框架将多模式大型语言模型（MLLMS）集成到共同编码扩散变压器（DIT）体系结构的视觉和文本条件。通过利用MLLM的先进的多模式理解能力，我们的模型可显着提高合成视频中的运动可控性和时间连贯性。动态-I2V的固有多模态进一步为各种条件输入提供了灵活的支持，从而将其适用性扩展到各种下游生成任务。通过系统分析，我们确定了当前I2V基准测试的关键限制：偏爱低动力视频的重要偏见，这是由于运动复杂性和视觉质量指标之间的平衡不足。为了解决这一评估差距，我们提出了潜水 - 一种新颖的评估基准，专门设计用于I2V生成的全面动态质量测量。总之，广泛的定量和定性实验证实，动态-I2V在图像到视频产生中达到了最先进的性能，尤其是在与现有方法相比评估的，分别揭示了分别在动态范围，可控性和质量的42.5％，7.9％和11.8％的显着改善，分别在动态范围，可控性和质量上。

Title: Attention! You Vision Language Model Could Be Maliciously Manipulated

Authors: Xiaosen Wang, Shaokang Wang, Zhijin Ge, Yuyang Luo, Shudong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19911
Pdf URL: https://arxiv.org/pdf/2505.19911
Copy Paste: [[2505.19911]] Attention! You Vision Language Model Could Be Maliciously Manipulated(https://arxiv.org/abs/2505.19911)
Keywords: generation
Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success in understanding complex real-world scenarios and supporting data-driven decision-making processes. However, VLMs exhibit significant vulnerability against adversarial examples, either text or image, which can lead to various adversarial outcomes, e.g., jailbreaking, hijacking, and hallucination, etc. In this work, we empirically and theoretically demonstrate that VLMs are particularly susceptible to image-based adversarial examples, where imperceptible perturbations can precisely manipulate each output token. To this end, we propose a novel attack called Vision-language model Manipulation Attack (VMA), which integrates first-order and second-order momentum optimization techniques with a differentiable transformation mechanism to effectively optimize the adversarial perturbation. Notably, VMA can be a double-edged sword: it can be leveraged to implement various attacks, such as jailbreaking, hijacking, privacy breaches, Denial-of-Service, and the generation of sponge examples, etc, while simultaneously enabling the injection of watermarks for copyright protection. Extensive empirical evaluations substantiate the efficacy and generalizability of VMA across diverse scenarios and datasets.
摘要：大型视觉模型（VLM）在理解复杂的现实世界情景和支持数据驱动的决策过程方面取得了巨大的成功。但是，VLM在针对文本或图像的对抗性例子上表现出很大的脆弱性，这可能导致各种对抗性结果，例如，越狱，劫持和幻觉等。在这项工作中，我们从经验上和理论上证明了VLM尤其易于基于图像的对抗性示例，因此可以确切地供您精确地进行替代。为此，我们提出了一种称为视觉模型操纵攻击（VMA）的新型攻击，该攻击将一阶和二阶动量优化技术与可区分的转换机制相结合，以有效地优化对抗性扰动。值得注意的是，VMA可以是一把双刃剑：可以利用它来实施各种攻击，例如越狱，劫持，隐私漏洞，服务式服务和海绵示例等，同时允许向水印注入版权保护。广泛的经验评估证实了VMA在各种情况和数据集中的功效和普遍性。

Title: An Explainable Diagnostic Framework for Neurodegenerative Dementias via Reinforcement-Optimized LLM Reasoning

Authors: Andrew Zamai, Nathanael Fijalkow, Boris Mansencal, Laurent Simon, Eloi Navet, Pierrick Coupe
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19954
Pdf URL: https://arxiv.org/pdf/2505.19954
Copy Paste: [[2505.19954]] An Explainable Diagnostic Framework for Neurodegenerative Dementias via Reinforcement-Optimized LLM Reasoning(https://arxiv.org/abs/2505.19954)
Keywords: generative
Abstract: The differential diagnosis of neurodegenerative dementias is a challenging clinical task, mainly because of the overlap in symptom presentation and the similarity of patterns observed in structural neuroimaging. To improve diagnostic efficiency and accuracy, deep learning-based methods such as Convolutional Neural Networks and Vision Transformers have been proposed for the automatic classification of brain MRIs. However, despite their strong predictive performance, these models find limited clinical utility due to their opaque decision making. In this work, we propose a framework that integrates two core components to enhance diagnostic transparency. First, we introduce a modular pipeline for converting 3D T1-weighted brain MRIs into textual radiology reports. Second, we explore the potential of modern Large Language Models (LLMs) to assist clinicians in the differential diagnosis between Frontotemporal dementia subtypes, Alzheimer's disease, and normal aging based on the generated reports. To bridge the gap between predictive accuracy and explainability, we employ reinforcement learning to incentivize diagnostic reasoning in LLMs. Without requiring supervised reasoning traces or distillation from larger models, our approach enables the emergence of structured diagnostic rationales grounded in neuroimaging findings. Unlike post-hoc explainability methods that retrospectively justify model decisions, our framework generates diagnostic rationales as part of the inference process-producing causally grounded explanations that inform and guide the model's decision-making process. In doing so, our framework matches the diagnostic performance of existing deep learning methods while offering rationales that support its diagnostic conclusions.
摘要：神经退行性痴呆的鉴别诊断是一项艰巨的临床任务，这主要是因为症状表现的重叠以及在结构神经影像学中观察到的模式的相似性。为了提高诊断效率和准确性，已经提出了基于深度学习的方法，例如卷积神经网络和视觉变压器，以自动对大脑MRIS进行自动分类。然而，尽管具有强大的预测性能，但由于其不透明的决策，这些模型发现临床实用性有限。在这项工作中，我们提出了一个框架，该框架集成了两个核心组件以提高诊断透明度。首先，我们引入了一个模块化管道，用于将3D T1加权大脑MRIS转换为文本放射学报告。其次，我们探讨了现代大型语言模型（LLM）的潜力，以帮助临床医生根据生成的报告，在额颞痴呆症亚型，阿尔茨海默氏病和正常衰老之间进行鉴别诊断。为了弥合预测准确性和解释性之间的差距，我们采用强化学习来激励LLMS中的诊断推理。我们的方法不需要监督推理轨迹或从较大模型中进行蒸馏，因此可以在神经成像发现中建立结构化诊断理由的出现。与回顾性的可解释性方法不同，我们的框架是推理过程产生的因果关系基础的解释的一部分，我们的框架会产生诊断原理，从而为模型的决策过程提供了信息和指导。在此过程中，我们的框架与现有深度学习方法的诊断性能相匹配，同时提供支持其诊断结论的理由。

Title: MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

Authors: Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.19955
Pdf URL: https://arxiv.org/pdf/2505.19955
Copy Paste: [[2505.19955]] MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research(https://arxiv.org/abs/2505.19955)
Keywords: generation
Abstract: Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.
摘要：AI代理商的最新进展表明，它们在推动和支持科学发现的潜力越来越大。在这项工作中，我们介绍了MLR Bench，这是一种评估AI代理商的全面基准测试，用于开放式机器学习研究。 MLR板凳包括三个关键组成部分：（1）201个研究任务来自Neurips，ICLR和ICML研讨会，涵盖了不同的ML主题；（2）MLR-Judge，一个自动化评估框架，将基于LLM的审阅者与精心设计的评论专栏相结合以评估研究质量；（3）MLR-Agent，一种模块化代理脚手架，能够通过四个阶段完成研究任务：创意生成，提案制定，实验和纸质写作。我们的框架支持在这些不同的研究阶段进行逐步评估，以及对最终研究论文的端到端评估。然后，我们使用MLR板台评估六个Frontier LLM和一个高级编码剂，发现LLM有效地产生了相干的想法和结构良好的论文，但当前的编码剂经常（例如，在80％的情况下）产生制造的或无效的实验结果，将主要的障碍物置于科学可靠性上。我们通过人类评估来验证MLR法官，与专家审稿人表示高度同意，并支持其作为研究评估的可扩展工具的潜力。我们开源MLR基础，以帮助社区基准，诊断和改善AI研究人员，以实现值得信赖和透明的科学发现。

Title: UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space

Authors: Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, Fei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19958
Pdf URL: https://arxiv.org/pdf/2505.19958
Copy Paste: [[2505.19958]] UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space(https://arxiv.org/abs/2505.19958)
Keywords: restoration, super-resolution
Abstract: Diffusion models have shown great potential in generating realistic image detail. However, adapting these models to video super-resolution (VSR) remains challenging due to their inherent stochasticity and lack of temporal modeling. In this paper, we propose UltraVSR, a novel framework that enables ultra-realistic and temporal-coherent VSR through an efficient one-step diffusion space. A central component of UltraVSR is the Degradation-aware Restoration Schedule (DRS), which estimates a degradation factor from the low-resolution input and transforms iterative denoising process into a single-step reconstruction from from low-resolution to high-resolution videos. This design eliminates randomness from diffusion noise and significantly speeds up inference. To ensure temporal consistency, we propose a lightweight yet effective Recurrent Temporal Shift (RTS) module, composed of an RTS-convolution unit and an RTS-attention unit. By partially shifting feature components along the temporal dimension, these two units collaboratively facilitate effective feature propagation, fusion, and alignment across neighboring frames, without relying on explicit temporal layers. The RTS module is integrated into a pretrained text-to-image diffusion model and is further enhanced through Spatio-temporal Joint Distillation (SJD), which improves temporal coherence while preserving realistic details. Additionally, we introduce a Temporally Asynchronous Inference (TAI) strategy to capture long-range temporal dependencies under limited memory constraints. Extensive experiments show that UltraVSR achieves state-of-the-art performance, both qualitatively and quantitatively, in a single sampling step.
摘要：扩散模型在生成逼真的图像细节方面显示出很大的潜力。但是，由于其固有的随机性和缺乏时间建模，将这些模型适应视频超分辨率（VSR）仍然具有挑战性。在本文中，我们提出了Ultravsr，这是一个新型框架，可以通过有效的一步扩散空间实现超现实和时间连接的VSR。 Ultravsr的一个核心组成部分是降解感知的恢复时间表（DRS），该计划估计了从低分辨率输入中的降解因子，并将迭代的脱氧过程转化为从低分辨率到高分辨率视频的单步重建。该设计消除了从扩散噪声中消除随机性，并显着加快了推断。为了确保时间一致性，我们提出了一个由RTS - 卷积单元和RTS注意单元组成的轻巧但有效的临时偏移（RTS）模块。通过沿时间维度的部分转移特征组件，这两个单元在不依赖明确的时间层的情况下协作促进了跨相邻帧的有效特征传播，融合和对齐。 RTS模块集成到预验证的文本对图像扩散模型中，并通过时空的关节蒸馏（SJD）进一步增强，从而提高了时间连贯性，同时保留了现实的细节。此外，我们引入了一个时间上异步推理（TAI）策略，以捕获有限的记忆约束下的远程时间依赖性。广泛的实验表明，UltravSR在定性和定量的单个采样步骤中都能实现最先进的性能。

Title: Learning to Select In-Context Demonstration Preferred by Large Language Model

Authors: Zheng Zhang, Shaocheng Lan, Lei Song, Jiang Bian, Yexin Li, Kan Ren
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.19966
Pdf URL: https://arxiv.org/pdf/2505.19966
Copy Paste: [[2505.19966]] Learning to Select In-Context Demonstration Preferred by Large Language Model(https://arxiv.org/abs/2505.19966)
Keywords: generative
Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks during inference using only a few demonstrations. However, ICL performance is highly dependent on the selection of these demonstrations. Recent work explores retrieval-based methods for selecting query-specific demonstrations, but these approaches often rely on surrogate objectives such as metric learning, failing to directly optimize ICL performance. Consequently, they struggle to identify truly beneficial demonstrations. Moreover, their discriminative retrieval paradigm is ineffective when the candidate pool lacks sufficient high-quality demonstrations. To address these challenges, we propose GenICL, a novel generative preference learning framework that leverages LLM feedback to directly optimize demonstration selection for ICL. Experiments on 19 datasets across 11 task categories demonstrate that GenICL achieves superior performance than existing methods in selecting the most effective demonstrations, leading to better ICL performance.
摘要：在文化学习（ICL）中，大型语言模型（LLMS）只能使用几个演示来适应推理期间的新任务。但是，ICL性能高度取决于这些演示的选择。最近的工作探讨了基于检索的方法来选择特定查询的演示，但是这些方法通常依赖于替代目标，例如公制学习，未能直接优化ICL性能。因此，他们难以确定真正有益的示威。此外，当候选人缺乏足够的高质量示范时，他们的判别检索范式就无效。为了应对这些挑战，我们提出了Genicl，这是一种新型的生成偏好学习框架，利用LLM反馈直接优化ICL的演示选择。在11个任务类别的19个数据集上进行的实验表明，在选择最有效的演示方面，Genicl比现有方法取得了更好的性能，从而提高了ICL性能更好。

Title: PHI: Bridging Domain Shift in Long-Term Action Quality Assessment via Progressive Hierarchical Instruction

Authors: Kanglei Zhou, Hubert P. H. Shum, Frederick W. B. Li, Xingxing Zhang, Xiaohui Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.19972
Pdf URL: https://arxiv.org/pdf/2505.19972
Copy Paste: [[2505.19972]] PHI: Bridging Domain Shift in Long-Term Action Quality Assessment via Progressive Hierarchical Instruction(https://arxiv.org/abs/2505.19972)
Keywords: quality assessment
Abstract: Long-term Action Quality Assessment (AQA) aims to evaluate the quantitative performance of actions in long videos. However, existing methods face challenges due to domain shifts between the pre-trained large-scale action recognition backbones and the specific AQA task, thereby hindering their performance. This arises since fine-tuning resource-intensive backbones on small AQA datasets is impractical. We address this by identifying two levels of domain shift: task-level, regarding differences in task objectives, and feature-level, regarding differences in important features. For feature-level shifts, which are more detrimental, we propose Progressive Hierarchical Instruction (PHI) with two strategies. First, Gap Minimization Flow (GMF) leverages flow matching to progressively learn a fast flow path that reduces the domain gap between initial and desired features across shallow to deep layers. Additionally, a temporally-enhanced attention module captures long-range dependencies essential for AQA. Second, List-wise Contrastive Regularization (LCR) facilitates coarse-to-fine alignment by comprehensively comparing batch pairs to learn fine-grained cues while mitigating domain shift. Integrating these modules, PHI offers an effective solution. Experiments demonstrate that PHI achieves state-of-the-art performance on three representative long-term AQA datasets, proving its superiority in addressing the domain shift for long-term AQA.
摘要：长期行动质量评估（AQA）旨在评估长期视频中动作的定量性能。但是，由于预先训练的大规模动作识别骨架和特定的AQA任务之间的领域变化，现有方法面临挑战，从而阻碍了他们的性能。这是因为小型AQA数据集上的微调资源密集型主机是不切实际的。我们通过识别两个域移动级别来解决这一问题：任务级别，关于任务目标的差异以及特征级别的重要特征差异。对于更有害的特征级别的转移，我们提出了采用两种策略的进行性分层教学（PHI）。首先，差距最小化流量（GMF）利用流量匹配的流程逐渐学习的快速路径，从而减少了跨浅层到深层之间的初始特征和所需特征之间的域间隙。此外，一个时间增强的注意模块捕获了AQA必不可少的远程依赖性。其次，列表对比度正则化（LCR）通过全面比较批处理对以学习细粒线索，同时减轻域移动，从而有助于粗到最新的对齐。整合这些模块，PHI提供了有效的解决方案。实验表明，PHI在三个代表性的长期AQA数据集上实现了最先进的性能，证明了其在解决长期AQA的域转移方面的优势。

Title: Rethinking Probabilistic Circuit Parameter Learning

Authors: Anji Liu, Guy Van den Broeck
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.19982
Pdf URL: https://arxiv.org/pdf/2505.19982
Copy Paste: [[2505.19982]] Rethinking Probabilistic Circuit Parameter Learning(https://arxiv.org/abs/2505.19982)
Keywords: generative
Abstract: Probabilistic Circuits (PCs) offer a computationally scalable framework for generative modeling, supporting exact and efficient inference of a wide range of probabilistic queries. While recent advances have significantly improved the expressiveness and scalability of PCs, effectively training their parameters remains a challenge. In particular, a widely used optimization method, full-batch Expectation-Maximization (EM), requires processing the entire dataset before performing a single update, making it ineffective for large datasets. While empirical extensions to the mini-batch setting have been proposed, it remains unclear what objective these algorithms are optimizing, making it difficult to assess their theoretical soundness. This paper bridges the gap by establishing a novel connection between the general EM objective and the standard full-batch EM algorithm. Building on this, we derive a theoretically grounded generalization to the mini-batch setting and demonstrate its effectiveness through preliminary empirical results.
摘要：概率电路（PC）为生成建模提供了一个可扩展的框架，支持了广泛的概率查询的精确推理。尽管最近的进步显着提高了PC的表现力和可扩展性，但有效训练其参数仍然是一个挑战。特别是，一种广泛使用的优化方法，即全批期望最大化（EM），需要在执行单个更新之前对整个数据集进行处理，从而使大型数据集无效。尽管已经提出了对迷你批次设置的经验扩展，但尚不清楚这些算法优化了什么目标，因此很难评估其理论声音。本文通过建立一般EM目标与标准的全批EM算法之间建立新的联系来弥合差距。在此基础上，我们将理论上的概括得分为小批量设置，并通过初步的经验结果证明了其有效性。

Title: NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-ID

Authors: Shihao Li, Chenglong Li, Aihua Zheng, Andong Lu, Jin Tang, Jixin Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20001
Pdf URL: https://arxiv.org/pdf/2505.20001
Copy Paste: [[2505.20001]] NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-ID(https://arxiv.org/abs/2505.20001)
Keywords: generation
Abstract: Multi-modal object re-identification (ReID) aims to extract identity features across heterogeneous spectral modalities to enable accurate recognition and retrieval in complex real-world scenarios. However, most existing methods rely on implicit feature fusion structures, making it difficult to model fine-grained recognition strategies under varying challenging conditions. Benefiting from the powerful semantic understanding capabilities of Multi-modal Large Language Models (MLLMs), the visual appearance of an object can be effectively translated into descriptive text. In this paper, we propose a reliable multi-modal caption generation method based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs in multi-modal semantic generation and improves the quality of generated text. Additionally, we propose a novel ReID framework NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural expert branches to separately capture modality-specific appearance and intrinsic structure. For semantic recognition, we propose the Text-Modulated Semantic-sampling Experts (TMSE), which leverages randomly sampled high-quality semantic texts to modulate expert-specific sampling of multi-modal features and mining intra-modality fine-grained semantic cues. Then, to recognize coarse-grained structure features, we propose the Context-Shared Structure-aware Experts (CSSE) that focuses on capturing the holistic object structure across modalities and maintains inter-modality structural consistency through a soft routing mechanism. Finally, we propose the Multi-Modal Feature Aggregation (MMFA), which adopts a unified feature fusion strategy to simply and effectively integrate semantic and structural expert outputs into the final identity representations.
摘要：多模式对象重新识别（REID）旨在在异质光谱模态上提取身份特征，以在复杂的现实世界情景中获得准确的识别和检索。但是，大多数现有的方法都依赖于隐式特征融合结构，因此很难在各种挑战性的条件下建模细粒度的识别策略。受益于多模式大语言模型（MLLM）的强大语义理解能力，可以将对象的视觉外观有效地转化为描述性文本。在本文中，我们提出了一种基于属性置信度的可靠的多模式字幕生成方法，该方法大大降低了多模式语义生成中MLLM的未知识别率，并提高了生成的文本质量。此外，我们将提出一个新颖的REID框架，即通过文本调制多模式对象重新识别专家的多层次混合物。具体而言，我们将识别问题分离为语义和结构专家分支，以分别捕获特定于模态的外观和内在结构。为了进行语义识别，我们提出了文本调制的语义采样专家（TMSE），该专家利用随机采样高质量的语义文本来调节多模式特征的专家特定于专家的采样，并采矿内模式内模式性细粒语义线索。然后，为了识别粗粒结构的特征，我们提出了上下文共享的结构感知专家（CSSE），该专家着重于跨模态捕获整体对象结构，并通过软路由机制保持模式间结构一致性。最后，我们提出了多模式特征聚合（MMFA），该特征聚集（MMFA）采用统一的特征融合策略，简单有效地将语义和结构专家输出整合到最终的身份表示中。

Title: TabPFN: One Model to Rule Them All?

Authors: Qiong Zhang, Yan Shuo Tan, Qinglong Tian, Pengfei Li
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.20003
Pdf URL: https://arxiv.org/pdf/2505.20003
Copy Paste: [[2505.20003]] TabPFN: One Model to Rule Them All?(https://arxiv.org/abs/2505.20003)
Keywords: generation
Abstract: Hollmann et al. (Nature 637 (2025) 319-326) recently introduced TabPFN, a transformer-based deep learning model for regression and classification on tabular data, which they claim "outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time." Furthermore, they have called TabPFN a "foundation model" for tabular data, as it can support "data generation, density estimation, learning reusable embeddings and fine-tuning". If these statements are well-supported, TabPFN may have the potential to supersede existing modeling approaches on a wide range of statistical tasks, mirroring a similar revolution in other areas of artificial intelligence that began with the advent of large language models. In this paper, we provide a tailored explanation of how TabPFN works for a statistics audience, by emphasizing its interpretation as approximate Bayesian inference. We also provide more evidence of TabPFN's "foundation model" capabilities: We show that an out-of-the-box application of TabPFN vastly outperforms specialized state-of-the-art methods for semi-supervised parameter estimation, prediction under covariate shift, and heterogeneous treatment effect estimation. We further show that TabPFN can outperform LASSO at sparse regression and can break a robustness-efficiency trade-off in classification. All experiments can be reproduced using the code provided at this https URL (this https URL).
摘要：Hollmann等。（Nature 637（2025）319-326）最近引入了TABPFN，这是一种基于变压器的深度学习模型，用于对表格数据进行回归和分类，他们声称“在数据集上胜过所有以前的方法，这些方法具有大量较大的差距，使用较少的培训时间，使用了大量的差距。”此外，他们将TABPFN称为表格数据的“基础模型”，因为它可以支持“数据生成，密度估计，可重复使用的嵌入和微调”。如果这些陈述得到很好的支持，TABPFN可能有可能取代广泛的统计任务上的现有建模方法，从而反映了人工智能其他领域的类似革命，这始于大型语言模型的出现。在本文中，我们通过强调其解释为近似贝叶斯推断，对TABPFN如何为统计受众提供了量身定制的解释。我们还提供了TABPFN的“基础模型”功能的更多证据：我们表明，TABPFN的开箱即用应用非常优于半监督参数估计的专业最先进方法，协方差转移的预测和异质治疗效果估计。我们进一步表明，TABPFN可以在稀疏回归时胜过套索，并且可以打破分类的稳健性效率折衷。所有实验都可以使用此HTTPS URL（此HTTPS URL）上提供的代码复制。

Title: Gradient Inversion Transcript: Leveraging Robust Generative Priors to Reconstruct Training Data from Gradient Leakage

Authors: Xinping Chen, Chen Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20026
Pdf URL: https://arxiv.org/pdf/2505.20026
Copy Paste: [[2505.20026]] Gradient Inversion Transcript: Leveraging Robust Generative Priors to Reconstruct Training Data from Gradient Leakage(https://arxiv.org/abs/2505.20026)
Keywords: generative
Abstract: We propose Gradient Inversion Transcript (GIT), a novel generative approach for reconstructing training data from leaked gradients. GIT employs a generative attack model, whose architecture is tailored to align with the structure of the leaked model based on theoretical analysis. Once trained offline, GIT can be deployed efficiently and only relies on the leaked gradients to reconstruct the input data, rendering it applicable under various distributed learning environments. When used as a prior for other iterative optimization-based methods, GIT not only accelerates convergence but also enhances the overall reconstruction quality. GIT consistently outperforms existing methods across multiple datasets and demonstrates strong robustness under challenging conditions, including inaccurate gradients, data distribution shifts and discrepancies in model parameters.
摘要：我们提出了梯度反演转录本（GIT），这是一种新的生成方法，用于重建泄漏梯度的训练数据。 GIT采用了生成攻击模型，其构造是根据理论分析量身定制的，以与泄漏模型的结构保持一致。一旦经过训练，可以有效地部署GIT，只依靠被泄漏的梯度来重建输入数据，并将其适用于各种分布式学习环境。当用作其他基于迭代优化的方法的先验时，Git不仅会加速收敛，还可以提高整体重建质量。 GIT始终优于多个数据集的现有方法，并在具有挑战性的条件下证明了强大的鲁棒性，包括不准确的梯度，数据分布变化和模型参数中的差异。

Title: Data-Free Class-Incremental Gesture Recognition with Prototype-Guided Pseudo Feature Replay

Authors: Hongsong Wang, Ao Sun, Jie Gui, Liang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20049
Pdf URL: https://arxiv.org/pdf/2505.20049
Copy Paste: [[2505.20049]] Data-Free Class-Incremental Gesture Recognition with Prototype-Guided Pseudo Feature Replay(https://arxiv.org/abs/2505.20049)
Keywords: generation
Abstract: Gesture recognition is an important research area in the field of computer vision. Most gesture recognition efforts focus on close-set scenarios, thereby limiting the capacity to effectively handle unseen or novel gestures. We aim to address class-incremental gesture recognition, which entails the ability to accommodate new and previously unseen gestures over time. Specifically, we introduce a Prototype-Guided Pseudo Feature Replay (PGPFR) framework for data-free class-incremental gesture recognition. This framework comprises four components: Pseudo Feature Generation with Batch Prototypes (PFGBP), Variational Prototype Replay (VPR) for old classes, Truncated Cross-Entropy (TCE) for new classes, and Continual Classifier Re-Training (CCRT). To tackle the issue of catastrophic forgetting, the PFGBP dynamically generates a diversity of pseudo features in an online manner, leveraging class prototypes of old classes along with batch class prototypes of new classes. Furthermore, the VPR enforces consistency between the classifier's weights and the prototypes of old classes, leveraging class prototypes and covariance matrices to enhance robustness and generalization capabilities. The TCE mitigates the impact of domain differences of the classifier caused by pseudo features. Finally, the CCRT training strategy is designed to prevent overfitting to new classes and ensure the stability of features extracted from old classes. Extensive experiments conducted on two widely used gesture recognition datasets, namely SHREC 2017 3D and EgoGesture 3D, demonstrate that our approach outperforms existing state-of-the-art methods by 11.8\% and 12.8\% in terms of mean global accuracy, respectively. The code is available on this https URL.
摘要：手势识别是计算机视觉领域的重要研究领域。大多数手势识别工作都集中在近距离场景上，从而限制了有效处理看不见或新颖的手势的能力。我们旨在解决班级知识识别的识别，这需要随着时间的推移适应新的和以前看不见的手势的能力。具体而言，我们引入了一个原型引导的伪特征重播（PGPFR）框架，以进行无数据的类型识别。该框架包括四个组成部分：具有批处理原型（PFGBP）的伪特征生成，旧类的变异原型重播（VPR），新类的截断跨凝胶（TCE）以及持续的分类器重新训练（CCRT）。为了解决灾难性遗忘的问题，PFGBP以在线方式动态生成了多样的伪功能，利用旧课程的类原型以及新课程的批处理原型。此外，VPR在分类器的权重与旧类的原型之间的一致性，利用类原型和协方差矩阵来增强鲁棒性和概括能力。 TCE减轻了由伪特征引起的分类器域差异的影响。最后，CCRT培训策略旨在防止过度适合新课程，并确保从旧课程中提取的功能的稳定性。在两个广泛使用的手势识别数据集上进行的广泛实验，即SHREC 2017 3D和EgogeSture 3D，表明我们的方法在平均全球准确性方面分别优于现有的最新方法和11.8％和12.8％。该代码可在此HTTPS URL上使用。

Title: Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

Authors: Zheqi Lv, Junhao Chen, Qi Tian, Keting Yin, Shengyu Zhang, Fei Wu
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2505.20053
Pdf URL: https://arxiv.org/pdf/2505.20053
Copy Paste: [[2505.20053]] Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion(https://arxiv.org/abs/2505.20053)
Keywords: generation, generative
Abstract: Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference. PPAD performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps. The framework supports both inference-only and training-enhanced settings, and performs semantic correction at only extremely few diffusion steps, offering strong generality and scalability. Extensive experiments demonstrate PPAD's significant improvements.
摘要：扩散模型已成为文本到图像生成的主流体系结构，在视觉质量和迅速的可控性方面取得了显着进步。但是，当前的推理管道通常缺乏在整个denoisis过程中的可解释的语义监督和校正机制。大多数现有的方法仅依赖于最终图像的事后评分，及时过滤或启发式重新采样策略，以便为纠正生成轨迹提供可行的指导。结果，模型通常会遭受对象混乱，空间错误，计数不准确和缺失的语义元素的困扰，严重损害了及时图像对齐和图像质量。为了应对这些挑战，我们提出了MLLM语义校正后的乒乓球扩散（PPAD），这是一个新颖的框架，该框架首次引入了推断期间的多模式大语言模型（MLLM）作为语义观察者。 PPAD对中级世代进行实时分析，确定潜在的语义不一致，并将反馈转化为可控的信号，以积极指导其余的剥离步骤。该框架支持仅推理和训练增强的设置，并仅在极少数的扩散步骤中执行语义校正，提供了强烈的通用性和可扩展性。广泛的实验证明了PPAD的重大改进。

Title: PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation

Authors: Hongsong Wang, Yin Zhu, Qiuxia Lai, Yang Zhang, Guo-Sen Xie, Xin Geng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20056
Pdf URL: https://arxiv.org/pdf/2505.20056
Copy Paste: [[2505.20056]] PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation(https://arxiv.org/abs/2505.20056)
Keywords: generation
Abstract: Computational dance generation is crucial in many areas, such as art, human-computer interaction, virtual reality, and digital entertainment, particularly for generating coherent and expressive long dance sequences. Diffusion-based music-to-dance generation has made significant progress, yet existing methods still struggle to produce physically plausible motions. To address this, we propose Plausibility-Aware Motion Diffusion (PAMD), a framework for generating dances that are both musically aligned and physically realistic. The core of PAMD lies in the Plausible Motion Constraint (PMC), which leverages Neural Distance Fields (NDFs) to model the actual pose manifold and guide generated motions toward a physically valid pose manifold. To provide more effective guidance during generation, we incorporate Prior Motion Guidance (PMG), which uses standing poses as auxiliary conditions alongside music features. To further enhance realism for complex movements, we introduce the Motion Refinement with Foot-ground Contact (MRFC) module, which addresses foot-skating artifacts by bridging the gap between the optimization objective in linear joint position space and the data representation in nonlinear rotation space. Extensive experiments show that PAMD significantly improves musical alignment and enhances the physical plausibility of generated motions. This project page is available at: this https URL.
摘要：计算舞蹈产生在许多领域至关重要，例如艺术，人类计算机互动，虚拟现实和数字娱乐，尤其是为了产生连贯和表现力的长舞蹈序列。基于扩散的音乐到舞蹈的产生取得了重大进展，但现有的方法仍然难以产生物理上合理的动作。为了解决这个问题，我们提出了合理的感知运动扩散（PAMD），这是一个框架，用于产生既有音乐在音乐上且在物理上逼真的舞蹈的舞蹈。 PAMD的核心在于合理的运动约束（PMC），它利用神经距离场（NDFS）对实际姿势歧管进行建模，并引导引起对物理有效的姿势歧管产生的运动。为了在一代人期间提供更有效的指导，我们合并了先前的运动指导（PMG），该指南将站立姿势作为辅助条件以及音乐功能。为了进一步增强复杂运动的现实主义，我们通过脚地面接触（MRFC）模块引入了运动改进，该模块通过弥合线性关节位置空间中的优化目标与非线性旋转空间中的数据表示之间的差距来解决脚步伪像。广泛的实验表明，PAMD显着改善了音乐对齐方式并提高了产生动作的物理合理性。此项目页面可用：此HTTPS URL。

Title: From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

Authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20106
Pdf URL: https://arxiv.org/pdf/2505.20106
Copy Paste: [[2505.20106]] From Data to Modeling: Fully Open-vocabulary Scene Graph Generation(https://arxiv.org/abs/2505.20106)
Keywords: generation
Abstract: We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.
摘要：我们提出了OVSGTR，这是一种基于变压器的新型框架，用于完全开放式视频范围的场景图生成，它克服了传统封闭式模型的局限性。常规方法将对象和关系识别限制为固定的词汇，从而阻碍了它们对新颖概念经常出现的现实情况的适用性。相反，我们的方法共同预测对象（节点）及其相互关系（边缘）超出预定义的类别。 OVSGTR利用类似DITR的架构具有冷冻图像骨干和文本编码器来提取高质量的视觉和语义特征，然后通过变压器解码器融合来进行端到端的场景图表预测。为了丰富模型对复杂的视觉关系的理解，我们提出了一种关系感知的预训练策略，该策略以弱监督的方式综合了场景图注释。具体而言，我们研究了三个管道 - 基于Scene Parser，基于LLM和多模式LLM的管道 - 以最少的手动注释生成可转移的监督信号。此外，我们通过结合视觉概念概念保留机制以及知识蒸馏策略，确保模型在微调过程中保留丰富的语义提示，从而解决了在开放式摄影环境中灾难性遗忘的常见问题。对VG150基准测试的广泛实验表明，OVSGTR在多个设置中实现了最先进的性能，包括基于封闭式的，开放式对象检测，基于关系，基于关系和完全开放式摄入量的场景。我们的结果突出了大规模关系感知的预训练和变压器体系结构的希望，以将场景图生成发展到更具概括和可靠的视觉理解。

Title: Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

Authors: Ziyi Zhang, Li Shen, Deheng Ye, Yong Luo, Huangxuan Zhao, Lefei Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.20107
Pdf URL: https://arxiv.org/pdf/2505.20107
Copy Paste: [[2505.20107]] Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning(https://arxiv.org/abs/2505.20107)
Keywords: generation
Abstract: Text-to-multiview (T2MV) generation, which produces coherent multiview images from a single text prompt, remains computationally intensive, while accelerated T2MV methods using few-step diffusion models often sacrifice image fidelity and view consistency. To address this, we propose a novel reinforcement learning (RL) finetuning framework tailored for few-step T2MV diffusion models to jointly optimize per-view fidelity and cross-view consistency. Specifically, we first reformulate T2MV denoising across all views as a single unified Markov decision process, enabling multiview-aware policy optimization driven by a joint-view reward objective. Next, we introduce ZMV-Sampling, a test-time T2MV sampling technique that adds an inversion-denoising pass to reinforce both viewpoint and text conditioning, resulting in improved T2MV generation at the cost of inference time. To internalize its performance gains into the base sampling policy, we develop MV-ZigAL, a novel policy optimization strategy that uses reward advantages of ZMV-Sampling over standard sampling as learning signals for policy updates. Finally, noting that the joint-view reward objective under-optimizes per-view fidelity but naively optimizing single-view metrics neglects cross-view alignment, we reframe RL finetuning for T2MV diffusion models as a constrained optimization problem that maximizes per-view fidelity subject to an explicit joint-view constraint, thereby enabling more efficient and balanced policy updates. By integrating this constrained optimization paradigm with MV-ZigAL, we establish our complete RL finetuning framework, referred to as MVC-ZigAL, which effectively refines the few-step T2MV diffusion baseline in both fidelity and consistency while preserving its few-step efficiency.
摘要：从单个文本提示中产生连贯的多视图像的文本到媒体（T2MV）生成在计算上仍然是计算密集型的，而使用少量步骤扩散模型加速的T2MV方法通常会牺牲图像保真度和视图一致性。为了解决这个问题，我们提出了一个新颖的加固学习（RL）填充框架，该框架是为几步的T2MV扩散模型量身定制的，以共同优化了各视图和跨视图一致性。具体而言，我们首先将所有观点的T2MV重新重新构成一个单一的统一马尔可夫决策过程，从而实现了由联合观看奖励目标驱动的多维亚感知政策优化。接下来，我们介绍了ZMV采样，这是一种测试时间T2MV采样技术，添加了反转降级通行证，以增强观点和文本条件，从而以推理时间为代价，从而改善了T2MV的生成。为了将其绩效提高到基础抽样策略中，我们开发了MV-Zigal，这是一种新颖的政策优化策略，该策略使用ZMV绘制优于标准采样的奖励优势作为策略更新的学习信号。最后，注意到联合视图奖励目标不足以限定的忠诚度，但天真地优化单视指标忽略了跨视图对齐，我们将T2MV扩散模型的RL FINETUNTUNTUNTUNTUNTUNTUNTUNTUNTUNTUNTINTINE缩减为受约束的优化问题，以最大程度地限制了限制性的限制性限制性，以更高的限制性地进行了限制，并增强了良好的效率。通过将这种约束优化范式与MV-Zigal集成，我们建立了我们的完整的RL登录框架（称为MVC-Zigal），在保真度和一致性中都可以有效地优化了几步的T2MV T2MV扩散基线，同时保持其较少的步骤效率。

Title: Proxy-Free GFlowNet

Authors: Ruishuo Chen, Xun Wang, Rui Hu, Zhuoran Li, Longbo Huang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20110
Pdf URL: https://arxiv.org/pdf/2505.20110
Copy Paste: [[2505.20110]] Proxy-Free GFlowNet(https://arxiv.org/abs/2505.20110)
Keywords: generative
Abstract: Generative Flow Networks (GFlowNets) are a promising class of generative models designed to sample diverse, high-reward structures by modeling distributions over compositional objects. In many real-world applications, obtaining the reward function for such objects is expensive, time-consuming, or requires human input, making it necessary to train GFlowNets from historical datasets. Most existing methods adopt a model-based approach, learning a proxy model from the dataset to approximate the reward function. However, this strategy inherently ties the quality of the learned policy to the accuracy of the proxy, introducing additional complexity and uncertainty into the training process. To overcome these limitations, we propose \textbf{Trajectory-Distilled GFlowNet (TD-GFN)}, a \emph{proxy-free} training framework that eliminates the need for out-of-dataset reward queries. Our method is motivated by the key observation that different edges in the associated directed acyclic graph (DAG) contribute unequally to effective policy learning. TD-GFN leverages inverse reinforcement learning to estimate edge-level rewards from the offline dataset, which are then used to ingeniously prune the DAG and guide backward trajectory sampling during training. This approach directs the policy toward high-reward regions while reducing the complexity of model fitting. Empirical results across multiple tasks show that TD-GFN trains both efficiently and reliably, significantly outperforming existing baselines in convergence speed and sample quality.
摘要：生成流动网络（GFLOWNETS）是一类有希望的生成模型，旨在通过对组成对象进行分布来对各种高奖励结构进行采样。在许多现实世界中，获得此类对象的奖励功能是昂贵，耗时的或需要人类输入的，因此有必要从历史数据集中训练Gflownets。大多数现有方法采用基于模型的方法，从数据集中学习代理模型以近似奖励功能。但是，该策略固有地将学习政策的质量与代理的准确性联系起来，在培训过程中引入了额外的复杂性和不确定性。为了克服这些局限性，我们提出\ textbf {轨迹延伸的gflownet（td-gfn）}，即\ emph {proxy-free}训练框架，以消除了对止境奖励查询的需求。我们的方法是由关键观察结果激发的，即相关的定向无环图（DAG）中的不同边缘对有效的政策学习无效。 TD-GFN利用逆增强学习来估计离线数据集的边缘级别奖励，然后用来巧妙地修剪DAG并指导训练过程中的向后轨迹采样。这种方法将政策指向高回报区域，同时降低模型拟合的复杂性。跨多个任务的经验结果表明，TD-GFN既有效又可靠地训练，在收敛速度和样品质量方面的表现明显优于现有基准。

Title: MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

Authors: Anh Thai, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20122
Pdf URL: https://arxiv.org/pdf/2505.20122
Copy Paste: [[2505.20122]] MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models(https://arxiv.org/abs/2505.20122)
Keywords: generation
Abstract: This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. We assess the performance of state-of-the-art vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes.
摘要：本文介绍了Mebench，这是一种用于评估相互排他性（ME）偏见的新基准，这是一种在单词学习过程中观察到的认知现象。与传统的ME任务不同，Mebench进一步结合了空间推理，以创建更具挑战性和现实的评估设置。我们使用新的评估指标来评估该基准测试中最先进的视觉模型（VLM）的性能，这些指标捕获了基于ME的推理的关键方面。为了促进受控的实验，我们还提出了灵活且可扩展的数据生成管道，该管道支持各种注释的场景的构建。

Title: Understanding Generalization in Diffusion Models via Probability Flow Distance

Authors: Huijie Zhang, Zijian Huang, Siyi Chen, Jinfan Zhou, Zekai Zhang, Peng Wang, Qing Qu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.20123
Pdf URL: https://arxiv.org/pdf/2505.20123
Copy Paste: [[2505.20123]] Understanding Generalization in Diffusion Models via Probability Flow Distance(https://arxiv.org/abs/2505.20123)
Keywords: generative
Abstract: Diffusion models have emerged as a powerful class of generative models, capable of producing high-quality samples that generalize beyond the training data. However, evaluating this generalization remains challenging: theoretical metrics are often impractical for high-dimensional data, while no practical metrics rigorously measure generalization. In this work, we bridge this gap by introducing probability flow distance ($\texttt{PFD}$), a theoretically grounded and computationally efficient metric to measure distributional generalization. Specifically, $\texttt{PFD}$ quantifies the distance between distributions by comparing their noise-to-data mappings induced by the probability flow ODE. Moreover, by using $\texttt{PFD}$ under a teacher-student evaluation protocol, we empirically uncover several key generalization behaviors in diffusion models, including: (1) scaling behavior from memorization to generalization, (2) early learning and double descent training dynamics, and (3) bias-variance decomposition. Beyond these insights, our work lays a foundation for future empirical and theoretical studies on generalization in diffusion models.
摘要：扩散模型已成为强大的生成模型类别，能够生成超出培训数据超出训练数据的高质量样本。但是，评估这种概括仍然具有挑战性：对于高维数据，理论指标通常是不切实际的，而没有实践指标可以严格衡量概括。在这项工作中，我们通过引入概率流距离（$ \ texttt {pfd} $）来弥合这一差距，这是一种理论上接地和计算有效度量的度量，以测量分布概括。具体而言，$ \ texttt {pfd} $通过比较概率流ode引起的噪声到数据映射来量化分布之间的距离。此外，通过在教师评估协议下使用$ \ texttt {pfd} $，我们从经验上可以发现扩散模型中的几种关键概括行为，包括：（1）从记忆到概括到概括，（2）早期学习和双重下降训练动态，以及（3）偏见 - 差异差异分解。除了这些见解之外，我们的工作还为扩散模型中的概括和理论研究奠定了基础。

Title: Agentic 3D Scene Generation with Spatially Contextualized VLMs

Authors: Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2505.20129
Pdf URL: https://arxiv.org/pdf/2505.20129
Copy Paste: [[2505.20129]] Agentic 3D Scene Generation with Spatially Contextualized VLMs(https://arxiv.org/abs/2505.20129)
Keywords: restoration, generation
Abstract: Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured, geometry-aware working memory that integrates its inherent multimodal reasoning capabilities with structured 3D understanding for effective spatial reasoning. Building on this foundation, we develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context. The pipeline features high-quality asset generation with geometric restoration, environment setup with automatic verification, and ergonomic adjustment guided by the scene hypergraph. Experiments show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work. Further results demonstrate that injecting spatial context enables VLMs to perform downstream tasks such as interactive scene editing and path planning, suggesting strong potential for spatially intelligent systems in computer graphics, 3D vision, and embodied applications.
摘要：尽管通过视觉模型（VLM）实现了多模式内容生成的最新进展，但它们的推理和生成结构化的3D场景的能力仍然很大程度上尚未得到充实。这种限制限制了它们在空间扎根的任务中的效用，例如体现的AI，沉浸式模拟和交互式3D应用程序。我们引入了一种新的范式，该范式可以通过注入不断发展的空间上下文来生成，理解和编辑复杂的3D环境。该上下文由多模式输入构建，由三个组成部分组成：一个场景肖像，提供高级语义蓝图，语义标记的点云捕获对象级几何形状以及一个编码丰富的空间关系的场景超透明图，包括纯正的，二进制和高阶约束。这些组件一起为VLM提供了结构化的，几何学意识到的工作记忆，该记忆将其固有的多模式推理能力与结构化的3D理解相结合，以实现有效的空间推理。在此基础的基础上，我们开发了一个代理3D场景生成管道，其中VLM迭代读取并更新空间上下文。该管道具有高质量的资产生成，具有几何恢复，具有自动验证的环境设置以及现场超图指导的人体工程学调整。实验表明，我们的框架可以处理多样化和具有挑战性的投入，从而达到了先前工作中未观察到的一定程度的概括。进一步的结果表明，注射空间上下文使VLM能够执行下游任务，例如交互式场景编辑和路径计划，这表明在计算机图形，3D视觉和体现的应用程序中对空间智能系统的强大潜力。

Title: FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Authors: Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, Ping Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20147
Pdf URL: https://arxiv.org/pdf/2505.20147
Copy Paste: [[2505.20147]] FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities(https://arxiv.org/abs/2505.20147)
Keywords: generation
Abstract: The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.
摘要：大语言模型（LLM）的快速进步促进了多模式大型语言模型（MLLM）的出现，该模型（MLLMS）在单个框架内统一了视觉理解和图像的产生。但是，大多数现有的MLLM都依赖于自回旋（AR）体系结构，这些体系结构对未来的开发施加了固有的限制，例如图像生成中的栅格扫描顺序以及因果上下文建模中的限制推理能力。在这项工作中，我们通过引入纯粹基于离散流匹配的统一的多模型Fudoki来挑战基于AR的方法的主导地位，作为常规AR范式的替代方法。通过使用动力学最佳速度利用度量诱导的概率路径，我们的框架超越了以前的基于掩盖的腐败过程，从而可以具有自我校正能力和发电期间的双向背景整合。为了减轻从头开始训练的高成本，我们将基于预训练的AR MLLM的Fudoki初始化，并自适应地过渡到离散的流量匹配范式。实验结果表明，Fudoki在视觉理解和图像生成任务中都可以达到与基于AR的最新MLLM相当的性能，从而强调了其潜力作为下一代统一多模型的基础。此外，我们表明，将测试时间缩放技术应用于Fudoki可以带来显着的性能增长，进一步强调了通过强化学习的未来增强的希望。

Title: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Authors: Kai Sun, Yushi Bai, Zhen Yang, Jiajie Zhang, Ji Qi, Lei Hou, Juanzi Li
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.20152
Pdf URL: https://arxiv.org/pdf/2505.20152
Copy Paste: [[2505.20152]] Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models(https://arxiv.org/abs/2505.20152)
Keywords: generation
Abstract: Benefiting from contrastively trained visual encoders on large-scale natural scene images, Large Multimodal Models (LMMs) have achieved remarkable performance across various visual perception tasks. However, the inherent limitations of contrastive learning upon summarized descriptions fundamentally restrict the capabilities of models in meticulous reasoning, particularly in crucial scenarios of geometric problem-solving. To enhance geometric understanding, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train CLIP using our strong negative learning method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further study the impact of different negative sample construction methods and the number of negative samples on the geometric reasoning performance of LMM, yielding fruitful conclusions. The code and dataset are available at this https URL.
摘要：大型多模型（LMM）受益于训练有素训练的视觉编码器，在各种视觉感知任务中都取得了出色的性能。但是，对比度学习对总结描述的固有局限性从根本上限制了模型在细致的推理中的能力，尤其是在几何问题解决的关键情况下。为了增强几何理解，我们为视觉编码器提出了一个新型的硬性负面对比学习框架，该框架结合了基于图像的对比度学习，该学习使用基于生成的硬核剂来通过扰动图生成代码创建的基于世代的硬质量，以及基于基于规则的基于经文本的对比性学习，该基于基于规则的否定词是从修改后的几何描述和基于基于主角类似的基于基于主角的基于基于基于基于的基于基于基于基于的基础的负面的负面的。我们使用强大的负学习方法（即MMCLIP（多模式数学剪辑））训练剪辑，然后训练LMM进行几何问题解决。实验表明，我们训练有素的模型MMGEOLM在三个几何推理基准上显着优于其他开源模型。即使具有7B的尺寸，它也可以与GPT-4O（例如GPT-4O）等强大的闭合源模型媲美。我们进一步研究了不同的负样品构造方法的影响以及负样品的数量对LMM几何推理性能的影响，得出了富有成果的结论。该代码和数据集可在此HTTPS URL上找到。

Title: Fine-grained List-wise Alignment for Generative Medication Recommendation

Authors: Chenxiao Fan, Chongming Gao, Wentao Shi, Yaxin Gong, Zihao Zhao, Fuli Feng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.20218
Pdf URL: https://arxiv.org/pdf/2505.20218
Copy Paste: [[2505.20218]] Fine-grained List-wise Alignment for Generative Medication Recommendation(https://arxiv.org/abs/2505.20218)
Keywords: generation, generative
Abstract: Accurate and safe medication recommendations are critical for effective clinical decision-making, especially in multimorbidity cases. However, existing systems rely on point-wise prediction paradigms that overlook synergistic drug effects and potential adverse drug-drug interactions (DDIs). We propose FLAME, a fine-grained list-wise alignment framework for large language models (LLMs), enabling drug-by-drug generation of drug lists. FLAME formulates recommendation as a sequential decision process, where each step adds or removes a single drug. To provide fine-grained learning signals, we devise step-wise Group Relative Policy Optimization (GRPO) with potential-based reward shaping, which explicitly models DDIs and optimizes the contribution of each drug to the overall prescription. Furthermore, FLAME enhances patient modeling by integrating structured clinical knowledge and collaborative information into the representation space of LLMs. Experiments on benchmark datasets demonstrate that FLAME achieves state-of-the-art performance, delivering superior accuracy, controllable safety-accuracy trade-offs, and strong generalization across diverse clinical scenarios. Our code is available at this https URL.
摘要：准确且安全的药物建议对于有效的临床决策至关重要，尤其是在多发性病例中。但是，现有系统依赖于忽略协同药物效应和潜在的不良药物相互作用（DDIS）的点上的预测范例（DDIS）。我们提出了Flame，这是一个针对大语言模型（LLMS）的细粒度列表对齐框架，从而逐渐逐渐产生药物清单。火焰将建议作为顺序决策过程制定，每个步骤都会增加或去除单一药物。为了提供细粒度的学习信号，我们通过潜在的奖励成型设计了逐步的小组相对政策优化（GRPO），该奖励成型明确地对DDI进行了建模并优化了每种药物对整体处方的贡献。此外，火焰通过将结构化临床知识和协作信息整合到LLM的表示空间中来增强患者的建模。基准数据集的实验表明，火焰可以达到最新的性能，具有卓越的准确性，可控的安全性智能权衡权衡以及在各种临床场景中进行强有力的概括。我们的代码可在此HTTPS URL上找到。

Title: Multimodal Federated Learning With Missing Modalities through Feature Imputation Network

Authors: Pranav Poudel, Aavash Chhetri, Prashnna Gyawali, Georgios Leontidis, Binod Bhattarai
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.20232
Pdf URL: https://arxiv.org/pdf/2505.20232
Copy Paste: [[2505.20232]] Multimodal Federated Learning With Missing Modalities through Feature Imputation Network(https://arxiv.org/abs/2505.20232)
Keywords: generative
Abstract: Multimodal federated learning holds immense potential for collaboratively training models from multiple sources without sharing raw data, addressing both data scarcity and privacy concerns, two key challenges in healthcare. A major challenge in training multimodal federated models in healthcare is the presence of missing modalities due to multiple reasons, including variations in clinical practice, cost and accessibility constraints, retrospective data collection, privacy concerns, and occasional technical or human errors. Previous methods typically rely on publicly available real datasets or synthetic data to compensate for missing modalities. However, obtaining real datasets for every disease is impractical, and training generative models to synthesize missing modalities is computationally expensive and prone to errors due to the high dimensionality of medical data. In this paper, we propose a novel, lightweight, low-dimensional feature translator to reconstruct bottleneck features of the missing modalities. Our experiments on three different datasets (MIMIC-CXR, NIH Open-I, and CheXpert), in both homogeneous and heterogeneous settings consistently improve the performance of competitive baselines. The code and implementation details are available at: this https URL
摘要：多模式联合学习具有从多个来源进行协作培训模型的巨大潜力，而无需共享原始数据，解决数据稀缺和隐私问题，这是医疗保健方面的两个关键挑战。培训医疗保健多模式联合模型的主要挑战是由于多种原因而存在缺失的方式，包括临床实践的变化，成本和可及性限制，回顾性数据收集，隐私问题以及偶尔的技术或人类错误。以前的方法通常依靠公开可用的真实数据集或合成数据来补偿缺失的模式。但是，为每种疾病获取真实数据集是不切实际的，培训生成模型综合缺失的方式在计算上是昂贵的，并且由于医疗数据的高维度而容易出现错误。在本文中，我们提出了一种新颖的，轻巧的低维特征翻译器，以重建缺失方式的瓶颈特征。在均质和异质环境中，我们在三个不同的数据集（Mimic-CXR，NIH Open-I和Chexpert）上进行的实验始终提高竞争基准的性能。代码和实施详细信息可用：此HTTPS URL

Title: AniCrafter: Customizing Realistic Human-Centric Animation via Avatar-Background Conditioning in Video Diffusion Models

Authors: Muyao Niu, Mingdeng Cao, Yifan Zhan, Qingtian Zhu, Mingze Ma, Jiancheng Zhao, Yanhong Zeng, Zhihang Zhong, Xiao Sun, Yinqiang Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20255
Pdf URL: https://arxiv.org/pdf/2505.20255
Copy Paste: [[2505.20255]] AniCrafter: Customizing Realistic Human-Centric Animation via Avatar-Background Conditioning in Video Diffusion Models(https://arxiv.org/abs/2505.20255)
Keywords: restoration
Abstract: Recent advances in video diffusion models have significantly improved character animation techniques. However, current approaches rely on basic structural conditions such as DWPose or SMPL-X to animate character images, limiting their effectiveness in open-domain scenarios with dynamic backgrounds or challenging human poses. In this paper, we introduce $\textbf{AniCrafter}$, a diffusion-based human-centric animation model that can seamlessly integrate and animate a given character into open-domain dynamic backgrounds while following given human motion sequences. Built on cutting-edge Image-to-Video (I2V) diffusion architectures, our model incorporates an innovative "avatar-background" conditioning mechanism that reframes open-domain human-centric animation as a restoration task, enabling more stable and versatile animation outputs. Experimental results demonstrate the superior performance of our method. Codes will be available at this https URL.
摘要：视频扩散模型的最新进展已显着改善了角色动画技术。但是，当前的方法依赖于基本的结构条件，例如DWPOSE或SMPL-X来动画角色图像，从而限制了它们在具有动态背景或充满挑战的人类姿势的开放域情景中的有效性。在本文中，我们引入了$ \ textbf {anicrafter} $，这是一种基于扩散的人类中心动画模型，可以在遵循给定的人运动序列的同时，无缝地集成并使给定的字符纳入开放域动态背景。我们的模型建立在尖端的图像到视频（I2V）扩散体系结构上，结合了一种创新的“ Avatar-Background”调理机制，该机制将开放域中以人为中心的动画作为恢复任务，使得能够稳定和多功能动画输出。实验结果证明了我们方法的出色性能。代码将在此HTTPS URL上可用。

Title: In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation

Authors: Yu Xu, Fan Tang, You Wu, Lin Gao, Oliver Deussen, Hongbin Yan, Jintao Li, Juan Cao, Tong-Yee Lee
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2505.20271
Pdf URL: https://arxiv.org/pdf/2505.20271
Copy Paste: [[2505.20271]] In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation(https://arxiv.org/abs/2505.20271)
Keywords: generation
Abstract: Recent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly "brushes" user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user's intent through textual prompts. In this work, we propose "In-Context Brush", a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head "latent feature shifting" within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head "attention reweighting" across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection.
摘要：扩散模型的最新进展增强了多模式引导的视觉生成，从而实现了自定义的主题插入，从而无缝将用户指定的对象“刷”用户指定的对象进入由文本提示引导的给定图像。但是，现有的方法通常很难插入具有高保真度的自定义主题，并通过文本提示与用户的意图保持一致。在这项工作中，我们提出了“ context刷子”，这是一个零拍框架，用于定制主题插入，通过重新启动内在学习范式中的任务。在不失去一般性的情况下，我们将对象图像和文本提示作为交叉模式演示，而目标图像则以蒙版区域作为查询。目的是将目标图像与主体对齐文本提示，而无需模型调整。在基于验证的MMDIT介绍网络的基础上，我们通过双级潜在空间操纵执行测试时间增强：在每个注意力头中，头部内“潜在特征转移”，该头部动态转移注意力输出以反映所需的主题语义语义，并反映出所需的主题语义和头部Intereweption“注意力”，“注意力重新持续”跨越了差异，从而促进了通过优先降低的差异能力，从而促进了对可控性的优先提高。广泛的实验和应用表明，与现有的最新方法相比，我们的方法可实现优越的身份保存，文本对齐和图像质量，而无需专门的培训或其他数据收集。

Title: ImgEdit: A Unified Image Editing Dataset and Benchmark

Authors: Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20275
Pdf URL: https://arxiv.org/pdf/2505.20275
Copy Paste: [[2505.20275]] ImgEdit: A Unified Image Editing Dataset and Benchmark(https://arxiv.org/abs/2505.20275)
Keywords: generation, generative
Abstract: Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks. To overcome these limitations, we introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks. To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict post-processing. ImgEdit surpasses existing datasets in both task novelty and data quality. Using ImgEdit, we train ImgEdit-E1, an editing model using Vision Language Model to process the reference image and editing prompt, which outperforms existing open-source models on multiple tasks, highlighting the value of ImgEdit and model design. For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance in terms of instruction adherence, editing quality, and detail preservation. It includes a basic testsuite, a challenging single-turn suite, and a dedicated multi-turn suite. We evaluate both open-source and proprietary models, as well as ImgEdit-E1, providing deep analysis and actionable insights into the current behavior of image-editing models. The source data are publicly available on this https URL.
摘要：生成模型的最新进展已使高保真文本到图像生成。但是，开源图像编辑模型仍然落后于其专有对应物，这主要是由于有限的高质量数据和基准不足所致。为了克服这些局限性，我们介绍了Imgedit，这是一个大规模的，高质量的图像编辑数据集，其中包括120万精心策划的编辑对，其中包含新颖和复杂的单转曲线编辑，以及具有挑战性的多转弯任务。为了确保数据质量，我们采用了多个阶段的管道，该管道集成了尖端视觉模型，检测模型，分割模型，以及特定于任务的镶嵌过程和严格的后处理。 Imgedit在任务新颖性和数据质量方面都超过了现有的数据集。使用IMGEDIT，我们训练使用视觉语言模型来处理参考图像和编辑提示的编辑模型的IMGETIT-E1，该模型的表现优于多个任务上现有的开源模型，从而突出了IMGEDIT和模型设计的价值。为了进行全面的评估，我们介绍了Imgedit-Bench，这是一种基准测试，旨在评估教学依从性，编辑质量和细节保存方面的图像编辑性能。它包括一个基本的测试套件，具有挑战性的单转套件和专用的多转套件。我们评估了开源和专有模型以及IMGEDIT-E1，从而对图像编辑模型的当前行为提供了深入的分析和可行的见解。源数据可在此HTTPS URL上公开可用。

Title: MotionPro: A Precise Motion Controller for Image-to-Video Generation

Authors: Zhongwei Zhang, Fuchen Long, Zhaofan Qiu, Yingwei Pan, Wu Liu, Ting Yao, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.20287
Pdf URL: https://arxiv.org/pdf/2505.20287
Copy Paste: [[2505.20287]] MotionPro: A Precise Motion Controller for Image-to-Video Generation(https://arxiv.org/abs/2505.20287)
Keywords: generation
Abstract: Animating images with interactive motion control has garnered popularity for image-to-video (I2V) generation. Modern approaches typically rely on large Gaussian kernels to extend motion trajectories as condition without explicitly defining movement region, leading to coarse motion control and failing to disentangle object and camera moving. To alleviate these, we present MotionPro, a precise motion controller that novelly leverages region-wise trajectory and motion mask to regulate fine-grained motion synthesis and identify target motion category (i.e., object or camera moving), respectively. Technically, MotionPro first estimates the flow maps on each training video via a tracking model, and then samples the region-wise trajectories to simulate inference scenario. Instead of extending flow through large Gaussian kernels, our region-wise trajectory approach enables more precise control by directly utilizing trajectories within local regions, thereby effectively characterizing fine-grained movements. A motion mask is simultaneously derived from the predicted flow maps to capture the holistic motion dynamics of the movement regions. To pursue natural motion control, MotionPro further strengthens video denoising by incorporating both region-wise trajectories and motion mask through feature modulation. More remarkably, we meticulously construct a benchmark, i.e., MC-Bench, with 1.1K user-annotated image-trajectory pairs, for the evaluation of both fine-grained and object-level I2V motion control. Extensive experiments conducted on WebVid-10M and MC-Bench demonstrate the effectiveness of MotionPro. Please refer to our project page for more results: this https URL.
摘要：具有交互式运动控制的动画图像已获得图像到视频（I2V）的流行。现代方法通常依靠大型高斯内核来扩展运动轨迹作为条件，而无需明确定义运动区域，从而导致了粗糙的运动控制，并且无法解开对象和相机移动。为了减轻这些方法，我们提出了MotionPro，这是一个精确的运动控制器，新颖地利用区域轨迹和运动掩模来调节细粒运动合成并识别目标运动类别（即物体或摄像机移动）。从技术上讲，MotionPro首先通过跟踪模型估算每个训练视频上的流图，然后对区域轨迹进行采样以模拟推理方案。我们的区域轨迹方法没有扩展流经大型高斯内核，而是通过直接利用局部区域内的轨迹来实现更精确的控制，从而有效地表征了细粒度的运动。运动面具同时从预测的流图中得出，以捕获运动区域的整体运动动力学。为了追求自然运动控制，MotionPro通过通过特征调制来纳入区域轨迹和运动面具，进一步增强了视频DeNoise。更值得注意的是，我们精心构建了一个基准，即具有1.1k用户向用户的图像 - 标题对，以评估细粒度和对象级I2V运动控制。在WebVID-10M和MC Bench上进行的广泛实验证明了MotionPro的有效性。请参阅我们的项目页面以获取更多结果：此HTTPS URL。

Title: Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots

Authors: Guangting Zheng, Yehao Li, Yingwei Pan, Jiajun Deng, Ting Yao, Yanyong Zhang, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.20288
Pdf URL: https://arxiv.org/pdf/2505.20288
Copy Paste: [[2505.20288]] Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots(https://arxiv.org/abs/2505.20288)
Keywords: generation, generative
Abstract: Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs. Code is available at this https URL.
摘要：自回归模型已成为视觉生成的强大生成范式。当前的近次令牌预测的当前事实标准通常是通过一个密集图像令牌的单个尺度序列运行，并且无法利用全球环境，尤其是对于早期令牌预测。在本文中，我们引入了一种新的自回归设计，以建模从几个低分辨率图像令牌到典型的致密图像令牌，并深入研究多尺度图像令牌的彻底层次依赖性。从技术上讲，我们提出了一个层次掩盖的自动回归模型（HI-MAR），该模型在低分辨率图像令牌上旋转以以多相方式触发层次结构自回归建模。 Hi-Mar学会了在低分辨率中预测一些图像令牌，在第一阶段充当中间枢轴以反映全局结构。这种枢轴是通过塑造典型密集图像令牌的全球结构意识来增强下一个自回归建模阶段的附加指南。进一步设计了一个新的扩散变压器头，以扩大所有令牌之间的全局环境，以换取蒙版令牌预测。对课堂条件和文本形象生成任务的广泛评估表明，Hi-Mar优于典型的AR基准，同时需要更少的计算成本。代码可在此HTTPS URL上找到。

Title: Visualized Text-to-Image Retrieval

Authors: Di Wu, Yixin Wan, Kai-Wei Chang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.20291
Pdf URL: https://arxiv.org/pdf/2505.20291
Copy Paste: [[2505.20291]] Visualized Text-to-Image Retrieval(https://arxiv.org/abs/2505.20291)
Keywords: generation
Abstract: We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval that mitigates the limitations of cross-modal similarity alignment of existing multi-modal embeddings. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Experiments on three knowledge-intensive T2I retrieval benchmarks, including a newly introduced multi-entity benchmark, demonstrate that VisRet consistently improves T2I retrieval by 24.5% to 32.7% NDCG@10 across different embedding models. VisRet also significantly benefits downstream visual question answering accuracy when used in retrieval-augmented generation pipelines. The method is plug-and-play and compatible with off-the-shelf retrievers, making it an effective module for knowledge-intensive multi-modal systems. Our code and the new benchmark are publicly available at this https URL.
摘要：我们提出了可视化 - 然后重新划分（Visret），这是一种新的文本图像（T2I）检索范式，可减轻现有多模式嵌入的跨模式相似性比对的局限性。 Visret首先通过T2i生成将文本查询投入到图像模式中。然后，它在图像模态内执行检索，以绕过识别微妙的视觉空间特征的跨模式回收者的弱点。对三个知识密集型T2I检索基准（包括新引入的多实体基准）进行的实验表明，Visret始终在不同嵌入模型中始终将T2I检索提高24.5％至32.7％NDCG@10。 Visret还可以显着使下游的视觉问题有益于回答授权的生成管道时的准确性。该方法是插件，与现成的猎犬兼容，使其成为知识密集型多模式系统的有效模块。我们的代码和新的基准标准在此HTTPS URL上公开可用。

Title: OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

Authors: Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Chongyang Ma, Jiebo Luo, Li Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20292
Pdf URL: https://arxiv.org/pdf/2505.20292
Copy Paste: [[2505.20292]] OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation(https://arxiv.org/abs/2505.20292)
Keywords: generation
Abstract: Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 16 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.
摘要：主题到视频（S2V）的一代旨在创建忠实地纳入参考内容的视频，从而在视频的制作中增强了灵活性。为了建立S2V生成的基础架构，我们提出了OpenS2V-Nexus，由（i）Opens2V-Eval，细粒基准和（ii）OpenS2V-5M，一个百万级数据集组成。与从VBENCH继承的现有S2V基准相反，该基准的重点是对产生的视频的全球和粗粒度评估，OpenS2V-Eval着重于该模型生成具有自然主题外观和身份忠诚度的主题一致视频的能力。为了这些目的，OpenS2V-Eval引入了来自S2V的七个主要类别的180个提示，这些提示均包含真实和合成测试数据。此外，为了将人类偏好与S2V基准测试，我们提出了三个自动指标，Nexusscore，NaturalsCore和Gmescore，以分别量化生成视频中的受试者一致性，自然性和文本相关性。在此基础上，我们对16种代表性S2V模型进行了全面评估，突出了它们在不同内容之间的优势和劣势。此外，我们创建了第一个开源大规模S2V生成数据集Opens2V-5M，该数据集由500万高质量的720p主题-Text-Video Triples组成。具体而言，我们通过（1）通过跨视频协会进行分割和构建配对信息来确保数据集中的主题信息多样性，以及（2）在原帧上提示GPT-Image-1综合多视图表示。通过OpenS2V-Nexus，我们提供了强大的基础架构，以加速未来的S2V生成研究。

Title: DiSA: Diffusion Step Annealing in Autoregressive Image Generation

Authors: Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Gould, Liang Zheng
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.20297
Pdf URL: https://arxiv.org/pdf/2505.20297
Copy Paste: [[2505.20297]] DiSA: Diffusion Step Annealing in Autoregressive Image Generation(https://arxiv.org/abs/2505.20297)
Keywords: generation
Abstract: An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves $5-10\times$ faster inference for MAR and Harmon and $1.4-2.5\times$ for FlowAR and xAR, while maintaining the generation quality.
摘要：越来越多的自回旋模型，例如MAR，Flotar，XAR和Harmon采用扩散采样，以提高图像生成的质量。但是，这种策略会导致推理效率较低，因为它通常需要50至100个步骤来扩散来样品一个令牌。本文探讨了如何有效解决此问题。我们的关键动机是，随着自回归过程中越来越多的令牌，随后的令牌遵循更有限的分布，并且更容易采样。为了凭直觉解释，如果模型已经生成了狗的一部分，则其余令牌必须完成狗，因此受到更大的约束。经验证据支持我们的动机：在后来的阶段，多层感知器可以很好地预测下一代的代币，表现出较低的方差，并遵循从噪声到代币的近距离直线deno的路径。根据我们的发现，我们引入了扩散步骤退火（DISA），这是一种无训练的方法，随着产生更多的令牌，例如在开始时使用50个步骤，并在后期阶段逐渐减少到5个步骤。由于DISA源自我们在自回归模型中扩散的发现，因此它与仅针对扩散设计的现有加速方法互补。 DISA只能在现有型号的几行代码中实现，尽管很简单，但对于MAR和Harmon的$ 5-10 \ times $ $ $ $ $ $ 1.4-2.5 \ times $ $ 1.4-2.5 \ times $ $ flowar and XAR，同时保持发电质量。