2025-05-20

Title: Tool-Aided Evolutionary LLM for Generative Policy Toward Efficient Resource Management in Wireless Federated Learning

Authors: Chongyang Tan, Ruoqi Wen, Rongpeng Li, Zhifeng Zhao, Ekram Hossain, Honggang Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11570
Pdf URL: https://arxiv.org/pdf/2505.11570
Copy Paste: [[2505.11570]] Tool-Aided Evolutionary LLM for Generative Policy Toward Efficient Resource Management in Wireless Federated Learning(https://arxiv.org/abs/2505.11570)
Keywords: generative
Abstract: Federated Learning (FL) enables distributed model training across edge devices in a privacy-friendly manner. However, its efficiency heavily depends on effective device selection and high-dimensional resource allocation in dynamic and heterogeneous wireless environments. Conventional methods demand a confluence of domain-specific expertise, extensive hyperparameter tuning, and/or heavy interaction cost. This paper proposes a Tool-aided Evolutionary Large Language Model (T-ELLM) framework to generate a qualified policy for device selection in a wireless FL environment. Unlike conventional optimization methods, T-ELLM leverages natural language-based scenario prompts to enhance generalization across varying network conditions. The framework decouples the joint optimization problem mathematically, enabling tractable learning of device selection policies while delegating resource allocation to convex optimization tools. To improve adaptability, T-ELLM integrates a sample-efficient, model-based virtual learning environment that captures the relationship between device selection and learning performance, facilitating subsequent group relative policy optimization. This concerted approach reduces reliance on real-world interactions, minimizing communication overhead while maintaining high-fidelity decision-making. Theoretical analysis proves that the discrepancy between virtual and real environments is bounded, ensuring the advantage function learned in the virtual environment maintains a provably small deviation from real-world conditions. Experimental results demonstrate that T-ELLM outperforms benchmark methods in energy efficiency and exhibits robust adaptability to environmental changes.
摘要：联合学习（FL）以隐私友好的方式启用了跨边缘设备的分布式模型培训。但是，它的效率在很大程度上取决于动态和异质无线环境中的有效设备选择和高维资源分配。常规方法需要特定于领域的专业知识，广泛的超参数调整和/或繁重的交互成本的汇合。本文提出了一个工具辅助进化的大语言模型（T-ELLM）框架，以在无线FL环境中生成用于设备选择的合格策略。与常规的优化方法不同，T-ELLM利用基于自然语言的场景提示在不同的网络条件上增强概括。该框架以数学方式将联合优化问题解密，从而使对设备选择策略的学习能够学习，同时将资源分配委派给了凸优化工具。为了提高适应性，T-ELLM集成了一个基于样本的，基于模型的虚拟学习环境，该环境捕获了设备选择与学习性能之间的关系，从而促进了随后的小组相对策略优化。这种一致的方法减少了对现实世界互动的依赖，从而最大程度地减少了沟通的开销，同时保持了高保真的决策。理论分析证明，虚拟环境和真实环境之间的差异是有限的，确保在虚拟环境中学到的优势函数使事实证明与现实世界的条件保持了很小的偏差。实验结果表明，T-ELLM在能源效率方面的表现优于基准方法，并表现出对环境变化的强大适应性。

Title: Concept-Guided Interpretability via Neural Chunking

Authors: Shuchen Wu, Stephan Alaniz, Shyamgopal Karthik, Peter Dayan, Eric Schulz, Zeynep Akata
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.11576
Pdf URL: https://arxiv.org/pdf/2505.11576
Copy Paste: [[2505.11576]] Concept-Guided Interpretability via Neural Chunking(https://arxiv.org/abs/2505.11576)
Keywords: generation
Abstract: Neural networks are often black boxes, reflecting the significant challenge of understanding their internal workings. We propose a different perspective that challenges the prevailing view: rather than being inscrutable, neural networks exhibit patterns in their raw population activity that mirror regularities in the training data. We refer to this as the Reflection Hypothesis and provide evidence for this phenomenon in both simple recurrent neural networks (RNNs) and complex large language models (LLMs). Building on this insight, we propose to leverage cognitively-inspired methods of chunking to segment high-dimensional neural population dynamics into interpretable units that reflect underlying concepts. We propose three methods to extract these emerging entities, complementing each other based on label availability and dimensionality. Discrete sequence chunking (DSC) creates a dictionary of entities; population averaging (PA) extracts recurring entities that correspond to known labels; and unsupervised chunk discovery (UCD) can be used when labels are absent. We demonstrate the effectiveness of these methods in extracting entities across varying model sizes, ranging from inducing compositionality in RNNs to uncovering recurring neural population states in large models with diverse architectures, and illustrate their advantage over other methods. Throughout, we observe a robust correspondence between the extracted entities and concrete or abstract concepts. Artificially inducing the extracted entities in neural populations effectively alters the network's generation of associated concepts. Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data to reveal the hidden computations of complex learning systems, gradually transforming them from black boxes into systems we can begin to understand.
摘要：神经网络通常是黑匣子，反映了了解其内部工作的重大挑战。我们提出了一种不同的观点，挑战了普遍的观点：神经网络在其原始人口活动中表现出模式，反映了培训数据中的规律性。我们将其称为反思假设，并在简单的复发神经网络（RNN）和复杂的大语言模型（LLMS）中为这种现象提供了证据。在这种见解的基础上，我们建议利用认知启发的分解方法将高维神经人口动态分为反映基本概念的可解释单元。我们提出了三种方法来提取这些新兴实体，并根据标签可用性和维度相互补充。离散序列块（DSC）创建了实体词典；人口平均（PA）提取与已知标签相对应的反复出现的实体；当缺乏标签时，可以使用无监督的块发现（UCD）。我们证明了这些方法在跨不同模型大小中提取实体的有效性，从诱导RNN的组成性到在具有不同架构的大型大型模型中发现重复的神经种群状态，并说明了它们比其他方法的优势。在整个过程中，我们都观察到提取的实体与具体或抽象概念之间的强大对应关系。人为地诱导神经种群中提取的实体有效地改变了网络的相关概念的产生。我们的工作指向了可解释性的新方向，该方向既利用认知原则，又可以揭示自然主义数据的结构来揭示复杂学习系统的隐藏计算，从而逐渐将它们从黑匣子转变为我们可以开始理解的系统。

Title: Spatiotemporal Field Generation Based on Hybrid Mamba-Transformer with Physics-informed Fine-tuning

Authors: Peimian Du, Jiabin Liu, Xiaowei Jin, Mengwang Zuo, Hui Li
Subjects: cs.LG, cs.AI, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2505.11578
Pdf URL: https://arxiv.org/pdf/2505.11578
Copy Paste: [[2505.11578]] Spatiotemporal Field Generation Based on Hybrid Mamba-Transformer with Physics-informed Fine-tuning(https://arxiv.org/abs/2505.11578)
Keywords: generation
Abstract: This research confronts the challenge of substantial physical equation discrepancies encountered in the generation of spatiotemporal physical fields through data-driven trained models. A spatiotemporal physical field generation model, named HMT-PF, is developed based on the hybrid Mamba-Transformer architecture, incorporating unstructured grid information as input. A fine-tuning block, enhanced with physical information, is introduced to effectively reduce the physical equation discrepancies. The physical equation residuals are computed through a point query mechanism for efficient gradient evaluation, then encoded into latent space for refinement. The fine-tuning process employs a self-supervised learning approach to achieve physical consistency while maintaining essential field characteristics. Results show that the hybrid Mamba-Transformer model achieves good performance in generating spatiotemporal fields, while the physics-informed fine-tuning mechanism further reduces significant physical errors effectively. A MSE-R evaluation method is developed to assess the accuracy and realism of physical field generation.
摘要：这项研究面临着通过数据驱动的训练模型产生的时空物理领域遇到的实质性物理方程差异的挑战。时空物理场生成模型，名为HMT-PF，是基于混合Mamba-Transformer架构开发的，将非结构化的网格信息作为输入。引入了通过物理信息增强的微调块，以有效地减少物理方程式差异。通过点查询机制计算物理方程残差，以进行有效的梯度评估，然后编码为潜在空间进行细化。微调过程采用一种自学的学习方法来实现身体一致性，同时保持基本的现场特征。结果表明，混合MAMBA转化器模型在产生时空场中实现了良好的性能，而物理学的微调机制进一步有效地降低了重大的物理错误。开发了一种MSE-R评估方法来评估物理场产生的准确性和现实性。

Title: Flash Invariant Point Attention

Authors: Andrew Liu, Axel Elaldi, Nicholas T Franklin, Nathan Russell, Gurinder S Atwal, Yih-En A Ban, Olivia Viessmann
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2505.11580
Pdf URL: https://arxiv.org/pdf/2505.11580
Copy Paste: [[2505.11580]] Flash Invariant Point Attention(https://arxiv.org/abs/2505.11580)
Keywords: generative
Abstract: Invariant Point Attention (IPA) is a key algorithm for geometry-aware modeling in structural biology, central to many protein and RNA models. However, its quadratic complexity limits the input sequence length. We introduce FlashIPA, a factorized reformulation of IPA that leverages hardware-efficient FlashAttention to achieve linear scaling in GPU memory and wall-clock time with sequence length. FlashIPA matches or exceeds standard IPA performance while substantially reducing computational costs. FlashIPA extends training to previously unattainable lengths, and we demonstrate this by re-training generative models without length restrictions and generating structures of thousands of residues. FlashIPA is available at this https URL.
摘要：不变点注意（IPA）是结构生物学中几何学建模的关键算法，这是许多蛋白质和RNA模型的中心。但是，其二次复杂性限制了输入序列长度。我们介绍了Flashipa，这是对IPA的分解重新印象，利用硬件有效的FlashContention在GPU内存中实现线性缩放，并具有序列长度。 Flashipa匹配或超过标准的IPA性能，同时大大降低了计算成本。 Flashipa将训练扩展到以前无法实现的长度，我们通过重新训练生成模型而无需限制并生成数千个残基的结构来证明这一点。 Flashipa可在此HTTPS URL上找到。

Title: Continuous Optimization for Feature Selection with Permutation-Invariant Embedding and Policy-Guided Search

Authors: Rui Liu, Rui Xie, Zijun Yao, Yanjie Fu, Dongjie Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11601
Pdf URL: https://arxiv.org/pdf/2505.11601
Copy Paste: [[2505.11601]] Continuous Optimization for Feature Selection with Permutation-Invariant Embedding and Policy-Guided Search(https://arxiv.org/abs/2505.11601)
Keywords: generative
Abstract: Feature selection removes redundant features to enhanc performance and computational efficiency in downstream tasks. Existing works often struggle to capture complex feature interactions and adapt to diverse scenarios. Recent advances in this domain have incorporated generative intelligence to address these drawbacks by uncovering intricate relationships between features. However, two key limitations remain: 1) embedding feature subsets in a continuous space is challenging due to permutation sensitivity, as changes in feature order can introduce biases and weaken the embedding learning process; 2) gradient-based search in the embedding space assumes convexity, which is rarely guaranteed, leading to reduced search effectiveness and suboptimal subsets. To address these limitations, we propose a new framework that can: 1) preserve feature subset knowledge in a continuous embedding space while ensuring permutation invariance; 2) effectively explore the embedding space without relying on strong convex assumptions. For the first objective, we develop an encoder-decoder paradigm to preserve feature selection knowledge into a continuous embedding space. This paradigm captures feature interactions through pairwise relationships within the subset, removing the influence of feature order on the embedding. Moreover, an inducing point mechanism is introduced to accelerate pairwise relationship computations. For the second objective, we employ a policy-based reinforcement learning (RL) approach to guide the exploration of the embedding space. The RL agent effectively navigates the space by balancing multiple objectives. By prioritizing high-potential regions adaptively and eliminating the reliance on convexity assumptions, the RL agent effectively reduces the risk of converging to local optima. Extensive experiments demonstrate the effectiveness, efficiency, robustness and explicitness of our model.
摘要：功能选择删除了冗余功能，以提高下游任务的性能和计算效率。现有作品通常难以捕获复杂的特征交互并适应各种情况。该领域的最新进展结合了生成智能，以通过发现特征之间的复杂关系来解决这些缺点。但是，仍然存在两个关键的局限性：1）由于置换敏感性，将特征子集嵌入连续空间是具有挑战性的，因为功能顺序的变化会引入偏见并削弱嵌入学习过程； 2）在嵌入空间中基于梯度的搜索假设凸度很少得到保证，从而降低了搜索效果和次优亚集。为了解决这些限制，我们提出了一个可以：1）在连续嵌入空间中保留特征子集知识的新框架，同时确保排列不变性； 2）有效地探索嵌入空间，而无需依靠强凸假设。对于第一个目标，我们开发了一个编码器范式，以将特征选择知识保存到连续的嵌入空间中。该范式通过子集中的成对关系捕获特征相互作用，从而消除了特征顺序对嵌入的影响。此外，引入了诱导点机制来加速成对关系计算。对于第二个目标，我们采用基于政策的增强学习（RL）方法来指导嵌入空间的探索。 RL代理通过平衡多个目标来有效地导航空间。通过适应高潜力区域的优先级，并消除对凸度假设的依赖，RL代理有效地降低了融合到局部Optima的风险。广泛的实验证明了我们模型的有效性，效率，鲁棒性和明确性。

Title: Enhancing Network Anomaly Detection with Quantum GANs and Successive Data Injection for Multivariate Time Series

Authors: Wajdi Hammami, Soumaya Cherkaoui, Shengrui Wang
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2505.11631
Pdf URL: https://arxiv.org/pdf/2505.11631
Copy Paste: [[2505.11631]] Enhancing Network Anomaly Detection with Quantum GANs and Successive Data Injection for Multivariate Time Series(https://arxiv.org/abs/2505.11631)
Keywords: generative
Abstract: Quantum computing may offer new approaches for advancing machine learning, including in complex tasks such as anomaly detection in network traffic. In this paper, we introduce a quantum generative adversarial network (QGAN) architecture for multivariate time-series anomaly detection that leverages variational quantum circuits (VQCs) in combination with a time-window shifting technique, data re-uploading, and successive data injection (SuDaI). The method encodes multivariate time series data as rotation angles. By integrating both data re-uploading and SuDaI, the approach maps classical data into quantum states efficiently, helping to address hardware limitations such as the restricted number of available qubits. In addition, the approach employs an anomaly scoring technique that utilizes both the generator and the discriminator output to enhance the accuracy of anomaly detection. The QGAN was trained using the parameter shift rule and benchmarked against a classical GAN. Experimental results indicate that the quantum model achieves a accuracy high along with high recall and F1-scores in anomaly detection, and attains a lower MSE compared to the classical model. Notably, the QGAN accomplishes this performance with only 80 parameters, demonstrating competitive results with a compact architecture. Tests using a noisy simulator suggest that the approach remains effective under realistic noise-prone conditions.
摘要：量子计算可能会提供用于推进机器学习的新方法，包括在复杂的任务中，例如网络流量中的异常检测。在本文中，我们引入了用于多元时间序列异常检测的量子生成对抗网络（QGAN）架构，该检测利用变异量子电路（VQC）以及时间抛窗的变化技术，数据重新计划和连续数据注入（SUDAI）结合使用。该方法将多元时间序列数据编码为旋转角度。通过集成数据重新上传和Sudai，该方法有效地将经典数据映射到量子状态中，有助于解决硬件限制，例如可用数量的限制数量。此外，该方法采用一种异常评分技术，该技术同时利用发电机和鉴别器输出来增强异常检测的准确性。使用参数偏移规则对QGAN进行了训练，并针对经典的gan进行了基准测试。实验结果表明，量子模型在异常检测中达到高度召回和F1得分的精度，与经典模型相比，MSE较低。值得注意的是，QGAN仅使用80个参数实现了这一性能，并通过紧凑的体系结构展示了竞争结果。使用嘈杂的模拟器进行测试表明，在易于逼真的噪声条件下，该方法仍然有效。

Title: The Gaussian-Multinoulli Restricted Boltzmann Machine: A Potts Model Extension of the GRBM

Authors: Nikhil Kapasi, William Whitehead, Luke Theogarajan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11635
Pdf URL: https://arxiv.org/pdf/2505.11635
Copy Paste: [[2505.11635]] The Gaussian-Multinoulli Restricted Boltzmann Machine: A Potts Model Extension of the GRBM(https://arxiv.org/abs/2505.11635)
Keywords: generative
Abstract: Many real-world tasks, from associative memory to symbolic reasoning, demand discrete, structured representations that standard continuous latent models struggle to express naturally. We introduce the Gaussian-Multinoulli Restricted Boltzmann Machine (GM-RBM), a generative energy-based model that extends the Gaussian-Bernoulli RBM (GB-RBM) by replacing binary hidden units with $q$-state Potts variables. This modification enables a combinatorially richer latent space and supports learning over multivalued, interpretable latent concepts. We formally derive GM-RBM's energy function, learning dynamics, and conditional distributions, showing that it preserves tractable inference and training through contrastive divergence. Empirically, we demonstrate that GM-RBMs model complex multimodal distributions more effectively than binary RBMs, outperforming them on tasks involving analogical recall and structured memory. Our results highlight GM-RBMs as a scalable framework for discrete latent inference with enhanced expressiveness and interoperability.
摘要：从关联记忆到象征性推理的许多现实世界任务，都要求标准连续模型努力自然表达的离散，结构化表示。我们介绍了高斯 - 穆尔图洛利限制的玻尔兹曼机器（GM-RBM），这是一种基于生成能量的模型，通过用$ q $ $ $ state potts potts变量替换二进制隐藏单元，扩展了高斯 - 伯努利rbm（GB-RBM）。这种修改使组合具有更丰富的潜在空间，并支持对多价，可解释的潜在概念的学习。我们正式得出GM-RBM的能量功能，学习动力学和条件分布，表明它通过对比差异可以保留可拖动的推理和训练。从经验上讲，我们证明GM-RBMS比二进制RBM更有效地模型复杂的多模式分布，在涉及类似召回和结构化记忆的任务上表现出色。我们的结果突出显示了GM-RBM作为可扩展的框架，用于具有增强的表现力和互操作性的离散潜在推断。

Title: BandRC: Band Shifted Raised Cosine Activated Implicit Neural Representations

Authors: Pandula Thennakoon, Avishka Ranasinghe, Mario De Silva, Buwaneka Epakanda, Roshan Godaliyadda, Parakrama Ekanayake, Vijitha Herath
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11640
Pdf URL: https://arxiv.org/pdf/2505.11640
Copy Paste: [[2505.11640]] BandRC: Band Shifted Raised Cosine Activated Implicit Neural Representations(https://arxiv.org/abs/2505.11640)
Keywords: super-resolution
Abstract: In recent years, implicit neural representations(INRs) have gained popularity in the computer vision community. This is mainly due to the strong performance of INRs in many computer vision tasks. These networks can extract a continuous signal representation given a discrete signal representation. In previous studies, it has been repeatedly shown that INR performance has a strong correlation with the activation functions used in its multilayer perceptrons. Although numerous activation functions have been proposed that are competitive with one another, they share some common set of challenges such as spectral bias(Lack of sensitivity to high-frequency content in signals), limited robustness to signal noise and difficulties in simultaneous capturing both local and global features. and furthermore, the requirement for manual parameter tuning. To address these issues, we introduce a novel activation function, Band Shifted Raised Cosine Activated Implicit Neural Networks \textbf{(BandRC)} tailored to enhance signal representation capacity further. We also incorporate deep prior knowledge extracted from the signal to adjust the activation functions through a task-specific model. Through a mathematical analysis and a series of experiments which include image reconstruction (with a +8.93 dB PSNR improvement over the nearest counterpart), denoising (with a +0.46 dB increase in PSNR), super-resolution (with a +1.03 dB improvement over the nearest State-Of-The-Art (SOTA) method for 6X super-resolution), inpainting, and 3D shape reconstruction we demonstrate the dominance of BandRC over existing state of the art activation functions.
摘要：近年来，隐式神经表示（INR）在计算机视觉社区中广受欢迎。这主要是由于许多计算机视觉任务中INR的表现强劲。这些网络可以在给定离散信号表示的情况下提取连续信号表示。在先前的研究中，反复表明，INR性能与其多层感知器中使用的激活函数具有很强的相关性。尽管已经提出了许多彼此竞争的激活功能，但它们具有一些共同的挑战，例如光谱偏见（信号中对高频含量缺乏敏感性），对信号噪声的鲁棒性和同时捕获本地和全球特征的难度有限。此外，手动参数调整的要求。为了解决这些问题，我们引入了一种新颖的激活函数，频带移动弹性的余弦激活的隐式神经网络\ textbf {（bandrc）}量身定制，以进一步增强信号表示能力。我们还合并了从信号中提取的深层知识，以通过特定于任务模型调整激活函数。 Through a mathematical analysis and a series of experiments which include image reconstruction (with a +8.93 dB PSNR improvement over the nearest counterpart), denoising (with a +0.46 dB increase in PSNR), super-resolution (with a +1.03 dB improvement over the nearest State-Of-The-Art (SOTA) method for 6X super-resolution), inpainting, and 3D shape reconstruction we demonstrate BandRC在现有的最新激活函数上的主导地位。

Title: Joint Graph Estimation and Signal Restoration for Robust Federated Learning

Authors: Tsutahiro Fukuhara, Junya Hara, Hiroshi Higashi, Yuichi Tanaka
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2505.11648
Pdf URL: https://arxiv.org/pdf/2505.11648
Copy Paste: [[2505.11648]] Joint Graph Estimation and Signal Restoration for Robust Federated Learning(https://arxiv.org/abs/2505.11648)
Keywords: restoration
Abstract: We propose a robust aggregation method for model parameters in federated learning (FL) under noisy communications. FL is a distributed machine learning paradigm in which a central server aggregates local model parameters from multiple clients. These parameters are often noisy and/or have missing values during data collection, training, and communication between the clients and server. This may cause a considerable drop in model accuracy. To address this issue, we learn a graph that represents pairwise relationships between model parameters of the clients during aggregation. We realize it with a joint problem of graph learning and signal (i.e., model parameters) restoration. The problem is formulated as a difference-of-convex (DC) optimization, which is efficiently solved via a proximal DC algorithm. Experimental results on MNIST and CIFAR-10 datasets show that the proposed method outperforms existing approaches by up to $2$--$5\%$ in classification accuracy under biased data distributions and noisy conditions.
摘要：我们在嘈杂的通信下提出了一种强大的聚合方法，用于联合学习（FL）中的模型参数。 FL是一种分布式机器学习范式，其中中央服务器从多个客户端汇总了本地模型参数。这些参数通常在数据收集，培训和客户端和服务器之间的通信过程中嘈杂和/或缺少值。这可能会导致模型准确性下降。为了解决这个问题，我们学习了一个图表，该图表示汇总过程中客户端模型参数之间的成对关系。我们通过图形学习和信号（即模型参数）恢复的联合问题意识到这一点。该问题被公式为征收差异（DC）优化，该优化是通过DC近端算法有效解决的。 MNIST和CIFAR-10数据集的实验结果表明，在有偏见的数据分布和嘈杂条件下，所提出的方法的分类精度最高为$ 2 $ - $ 5 \％$ $ 5 \％。

Title: DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation

Authors: Ziyu Zhao, Xiaoguang Li, Linjia Shi, Nasrin Imanpour, Song Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11676
Pdf URL: https://arxiv.org/pdf/2505.11676
Copy Paste: [[2505.11676]] DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation(https://arxiv.org/abs/2505.11676)
Keywords: generation
Abstract: Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions for both seen and unseen categories at the pixel level. Current methods utilize text embeddings from pre-trained vision-language models like CLIP but struggle with the inherent domain gap between image and text embeddings, even after extensive alignment during training. Additionally, relying solely on deep text-aligned features limits shallow-level feature guidance, which is crucial for detecting small objects and fine details, ultimately reducing segmentation accuracy. To address these limitations, we propose a dual prompting framework, DPSeg, for this task. Our approach combines dual-prompt cost volume generation, a cost volume-guided decoder, and a semantic-guided prompt refinement strategy that leverages our dual prompting scheme to mitigate alignment issues in visual prompt generation. By incorporating visual embeddings from a visual prompt encoder, our approach reduces the domain gap between text and image embeddings while providing multi-level guidance through shallow features. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches on multiple public datasets.
摘要：开放式语义语义分割旨在将图像分为不同的语义区域，以在像素级别的可见类别和看不见的类别中分为不同的语义区域。当前的方法利用文本嵌入剪辑（例如剪辑）等文本嵌入，但即使在训练过程中进行了广泛的一致性，即使在图像和文本嵌入之间的固有域间隙也很难。此外，仅依靠深层文本对准特征限制浅层特征指导，这对于检测小物体和细节至关重要，最终降低了细分精度。为了解决这些限制，我们为此任务提出了一个双重提示框架DPSEG。我们的方法结合了双提取的成本量产生，成本量引导的解码器以及一种语义引导的及时改进策略，该策略利用我们的双重提示方案来减轻视觉及时生成中的一致性问题。通过将视觉提示编码器中的视觉嵌入结合在一起，我们的方法可以减少文本和图像嵌入之间的域间隙，同时通过浅色特征提供多层次的指导。广泛的实验表明，我们的方法在多个公共数据集上的现有最新方法大大优于现有的最新方法。

Title: Mollifier Layers: Enabling Efficient High-Order Derivatives in Inverse PDE Learning

Authors: Ananyae Kumar Bhartari, Vinayak Vinayak, Vivek B Shenoy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.11682
Pdf URL: https://arxiv.org/pdf/2505.11682
Copy Paste: [[2505.11682]] Mollifier Layers: Enabling Efficient High-Order Derivatives in Inverse PDE Learning(https://arxiv.org/abs/2505.11682)
Keywords: super-resolution
Abstract: Parameter estimation in inverse problems involving partial differential equations (PDEs) underpins modeling across scientific disciplines, especially when parameters vary in space or time. Physics-informed Machine Learning (PhiML) integrates PDE constraints into deep learning, but prevailing approaches depend on recursive automatic differentiation (autodiff), which produces inaccurate high-order derivatives, inflates memory usage, and underperforms in noisy settings. We propose Mollifier Layers, a lightweight, architecture-agnostic module that replaces autodiff with convolutional operations using analytically defined mollifiers. This reframing of derivative computation as smoothing integration enables efficient, noise-robust estimation of high-order derivatives directly from network outputs. Mollifier Layers attach at the output layer and require no architectural modifications. We compare them with three distinct architectures and benchmark performance across first-, second-, and fourth-order PDEs -- including Langevin dynamics, heat diffusion, and reaction-diffusion systems -- observing significant improvements in memory efficiency, training time and accuracy for parameter recovery across tasks. To demonstrate practical relevance, we apply Mollifier Layers to infer spatially varying epigenetic reaction rates from super-resolution chromatin imaging data -- a real-world inverse problem with biomedical significance. Our results establish Mollifier Layers as an efficient and scalable tool for physics-constrained learning.
摘要：涉及偏微分方程（PDE）的反问题的参数估计基础跨科学学科建模，尤其是当参数在空间或时间上变化时。物理知识的机器学习（PHIML）将PDE的约束整合到深度学习中，但是流行的方法取决于递归自动分化（AUTODIFF），这种分化（AUTODIFF）会产生不准确的高级导数，膨胀记忆使用情况，并且在嘈杂的环境中表现不佳。我们建议使用分析定义的MolliFiers使用卷积操作代替AutoDiff，这是一种轻巧的，体系结构 - 敏捷的模块。将衍生化计算作为平滑积分的重新标记，可以直接从网络输出中对高级导数的有效，噪声估算。 MolliFier层附着在输出层，不需要架构修改。我们将它们与三种不同的体系结构和基准性能进行比较，包括Langevin动力学，热扩散和反应扩散系统在内，可以观察到记忆效率，训练时间和准确性的显着提高，以跨任务恢复参数恢复。为了证明实际相关性，我们将软体动物层从超分辨率染色质成像数据中推断出空间变化的表观遗传反应速率，这是一个具有生物医学意义的现实世界反相问题。我们的结果将软体动物层建立为物理受限学习的有效且可扩展的工具。

Title: Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization

Authors: Shihao Zhang, Haoyu Zhang, Ian Colbert, Rayan Saab
Subjects: cs.LG, cs.AI, math.OC
Abstract URL: https://arxiv.org/abs/2505.11695
Pdf URL: https://arxiv.org/pdf/2505.11695
Copy Paste: [[2505.11695]] Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization(https://arxiv.org/abs/2505.11695)
Keywords: generation
Abstract: We introduce Qronos -- a new state-of-the-art post-training quantization algorithm that sequentially rounds and updates neural network weights. Qronos not only explicitly corrects errors due to both weight and activation quantization, but also errors resulting from quantizing previous layers. Our iterative algorithm is based on an interpretable and disciplined optimization framework that subsumes and surpasses existing data-driven approaches. At each step, Qronos alternates between error correction and diffusion via optimal update rules. Importantly, we prove that Qronos admits an efficient implementation that uses the Cholesky decomposition for solving least-squares problems. We also demonstrate that Qronos is compatible with existing transformation techniques such as Hadamard-based incoherence processing and weight-activation scaling equalization, among others. We evaluate Qronos using recent autoregressive language generation models in the Llama3 family; Qronos consistently outperforms previous state-of-the-art adaptive rounding methods when quantizing the weights, activations, and/or KV caches.
摘要：我们介绍了Qronos - 一种新的最新训练后量化算法，该算法依次绕开并更新神经网络权重。 Qronos不仅因重量和激活量化而明确纠正错误，而且还纠正了量化以前的层所产生的错误。我们的迭代算法基于一个可解释的纪律优化框架，该框架涵盖并超过了现有的数据驱动方法。在每个步骤中，Qronos通过最佳更新规则在错误校正和扩散之间交替。重要的是，我们证明Qronos承认了一种有效的实施，该实施使用了Cholesky分解来解决最小二乘问题。我们还证明了Qronos与现有的转换技术兼容，例如基于Hadamard的不一致处理和权重激活缩放均衡等级等。我们使用llama3家族中最近自回归语言的产生模型评估Qronos； Qronos在量化权重，激活和/或KV缓存时始终优于先前的最新自适应舍入方法。

Title: LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance

Authors: Jae Myung Kim, Stephan Alaniz, Cordelia Schmid, Zeynep Akata
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11703
Pdf URL: https://arxiv.org/pdf/2505.11703
Copy Paste: [[2505.11703]] LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance(https://arxiv.org/abs/2505.11703)
Keywords: generation
Abstract: Despite recent advances in text-to-image generation, using synthetically generated data seldom brings a significant boost in performance for supervised learning. Oftentimes, synthetic datasets do not faithfully recreate the data distribution of real data, i.e., they lack the fidelity or diversity needed for effective downstream model training. While previous work has employed few-shot guidance to address this issue, existing methods still fail to capture and generate features unique to specific real images. In this paper, we introduce a novel dataset generation framework named LoFT, LoRA-Fused Training-data Generation with Few-shot Guidance. Our method fine-tunes LoRA weights on individual real images and fuses them at inference time, producing synthetic images that combine the features of real images for improved diversity and fidelity of generated data. We evaluate the synthetic data produced by LoFT on 10 datasets, using 8 to 64 real images per class as guidance and scaling up to 1000 images per class. Our experiments show that training on LoFT-generated data consistently outperforms other synthetic dataset methods, significantly increasing accuracy as the dataset size increases. Additionally, our analysis demonstrates that LoFT generates datasets with high fidelity and sufficient diversity, which contribute to the performance improvement. The code is available at this https URL.
摘要：尽管最近在文本到图像生成方面取得了进步，但使用合成生成的数据很少带来明显的促进监督学习的绩效。通常，合成数据集并不忠实地重新创建真实数据的数据分布，即缺乏有效的下游模型培训所需的保真度或多样性。尽管以前的工作已经使用了很少的指导来解决此问题，但现有方法仍然无法捕获和生成特定真实图像所特有的功能。在本文中，我们介绍了一个新颖的数据集生成框架，名为Lora-Funed Training Data Generation，带有很少的指导。我们的方法微型洛拉（Lora）在单个真实图像上的权重并在推理时间融合它们，从而产生合成图像，这些图像结合了真实图像的特征，以提高多样性和生成数据的保真度。我们评估了Loft在10个数据集上产生的合成数据，使用每类8至64个真实图像作为指导，并规模每类高达1000张图像。我们的实验表明，对阁楼生成数据的培训始终优于其他合成数据集方法，随着数据集大小的增加而显着提高了准确性。此外，我们的分析表明，Loft生成具有高忠诚度和足够多样性的数据集，这有助于提高性能。该代码可在此HTTPS URL上找到。

Title: Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration

Authors: Haipeng Fang, Sheng Tang, Juan Cao, Enshuo Zhang, Fan Tang, Tong-Yee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11707
Pdf URL: https://arxiv.org/pdf/2505.11707
Copy Paste: [[2505.11707]] Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration(https://arxiv.org/abs/2505.11707)
Keywords: generation
Abstract: Diffusion transformers have shown exceptional performance in visual generation but incur high computational costs. Token reduction techniques that compress models by sharing the denoising process among similar tokens have been introduced. However, existing approaches neglect the denoising priors of the diffusion models, leading to suboptimal acceleration and diminished image quality. This study proposes a novel concept: attend to prune feature redundancies in areas not attended by the diffusion process. We analyze the location and degree of feature redundancies based on the structure-then-detail denoising priors. Subsequently, we introduce SDTM, a structure-then-detail token merging approach that dynamically compresses feature redundancies. Specifically, we design dynamic visual token merging, compression ratio adjusting, and prompt reweighting for different stages. Served in a post-training way, the proposed method can be integrated seamlessly into any DiT architecture. Extensive experiments across various backbones, schedulers, and datasets showcase the superiority of our method, for example, it achieves 1.55 times acceleration with negligible impact on image quality. Project page: this https URL.
摘要：扩散变压器在视觉生成中表现出了出色的性能，但会产生高计算成本。引入了通过在类似代币之间共享denoising过程来压缩模型的令牌减少技术。但是，现有方法忽略了扩散模型的降级先生，导致了次优的加速度和图像质量的降低。这项研究提出了一个新颖的概念：参加未通过扩散过程参加的领域的修剪裁员。我们根据结构 - 然后详细降级先验分析了特征冗余的位置和程度。随后，我们介绍了SDTM，这是一种结构 - 然后动态压缩冗余的结构 - 详细令牌合并方法。具体而言，我们设计了动态的视觉令牌合并，调整压缩率并促使在不同阶段重新加权。以训练后的方式服务，可以将所提出的方法无缝集成到任何DIT架构中。例如，各种骨干，调度程序和数据集进行了广泛的实验，例如，它的优势是达到1.55倍的加速度，对图像质量的影响微不足道。项目页面：此HTTPS URL。

Title: UGoDIT: Unsupervised Group Deep Image Prior Via Transferable Weights

Authors: Shijun Liang, Ismail R. Alkhouri, Siddhant Gautam, Qing Qu, Saiprasad Ravishankar
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2505.11720
Pdf URL: https://arxiv.org/pdf/2505.11720
Copy Paste: [[2505.11720]] UGoDIT: Unsupervised Group Deep Image Prior Via Transferable Weights(https://arxiv.org/abs/2505.11720)
Keywords: generative
Abstract: Recent advances in data-centric deep generative models have led to significant progress in solving inverse imaging problems. However, these models (e.g., diffusion models (DMs)) typically require large amounts of fully sampled (clean) training data, which is often impractical in medical and scientific settings such as dynamic imaging. On the other hand, training-data-free approaches like the Deep Image Prior (DIP) do not require clean ground-truth images but suffer from noise overfitting and can be computationally expensive as the network parameters need to be optimized for each measurement set independently. Moreover, DIP-based methods often overlook the potential of learning a prior using a small number of sub-sampled measurements (or degraded images) available during training. In this paper, we propose UGoDIT, an Unsupervised Group DIP via Transferable weights, designed for the low-data regime where only a very small number, M, of sub-sampled measurement vectors are available during training. Our method learns a set of transferable weights by optimizing a shared encoder and M disentangled decoders. At test time, we reconstruct the unseen degraded image using a DIP network, where part of the parameters are fixed to the learned weights, while the remaining are optimized to enforce measurement consistency. We evaluate UGoDIT on both medical (multi-coil MRI) and natural (super resolution and non-linear deblurring) image recovery tasks under various settings. Compared to recent standalone DIP methods, UGoDIT provides accelerated convergence and notable improvement in reconstruction quality. Furthermore, our method achieves performance competitive with SOTA DM-based and supervised approaches, despite not requiring large amounts of clean training data.
摘要：以数据为中心的深层生成模型的最新进展导致了解决反向成像问题的重大进展。但是，这些模型（例如，扩散模型（DMS））通常需要大量的完全采样（清洁）训练数据，这在医学和科学环境（例如动态成像）中通常是不切实际的。另一方面，诸如Deep Image Prior（DIP）之类的无训练DATA方法不需要干净的地面真实图像，而是噪音过高，并且可以在计算上昂贵，因为需要独立设置的每个测量值对网络参数进行优化。此外，基于DIP的方法通常会忽略使用培训期间可用的少量亚采样测量（或退化的图像）的先验学习的潜力。在本文中，我们提出了一种通过可转移权重的无监督组浸入的ugodit，该量是为低数据表制度设计的，在训练过程中，只有一个非常小的亚采样测量向量的数字，m，m，m，m，m，m，m，m，m，m，m，m，m，m，m，m，m，m，m，m，m，m，则可以在训练过程中提供。我们的方法通过优化共享的编码器和M DISENTANGLED解码器来学习一组可转移权重。在测试时，我们使用DIP网络重建了看不见的降级图像，其中一部分参数固定在学到的权重，而其余的则优化以强制执行测量一致性。我们在各种设置下对医学（多圈MRI）和天然（超级分辨率和非线性脱张）图像恢复任务进行评估。与最近的独立倾角方法相比，Ugodit提供了加速的收敛性和重建质量的显着改善。此外，尽管不需要大量的清洁培训数据，但我们的方法与基于SOTA DM和受监督的方法达到了性能竞争。

Title: Semantically-Aware Game Image Quality Assessment

Authors: Kai Zhu, Vignesh Edithal, Le Zhang, Ilia Blank, Imran Junejo
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.11724
Pdf URL: https://arxiv.org/pdf/2505.11724
Copy Paste: [[2505.11724]] Semantically-Aware Game Image Quality Assessment(https://arxiv.org/abs/2505.11724)
Keywords: quality assessment
Abstract: Assessing the visual quality of video game graphics presents unique challenges due to the absence of reference images and the distinct types of distortions, such as aliasing, texture blur, and geometry level of detail (LOD) issues, which differ from those in natural images or user-generated content. Existing no-reference image and video quality assessment (NR-IQA/VQA) methods fail to generalize to gaming environments as they are primarily designed for distortions like compression artifacts. This study introduces a semantically-aware NR-IQA model tailored to gaming. The model employs a knowledge-distilled Game distortion feature extractor (GDFE) to detect and quantify game-specific distortions, while integrating semantic gating via CLIP embeddings to dynamically weight feature importance based on scene content. Training on gameplay data recorded across graphical quality presets enables the model to produce quality scores that align with human perception. Our results demonstrate that the GDFE, trained through knowledge distillation from binary classifiers, generalizes effectively to intermediate distortion levels unseen during training. Semantic gating further improves contextual relevance and reduces prediction variance. In the absence of in-domain NR-IQA baselines, our model outperforms out-of-domain methods and exhibits robust, monotonic quality trends across unseen games in the same genre. This work establishes a foundation for automated graphical quality assessment in gaming, advancing NR-IQA methods in this domain.
摘要：评估视频游戏图形的视觉质量提出了独特的挑战，这是由于缺乏参考图像和不同类型的扭曲类型（例如混音，纹理模糊和细节的几何水平（LOD）问题，这些问题与自然图像或用户生成的内容中的内容有所不同。现有的无参考图像和视频质量评估（NR-IQA/VQA）方法无法推广到游戏环境，因为它们主要用于压缩伪像等扭曲。这项研究介绍了针对游戏的语义意识NR-IQA模型。该模型采用知识缩放的游戏失真功能提取器（GDFE）来检测和量化游戏特定的扭曲，同时通过剪辑嵌入将语义门控为基于场景内容的动态权重功能重要性。在图形质量预设中记录的游戏数据培训使该模型能够产生与人类感知相符的质量分数。我们的结果表明，通过从二进制分类器中蒸馏的知识蒸馏而训练的GDFE有效地将训练期间看不见的中间失真水平概括。语义门控进一步提高了上下文相关性并减少了预测差异。在没有内域NR-IQA基线的情况下，我们的模型表现出色的方法，并且在同一类型中看不见的游戏中表现出强大的单调质量趋势。这项工作为游戏中的自动图形质量评估奠定了基础，从而推进了该领域的NR-IQA方法。

Title: Token-Level Uncertainty Estimation for Large Language Model Reasoning

Authors: Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, Dimitris Metaxas, Hao Wang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.11737
Pdf URL: https://arxiv.org/pdf/2505.11737
Copy Paste: [[2505.11737]] Token-Level Uncertainty Estimation for Large Language Model Reasoning(https://arxiv.org/abs/2505.11737)
Keywords: generation
Abstract: While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a token-level uncertainty estimation framework to enable LLMs to self-assess and self-improve their generation quality in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation to LLM decoding, generating predictive distributions that we use to estimate token-level uncertainties. We then aggregate these uncertainties to reflect semantic uncertainty of the generated sequences. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that our token-level uncertainty metrics strongly correlate with answer correctness and model robustness. Additionally, we explore using uncertainty to directly enhance the model's reasoning performance through multiple generations and the particle filtering algorithm. Our approach consistently outperforms existing uncertainty estimation methods, establishing effective uncertainty estimation as a valuable tool for both evaluating and improving reasoning generation in LLMs.
摘要：尽管大型语言模型（LLM）表现出了令人印象深刻的功能，但它们的输出质量在各种应用程序方面仍然不一致，因此很难确定可信赖的响应，尤其是在需要多步推理的复杂任务中。在本文中，我们提出了一个令牌级的不确定性估计框架，以使LLMS能够自我评估和自我消除其在数学推理中的产生质量。具体而言，我们将低级随机重量扰动引入LLM解码，从而产生我们用来估计令牌级不确定性的预测分布。然后，我们汇总了这些不确定性，以反映生成序列的语义不确定性。关于不同难度的数学推理数据集的实验表明，我们的令牌级别的不确定性指标与答案正确性和模型鲁棒性密切相关。此外，我们使用不确定性探索通过多代和粒子过滤算法直接增强模型的推理性能。我们的方法始终优于现有的不确定性估计方法，建立有效的不确定性估计，作为评估和改善LLM中推理产生的有价值的工具。

Title: Redefining Neural Operators in $d+1$ Dimensions

Authors: Haoze Song, Zhihao Li, Xiaobo Zhang, Zecheng Gan, Zhilu Lai, Wei Wang
Subjects: cs.LG, cs.AI, quant-ph
Abstract URL: https://arxiv.org/abs/2505.11766
Pdf URL: https://arxiv.org/pdf/2505.11766
Copy Paste: [[2505.11766]] Redefining Neural Operators in $d+1$ Dimensions(https://arxiv.org/abs/2505.11766)
Keywords: super-resolution
Abstract: Neural Operators have emerged as powerful tools for learning mappings between function spaces. Among them, the kernel integral operator has been widely validated on universally approximating various operators. Although recent advancements following this definition have developed effective modules to better approximate the kernel function defined on the original domain (with $d$ dimensions, $d=1, 2, 3...$), the unclarified evolving mechanism in the embedding spaces blocks our view to design neural operators that can fully capture the target system evolution. Drawing on recent breakthroughs in quantum simulation of partial differential equations (PDEs), we elucidate the linear evolution process in neural operators. Based on that, we redefine neural operators on a new $d+1$ dimensional domain. Within this framework, we implement our proposed Schrödingerised Kernel Neural Operator (SKNO) aligning better with the $d+1$ dimensional evolution. In experiments, our $d+1$ dimensional evolving linear block performs far better than others. Also, we test SKNO's SOTA performance on various benchmark tests and also the zero-shot super-resolution task. In addition, we analyse the impact of different lifting and recovering operators on the prediction within the redefined NO framework, reflecting the alignment between our model and the underlying $d+1$ dimensional evolution.
摘要：神经操作员已成为在功能空间之间学习映射的强大工具。其中，内核积分运算符在普遍近似各种操作员方面得到了广泛的验证。尽管此定义之后的最新进展已开发出有效的模块，以更好地近似原始域上定义的内核函数（使用$ d $ dimensions，$ d = 1，2，3 ... $），但嵌入空间中未亮起的演变机制阻止了我们的观点，以设计可以完全捕获目标系统进化的神经操作员。借鉴了部分微分方程（PDE）的量子模拟中最近的突破，我们阐明了神经操作员的线性演化过程。基于此，我们重新定义了新的$ d+1 $尺寸域上的神经操作员。在此框架内，我们实施了我们提出的Schrödingerized内核神经操作员（SKNO），与$ d+1 $尺寸的演变更好。在实验中，我们的$ d+1 $尺寸不断发展的线性块的性能远远好。此外，我们在各种基准测试以及零击的超分辨率任务上测试了SKNO的SOTA性能。此外，我们分析了不同的举重和恢复运算符对重新定义NO框架内预测的影响，这反映了我们的模型与基础$ d+1 $ dimensional Evolution之间的一致性。

Title: Generative and Contrastive Graph Representation Learning

Authors: Jiali Chen, Avijit Mukherjee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11776
Pdf URL: https://arxiv.org/pdf/2505.11776
Copy Paste: [[2505.11776]] Generative and Contrastive Graph Representation Learning(https://arxiv.org/abs/2505.11776)
Keywords: generation, generative
Abstract: Self-supervised learning (SSL) on graphs generates node and graph representations (i.e., embeddings) that can be used for downstream tasks such as node classification, node clustering, and link prediction. Graph SSL is particularly useful in scenarios with limited or no labeled data. Existing SSL methods predominantly follow contrastive or generative paradigms, each excelling in different tasks: contrastive methods typically perform well on classification tasks, while generative methods often excel in link prediction. In this paper, we present a novel architecture for graph SSL that integrates the strengths of both approaches. Our framework introduces community-aware node-level contrastive learning, providing more robust and effective positive and negative node pairs generation, alongside graph-level contrastive learning to capture global semantic information. Additionally, we employ a comprehensive augmentation strategy that combines feature masking, node perturbation, and edge perturbation, enabling robust and diverse representation learning. By incorporating these enhancements, our model achieves superior performance across multiple tasks, including node classification, clustering, and link prediction. Evaluations on open benchmark datasets demonstrate that our model outperforms state-of-the-art methods, achieving a performance lift of 0.23%-2.01% depending on the task and dataset.
摘要：图表上的自我监督学习（SSL）生成节点和图表表示（即嵌入），可用于下游任务，例如节点分类，节点群集和链接预测。 Graph SSL在有限或没有标记数据的方案中特别有用。现有的SSL方法主要遵循对比度或生成范式，在不同的任务中均出色：对比方法通常在分类任务上表现良好，而生成方法通常在链接预测中表现出色。在本文中，我们介绍了图形SSL的新型体系结构，该架构整合了两种方法的优势。我们的框架介绍了社区感知的节点级对比度学习，提供了更强大，更有效的正面和负节点对的生成，以及图形对比度学习以捕获全球语义信息。此外，我们采用了一种全面的增强策略，该策略结合了特征掩盖，节点扰动和边缘扰动，从而实现了强大而多样的表示学习。通过合并这些增强功能，我们的模型可以在多个任务中实现出色的性能，包括节点分类，聚类和链接预测。对开放基准数据集的评估表明，我们的模型优于最先进的方法，根据任务和数据集的规定，达到0.23％-2.01％的性能提升。

Title: Self-NPO: Negative Preference Optimization of Diffusion Models by Simply Learning from Itself without Explicit Preference Annotations

Authors: Fu-Yun Wang, Keqiang Sun, Yao Teng, Xihui Liu, Jiaming Song, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11777
Pdf URL: https://arxiv.org/pdf/2505.11777
Copy Paste: [[2505.11777]] Self-NPO: Negative Preference Optimization of Diffusion Models by Simply Learning from Itself without Explicit Preference Annotations(https://arxiv.org/abs/2505.11777)
Keywords: generation
Abstract: Diffusion models have demonstrated remarkable success in various visual generation tasks, including image, video, and 3D content generation. Preference optimization (PO) is a prominent and growing area of research that aims to align these models with human preferences. While existing PO methods primarily concentrate on producing favorable outputs, they often overlook the significance of classifier-free guidance (CFG) in mitigating undesirable results. Diffusion-NPO addresses this gap by introducing negative preference optimization (NPO), training models to generate outputs opposite to human preferences and thereby steering them away from unfavorable outcomes. However, prior NPO approaches, including Diffusion-NPO, rely on costly and fragile procedures for obtaining explicit preference annotations (e.g., manual pairwise labeling or reward model training), limiting their practicality in domains where such data are scarce or difficult to acquire. In this work, we introduce Self-NPO, a Negative Preference Optimization approach that learns exclusively from the model itself, thereby eliminating the need for manual data labeling or reward model training. Moreover, our method is highly efficient and does not require exhaustive data sampling. We demonstrate that Self-NPO integrates seamlessly into widely used diffusion models, including SD1.5, SDXL, and CogVideoX, as well as models already optimized for human preferences, consistently enhancing both their generation quality and alignment with human preferences.
摘要：扩散模型在各种视觉生成任务中都取得了巨大的成功，包括图像，视频和3D内容生成。偏好优化（PO）是一个重要的研究领域，旨在将这些模型与人类偏好保持一致。尽管现有的PO方法主要集中于产生有利的产出，但它们经常忽略无分类器指导（CFG）在减轻不良结果中的重要性。扩散-NPO通过引入负偏好优化（NPO），训练模型来解决与人类偏好相反的输出，从而远离不利结果，从而解决了这一差距。但是，先前的NPO方法（包括扩散NPO）依靠昂贵且脆弱的程序来获得明确的偏好注释（例如，手动成对标记或奖励模型培训），从而限制了其在此类数据稀缺或难以获取的域中的实用性。在这项工作中，我们介绍了一种自NPO，这是一种负面偏好优化的方法，该方法仅从模型本身中学习，从而消除了对手动数据标记或奖励模型培训的需求。此外，我们的方法高效，不需要详尽的数据采样。我们证明，自NPO无缝整合到广泛使用的扩散模型中，包括SD1.5，SDXL和Cogvideox，以及已经针对人类偏好进行了优化的模型，不断提高其产生质量和与人类偏好的一致性。

Title: JULI: Jailbreak Large Language Models by Self-Introspection

Authors: Jesson Wang, Zhanhao Hu, David Wagner
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2505.11790
Pdf URL: https://arxiv.org/pdf/2505.11790
Copy Paste: [[2505.11790]] JULI: Jailbreak Large Language Models by Self-Introspection(https://arxiv.org/abs/2505.11790)
Keywords: generation
Abstract: Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content. Although some attacks have highlighted vulnerabilities in these safety-aligned LLMs, they typically have limitations, such as necessitating access to the model weights or the generation process. Since proprietary models through API-calling do not grant users such permissions, these attacks find it challenging to compromise them. In this paper, we propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities, using a tiny plug-in block, BiasNet. JULI relies solely on the knowledge of the target LLM's predicted token log probabilities. It can effectively jailbreak API-calling LLMs under a black-box setting and knowing only top-$5$ token log probabilities. Our approach demonstrates superior effectiveness, outperforming existing state-of-the-art (SOTA) approaches across multiple metrics.
摘要：大型语言模型（LLMS）经过安全对准训练，以防止产生恶意内容。尽管某些攻击突出了这些安全一致的LLM中的漏洞，但它们通常有局限性，例如需要访问模型权重或生成过程。由于通过API呼叫的专有模型不会授予用户此类权限，因此这些攻击发现妥协它们具有挑战性。在本文中，我们建议使用LLM内省（JULI）越狱，该越狱通过使用小插件Biasnet来操纵令牌日志概率来越狱LLM。朱利完全依靠目标LLM预测的令牌对数概率的知识。它可以有效地在黑盒子设置下有效越狱API称呼LLM，并且仅知道$ 5 $ $ 5 $的日志概率。我们的方法表明了较高的有效性，超过了多个指标的现有最新方法（SOTA）方法。

Title: CL-CaGAN: Capsule differential adversarial continuous learning for cross-domain hyperspectral anomaly detection

Authors: Jianing Wang, Siying Guo, Zheng Hua, Runhu Huang, Jinyu Hu, Maoguo Gong
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2505.11793
Pdf URL: https://arxiv.org/pdf/2505.11793
Copy Paste: [[2505.11793]] CL-CaGAN: Capsule differential adversarial continuous learning for cross-domain hyperspectral anomaly detection(https://arxiv.org/abs/2505.11793)
Keywords: generation, generative
Abstract: Anomaly detection (AD) has attracted remarkable attention in hyperspectral image (HSI) processing fields, and most existing deep learning (DL)-based algorithms indicate dramatic potential for detecting anomaly samples through specific training process under current scenario. However, the limited prior information and the catastrophic forgetting problem indicate crucial challenges for existing DL structure in open scenarios cross-domain detection. In order to improve the detection performance, a novel continual learning-based capsule differential generative adversarial network (CL-CaGAN) is proposed to elevate the cross-scenario learning performance for facilitating the real application of DL-based structure in hyperspectral AD (HAD) task. First, a modified capsule structure with adversarial learning network is constructed to estimate the background distribution for surmounting the deficiency of prior information. To mitigate the catastrophic forgetting phenomenon, clustering-based sample replay strategy and a designed extra self-distillation regularization are integrated for merging the history and future knowledge in continual AD task, while the discriminative learning ability from previous detection scenario to current scenario is retained by the elaborately designed structure with continual learning (CL) strategy. In addition, the differentiable enhancement is enforced to augment the generation performance of the training data. This further stabilizes the training process with better convergence and efficiently consolidates the reconstruction ability of background samples. To verify the effectiveness of our proposed CL-CaGAN, we conduct experiments on several real HSIs, and the results indicate that the proposed CL-CaGAN demonstrates higher detection performance and continuous learning capacity for mitigating the catastrophic forgetting under cross-domain scenarios.
摘要：异常检测（AD）在高光谱图像（HSI）处理场上引起了极大的关注，并且在当前情况下，通过特定的训练过程来检测特定的训练过程，这表明了巨大的基于深度学习（DL）的算法。但是，在公开场景跨域检测中，有限的先前信息和灾难性遗忘问题表明现有DL结构的挑战至关重要。为了提高检测性能，提出了一种新型基于学习的基于学习的胶囊差异生成网络（CL-CAGAN），以提升跨筛查学习绩效，以促进基于DL的结构在高光谱AD中的真实应用（有）任务。首先，构建具有对抗性学习网络的修改后的胶囊结构，以估计背景分布以衡量先验信息的不足。为了减轻灾难性的遗忘现象，集成了基于聚类的样本重播策略和设计的额外自我验证正规化，以合并持续的AD任务中的历史和未来知识，而从以前的检测方案到当前场景的歧视性学习能力则由连续学习的精心设计结构保留在当前的情况下（CL）策略。此外，可以实施可区分的增强，以增强培训数据的发电性能。这进一步稳定了训练过程，以更好的收敛性并有效地巩固了背景样本的重建能力。为了验证我们提出的CL-CAGAN的有效性，我们对几个实际HSI进行了实验，结果表明，提出的CL-Cagan表现出更高的检测性能和持续学习能力，以减轻在交叉层面场景下灾难性遗忘的能力。

Title: CL-BioGAN: Biologically-Inspired Cross-Domain Continual Learning for Hyperspectral Anomaly Detection

Authors: Jianing Wang, Zheng Hua, Wan Zhang, Shengjia Hao, Yuqiong Yao, Maoguo Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11796
Pdf URL: https://arxiv.org/pdf/2505.11796
Copy Paste: [[2505.11796]] CL-BioGAN: Biologically-Inspired Cross-Domain Continual Learning for Hyperspectral Anomaly Detection(https://arxiv.org/abs/2505.11796)
Keywords: generative
Abstract: Memory stability and learning flexibility in continual learning (CL) is a core challenge for cross-scene Hyperspectral Anomaly Detection (HAD) task. Biological neural networks can actively forget history knowledge that conflicts with the learning of new experiences by regulating learning-triggered synaptic expansion and synaptic convergence. Inspired by this phenomenon, we propose a novel Biologically-Inspired Continual Learning Generative Adversarial Network (CL-BioGAN) for augmenting continuous distribution fitting ability for cross-domain HAD task, where Continual Learning Bio-inspired Loss (CL-Bio Loss) and self-attention Generative Adversarial Network (BioGAN) are incorporated to realize forgetting history knowledge as well as involving replay strategy in the proposed BioGAN. Specifically, a novel Bio-Inspired Loss composed with an Active Forgetting Loss (AF Loss) and a CL loss is designed to realize parameters releasing and enhancing between new task and history tasks from a Bayesian perspective. Meanwhile, BioGAN loss with L2-Norm enhances self-attention (SA) to further balance the stability and flexibility for better fitting background distribution for open scenario HAD (OHAD) tasks. Experiment results underscore that the proposed CL-BioGAN can achieve more robust and satisfying accuracy for cross-domain HAD with fewer parameters and computation cost. This dual contribution not only elevates CL performance but also offers new insights into neural adaptation mechanisms in OHAD task.
摘要：持续学习（CL）中的记忆稳定性和学习灵活性是跨场次级性高光谱异常检测（已有）任务的核心挑战。生物神经网络可以积极忘记历史知识，这些知识通过调节学习触发的突触扩展和突触收敛来与新体验的学习冲突。受这种现象的启发，我们提出了一种新型的生物学启发的持续学习生成的对抗网络（CL-BIOGAN），以增强跨域的连续分配拟合能力的持续分配能力，其中持续的生物学启发性损失（CL-BIO损失（CL-BIO损失）（CL-BIO损失）（CL-BIO）损失（CLBIO损失）和自我发起的自我发挥作用的生物群体（Biogan）以实现了策略，并构成了众所周知的知识，并置于脑海中的知识。具体而言，一种新颖的生物风格损失，该损失由主动遗忘损失（AF损失）和CL损失组成，旨在从贝叶斯的角度实现参数释放和增强新任务和历史任务之间的参数。同时，使用L2-Norm的Biogan损失增强了自我注意力（SA），以进一步平衡稳定性和灵活性，以更好地拟合背景分配的开放场景（OHAD）任务。实验结果强调了所提出的Cl-Biogan可以实现跨域的更强和令人满意的精度，其参数和计算成本较少。这种双重贡献不仅提高了CL性能，而且还为OHAD任务中的神经适应机制提供了新的见解。

Title: Diffmv: A Unified Diffusion Framework for Healthcare Predictions with Random Missing Views and View Laziness

Authors: Chuang Zhao, Hui Tang, Hongke Zhao, Xiaomeng Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11802
Pdf URL: https://arxiv.org/pdf/2505.11802
Copy Paste: [[2505.11802]] Diffmv: A Unified Diffusion Framework for Healthcare Predictions with Random Missing Views and View Laziness(https://arxiv.org/abs/2505.11802)
Keywords: generative
Abstract: Advanced healthcare predictions offer significant improvements in patient outcomes by leveraging predictive analytics. Existing works primarily utilize various views of Electronic Health Record (EHR) data, such as diagnoses, lab tests, or clinical notes, for model training. These methods typically assume the availability of complete EHR views and that the designed model could fully leverage the potential of each view. However, in practice, random missing views and view laziness present two significant challenges that hinder further improvements in multi-view utilization. To address these challenges, we introduce Diffmv, an innovative diffusion-based generative framework designed to advance the exploitation of multiple views of EHR data. Specifically, to address random missing views, we integrate various views of EHR data into a unified diffusion-denoising framework, enriched with diverse contextual conditions to facilitate progressive alignment and view transformation. To mitigate view laziness, we propose a novel reweighting strategy that assesses the relative advantages of each view, promoting a balanced utilization of various data views within the model. Our proposed strategy achieves superior performance across multiple health prediction tasks derived from three popular datasets, including multi-view and multi-modality scenarios.
摘要：通过利用预测分析，先进的医疗保健预测可显着改善患者预后。现有作品主要利用电子健康记录（EHR）数据（例如诊断，实验室测试或临床注释）进行模型培训的各种观点。这些方法通常假设完整的EHR视图的可用性，并且设计的模型可以充分利用每种视图的潜力。但是，实际上，随机丢失的观点和视图懒惰提出了两个重大挑战，从而阻碍了多视图利用率的进一步改善。为了应对这些挑战，我们引入了DiffMV，这是一种基于创新的基于扩散的生成框架，旨在推动对EHR数据的多种视图的开发。具体而言，为了解决随机丢失的视图，我们将EHR数据的各种视图集成到统一的扩散式降解框架中，并具有多种环境条件，以促进渐进的对齐和视图转换。为了减轻视图懒惰，我们提出了一种新颖的重新加权策略，以评估每种观点的相对优势，从而促进模型中各种数据视图的平衡利用。我们提出的策略在从三个流行的数据集中得出的多个健康预测任务中实现了卓越的性能，包括多视图和多模式的方案。

Title: SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation

Authors: Yixuan Dong, Fang-Yi Su, Jung-Hsien Chiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11813
Pdf URL: https://arxiv.org/pdf/2505.11813
Copy Paste: [[2505.11813]] SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation(https://arxiv.org/abs/2505.11813)
Keywords: generative
Abstract: Data augmentation for domain-specific image classification tasks often struggles to simultaneously address diversity, faithfulness, and label clarity of generated data, leading to suboptimal performance in downstream tasks. While existing generative diffusion model-based methods aim to enhance augmentation, they fail to cohesively tackle these three critical aspects and often overlook intrinsic challenges of diffusion models, such as sensitivity to model characteristics and stochasticity under strong transformations. In this paper, we propose a novel framework that explicitly integrates diversity, faithfulness, and label clarity into the augmentation process. Our approach employs saliency-guided mixing and a fine-tuned diffusion model to preserve foreground semantics, enrich background diversity, and ensure label consistency, while mitigating diffusion model limitations. Extensive experiments across fine-grained, long-tail, few-shot, and background robustness tasks demonstrate our method's superior performance over state-of-the-art approaches.
摘要：针对特定领域的图像分类任务的数据增强通常会努力同时解决生成数据的多样性，忠诚和标签清晰度，从而导致下游任务的次优性能。尽管现有的基于生成扩散模型的方法旨在增强增强，但它们无法凝聚力解决这三个关键方面，并且经常忽略扩散模型的内在挑战，例如对模型特征和强大变换下的随机性的敏感性。在本文中，我们提出了一个新颖的框架，该框架将多样性，忠诚和标签的清晰度明确整合到增强过程中。我们的方法采用显着性引导的混合和微调扩散模型来保留前景语义，丰富背景多样性并确保标签一致性，同时减轻扩散模型的限制。对细粒度，长尾，很少的稳健性任务进行了广泛的实验，这表明了我们的方法优于最先进的方法。

Title: RVTBench: A Benchmark for Visual Reasoning Tasks

Authors: Yiqing Shen, Chenjia Li, Chenxiao Fan, Mathias Unberath
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11838
Pdf URL: https://arxiv.org/pdf/2505.11838
Copy Paste: [[2505.11838]] RVTBench: A Benchmark for Visual Reasoning Tasks(https://arxiv.org/abs/2505.11838)
Keywords: generation
Abstract: Visual reasoning, the capability to interpret visual input in response to implicit text query through multi-step reasoning, remains a challenge for deep learning models due to the lack of relevant benchmarks. Previous work in visual reasoning has primarily focused on reasoning segmentation, where models aim to segment objects based on implicit text queries. This paper introduces reasoning visual tasks (RVTs), a unified formulation that extends beyond traditional video reasoning segmentation to a diverse family of visual language reasoning problems, which can therefore accommodate multiple output formats including bounding boxes, natural language descriptions, and question-answer pairs. Correspondingly, we identify the limitations in current benchmark construction methods that rely solely on large language models (LLMs), which inadequately capture complex spatial-temporal relationships and multi-step reasoning chains in video due to their reliance on token representation, resulting in benchmarks with artificially limited reasoning complexity. To address this limitation, we propose a novel automated RVT benchmark construction pipeline that leverages digital twin (DT) representations as structured intermediaries between perception and the generation of implicit text queries. Based on this method, we construct RVTBench, a RVT benchmark containing 3,896 queries of over 1.2 million tokens across four types of RVT (segmentation, grounding, VQA and summary), three reasoning categories (semantic, spatial, and temporal), and four increasing difficulty levels, derived from 200 video sequences. Finally, we propose RVTagent, an agent framework for RVT that allows for zero-shot generalization across various types of RVT without task-specific fine-tuning.
摘要：视觉推理，通过多步推理解释视觉输入来解释视觉输入的能力，由于缺乏相关基准，对深度学习模型的挑战仍然是一个挑战。视觉推理的先前工作主要集中在推理细分上，其中模型旨在根据隐式文本查询进行细分对象。本文介绍了推理视觉任务（RVT），这是一种统一的公式，超越了传统的视频推理细分到一个各种视觉语言推理问题家族，因此可以适应多种输出格式，包括界限框，自然语言描述以及问题 - 答案对。相应地，我们确定了仅依赖大型语言模型（LLM）的当前基准构造方法中的局限性，这些模型不足以捕获复杂的时空关系和视频中多步合理链，因为它们依赖于令牌表示，从而导致基准有助于人为地限制了限制的理由复杂性。为了解决这一限制，我们提出了一种新型的自动化RVT基准构造管道，该管道利用数字双（DT）表示为结构化的中间体，并在感知和隐性文本查询的产生之间。基于这种方法，我们构建了RVTbench，这是一种RVT基准测试，其中包含在四种类型的RVT（分割，接地，VQA和摘要）中的3,896个查询，该查询超过120万个令牌，三个推理类别（语义，空间和时间）以及来自200个视频序列的三个推理类别（语义，空间和时间）。最后，我们提出了RVTagent，这是RVT的代理框架，允许在各种类型的RVT上进行零击的概括，而无需特定于任务的微调。

Title: Learning Pareto-Optimal Rewards from Noisy Preferences: A Framework for Multi-Objective Inverse Reinforcement Learning

Authors: Kalyan Cherukuri, Aarav Lala
Subjects: cs.LG, cs.AI, cs.CG
Abstract URL: https://arxiv.org/abs/2505.11864
Pdf URL: https://arxiv.org/pdf/2505.11864
Copy Paste: [[2505.11864]] Learning Pareto-Optimal Rewards from Noisy Preferences: A Framework for Multi-Objective Inverse Reinforcement Learning(https://arxiv.org/abs/2505.11864)
Keywords: generative
Abstract: As generative agents become increasingly capable, alignment of their behavior with complex human values remains a fundamental challenge. Existing approaches often simplify human intent through reduction to a scalar reward, overlooking the multi-faceted nature of human feedback. In this work, we introduce a theoretical framework for preference-based Multi-Objective Inverse Reinforcement Learning (MO-IRL), where human preferences are modeled as latent vector-valued reward functions. We formalize the problem of recovering a Pareto-optimal reward representation from noisy preference queries and establish conditions for identifying the underlying multi-objective structure. We derive tight sample complexity bounds for recovering $\epsilon$-approximations of the Pareto front and introduce a regret formulation to quantify suboptimality in this multi-objective setting. Furthermore, we propose a provably convergent algorithm for policy optimization using preference-inferred reward cones. Our results bridge the gap between practical alignment techniques and theoretical guarantees, providing a principled foundation for learning aligned behaviors in a high-dimension and value-pluralistic environment.
摘要：随着生成剂变得越来越有能力，其行为与复杂的人类价值观保持一致仍然是一个基本挑战。现有的方法通常通过减少标量奖励来简化人类意图，从而忽略了人类反馈的多方面性质。在这项工作中，我们引入了一个理论框架，用于基于偏好的多目标逆增强学习（MO-EIRL），其中人类的偏好被建模为潜在的矢量价值奖励函数。我们正式从嘈杂的偏好查询中恢复了帕累托最佳奖励表示的问题，并建立了识别基本多目标结构的条件。我们得出了紧缩的样本复杂性范围，用于恢复帕累托阵线的$ \ epsilon $ - approximations，并引入遗憾配方，以在此多目标设置中量化次优。此外，我们提出了一种使用首选项奖励锥进行策略优化的可证明的收敛算法。我们的结果弥合了实践对准技术与理论保证之间的差距，为在高维和价值环境中学习统一行为的基础提供了原则上的基础。

Title: GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder

Authors: Shiming Chen, Dingjie Fu, Salman Khan, Fahad Shahbaz Khan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.11882
Pdf URL: https://arxiv.org/pdf/2505.11882
Copy Paste: [[2505.11882]] GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder(https://arxiv.org/abs/2505.11882)
Keywords: generation, generative
Abstract: Remarkable progress in zero-shot learning (ZSL) has been achieved using generative models. However, existing generative ZSL methods merely generate (imagine) the visual features from scratch guided by the strong class semantic vectors annotated by experts, resulting in suboptimal generative performance and limited scene generalization. To address these and advance ZSL, we propose an inductive variational autoencoder for generative zero-shot learning, dubbed GenZSL. Mimicking human-level concept learning, GenZSL operates by inducting new class samples from similar seen classes using weak class semantic vectors derived from target class names (i.e., CLIP text embedding). To ensure the generation of informative samples for training an effective ZSL classifier, our GenZSL incorporates two key strategies. Firstly, it employs class diversity promotion to enhance the diversity of class semantic vectors. Secondly, it utilizes target class-guided information boosting criteria to optimize the model. Extensive experiments conducted on three popular benchmark datasets showcase the superiority and potential of our GenZSL with significant efficacy and efficiency over f-VAEGAN, e.g., 24.7% performance gains and more than $60\times$ faster training speed on AWA2. Codes are available at this https URL.
摘要：使用生成模型实现了零拍学习（ZSL）的显着进步（ZSL）。但是，现有的生成ZSL方法仅生成（想象）由专家注释的强级语义向量引导的视觉特征，从而导致了次优的生成性能和有限的场景概括。为了解决这些问题并推进ZSL，我们提出了一种诱导性变异自动编码器，用于生成零射门学习，称为genzsl。 GenZSL模仿人类水平的概念学习，通过使用来自目标类名称（即剪辑文本嵌入）的弱类语义向量引入相似类别的类别的新类样本来运作。为了确保生成培训有效ZSL分类器的信息样本，我们的GENZSL结合了两个关键策略。首先，它采用阶级多样性促进来增强类语义向量的多样性。其次，它利用目标类引导的信息提高标准来优化模型。在三个流行的基准数据集上进行的广泛实验展示了我们的GENZSL的优势和潜力，其功效和效率高于F-Vaegan，例如24.7％的性能增长，超过$ 60 \ tims $ 60 \ times $ $ \ times $ $ $ \ times $ awa2上的训练速度更快。代码可在此HTTPS URL上找到。

Title: Facial Recognition Leveraging Generative Adversarial Networks

Authors: Zhongwen Li, Zongwei Li, Xiaoqi Li
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2505.11884
Pdf URL: https://arxiv.org/pdf/2505.11884
Copy Paste: [[2505.11884]] Facial Recognition Leveraging Generative Adversarial Networks(https://arxiv.org/abs/2505.11884)
Keywords: generation, generative
Abstract: Face recognition performance based on deep learning heavily relies on large-scale training data, which is often difficult to acquire in practical applications. To address this challenge, this paper proposes a GAN-based data augmentation method with three key contributions: (1) a residual-embedded generator to alleviate gradient vanishing/exploding problems, (2) an Inception ResNet-V1 based FaceNet discriminator for improved adversarial training, and (3) an end-to-end framework that jointly optimizes data generation and recognition performance. Experimental results demonstrate that our approach achieves stable training dynamics and significantly improves face recognition accuracy by 12.7% on the LFW benchmark compared to baseline methods, while maintaining good generalization capability with limited training samples.
摘要：基于深度学习的面部识别表现在很大程度上取决于大规模培训数据，这通常很难在实际应用中获得。为了应对这一挑战，本文提出了一种基于GAN的数据增强方法，具有三个关键贡献：（1）一个残留的变成生成器，以减轻梯度消失/爆炸问题，（2）基于Inception-V1基于V1的FaceNet歧视器，用于改进对抗性培训，以及（3）最终框架的终端框架，以最终的框架为单位识别效果和认可数据。实验结果表明，与基线方法相比，LFW基准测试的稳定训练动力学可实现稳定的训练动力，并显着提高了面部识别精度的12.7％，同时通过有限的培训样本保持了良好的概括能力。

Title: Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?

Authors: Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, Xuming Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.11907
Pdf URL: https://arxiv.org/pdf/2505.11907
Copy Paste: [[2505.11907]] Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?(https://arxiv.org/abs/2505.11907)
Keywords: generation
Abstract: The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this paper, we ask: Are MLLMs ready for omnidirectional spatial reasoning? To investigate this, we introduce OSR-Bench, the first benchmark specifically designed for this setting. OSR-Bench includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps. It covers key reasoning types including object counting, relative distance, and direction. We also propose a negative sampling strategy that inserts non-existent objects into prompts to evaluate hallucination and grounding robustness. For fine-grained analysis, we design a two-stage evaluation framework assessing both cognitive map generation and QA accuracy using rotation-invariant matching and a combination of rule-based and LLM-based metrics. We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings. Results show that current models struggle with spatial reasoning in panoramic contexts, highlighting the need for more perceptually grounded MLLMs. OSR-Bench and code will be released at: this https URL
摘要：由360度摄像机捕获的180x360全向视场使它们在体现的AI和虚拟现实等广泛应用中使用。尽管多模式大语言模型（MLLM）的最新进展已在视觉空间推理中显示出希望，但大多数研究都集中在标准的针孔视图图像上，而全向知觉基本上没有探索。在本文中，我们问：MLLM是否准备好全向空间推理？为了调查此问题，我们介绍了OSR基础板，这是专门为此设置设计的第一个基准。 OSR板凳包括超过153,000多种提问对，这些提问是基于高保真的全景室内场景地图。它涵盖关键推理类型，包括对象计数，相对距离和方向。我们还提出了一种负面抽样策略，该策略将不存在的对象插入评估幻觉和接地鲁棒性的提示中。为了进行细粒度分析，我们设计了一个两阶段评估框架，使用旋转不变匹配以及基于规则和基于LLM的指标的组合来评估认知地图的产生和质量质量准确性。我们评估了八个最先进的MLLM，包括GPT-4O，GEMINI 1.5 PRO和在零弹位设置下领先的开源型号。结果表明，当前的模型在全景环境中与空间推理作斗争，这突出了需要更感知地接地的MLLM。 OSR基础和代码将在以下位置发布：此HTTPS URL

Title: SafeVid: Toward Safety Aligned Video Large Multimodal Models

Authors: Yixu Wang, Jiaxin Song, Yifeng Gao, Xin Wang, Yang Yao, Yan Teng, Xingjun Ma, Yingchun Wang, Yu-Gang Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11926
Pdf URL: https://arxiv.org/pdf/2505.11926
Copy Paste: [[2505.11926]] SafeVid: Toward Safety Aligned Video Large Multimodal Models(https://arxiv.org/abs/2505.11926)
Keywords: generation
Abstract: As Video Large Multimodal Models (VLMMs) rapidly advance, their inherent complexity introduces significant safety challenges, particularly the issue of mismatched generalization where static safety alignments fail to transfer to dynamic video contexts. We introduce SafeVid, a framework designed to instill video-specific safety principles in VLMMs. SafeVid uniquely transfers robust textual safety alignment capabilities to the video domain by employing detailed textual video descriptions as an interpretive bridge, facilitating LLM-based rule-driven safety reasoning. This is achieved through a closed-loop system comprising: 1) generation of SafeVid-350K, a novel 350,000-pair video-specific safety preference dataset; 2) targeted alignment of VLMMs using Direct Preference Optimization (DPO); and 3) comprehensive evaluation via our new SafeVidBench benchmark. Alignment with SafeVid-350K significantly enhances VLMM safety, with models like LLaVA-NeXT-Video demonstrating substantial improvements (e.g., up to 42.39%) on SafeVidBench. SafeVid provides critical resources and a structured approach, demonstrating that leveraging textual descriptions as a conduit for safety reasoning markedly improves the safety alignment of VLMMs. We have made SafeVid-350K dataset (this https URL) publicly available.
摘要：随着视频大型多模型模型（VLMMS）迅速发展，它们的固有复杂性引入了重大的安全挑战，尤其是静态安全一致性无法转移到动态视频环境的不匹配概括的问题。我们介绍了Safevid，该框架旨在在VLMM中灌输特定于视频的安全原则。 Safevid独特地通过采用详细的文本视频描述作为解释性桥梁，将强大的文本安全对齐功能传输到视频域，从而促进了基于LLM的规则驱动的安全性推理。这是通过包含的闭环系统来实现的：1）生成Safevid-350k，这是一种新颖的350,000对特定于视频的安全性偏好数据集； 2）使用直接偏好优化（DPO）靶向VLMMS对齐； 3）通过我们的新SafevidBench基准进行全面评估。与SAFEVID-350K的对齐方式可显着提高VLMM的安全性，诸如Llava-Next-Video之类的模型在SafevidBench上显示出可取得的重大改进（例如，高达42.39％）。 Safevid提供了关键的资源和结构化的方法，表明将文本描述作为安全推理的渠道明显提高了VLMM的安全一致性。我们已公开提供Safevid-350k数据集（此HTTPS URL）。

Title: How can Diffusion Models Evolve into Continual Generators?

Authors: Jingren Liu, Zhong Ji, Xiangyu Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11936
Pdf URL: https://arxiv.org/pdf/2505.11936
Copy Paste: [[2505.11936]] How can Diffusion Models Evolve into Continual Generators?(https://arxiv.org/abs/2505.11936)
Keywords: generation, generative
Abstract: While diffusion models have achieved remarkable success in static data generation, their deployment in streaming or continual learning (CL) scenarios faces a major challenge: catastrophic forgetting (CF), where newly acquired generative capabilities overwrite previously learned ones. To systematically address this, we introduce a formal Continual Diffusion Generation (CDG) paradigm that characterizes and redefines CL in the context of generative diffusion models. Prior efforts often adapt heuristic strategies from continual classification tasks but lack alignment with the underlying diffusion process. In this work, we develop the first theoretical framework for CDG by analyzing cross-task dynamics in diffusion-based generative modeling. Our analysis reveals that the retention and stability of generative knowledge across tasks are governed by three key consistency criteria: inter-task knowledge consistency (IKC), unconditional knowledge consistency (UKC), and label knowledge consistency (LKC). Building on these insights, we propose Continual Consistency Diffusion (CCD), a principled framework that integrates these consistency objectives into training via hierarchical loss terms $\mathcal{L}_{IKC}$, $\mathcal{L}_{UKC}$, and $\mathcal{L}_{LKC}$. This promotes effective knowledge retention while enabling the assimilation of new generative capabilities. Extensive experiments on four benchmark datasets demonstrate that CCD achieves state-of-the-art performance under continual settings, with substantial gains in Mean Fidelity (MF) and Incremental Mean Fidelity (IMF), particularly in tasks with rich cross-task knowledge overlap.
摘要：尽管扩散模型在静态数据生成方面取得了巨大的成功，但它们在流媒体或持续学习（CL）场景中的部署面临着一个主要的挑战：灾难性遗忘（CF），新获得的生成能力覆盖了以前学到的生成能力。为了系统地解决此问题，我们引入了形式的持续扩散产生（CDG）范式，该范式在生成扩散模型的背景下表征和重新定义了CL。先前的努力通常会从连续的分类任务中适应启发式策略，但与潜在的扩散过程缺乏一致性。在这项工作中，我们通过分析基于扩散的生成建模中的交叉任务动力学来开发CDG的第一个理论框架。我们的分析表明，跨任务的生成知识的保留和稳定性受三个关键一致性标准的控制：任务跨任务知识一致性（IKC），无条件知识一致性（UKC）和标签知识一致性（LKC）。在这些见解的基础上，我们提出了连续的一致性扩散（CCD），该框架将这些一致性目标集成到通过层次损耗项$ \ MATHCAL {l} _ {IKC} $，$ \ MATHCAL {L} _ {l} _ {ukc} _ {ukc} $，以及$ \ mathcal} $} $ {这促进了有效的知识保留，同时可以吸收新的生成能力。在四个基准数据集上进行的广泛实验表明，CCD在持续设置下实现了最先进的性能，平均忠诚度（MF）和增量平均忠诚度（IMF）具有可观的增长，尤其是在具有丰富交叉任务知识重叠的任务中。

Title: AoP-SAM: Automation of Prompts for Efficient Segmentation

Authors: Yi Chen, Mu-Young Son, Chuanbo Hua, Joo-Young Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11980
Pdf URL: https://arxiv.org/pdf/2505.11980
Copy Paste: [[2505.11980]] AoP-SAM: Automation of Prompts for Efficient Segmentation(https://arxiv.org/abs/2505.11980)
Keywords: generation
Abstract: The Segment Anything Model (SAM) is a powerful foundation model for image segmentation, showing robust zero-shot generalization through prompt engineering. However, relying on manual prompts is impractical for real-world applications, particularly in scenarios where rapid prompt provision and resource efficiency are crucial. In this paper, we propose the Automation of Prompts for SAM (AoP-SAM), a novel approach that learns to generate essential prompts in optimal locations automatically. AoP-SAM enhances SAM's efficiency and usability by eliminating manual input, making it better suited for real-world tasks. Our approach employs a lightweight yet efficient Prompt Predictor model that detects key entities across images and identifies the optimal regions for placing prompt candidates. This method leverages SAM's image embeddings, preserving its zero-shot generalization capabilities without requiring fine-tuning. Additionally, we introduce a test-time instance-level Adaptive Sampling and Filtering mechanism that generates prompts in a coarse-to-fine manner. This notably enhances both prompt and mask generation efficiency by reducing computational overhead and minimizing redundant mask refinements. Evaluations of three datasets demonstrate that AoP-SAM substantially improves both prompt generation efficiency and mask generation accuracy, making SAM more effective for automated segmentation tasks.
摘要：任何模型（SAM）是图像分割的强大基础模型，通过及时工程显示了稳健的零弹性概括。但是，依靠手动提示对于实际应用是不切实际的，尤其是在快速迅速提供和资源效率至关重要的情况下。在本文中，我们提出了SAM（AOP-SAM）提示的自动化，这是一种新颖的方法，该方法学会自动在最佳位置生成必要的提示。 AOP-SAM通过消除手动输入来提高SAM的效率和可用性，使其更适合现实世界任务。我们的方法采用了轻巧而有效的及时预测模型，该模型可检测图像跨图像的关键实体，并确定放置及时候选者的最佳区域。此方法利用SAM的图像嵌入，保留其零弹药的概括功能，而无需进行微调。此外，我们引入了测试时实例级的自适应采样和过滤机制，该机制以粗到精细的方式生成提示。这值得注意的是，通过降低计算开销并最大程度地减少冗余面具的精炼，可以提高及时和掩盖的产生效率。对三个数据集的评估表明，AOP-SAM显着提高了迅速的发电效率和掩盖的生成准确性，从而使SAM在自动分段任务中更有效。

Title: Online Iterative Self-Alignment for Radiology Report Generation

Authors: Ting Xiao, Lei Shi, Yang Zhang, HaoFeng Yang, Zhe Wang, Chenjia Bai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.11983
Pdf URL: https://arxiv.org/pdf/2505.11983
Copy Paste: [[2505.11983]] Online Iterative Self-Alignment for Radiology Report Generation(https://arxiv.org/abs/2505.11983)
Keywords: generation
Abstract: Radiology Report Generation (RRG) is an important research topic for relieving radiologist' heavy workload. Existing RRG models mainly rely on supervised fine-tuning (SFT) based on different model architectures using data pairs of radiological images and corresponding radiologist-annotated reports. Recent research has shifted focus to post-training improvements, aligning RRG model outputs with human preferences using reinforcement learning (RL). However, the limited data coverage of high-quality annotated data poses risks of overfitting and generalization. This paper proposes a novel Online Iterative Self-Alignment (OISA) method for RRG that consists of four stages: self-generation of diverse data, self-evaluation for multi-objective preference data,self-alignment for multi-objective optimization and self-iteration for further improvement. Our approach allows for generating varied reports tailored to specific clinical objectives, enhancing the overall performance of the RRG model iteratively. Unlike existing methods, our frame-work significantly increases data quality and optimizes performance through iterative multi-objective optimization. Experimental results demonstrate that our method surpasses previous approaches, achieving state-of-the-art performance across multiple evaluation metrics.
摘要：放射学报告产生（RRG）是减轻放射科医生大量工作量的重要研究主题。现有的RRG模型主要依赖于使用放射学图像的数据对和相应的放射科医师注销的报告基于不同模型架构的监督微调（SFT）。最近的研究将重点转移到了训练后的改进上，使用增强学习（RL）将RRG模型输出与人类偏好相结合。但是，高质量注释数据的数据覆盖范围有限，会带来过度拟合和泛化的风险。本文提出了一种用于RRG的新颖的在线迭代自我对准方法（OISA）方法，该方法包括四个阶段：多样化数据的自我生成，多目标偏好数据的自我评估，多目标优化的自我调整和自我对自我的自我调整以进行进一步改进。我们的方法允许生成针对特定临床目标量身定制的各种报告，从而增强RRG模型的整体性能。与现有方法不同，我们的框架工作可显着提高数据质量并通过迭代多目标优化优化性能。实验结果表明，我们的方法超过了先前的方法，从而在多个评估指标中实现了最先进的表现。

Title: Approximation theory for 1-Lipschitz ResNets

Authors: Davide Murari, Takashi Furuya, Carola-Bibiane Schönlieb
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2505.12003
Pdf URL: https://arxiv.org/pdf/2505.12003
Copy Paste: [[2505.12003]] Approximation theory for 1-Lipschitz ResNets(https://arxiv.org/abs/2505.12003)
Keywords: generative
Abstract: 1-Lipschitz neural networks are fundamental for generative modelling, inverse problems, and robust classifiers. In this paper, we focus on 1-Lipschitz residual networks (ResNets) based on explicit Euler steps of negative gradient flows and study their approximation capabilities. Leveraging the Restricted Stone-Weierstrass Theorem, we first show that these 1-Lipschitz ResNets are dense in the set of scalar 1-Lipschitz functions on any compact domain when width and depth are allowed to grow. We also show that these networks can exactly represent scalar piecewise affine 1-Lipschitz functions. We then prove a stronger statement: by inserting norm-constrained linear maps between the residual blocks, the same density holds when the hidden width is fixed. Because every layer obeys simple norm constraints, the resulting models can be trained with off-the-shelf optimisers. This paper provides the first universal approximation guarantees for 1-Lipschitz ResNets, laying a rigorous foundation for their practical use.
摘要：1-Lipschitz神经网络对于生成建模，反问题和强大的分类器至关重要。在本文中，我们将重点关注1-Lipschitz残留网络（RESNET），基于负梯度流的显式Euler步骤，并研究其近似功能。利用受限制的石材 - 网络定理，我们首先表明，当允许宽度和深度生长时，这些1-lipschitz的重压在任何紧凑型域上的标量1-lipschitz函数中都很致密。我们还表明，这些网络可以完全代表标量分段仿射1-lipschitz函数。然后，我们证明了一个更强大的陈述：通过在残留块之间插入规范约束的线性图，当固定隐藏宽度时，相同的密度可以保持相同的密度。因为每个层都遵守简单的规范约束，因此可以使用现成的优化器训练所得的模型。本文为1-Lipschitz Resnets提供了第一个通用近似保证，为其实际使用奠定了严格的基础。

Title: Black-box Adversaries from Latent Space: Unnoticeable Attacks on Human Pose and Shape Estimation

Authors: Zhiying Li, Guanggang Geng, Yeying Jin, Zhizhi Guo, Bruce Gu, Jidong Huo, Zhaoxin Fan, Wenjun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12009
Pdf URL: https://arxiv.org/pdf/2505.12009
Copy Paste: [[2505.12009]] Black-box Adversaries from Latent Space: Unnoticeable Attacks on Human Pose and Shape Estimation(https://arxiv.org/abs/2505.12009)
Keywords: generation
Abstract: Expressive human pose and shape (EHPS) estimation is vital for digital human generation, particularly in live-streaming applications. However, most existing EHPS models focus primarily on minimizing estimation errors, with limited attention on potential security vulnerabilities. Current adversarial attacks on EHPS models often require white-box access (e.g., model details or gradients) or generate visually conspicuous perturbations, limiting their practicality and ability to expose real-world security threats. To address these limitations, we propose a novel Unnoticeable Black-Box Attack (UBA) against EHPS models. UBA leverages the latent-space representations of natural images to generate an optimal adversarial noise pattern and iteratively refine its attack potency along an optimized direction in digital space. Crucially, this process relies solely on querying the model's output, requiring no internal knowledge of the EHPS architecture, while guiding the noise optimization toward greater stealth and effectiveness. Extensive experiments and visual analyses demonstrate the superiority of UBA. Notably, UBA increases the pose estimation errors of EHPS models by 17.27%-58.21% on average, revealing critical vulnerabilities. These findings underscore the urgent need to address and mitigate security risks associated with digital human generation systems.
摘要：富有表现力的人姿势和形状（EHP）估计对于数字人类发电是至关重要的，尤其是在实时流程应用中。但是，大多数现有的EHPS模型主要侧重于最小化估计错误，而对潜在安全漏洞的关注有限。当前对EHPS模型的对抗性攻击通常需要白色框访问（例如，模型详细信息或梯度）或生成视觉上明显的扰动，从而限制了它们的实用性和暴露现实世界安全威胁的能力。为了解决这些局限性，我们提出了一种针对EHPS模型的新型不可吸引的黑盒攻击（UBA）。 UBA利用自然图像的潜在空间表示产生最佳的对抗性噪声模式，并迭代地沿数字空间中优化的方向提高其攻击效力。至关重要的是，此过程仅依赖于查询模型的输出，不需要内部对EHPS架构的了解，同时指导噪声优化，以实现更大的隐身和有效性。广泛的实验和视觉分析证明了UBA的优势。值得注意的是，UBA平均将EHPS模型的姿势估计错误增加了17.27％-58.21％，揭示了关键的漏洞。这些发现强调了迫切需要解决与数字人类发电系统相关的安全风险。

Title: Improving regional weather forecasts with neural interpolation

Authors: James Jackaman, Oliver Sutton
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.12040
Pdf URL: https://arxiv.org/pdf/2505.12040
Copy Paste: [[2505.12040]] Improving regional weather forecasts with neural interpolation(https://arxiv.org/abs/2505.12040)
Keywords: super-resolution
Abstract: In this paper we design a neural interpolation operator to improve the boundary data for regional weather models, which is a challenging problem as we are required to map multi-scale dynamics between grid resolutions. In particular, we expose a methodology for approaching the problem through the study of a simplified model, with a view to generalise the results in this work to the dynamical core of regional weather models. Our approach will exploit a combination of techniques from image super-resolution with convolutional neural networks (CNNs) and residual networks, in addition to building the flow of atmospheric dynamics into the neural network
摘要：在本文中，我们设计了一个神经插值操作员来改善区域天气模型的边界数据，这是一个充满挑战的问题，因为我们需要在网格分辨率之间绘制多规模动力学。特别是，我们通过研究简化模型来揭示解决问题的方法，以将这项工作的结果推广到区域天气模型的动态核心。我们的方法还将利用图像超分辨率与卷积神经网络（CNN）和残留网络的技术组合，除了将大气动力学的流动纳入神经网络之外

Title: FIGhost: Fluorescent Ink-based Stealthy and Flexible Backdoor Attacks on Physical Traffic Sign Recognition

Authors: Shuai Yuan, Guowen Xu, Hongwei Li, Rui Zhang, Xinyuan Qian, Wenbo Jiang, Hangcheng Cao, Qingchuan Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12045
Pdf URL: https://arxiv.org/pdf/2505.12045
Copy Paste: [[2505.12045]] FIGhost: Fluorescent Ink-based Stealthy and Flexible Backdoor Attacks on Physical Traffic Sign Recognition(https://arxiv.org/abs/2505.12045)
Keywords: generation
Abstract: Traffic sign recognition (TSR) systems are crucial for autonomous driving but are vulnerable to backdoor attacks. Existing physical backdoor attacks either lack stealth, provide inflexible attack control, or ignore emerging Vision-Large-Language-Models (VLMs). In this paper, we introduce FIGhost, the first physical-world backdoor attack leveraging fluorescent ink as triggers. Fluorescent triggers are invisible under normal conditions and activated stealthily by ultraviolet light, providing superior stealthiness, flexibility, and untraceability. Inspired by real-world graffiti, we derive realistic trigger shapes and enhance their robustness via an interpolation-based fluorescence simulation algorithm. Furthermore, we develop an automated backdoor sample generation method to support three attack objectives. Extensive evaluations in the physical world demonstrate FIGhost's effectiveness against state-of-the-art detectors and VLMs, maintaining robustness under environmental variations and effectively evading existing defenses.
摘要：交通标志识别（TSR）系统对于自动驾驶至关重要，但容易受到后门攻击的影响。现有的物理后门攻击要么缺乏隐形，提供僵化的攻击控制，要么忽略新兴的视力 - 大型语言模型（VLMS）。在本文中，我们介绍了Fighost，这是第一次物理世界后门攻击，将荧光墨水作为触发器。荧光触发器在正常条件下是看不见的，并被紫外线偷偷地激活，可提供卓越的隐形，柔韧性和不可追溯性。受实际涂鸦的启发，我们通过基于插值的荧光模拟算法得出了逼真的触发形状，并增强了它们的鲁棒性。此外，我们开发了一种自动后门样本生成方法来支持三个攻击目标。在物理世界中进行的广泛评估表明，Fighost对最先进的探测器和VLM的有效性，在环境变化下保持稳健性并有效地避免了现有的防御能力。

Title: Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling

Authors: Rui Qin, Qijie Wang, Ming Sun, Haowei Zhu, Chao Zhou, Bin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12048
Pdf URL: https://arxiv.org/pdf/2505.12048
Copy Paste: [[2505.12048]] Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling(https://arxiv.org/abs/2505.12048)
Keywords: super-resolution
Abstract: Diffusion models have gained attention for their success in modeling complex distributions, achieving impressive perceptual quality in SR tasks. However, existing diffusion-based SR methods often suffer from high computational costs, requiring numerous iterative steps for training and inference. Existing acceleration techniques, such as distillation and solver optimization, are generally task-agnostic and do not fully leverage the specific characteristics of low-level tasks like super-resolution (SR). In this study, we analyze the frequency- and spatial-domain properties of diffusion-based SR methods, revealing key insights into the temporal and spatial dependencies of high-frequency signal recovery. Specifically, high-frequency details benefit from concentrated optimization during early and late diffusion iterations, while spatially textured regions demand adaptive denoising strategies. Building on these observations, we propose the Time-Spatial-aware Sampling strategy (TSS) for the acceleration of Diffusion SR without any extra training cost. TSS combines Time Dynamic Sampling (TDS), which allocates more iterations to refining textures, and Spatial Dynamic Sampling (SDS), which dynamically adjusts strategies based on image content. Extensive evaluations across multiple benchmarks demonstrate that TSS achieves state-of-the-art (SOTA) performance with significantly fewer iterations, improving MUSIQ scores by 0.2 - 3.0 and outperforming the current acceleration methods with only half the number of steps.
摘要：扩散模型因其在建模复杂分布方面的成功而引起了人们的关注，从而在SR任务中实现了令人印象深刻的感知质量。但是，现有的基于扩散的SR方法通常遭受高计算成本的损失，需要进行许多迭代步骤进行培训和推理。现有的加速技术（例如蒸馏和求解器优化）通常是任务不合时宜的，并且不能完全利用低级任务（例如超分辨率（SR））的特定特征。在这项研究中，我们分析了基于扩散的SR方法的频率和空间域特性，揭示了对高频信号恢复时间和空间依赖性的关键见解。具体而言，高频细节受益于早期和晚期扩散迭代期间的集中优化，而空间纹理的区域则需要适应性的deno制定策略。在这些观察结果的基础上，我们提出了时间空间感知的采样策略（TSS），以加速扩散SR，而无需任何额外的培训成本。 TSS结合了时间动态采样（TDS），该采样将更多的迭代分配给精炼纹理和空间动态采样（SDS），该采样（SDS）会根据图像内容动态调整策略。对多个基准测试的广泛评估表明，TSS可以实现最新的（SOTA）性能，其迭代率明显较少，将MUSIQ得分提高了0.2-3.0，并且比目前的加速度胜过只有一半的步骤。

Title: VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption

Authors: Tianxiong Zhong, Xingye Tian, Boyuan Jiang, Xuebo Wang, Xin Tao, Pengfei Wan, Zhiwei Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12053
Pdf URL: https://arxiv.org/pdf/2505.12053
Copy Paste: [[2505.12053]] VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption(https://arxiv.org/abs/2505.12053)
Keywords: generation
Abstract: Modern video generation frameworks based on Latent Diffusion Models suffer from inefficiencies in tokenization due to the Frame-Proportional Information Assumption. Existing tokenizers provide fixed temporal compression rates, causing the computational cost of the diffusion model to scale linearly with the frame rate. The paper proposes the Duration-Proportional Information Assumption: the upper bound on the information capacity of a video is proportional to the duration rather than the number of frames. Based on this insight, the paper introduces VFRTok, a Transformer-based video tokenizer, that enables variable frame rate encoding and decoding through asymmetric frame rate training between the encoder and decoder. Furthermore, the paper proposes Partial Rotary Position Embeddings (RoPE) to decouple position and content modeling, which groups correlated patches into unified tokens. The Partial RoPE effectively improves content-awareness, enhancing the video generation capability. Benefiting from the compact and continuous spatio-temporal representation, VFRTok achieves competitive reconstruction quality and state-of-the-art generation fidelity while using only 1/8 tokens compared to existing tokenizers.
摘要：基于潜在扩散模型的现代视频生成框架由于框架 - 比例信息假设而导致令牌化效率低下。现有的令牌提供了固定的时间压缩率，导致扩散模型的计算成本与帧速率线性扩展。该论文提出了持续时间 - 比例信息假设：视频的信息能力上的上限与持续时间成正比，而不是帧数。基于此洞察力，本文介绍了基于变压器的视频令牌VFRTOK，它可以通过编码器和解码器之间的不对称帧速率训练来实现可变帧速率编码和解码。此外，本文提出了部分旋转位置嵌入（绳索）以将位置和内容建模解除，将其分组为统一的令牌。部分绳索有效地提高了内容意识，从而增强了视频生成能力。 Vfrtok受益于紧凑而连续的时空表示，与现有的标记者相比，仅使用1/8代币，可以实现竞争性重建质量和最先进的一代忠诚度。

Title: LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation

Authors: Jiarui Wang, Huiyu Duan, Ziheng Jia, Yu Zhao, Woo Yi Yang, Zicheng Zhang, Zijian Chen, Juntong Wang, Yuke Xing, Guangtao Zhai, Xiongkuo Min
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12098
Pdf URL: https://arxiv.org/pdf/2505.12098
Copy Paste: [[2505.12098]] LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation(https://arxiv.org/abs/2505.12098)
Keywords: generation
Abstract: Recent advancements in large multimodal models (LMMs) have driven substantial progress in both text-to-video (T2V) generation and video-to-text (V2T) interpretation tasks. However, current AI-generated videos (AIGVs) still exhibit limitations in terms of perceptual quality and text-video alignment. Therefore, a reliable and scalable automatic model for AIGV evaluation is desirable, which heavily relies on the scale and quality of human annotations. To this end, we present AIGVE-60K, a comprehensive dataset and benchmark for AI-Generated Video Evaluation, which features (i) comprehensive tasks, encompassing 3,050 extensive prompts across 20 fine-grained task dimensions, (ii) the largest human annotations, including 120K mean-opinion scores (MOSs) and 60K question-answering (QA) pairs annotated on 58,500 videos generated from 30 T2V models, and (iii) bidirectional benchmarking and evaluating for both T2V generation and V2T interpretation capabilities. Based on AIGVE-60K, we propose LOVE, a LMM-based metric for AIGV Evaluation from multiple dimensions including perceptual preference, text-video correspondence, and task-specific accuracy in terms of both instance level and model level. Comprehensive experiments demonstrate that LOVE not only achieves state-of-the-art performance on the AIGVE-60K dataset, but also generalizes effectively to a wide range of other AIGV evaluation benchmarks. These findings highlight the significance of the AIGVE-60K dataset. Database and codes are anonymously available at this https URL.
摘要：大型多模型模型（LMM）的最新进展已在文本到视频（T2V）生成和视频对文本（V2T）解释任务中取得了重大进展。但是，当前的AI生成的视频（AIGV）仍然在感知质量和文本视频对齐方面表现出局限性。因此，需要一个可靠且可扩展的自动模型，用于AIGV评估，这在很大程度上依赖于人类注释的规模和质量。 To this end, we present AIGVE-60K, a comprehensive dataset and benchmark for AI-Generated Video Evaluation, which features (i) comprehensive tasks, encompassing 3,050 extensive prompts across 20 fine-grained task dimensions, (ii) the largest human annotations, including 120K mean-opinion scores (MOSs) and 60K question-answering (QA) pairs annotated on 58,500 videos由30个T2V模型以及（iii）双向基准测试和评估T2V生成和V2T解释能力产生。基于AIGVE-60K，我们提出了Love，这是一种基于LMM的指标，用于从多个维度进行AIGV评估，包括感知偏好，文本视频通信以及特定于任务的精度，从实例级别和模型级别上。全面的实验表明，爱情不仅在AIGVE-60K数据集上实现最先进的性能，而且还可以有效地推广到其他广泛的AIGV评估基准。这些发现突出了AIGVE-60K数据集的重要性。此HTTPS URL匿名可用数据库和代码。

Title: EarthSynth: Generating Informative Earth Observation with Diffusion Models

Authors: Jiancheng Pan, Shiye Lei, Yuqian Fu, Jiahao Li, Yanxing Liu, Yuze Sun, Xiao He, Long Peng, Xiaomeng Huang, Bo Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12108
Pdf URL: https://arxiv.org/pdf/2505.12108
Copy Paste: [[2505.12108]] EarthSynth: Generating Informative Earth Observation with Diffusion Models(https://arxiv.org/abs/2505.12108)
Keywords: generation, generative
Abstract: Remote sensing image (RSI) interpretation typically faces challenges due to the scarcity of labeled data, which limits the performance of RSI interpretation tasks. To tackle this challenge, we propose EarthSynth, a diffusion-based generative foundation model that enables synthesizing multi-category, cross-satellite labeled Earth observation for downstream RSI interpretation tasks. To the best of our knowledge, EarthSynth is the first to explore multi-task generation for remote sensing. EarthSynth, trained on the EarthSynth-180K dataset, employs the Counterfactual Composition training strategy to improve training data diversity and enhance category control. Furthermore, a rule-based method of R-Filter is proposed to filter more informative synthetic data for downstream tasks. We evaluate our EarthSynth on scene classification, object detection, and semantic segmentation in open-world scenarios, offering a practical solution for advancing RSI interpretation.
摘要：遥感图像（RSI）的解释通常由于标记数据的稀缺而面临挑战，这限制了RSI解释任务的性能。为了应对这一挑战，我们提出了Earthsyth，这是一种基于扩散的生成基础模型，可实现多类，跨卫星标记的地球观测，以实现下游RSI解释任务。据我们所知，Earthsyth是第一个探索遥感多任务生成的人。在Earthsyth-180k数据集中受过培训的Earthsyth采用反事实组成训练策略来改善训练数据多样性并增强类别控制。此外，提出了一种基于规则的R滤波器方法，以过滤更多信息的综合数据，以实现下游任务。我们在开放世界的场景中评估了场景分类，对象检测和语义细分的地位，为推进RSI解释提供了实用的解决方案。

Title: Learning to Highlight Audio by Watching Movies

Authors: Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.12154
Pdf URL: https://arxiv.org/pdf/2505.12154
Copy Paste: [[2505.12154]] Learning to Highlight Audio by Watching Movies(https://arxiv.org/abs/2505.12154)
Keywords: generation
Abstract: Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process -- separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Our project page is here: this https URL.
摘要：近年来，视频内容创建和消费量大幅增加。制作引人入胜的内容需要视觉和音频元素的仔细策划。尽管视觉提示策划通过最佳的观点选择或编辑后的技术是媒体生产的核心，但其自然的音频却没有进行同等的进步。这通常会导致视觉显着性和声学显着性之间的脱节。为了弥合这一差距，我们介绍了一项新颖的任务：视觉引导的声学突出显示，旨在改变音频，以产生由随附的视频引导的适当突出效果，最终创造了更和谐的视听体验。我们提出了一个灵活的，基于变压器的多模式框架来解决此任务。为了训练我们的模型，我们还引入了一个新的数据集 - 泥泞的混合数据集，利用电影中的细致音频和视频制作，提供了一种免费的监督形式。我们开发了一个伪数据生成过程，以模拟混合的音频，通过三步过程 - 分离，调整和混合来模仿现实世界的场景。我们的方法在定量和主观评估中始终优于几个基线。我们还系统地研究了不同类型的上下文指导和数据集难度级别的影响。我们的项目页面在这里：此HTTPS URL。

Title: Always Clear Depth: Robust Monocular Depth Estimation under Adverse Weather

Authors: Kui Jiang, Jing Cao, Zhaocheng Yu, Junjun Jiang, Jingchun Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12199
Pdf URL: https://arxiv.org/pdf/2505.12199
Copy Paste: [[2505.12199]] Always Clear Depth: Robust Monocular Depth Estimation under Adverse Weather(https://arxiv.org/abs/2505.12199)
Keywords: generation
Abstract: Monocular depth estimation is critical for applications such as autonomous driving and scene reconstruction. While existing methods perform well under normal scenarios, their performance declines in adverse weather, due to challenging domain shifts and difficulties in extracting scene information. To address this issue, we present a robust monocular depth estimation method called \textbf{ACDepth} from the perspective of high-quality training data generation and domain adaptation. Specifically, we introduce a one-step diffusion model for generating samples that simulate adverse weather conditions, constructing a multi-tuple degradation dataset during training. To ensure the quality of the generated degradation samples, we employ LoRA adapters to fine-tune the generation weights of diffusion model. Additionally, we integrate circular consistency loss and adversarial training to guarantee the fidelity and naturalness of the scene contents. Furthermore, we elaborate on a multi-granularity knowledge distillation strategy (MKD) that encourages the student network to absorb knowledge from both the teacher model and pretrained Depth Anything V2. This strategy guides the student model in learning degradation-agnostic scene information from various degradation inputs. In particular, we introduce an ordinal guidance distillation mechanism (OGD) that encourages the network to focus on uncertain regions through differential ranking, leading to a more precise depth estimation. Experimental results demonstrate that our ACDepth surpasses md4all-DD by 2.50\% for night scene and 2.61\% for rainy scene on the nuScenes dataset in terms of the absRel metric.
摘要：单眼深度估计对于自动驾驶和场景重建等应用至关重要。尽管现有方法在正常情况下表现良好，但由于领域的转移和提取场景信息的困难，它们的性能在不利天气下下降。为了解决这个问题，我们提出了一种可靠的单眼深度估计方法，称为\ textbf {acdepth}，从高质量培训数据生成和域适应的角度来看。具体而言，我们引入了一个一步扩散模型，用于生成模拟不利天气条件的样品，在训练过程中构建多核降解数据集。为了确保生成的降解样品的质量，我们采用Lora适配器来微调扩散模型的产生权重。此外，我们整合了循环一致性损失和对抗性训练，以确保场景内容的忠诚度和自然性。此外，我们详细介绍了多个跨性知识蒸馏策略（MKD），该策略鼓励学生网络从教师模型中吸收知识并预估计了任何V2的深度。该策略指导学生模型从各种降解输入中学习降解 - 不足的场景信息。特别是，我们引入了一种序数引导蒸馏机制（OGD），该机制鼓励网络通过差分排名专注于不确定的区域，从而更加精确地估计深度估计。实验结果表明，我们的ACDEPTH在夜景中超过MD4All-DD 2.50 \％，而在Nuscenes数据集上，就absrel指标而言，下雨天的雨天场景为2.61 \％。

Title: CompBench: Benchmarking Complex Instruction-guided Image Editing

Authors: Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Zihan Wang, Yuan Xie, Shaohui Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12200
Pdf URL: https://arxiv.org/pdf/2505.12200
Copy Paste: [[2505.12200]] CompBench: Benchmarking Complex Instruction-guided Image Editing(https://arxiv.org/abs/2505.12200)
Keywords: generation
Abstract: While real-world applications increasingly demand intricate scene manipulation, existing instruction-guided image editing benchmarks often oversimplify task complexity and lack comprehensive, fine-grained instructions. To bridge this gap, we introduce, a large-scale benchmark specifically designed for complex instruction-guided image editing. CompBench features challenging editing scenarios that incorporate fine-grained instruction following, spatial and contextual reasoning, thereby enabling comprehensive evaluation of image editing models' precise manipulation capabilities. To construct CompBench, We propose an MLLM-human collaborative framework with tailored task pipelines. Furthermore, we propose an instruction decoupling strategy that disentangles editing intents into four key dimensions: location, appearance, dynamics, and objects, ensuring closer alignment between instructions and complex editing requirements. Extensive evaluations reveal that CompBench exposes fundamental limitations of current image editing models and provides critical insights for the development of next-generation instruction-guided image editing systems.
摘要：尽管现实世界的应用程序越来越多地要求复杂的场景操纵，但现有的指导引导的图像编辑基准通常会过多简化任务的复杂性，并且缺乏全面的，细粒度的指令。为了弥合这一差距，我们介绍了一种专门设计用于复杂指导引导的图像编辑的大规模基准。 Compbench具有挑战性的编辑方案，其中包含了以下，空间和上下文推理的精细颗粒说明，从而可以对图像编辑模型的精确操纵功能进行全面评估。为了构建Compbench，我们建议使用量身定制的任务管道提出MLLM-Human协作框架。此外，我们提出了一种指令解耦策略，该策略将编辑意图分解为四个关键维度：位置，外观，动态和对象，以确保指令和复杂的编辑要求之间更紧密的对齐。广泛的评估表明，Compbench揭示了当前图像编辑模型的基本局限性，并为开发下一代指导引导的图像编辑系统提供了关键见解。

Title: NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation

Authors: Jia Li, Nan Gao, Huaibo Huang, Ran He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12235
Pdf URL: https://arxiv.org/pdf/2505.12235
Copy Paste: [[2505.12235]] NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation(https://arxiv.org/abs/2505.12235)
Keywords: generation
Abstract: The diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation. Recently, topology and texture control are popular explorations, e.g., ControlNet, IP-Adapter, Ctrl-X, and DSG. These methods explicitly consider high-fidelity controllable editing based on external signals or diffusion feature manipulations. As for diversity, they directly choose different noise latents. However, the diffused noise is capable of implicitly representing the topological and textural manifold of the corresponding image. Moreover, it's an effective workbench to conduct the trade-off between content preservation and controllable variations. Previous T2I and I2I diffusion works do not explore the information within the compressed contextual latent. In this paper, we first propose a plug-and-play noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images. We fine-tune seed noise or inverse noise through an optimal-transported (OT) information bottleneck (IB) with around only 14K trainable parameters and 10 minutes of training. Our test-time NOFT is good at producing high-fidelity image variations considering topology and texture alignments. Comprehensive experiments demonstrate that NOFT is a powerful general reimagine approach to efficiently fine-tune the 2D/3D AIGC assets with text or image guidance.
摘要：扩散模型为实施文本形象（T2I）和图像对图像（I2i）生成提供了强大的工具。最近，拓扑和纹理控制是流行的探索，例如ControlNet，IP-Adapter，Ctrl-X和DSG。这些方法明确考虑基于外部信号或扩散特征操纵的高保真可控编辑。至于多样性，他们直接选择不同的噪音潜伏期。但是，扩散的噪声能够隐式代表相应图像的拓扑和纹理歧管。此外，这是一个有效的工作台，可以在内容保存和可控变化之间进行权衡。以前的T2i和I2i扩散作品不会探索压缩的上下文潜伏期中的信息。在本文中，我们首先提出了一个由稳定扩散使用的即插即用的噪声芬太尼Noft模块，以产生高度相关和多样化的图像。我们通过最佳传输（OT）信息瓶颈（IB）微调种子噪声或逆噪声，只有大约14K可训练的参数和10分钟的训练。考虑到拓扑和纹理比对，我们的测试时间Noft擅长生产高保真图像变化。全面的实验表明，NOFT是一种有力的一般重新构想方法，可以通过文本或图像指导有效地微调2D/3D AIGC资产。

Title: PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement

Authors: ZhanFeng Feng, Long Peng, Xin Di, Yong Guo, Wenbo Li, Yulun Zhang, Renjing Pei, Yang Wang, Yang Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12266
Pdf URL: https://arxiv.org/pdf/2505.12266
Copy Paste: [[2505.12266]] PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement(https://arxiv.org/abs/2505.12266)
Keywords: generation
Abstract: Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences by leveraging temporal information from multiple frames, which are widely used in streaming video processing, surveillance, and generation. Although numerous Transformer-based enhancement methods have achieved impressive performance, their computational and memory demands hinder deployment on edge devices. Quantization offers a practical solution by reducing the bit-width of weights and activations to improve efficiency. However, directly applying existing quantization methods to video enhancement tasks often leads to significant performance degradation and loss of fine details. This stems from two limitations: (a) inability to allocate varying representational capacity across frames, which results in suboptimal dynamic range adaptation; (b) over-reliance on full-precision teachers, which limits the learning of low-bit student models. To tackle these challenges, we propose a novel quantization method for video enhancement: Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE). This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD). BMFQ utilizes a percentile-based initialization and iterative search with pruning and backtracking for robust clipping bounds. PMTD employs a progressive distillation strategy with both full-precision and multiple high-bit (INT) teachers to enhance low-bit models' capacity and quality. Extensive experiments demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance across multiple tasks and this http URL code will be made publicly available at: this https URL.
摘要：多帧视频增强任务旨在通过利用来自多个帧的时间信息来改善视频序列的空间和时间分辨率以及质量，这些信息广泛用于流视频处理，监视和生成。尽管许多基于变压器的增强方法取得了令人印象深刻的性能，但它们的计算和内存需要阻碍边缘设备上的部署。量化通过减少重量和激活的位宽度以提高效率来提供实用的解决方案。但是，将现有量化方法直接应用于视频增强任务通常会导致大量的性能下降和细节丢失。这源于两个局限性：（a）无法分配跨帧的不同代表能力，这导致了次优的动态范围适应；（b）过度依赖完整的教师，这限制了低位学生模型的学习。为了应对这些挑战，我们提出了一种新颖的视频增强量化方法：视频增强（PMQ-VE）的进行性多框架量化。该框架具有粗到最新的两阶段过程：基于回溯的多帧量化（BMFQ）和进行性多教师蒸馏（PMTD）。 BMFQ利用基于百分比的初始化和迭代搜索，并通过修剪和回溯可靠的剪切范围。 PMTD采用完整精确和多个高位（INT）教师的渐进式蒸馏策略来增强低位模型的容量和质量。广泛的实验表明，我们的方法优于现有方法，在多个任务中实现最先进的性能，并且该HTTP URL代码将在以下位置公开可用：此HTTPS URL。

Title: Context-Aware Autoregressive Models for Multi-Conditional Image Generation

Authors: Yixiao Chen, Zhiyuan Ma, Guoli Jia, Che Jiang, Jianjun Li, Bowen Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12274
Pdf URL: https://arxiv.org/pdf/2505.12274
Copy Paste: [[2505.12274]] Context-Aware Autoregressive Models for Multi-Conditional Image Generation(https://arxiv.org/abs/2505.12274)
Keywords: generation
Abstract: Autoregressive transformers have recently shown impressive image generation quality and efficiency on par with state-of-the-art diffusion models. Unlike diffusion architectures, autoregressive models can naturally incorporate arbitrary modalities into a single, unified token sequence--offering a concise solution for multi-conditional image generation tasks. In this work, we propose $\textbf{ContextAR}$, a flexible and effective framework for multi-conditional image generation. ContextAR embeds diverse conditions (e.g., canny edges, depth maps, poses) directly into the token sequence, preserving modality-specific semantics. To maintain spatial alignment while enhancing discrimination among different condition types, we introduce hybrid positional encodings that fuse Rotary Position Embedding with Learnable Positional Embedding. We design Conditional Context-aware Attention to reduces computational complexity while preserving effective intra-condition perception. Without any fine-tuning, ContextAR supports arbitrary combinations of conditions during inference time. Experimental results demonstrate the powerful controllability and versatility of our approach, and show that the competitive perpormance than diffusion-based multi-conditional control approaches the existing autoregressive baseline across diverse multi-condition driven scenarios. Project page: $\href{this https URL}{this https URL.}$
摘要：自回归的变压器最近与最先进的扩散模型显示出令人印象深刻的图像产生质量和效率。与扩散体系结构不同，自回归模型可以自然地将任意模式纳入单个统一的令牌序列中 - 为多条件图像生成任务提供简洁的解决方案。在这项工作中，我们建议$ \ textbf {contextar} $，这是一个灵活有效的多条件图像生成框架。 ContextAR将各种条件（例如Canny边缘，深度图，姿势）直接嵌入到令牌序列中，并保留了特定于模态的语义。为了保持空间对齐，同时增强了不同条件类型之间的歧视，我们引入了混合位置编码，将旋转位置嵌入与可学习的位置嵌入融合在一起。我们设计有条件的环境 - 注意的关注以降低计算复杂性，同时保留有效的内部条件感知。如果没有进行任何微调，ContextAR支持推理时间期间条件的任意组合。实验结果证明了我们方法的强大可控性和多功能性，并表明，基于扩散的多条件控制方法的竞争性持久性是在多种多条件驱动的方案中现有的自回归基线。项目页面：$ \ href {this https url} {this https url。} $

Title: Model alignment using inter-modal bridges

Authors: Ali Gholamzadeh, Noor Sajid
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.12322
Pdf URL: https://arxiv.org/pdf/2505.12322
Copy Paste: [[2505.12322]] Model alignment using inter-modal bridges(https://arxiv.org/abs/2505.12322)
Keywords: generation
Abstract: Foundation models have demonstrated remarkable performance across modalities such as language and vision. However, model reuse across distinct modalities (e.g., text and vision) remains limited due to the difficulty of aligning internal representations. Existing methods require extensive paired training data or are constrained to specific domains. We introduce a semi-supervised approach for model alignment via conditional flow matching. The conditional flow between latent spaces of different modalities (e.g., text-to-image or biological-to-artificial neuronal activity) can be learned in two settings: ($1$) solving a (balanced or unbalanced) optimal transport problem with an inter-space bridge cost, and ($2$) performing memory-efficient alignment using labelled exemplars. Despite being constrained by the original models' capacity, our method--under both settings--matches downstream task performance of end-to-end trained models on object recognition and image generation tasks across MNIST, ImageNet, and \cite{majaj2015simple} datasets, particularly when labelled training data is scarce ($<20\%$). Our method provides a data-efficient solution for inter-modal model alignment with minimal supervision.
摘要：基础模型表现出跨语言和愿景等方式的出色表现。但是，由于难以对齐内部表示，跨不同模式（例如，文本和视觉）的模型再利用仍然有限。现有方法需要广泛的配对培训数据或受到特定域的约束。我们引入了通过条件流匹配的半监督方法进行模型对齐方式。可以在两种情况下学习不同模态的潜在空间之间的条件流（例如，文本到图像或生物对人工神经元活动）可以学习：（$ 1 $）解决（平衡或不平衡的）最佳运输问题，并使用空间间的桥梁成本，（$ 2 $）使用标记的Exemplars执行记忆效率的校准。尽管受到原始模型的容量的限制，但我们的方法 - 在两个设置下 - 在MNIST，Imagenet和\ cite {majaj2015simple}数据集的端到端训练模型的下游任务性能下，尤其是在标记为训练数据的标记数据（$ <20 \％$）时。我们的方法为模型间模型提供了一个数据效率的解决方案，并提供了最小的监督。

Title: Is Artificial Intelligence Generated Image Detection a Solved Problem?

Authors: Ziqiang Li, Jiazhen Yan, Ziwen He, Kai Zeng, Weiwei Jiang, Lizhi Xiong, Zhangjie Fu
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2505.12335
Pdf URL: https://arxiv.org/pdf/2505.12335
Copy Paste: [[2505.12335]] Is Artificial Intelligence Generated Image Detection a Solved Problem?(https://arxiv.org/abs/2505.12335)
Keywords: generation, generative
Abstract: The rapid advancement of generative models, such as GANs and Diffusion models, has enabled the creation of highly realistic synthetic images, raising serious concerns about misinformation, deepfakes, and copyright infringement. Although numerous Artificial Intelligence Generated Image (AIGI) detectors have been proposed, often reporting high accuracy, their effectiveness in real-world scenarios remains questionable. To bridge this gap, we introduce AIGIBench, a comprehensive benchmark designed to rigorously evaluate the robustness and generalization capabilities of state-of-the-art AIGI detectors. AIGIBench simulates real-world challenges through four core tasks: multi-source generalization, robustness to image degradation, sensitivity to data augmentation, and impact of test-time pre-processing. It includes 23 diverse fake image subsets that span both advanced and widely adopted image generation techniques, along with real-world samples collected from social media and AI art platforms. Extensive experiments on 11 advanced detectors demonstrate that, despite their high reported accuracy in controlled settings, these detectors suffer significant performance drops on real-world data, limited benefits from common augmentations, and nuanced effects of pre-processing, highlighting the need for more robust detection strategies. By providing a unified and realistic evaluation framework, AIGIBench offers valuable insights to guide future research toward dependable and generalizable AIGI detection.
摘要：生成模型的快速发展，例如gan和扩散模型，使得创造了高度逼真的合成图像，从而引起了人们对错误信息，深击和版权侵权的严重关注。尽管已经提出了许多人工智能产生的图像（AIGI）探测器，但通常报告了高精度，但它们在现实世界中的有效性仍然值得怀疑。为了弥合这一差距，我们引入了Aigibench，这是一种综合基准，旨在严格评估最先进的AIGI探测器的稳健性和概括能力。 Aigibench通过四个核心任务来模拟现实世界中的挑战：多源概括，对图像降解的鲁棒性，对数据增强的敏感性以及测试时间预处理的影响。它包括23种不同的假图像子集，涵盖了高级和广泛采用的图像生成技术，以及从社交媒体和AI艺术平台收集的现实样本。对11个高级探测器进行了广泛的实验表明，尽管它们在受控设置中有很高的报道准确性，但这些检测器在现实世界数据上的性能下降，限制了共同增强的益处以及预处理的细微效果，突显了需要进行更强大的检测策略的需求。通过提供统一和现实的评估框架，Aigibench提供了宝贵的见解，以指导未来的研究，以可靠和可概括的AIGI检测。

Title: Towards Open-world Generalized Deepfake Detection: General Feature Extraction via Unsupervised Domain Adaptation

Authors: Midou Guo, Qilin Yin, Wei Lu, Xiangyang Luo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12339
Pdf URL: https://arxiv.org/pdf/2505.12339
Copy Paste: [[2505.12339]] Towards Open-world Generalized Deepfake Detection: General Feature Extraction via Unsupervised Domain Adaptation(https://arxiv.org/abs/2505.12339)
Keywords: generative
Abstract: With the development of generative artificial intelligence, new forgery methods are rapidly emerging. Social platforms are flooded with vast amounts of unlabeled synthetic data and authentic data, making it increasingly challenging to distinguish real from fake. Due to the lack of labels, existing supervised detection methods struggle to effectively address the detection of unknown deepfake methods. Moreover, in open world scenarios, the amount of unlabeled data greatly exceeds that of labeled data. Therefore, we define a new deepfake detection generalization task which focuses on how to achieve efficient detection of large amounts of unlabeled data based on limited labeled data to simulate a open world scenario. To solve the above mentioned task, we propose a novel Open-World Deepfake Detection Generalization Enhancement Training Strategy (OWG-DS) to improve the generalization ability of existing methods. Our approach aims to transfer deepfake detection knowledge from a small amount of labeled source domain data to large-scale unlabeled target domain data. Specifically, we introduce the Domain Distance Optimization (DDO) module to align different domain features by optimizing both inter-domain and intra-domain distances. Additionally, the Similarity-based Class Boundary Separation (SCBS) module is used to enhance the aggregation of similar samples to ensure clearer class boundaries, while an adversarial training mechanism is adopted to learn the domain-invariant features. Extensive experiments show that the proposed deepfake detection generalization enhancement training strategy excels in cross-method and cross-dataset scenarios, improving the model's generalization.
摘要：随着生成人工智能的发展，新的伪造方法正在迅速出现。社交平台充斥着大量未标记的合成数据和真实数据，使得将真实区分开来使其越来越具有挑战性。由于缺乏标签，现有的监督检测方法难以有效解决未知深泡方法的检测。此外，在开放世界的情况下，未标记的数据的量大大超过了标记的数据。因此，我们定义了一项新的DeepFake检测概括任务，该任务的重点是如何基于有限的标记数据来实现大量未标记数据的有效检测以模拟开放世界的情况。为了解决上述任务，我们提出了一种新型的开放世界深层检测概括增强训练策略（OWG-DS），以提高现有方法的概括能力。我们的方法旨在将深层检测知识从少量标记的源域数据转移到大型未标记的目标域数据。具体而言，我们通过优化域间和内域距离来介绍域距离优化（DDO）模块以对齐不同的域特征。此外，基于相似性的类边界分离（SCB）模块用于增强相似样品的聚合以确保更清晰的类边界，而采用了对抗性训练机制来学习域不变特征。广泛的实验表明，所提出的深膜检测概括增强训练策略在交叉方法和跨数据库方案中擅长，从而改善了模型的概括。

Title: AbFlowNet: Optimizing Antibody-Antigen Binding Energy via Diffusion-GFlowNet Fusion

Authors: Abrar Rahman Abir, Haz Sameen Shahgir, Md Rownok Zahan Ratul, Md Toki Tahmid, Greg Ver Steeg, Yue Dong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12358
Pdf URL: https://arxiv.org/pdf/2505.12358
Copy Paste: [[2505.12358]] AbFlowNet: Optimizing Antibody-Antigen Binding Energy via Diffusion-GFlowNet Fusion(https://arxiv.org/abs/2505.12358)
Keywords: generative
Abstract: Complementarity Determining Regions (CDRs) are critical segments of an antibody that facilitate binding to specific antigens. Current computational methods for CDR design utilize reconstruction losses and do not jointly optimize binding energy, a crucial metric for antibody efficacy. Rather, binding energy optimization is done through computationally expensive Online Reinforcement Learning (RL) pipelines rely heavily on unreliable binding energy estimators. In this paper, we propose AbFlowNet, a novel generative framework that integrates GFlowNet with Diffusion models. By framing each diffusion step as a state in the GFlowNet framework, AbFlowNet jointly optimizes standard diffusion losses and binding energy by directly incorporating energy signals into the training process, thereby unifying diffusion and reward optimization in a single procedure. Experimental results show that AbFlowNet outperforms the base diffusion model by 3.06% in amino acid recovery, 20.40% in geometric reconstruction (RMSD), and 3.60% in binding energy improvement ratio. ABFlowNet also decreases Top-1 total energy and binding energy errors by 24.8% and 38.1% without pseudo-labeling the test dataset or using computationally expensive online RL regimes.
摘要：互补性确定区域（CDR）是促进与特定抗原结合的抗体的关键段。 CDR设计的当前计算方法利用重建损失，并且不会共同优化结合能，这是抗体功效的关键度量。相反，结合能量优化是通过计算昂贵的在线增强学习（RL）管道来实现的，这在很大程度上依赖于不可靠的结合能估计器。在本文中，我们提出了Abflownet，这是一种新颖的生成框架，将Gflownet与扩散模型集成在一起。通过将每个扩散步骤作为GFLOWNET框架中的一个状态构架，Abflownet通过将能量信号直接纳入训练过程，从而在单个过程中统一扩散和奖励优化，从而共同优化标准扩散损耗和结合能。实验结果表明，Abflownet的氨基酸回收率优于3.06％，几何重建（RMSD）和3.60％的结合能量改善比率优于基础扩散模型。 Abflownet还将TOP-1总能量和结合能误差降低了24.8％和38.1％，而无需伪标记测试数据集或使用计算昂贵的在线RL Ingimes。

Title: CLIP-aware Domain-Adaptive Super-Resolution

Authors: Zhengyang Lu, Qian Xia, Weifan Wang, Feng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12391
Pdf URL: https://arxiv.org/pdf/2505.12391
Copy Paste: [[2505.12391]] CLIP-aware Domain-Adaptive Super-Resolution(https://arxiv.org/abs/2505.12391)
Keywords: super-resolution, generation
Abstract: This work introduces CLIP-aware Domain-Adaptive Super-Resolution (CDASR), a novel framework that addresses the critical challenge of domain generalization in single image super-resolution. By leveraging the semantic capabilities of CLIP (Contrastive Language-Image Pre-training), CDASR achieves unprecedented performance across diverse domains and extreme scaling factors. The proposed method integrates CLIP-guided feature alignment mechanism with a meta-learning inspired few-shot adaptation strategy, enabling efficient knowledge transfer and rapid adaptation to target domains. A custom domain-adaptive module processes CLIP features alongside super-resolution features through a multi-stage transformation process, including CLIP feature processing, spatial feature generation, and feature fusion. This intricate process ensures effective incorporation of semantic information into the super-resolution pipeline. Additionally, CDASR employs a multi-component loss function that combines pixel-wise reconstruction, perceptual similarity, and semantic consistency. Extensive experiments on benchmark datasets demonstrate CDASR's superiority, particularly in challenging scenarios. On the Urban100 dataset at $\times$8 scaling, CDASR achieves a significant PSNR gain of 0.15dB over existing methods, with even larger improvements of up to 0.30dB observed at $\times$16 scaling.
摘要：这项工作介绍了剪贴感知的域自适应超分辨率（CDASR），这是一个新型框架，该框架解决了单个图像超分辨率中域概括的关键挑战。通过利用剪辑的语义能力（对比语言图像预训练），CDASR在不同领域和极端缩放因素上实现了前所未有的性能。所提出的方法将夹子引导的特征对准机制与元学习的启发启发了几乎没有射击的适应策略，从而使有效的知识转移和对目标域的快速适应。自定义域自适应模块过程通过多阶段转换过程，包括剪辑特征处理，空间特征生成和功能融合，以及超级分辨率的功能以及超分辨率功能。这个复杂的过程确保有效地将语义信息纳入超分辨率管道。此外，CDASR采用多组分损耗函数，结合了像素的重建，感知相似性和语义一致性。基准数据集的广泛实验证明了CDASR的优势，尤其是在具有挑战性的情况下。在$ \ times $ 8缩放的Urban100数据集上，CDASR比现有方法的PSNR增益显着0.15dB，甚至在$ \ times $ \ times $ 16缩放下观察到的更大的改进。

Title: Few-Shot Concept Unlearning with Low Rank Adaptation

Authors: Udaya Shreyas, L.N. Aadarsh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12395
Pdf URL: https://arxiv.org/pdf/2505.12395
Copy Paste: [[2505.12395]] Few-Shot Concept Unlearning with Low Rank Adaptation(https://arxiv.org/abs/2505.12395)
Keywords: generation, generative
Abstract: Image Generation models are a trending topic nowadays, with many people utilizing Artificial Intelligence models in order to generate images. There are many such models which, given a prompt of a text, will generate an image which depicts said prompt. There are many image generation models, such as Latent Diffusion Models, Denoising Diffusion Probabilistic Models, Generative Adversarial Networks and many more. When generating images, these models can generate sensitive image data, which can be threatening to privacy or may violate copyright laws of private entities. Machine unlearning aims at removing the influence of specific data subsets from the trained models and in the case of image generation models, remove the influence of a concept such that the model is unable to generate said images of the concept when prompted. Conventional retraining of the model can take upto days, hence fast algorithms are the need of the hour. In this paper we propose an algorithm that aims to remove the influence of concepts in diffusion models through updating the gradients of the final layers of the text encoders. Using a weighted loss function, we utilize backpropagation in order to update the weights of the final layers of the Text Encoder componet of the Stable Diffusion Model, removing influence of the concept from the text-image embedding space, such that when prompted, the result is an image not containing the concept. The weighted loss function makes use of Textual Inversion and Low-Rank this http URL perform our experiments on Latent Diffusion Models, namely the Stable Diffusion v2 model, with an average concept unlearning runtime of 50 seconds using 4-5 images.
摘要：如今，图像生成模型是一个流行的主题，许多人利用人工智能模型来生成图像。有许多这样的模型，鉴于文本的提示，它们会产生一个描绘提示的图像。有许多图像生成模型，例如潜在扩散模型，扩散概率模型，生成对抗网络等。当生成图像时，这些模型可以生成敏感的图像数据，这可能威胁到隐私或违反私人实体的版权法。机器的旨在从训练有素的模型中删除特定数据子集的影响，并且在图像生成模型的情况下，删除概念的影响，以使该模型在提示时无法生成该概念的上述图像。该模型的常规重新培训可能需要长达几天，因此快速算法是小时的需要。在本文中，我们提出了一种算法，该算法旨在通过更新文本编码器的最终层的梯度来消除扩散模型中概念的影响。使用加权损耗函数，我们利用反向传播来更新稳定扩散模型的文本编码器组件的最终层的权重，从而从文本图像嵌入空间中删除了概念的影响，因此，当提示时，结果是不包含概念的图像。加权损耗函数利用文本反转和低级别的HTTP URL在潜在扩散模型上执行我们的实验，即稳定的扩散V2模型，使用4-5张图像的平均概念读取时间为50秒。

Title: DPCD: A Quality Assessment Database for Dynamic Point Clouds

Authors: Yating Liu, Yujie Zhang, Qi Yang, Yiling Xu, Zhu Li, Ye-Kui Wang
Subjects: cs.CV, cs.DB
Abstract URL: https://arxiv.org/abs/2505.12431
Pdf URL: https://arxiv.org/pdf/2505.12431
Copy Paste: [[2505.12431]] DPCD: A Quality Assessment Database for Dynamic Point Clouds(https://arxiv.org/abs/2505.12431)
Keywords: quality assessment
Abstract: Recently, the advancements in Virtual/Augmented Reality (VR/AR) have driven the demand for Dynamic Point Clouds (DPC). Unlike static point clouds, DPCs are capable of capturing temporal changes within objects or scenes, offering a more accurate simulation of the real world. While significant progress has been made in the quality assessment research of static point cloud, little study has been done on Dynamic Point Cloud Quality Assessment (DPCQA), which hinders the development of quality-oriented applications, such as interframe compression and transmission in practical scenarios. In this paper, we introduce a large-scale DPCQA database, named DPCD, which includes 15 reference DPCs and 525 distorted DPCs from seven types of lossy compression and noise distortion. By rendering these samples to Processed Video Sequences (PVS), a comprehensive subjective experiment is conducted to obtain Mean Opinion Scores (MOS) from 21 viewers for analysis. The characteristic of contents, impact of various distortions, and accuracy of MOSs are presented to validate the heterogeneity and reliability of the proposed database. Furthermore, we evaluate the performance of several objective metrics on DPCD. The experiment results show that DPCQA is more challenge than that of static point cloud. The DPCD, which serves as a catalyst for new research endeavors on DPCQA, is publicly available at this https URL.
摘要：最近，虚拟/增强现实（VR/AR）的进步推动了对动态点云（DPC）的需求。与静态点云不同，DPC能够捕获对象或场景中的时间变化，从而对现实世界进行更准确的模拟。尽管在静态点云的质量评估研究中取得了重大进展，但对动态云质量评估（DPCQA）的研究很少，这阻碍了以质量为导向的应用的开发，例如在实际情况下框架间压缩和传输。在本文中，我们引入了一个名为DPCD的大规模DPCQA数据库，其中包括15种参考DPC和525个失真的DPC，来自七种类型的有损压缩和噪声失真。通过将这些样本渲染为处理的视频序列（PVS），进行了全面的主观实验，以从21位观众中获得平均意见分数（MOS）进行分析。提出了内容的特征，各种扭曲的影响以及苔藓的准确性，以验证所提出的数据库的异质性和可靠性。此外，我们评估了DPCD上几个客观指标的性能。实验结果表明，DPCQA比静态点云更具挑战性。 DPCD可作为DPCQA新研究的催化剂，可在此HTTPS URL上公开使用。

Title: Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning

Authors: Hugues Van Assel, Mark Ibrahim, Tommaso Biancalani, Aviv Regev, Randall Balestriero
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.12477
Pdf URL: https://arxiv.org/pdf/2505.12477
Copy Paste: [[2505.12477]] Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning(https://arxiv.org/abs/2505.12477)
Keywords: generation
Abstract: Reconstruction and joint embedding have emerged as two leading paradigms in Self Supervised Learning (SSL). Reconstruction methods focus on recovering the original sample from a different view in input space. On the other hand, joint embedding methods align the representations of different views in latent space. Both approaches offer compelling advantages, yet practitioners lack clear guidelines for choosing between them. In this work, we unveil the core mechanisms that distinguish each paradigm. By leveraging closed form solutions for both approaches, we precisely characterize how the view generation process, e.g. data augmentation, impacts the learned representations. We then demonstrate that, unlike supervised learning, both SSL paradigms require a minimal alignment between augmentations and irrelevant features to achieve asymptotic optimality with increasing sample size. Our findings indicate that in scenarios where these irrelevant features have a large magnitude, joint embedding methods are preferable because they impose a strictly weaker alignment condition compared to reconstruction based methods. These results not only clarify the trade offs between the two paradigms but also substantiate the empirical success of joint embedding approaches on real world challenging datasets.
摘要：重建和联合嵌入已成为自我监督学习（SSL）的两个主要范式。重建方法的重点是从输入空间中的不同视图中恢复原始样本。另一方面，关节嵌入方法对齐潜在空间中不同视图的表示。两种方法都具有令人信服的优势，但是从业者缺乏在他们之间选择的明确指南。在这项工作中，我们揭示了区分每个范式的核心机制。通过利用两种方法的封闭形式解决方案，我们精确地表征了视图生成过程，例如数据增强，影响学习的表示。然后，我们证明，与受监督的学习不同，这两个SSL范式都需要在增强和无关的特征之间最小的比对，以随着样本量的增加而实现渐近的优化性。我们的发现表明，在这些无关的特征具有很大范围的情况下，关节嵌入方法是可取的，因为与基于重建的方法相比，它们施加了严格的弱比对条件。这些结果不仅阐明了两个范式之间的交易折扣，而且还证实了在现实世界中挑战数据集上联合嵌入方法的经验成功。

Title: Guiding Diffusion with Deep Geometric Moments: Balancing Fidelity and Variation

Authors: Sangmin Jung, Utkarsh Nath, Yezhou Yang, Giulia Pedrielli, Joydeep Biswas, Amy Zhang, Hassan Ghasemzadeh, Pavan Turaga
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12486
Pdf URL: https://arxiv.org/pdf/2505.12486
Copy Paste: [[2505.12486]] Guiding Diffusion with Deep Geometric Moments: Balancing Fidelity and Variation(https://arxiv.org/abs/2505.12486)
Keywords: generation
Abstract: Text-to-image generation models have achieved remarkable capabilities in synthesizing images, but often struggle to provide fine-grained control over the output. Existing guidance approaches, such as segmentation maps and depth maps, introduce spatial rigidity that restricts the inherent diversity of diffusion models. In this work, we introduce Deep Geometric Moments (DGM) as a novel form of guidance that encapsulates the subject's visual features and nuances through a learned geometric prior. DGMs focus specifically on the subject itself compared to DINO or CLIP features, which suffer from overemphasis on global image features or semantics. Unlike ResNets, which are sensitive to pixel-wise perturbations, DGMs rely on robust geometric moments. Our experiments demonstrate that DGM effectively balance control and diversity in diffusion-based image generation, allowing a flexible control mechanism for steering the diffusion process.
摘要：文本到图像生成模型在综合图像中具有显着的功能，但通常很难提供对输出的细粒度控制。现有的指导方法，例如分割图和深度图，引入了空间刚性，从而限制了扩散模型的固有多样性。在这项工作中，我们将深层几何矩（DGM）引入了一种新型的指导形式，该指导形式通过学习的几何形式封装了对象的视觉特征和细微差别。与恐龙或剪辑特征相比，DGMS专门关注该主题本身，这些功能过分强调全球图像特征或语义。与对像素的扰动敏感的重置不同，DGM依赖于强大的几何矩。我们的实验表明，DGM有效地平衡了基于扩散的图像产生中的控制和多样性，从而允许柔性控制机制来探定扩散过程。

Title: Video-GPT via Next Clip Diffusion

Authors: Shaobin Zhuang, Zhipeng Huang, Ying Zhang, Fangyikang Wang, Canmiao Fu, Binxin Yang, Chong Sun, Chen Li, Yali Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12489
Pdf URL: https://arxiv.org/pdf/2505.12489
Copy Paste: [[2505.12489]] Video-GPT via Next Clip Diffusion(https://arxiv.org/abs/2505.12489)
Keywords: generation
Abstract: GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream. The project page is at this https URL.
摘要：GPT在自然语言处理方面表现出了杰出的成功。但是，语言序列不足以描述视觉世界中的时空细节。另外，视频序列擅长捕获此类细节。在这一事实中，我们通过将视频视为视觉世界建模的新语言，在本文中提出了一个简洁的视频-GPT。从GPT中的下一个令牌预测中，我们引入了一个新颖的下一个剪辑扩散范式，用于预读视频GPT。与以前的作品不同，这种独特的范式使视频GPT可以根据历史上的干净剪辑自动降低嘈杂的剪辑来自动降低嘈杂的剪辑，从而解决短期生成和长期预测。广泛的实验表明，我们的视频-GPT在视频预测上实现了最新的表现，这是世界建模的关键因素（Physics-IQ基准：视频GPT 34.97对Kling 23.64对WAN 20.89）。此外，它可以很好地适应视频生成和理解中的6个主流视频任务，显示其在下游中的概括能力。项目页面位于此HTTPS URL。

Title: Unsupervised Invariant Risk Minimization

Authors: Yotam Norman, Ron Meir
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12506
Pdf URL: https://arxiv.org/pdf/2505.12506
Copy Paste: [[2505.12506]] Unsupervised Invariant Risk Minimization(https://arxiv.org/abs/2505.12506)
Keywords: generation, generative
Abstract: We propose a novel unsupervised framework for \emph{Invariant Risk Minimization} (IRM), extending the concept of invariance to settings where labels are unavailable. Traditional IRM methods rely on labeled data to learn representations that are robust to distributional shifts across environments. In contrast, our approach redefines invariance through feature distribution alignment, enabling robust representation learning from unlabeled data. We introduce two methods within this framework: Principal Invariant Component Analysis (PICA), a linear method that extracts invariant directions under Gaussian assumptions, and Variational Invariant Autoencoder (VIAE), a deep generative model that disentangles environment-invariant and environment-dependent latent factors. Our approach is based on a novel ``unsupervised'' structural causal model and supports environment-conditioned sample-generation and intervention. Empirical evaluations on synthetic dataset and modified versions of MNIST demonstrate the effectiveness of our methods in capturing invariant structure, preserving relevant information, and generalizing across environments without access to labels.
摘要：我们为\ emph {不变风险最小化}（IRM）提出了一个新颖的无监督框架（IRM），将不变性的概念扩展到了不可用的标签的设置。传统的IRM方法依靠标记的数据来学习在环境之间进行分配变化的强大表示形式。相比之下，我们的方法通过特征分布对齐重新定义了不变性，从而从未标记的数据中实现了可靠的表示。我们在此框架内介绍了两种方法：主要不变组件分析（PICA），一种线性方法，一种在高斯假设下提取不变方向的线性方法，以及变异不变式自动编码器（VIAE），这是一种深层生成模型，它使环境不变和环境依赖环境依赖于环境依赖于环境依赖环境依赖于环境依赖于环境。我们的方法基于一种新颖的“无监督”结构性因果模型，并支持环境条件的样品生成和干预。对MNIST的合成数据集和修改版本的经验评估证明了我们方法在捕获不变结构，保留相关信息以及跨环境中概括的有效性，而无需访问标签。

Title: Towards Budget-Friendly Model-Agnostic Explanation Generation for Large Language Models

Authors: Junhao Liu, Haonan Yu, Xin Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12509
Pdf URL: https://arxiv.org/pdf/2505.12509
Copy Paste: [[2505.12509]] Towards Budget-Friendly Model-Agnostic Explanation Generation for Large Language Models(https://arxiv.org/abs/2505.12509)
Keywords: generation
Abstract: With Large language models (LLMs) becoming increasingly prevalent in various applications, the need for interpreting their predictions has become a critical challenge. As LLMs vary in architecture and some are closed-sourced, model-agnostic techniques show great promise without requiring access to the model's internal parameters. However, existing model-agnostic techniques need to invoke LLMs many times to gain sufficient samples for generating faithful explanations, which leads to high economic costs. In this paper, we show that it is practical to generate faithful explanations for large-scale LLMs by sampling from some budget-friendly models through a series of empirical studies. Moreover, we show that such proxy explanations also perform well on downstream tasks. Our analysis provides a new paradigm of model-agnostic explanation methods for LLMs, by including information from budget-friendly models.
摘要：随着大型语言模型（LLM）在各种应用中变得越来越普遍，解释其预测的需求已成为一个关键的挑战。随着LLM在架构中的变化，有些是封闭式的，模型不合时宜的技术显示出巨大的希望，而无需访问模型的内部参数。但是，现有的模型不足技术需要多次调用LLM，以获取足够的样本来产生忠实的解释，从而导致高昂的经济成本。在本文中，我们表明，通过通过一系列经验研究从某些预算友好的模型中抽样大规模LLMS对大规模LLM的忠实解释是可行的。此外，我们表明，这种代理说明在下游任务上也表现良好。我们的分析通过包含来自预算友好的模型的信息，为LLMS提供了新的模型解释方法的新范式。

Title: Exploring Sparsity for Parameter Efficient Fine Tuning Using Wavelets

Authors: Ahmet Bilican, M. Akın Yılmaz, A. Murat Tekalp, R. Gökberk Cinbiş
Subjects: cs.CV, cs.AI, cs.LG, eess.IV, eess.SP
Abstract URL: https://arxiv.org/abs/2505.12532
Pdf URL: https://arxiv.org/pdf/2505.12532
Copy Paste: [[2505.12532]] Exploring Sparsity for Parameter Efficient Fine Tuning Using Wavelets(https://arxiv.org/abs/2505.12532)
Keywords: generation
Abstract: Efficiently adapting large foundation models is critical, especially with tight compute and memory budgets. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA offer limited granularity and effectiveness in few-parameter regimes. We propose Wavelet Fine-Tuning (WaveFT), a novel PEFT method that learns highly sparse updates in the wavelet domain of residual matrices. WaveFT allows precise control of trainable parameters, offering fine-grained capacity adjustment and excelling with remarkably low parameter count, potentially far fewer than LoRA's minimum -- ideal for extreme parameter-efficient scenarios. In order to demonstrate the effect of the wavelet transform, we compare WaveFT with a special case, called SHiRA, that entails applying sparse updates directly in the weight domain. Evaluated on personalized text-to-image generation using Stable Diffusion XL as baseline, WaveFT significantly outperforms LoRA and other PEFT methods, especially at low parameter counts; achieving superior subject fidelity, prompt alignment, and image diversity.
摘要：有效地调整大型基础模型至关重要，尤其是在计算和记忆预算紧张的情况下。参数有效的微调（PEFT）方法，例如洛拉（Lora），在少数参数方面具有有限的粒度和有效性。我们提出了小波微调（Waveft），这是一种新型的PEFT方法，它在残留矩阵的小波域中学习了高度稀疏的更新。 WaveFT允许精确控制可训练的参数，提供细粒的容量调整并以非常低的参数计数提供出色，这可能少于Lora的最低限度 - 非常适合极端参数有效的方案。为了证明小波变换的效果，我们将Waveft与一种称为Shira的特殊情况进行了比较，该案例需要直接在重量域中应用稀疏更新。使用稳定的扩散XL作为基线评估个性化的文本对图像生成，Waveft明显优于lora和其他PEFT方法，尤其是在低参数计数下；实现卓越的主题保真度，及时的对准和图像多样性。

Title: ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models

Authors: Adrian Mirza, Nawaf Alampara, Martiño Ríos-García, Mohamed Abdelalim, Jack Butler, Bethany Connolly, Tunca Dogan, Marianna Nezhurina, Bünyamin Şen, Santosh Tirunagari, Mark Worrall, Adamo Young, Philippe Schwaller, Michael Pieler, Kevin Maik Jablonka
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.12534
Pdf URL: https://arxiv.org/pdf/2505.12534
Copy Paste: [[2505.12534]] ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models(https://arxiv.org/abs/2505.12534)
Keywords: generation
Abstract: Foundation models have shown remarkable success across scientific domains, yet their impact in chemistry remains limited due to the absence of diverse, large-scale, high-quality datasets that reflect the field's multifaceted nature. We present the ChemPile, an open dataset containing over 75 billion tokens of curated chemical data, specifically built for training and evaluating general-purpose models in the chemical sciences. The dataset mirrors the human learning journey through chemistry -- from educational foundations to specialized expertise -- spanning multiple modalities and content types including structured data in diverse chemical representations (SMILES, SELFIES, IUPAC names, InChI, molecular renderings), scientific and educational text, executable code, and chemical images. ChemPile integrates foundational knowledge (textbooks, lecture notes), specialized expertise (scientific articles and language-interfaced data), visual understanding (molecular structures, diagrams), and advanced reasoning (problem-solving traces and code) -- mirroring how human chemists develop expertise through diverse learning materials and experiences. Constructed through hundreds of hours of expert curation, the ChemPile captures both foundational concepts and domain-specific complexity. We provide standardized training, validation, and test splits, enabling robust benchmarking. ChemPile is openly released via HuggingFace with a consistent API, permissive license, and detailed documentation. We hope the ChemPile will serve as a catalyst for chemical AI, enabling the development of the next generation of chemical foundation models.
摘要：基础模型在科学领域表现出了显着的成功，但是由于没有多种多样的，大规模的高质量数据集，反映了该领域的多方面性质，因此它们对化学的影响仍然受到限制。我们提出了Chempile，这是一个开放数据集，其中包含超过750亿个策划的化学数据，专门用于训练和评估化学科学中的通用模型。数据集反映了人类学习化学的学习旅程 - 从教育基础到专业知识 - 涵盖多种模式和内容类型，包括各种化学表示（微笑，自拍照，IUPAC名称，Inchi，分子渲染）中的结构化数据，科学和教育文本，可执行的代码和化学图像。 Chempile整合了基础知识（教科书，讲座说明），专业知识（科学文章和语言互化数据），视觉理解（分子结构，图表）和先进的推理（解决问题的痕迹和代码） - 反映人类化学家如何通过多样化的学习材料和经验来开发专业知识。通过数百小时的专家策划建造，Chempile捕获了基础概念和域特异性复杂性。我们提供标准化的培训，验证和测试拆分，从而实现强大的基准测试。 Chempile通过HuggingFace公开发布，并具有一致的API，宽松的许可证和详细的文档。我们希望Chempile能够成为化学AI的催化剂，从而能够开发下一代化学基础模型。

Title: Event-based Star Tracking under Spacecraft Jitter: the e-STURT Dataset

Authors: Samya Bagchi, Peter Anastasiou, Matthew Tetlow, Tat-Jun Chin, Yasir Latif
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2505.12588
Pdf URL: https://arxiv.org/pdf/2505.12588
Copy Paste: [[2505.12588]] Event-based Star Tracking under Spacecraft Jitter: the e-STURT Dataset(https://arxiv.org/abs/2505.12588)
Keywords: generation
Abstract: Jitter degrades a spacecraft's fine-pointing ability required for optical communication, earth observation, and space domain awareness. Development of jitter estimation and compensation algorithms requires high-fidelity sensor observations representative of on-board jitter. In this work, we present the Event-based Star Tracking Under Jitter (e-STURT) dataset -- the first event camera based dataset of star observations under controlled jitter conditions. Specialized hardware employed for the dataset emulates an event-camera undergoing on-board jitter. While the event camera provides asynchronous, high temporal resolution star observations, systematic and repeatable jitter is introduced using a micrometer accurate piezoelectric actuator. Various jitter sources are simulated using distinct frequency bands and utilizing both axes of motion. Ground-truth jitter is captured in hardware from the piezoelectric actuator. The resulting dataset consists of 200 sequences and is made publicly available. This work highlights the dataset generation process, technical challenges and the resulting limitations. To serve as a baseline, we propose a high-frequency jitter estimation algorithm that operates directly on the event stream. The e-STURT dataset will enable the development of jitter aware algorithms for mission critical event-based space sensing applications.
摘要：抖动降低了飞船的光学通信，地球观察和太空领域意识所需的精细能力。抖动估计和补偿算法的开发需要高保真传感器的观察，代表了车载抖动的代表。在这项工作中，我们介绍了基于事件的星星跟踪（E-Sturt）数据集，这是第一个基于事件摄像机的基于事件摄像头的数据集，该数据集是在受控抖动条件下的星形观测值。用于数据集的专门硬件模拟了经历了板上抖动的事件摄像机。事件摄像头提供异步，高时间分辨率的星星观测值，但使用微米准确的压电执行器引入系统和可重复的抖动。使用不同的频带模拟各种抖动源，并利用两个运动轴。地面抖动是在压电执行器的硬件中捕获的。由此产生的数据集由200个序列组成，并可以公开使用。这项工作强调了数据集生成过程，技术挑战和由此产生的局限性。作为基线，我们提出了直接在事件流上运行的高频抖动估计算法。 E-Sturt数据集将为基于任务的基于事件的太空传感应用程序开发抖动意识算法。

Title: SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models

Authors: Bo Liu, Pengfei Qiao, Minhan Ma, Xuange Zhang, Yinan Tang, Peng Xu, Kun Liu, Tongtong Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12589
Pdf URL: https://arxiv.org/pdf/2505.12589
Copy Paste: [[2505.12589]] SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models(https://arxiv.org/abs/2505.12589)
Keywords: generation
Abstract: Understanding surveillance video content remains a critical yet underexplored challenge in vision-language research, particularly due to its real-world complexity, irregular event dynamics, and safety-critical implications. In this work, we introduce SurveillanceVQA-589K, the largest open-ended video question answering benchmark tailored to the surveillance domain. The dataset comprises 589,380 QA pairs spanning 12 cognitively diverse question types, including temporal reasoning, causal inference, spatial understanding, and anomaly interpretation, across both normal and abnormal video scenarios. To construct the benchmark at scale, we design a hybrid annotation pipeline that combines temporally aligned human-written captions with Large Vision-Language Model-assisted QA generation using prompt-based techniques. We also propose a multi-dimensional evaluation protocol to assess contextual, temporal, and causal comprehension. We evaluate eight LVLMs under this framework, revealing significant performance gaps, especially in causal and anomaly-related tasks, underscoring the limitations of current models in real-world surveillance contexts. Our benchmark provides a practical and comprehensive resource for advancing video-language understanding in safety-critical applications such as intelligent monitoring, incident analysis, and autonomous decision-making.
摘要：了解监视视频内容在视觉研究中仍然是一个关键但毫无争议的挑战，尤其是由于其现实世界中的复杂性，不规则的事件动态和关键性的含义。在这项工作中，我们介绍了SurveillanceVQA-589K，这是最大的开放式视频问题，回答了针对监视域量身定制的基准。该数据集包含589,380个质量检查对，涵盖了12种不同的问题类型，包括时间推理，因果推理，空间理解和异常视频方面的异常解释。为了大规模构建基准测试，我们设计了一种混合注释管道，该管道结合了使用基于及时的技术，将时间对齐的人体编写的字幕与大型视觉模型辅助QA生成结合在一起。我们还提出了一项多维评估协议，以评估上下文，时间和因果理解。我们在此框架下评估了八个LVLM，揭示了巨大的性能差距，尤其是在因果关系和异常相关的任务中，强调了现实监视环境中当前模型的局限性。我们的基准提供了一种实用，全面的资源，可在诸如智能监控，事件分析和自动决策等安全性应用程序中推进视频语言理解。

Title: Diff-MM: Exploring Pre-trained Text-to-Image Generation Model for Unified Multi-modal Object Tracking

Authors: Shiyu Xuan, Zechao Li, Jinhui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12606
Pdf URL: https://arxiv.org/pdf/2505.12606
Copy Paste: [[2505.12606]] Diff-MM: Exploring Pre-trained Text-to-Image Generation Model for Unified Multi-modal Object Tracking(https://arxiv.org/abs/2505.12606)
Keywords: generation
Abstract: Multi-modal object tracking integrates auxiliary modalities such as depth, thermal infrared, event flow, and language to provide additional information beyond RGB images, showing great potential in improving tracking stabilization in complex scenarios. Existing methods typically start from an RGB-based tracker and learn to understand auxiliary modalities only from training data. Constrained by the limited multi-modal training data, the performance of these methods is unsatisfactory. To alleviate this limitation, this work proposes a unified multi-modal tracker Diff-MM by exploiting the multi-modal understanding capability of the pre-trained text-to-image generation model. Diff-MM leverages the UNet of pre-trained Stable Diffusion as a tracking feature extractor through the proposed parallel feature extraction pipeline, which enables pairwise image inputs for object tracking. We further introduce a multi-modal sub-module tuning method that learns to gain complementary information between different modalities. By harnessing the extensive prior knowledge in the generation model, we achieve a unified tracker with uniform parameters for RGB-N/D/T/E tracking. Experimental results demonstrate the promising performance of our method compared with recently proposed trackers, e.g., its AUC outperforms OneTracker by 8.3% on TNL2K.
摘要：多模式对象跟踪集成了辅助模态，例如深度，热红外，事件流和语言，以提供除RGB图像以外的其他信息，从而在改善复杂场景中的跟踪稳定性方面具有巨大的潜力。现有方法通常从基于RGB的跟踪器开始，并仅从培训数据中学习辅助模式。受到有限的多模式训练数据的限制，这些方法的性能并不令人满意。为了减轻这一限制，这项工作通过利用预先训练的文本到图像生成模型的多模式理解能力来提出统一的多模式跟踪器DIFF-MM。 DIFF-MM通过提出的并行特征提取管道利用预训练的稳定扩散作为跟踪特征提取器的UNET，这使成对图像输入启用用于对象跟踪的成对图像输入。我们进一步介绍了一种多模式的亚模块调整方法，该方法学会学会在不同模态之间获得互补信息。通过利用生成模型中的广泛的先验知识，我们实现了一个统一的跟踪器，该跟踪器具有RGB-N/D/T/E跟踪的均匀参数。实验结果表明，与最近提出的跟踪器相比，我们的方法的表现令人鼓舞，例如，TNL2K上的AUC AUC优于8.3％。

Title: BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

Authors: Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan YU, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, Guangliang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12620
Pdf URL: https://arxiv.org/pdf/2505.12620
Copy Paste: [[2505.12620]] BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation(https://arxiv.org/abs/2505.12620)
Keywords: generation, generative
Abstract: Advances in AI generative models facilitate super-realistic video synthesis, amplifying misinformation risks via social media and eroding trust in digital content. Several research works have explored new deepfake detection methods on AI-generated images to alleviate these risks. However, with the fast development of video generation models, such as Sora and WanX, there is currently a lack of large-scale, high-quality AI-generated video datasets for forgery detection. In addition, existing detection approaches predominantly treat the task as binary classification, lacking explainability in model decision-making and failing to provide actionable insights or guidance for the public. To address these challenges, we propose \textbf{GenBuster-200K}, a large-scale AI-generated video dataset featuring 200K high-resolution video clips, diverse latest generative techniques, and real-world scenes. We further introduce \textbf{BusterX}, a novel AI-generated video detection and explanation framework leveraging multimodal large language model (MLLM) and reinforcement learning for authenticity determination and explainable rationale. To our knowledge, GenBuster-200K is the {\it \textbf{first}} large-scale, high-quality AI-generated video dataset that incorporates the latest generative techniques for real-world scenarios. BusterX is the {\it \textbf{first}} framework to integrate MLLM with reinforcement learning for explainable AI-generated video detection. Extensive comparisons with state-of-the-art methods and ablation studies validate the effectiveness and generalizability of BusterX. The code, models, and datasets will be released.
摘要：AI生成模型的进步有助于超现实的视频综合，通过社交媒体扩大错误信息风险并侵蚀数字内容的信任。几项研究作品探索了AI生成的图像上的新的DeepFake检测方法，以减轻这些风险。但是，随着诸如Sora和Wanx之类的视频生成模型的快速开发，目前缺乏用于伪造检测的大型，高质量的AI生成的视频数据集。此外，现有的检测方法主要将任务视为二进制分类，缺乏模型决策中的解释性，并且未能为公众提供可行的见解或指导。为了应对这些挑战，我们建议\ textbf {genBuster-200K}，这是一个大规模的AI生成的视频数据集，具有200K高分辨率的视频剪辑，多样化的最新生成技术和现实世界的场景。我们进一步介绍了\ textbf {busterx}，这是一种新颖的AI生成的视频检测和解释框架，利用多模式大语言模型（MLLM）和强化学习来确定真实性和可解释的理由。据我们所知，GenBuster-200k是{\ it \ textbf {first}}大规模，高质量的AI生成的视频数据集，该数据集合了现实世界中最新的生成技术。 busterx是{\ it \ textbf {first}}框架，将MLLM与增强学习集成在一起，以进行可解释的AI生成的视频检测。与最新方法和消融研究的广泛比较证明了busterx的有效性和普遍性。代码，模型和数据集将发布。

Title: Dual-Agent Reinforcement Learning for Automated Feature Generation

Authors: Wanfu Gao, Zengyao Man, Hanlin Pan, Kunpeng Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.12628
Pdf URL: https://arxiv.org/pdf/2505.12628
Copy Paste: [[2505.12628]] Dual-Agent Reinforcement Learning for Automated Feature Generation(https://arxiv.org/abs/2505.12628)
Keywords: generation
Abstract: Feature generation involves creating new features from raw data to capture complex relationships among the original features, improving model robustness and machine learning performance. Current methods using reinforcement learning for feature generation have made feature exploration more flexible and efficient. However, several challenges remain: first, during feature expansion, a large number of redundant features are generated. When removing them, current methods only retain the best features each round, neglecting those that perform poorly initially but could improve later. Second, the state representation used by current methods fails to fully capture complex feature relationships. Third, there are significant differences between discrete and continuous features in tabular data, requiring different operations for each type. To address these challenges, we propose a novel dual-agent reinforcement learning method for feature generation. Two agents are designed: the first generates new features, and the second determines whether they should be preserved. A self-attention mechanism enhances state representation, and diverse operations distinguish interactions between discrete and continuous features. The experimental results on multiple datasets demonstrate that the proposed method is effective. The code is available at this https URL.
摘要：特征生成涉及创建从原始数据创建新功能，以捕获原始功能之间的复杂关系，改善模型鲁棒性和机器学习性能。当前使用增强学习进行功能生成的方法使特征探索更加灵活和高效。但是，仍然存在一些挑战：首先，在功能扩展过程中，生成了许多冗余功能。当删除它们时，当前方法仅保留每一轮最佳功能，忽略了那些最初表现不佳但可能会改善的方法。其次，当前方法使用的状态表示无法完全捕获复杂的特征关系。第三，在表格数据中离散和连续特征之间存在显着差异，每种类型都需要不同的操作。为了应对这些挑战，我们提出了一种新颖的双级辅助增强学习方法，以实现特征生成。设计了两个代理：第一个生成新功能，第二个确定是否应保留它们。自我注意力的机制增强了状态表示，并且不同的操作区分了离散和连续特征之间的相互作用。多个数据集的实验结果表明所提出的方法是有效的。该代码可在此HTTPS URL上找到。

Title: Degradation-Aware Feature Perturbation for All-in-One Image Restoration

Authors: Xiangpeng Tian, Xiangyu Liao, Xiao Liu, Meng Li, Chao Ren
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12630
Pdf URL: https://arxiv.org/pdf/2505.12630
Copy Paste: [[2505.12630]] Degradation-Aware Feature Perturbation for All-in-One Image Restoration(https://arxiv.org/abs/2505.12630)
Keywords: restoration
Abstract: All-in-one image restoration aims to recover clear images from various degradation types and levels with a unified model. Nonetheless, the significant variations among degradation types present challenges for training a universal model, often resulting in task interference, where the gradient update directions of different tasks may diverge due to shared parameters. To address this issue, motivated by the routing strategy, we propose DFPIR, a novel all-in-one image restorer that introduces Degradation-aware Feature Perturbations(DFP) to adjust the feature space to align with the unified parameter space. In this paper, the feature perturbations primarily include channel-wise perturbations and attention-wise perturbations. Specifically, channel-wise perturbations are implemented by shuffling the channels in high-dimensional space guided by degradation types, while attention-wise perturbations are achieved through selective masking in the attention space. To achieve these goals, we propose a Degradation-Guided Perturbation Block (DGPB) to implement these two functions, positioned between the encoding and decoding stages of the encoder-decoder architecture. Extensive experimental results demonstrate that DFPIR achieves state-of-the-art performance on several all-in-one image restoration tasks including image denoising, image dehazing, image deraining, motion deblurring, and low-light image enhancement. Our codes are available at this https URL.
摘要：多合一图像恢复旨在通过统一模型从各种降解类型和水平中恢复清晰的图像。但是，降解类型之间的显着差异给训练通用模型带来了挑战，通常会导致任务干扰，在这种情况下，由于共享参数，不同任务的梯度更新方向可能会有所不同。为了解决这个问题，是由路由策略激励的，我们提出了DFPIR，这是一个新颖的一对一图像修复程序，它引入了降级感知功能扰动（DFP），以调整特征空间，以与统一的参数空间保持一致。在本文中，特征扰动主要包括诸如通道的扰动和注意力扰动。具体而言，通过在以退化类型为指导的高维空间中，通过将频道改道来实现渠道扰动，而注意力扰动是通过在注意空间中的选择性掩盖来实现的。为了实现这些目标，我们提出了一个降级引导的扰动块（DGPB）来实现这两个功能，该功能位于编码器架构的编码和解码阶段之间。广泛的实验结果表明，DFPIR在几个多合一的图像恢复任务上实现了最先进的性能，包括图像DeNoising，图像去悬式，图像降低，运动脱张和低光图像增强。我们的代码可在此HTTPS URL上找到。

Title: Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents

Authors: Yunseok Jang, Yeda Song, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Dong-Ki Kim, Kyunghoon Bae, Honglak Lee
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.12632
Pdf URL: https://arxiv.org/pdf/2505.12632
Copy Paste: [[2505.12632]] Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents(https://arxiv.org/abs/2505.12632)
Keywords: generation
Abstract: Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing GUI visual agents. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale dataset of 313K annotated frames from 20K instructional videos capturing diverse real-world mobile OS navigation across multiple platforms. Models that include MONDAY in their pre-training phases demonstrate robust cross-platform generalization capabilities, consistently outperforming models trained on existing single OS datasets while achieving an average performance gain of 18.11%p on an unseen mobile OS platform. To enable continuous dataset expansion as mobile platforms evolve, we present an automated framework that leverages publicly available video content to create comprehensive task datasets without manual annotation. Our framework comprises robust OCR-based scene detection (95.04% F1score), near-perfect UI element detection (99.87% hit ratio), and novel multi-step action identification to extract reliable action sequences across diverse interface configurations. We contribute both the MONDAY dataset and our automated collection framework to facilitate future research in mobile OS navigation.
摘要：大型语言模型（LLM）和视觉语言模型（VLM）的最新进展引发了人们对开发GUI视觉剂的浓厚兴趣。我们在星期一介绍（YouTube代理的移动OS导航任务数据集），这是一个大规模的数据集，由20K教学视频中的313K注释帧，可捕获跨多个平台捕获多样化的现实世界移动OS导航。在其前训练阶段中包括星期一的模型表明了强大的跨平台泛化功能，在现有的单个操作系统数据集中持续优于训练的模型，同时在看不见的移动操作系统平台上达到了平均性能增长18.11％P。为了随着移动平台的发展，启用连续数据集扩展，我们提出了一个自动化框架，该框架利用公开可用的视频内容来创建无需手动注释的全面任务数据集。我们的框架包括基于强大的OCR场景检测（95.04％F1SCORE），接近完美的UI元素检测（99.87％的命中率）和新颖的多步操作识别，以在不同的界面配置中提取可靠的动作序列。我们贡献了星期一数据集和自动收集框架，以促进移动OS导航的未来研究。

Title: MVPainter: Accurate and Detailed 3D Texture Generation via Multi-View Diffusion with Geometric Control

Authors: Mingqi Shao, Feng Xiong, Zhaoxu Sun, Mu Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12635
Pdf URL: https://arxiv.org/pdf/2505.12635
Copy Paste: [[2505.12635]] MVPainter: Accurate and Detailed 3D Texture Generation via Multi-View Diffusion with Geometric Control(https://arxiv.org/abs/2505.12635)
Keywords: generation
Abstract: Recently, significant advances have been made in 3D object generation. Building upon the generated geometry, current pipelines typically employ image diffusion models to generate multi-view RGB images, followed by UV texture reconstruction through texture baking. While 3D geometry generation has improved significantly, supported by multiple open-source frameworks, 3D texture generation remains underexplored. In this work, we systematically investigate 3D texture generation through the lens of three core dimensions: reference-texture alignment, geometry-texture consistency, and local texture quality. To tackle these issues, we propose MVPainter, which employs data filtering and augmentation strategies to enhance texture fidelity and detail, and introduces ControlNet-based geometric conditioning to improve texture-geometry alignment. Furthermore, we extract physically-based rendering (PBR) attributes from the generated views to produce PBR meshes suitable for real-world rendering applications. MVPainter achieves state-of-the-art results across all three dimensions, as demonstrated by human-aligned evaluations. To facilitate further research and reproducibility, we also release our full pipeline as an open-source system, including data construction, model architecture, and evaluation tools.
摘要：最近，在3D对象生成中已经取得了重大进展。在生成的几何形状的基础上，当前的管道通常采用图像扩散模型来生成多视图RGB图像，然后通过纹理烘烤进行紫外线纹理重建。尽管在多个开源框架的支持下，3D几何产生有了显着改善，但3D纹理生成仍未得到充分震惊。在这项工作中，我们通过三个核心维度的镜头系统地研究了3D纹理生成：参考文本一致性，几何形式 - 文本一致性和本地纹理质量。为了解决这些问题，我们提出了MVPainter，该MVPainter采用数据过滤和增强策略来增强纹理保真度和细节，并引入基于控制网络的几何条件来改善纹理几何形状的一致性。此外，我们从生成的视图中提取基于物理的渲染（PBR）属性，以产生适合现实世界渲染应用的PBR网格。如人类一致的评估所证明的那样，MVPainter在所有三个维度上都取得了最新的结果。为了促进进一步的研究和可重复性，我们还将完整的管道释放为开源系统，包括数据构建，模型架构和评估工具。

Title: Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking

Authors: Zihan Su, Xuerui Qiu, Hongbin Xu, Tangyu Jiang, Junhao Zhuang, Chun Yuan, Ming Li, Shengfeng He, Fei Richard Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12667
Pdf URL: https://arxiv.org/pdf/2505.12667
Copy Paste: [[2505.12667]] Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking(https://arxiv.org/abs/2505.12667)
Keywords: generation, generative
Abstract: The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel spatiotemporal local scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. We will release our code upon publication.
摘要：生成视频模型的爆炸性增长扩大了对AI生成内容的可靠版权保存的需求。尽管它在图像合成中很受欢迎，但无形的生成水印在视频生成中仍基本上尚未得到充实。为了解决这一差距，我们提出了Safe-Sora，这是将图形水印直接嵌入视频生成过程的第一个框架。通过观察到水印性能与水印和覆盖含量之间的视觉相似性紧密相关的观察，我们引入了分层的粗到精细的自适应匹配机制。具体而言，将水印图像分为斑块，每个图像分配给视觉上最相似的视频框架，并进一步定位于最佳空间区域以进行无缝嵌入。为了实现视频框架上水印斑块的时空融合，我们通过新型时空局部扫描策略开发了3D小波变换增强的Mamba架构，在水印嵌入和检索过程中有效地建模了长距离依赖关系。据我们所知，这是将国家空间模型应用于水印的首次尝试，为有效且强大的水印保护开辟了新的途径。广泛的实验表明，安全 - 索拉（Safe-Sora）在视频质量，水印保真度和鲁棒性方面取得了最先进的表现，这在很大程度上归因于我们的建议。我们将在出版时发布代码。

Title: Few-Step Diffusion via Score identity Distillation

Authors: Mingyuan Zhou, Yi Gu, Zhendong Wang
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.12674
Pdf URL: https://arxiv.org/pdf/2505.12674
Copy Paste: [[2505.12674]] Few-Step Diffusion via Score identity Distillation(https://arxiv.org/abs/2505.12674)
Keywords: generation
Abstract: Diffusion distillation has emerged as a promising strategy for accelerating text-to-image (T2I) diffusion models by distilling a pretrained score network into a one- or few-step generator. While existing methods have made notable progress, they often rely on real or teacher-synthesized images to perform well when distilling high-resolution T2I diffusion models such as Stable Diffusion XL (SDXL), and their use of classifier-free guidance (CFG) introduces a persistent trade-off between text-image alignment and generation diversity. We address these challenges by optimizing Score identity Distillation (SiD) -- a data-free, one-step distillation framework -- for few-step generation. Backed by theoretical analysis that justifies matching a uniform mixture of outputs from all generation steps to the data distribution, our few-step distillation algorithm avoids step-specific networks and integrates seamlessly into existing pipelines, achieving state-of-the-art performance on SDXL at 1024x1024 resolution. To mitigate the alignment-diversity trade-off when real text-image pairs are available, we introduce a Diffusion GAN-based adversarial loss applied to the uniform mixture and propose two new guidance strategies: Zero-CFG, which disables CFG in the teacher and removes text conditioning in the fake score network, and Anti-CFG, which applies negative CFG in the fake score network. This flexible setup improves diversity without sacrificing alignment. Comprehensive experiments on SD1.5 and SDXL demonstrate state-of-the-art performance in both one-step and few-step generation settings, along with robustness to the absence of real images. Our efficient PyTorch implementation, along with the resulting one- and few-step distilled generators, will be released publicly as a separate branch at this https URL.
摘要：扩散蒸馏已成为通过将预告片的分数网络提炼成一个或几个步骤发生器来加速文本形象（T2I）扩散模型的有前途的策略。尽管现有方法取得了显着的进展，但它们通常依靠真实或教师合成的图像在提炼高分辨率T2I扩散模型（例如稳定的扩散XL（SDXL））时表现良好，并且它们使用无分类器指导（CFG）的使用引入了文本图像构图和生成生成多样性之间的持续权衡。我们通过优化得分身份蒸馏（SID）（SID）（一个无数据的单步蒸馏框架）来解决这些挑战，以生成几步。在理论分析的支持下，我们的几个步骤到数据分布的均匀混合物是合理的，我们的几步蒸馏算法避免了阶跃特异性的网络并无缝地集成到现有的管道中，从而在1024x1024分辨率下实现了SDXL上最先进的性能。 To mitigate the alignment-diversity trade-off when real text-image pairs are available, we introduce a Diffusion GAN-based adversarial loss applied to the uniform mixture and propose two new guidance strategies: Zero-CFG, which disables CFG in the teacher and removes text conditioning in the fake score network, and Anti-CFG, which applies negative CFG in the fake score network.这种灵活的设置可改善多样性而无需牺牲一致性。 SD1.5和SDXL的全面实验表明，一步和少数一代的设置中的最先进的性能，以及不存在真实图像的鲁棒性。我们有效的Pytorch实施以及由此产生的一步蒸馏器的生成器将在此HTTPS URL上公开发布，作为单独的分支。

Title: CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models

Authors: Shristi Das Biswas, Arani Roy, Kaushik Roy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12677
Pdf URL: https://arxiv.org/pdf/2505.12677
Copy Paste: [[2505.12677]] CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models(https://arxiv.org/abs/2505.12677)
Keywords: generation
Abstract: As Text-to-Image models continue to evolve, so does the risk of generating unsafe, copyrighted, or privacy-violating content. Existing safety interventions - ranging from training data curation and model fine-tuning to inference-time filtering and guidance - often suffer from incomplete concept removal, susceptibility to jail-breaking, computational inefficiency, or collateral damage to unrelated capabilities. In this paper, we introduce CURE, a training-free concept unlearning framework that operates directly in the weight space of pre-trained diffusion models, enabling fast, interpretable, and highly specific suppression of undesired concepts. At the core of our method is the Spectral Eraser, a closed-form, orthogonal projection module that identifies discriminative subspaces using Singular Value Decomposition over token embeddings associated with the concepts to forget and retain. Intuitively, the Spectral Eraser identifies and isolates features unique to the undesired concept while preserving safe attributes. This operator is then applied in a single step update to yield an edited model in which the target concept is effectively unlearned - without retraining, supervision, or iterative optimization. To balance the trade-off between filtering toxicity and preserving unrelated concepts, we further introduce an Expansion Mechanism for spectral regularization which selectively modulates singular vectors based on their relative significance to control the strength of forgetting. All the processes above are in closed-form, guaranteeing extremely efficient erasure in only $2$ seconds. Benchmarking against prior approaches, CURE achieves a more efficient and thorough removal for targeted artistic styles, objects, identities, or explicit content, with minor damage to original generation ability and demonstrates enhanced robustness against red-teaming.
摘要：随着文本对图像模型的不断发展，产生不安全，版权或隐私竞争内容的风险也随之而来。现有的安全干预措施 - 从训练数据策展和模型微调到推理时间过滤和指导 - 通常会遇到不完整的概念删除，对监狱的易感性，计算效率低下或对无关能力的副品损害。在本文中，我们介绍了Cure，这是一个无训练的概念，未学习的框架，直接在预训练的扩散模型的重量空间中运行，从而可以快速，可解释且高度具体地抑制不希望的概念。我们方法的核心是光谱橡皮擦，这是一种封闭形式的正交投影模块，该模块使用与与遗忘和保留的概念相关的令牌嵌入方式识别歧视性子空间。凭直觉，光谱橡皮擦识别和隔离物具有不希望的概念在保留安全属性的同时所特有的。然后，将该操作员应用于单步更新中，以产生一个编辑的模型，在该模型中，目标概念有效地未经学习 - 而无需再培训，监督或迭代优化。为了平衡过滤毒性和保持无关概念之间的权衡，我们进一步引入了一种频谱正则化的扩展机制，该机制根据其相对的意义选择性地调节奇异向量以控制遗忘的强度。上面的所有过程都是封闭形式的，可确保仅$ 2 $秒的效率擦除。 CURE对先前方法进行基准测试，可为有针对性的艺术风格，物体，身份或明确的内容实现更有效，更彻底的去除，对原始生成能力的损害很小，并且表现出增强的鲁棒性，可抵抗红色团队。

Title: PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI

Authors: Yingchen He, Christian D. Weilbach, Martyna E. Wojciechowska, Yuxuan Zhang, Frank Wood
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.12707
Pdf URL: https://arxiv.org/pdf/2505.12707
Copy Paste: [[2505.12707]] PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI(https://arxiv.org/abs/2505.12707)
Keywords: generative
Abstract: Advances in deep generative modelling have made it increasingly plausible to train human-level embodied agents. Yet progress has been limited by the absence of large-scale, real-time, multi-modal, and socially interactive datasets that reflect the sensory-motor complexity of natural environments. To address this, we present PLAICraft, a novel data collection platform and dataset capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse, and keyboard actions. Each modality is logged with millisecond time precision, enabling the study of synchronous, embodied behaviour in a rich, open-ended world. The dataset comprises over 10,000 hours of gameplay from more than 10,000 global participants.\footnote{We have done a privacy review for the public release of an initial 200-hour subset of the dataset, with plans to release most of the dataset over time.} Alongside the dataset, we provide an evaluation suite for benchmarking model capabilities in object recognition, spatial awareness, language grounding, and long-term memory. PLAICraft opens a path toward training and evaluating agents that act fluently and purposefully in real time, paving the way for truly embodied artificial intelligence.
摘要：深层生成建模的进步使训练人类水平的体现药物变得越来越合理。然而，进展受到了反映自然环境的感觉运动复杂性的大规模，实时，多模式和社会互动数据集的限制。为了解决这个问题，我们提出了PlaiCraft，这是一个新颖的数据收集平台和数据集，可捕获五个时间对齐的模态的多人Minecraft交互：视频，游戏输出音频，麦克风输入音频，鼠标和键盘操作。每种方式都以毫秒的时间精度记录，从而可以在一个丰富的开放式世界中研究同步，体现的行为。该数据集包括10,000多个全球参与者的10,000多个小时的游戏玩法。 Plaicraft开辟了一条训练和评估代理的途径，这些途径是实时流利而有意地行动的，为真正体现的人工智能铺平了道路。

Title: Confidence-Regulated Generative Diffusion Models for Reliable AI Agent Migration in Vehicular Metaverses

Authors: Yingkai Kang, Jiawen Kang, Jinbo Wen, Tao Zhang, Zhaohui Yang, Dusit Niyato, Yan Zhang
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2505.12710
Pdf URL: https://arxiv.org/pdf/2505.12710
Copy Paste: [[2505.12710]] Confidence-Regulated Generative Diffusion Models for Reliable AI Agent Migration in Vehicular Metaverses(https://arxiv.org/abs/2505.12710)
Keywords: generative
Abstract: Vehicular metaverses are an emerging paradigm that merges intelligent transportation systems with virtual spaces, leveraging advanced digital twin and Artificial Intelligence (AI) technologies to seamlessly integrate vehicles, users, and digital environments. In this paradigm, vehicular AI agents are endowed with environment perception, decision-making, and action execution capabilities, enabling real-time processing and analysis of multi-modal data to provide users with customized interactive services. Since vehicular AI agents require substantial resources for real-time decision-making, given vehicle mobility and network dynamics conditions, the AI agents are deployed in RoadSide Units (RSUs) with sufficient resources and dynamically migrated among them. However, AI agent migration requires frequent data exchanges, which may expose vehicular metaverses to potential cyber attacks. To this end, we propose a reliable vehicular AI agent migration framework, achieving reliable dynamic migration and efficient resource scheduling through cooperation between vehicles and RSUs. Additionally, we design a trust evaluation model based on the theory of planned behavior to dynamically quantify the reputation of RSUs, thereby better accommodating the personalized trust preferences of users. We then model the vehicular AI agent migration process as a partially observable markov decision process and develop a Confidence-regulated Generative Diffusion Model (CGDM) to efficiently generate AI agent migration decisions. Numerical results demonstrate that the CGDM algorithm significantly outperforms baseline methods in reducing system latency and enhancing robustness against cyber attacks.
摘要：车辆元元素是一种新兴的范式，将智能运输系统与虚拟空间合并，利用高级数字双胞胎和人工智能（AI）技术来无缝整合车辆，用户和数字环境。在此范式中，车辆AI代理具有环境感知，决策和动作执行功能，从而实现了多模式数据的实时处理和分析，以为用户提供自定义的交互式服务。由于车辆AI代理需要大量资源进行实时决策，鉴于车辆的移动性和网络动态条件，因此AI代理被部署在路边单元（RSU）中，并具有足够的资源，并在其中动态迁移。但是，AI代理迁移需要频繁的数据交换，这可能会使车辆元元素暴露于潜在的网络攻击中。为此，我们提出了一个可靠的车辆AI代理迁移框架，通过车辆和RSU之间的合作来实现可靠的动态迁移和有效的资源调度。此外，我们根据计划行为理论设计了一个信任评估模型，以动态量化RSU的声誉，从而更好地适应用户的个性化信任偏好。然后，我们将车辆AI代理迁移过程建模为部分可观察到的马尔可夫决策过程，并开发置信度调节的生成扩散模型（CGDM），以有效地生成AI代理迁移决策。数值结果表明，CGDM算法在减少系统潜伏期和增强针对网络攻击的鲁棒性方面显着优于基线方法。

Title: Any-to-Any Learning in Computational Pathology via Triplet Multimodal Pretraining

Authors: Qichen Sun, Zhengrui Guo, Rui Peng, Hao Chen, Jinzhuo Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12711
Pdf URL: https://arxiv.org/pdf/2505.12711
Copy Paste: [[2505.12711]] Any-to-Any Learning in Computational Pathology via Triplet Multimodal Pretraining(https://arxiv.org/abs/2505.12711)
Keywords: generation
Abstract: Recent advances in computational pathology and artificial intelligence have significantly enhanced the utilization of gigapixel whole-slide images and and additional modalities (e.g., genomics) for pathological diagnosis. Although deep learning has demonstrated strong potential in pathology, several key challenges persist: (1) fusing heterogeneous data types requires sophisticated strategies beyond simple concatenation due to high computational costs; (2) common scenarios of missing modalities necessitate flexible strategies that allow the model to learn robustly in the absence of certain modalities; (3) the downstream tasks in CPath are diverse, ranging from unimodal to multimodal, cnecessitating a unified model capable of handling all modalities. To address these challenges, we propose ALTER, an any-to-any tri-modal pretraining framework that integrates WSIs, genomics, and pathology reports. The term "any" emphasizes ALTER's modality-adaptive design, enabling flexible pretraining with any subset of modalities, and its capacity to learn robust, cross-modal representations beyond WSI-centric approaches. We evaluate ALTER across extensive clinical tasks including survival prediction, cancer subtyping, gene mutation prediction, and report generation, achieving superior or comparable performance to state-of-the-art baselines.
摘要：计算病理学和人工智能的最新进展显着增强了Gigapixel全滑动图像以及其他方式（例如，基因组学）对病理诊断的利用。尽管深度学习在病理学方面表现出了强大的潜力，但几个关键挑战仍然存在：（1）融合异质数据类型需要由于高计算成本而超出简单串联的复杂策略；（2）缺失模式的常见场景需要灵活的策略，使模型在没有某些方式的情况下可以强大地学习；（3）CPATH中的下游任务是多种多样的，范围从单峰到多模式，构成了能够处理所有模式的统一模型。为了应对这些挑战，我们提出了Alter，这是一个整合WSI，基因组学和病理学报告的任何对任何三型模式预处理的框架。 “ Any”一词强调了Alter的模态自适应设计，可以通过任何模态来进行灵活的预处理，以及其以WSI为中心的方法以外的强大的跨模式表示的能力。我们在广泛的临床任务中评估了变更，包括生存预测，癌症亚型，基因突变预测和报告产生，取得了比最先进的基线的卓越或可比的性能。

Title: MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

Authors: Jinhua Zhang, Wei Long, Minghao Han, Weiyi You, Shuhang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12742
Pdf URL: https://arxiv.org/pdf/2505.12742
Copy Paste: [[2505.12742]] MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning(https://arxiv.org/abs/2505.12742)
Keywords: generation
Abstract: Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features of adjacent preceding scale for next-scale prediction, enabling the adoption of a parallel training strategy that significantly reduces GPU memory consumption. Furthermore, we propose spatial-Markov attention, which restricts the attention of each token to a localized neighborhood of size k at corresponding positions on adjacent scales, rather than attending to every token across these scales, for the pursuit of reduced modeling complexity. Building on these improvements, we reduce the computational complexity of attention calculation from O(N^2) to O(Nk), enabling training with just eight NVIDIA RTX 4090 GPUs and eliminating the need for KV cache during inference. Extensive experiments on ImageNet demonstrate that MVAR achieves comparable or superior performance with both small model trained from scratch and large fine-tuned models, while reducing the average GPU memory footprint by 3.0x.
摘要：视觉生成必不可少的是视觉数据先验的有效建模。常规的下一步预测方法将过程定义为学习连续令牌的条件概率分布。最近，隔壁预测方法重新定义了学习多尺度表示的分布的过程，从而大大减少了产生潜伏期。但是，这些方法在所有先前的尺度上都要调节每个量表，并要求每个令牌考虑所有先前的令牌，表现出量表和空间冗余。为了通过缓解冗余来更好地对分布进行建模，我们提出了马尔可夫视觉自动回归建模（MVAR），这是一种新型的自回旋框架，它引入了规模和空间马尔可夫假设，以降低条件概率建模的复杂性。具体而言，我们引入了一种比例标准轨迹轨迹，该轨迹仅作为输入临时预测的相邻先前尺度的特征，从而实现了平行训练策略，从而大大降低了GPU内存消耗。此外，我们提出了空间 - 马尔科夫的注意，这将每个令牌的注意力限制在相邻尺度上相应位置的局部大小k的局部邻域，而不是在这些尺度上遍及这些标记，以追求降低的建模复杂性。在这些改进的基础上，我们将注意力计算的计算复杂性从O（n^2）降低到O（nk），仅使用八个NVIDIA RTX 4090 GPU启用训练，并消除了推断期间对KV缓存的需求。对ImageNet的广泛实验表明，MVAR通过从头开始训练的小型模型和大型微型模型，同时将平均GPU存储器足迹降低3.0倍，从而实现了可比或出色的性能。

Title: ProDS: Preference-oriented Data Selection for Instruction Tuning

Authors: Wenya Guo, Zhengkun Zhang, Xumeng Liu, Ying Zhang, Ziyu Lu, Haoze Zhu, Xubo Liu, Ruxue Yan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.12754
Pdf URL: https://arxiv.org/pdf/2505.12754
Copy Paste: [[2505.12754]] ProDS: Preference-oriented Data Selection for Instruction Tuning(https://arxiv.org/abs/2505.12754)
Keywords: generation
Abstract: Instruction data selection aims to identify a high-quality subset from the training set that matches or exceeds the performance of the full dataset on target tasks. Existing methods focus on the instruction-to-response mapping, but neglect the human preference for diverse responses. In this paper, we propose Preference-oriented Data Selection method (ProDS) that scores training samples based on their alignment with preferences observed in the target set. Our key innovation lies in shifting the data selection criteria from merely estimating features for accurate response generation to explicitly aligning training samples with human preferences in target tasks. Specifically, direct preference optimization (DPO) is employed to estimate human preferences across diverse responses. Besides, a bidirectional preference synthesis strategy is designed to score training samples according to both positive preferences and negative preferences. Extensive experimental results demonstrate our superiority to existing task-agnostic and targeted methods.
摘要：指令数据选择旨在确定与目标任务上完整数据集相匹配或超过完整数据集的训练集中的高质量子集。现有方法集中于指导响应映射，但忽略了人类对各种反应的偏好。在本文中，我们提出了面向偏好的数据选择方法（PRODS），该方法基于其对准样本的比对，并在目标集合中观察到的偏好。我们的关键创新在于将数据选择标准从仅估算精确响应生成的特征，以将目标任务中的人类偏好明确地对准培训样本。具体而言，采用直接偏好优化（DPO）来估计各种反应的人类偏好。此外，双向偏好合成策略旨在根据积极的偏好和负偏好来评分训练样本。广泛的实验结果证明了我们对现有的任务不合时宜和靶向方法的优势。

Title: A Study on the Refining Handwritten Font by Mixing Font Styles

Authors: Avinash Kumar, Kyeolhee Kang, Ammar ul Hassan, Jaeyoung Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12834
Pdf URL: https://arxiv.org/pdf/2505.12834
Copy Paste: [[2505.12834]] A Study on the Refining Handwritten Font by Mixing Font Styles(https://arxiv.org/abs/2505.12834)
Keywords: generative
Abstract: Handwritten fonts have a distinct expressive character, but they are often difficult to read due to unclear or inconsistent handwriting. FontFusionGAN (FFGAN) is a novel method for improving handwritten fonts by combining them with printed fonts. Our method implements generative adversarial network (GAN) to generate font that mix the desirable features of handwritten and printed fonts. By training the GAN on a dataset of handwritten and printed fonts, it can generate legible and visually appealing font images. We apply our method to a dataset of handwritten fonts and demonstrate that it significantly enhances the readability of the original fonts while preserving their unique aesthetic. Our method has the potential to improve the readability of handwritten fonts, which would be helpful for a variety of applications including document creation, letter writing, and assisting individuals with reading and writing difficulties. In addition to addressing the difficulties of font creation for languages with complex character sets, our method is applicable to other text-image-related tasks, such as font attribute control and multilingual font style transfer.
摘要：手写字体具有独特的表现性特征，但是由于不清楚或不一致的笔迹，它们通常很难阅读。 Fontfusiongan（FFGAN）是一种新颖的方法，可通过将其与印刷字体结合使用来改进手写字体。我们的方法实现了生成对抗网络（GAN），以生成混合手写字体和印刷字体所需功能的字体。通过在手写字体和印刷字体的数据集上训练gan，它可以生成清晰且视觉上吸引人的字体图像。我们将方法应用于手写字体的数据集，并证明它可以显着提高原始字体的可读性，同时保留其独特的美学。我们的方法有可能提高手写字体的可读性，这将有助于各种应用程序，包括创建文档，写信以及帮助个人阅读和写作困难。除了解决具有复杂字符集的语言的字体创建困难外，我们的方法还适用于其他与文本图像相关的任务，例如字体属性控制和多语言字体样式传输。

Title: Accelerate TarFlow Sampling with GS-Jacobi Iteration

Authors: Ben Liu, Zhen Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12849
Pdf URL: https://arxiv.org/pdf/2505.12849
Copy Paste: [[2505.12849]] Accelerate TarFlow Sampling with GS-Jacobi Iteration(https://arxiv.org/abs/2505.12849)
Keywords: generation
Abstract: Image generation models have achieved widespread applications. As an instance, the TarFlow model combines the transformer architecture with Normalizing Flow models, achieving state-of-the-art results on multiple benchmarks. However, due to the causal form of attention requiring sequential computation, TarFlow's sampling process is extremely slow. In this paper, we demonstrate that through a series of optimization strategies, TarFlow sampling can be greatly accelerated by using the Gauss-Seidel-Jacobi (abbreviated as GS-Jacobi) iteration method. Specifically, we find that blocks in the TarFlow model have varying importance: a small number of blocks play a major role in image generation tasks, while other blocks contribute relatively little; some blocks are sensitive to initial values and prone to numerical overflow, while others are relatively robust. Based on these two characteristics, we propose the Convergence Ranking Metric (CRM) and the Initial Guessing Metric (IGM): CRM is used to identify whether a TarFlow block is "simple" (converges in few iterations) or "tough" (requires more iterations); IGM is used to evaluate whether the initial value of the iteration is good. Experiments on four TarFlow models demonstrate that GS-Jacobi sampling can significantly enhance sampling efficiency while maintaining the quality of generated images (measured by FID), achieving speed-ups of 4.53x in Img128cond, 5.32x in AFHQ, 2.96x in Img64uncond, and 2.51x in Img64cond without degrading FID scores or sample quality. Code and checkpoints are accessible on this https URL
摘要：图像生成模型已实现了广泛的应用程序。作为一个例子，TARFLOW模型将变压器体系结构与标准化流模型相结合，在多个基准测试上实现了最新的结果。但是，由于需要顺序计算的注意力的因果形式，Tarflow的采样过程非常慢。在本文中，我们证明，通过一系列优化策略，可以通过使用高斯 - 塞德尔 - 雅各比（GS-JACOBI）迭代方法来大大加速TARFLOW采样方法。具体而言，我们发现TARFLOW模型中的块具有不同的重要性：少数块在图像生成任务中起着重要作用，而其他块的贡献相对较小。有些块对初始值敏感，并且容易对数值溢出，而另一些则相对强大。基于这两个特征，我们提出了收敛排名度量（CRM）和初始猜测度量（IGM）：CRM用于识别TARFLOW块是“简单的”（几次迭代）还是“硬化”（需要更坚固”（需要更多的迭代）； IGM用于评估迭代的初始值是否好。 Experiments on four TarFlow models demonstrate that GS-Jacobi sampling can significantly enhance sampling efficiency while maintaining the quality of generated images (measured by FID), achieving speed-ups of 4.53x in Img128cond, 5.32x in AFHQ, 2.96x in Img64uncond, and 2.51x in Img64cond without degrading FID scores or sample quality.在此HTTPS URL上可以访问代码和检查点

Title: Towards a Universal Image Degradation Model via Content-Degradation Disentanglement

Authors: Wenbo Yang, Zhongling Wang, Zhou Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.12860
Pdf URL: https://arxiv.org/pdf/2505.12860
Copy Paste: [[2505.12860]] Towards a Universal Image Degradation Model via Content-Degradation Disentanglement(https://arxiv.org/abs/2505.12860)
Keywords: restoration
Abstract: Image degradation synthesis is highly desirable in a wide variety of applications ranging from image restoration to simulating artistic effects. Existing models are designed to generate one specific or a narrow set of degradations, which often require user-provided degradation parameters. As a result, they lack the generalizability to synthesize degradations beyond their initial design or adapt to other applications. Here we propose the first universal degradation model that can synthesize a broad spectrum of complex and realistic degradations containing both homogeneous (global) and inhomogeneous (spatially varying) components. Our model automatically extracts and disentangles homogeneous and inhomogeneous degradation features, which are later used for degradation synthesis without user intervention. A disentangle-by-compression method is proposed to separate degradation information from images. Two novel modules for extracting and incorporating inhomogeneous degradations are created to model inhomogeneous components in complex degradations. We demonstrate the model's accuracy and adaptability in film-grain simulation and blind image restoration tasks. The demo video, code, and dataset of this project will be released upon publication at this http URL.
摘要：图像降解综合在从图像恢复到模拟艺术效果的各种应用中是非常可取的。现有模型旨在生成一组或狭窄的降解集，这些降解通常需要用户提供的降解参数。结果，他们缺乏将降解超出其初始设计或适应其他应用程序的降解的普遍性。在这里，我们提出了第一个通用降解模型，该模型可以综合含有均匀（全局）和不均匀（空间变化）组件的复杂和现实降解。我们的模型会自动提取并解散同质和不均匀降解功能，后来用于降解合成而无需用户干预。提出了一种逐个压缩方法，以将降解信息与图像分开。创建了两个用于提取和掺入不均匀降解的新型模块，以模拟复杂降解中的不均匀成分。我们演示了模型在膜元模拟和盲图恢复任务中的准确性和适应性。该项目的演示视频，代码和数据集将在此HTTP URL出版时发布。

Title: PhyDA: Physics-Guided Diffusion Models for Data Assimilation in Atmospheric Systems

Authors: Hao Wang, Jindong Han, Wei Fan, Weijia Zhang, Hao Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.12882
Pdf URL: https://arxiv.org/pdf/2505.12882
Copy Paste: [[2505.12882]] PhyDA: Physics-Guided Diffusion Models for Data Assimilation in Atmospheric Systems(https://arxiv.org/abs/2505.12882)
Keywords: generative
Abstract: Data Assimilation (DA) plays a critical role in atmospheric science by reconstructing spatially continous estimates of the system state, which serves as initial conditions for scientific analysis. While recent advances in diffusion models have shown great potential for DA tasks, most existing approaches remain purely data-driven and often overlook the physical laws that govern complex atmospheric dynamics. As a result, they may yield physically inconsistent reconstructions that impair downstream applications. To overcome this limitation, we propose PhyDA, a physics-guided diffusion framework designed to ensure physical coherence in atmospheric data assimilation. PhyDA introduces two key components: (1) a Physically Regularized Diffusion Objective that integrates physical constraints into the training process by penalizing deviations from known physical laws expressed as partial differential equations, and (2) a Virtual Reconstruction Encoder that bridges observational sparsity for structured latent representations, further enhancing the model's ability to infer complete and physically coherent states. Experiments on the ERA5 reanalysis dataset demonstrate that PhyDA achieves superior accuracy and better physical plausibility compared to state-of-the-art baselines. Our results emphasize the importance of combining generative modeling with domain-specific physical knowledge and show that PhyDA offers a promising direction for improving real-world data assimilation systems.
摘要：数据同化（DA）通过重建系统状态的空间连续估计，在大气科学中起着至关重要的作用，该估计是科学分析的初始条件。尽管扩散模型的最新进展显示出了DA任务的巨大潜力，但大多数现有方法仍然纯粹是数据驱动的，并且经常忽略控制复杂大气动态的物理定律。结果，它们可能会产生损害下游应用的身体不一致的重建。为了克服这一限制，我们提出了Phyda，这是一个物理引导的扩散框架，旨在确保大气数据同化的物理连贯性。 Phyda介绍了两个关键组成部分：（1）一个物理正则化的扩散目标，通过对以部分差分方程表示已知的物理定律的偏差来将物理约束整合到训练过程中，以及（2）虚拟重建编码器，弥合了结构性潜伏表示的观察性稀疏性，从而进一步增强了模型的能力，从而进一步增强了模型的完整能力。 ERA5重新分析数据集的实验表明，与最先进的基准相比，Phyda具有优异的准确性和更好的身体合理性。我们的结果强调了将生成建模与特定领域的物理知识相结合的重要性，并表明Phyda为改善现实世界数据同化系统提供了有希望的方向。

Title: TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks

Authors: Yuanze Hu, Zhaoxin Fan, Xinyu Wang, Gen Li, Ye Qiu, Zhichao Yang, Wenjun Wu, Kejian Wu, Yifan Sun, Xiaotie Deng, Jin Dong
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.12884
Pdf URL: https://arxiv.org/pdf/2505.12884
Copy Paste: [[2505.12884]] TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks(https://arxiv.org/abs/2505.12884)
Keywords: generation
Abstract: Lightweight Vision-Language Models (VLMs) are indispensable for resource-constrained applications. The prevailing approach to aligning vision and language models involves freezing both the vision encoder and the language model while training small connector modules. However, this strategy heavily depends on the intrinsic capabilities of the language model, which can be suboptimal for lightweight models with limited representational capacity. In this work, we investigate this alignment bottleneck through the lens of mutual information, demonstrating that the constrained capacity of the language model inherently limits the Effective Mutual Information (EMI) between multimodal inputs and outputs, thereby compromising alignment quality. To address this challenge, we propose TinyAlign, a novel framework inspired by Retrieval-Augmented Generation, which strategically retrieves relevant context from a memory bank to enrich multimodal inputs and enhance their alignment. Extensive empirical evaluations reveal that TinyAlign significantly reduces training loss, accelerates convergence, and enhances task performance. Remarkably, it allows models to achieve baseline-level performance with only 40\% of the fine-tuning data, highlighting exceptional data efficiency. Our work thus offers a practical pathway for developing more capable lightweight VLMs while introducing a fresh theoretical lens to better understand and address alignment bottlenecks in constrained multimodal systems.
摘要：轻巧的视觉语言模型（VLM）对于资源受限的应用程序是必不可少的。对齐视觉和语言模型的普遍方法涉及在训练小连接器模块时冻结视觉编码器和语言模型。但是，这种策略在很大程度上取决于语言模型的内在功能，对于具有有限的代表性能力的轻量级模型来说，这可能是次优的。在这项工作中，我们通过相互信息的镜头研究了这种对齐瓶颈，这表明语言模型的约束能力固有地限制了多模式输入和输出之间的有效共同信息（EMI），从而损害了对齐的质量。为了应对这一挑战，我们提出了Tinyalign，这是一个受检索型生成启发的新颖框架，该框架从策略上从记忆库中检索了相关的上下文，以丰富多模式输入并增强其对齐方式。广泛的经验评估表明，Tinyalign可显着减少训练损失，加速融合并增强任务绩效。值得注意的是，它允许模型仅使用40 \％的微调数据实现基线级别的性能，从而强调了出色的数据效率。因此，我们的工作为开发更有能力的轻质VLM提供了一种实用途径，同时引入了新的理论镜头，以更好地理解和解决受约束的多模式系统中的对齐瓶颈。

Title: ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling

Authors: Ege Özsoy, Chantal Pellegrini, David Bani-Harouni, Kun Yuan, Matthias Keicher, Nassir Navab
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12890
Pdf URL: https://arxiv.org/pdf/2505.12890
Copy Paste: [[2505.12890]] ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling(https://arxiv.org/abs/2505.12890)
Keywords: generation
Abstract: The real-world complexity of surgeries necessitates surgeons to have deep and holistic comprehension to ensure precision, safety, and effective interventions. Computational systems are required to have a similar level of comprehension within the operating room. Prior works, limited to single-task efforts like phase recognition or scene graph generation, lack scope and generalizability. In this work, we introduce ORQA, a novel OR question answering benchmark and foundational multimodal model to advance OR intelligence. By unifying all four public OR datasets into a comprehensive benchmark, we enable our approach to concurrently address a diverse range of OR challenges. The proposed multimodal large language model fuses diverse OR signals such as visual, auditory, and structured data, for a holistic modeling of the OR. Finally, we propose a novel, progressive knowledge distillation paradigm, to generate a family of models optimized for different speed and memory requirements. We show the strong performance of ORQA on our proposed benchmark, and its zero-shot generalization, paving the way for scalable, unified OR modeling and significantly advancing multimodal surgical intelligence. We will release our code and data upon acceptance.
摘要：手术的现实复杂性需要外科医生具有深刻而整体的理解，以确保精确，安全和有效的干预措施。计算系统必须在手术室内具有相似的理解水平。先前的工作仅限于单任务努力，例如相位识别或场景图的产生，缺乏范围和概括性。在这项工作中，我们介绍了ORQA，这是一个小说或问题，回答基准和基础多模式模型以提高或智力。通过将所有四个公共或数据集统一为全面的基准标准，我们可以同时解决各种各样的挑战或挑战。提出的多模式大型语言模型融合了多种多样或信号，例如视觉，听觉和结构化数据，以进行OR的整体建模。最后，我们提出了一个新颖的，渐进的知识蒸馏范式，以生成针对不同速度和内存要求优化的模型系列。我们在提出的基准测试中显示了ORQA的强劲性能及其零拍的概括，为可扩展，统一或建模铺平了道路，并显着提高了多峰手术智能。我们将在接受后发布代码和数据。

Title: Active Learning on Synthons for Molecular Design

Authors: Tom George Grigg, Mason Burlage, Oliver Brook Scott, Adam Taouil, Dominique Sydow, Liam Wilbraham
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.12913
Pdf URL: https://arxiv.org/pdf/2505.12913
Copy Paste: [[2505.12913]] Active Learning on Synthons for Molecular Design(https://arxiv.org/abs/2505.12913)
Keywords: generative
Abstract: Exhaustive virtual screening is highly informative but often intractable against the expensive objective functions involved in modern drug discovery. This problem is exacerbated in combinatorial contexts such as multi-vector expansion, where molecular spaces can quickly become ultra-large. Here, we introduce Scalable Active Learning via Synthon Acquisition (SALSA): a simple algorithm applicable to multi-vector expansion which extends pool-based active learning to non-enumerable spaces by factoring modeling and acquisition over synthon or fragment choices. Through experiments on ligand- and structure-based objectives, we highlight SALSA's sample efficiency, and its ability to scale to spaces of trillions of compounds. Further, we demonstrate application toward multi-parameter objective design tasks on three protein targets - finding SALSA-generated molecules have comparable chemical property profiles to known bioactives, and exhibit greater diversity and higher scores over an industry-leading generative approach.
摘要：详尽的虚拟筛查具有很高的信息信息，但对于现代药物发现涉及的昂贵目标功能通常是棘手的。在组合环境（例如多向量扩展）中，该问题加剧了，分子空间可以迅速变为超大。在这里，我们通过Synthon Acquisition（SALSA）介绍了可扩展的主动学习：一种适用于多向量扩展的简单算法，该算法将基于池的主动学习扩展到非可控空间，通过将模型和获取考虑在Synthon或fragment选择中。通过实验配体和基于结构的目标，我们突出了莎莎的样品效率，以及其扩展到数万亿种化合物的空间的能力。此外，我们证明了在三个蛋白质靶标上的多参数目标设计任务的应用 - 发现莎莎生成的分子具有与已知生物活性剂相当的化学性质谱，并且在行业领先的生成方法上表现出更高的多样性和更高的分数。

Title: LatentINDIGO: An INN-Guided Latent Diffusion Algorithm for Image Restoration

Authors: Di You, Daniel Siromani, Pier Luigi Dragotti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.12935
Pdf URL: https://arxiv.org/pdf/2505.12935
Copy Paste: [[2505.12935]] LatentINDIGO: An INN-Guided Latent Diffusion Algorithm for Image Restoration(https://arxiv.org/abs/2505.12935)
Keywords: restoration
Abstract: There is a growing interest in the use of latent diffusion models (LDMs) for image restoration (IR) tasks due to their ability to model effectively the distribution of natural images. While significant progress has been made, there are still key challenges that need to be addressed. First, many approaches depend on a predefined degradation operator, making them ill-suited for complex or unknown degradations that deviate from standard analytical models. Second, many methods struggle to provide a stable guidance in the latent space and finally most methods convert latent representations back to the pixel domain for guidance at every sampling iteration, which significantly increases computational and memory overhead. To overcome these limitations, we introduce a wavelet-inspired invertible neural network (INN) that simulates degradations through a forward transform and reconstructs lost details via the inverse transform. We further integrate this design into a latent diffusion pipeline through two proposed approaches: LatentINDIGO-PixelINN, which operates in the pixel domain, and LatentINDIGO-LatentINN, which stays fully in the latent space to reduce complexity. Both approaches alternate between updating intermediate latent variables under the guidance of our INN and refining the INN forward model to handle unknown degradations. In addition, a regularization step preserves the proximity of latent variables to the natural image manifold. Experiments demonstrate that our algorithm achieves state-of-the-art performance on synthetic and real-world low-quality images, and can be readily adapted to arbitrary output sizes.
摘要：由于能够有效建模自然图像的分布，因此对使用潜扩散模型（LDM）进行图像恢复（IR）任务的兴趣越来越大。尽管已经取得了重大进展，但仍需要解决主要的挑战。首先，许多方法取决于预定义的降解操作员，使其不适合偏离标准分析模型的复杂或未知降解。其次，许多方法都难以在潜在空间中提供稳定的指导，最后大多数方法将潜在表示返回到像素域，以在每次采样迭代中进行指导，从而大大增加了计算和内存开销。为了克服这些局限性，我们引入了一个由小波启发的可逆神经网络（INN），该网络（INN）通过向前变换来模拟退化，并通过逆变换重建丢失细节。我们通过两种建议的方法将该设计进一步整合到潜在的扩散管道中：在像素域中运行的LitentIndIndigo-Pixelin和LitentIndIndigo-Latentinn，它完全在潜在空间中降低复杂性。两种方法都在我们的旅馆的指导下更新中间的潜在变量和完善旅馆的前向模型以处理未知降解。此外，正则步骤保留了潜在变量与自然图像歧管的接近度。实验表明，我们的算法在合成和现实世界中的低质量图像上实现了最先进的性能，并且可以很容易地适应任意输出尺寸。

Title: Leveraging LLM Inconsistency to Boost Pass@k Performance

Authors: Uri Dalal, Meirav Segal, Zvika Ben-Haim, Dan Lahav, Omer Nevo
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.12938
Pdf URL: https://arxiv.org/pdf/2505.12938
Copy Paste: [[2505.12938]] Leveraging LLM Inconsistency to Boost Pass@k Performance(https://arxiv.org/abs/2505.12938)
Keywords: generation
Abstract: Large language models (LLMs) achieve impressive abilities in numerous domains, but exhibit inconsistent performance in response to minor input changes. Rather than view this as a drawback, in this paper we introduce a novel method for leveraging models' inconsistency to boost Pass@k performance. Specifically, we present a "Variator" agent that generates k variants of a given task and submits one candidate solution for each one. Our variant generation approach is applicable to a wide range of domains as it is task agnostic and compatible with free-form inputs. We demonstrate the efficacy of our agent theoretically using a probabilistic model of the inconsistency effect, and show empirically that it outperforms the baseline on the APPS dataset. Furthermore, we establish that inconsistency persists even in frontier reasoning models across coding and cybersecurity domains, suggesting our method is likely to remain relevant for future model generations.
摘要：大型语言模型（LLMS）在众多领域中具有令人印象深刻的能力，但在响应较小的输入变化时表现出不一致的性能。我们没有将其视为缺点，而是在本文中引入了一种新颖的方法，以利用模型的不一致来提高通过@k性能。具体而言，我们提出了一个“变量”代理，该代理生成给定任务的K变体，并为每个任务提交一个候选解决方案。我们的变体生成方法适用于多种域，因为它是任务不可知的，并且与自由形式输入兼容。我们使用不一致效应的概率模型在理论上证明了代理的功效，并从经验上表明，它的表现优于应用程序数据集上的基线。此外，我们确定即使在编码和网络安全领域的边界推理模型中，不一致仍然存在，这表明我们的方法可能与未来的模型世代保持相关。

Title: Generative Modeling of Random Fields from Limited Data via Constrained Latent Flow Matching

Authors: James E. Warner, Tristan A. Shah, Patrick E. Leser, Geoffrey F. Bomarito, Joshua D. Pribe, Michael C. Stanley
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2505.13007
Pdf URL: https://arxiv.org/pdf/2505.13007
Copy Paste: [[2505.13007]] Generative Modeling of Random Fields from Limited Data via Constrained Latent Flow Matching(https://arxiv.org/abs/2505.13007)
Keywords: generative
Abstract: Deep generative models are promising tools for science and engineering, but their reliance on abundant, high-quality data limits applicability. We present a novel framework for generative modeling of random fields (probability distributions over continuous functions) that incorporates domain knowledge to supplement limited, sparse, and indirect data. The foundation of the approach is latent flow matching, where generative modeling occurs on compressed function representations in the latent space of a pre-trained variational autoencoder (VAE). Innovations include the adoption of a function decoder within the VAE and integration of physical/statistical constraints into the VAE training process. In this way, a latent function representation is learned that yields continuous random field samples satisfying domain-specific constraints when decoded, even in data-limited regimes. Efficacy is demonstrated on two challenging applications: wind velocity field reconstruction from sparse sensors and material property inference from a limited number of indirect measurements. Results show that the proposed framework achieves significant improvements in reconstruction accuracy compared to unconstrained methods and enables effective inference with relatively small training datasets that is intractable without constraints.
摘要：深层生成模型是科学和工程的有前途的工具，但它们依赖丰富的高质量数据限制了适用性。我们提出了一个新颖的框架，用于对随机字段的生成建模（连续功能上的概率分布），该框架结合了域知识以补充有限，稀疏和间接数据。该方法的基础是潜在流量匹配，其中生成建模发生在预先训练的变异自动编码器（VAE）的潜在空间中的压缩函数表示上。创新包括在VAE内采用功能解码器以及将物理/统计约束的整合到VAE培训过程中。通过这种方式，学习了潜在的函数表示，即使在数据限制的方案中，也可以在解码时产生满足域特异性约束的连续随机样品。在两个具有挑战性的应用中证明了功效：从有限数量的间接测量值中稀疏传感器和材料属性推断的风速场重建。结果表明，与不受约束的方法相比，提出的框架可实现重建精度的显着提高，并可以使用相对较小的训练数据集有效地推断，而这些数据集棘手而没有约束。

Title: LiBOG: Lifelong Learning for Black-Box Optimizer Generation

Authors: Jiyuan Pei, Yi Mei, Jialin Liu, Mengjie Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13025
Pdf URL: https://arxiv.org/pdf/2505.13025
Copy Paste: [[2505.13025]] LiBOG: Lifelong Learning for Black-Box Optimizer Generation(https://arxiv.org/abs/2505.13025)
Keywords: generation
Abstract: Meta-Black-Box Optimization (MetaBBO) garners attention due to its success in automating the configuration and generation of black-box optimizers, significantly reducing the human effort required for optimizer design and discovering optimizers with higher performance than classic human-designed optimizers. However, existing MetaBBO methods conduct one-off training under the assumption that a stationary problem distribution with extensive and representative training problem samples is pre-available. This assumption is often impractical in real-world scenarios, where diverse problems following shifting distribution continually arise. Consequently, there is a pressing need for methods that can continuously learn from new problems encountered on-the-fly and progressively enhance their capabilities. In this work, we explore a novel paradigm of lifelong learning in MetaBBO and introduce LiBOG, a novel approach designed to learn from sequentially encountered problems and generate high-performance optimizers for Black-Box Optimization (BBO). LiBOG consolidates knowledge both across tasks and within tasks to mitigate catastrophic forgetting. Extensive experiments demonstrate LiBOG's effectiveness in learning to generate high-performance optimizers in a lifelong learning manner, addressing catastrophic forgetting while maintaining plasticity to learn new tasks.
摘要：Meta-Black-box优化（METABBO）由于成功地自动化了黑盒优化器的配置和生成，因此引起了人们的注意，这大大降低了优化器设计所需的人体努力，并发现具有比经典人为设计的优化器更高性能的优化器。但是，现有的Metabbo方法在假设具有广泛且具有代表性培训问题样本的固定问题分布的假设是可以预先提供的。在现实世界中，这种假设通常是不切实际的，在现实情况下，在不断转移分布后的各种问题。因此，迫切需要方法，可以从即时并逐步增强其能力的新问题中不断学习。在这项工作中，我们探索了Metabbo中终身学习的新颖范式，并引入了Libog，这是一种新颖的方法，旨在从依次遇到的问题中学习并为黑盒优化（BBO）生成高性能优化器。 Libog跨任务和任务内部巩固了知识，以减轻灾难性的遗忘。广泛的实验表明，Libog在学习以终生学习的方式产生高性能优化者方面的有效性，解决了灾难性遗忘，同时保持可塑性以学习新任务。

Title: RGB-to-Polarization Estimation: A New Task and Benchmark Study

Authors: Beibei Lin, Zifeng Yuan, Tingting Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.13050
Pdf URL: https://arxiv.org/pdf/2505.13050
Copy Paste: [[2505.13050]] RGB-to-Polarization Estimation: A New Task and Benchmark Study(https://arxiv.org/abs/2505.13050)
Keywords: restoration, generative
Abstract: Polarization images provide rich physical information that is fundamentally absent from standard RGB images, benefiting a wide range of computer vision applications such as reflection separation and material classification. However, the acquisition of polarization images typically requires additional optical components, which increases both the cost and the complexity of the applications. To bridge this gap, we introduce a new task: RGB-to-polarization image estimation, which aims to infer polarization information directly from RGB images. In this work, we establish the first comprehensive benchmark for this task by leveraging existing polarization datasets and evaluating a diverse set of state-of-the-art deep learning models, including both restoration-oriented and generative architectures. Through extensive quantitative and qualitative analysis, our benchmark not only establishes the current performance ceiling of RGB-to-polarization estimation, but also systematically reveals the respective strengths and limitations of different model families -- such as direct reconstruction versus generative synthesis, and task-specific training versus large-scale pre-training. In addition, we provide some potential directions for future research on polarization estimation. This benchmark is intended to serve as a foundational resource to facilitate the design and evaluation of future methods for polarization estimation from standard RGB inputs.
摘要：极化图像提供了丰富的物理信息，从标准RGB图像中根本上没有，从而使广泛的计算机视觉应用（例如反射分离和材料分类）受益。但是，极化图像的获取通常需要其他光学组件，这同时增加了应用程序的成本和复杂性。为了弥合这一差距，我们引入了一项新任务：RGB至偏振图像估计，该估计旨在直接从RGB图像中推断极化信息。在这项工作中，我们通过利用现有的极化数据集并评估各种最先进的深度学习模型，包括以恢复为导向和生成的架构，建立了这项任务的第一个全面基准。通过广泛的定量和定性分析，我们的基准不仅建立了RGB至极化估计的当前性能上限，而且系统地揭示了不同模型家族的各自的优势和局限性，例如直接重建与生成生成合成，以及特定于特定于特定于特定的训练与大规模预训练。此外，我们为未来的极化估计研究提供了一些潜在的方向。该基准旨在作为基础资源，以促进对标准RGB输入的极化估计的未来方法的设计和评估。

Title: Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction

Authors: Yuanbo Wang, Zhaoxuan Zhang, Jiajin Qiu, Dilong Sun, Zhengyu Meng, Xiaopeng Wei, Xin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.13091
Pdf URL: https://arxiv.org/pdf/2505.13091
Copy Paste: [[2505.13091]] Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction(https://arxiv.org/abs/2505.13091)
Keywords: generation
Abstract: Diffusion models have made breakthroughs in 3D generation tasks. Current 3D diffusion models focus on reconstructing target shape from images or a set of partial observations. While excelling in global context understanding, they struggle to capture the local details of complex shapes and limited to the occlusion and lighting conditions. To overcome these limitations, we utilize tactile images to capture the local 3D information and propose a Touch2Shape model, which leverages a touch-conditioned diffusion model to explore and reconstruct the target shape from touch. For shape reconstruction, we have developed a touch embedding module to condition the diffusion model in creating a compact representation and a touch shape fusion module to refine the reconstructed shape. For shape exploration, we combine the diffusion model with reinforcement learning to train a policy. This involves using the generated latent vector from the diffusion model to guide the touch exploration policy training through a novel reward design. Experiments validate the reconstruction quality thorough both qualitatively and quantitative analysis, and our touch exploration policy further boosts reconstruction performance.
摘要：扩散模型在3D代任务中取得了突破。当前的3D扩散模型的重点是从图像或一组部分观测值重建目标形状。在全球环境理解方面表现出色的同时，他们努力捕获复杂形状的当地细节，并仅限于遮挡和照明条件。为了克服这些局限性，我们利用触觉图像捕获本地3D信息并提出触摸2型模型，该模型利用触摸条件的扩散模型来探索和从触摸中重建目标形状。对于形状重建，我们开发了一个触摸嵌入模块，以调节扩散模型创建紧凑的表示和一个触摸形状融合模块，以完善重建的形状。对于形状探索，我们将扩散模型与加强学习结合在一起，以训练政策。这涉及使用从扩散模型中生成的潜在向量来通过新颖的奖励设计来指导触摸探索策略培训。实验验证了重建质量的质量和定量分析，而我们的触摸探索政策进一步提高了重建性能。

Title: Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation

Authors: Sungmin Cha, Kyunghyun Cho
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.13111
Pdf URL: https://arxiv.org/pdf/2505.13111
Copy Paste: [[2505.13111]] Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation(https://arxiv.org/abs/2505.13111)
Keywords: generation, generative
Abstract: Knowledge distillation (KD) is a core component in the training and deployment of modern generative models, particularly large language models (LLMs). While its empirical benefits are well documented--enabling smaller student models to emulate the performance of much larger teachers--the underlying mechanisms by which KD improves generative quality remain poorly understood. In this work, we present a minimal working explanation of KD in generative modeling. Using a controlled simulation with mixtures of Gaussians, we demonstrate that distillation induces a trade-off between precision and recall in the student model. As the teacher distribution becomes more selective, the student concentrates more probability mass on high-likelihood regions at the expense of coverage--a behavior modulated by a single entropy-controlling parameter. We then validate this effect in a large-scale language modeling setup using the SmolLM2 family of models. Empirical results reveal the same precision-recall dynamics observed in simulation, where precision corresponds to sample quality and recall to distributional coverage. This precision-recall trade-off proves especially beneficial in scenarios where sample quality outweighs diversity, such as instruction tuning or downstream generation. Our analysis provides a simple and general explanation for the effectiveness of KD in generative modeling.
摘要：知识蒸馏（KD）是培训和部署现代生成模型，尤其是大语言模型（LLMS）的核心组成部分。尽管它的经验益处有充分的文献记载 - 以模仿较小的老师的表现，但KD提高生成质量的基本机制仍然很糟糕。在这项工作中，我们在生成建模中介绍了KD的最小工作解释。使用控制的模拟与高斯人的混合物，我们证明蒸馏会导致学生模型中的精确度和回忆之间的权衡。随着教师分布变得更加选择性，学生以损失覆盖范围为代价将更多的概率质量集中在高样区域上 - 这种行为由单个熵控制参数调节。然后，我们使用SmollM2模型家族在大规模的语言建模设置中验证这种效果。经验结果揭示了在模拟中观察到的相同的精确核心动力学，其中精度对应于样本质量，并回忆到分布覆盖率。在样本质量大于多样性（例如教学调音或下游生成）的情况下，这种精确的核心权衡证明了尤其有益。我们的分析为KD在生成建模中的有效性提供了简单而一般的解释。

Title: Just Dance with $π$! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection

Authors: Snehashis Majhi, Giacomo D'Amicantonio, Antitza Dantcheva, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Egor Bondarev, Francois Bremond
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13123
Pdf URL: https://arxiv.org/pdf/2505.13123
Copy Paste: [[2505.13123]] Just Dance with $π$! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection(https://arxiv.org/abs/2505.13123)
Keywords: generation
Abstract: Weakly-supervised methods for video anomaly detection (VAD) are conventionally based merely on RGB spatio-temporal features, which continues to limit their reliability in real-world scenarios. This is due to the fact that RGB-features are not sufficiently distinctive in setting apart categories such as shoplifting from visually similar events. Therefore, towards robust complex real-world VAD, it is essential to augment RGB spatio-temporal features by additional modalities. Motivated by this, we introduce the Poly-modal Induced framework for VAD: "PI-VAD", a novel approach that augments RGB representations by five additional modalities. Specifically, the modalities include sensitivity to fine-grained motion (Pose), three dimensional scene and entity representation (Depth), surrounding objects (Panoptic masks), global motion (optical flow), as well as language cues (VLM). Each modality represents an axis of a polygon, streamlined to add salient cues to RGB. PI-VAD includes two plug-in modules, namely Pseudo-modality Generation module and Cross Modal Induction module, which generate modality-specific prototypical representation and, thereby, induce multi-modal information into RGB cues. These modules operate by performing anomaly-aware auxiliary tasks and necessitate five modality backbones -- only during training. Notably, PI-VAD achieves state-of-the-art accuracy on three prominent VAD datasets encompassing real-world scenarios, without requiring the computational overhead of five modality backbones at inference.
摘要：视频异常检测（VAD）的弱监督方法通常基于RGB时空特征，该特征继续限制其在现实世界中的可靠性。这是由于以下事实：RGB功能在分开类别（例如在视觉上相似的事件中投放行窃）方面没有足够的独特性。因此，对于健壮的复杂现实世界vad，必须通过其他方式来增强RGB时空特征。在此激励的基础上，我们引入了VAD的多模式诱导框架：“ Pi-VAD”，这是一种新型方法，将RGB表示形式增强了五种方式。具体而言，这些方式包括对细粒运动（姿势），三维场景和实体表示（深度），周围对象（泛型掩码），全局运动（光流）以及语言提示（VLM）的敏感性。每个模态代表多边形的轴，简化以在RGB中添加显着提示。 PI-VAD包括两个插件模块，即伪模式生成模块和交叉模态感应模块，它们生成模态特异性的原型表示，从而将多模式信息诱入RGB提示。这些模块通过执行异常辅助任务而运行，并且需要五个模式骨架 - 仅在训练期间。值得注意的是，PI-VAD在包含现实世界情景的三个突出的VAD数据集上实现了最先进的精度，而无需推断五个模态骨干的计算开销。

Title: Adaptive Image Restoration for Video Surveillance: A Real-Time Approach

Authors: Muhammad Awais Amin, Adama Ilboudo, Abdul Samad bin Shahid, Amjad Ali, Waqas Haider Khan Bangyal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13130
Pdf URL: https://arxiv.org/pdf/2505.13130
Copy Paste: [[2505.13130]] Adaptive Image Restoration for Video Surveillance: A Real-Time Approach(https://arxiv.org/abs/2505.13130)
Keywords: restoration
Abstract: One of the major challenges in the field of computer vision especially for detection, segmentation, recognition, monitoring, and automated solutions, is the quality of images. Image degradation, often caused by factors such as rain, fog, lighting, etc., has a negative impact on automated this http URL, several image restoration solutions exist, including restoration models for single degradation and restoration models for multiple degradations. However, these solutions are not suitable for real-time processing. In this study, the aim was to develop a real-time image restoration solution for video surveillance. To achieve this, using transfer learning with ResNet_50, we developed a model for automatically identifying the types of degradation present in an image to reference the necessary treatment(s) for image restoration. Our solution has the advantage of being flexible and scalable.
摘要：图像的质量是计算机视觉领域的主要挑战之一，尤其是用于检测，分割，识别，监视和自动化解决方案的挑战之一。图像降解通常是由雨，雾，照明等因素引起的，对自动化的HTTP URL产生负面影响，存在几种图像恢复解决方案，包括用于单个降级的恢复模型和用于多种降解的恢复模型。但是，这些解决方案不适合实时处理。在这项研究中，目的是为视频监视开发实时图像修复解决方案。为了实现这一目标，使用RESNET_50传输学习，我们开发了一个模型，用于自动识别图像中存在的降解类型，以引用图像恢复的必要处理方法。我们的解决方案具有灵活和可扩展的优势。

Title: CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow

Authors: Takahiro Maeda, Jinkun Cao, Norimichi Ukita, Kris Kitani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.13140
Pdf URL: https://arxiv.org/pdf/2505.13140
Copy Paste: [[2505.13140]] CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow(https://arxiv.org/abs/2505.13140)
Keywords: generative
Abstract: Many density estimation techniques for 3D human motion prediction require a significant amount of inference time, often exceeding the duration of the predicted time horizon. To address the need for faster density estimation for 3D human motion prediction, we introduce a novel flow-based method for human motion prediction called CacheFlow. Unlike previous conditional generative models that suffer from time efficiency, CacheFlow takes advantage of an unconditional flow-based generative model that transforms a Gaussian mixture into the density of future motions. The results of the computation of the flow-based generative model can be precomputed and cached. Then, for conditional prediction, we seek a mapping from historical trajectories to samples in the Gaussian mixture. This mapping can be done by a much more lightweight model, thus saving significant computation overhead compared to a typical conditional flow model. In such a two-stage fashion and by caching results from the slow flow model computation, we build our CacheFlow without loss of prediction accuracy and model expressiveness. This inference process is completed in approximately one millisecond, making it 4 times faster than previous VAE methods and 30 times faster than previous diffusion-based methods on standard benchmarks such as Human3.6M and AMASS datasets. Furthermore, our method demonstrates improved density estimation accuracy and comparable prediction accuracy to a SOTA method on Human3.6M. Our code and models will be publicly available.
摘要：3D人体运动预测的许多密度估计技术需要大量的推理时间，通常超过预测时间范围的持续时间。为了满足3D人类运动预测的更快密度估计的需求，我们引入了一种基于流动的人类运动预测方法，称为CACHEFLOW。与以前遭受时间效率的条件生成模型不同，Cacheflow利用了基于流动的生成模型的优势，该模型将高斯混合物转换为未来运动的密度。可以预先计算和缓存基于流的生成模型的计算结果。然后，根据条件预测，我们寻求从历史轨迹到高斯混合物中样品的映射。该映射可以通过更轻巧的模型来完成，因此与典型的条件流模型相比，可以节省大量的计算开销。以这种两阶段的方式和缓存是由慢流模型计算产生的，我们构建了缓存，而不会损失预测准确性和模型表达性。该推论过程以大约一毫秒的速度完成，使其比以前的VAE方法快4倍，并且比以前基于扩散的方法（例如Human 36M和Amass数据集）快30倍。此外，我们的方法表明，与人类36m的SOTA方法相比，密度估计的精度和可比的预测准确性提高了。我们的代码和模型将公开可用。

Title: Emergence of Fixational and Saccadic Movements in a Multi-Level Recurrent Attention Model for Vision

Authors: Pengcheng Pan, Yonekura Shogo, Yasuo Kuniyoshi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13191
Pdf URL: https://arxiv.org/pdf/2505.13191
Copy Paste: [[2505.13191]] Emergence of Fixational and Saccadic Movements in a Multi-Level Recurrent Attention Model for Vision(https://arxiv.org/abs/2505.13191)
Keywords: generation
Abstract: Inspired by foveal vision, hard attention models promise interpretability and parameter economy. However, existing models like the Recurrent Model of Visual Attention (RAM) and Deep Recurrent Attention Model (DRAM) failed to model the hierarchy of human vision system, that compromise on the visual exploration dynamics. As a result, they tend to produce attention that are either overly fixational or excessively saccadic, diverging from human eye movement behavior. In this paper, we propose a Multi-Level Recurrent Attention Model (MRAM), a novel hard attention framework that explicitly models the neural hierarchy of human visual processing. By decoupling the function of glimpse location generation and task execution in two recurrent layers, MRAM emergent a balanced behavior between fixation and saccadic movement. Our results show that MRAM not only achieves more human-like attention dynamics, but also consistently outperforms CNN, RAM and DRAM baselines on standard image classification benchmarks.
摘要：受凹觉视觉的启发，硬关注模型有望解释性和参数经济。但是，现有模型诸如视觉注意的复发模型（RAM）和深度重复注意模型（DRAM）未能模拟人类视觉系统的层次结构，这些模型损害了视觉探索动力学。结果，它们倾向于引起过度固定或过度刺激性的注意力，与人眼运动行为不同。在本文中，我们提出了一个多层次的重复注意模型（MRAM），这是一个新型的硬注意框架，明确地模拟了人类视觉处理的神经层次结构。通过将瞥见位置生成和任务执行的功能分离为两个复发层的功能，MRAM突然出现了固定和固定运动之间的平衡行为。我们的结果表明，MRAM不仅实现了更类似人类的注意力动力学，而且在标准图像分类基准上始终超过CNN，RAM和DRAM基准。

Title: True Zero-Shot Inference of Dynamical Systems Preserving Long-Term Statistics

Authors: Christoph Jürgen Hemmer, Daniel Durstewitz
Subjects: cs.LG, cs.AI, math.DS, nlin.CD
Abstract URL: https://arxiv.org/abs/2505.13192
Pdf URL: https://arxiv.org/pdf/2505.13192
Copy Paste: [[2505.13192]] True Zero-Shot Inference of Dynamical Systems Preserving Long-Term Statistics(https://arxiv.org/abs/2505.13192)
Keywords: generative
Abstract: Complex, temporally evolving phenomena, from climate to brain activity, are governed by dynamical systems (DS). DS reconstruction (DSR) seeks to infer generative surrogate models of these from observed data, reproducing their long-term behavior. Existing DSR approaches require purpose-training for any new system observed, lacking the zero-shot and in-context inference capabilities known from LLMs. Here we introduce DynaMix, a novel multivariate ALRNN-based mixture-of-experts architecture pre-trained for DSR, the first DSR model able to generalize zero-shot to out-of-domain DS. Just from a provided context signal, without any re-training, DynaMix faithfully forecasts the long-term evolution of novel DS where existing time series (TS) foundation models, like Chronos, fail -- at a fraction of the number of parameters and orders of magnitude faster inference times. DynaMix outperforms TS foundation models in terms of long-term statistics, and often also short-term forecasts, even on real-world time series, like traffic or weather data, typically used for training and evaluating TS models, but not at all part of DynaMix' training corpus. We illustrate some of the failure modes of TS models for DSR problems, and conclude that models built on DS principles may bear a huge potential also for advancing the TS prediction field.
摘要：从气候到大脑活动的复杂，暂时发展的现象受动力系统（DS）的控制。 DS重建（DSR）试图从观察到的数据中推断出这些模型的生成替代模型，从而再现其长期行为。现有的DSR方法需要针对观察到的任何新系统的目的训练，缺乏LLMS已知的零射击和中文推理功能。在这里，我们介绍了Dynamix，这是一种新型的多元ALRNN的混合物，预先训练DSR，这是第一个DSR模型，该模型能够将零摄像机概括为DOSED域DS。仅从提供的上下文信号中，没有任何重新训练的Dynamix忠实地预测了新型DS的长期演变，其中现有时间序列（TS）基础模型（如Chronos）失败 - 在参数数量和数量级的一小部分中失败。 Dynamix在长期统计数据方面都超过了TS基础模型，并且通常在短期预测中，即使在实际时间序列（如交通或天气数据）上，通常用于培训和评估TS模型，但在Dynamix的培训语料库的所有部分都不是。我们说明了针对DSR问题的TS模型的某些故障模式，并得出结论，基于DS原理的模型可能具有巨大的潜力，也可以推进TS预测字段。

Title: A Physics-Inspired Optimizer: Velocity Regularized Adam

Authors: Pranav Vaidhyanathan, Lucas Schorling, Natalia Ares, Michael A. Osborne
Subjects: cs.LG, cs.AI, quant-ph
Abstract URL: https://arxiv.org/abs/2505.13196
Pdf URL: https://arxiv.org/pdf/2505.13196
Copy Paste: [[2505.13196]] A Physics-Inspired Optimizer: Velocity Regularized Adam(https://arxiv.org/abs/2505.13196)
Keywords: generation, generative
Abstract: We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so called adaptive edge of stability regime during training leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, damping oscillations and allowing for a more aggressive base step size when necessary without divergence. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam to create a hybrid optimizer, we demonstrate that VRAdam consistently exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, image generation and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.
摘要：我们引入了速度调节的Adam（Vradam），这是一种训练深层神经网络的物理启发的优化器，它借鉴了四重奏术语的思想，用于动力学，其稳定性对各种系统动力学。以前的算法，包括无处不在的亚当，在训练过程中在所谓的自适应边缘处运行，导致快速振荡和减慢损失的收敛减慢。但是，Vradam会根据速度对学习率增加高阶罚款，以便算法每当重量更新变得较大时会自动减慢。在实践中，我们观察到，有效的动态学习率在高速度状态下缩小，抑制振荡并允许在必要时毫无差异时更具侵略性的基本步长。通过将基于速度的全局阻尼速度的正常化程序与Adam的每参数缩放相结合以创建混合优化器，我们证明Vradam始终超过包括ADAMW在内的标准优化器的性能。我们使用各种体系结构和培训方法（包括卷积神经网络（CNN），变形金刚和Gflownets）进行各种任务，例如图像分类，语言建模，图像产生和生成型建模。

Title: MAGI-1: Autoregressive Video Generation at Scale

Authors: Sand.ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W.Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shucheng Yin, Siran Zhang, Tingting Liu, Xianping Yin, Xiaoyu Yang, Xin Song, Xuan Hu, Yankai Zhang, Yuqiao Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13211
Pdf URL: https://arxiv.org/pdf/2505.13211
Copy Paste: [[2505.13211]] MAGI-1: Autoregressive Video Generation at Scale(https://arxiv.org/abs/2505.13211)
Keywords: generation
Abstract: We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at this https URL and this https URL. The product can be accessed at this https URL.
摘要：我们提出了Magi-1，这是一个世界模型，通过自动调查来预测一系列视频块，该模型定义为连续帧的固定长度段。 MAGI-1经过训练以随着时间的流逝而单调增加的DeNoise每块噪声，可实现因果时间建模，并且自然支持流的生成。它在基于文本说明的条件下的图像到视频（I2V）任务上实现了强劲的性能，提供了高度的时间一致性和可扩展性，这些算法创新和专用的基础架构堆栈使得它们成为可能。 MAGI-1通过块的提示来促进可控制的生成，并通过保持恒定的峰推理成本（无论视频长度如何）来支持实时，记忆有效的部署。 MAGI-1的最大变体包括240亿个参数，并支持上下文长度高达400万个令牌，这表明了我们方法的可扩展性和鲁棒性。该代码和模型可在此HTTPS URL和此HTTPS URL上找到。可以通过此HTTPS URL访问该产品。

Title: Swin DiT: Diffusion Transformer using Pseudo Shifted Windows

Authors: Jiafu Wu, Yabiao Wang, Jian Li, Jinlong Peng, Yun Cao, Chengjie Wang, Jiangning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.13219
Pdf URL: https://arxiv.org/pdf/2505.13219
Copy Paste: [[2505.13219]] Swin DiT: Diffusion Transformer using Pseudo Shifted Windows(https://arxiv.org/abs/2505.13219)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) achieve remarkable performance within the domain of image generation through the incorporation of the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global information modeling transformers, which face significant computational cost when processing high-resolution images. We empirically analyze that latent space image generation does not exhibit a strong dependence on global information as traditionally assumed. Most of the layers in the model demonstrate redundancy in global computation. In addition, conventional attention mechanisms exhibit low-frequency inertia issues. To address these issues, we propose \textbf{P}seudo \textbf{S}hifted \textbf{W}indow \textbf{A}ttention (PSWA), which fundamentally mitigates global model redundancy. PSWA achieves intermediate global-local information interaction through window attention, while employing a high-frequency bridging branch to simulate shifted window operations, supplementing appropriate global and high-frequency information. Furthermore, we propose the Progressive Coverage Channel Allocation(PCCA) strategy that captures high-order attention similarity without additional computational cost. Building upon all of them, we propose a series of Pseudo \textbf{S}hifted \textbf{Win}dow DiTs (\textbf{Swin DiT}), accompanied by extensive experiments demonstrating their superior performance. For example, our proposed Swin-DiT-L achieves a 54%$\uparrow$ FID improvement over DiT-XL/2 while requiring less computational. this https URL
摘要：扩散变压器（DITS）通过结合变压器体系结构在图像生成领域实现了显着的性能。通常，DIT是通过堆叠串行各向同性全局信息建模变压器来构建的，该变压器在处理高分辨率图像时会面临巨大的计算成本。我们从经验上分析，潜在空间图像产生并不像传统上假定的那样对全球信息表现出很大的依赖。模型中的大多数层都显示出全球计算中的冗余。此外，常规的注意机制表现出低频惯性问题。为了解决这些问题，我们提出\ textbf {p} seudo \ textbf {s} hifted \ textbf {w} indow \ textbf {a} tteention（pswa），从根本上减轻全局模型的模型。 PSWA通过窗口注意实现中间的全球本地信息交互，同时采用高频桥接分支来模拟移动的窗口操作，并补充适当的全球和高频信息。此外，我们提出了渐进覆盖渠道分配（PCCA）策略，该策略捕获了高阶注意相似性而无需额外的计算成本。在所有这些基础上，我们提出了一系列伪\ textbf {s} hifted \ textbf {win} dow dit（\ textbf {swin dit}），并进行了广泛的实验，证明了它们的出色表现。例如，我们提出的SWIN-DIT-L可以在需要较少的计算过程的同时，取得了比DIT-XL/2的54％$ \ uparrow $ fid改进。此HTTPS URL

Title: WriteViT: Handwritten Text Generation with Vision Transformer

Authors: Dang Hoai Nam, Huynh Tong Dang Khoa, Vo Nguyen Le Duy
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13235
Pdf URL: https://arxiv.org/pdf/2505.13235
Copy Paste: [[2505.13235]] WriteViT: Handwritten Text Generation with Vision Transformer(https://arxiv.org/abs/2505.13235)
Keywords: generation
Abstract: Humans can quickly generalize handwriting styles from a single example by intuitively separating content from style. Machines, however, struggle with this task, especially in low-data settings, often missing subtle spatial and stylistic cues. Motivated by this gap, we introduce WriteViT, a one-shot handwritten text synthesis framework that incorporates Vision Transformers (ViT), a family of models that have shown strong performance across various computer vision tasks. WriteViT integrates a ViT-based Writer Identifier for extracting style embeddings, a multi-scale generator built with Transformer encoder-decoder blocks enhanced by conditional positional encoding (CPE), and a lightweight ViT-based recognizer. While previous methods typically rely on CNNs or CRNNs, our design leverages transformers in key components to better capture both fine-grained stroke details and higher-level style information. Although handwritten text synthesis has been widely explored, its application to Vietnamese -- a language rich in diacritics and complex typography -- remains limited. Experiments on Vietnamese and English datasets demonstrate that WriteViT produces high-quality, style-consistent handwriting while maintaining strong recognition performance in low-resource scenarios. These results highlight the promise of transformer-based designs for multilingual handwriting generation and efficient style adaptation.
摘要：人类可以通过直观地将内容与样式分开，可以快速从单个示例中概括手写样式。然而，机器在这项任务上，尤其是在低数据设置中，通常会缺少微妙的空间和风格线索。在这个差距的激励下，我们介绍了WriteVit，这是一个单发手写的文本综合框架，该框架结合了视觉变形金刚（VIT），这是一个模型家族，在各种计算机视觉任务中都表现出强烈的性能。 WriteVit集成了一个基于VIT的作者标识符，用于提取样式嵌入式，这是一种由有条件的位置编码（CPE）增强的变压器编码器模块构建的多尺度发电机，以及基于轻巧的VIT识别器。尽管以前的方法通常依赖CNN或CRNN，但我们的设计利用了关键组件中的变压器，以更好地捕获细粒度的中风细节和更高级别的样式信息。尽管已广泛探索了手写文本综合，但其对越南语的应用 - 一种富含变量和复杂版式的语言 - 仍然有限。关于越南和英语数据集的实验表明，WriteVit会产生高质量的，风格的笔迹，同时在低资源场景中保持良好的识别性能。这些结果突出了基于变压器设计的多语言手写生成和有效风格适应的希望。

Title: RN-F: A Novel Approach for Mitigating Contaminated Data in Large Language Models

Authors: Le Vu Anh, Dinh Duc Nha Nguyen, Phi Long Nguyen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.13249
Pdf URL: https://arxiv.org/pdf/2505.13249
Copy Paste: [[2505.13249]] RN-F: A Novel Approach for Mitigating Contaminated Data in Large Language Models(https://arxiv.org/abs/2505.13249)
Keywords: generation
Abstract: Large Language Models (LLMs) have become foundational in modern artificial intelligence, powering a wide range of applications from code generation and virtual assistants to scientific research and enterprise automation. However, concerns about data contamination--where test data overlaps with training data--have raised serious questions about the reliability of these applications. Despite awareness of this issue, existing methods fall short in effectively identifying or mitigating contamination. In this paper, we propose Residual-Noise Fingerprinting (RN-F), a novel framework for detecting contaminated data in LLMs. RN-F is a single-pass, gradient-free detection method that leverages residual signal patterns without introducing additional floating-point operations. Our approach is lightweight, model-agnostic, and efficient. We evaluate RN-F on multiple LLMs across various contaminated datasets and show that it consistently outperforms existing state-of-the-art methods, achieving performance improvements of up to 10.5% in contamination detection metrics.
摘要：大型语言模型（LLM）已成为现代人工智能的基础，从代码生成和虚拟助手到科学研究和企业自动化的广泛应用。但是，对数据污染的担忧 - 测试数据与培训数据重叠 - 引发了有关这些应用程序可靠性的严重问题。尽管对这个问题有意识，但现有的方法在有效识别或减轻污染方面缺乏。在本文中，我们提出了残留的噪声指纹（RN-F），这是一个用于检测LLMS中受污染数据的新型框架。 RN-F是一种单人，无梯度检测方法，它在不引入其他浮点操作的情况下利用残留信号模式。我们的方法是轻巧的，模型的，不可思议的和有效的。我们在各种受污染的数据集中评估了多个LLM的RN-F，并表明它始终胜过现有的最新方法，在污染检测指标中可实现高达10.5％的性能提高。

Title: eStonefish-scenes: A synthetically generated dataset for underwater event-based optical flow prediction tasks

Authors: Jad Mansour, Sebastian Realpe, Hayat Rajani, Michele Grimaldi, Rafael Garcia, Nuno Gracias
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.13309
Pdf URL: https://arxiv.org/pdf/2505.13309
Copy Paste: [[2505.13309]] eStonefish-scenes: A synthetically generated dataset for underwater event-based optical flow prediction tasks(https://arxiv.org/abs/2505.13309)
Keywords: generation
Abstract: The combined use of event-based vision and Spiking Neural Networks (SNNs) is expected to significantly impact robotics, particularly in tasks like visual odometry and obstacle avoidance. While existing real-world event-based datasets for optical flow prediction, typically captured with Unmanned Aerial Vehicles (UAVs), offer valuable insights, they are limited in diversity, scalability, and are challenging to collect. Moreover, there is a notable lack of labelled datasets for underwater applications, which hinders the integration of event-based vision with Autonomous Underwater Vehicles (AUVs). To address this, synthetic datasets could provide a scalable solution while bridging the gap between simulation and reality. In this work, we introduce eStonefish-scenes, a synthetic event-based optical flow dataset based on the Stonefish simulator. Along with the dataset, we present a data generation pipeline that enables the creation of customizable underwater environments. This pipeline allows for simulating dynamic scenarios, such as biologically inspired schools of fish exhibiting realistic motion patterns, including obstacle avoidance and reactive navigation around corals. Additionally, we introduce a scene generator that can build realistic reef seabeds by randomly distributing coral across the terrain. To streamline data accessibility, we present eWiz, a comprehensive library designed for processing event-based data, offering tools for data loading, augmentation, visualization, encoding, and training data generation, along with loss functions and performance metrics.
摘要：预计基于事件的视力和尖峰神经网络（SNN）的联合使用将显着影响机器人技术，尤其是在视觉刺耳和避免障碍物等任务中。尽管现有的基于现实世界的事件数据集用于光流预测，通常以无人驾驶汽车（UAV）捕获，提供宝贵的见解，但它们的多样性，可伸缩性限制，并且收集挑战。此外，缺乏用于水下应用的标签数据集，这阻碍了基于事件的视力与自动水下水下车辆（AUVS）的整合。为了解决这个问题，合成数据集可以提供可扩展的解决方案，同时弥合模拟和现实之间的差距。在这项工作中，我们介绍了基于Stonefish模拟器的基于合成事件的光流数据集Estonefish-Scenes。与数据集一起，我们提出了一个数据生成管道，该管道可以创建可自定义的水下环境。该管道可以模拟动态场景，例如以生物学启发的鱼类学校表现出逼真的运动模式，包括避免障碍物和珊瑚周围的反应性导航。此外，我们介绍了一个场景生成器，该发电机可以通过在整个地形上随机分配珊瑚来建立逼真的海床。为了简化数据可访问性，我们提出了Ewiz，这是一个全面的库，旨在处理基于事件的数据，为数据加载，增强，可视化，编码和培训数据生成以及损失功能和性能指标提供工具。

Title: Denoising Diffusion Probabilistic Model for Point Cloud Compression at Low Bit-Rates

Authors: Gabriele Spadaro, Alberto Presta, Jhony H. Giraldo, Marco Grangetto, Wei Hu, Giuseppe Valenzise, Attilio Fiandrotti, Enzo Tartaglione
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13316
Pdf URL: https://arxiv.org/pdf/2505.13316
Copy Paste: [[2505.13316]] Denoising Diffusion Probabilistic Model for Point Cloud Compression at Low Bit-Rates(https://arxiv.org/abs/2505.13316)
Keywords: generation
Abstract: Efficient compression of low-bit-rate point clouds is critical for bandwidth-constrained applications. However, existing techniques mainly focus on high-fidelity reconstruction, requiring many bits for compression. This paper proposes a "Denoising Diffusion Probabilistic Model" (DDPM) architecture for point cloud compression (DDPM-PCC) at low bit-rates. A PointNet encoder produces the condition vector for the generation, which is then quantized via a learnable vector quantizer. This configuration allows to achieve a low bitrates while preserving quality. Experiments on ShapeNet and ModelNet40 show improved rate-distortion at low rates compared to standardized and state-of-the-art approaches. We publicly released the code at this https URL.
摘要：低位率点云的有效压缩对于带宽受限的应用至关重要。但是，现有技术主要集中于高保真重建，需要许多钻头进行压缩。本文提出了一个在低位速率下用于点云压缩（DDPM-PCC）的“降级扩散概率模型”（DDPM）体系结构。点网编码器会产生生成的条件矢量，然后通过可学习的向量量化器进行量化。这种配置允许在保持质量的同时达到低比特率。与标准化和最先进的方法相比，对Shapenet和ModelNet40的实验显示出低率的速率率提高。我们在此HTTPS URL上公开发布了代码。

Title: VesselGPT: Autoregressive Modeling of Vascular Geometry

Authors: Paula Feldman, Martin Sinnona, Viviana Siless, Claudio Delrieux, Emmanuel Iarussi
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2505.13318
Pdf URL: https://arxiv.org/pdf/2505.13318
Copy Paste: [[2505.13318]] VesselGPT: Autoregressive Modeling of Vascular Geometry(https://arxiv.org/abs/2505.13318)
Keywords: generation
Abstract: Anatomical trees are critical for clinical diagnosis and treatment planning, yet their complex and diverse geometry make accurate representation a significant challenge. Motivated by the latest advances in large language models, we introduce an autoregressive method for synthesizing anatomical trees. Our approach first embeds vessel structures into a learned discrete vocabulary using a VQ-VAE architecture, then models their generation autoregressively with a GPT-2 model. This method effectively captures intricate geometries and branching patterns, enabling realistic vascular tree synthesis. Comprehensive qualitative and quantitative evaluations reveal that our technique achieves high-fidelity tree reconstruction with compact discrete representations. Moreover, our B-spline representation of vessel cross-sections preserves critical morphological details that are often overlooked in previous' methods parameterizations. To the best of our knowledge, this work is the first to generate blood vessels in an autoregressive manner. Code, data, and trained models will be made available.
摘要：解剖学对临床诊断和治疗计划至关重要，但是它们复杂而多样的几何形状使准确的代表成为了重大挑战。在大型语言模型中的最新进展中，我们引入了一种自回归方法来综合解剖树。我们的方法首先使用VQ-VAE架构将血管结构嵌入到学习的离散词汇中，然后使用GPT-2模型对其产生进行自动重新测试。该方法有效地捕获了复杂的几何形状和分支模式，从而实现了逼真的血管树合成。全面的定性和定量评估表明，我们的技术通过紧凑的离散表示实现了高保真树的重建。此外，我们对血管横截面的B型序列表示保留了关键的形态学细节，这些细节通常在以前的方法参数化中被忽略。据我们所知，这项工作是第一个以自回归方式产生血管的作品。代码，数据和训练有素的模型将提供。

Title: RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers

Authors: Ahmet Berke Gokmen, Yigit Ekin, Bahri Batuhan Bilecen, Aysegul Dundar
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13344
Pdf URL: https://arxiv.org/pdf/2505.13344
Copy Paste: [[2505.13344]] RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers(https://arxiv.org/abs/2505.13344)
Keywords: generation
Abstract: We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video's Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.
摘要：我们提出了Ropecraft，这是一种用于扩散变压器的无训练视频运动转移方法，仅通过修改其旋转位置嵌入（ROPE）来操作。我们首先从参考视频中提取密集的光流，并利用所得的运动偏移来扭曲绳索的复杂指数张量，有效地将运动编码为生成过程。然后，通过使用流程匹配目标，通过轨迹对齐，通过轨迹比对进一步优化了这些嵌入。为了使输出忠实于文本提示并防止重复的世代，我们根据参考视频的傅立叶变换的相位组件合并一个正则化术语，将相角投射到平滑的歧管上以抑制高频文物。基准上的实验表明，罗蛋白原（Ropecraft）在定性和定量上均优于最近发布的所有方法。

Title: One-Step Offline Distillation of Diffusion-based Models via Koopman Modeling

Authors: Nimrod Berman, Ilan Naiman, Moshe Eliasof, Hedi Zisling, Omri Azencot
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13358
Pdf URL: https://arxiv.org/pdf/2505.13358
Copy Paste: [[2505.13358]] One-Step Offline Distillation of Diffusion-based Models via Koopman Modeling(https://arxiv.org/abs/2505.13358)
Keywords: generation, generative
Abstract: Diffusion-based generative models have demonstrated exceptional performance, yet their iterative sampling procedures remain computationally expensive. A prominent strategy to mitigate this cost is distillation, with offline distillation offering particular advantages in terms of efficiency, modularity, and flexibility. In this work, we identify two key observations that motivate a principled distillation framework: (1) while diffusion models have been viewed through the lens of dynamical systems theory, powerful and underexplored tools can be further leveraged; and (2) diffusion models inherently impose structured, semantically coherent trajectories in latent space. Building on these observations, we introduce the Koopman Distillation Model KDM, a novel offline distillation approach grounded in Koopman theory-a classical framework for representing nonlinear dynamics linearly in a transformed space. KDM encodes noisy inputs into an embedded space where a learned linear operator propagates them forward, followed by a decoder that reconstructs clean samples. This enables single-step generation while preserving semantic fidelity. We provide theoretical justification for our approach: (1) under mild assumptions, the learned diffusion dynamics admit a finite-dimensional Koopman representation; and (2) proximity in the Koopman latent space correlates with semantic similarity in the generated outputs, allowing for effective trajectory alignment. Empirically, KDM achieves state-of-the-art performance across standard offline distillation benchmarks, improving FID scores by up to 40% in a single generation step. All implementation details and code for the experimental setups are provided in our GitHub - this https URL, or in our project page - this https URL.
摘要：基于扩散的生成模型表现出了出色的性能，但其迭代采样程序在计算上仍然保持昂贵。降低这一成本的重要策略是蒸馏，离线蒸馏在效率，模块化和灵活性方面具有特殊的优势。在这项工作中，我们确定了激励有原则的蒸馏框架的两个关键观察结果：（1）虽然通过动态系统理论的镜头来查看扩散模型，但可以进一步利用强大而又毫无疑问的工具；（2）扩散模型固有地施加了潜在空间中的结构化的，语义上的相干轨迹。在这些观察结果的基础上，我们介绍了Koopman蒸馏模型KDM，这是一种基于Koopman理论的新型离线蒸馏方法 - 一个经典的框架，用于在变换的空间中线性地表示非线性动力学。 KDM将嘈杂的输入编码到一个嵌入式空间中，在该空间中，学到的线性操作员将它们向前传播，然后是重建干净样品的解码器。这可以在保留语义保真度的同时产生单步。我们为我们的方法提供了理论上的理由：（1）在轻度假设下，学到的扩散动力学承认有限的Koopman代表；（2）Koopman潜在空间的接近性与生成的输出中的语义相似性相关，从而允许有效的轨迹比对。从经验上讲，KDM在标准离线蒸馏基准中实现最先进的性能，在单代步骤中提高了多达40％的FID分数。所有实验设置的实现详细信息和代码均在我们的GitHub（此HTTPS URL或我们的项目页面）中提供。此HTTPS URL。

Title: Restoration Score Distillation: From Corrupted Diffusion Pretraining to One-Step High-Quality Generation

Authors: Yasi Zhang, Tianyu Chen, Zhendong Wang, Ying Nian Wu, Mingyuan Zhou, Oscar Leong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.13377
Pdf URL: https://arxiv.org/pdf/2505.13377
Copy Paste: [[2505.13377]] Restoration Score Distillation: From Corrupted Diffusion Pretraining to One-Step High-Quality Generation(https://arxiv.org/abs/2505.13377)
Keywords: restoration, generation, generative
Abstract: Learning generative models from corrupted data is a fundamental yet persistently challenging task across scientific disciplines, particularly when access to clean data is limited or expensive. Denoising Score Distillation (DSD) \cite{chen2025denoising} recently introduced a novel and surprisingly effective strategy that leverages score distillation to train high-fidelity generative models directly from noisy observations. Building upon this foundation, we propose \textit{Restoration Score Distillation} (RSD), a principled generalization of DSD that accommodates a broader range of corruption types, such as blurred, incomplete, or low-resolution images. RSD operates by first pretraining a teacher diffusion model solely on corrupted data and subsequently distilling it into a single-step generator that produces high-quality reconstructions. Empirically, RSD consistently surpasses its teacher model across diverse restoration tasks on both natural and scientific datasets. Moreover, beyond standard diffusion objectives, the RSD framework is compatible with several corruption-aware training techniques such as Ambient Tweedie, Ambient Diffusion, and its Fourier-space variant, enabling flexible integration with recent advances in diffusion modeling. Theoretically, we demonstrate that in a linear regime, RSD recovers the eigenspace of the clean data covariance matrix from linear measurements, thereby serving as an implicit regularizer. This interpretation recasts score distillation not only as a sampling acceleration technique but as a principled approach to enhancing generative performance in severely degraded data regimes.
摘要：从损坏的数据中学习生成模型是一项基本的，但在科学学科中持续具有挑战性的任务，尤其是当访问清洁数据有限或昂贵时。脱氧得分蒸馏（DSD）\ cite {chen2025denoisisy}最近引入了一种新颖且令人惊讶的有效策略，该策略利用得分蒸馏来直接从嘈杂的观察结果中训练高保真生成模型。在这个基础的基础上，我们建议\ textit {恢复得分蒸馏}（RSD），这是DSD的原则性概括，可容纳更广泛的腐败类型，例如模糊，不完整或低分辨率图像。 RSD通过首先预处理教师扩散模型仅在损坏的数据上进行操作，然后将其蒸馏成产生高质量重建的单步生成器。从经验上讲，RSD始终超过其在自然和科学数据集的各种恢复任务的教师模型。此外，除了标准扩散目标之外，RSD框架还与几种腐败感知的训练技术兼容，例如环境Tweedie，环境扩散及其傅立叶空间变体，从而使灵活的集成能够随着扩散建模的最新进展提供了进步。从理论上讲，我们证明，在线性方向上，RSD从线性测量值中恢复了清洁数据协方差矩阵的特征空间，从而充当隐式正规器。这种解释不仅将得分蒸馏作为采样加速技术，而且还作为一种有原则的方法来提高严重降级的数据制度中的生成性能。

Title: Faster Video Diffusion with Trainable Sparse Attention

Authors: Peiyuan Zhang, Haofeng Huang, Yongqi Chen, Will Lin, Zhengzhong Liu, Ion Stoica, Eric P. Xing, Hao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.13389
Pdf URL: https://arxiv.org/pdf/2505.13389
Copy Paste: [[2505.13389]] Faster Video Diffusion with Trainable Sparse Attention(https://arxiv.org/abs/2505.13389)
Keywords: generation
Abstract: Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models.
摘要：缩放视频扩散变压器（DIT）受其二次3D注意的限制，即使大多数注意力集中在一小部分位置上。我们将这种观察结果变成了VSA，这是一种可训练，高效的稀疏注意力，取代了\ emph {toc t ot and}训练和推理的全部关注。在VSA中，一个轻巧的粗阶段池将图表带入瓷砖，并识别高重量\ emph {critical tokens};一个精细的阶段仅在受到阻止计算布局的那些瓷砖内计算令牌级别的关注，以确保硬效率。这导致了一个单个可区分的内核，该内核端到端训练不需要事后分析，并且维持85 \％的flashattention3 MFU。我们通过从60m到1.4B参数进行预处理进行大量的消融研究和缩放法实验。 VSA达到了一个帕累托点，将训练拖鞋削减了2.53 $ \ times $，而没有扩散损失。对开源WAN-2.1进行改造，将注意力时间加快了6 $ \ times $，并以相当的质量从31秒降低到端到端生成时间。这些结果建立了可训练的稀疏关注，作为全部关注的实际替代方法，并是进一步扩散模型的关键推动力。

Title: Understanding Complexity in VideoQA via Visual Program Generation

Authors: Cristobal Eyzaguirre, Igor Vasiljevic, Achal Dave, Jiajun Wu, Rares Andrei Ambrus, Thomas Kollar, Juan Carlos Niebles, Pavel Tokmakov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.13429
Pdf URL: https://arxiv.org/pdf/2505.13429
Copy Paste: [[2505.13429]] Understanding Complexity in VideoQA via Visual Program Generation(https://arxiv.org/abs/2505.13429)
Keywords: generation
Abstract: We propose a data-driven approach to analyzing query complexity in Video Question Answering (VideoQA). Previous efforts in benchmark design have relied on human expertise to design challenging questions, yet we experimentally show that humans struggle to predict which questions are difficult for machine learning models. Our automatic approach leverages recent advances in code generation for visual question answering, using the complexity of generated code as a proxy for question difficulty. We demonstrate that this measure correlates significantly better with model performance than human estimates. To operationalize this insight, we propose an algorithm for estimating question complexity from code. It identifies fine-grained primitives that correlate with the hardest questions for any given set of models, making it easy to scale to new approaches in the future. Finally, to further illustrate the utility of our method, we extend it to automatically generate complex questions, constructing a new benchmark that is 1.9 times harder than the popular NExT-QA.
摘要：我们提出了一种数据驱动的方法，用于分析视频问题回答中的查询复杂性（VideoQA）。基准设计的先前努力已依靠人类的专业知识来设计具有挑战性的问题，但我们在实验上表明，人类难以预测机器学习模型难以预测哪些问题。我们的自动方法利用生成代码的复杂性作为问题难度的代理，利用代码生成的最新进展来回答视觉问题。我们证明，该度量与模型性能明显高于人类估计。为了实现这一见解，我们提出了一种算法，以估算代码复杂性的问题。它确定了与任何给定模型相关的最难问题相关的细粒原始图，从而使将来易于扩展到新方法。最后，为了进一步说明我们方法的实用性，我们将其扩展到自动产生复杂的问题，构建一个新的基准测试，该基准比流行的Next-QA要困难1.9倍。

Title: Synthetic-Powered Predictive Inference

Authors: Meshi Bashari, Roy Maor Lotan, Yonghoon Lee, Edgar Dobriban, Yaniv Romano
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.13432
Pdf URL: https://arxiv.org/pdf/2505.13432
Copy Paste: [[2505.13432]] Synthetic-Powered Predictive Inference(https://arxiv.org/abs/2505.13432)
Keywords: generative
Abstract: Conformal prediction is a framework for predictive inference with a distribution-free, finite-sample guarantee. However, it tends to provide uninformative prediction sets when calibration data are scarce. This paper introduces Synthetic-powered predictive inference (SPPI), a novel framework that incorporates synthetic data -- e.g., from a generative model -- to improve sample efficiency. At the core of our method is a score transporter: an empirical quantile mapping that aligns nonconformity scores from trusted, real data with those from synthetic data. By carefully integrating the score transporter into the calibration process, SPPI provably achieves finite-sample coverage guarantees without making any assumptions about the real and synthetic data distributions. When the score distributions are well aligned, SPPI yields substantially tighter and more informative prediction sets than standard conformal prediction. Experiments on image classification and tabular regression demonstrate notable improvements in predictive efficiency in data-scarce settings.
摘要：共形预测是具有无分配，有限样本保证的预测推断的框架。但是，当校准数据稀缺时，它倾向于提供非信息预测集。本文介绍了合成驱动的预测推理（SPPI），这是一个新颖的框架，结合了合成数据（例如，从生成模型）来提高样品效率。我们方法的核心是一个分数转运蛋白：一种经验分位数映射，将可信赖的真实数据与合成数据中的不合格分数保持一致。通过将分数转运蛋白仔细整合到校准过程中，SPPI可证明可以实现有限样本的保证，而无需对真实和合成数据分布做出任何假设。当得分分布很好地对齐时，SPPI的产生比标准的共形预测更高，更有信息的预测集。关于图像分类和表格回归的实验表明，数据筛选设置的预测效率有了显着提高。

Title: FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance

Authors: Dian Shao, Mingfei Shi, Shengda Xu, Haodong Chen, Yongle Huang, Binglu Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13437
Pdf URL: https://arxiv.org/pdf/2505.13437
Copy Paste: [[2505.13437]] FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance(https://arxiv.org/abs/2505.13437)
Keywords: generation
Abstract: Despite significant advances in video generation, synthesizing physically plausible human actions remains a persistent challenge, particularly in modeling fine-grained semantics and complex temporal dynamics. For instance, generating gymnastics routines such as "switch leap with 0.5 turn" poses substantial difficulties for current methods, often yielding unsatisfactory results. To bridge this gap, we propose FinePhys, a Fine-grained human action generation framework that incorporates Physics to obtain effective skeletal guidance. Specifically, FinePhys first estimates 2D poses in an online manner and then performs 2D-to-3D dimension lifting via in-context learning. To mitigate the instability and limited interpretability of purely data-driven 3D poses, we further introduce a physics-based motion re-estimation module governed by Euler-Lagrange equations, calculating joint accelerations via bidirectional temporal updating. The physically predicted 3D poses are then fused with data-driven ones, offering multi-scale 2D heatmap guidance for the diffusion process. Evaluated on three fine-grained action subsets from FineGym (FX-JUMP, FX-TURN, and FX-SALTO), FinePhys significantly outperforms competitive baselines. Comprehensive qualitative results further demonstrate FinePhys's ability to generate more natural and plausible fine-grained human actions.
摘要：尽管视频产生取得了重大进展，但综合物理上合理的人类行为仍然是一个持续的挑战，尤其是在建模细粒语义和复杂的时间动态方面。例如，生成体操例程，例如“带0.5转的开关LEAP”为当前方法带来了很大的困难，通常会产生不令人满意的结果。为了弥合这一差距，我们提出了Finephys，这是一个精细的人类动作生成框架，结合了物理学以获得有效的骨骼指导。 Specifically, FinePhys first estimates 2D poses in an online manner and then performs 2D-to-3D dimension lifting via in-context learning.为了减轻纯粹数据驱动的3D姿势的不稳定性和有限的解释性，我们进一步引入了一个基于物理的运动重新估计模块，该模块由Euler-Lagrange方程控制，通过双向时间更新来计算关节加速度。 The physically predicted 3D poses are then fused with data-driven ones, offering multi-scale 2D heatmap guidance for the diffusion process.通过FX-JUMP，FX-TURN和FX-SALTO的三个细粒作用子集进行评估，细胞的表现明显胜过竞争性基线。 Comprehensive qualitative results further demonstrate FinePhys's ability to generate more natural and plausible fine-grained human actions.

Title: VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation

Authors: Huawei Lin, Tong Geng, Zhaozhuo Xu, Weijie Zhao
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.13439
Pdf URL: https://arxiv.org/pdf/2505.13439
Copy Paste: [[2505.13439]] VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation(https://arxiv.org/abs/2505.13439)
Keywords: generation
Abstract: Autoregressive (AR) models have recently shown strong performance in image generation, where a critical component is the visual tokenizer (VT) that maps continuous pixel inputs to discrete token sequences. The quality of the VT largely defines the upper bound of AR model performance. However, current discrete VTs fall significantly behind continuous variational autoencoders (VAEs), leading to degraded image reconstructions and poor preservation of details and text. Existing benchmarks focus on end-to-end generation quality, without isolating VT performance. To address this gap, we introduce VTBench, a comprehensive benchmark that systematically evaluates VTs across three core tasks: Image Reconstruction, Detail Preservation, and Text Preservation, and covers a diverse range of evaluation scenarios. We systematically assess state-of-the-art VTs using a set of metrics to evaluate the quality of reconstructed images. Our findings reveal that continuous VAEs produce superior visual representations compared to discrete VTs, particularly in retaining spatial structure and semantic detail. In contrast, the degraded representations produced by discrete VTs often lead to distorted reconstructions, loss of fine-grained textures, and failures in preserving text and object integrity. Furthermore, we conduct experiments on GPT-4o image generation and discuss its potential AR nature, offering new insights into the role of visual tokenization. We release our benchmark and codebase publicly to support further research and call on the community to develop strong, general-purpose open-source VTs.
摘要：自回归（AR）模型最近在图像生成中表现出很强的性能，其中关键组件是视觉令牌器（VT），将连续像素输入映射到离散令牌序列。 VT的质量在很大程度上定义了AR模型性能的上限。但是，当前的离散VTS显着落后于连续变化自动编码器（VAE），导致图像重建降低，细节和文本的保存不佳。现有的基准测试着眼于端到端的生成质量，而无需隔离VT性能。为了解决这一差距，我们介绍了VTBENCH，这是一个全面的基准，该基准系统地评估了三个核心任务的VT：图像重建，详细信息保存和文本保存，并涵盖了各种评估场景。我们使用一组指标系统地评估最先进的VT，以评估重建图像的质量。我们的发现表明，与离散的VT相比，连续VAE产生了优越的视觉表示，尤其是在保持空间结构和语义细节方面。相比之下，离散VT产生的降级表示通常会导致重建扭曲，细粒纹理的丢失以及保留文本和对象完整性的失败。此外，我们对GPT-4O图像产生进行实验，并讨论其潜在的AR性质，从而提供了有关视觉令牌化作用的新见解。我们公开发布基准和代码库，以支持进一步的研究，并呼吁社区开发强大的通用开源VT。

Title: Mean Flows for One-step Generative Modeling

Authors: Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, Kaiming He
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.13447
Pdf URL: https://arxiv.org/pdf/2505.13447
Copy Paste: [[2505.13447]] Mean Flows for One-step Generative Modeling(https://arxiv.org/abs/2505.13447)
Keywords: generative
Abstract: We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.
摘要：我们为一步生成建模提供了一个有效的有效框架。我们介绍了平均速度的概念以表征流场，与通过流匹配方法建模的瞬时速度相反。得出并用于指导神经网络训练之间的平均速度和瞬时速度之间定义明确的身份。我们的方法称为平均流量模型，是独立的，不需要预培训，蒸馏或课程学习。 MeanFlow表现出强烈的经验表现：它在ImageNet 256x256上获得了3.43的FID为3.43，该功能评估（1-NFE）受到了从头开始训练的，这显着优于先前的先前最新一步一步扩散/流模型。我们的研究大大缩小了一步扩散/流模型及其多步骤的前辈之间的鸿沟，我们希望它将激发未来的研究以重新审视这些强大模型的基础。