2025-04-22

Title: Generative System Dynamics in Recurrent Neural Networks

Authors: Michele Casoni, Tommaso Guidi, Alessandro Betti, Stefano Melacci, Marco Gori
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13951
Pdf URL: https://arxiv.org/pdf/2504.13951
Copy Paste: [[2504.13951]] Generative System Dynamics in Recurrent Neural Networks(https://arxiv.org/abs/2504.13951)
Keywords: generative
Abstract: In this study, we investigate the continuous time dynamics of Recurrent Neural Networks (RNNs), focusing on systems with nonlinear activation functions. The objective of this work is to identify conditions under which RNNs exhibit perpetual oscillatory behavior, without converging to static fixed points. We establish that skew-symmetric weight matrices are fundamental to enable stable limit cycles in both linear and nonlinear configurations. We further demonstrate that hyperbolic tangent-like activation functions (odd, bounded, and continuous) preserve these oscillatory dynamics by ensuring motion invariants in state space. Numerical simulations showcase how nonlinear activation functions not only maintain limit cycles, but also enhance the numerical stability of the system integration process, mitigating those instabilities that are commonly associated with the forward Euler method. The experimental results of this analysis highlight practical considerations for designing neural architectures capable of capturing complex temporal dependencies, i.e., strategies for enhancing memorization skills in recurrent models.
摘要：在这项研究中，我们研究了复发性神经网络（RNN）的连续时间动力学，重点是具有非线性激活函数的系统。这项工作的目的是确定RNN表现出永久振荡行为的条件，而不会融合到静态固定点。我们确定偏斜的重量矩阵对于在线性和非线性构型中启用稳定的极限循环至关重要。我们进一步证明，双曲线切线样激活函数（奇数，边界和连续）通过确保在状态空间中的运动不变来保留这些振荡动力学。数值模拟展示了非线性激活如何不仅保持极限周期，还可以增强系统集成过程的数值稳定性，从而减轻通常与正向Euler方法相关的不稳定性。该分析的实验结果突出了设计能够捕获复杂时间依赖性的神经体系结构的实际考虑因素，即增强经常性模型中记忆技能的策略。

Title: Multiscale Tensor Summation Factorization as a New Neural Network Layer (MTS Layer) for Multidimensional Data Processing

Authors: Mehmet Yamaç, Muhammad Numan Yousaf, Serkan Kiranyaz, Moncef Gabbouj
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.13975
Pdf URL: https://arxiv.org/pdf/2504.13975
Copy Paste: [[2504.13975]] Multiscale Tensor Summation Factorization as a New Neural Network Layer (MTS Layer) for Multidimensional Data Processing(https://arxiv.org/abs/2504.13975)
Keywords: restoration
Abstract: Multilayer perceptrons (MLP), or fully connected artificial neural networks, are known for performing vector-matrix multiplications using learnable weight matrices; however, their practical application in many machine learning tasks, especially in computer vision, can be limited due to the high dimensionality of input-output pairs at each layer. To improve efficiency, convolutional operators have been utilized to facilitate weight sharing and local connections, yet they are constrained by limited receptive fields. In this paper, we introduce Multiscale Tensor Summation (MTS) Factorization, a novel neural network operator that implements tensor summation at multiple scales, where each tensor to be summed is obtained through Tucker-decomposition-like mode products. Unlike other tensor decomposition methods in the literature, MTS is not introduced as a network compression tool; instead, as a new backbone neural layer. MTS not only reduces the number of parameters required while enhancing the efficiency of weight optimization compared to traditional dense layers (i.e., unfactorized weight matrices in MLP layers), but it also demonstrates clear advantages over convolutional layers. The proof-of-concept experimental comparison of the proposed MTS networks with MLPs and Convolutional Neural Networks (CNNs) demonstrates their effectiveness across various tasks, such as classification, compression, and signal restoration. Additionally, when integrated with modern non-linear units such as the multi-head gate (MHG), also introduced in this study, the corresponding neural network, MTSNet, demonstrates a more favorable complexity-performance tradeoff compared to state-of-the-art transformers in various computer vision applications. The software implementation of the MTS layer and the corresponding MTS-based networks, MTSNets, is shared at this https URL.
摘要：多层感知器（MLP）或完全连接的人工神经网络，以使用可学习的权重矩阵执行矢量矩阵乘法而闻名；但是，由于每一层输入输出对的高维度，它们在许多机器学习任务中的实际应用，尤其是在计算机视觉中，可能会受到限制。为了提高效率，已利用卷积运算符来促进体重分享和本地连接，但受到受体有限的限制。在本文中，我们介绍了多尺度张量总和（MTS）分解，这是一个新型的神经网络操作员，在多个尺度上实现张量求和，在该量表中，每个张量都可以通过塔克（Tucker-Decotion）模式式产物获得。与文献中的其他张量分解方法不同，MT不会作为网络压缩工具引入MT。相反，作为新的骨干神经层。 MT不仅减少了所需的参数数量，同时与传统的致密层相比（即MLP层中未分离的重量矩阵），但它也表明了与卷积层明显的优势。概念证明与MLP和卷积神经网络（CNN）的拟议MTS网络进行了比较，证明了它们在各种任务（例如分类，压缩和信号恢复）中的有效性。此外，当与本研究中引入的现代非线性单元（MHG）等现代非线性单元集成时，相应的神经网络MTSNET与各种计算机视觉应用中最先进的变形金刚相比，相应的神经网络MTSNET表现出了更有利的复杂性绩效交易。 MTS层的软件实现和相应的基于MTS的网络MTSNET在此HTTPS URL上共享。

Title: Entropy Rectifying Guidance for Diffusion and Flow Models

Authors: Tariq Berrada Ifriqi, Adriana Romero-Soriano, Michal Drozdzal, Jakob Verbeek, Karteek Alahari
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.13987
Pdf URL: https://arxiv.org/pdf/2504.13987
Copy Paste: [[2504.13987]] Entropy Rectifying Guidance for Diffusion and Flow Models(https://arxiv.org/abs/2504.13987)
Keywords: generation, generative
Abstract: Guidance techniques are commonly used in diffusion and flow models to improve image quality and consistency for conditional generative tasks such as class-conditional and text-to-image generation. In particular, classifier-free guidance (CFG) -- the most widely adopted guidance technique -- contrasts conditional and unconditional predictions to improve the generated images. This results, however, in trade-offs across quality, diversity and consistency, improving some at the expense of others. While recent work has shown that it is possible to disentangle these factors to some extent, such methods come with an overhead of requiring an additional (weaker) model, or require more forward passes per sampling step. In this paper, we propose Entropy Rectifying Guidance (ERG), a simple and effective guidance mechanism based on inference-time changes in the attention mechanism of state-of-the-art diffusion transformer architectures, which allows for simultaneous improvements over image quality, diversity and prompt consistency. ERG is more general than CFG and similar guidance techniques, as it extends to unconditional sampling. ERG results in significant improvements in various generation tasks such as text-to-image, class-conditional and unconditional image generation. We also show that ERG can be seamlessly combined with other recent guidance methods such as CADS and APG, further boosting generation performance.
摘要：指导技术通常用于扩散和流模型中，以提高有条件生成任务（例如课堂条件和文本形象生成）的图像质量和一致性。特别是，无分类器指导（CFG）（最广泛地采用的指导技术）对比有条件和无条件的预测来改善生成的图像。然而，这在跨越质量，多样性和一致性之间取舍，以牺牲其他人为代价，这是在权衡的方面。尽管最近的工作表明可以在一定程度上解散这些因素，但此类方法的开销是需要额外的（较弱）模型，或者每个采样步骤需要更多的正向通行证。在本文中，我们提出了熵纠正指导（ERG），这是一种基于推理时间变化的简单有效指导机制，其最新扩散变压器体系结构的注意机制可以同时改进图像质量，多样性和及时的一致性。 ERG比CFG和类似的指导技术更一般，因为它扩展到无条件采样。 ERG导致各种一代任务的重大改进，例如文本对图像，班级条件和无条件图像生成。我们还表明，ERG可以与其他最近的指导方法（例如CADS和APG）无缝结合，从而进一步提高了生成性能。

Title: Deep Learning on Graphs for Mobile Network Topology Generation

Authors: Felix Nannesson Meli, Johan Tell, Shirwan Piroti, Tahar Zanouda, Elias Jarlebring
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.13991
Pdf URL: https://arxiv.org/pdf/2504.13991
Copy Paste: [[2504.13991]] Deep Learning on Graphs for Mobile Network Topology Generation(https://arxiv.org/abs/2504.13991)
Keywords: generation
Abstract: Mobile networks consist of interconnected radio nodes strategically positioned across various geographical regions to provide connectivity services. The set of relations between these radio nodes, referred to as the \emph{mobile network topology}, is vital in the construction of the networking infrastructure. Typically, the connections between radio nodes and their associated cells are defined by software features that establish mobility relations (referred to as \emph{edges} in this paper) within the mobile network graph through heuristic methods. Although these approaches are efficient, they encounter significant limitations, particularly since edges can only be established prior to the installation of physical hardware. In this work, we use graph-based deep learning methods to determine mobility relations (edges), trained on radio node configuration data and reliable mobility relations set by Automatic Neighbor Relations (ANR) in stable networks. This paper focuses on measuring the accuracy and precision of different graph-based deep learning approaches applied to real-world mobile networks. We evaluated two deep learning models. Our comprehensive experiments on Telecom datasets obtained from operational Telecom Networks demonstrate the effectiveness of the graph neural network (GNN) model and multilayer perceptron. Our evaluation showed that considering graph structure improves results, which motivates the use of GNNs. Additionally, we investigated the use of heuristics to reduce the training time based on the distance between radio nodes to eliminate irrelevant cases. Our investigation showed that the use of these heuristics improved precision and accuracy considerably.
摘要：移动网络由在各个地理区域进行战略性位置的互连无线电节点组成，以提供连接服务。这些无线电节点之间的一组关系，称为\ emph {移动网络拓扑}，对于网络基础架构的构建至关重要。通常，通过启发式方法，在移动网络图内建立移动关系（称为\ emph {edges}）的软件功能来定义无线电节点及其相关单元格之间的连接。尽管这些方法是有效的，但它们会遇到重大限制，特别是因为边缘只能在安装物理硬件之前建立。在这项工作中，我们使用基于图的深度学习方法来确定稳定网络中自动邻居关系（ANR）设置的无线电节点配置数据和可靠的移动关系培训的移动性关系（边缘）。本文着重于衡量应用于现实世界移动网络的不同基于图的深度学习方法的准确性和精度。我们评估了两个深度学习模型。我们对从操作电信网络获得的电信数据集进行的全面实验，证明了图神经网络（GNN）模型和多层感知器的有效性。我们的评估表明，考虑图结构可以改善结果，从而激发了GNN的使用。此外，我们研究了使用启发式方法根据无线电节点之间的距离减少训练时间，以消除无关的病例。我们的调查表明，使用这些启发式方法可大大提高精度和准确性。

Title: Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

Authors: Fulvio Sanguigni, Davide Morelli, Marcella Cornia, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2504.14011
Pdf URL: https://arxiv.org/pdf/2504.14011
Copy Paste: [[2504.14011]] Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation(https://arxiv.org/abs/2504.14011)
Keywords: generation, generative
Abstract: In recent years, the fashion industry has increasingly adopted AI technologies to enhance customer experience, driven by the proliferation of e-commerce platforms and virtual applications. Among the various tasks, virtual try-on and multimodal fashion image editing -- which utilizes diverse input modalities such as text, garment sketches, and body poses -- have become a key area of research. Diffusion models have emerged as a leading approach for such generative tasks, offering superior image quality and diversity. However, most existing virtual try-on methods rely on having a specific garment input, which is often impractical in real-world scenarios where users may only provide textual specifications. To address this limitation, in this work we introduce Fashion Retrieval-Augmented Generation (Fashion-RAG), a novel method that enables the customization of fashion items based on user preferences provided in textual form. Our approach retrieves multiple garments that match the input specifications and generates a personalized image by incorporating attributes from the retrieved items. To achieve this, we employ textual inversion techniques, where retrieved garment images are projected into the textual embedding space of the Stable Diffusion text encoder, allowing seamless integration of retrieved elements into the generative process. Experimental results on the Dress Code dataset demonstrate that Fashion-RAG outperforms existing methods both qualitatively and quantitatively, effectively capturing fine-grained visual details from retrieved garments. To the best of our knowledge, this is the first work to introduce a retrieval-augmented generation approach specifically tailored for multimodal fashion image editing.
摘要：近年来，由于电子商务平台和虚拟应用程序的扩散，时装行业越来越多地采用了AI技术来增强客户体验。在各种任务中，虚拟的尝试和多模式时尚图像编辑（利用文本，服装草图和身体姿势）的各种输入方式已成为研究的关键领域。扩散模型已成为这种生成任务的领先方法，提供了出色的图像质量和多样性。但是，大多数现有的虚拟试验方法都依赖于具有特定的服装输入，这在用户只能提供文本规范的实际情况下通常是不切实际的。为了解决这一局限性，在这项工作中，我们介绍了时尚检索型的一代（时尚抹布），这是一种新颖的方法，可以根据文本形式提供的用户偏好来自定义时尚项目。我们的方法检索了与输入规格相匹配的多种服装，并通过合并从检索到的项目中的属性来生成个性化图像。为了实现这一目标，我们采用文本反演技术，将检索到的服装图像投射到稳定的扩散文本编码器的文本嵌入空间中，从而使检索到的元素无缝集成到生成过程中。着装码数据集的实验结果表明，时尚摊位在定性和定量上都优于现有方法，从而有效地捕获了从检索到的服装中的细颗粒的视觉细节。据我们所知，这是第一项引入专门针对多模式时尚图像编辑的检索增强生成方法的作品。

Title: A synthetic dataset of French electric load curves with temperature conditioning

Authors: Tahar Nabil, Ghislain Agoua, Pierre Cauchois, Anne De Moliner, Benoît Grossin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14046
Pdf URL: https://arxiv.org/pdf/2504.14046
Copy Paste: [[2504.14046]] A synthetic dataset of French electric load curves with temperature conditioning(https://arxiv.org/abs/2504.14046)
Keywords: generation
Abstract: The undergoing energy transition is causing behavioral changes in electricity use, e.g. with self-consumption of local generation, or flexibility services for demand control. To better understand these changes and the challenges they induce, accessing individual smart meter data is crucial. Yet this is personal data under the European GDPR. A widespread use of such data requires thus to create synthetic realistic and privacy-preserving samples. This paper introduces a new synthetic load curve dataset generated by conditional latent diffusion. We also provide the contracted power, time-of-use plan and local temperature used for generation. Fidelity, utility and privacy of the dataset are thoroughly evaluated, demonstrating its good quality and thereby supporting its interest for energy modeling applications.
摘要：进行的能量转变导致电力使用的行为变化，例如自我消费本地发电或灵活性服务以控制需求。为了更好地了解这些变化及其引起的挑战，访问单个智能电表数据至关重要。但这是欧洲GDPR下的个人数据。因此，对此类数据的广泛使用需要创建合成现实和隐私的样本。本文介绍了一个新的合成负载曲线数据集，该数据集由条件潜在扩散产生。我们还提供用于生成的合同功率，使用时间计划和本地温度。对数据集的保真度，效用和隐私进行了彻底的评估，证明了其质量的质量，从而支持其对能源建模应用的兴趣。

Title: Personalizing Exposure Therapy via Reinforcement Learning

Authors: Athar Mahmoudi-Nejad, Matthew Guzdial, Pierre Boulanger
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14095
Pdf URL: https://arxiv.org/pdf/2504.14095
Copy Paste: [[2504.14095]] Personalizing Exposure Therapy via Reinforcement Learning(https://arxiv.org/abs/2504.14095)
Keywords: generation
Abstract: Personalized therapy, in which a therapeutic practice is adapted to an individual patient, can lead to improved health outcomes. Typically, this is accomplished by relying on a therapist's training and intuition along with feedback from a patient. However, this requires the therapist to become an expert on any technological components, such as in the case of Virtual Reality Exposure Therapy (VRET). While there exist approaches to automatically adapt therapeutic content to a patient, they generally rely on hand-authored, pre-defined rules, which may not generalize to all individuals. In this paper, we propose an approach to automatically adapt therapeutic content to patients based on physiological measures. We implement our approach in the context of virtual reality arachnophobia exposure therapy, and rely on experience-driven procedural content generation via reinforcement learning (EDPCGRL) to generate virtual spiders to match an individual patient. Through a human subject study, we demonstrate that our system significantly outperforms a more common rules-based method, highlighting its potential for enhancing personalized therapeutic interventions.
摘要：个性化治疗方法适应了个体患者，可以改善健康结果。通常，这是通过依靠治疗师的培训和直觉以及患者的反馈来实现的。但是，这要求治疗师成为任何技术组成部分的专家，例如在虚拟现实暴露疗法（VRET）的情况下。尽管存在自动适应治疗含量的方法，但他们通常依靠手工实现的预定义规则，这可能不会概括所有个人。在本文中，我们提出了一种基于生理措施自动对患者的治疗含量自动调整治疗含量的方法。我们在虚拟现实恐惧症暴露疗法的背景下实施我们的方法，并通过强化学习（EDPCGRL）依靠经验驱动的程序内容来生成虚拟蜘蛛以匹配单个患者。通过人类学科的研究，我们证明了我们的系统明显优于一种基于规则的方法，强调了其增强个性化治疗干预措施的潜力。

Title: Point-Driven Interactive Text and Image Layer Editing Using Diffusion Models

Authors: Zhenyu Yu, Mohd Yamani Idna Idris, Pei Wang, Yuelong Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14108
Pdf URL: https://arxiv.org/pdf/2504.14108
Copy Paste: [[2504.14108]] Point-Driven Interactive Text and Image Layer Editing Using Diffusion Models(https://arxiv.org/abs/2504.14108)
Keywords: generative
Abstract: We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios.
摘要：我们提出了Dancetext，这是一个用于图像中多语言文本编辑的无训练框架，旨在支持复杂的几何变换并实现无缝的前景背景集成。尽管基于扩散的生成模型在文本指导的图像合成中表现出了希望，但它们通常缺乏可控性，并且在非平凡的操作（例如旋转，翻译，缩放和翘曲）之类的非平凡操作下无法保留布局一致性。为了解决这些局限性，Dancetext引入了分层编辑策略，该策略将文本与背景分开，从而使几何转换可以以模块化和可控制的方式进行。进一步提出了一个深度感知的模块，以使转换的文本与重建背景之间的外观和透视图保持一致，从而增强了光真相和空间一致性。重要的是，DanceText通过集成预审预测的模块来采用全面的无训练设计，从而可以在没有特定任务的微调的情况下进行灵活的部署。对任何Words-3M基准测试的广泛实验表明，我们的方法在视觉质量方面取得了出色的性能，尤其是在大规模和复杂的转换方案下。

Title: BMRL: Bi-Modal Guided Multi-Perspective Representation Learning for Zero-Shot Deepfake Attribution

Authors: Yaning Zhang, Jiahe Zhang, Chunjie Ma, Weili Guan, Tian Gan, Zan Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14129
Pdf URL: https://arxiv.org/pdf/2504.14129
Copy Paste: [[2504.14129]] BMRL: Bi-Modal Guided Multi-Perspective Representation Learning for Zero-Shot Deepfake Attribution(https://arxiv.org/abs/2504.14129)
Keywords: generative
Abstract: The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen generators in a fine-grained manner. In this paper, we propose a novel bi-modal guided multi-perspective representation learning (BMRL) framework for zero-shot deepfake attribution (ZS-DFA), which facilitates effective traceability to unseen generators. Specifically, we design a multi-perspective visual encoder (MPVE) to explore general deepfake attribution visual characteristics across three views (i.e., image, noise, and edge). We devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via vision-parsing matching. A language encoder is proposed to capture fine-grained language embeddings, facilitating language-guided general visual forgery representation learning through vision-language alignment. Additionally, we present a novel deepfake attribution contrastive center (DFACC) loss, to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results demonstrate that our method outperforms the state-of-the-art on the ZS-DFA task through various protocols evaluation.
摘要：由于生成模型的快速发展，追踪锻造面的来源归因的挑战引起了人们的重大关注。但是，现有的DeepFake归因（DFA）主要关注视觉方式中各个领域之间的相互作用，而其他方式（例如文本和面部解析）尚未充分探索。此外，他们倾向于以细粒度的方式评估深击属性对未见发电机的概括性能。在本文中，我们提出了一个新型的双模式指导性多角度表示学习（BMRL）框架，用于零射击深击归因（ZS-DFA），这促进了对看不见的发电机的有效特鲁科性。具体来说，我们设计了一个多观点的视觉编码器（MPVE），以探索三种视图（即图像，噪声和边缘）的一般深层捕获归因性特征。我们设计了一种新颖的解析编码器，以专注于全球面部属性嵌入，从而通过避开视觉匹配来实现解析引导的DFA表示学习。提出了一种语言编码器来捕获细粒的语言嵌入，从而通过视觉统一来促进语言引导的一般视觉伪造表示学习。此外，我们提出了一种新型的深层归因对比中心（DFACC）损失，以使相关发电机更靠近并将无关的发电机推开，可以将其引入DFA模型以增强可食用性。实验结果表明，通过各种协议评估，我们的方法优于ZS-DFA任务上的最先进。

Title: Transforming hyperspectral images into chemical maps: A new deep learning based approach to hyperspectral image processing

Authors: Ole-Christian Galbo Engstrøm, Michela Albano-Gaglio, Erik Schou Dreier, Yamine Bouzembrak, Maria Font-i-Furnols, Puneet Mishra, Kim Steenstrup Pedersen
Subjects: cs.CV, cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2504.14131
Pdf URL: https://arxiv.org/pdf/2504.14131
Copy Paste: [[2504.14131]] Transforming hyperspectral images into chemical maps: A new deep learning based approach to hyperspectral image processing(https://arxiv.org/abs/2504.14131)
Keywords: generation
Abstract: Current approaches to chemical map generation from hyperspectral images are based on models such as partial least squares (PLS) regression, generating pixel-wise predictions that do not consider spatial context and suffer from a high degree of noise. This study proposes an end-to-end deep learning approach using a modified version of U-Net and a custom loss function to directly obtain chemical maps from hyperspectral images, skipping all intermediate steps required for traditional pixel-wise analysis. We compare the U-Net with the traditional PLS regression on a real dataset of pork belly samples with associated mean fat reference values. The U-Net obtains a test set root mean squared error of between 9% and 13% lower than that of PLS regression on the task of mean fat prediction. At the same time, U-Net generates fine detail chemical maps where 99.91% of the variance is spatially correlated. Conversely, only 2.53% of the variance in the PLS-generated chemical maps is spatially correlated, indicating that each pixel-wise prediction is largely independent of neighboring pixels. Additionally, while the PLS-generated chemical maps contain predictions far beyond the physically possible range of 0-100%, U-Net learns to stay inside this range. Thus, the findings of this study indicate that U-Net is superior to PLS for chemical map generation.
摘要：当前从高光谱图像中产生化学图的方法基于诸如部分最小二乘（PLS）回归等模型，从而产生不考虑空间环境并具有高度噪声的像素预测。这项研究提出了一种使用修改版的U-NET和自定义损失功能的端到端深度学习方法，以直接从高光谱图像中获取化学图，从而跳过传统像素分析所需的所有中间步骤。我们将U-NET与传统PLS回归进行比较，并在具有相关脂肪参考值的猪肚样品的真实数据集上进行比较。 U-NET在平均脂肪预测任务上获得了比PLS回归低9％至13％的测试集平方平方误差。同时，U-NET生成精细的细节化学图，其中99.91％的方差在空间上相关。相反，PLS生成的化学图的方差的2.53％在空间上相关，表明每个像素的预测在很大程度上与邻近的像素无关。此外，尽管PLS生成的化学图包含的预测远远超出了可能的0-100％，但U-NET学会了留在此范围内。因此，这项研究的结果表明，对于化学图的产生，U-NET优于PL。

Title: Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach

Authors: Hangyu Liu, Bo Peng, Pengxiang Ding, Donglin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14137
Pdf URL: https://arxiv.org/pdf/2504.14137
Copy Paste: [[2504.14137]] Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach(https://arxiv.org/abs/2504.14137)
Keywords: generation, generative
Abstract: Compared to single-target adversarial attacks, multi-target attacks have garnered significant attention due to their ability to generate adversarial images for multiple target classes simultaneously. Existing generative approaches for multi-target attacks mainly analyze the effect of the use of target labels on noise generation from a theoretical perspective, lacking practical validation and comprehensive summarization. To address this gap, we first identify and validate that the semantic feature quality and quantity are critical factors affecting the transferability of targeted attacks: 1) Feature quality refers to the structural and detailed completeness of the implanted target features, as deficiencies may result in the loss of key discriminative information; 2) Feature quantity refers to the spatial sufficiency of the implanted target features, as inadequacy limits the victim model's attention to this feature. Based on these findings, we propose the 2D Tensor-Guided Adversarial Fusion (2D-TGAF) framework, which leverages the powerful generative capabilities of diffusion models to encode target labels into two-dimensional semantic tensors for guiding adversarial noise generation. Additionally, we design a novel masking strategy tailored for the training process, ensuring that parts of the generated noise retain complete semantic information about the target class. Extensive experiments on the standard ImageNet dataset demonstrate that 2D-TGAF consistently surpasses state-of-the-art methods in attack success rates, both on normally trained models and across various defense mechanisms.
摘要：与单目标对抗性攻击相比，多目标攻击由于能够同时为多个目标类别生成对抗性图像的能力而引起了极大的关注。多目标攻击的现有生成方法主要分析了从理论的角度分析目标标签对噪声产生的影响，缺乏实际的验证和全面的汇总。为了解决这一差距，我们首先识别并验证语义特征质量和数量是影响目标攻击转移性的关键因素：1）特征质量是指植入目标特征的结构性和详细的完整性，因为不足可能导致关键歧视性信息的丢失； 2）特征数量是指植入目标特征的空间充足性，因为不足限制了受害者模型对此功能的关注。基于这些发现，我们提出了2D张量引导的对抗融合（2D-TGAF）框架，该框架利用扩散模型的强大生成能力将目标标签编码为指导对抗的对手噪声的二维语义张量。此外，我们设计了针对训练过程量身定制的新型掩蔽策略，以确保产生的噪声的一部分保留了有关目标类别的完整语义信息。标准图像网数据集的广泛实验表明，在正常训练的模型和各种防御机制上，2D-TGAF始终超过攻击成功率的最新方法。

Title: Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis

Authors: Zichuan Liu, Liming Jiang, Qing Yan, Yumin Jia, Hao Kang, Xin Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14202
Pdf URL: https://arxiv.org/pdf/2504.14202
Copy Paste: [[2504.14202]] Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis(https://arxiv.org/abs/2504.14202)
Keywords: generation
Abstract: We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference face and a text prompt, FaceCLIP produces a unified representation that encodes both identity and text, which conditions a base diffusion model to generate images that are identity-consistent and text-aligned. We also present a multi-modal alignment algorithm to train FaceCLIP, using a loss that aligns its joint representation with face, text, and image embedding spaces. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline by integrating FaceCLIP with Stable Diffusion XL (SDXL). Compared to prior methods, FaceCLIP-SDXL enables photorealistic portrait generation with better identity preservation and textual relevance. Extensive experiments demonstrate its quantitative and qualitative superiority.
摘要：我们为使用多模式编码策略而不是通过适配器向预训练的模型注入身份特征的新型框架，用于使用多模式的编码策略来生成ID框架。我们的方法将身份和文本视为统一的条件输入。为了实现这一目标，我们介绍了一个多模式编码器Faceclip，该编码器学习了身份和文本语义的联合嵌入空间。给定参考面和文本提示，面对面产生了统一表示形式，该表示同时编码身份和文本，该表示基础扩散模型以生成具有身份符合性和文本对准的图像。我们还提出了一种多模式比对算法来训练面部的训练，该损失将其联合表示与面部，文本和图像嵌入空间保持一致。然后，我们通过将面部与稳定的扩散XL（SDXL）集成，构建了一个保留ID的图像合成管道的Faceclip-SDXL。与先前的方法相比，Faceclip-SDXL可以具有更好的身份保存和文本相关性的影像肖像生成。广泛的实验证明了其定量和定性优势。

Title: Towards Explainable Fake Image Detection with Multi-Modal Large Language Models

Authors: Yikun Ji, Yan Hong, Jiahui Zhan, Haoxing Chen, jun lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.14245
Pdf URL: https://arxiv.org/pdf/2504.14245
Copy Paste: [[2504.14245]] Towards Explainable Fake Image Detection with Multi-Modal Large Language Models(https://arxiv.org/abs/2504.14245)
Keywords: generation
Abstract: Progress in image generation raises significant public security concerns. We argue that fake image detection should not operate as a "black box". Instead, an ideal approach must ensure both strong generalization and transparency. Recent progress in Multi-modal Large Language Models (MLLMs) offers new opportunities for reasoning-based AI-generated image detection. In this work, we evaluate the capabilities of MLLMs in comparison to traditional detection methods and human evaluators, highlighting their strengths and limitations. Furthermore, we design six distinct prompts and propose a framework that integrates these prompts to develop a more robust, explainable, and reasoning-driven detection system. The code is available at this https URL.
摘要：图像产生的进展引起了重大的公共安全问题。我们认为假图像检测不应作为“黑匣子”操作。相反，理想的方法必须确保强大的概括和透明度。多模式大语言模型（MLLM）的最新进展为基于推理的AI生成的图像检测提供了新的机会。在这项工作中，我们评估了与传统检测方法和人类评估者相比，MLLM的功能，强调了它们的优势和局限性。此外，我们设计了六个不同的提示，并提出了一个框架，该框架集成了这些提示，以开发更强大，可解释和推理驱动的检测系统。该代码可在此HTTPS URL上找到。

Title: Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation

Authors: Bin Ren, Eduard Zamfir, Zongwei Wu, Yawei Li, Yidi Li, Danda Pani Paudel, Radu Timofte, Ming-Hsuan Yang, Luc Van Gool, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14249
Pdf URL: https://arxiv.org/pdf/2504.14249
Copy Paste: [[2504.14249]] Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation(https://arxiv.org/abs/2504.14249)
Keywords: restoration
Abstract: Restoring any degraded image efficiently via just one model has become increasingly significant and impactful, especially with the proliferation of mobile devices. Traditional solutions typically involve training dedicated models per degradation, resulting in inefficiency and redundancy. More recent approaches either introduce additional modules to learn visual prompts, significantly increasing model size, or incorporate cross-modal transfer from large language models trained on vast datasets, adding complexity to the system architecture. In contrast, our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations to enable both efficient and comprehensive restoration through a joint embedding mechanism, without scaling up the model or relying on large language this http URL, we examine the sub-latent space of each input, identifying key components and reweighting them first in a gated manner. To fuse the intrinsic degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed for enhancing spatial-aware local-global interactions and enriching the restoration details from the frequency perspective. Extensive benchmarking in the all-in-one restoration setting confirms AnyIR's SOTA performance, reducing model complexity by around 82\% in parameters and 85\% in FLOPs. Our code will be available at our Project page (this https URL)
摘要：通过仅通过一个模型有效地恢复任何降级的图像已经变得越来越重要和影响力，尤其是随着移动设备的扩散。传统解决方案通常涉及每次降解的专用模型，从而导致效率低下和冗余。最新的方法要么引入其他模块来学习视觉提示，大大增加了模型大小，要么将跨模式转移从大型数据集中训练的大型语言模型中纳入，从而为系统体系结构增添了复杂性。相反，我们称为Anyir的方法采取了一条统一的道路，该路径利用各种降级的固有相似性，通过联合嵌入机制来实现高效和全面的恢复，而无需扩大模型或依靠大型语言，或者依靠大型语言HTTP URL，我们检查了每个输入的次级范围空间，并确定了钥匙组成部分，并确定了他们的首先将其重新定向。为了融合固有的退化意识和上下文的关注，提出了一种空间频率平行融合策略，以增强空间感知的局部全球相互作用并从频率角度来丰富修复细节。在多合一的恢复环境中进行广泛的基准测试证实了Anyir的SOTA性能，将模型的复杂性降低了约82 \％，而Flops中的复杂性则减少了85 \％。我们的代码将在我们的项目页面（此HTTPS URL）上提供

Title: ColorVein: Colorful Cancelable Vein Biometrics

Authors: Yifan Wang, Jie Gui, Xinli Shi, Linqing Gui, Yuan Yan Tang, James Tin-Yau Kwok
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14253
Pdf URL: https://arxiv.org/pdf/2504.14253
Copy Paste: [[2504.14253]] ColorVein: Colorful Cancelable Vein Biometrics(https://arxiv.org/abs/2504.14253)
Keywords: generation
Abstract: Vein recognition technologies have become one of the primary solutions for high-security identification systems. However, the issue of biometric information leakage can still pose a serious threat to user privacy and anonymity. Currently, there is no cancelable biometric template generation scheme specifically designed for vein biometrics. Therefore, this paper proposes an innovative cancelable vein biometric generation scheme: ColorVein. Unlike previous cancelable template generation schemes, ColorVein does not destroy the original biometric features and introduces additional color information to grayscale vein images. This method significantly enhances the information density of vein images by transforming static grayscale information into dynamically controllable color representations through interactive colorization. ColorVein allows users/administrators to define a controllable pseudo-random color space for grayscale vein images by editing the position, number, and color of hint points, thereby generating protected cancelable templates. Additionally, we propose a new secure center loss to optimize the training process of the protected feature extraction model, effectively increasing the feature distance between enrolled users and any potential impostors. Finally, we evaluate ColorVein's performance on all types of vein biometrics, including recognition performance, unlinkability, irreversibility, and revocability, and conduct security and privacy analyses. ColorVein achieves competitive performance compared with state-of-the-art methods.
摘要：静脉识别技术已成为高安全性识别系统的主要解决方案之一。但是，生物识别信息泄漏的问题仍然可能对用户隐私和匿名性构成严重威胁。当前，没有专门为静脉生物识别技术设计的可取消生物识别模板生成方案。因此，本文提出了一种创新的可取消静脉生物特征生成方案：Colorvein。与以前的可取消模板生成方案不同，Colorvein不会破坏原始的生物特征特征，并将其他颜色信息引入灰度静脉图像。通过将静态灰度信息通过交互式着色将静态灰度信息转换为动态控制的色彩表示，该方法可显着增强静脉图像的信息密度。 Colorvein允许用户/管理员通过编辑提示点的位置，数字和颜色，从而为灰度静脉图像定义可控的伪随机颜色空间，从而生成可保护的可取消模板。此外，我们提出了一种新的安全中心损失，以优化受保护特征提取模型的训练过程，从而有效地增加了注册用户和任何潜在冒名顶替者之间的特征距离。最后，我们评估了Colorvein在所有类型的静脉生物识别技术上的性能，包括识别性能，不链接性，不可逆性和可竞争性，以及进行安全性和隐私分析。与最先进的方法相比，Colorvein实现了竞争性能。

Title: Cross-attention for State-based model RWKV-7

Authors: Liu Xiao, Li Zhiyuan, Lin Yueyu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.14260
Pdf URL: https://arxiv.org/pdf/2504.14260
Copy Paste: [[2504.14260]] Cross-attention for State-based model RWKV-7(https://arxiv.org/abs/2504.14260)
Keywords: generation
Abstract: We introduce CrossWKV, a novel cross-attention mechanism for the state-based RWKV-7 model, designed to enhance the expressive power of text-to-image generation. Leveraging RWKV-7's linear-complexity Weighted Key-Value (WKV) architecture, CrossWKV integrates text and image modalities in a single pass, utilizing a generalized delta rule with vector-valued gating and low-rank adaptations (LoRA) to achieve superior cross-modal alignment. Unlike Transformer-based models, CrossWKV's non-diagonal, input-dependent transition matrix enables it to represent complex functions beyond the $\mathrm{TC}^0$ complexity class, including all regular languages, as demonstrated by its ability to perform state-tracking tasks like $S_5$ permutation modeling. Evaluated within the Diffusion in RWKV-7 (DIR-7) on datasets such as LAION-5B and ImageNet, CrossWKV achieves a Frechet Inception Distance (FID) of 2.88 and a CLIP score of 0.33 on ImageNet 256x256, matching state-of-the-art performance while offering robust generalization across diverse prompts. The model's enhanced expressivity, combined with constant memory usage and linear scaling, positions it as a powerful solution for advanced cross-modal tasks, with potential applications in high-resolution generation and dynamic state this http URL at this https URL
摘要：我们介绍了CrossWKV，这是一种基于状态的RWKV-7模型的新型跨注意机制，旨在增强文本对图像生成的表现力。 CrossWKV利用RWKV-7的线性复杂性加权键值（WKV）架构，将文本和图像模式集成在单个通道中，利用具有广义的Delta规则与矢量价值的门控和低级适应（LORA），以实现出色的跨模量值。与基于变压器的模型不同，CrossWKV的非对角线的输入依赖性过渡矩阵使其能够代表$ \ Mathrm {Tc}^0 $复杂性类别以外的复杂功能，包括所有常规语言，包括所有常规语言，其能力通过执行$ S_5 $ SONSTONT OPTRANT OPTRANT OPTRANT OPTRAIN MONDON MODERING诸如$ s_5 $ permoty模型。在LAION-5B和IMAGENET等数据集对RWKV-7（DIR-7）的扩散中进行了评估，CrossWKV的Frechet Inception Inception距离（FID）为2.88，剪辑得分为0.33，Imagenet 256x256的剪辑得分为0.33，在提供互补的跨性别的一般性提示的同时，在匹配的是互补的。该模型的增强表达性，结合恒定的内存使用和线性缩放，将其定位为高级跨模式任务的强大解决方案，并在此HTTPS URL处具有高分辨率生成和动态状态的潜在应用和动态状态。

Title: Generative emulation of chaotic dynamics with coherent prior

Authors: Juan Nathaniel, Pierre Gentine
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.14264
Pdf URL: https://arxiv.org/pdf/2504.14264
Copy Paste: [[2504.14264]] Generative emulation of chaotic dynamics with coherent prior(https://arxiv.org/abs/2504.14264)
Keywords: generation, generative
Abstract: Data-driven emulation of nonlinear dynamics is challenging due to long-range skill decay that often produces physically unrealistic outputs. Recent advances in generative modeling aim to address these issues by providing uncertainty quantification and correction. However, the quality of generated simulation remains heavily dependent on the choice of conditioning priors. In this work, we present an efficient generative framework for dynamics emulation, unifying principles of turbulence with diffusion-based modeling: Cohesion. Specifically, our method estimates large-scale coherent structure of the underlying dynamics as guidance during the denoising process, where small-scale fluctuation in the flow is then resolved. These coherent priors are efficiently approximated using reduced-order models, such as deep Koopman operators, that allow for rapid generation of long prior sequences while maintaining stability over extended forecasting horizon. With this gain, we can reframe forecasting as trajectory planning, a common task in reinforcement learning, where conditional denoising is performed once over entire sequences, minimizing the computational cost of autoregressive-based generative methods. Empirical evaluations on chaotic systems of increasing complexity, including Kolmogorov flow, shallow water equations, and subseasonal-to-seasonal climate dynamics, demonstrate Cohesion superior long-range forecasting skill that can efficiently generate physically-consistent simulations, even in the presence of partially-observed guidance.
摘要：由于远程技能衰减通常会产生身体上不切实际的输出，因此数据驱动的非线性动力学仿真是具有挑战性的。生成建模的最新进展旨在通过提供不确定性量化和纠正来解决这些问题。但是，生成的模拟的质量仍然在很大程度上取决于条件先验的选择。在这项工作中，我们提出了一个有效的动力学仿真生成框架，以基于扩散的建模来统一湍流原理：内聚力。具体而言，我们的方法估计了基础动力学的大规模相干结构，作为在转化过程中的指导，然后解决流动中的小规模波动。这些连贯的先验使用降低的模型（例如深koopman操作员）有效地近似，从而可以快速生成长期的先前序列，同时维持稳定性在扩展的预测范围内。有了这一收益，我们可以将预测作为轨迹计划进行重新构架，这是强化学习中的一项常见任务，在整个序列中，进行有条件的降解，从而最大程度地减少了基于自动回归的生成方法的计算成本。对增加复杂性的混乱系统的经验评估，包括Kolmogorov流动，浅水方程以及季节至季节的气候动态，也证明了凝聚力上的较高的远程预测技能，即使在有部分观察到的部分观测指导的情况下，也可以有效地产生物理固定的模拟。

Title: Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction

Authors: Li Yu, Xuanzhe Sun, Wei Zhou, Moncef Gabbouj
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14267
Pdf URL: https://arxiv.org/pdf/2504.14267
Copy Paste: [[2504.14267]] Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction(https://arxiv.org/abs/2504.14267)
Keywords: generation
Abstract: Video saliency prediction is crucial for downstream applications, such as video compression and human-computer interaction. With the flourishing of multimodal learning, researchers started to explore multimodal video saliency prediction, including audio-visual and text-visual approaches. Auditory cues guide the gaze of viewers to sound sources, while textual cues provide semantic guidance for understanding video content. Integrating these complementary cues can improve the accuracy of saliency prediction. Therefore, we attempt to simultaneously analyze visual, auditory, and textual modalities in this paper, and propose TAVDiff, a Text-Audio-Visual-conditioned Diffusion Model for video saliency prediction. TAVDiff treats video saliency prediction as an image generation task conditioned on textual, audio, and visual inputs, and predicts saliency maps through stepwise denoising. To effectively utilize text, a large multimodal model is used to generate textual descriptions for video frames and introduce a saliency-oriented image-text response (SITR) mechanism to generate image-text response maps. It is used as conditional information to guide the model to localize the visual regions that are semantically related to the textual description. Regarding the auditory modality, it is used as another conditional information for directing the model to focus on salient regions indicated by sounds. At the same time, since the diffusion transformer (DiT) directly concatenates the conditional information with the timestep, which may affect the estimation of the noise level. To achieve effective conditional guidance, we propose Saliency-DiT, which decouples the conditional information from the timestep. Experimental results show that TAVDiff outperforms existing methods, improving 1.03\%, 2.35\%, 2.71\% and 0.33\% on SIM, CC, NSS and AUC-J metrics, respectively.
摘要：视频显着性预测对于下游应用至关重要，例如视频压缩和人类计算机的相互作用。随着多模式学习的蓬勃发展，研究人员开始探索多模式的视频显着性预测，包括视听和文本视频方法。听觉提示指导观众的视线来源，而文本提示为理解视频内容提供了语义指导。整合这些互补线索可以提高显着性预测的准确性。因此，我们尝试在本文中同时分析视觉，听觉和文本方式，并提出Tavdiff，这是视频显着性预测的文本审计 - 视觉条件的扩散模型。 Tavdiff将视频显着性预测视为以文本，音频和视觉输入为条件的图像生成任务，并通过逐步降级来预测显着性图。为了有效地利用文本，使用大型多模式模型来生成视频帧的文本描述，并引入面向显着的图像文本响应（SITR）机制来生成图像 - 文本响应映射。它被用作有条件信息，以指导模型定位与文本描述相关的视觉区域。关于听觉方式，它被用作另一种条件信息，以指导模型专注于声音指示的显着区域。同时，由于扩散变压器（DIT）直接将条件信息与时间段相结合，这可能会影响噪声水平的估计。为了获得有效的有条件指导，我们提出了显着性，这将有条件的信息与时间段相关。实验结果表明，Tavdiff在SIM，CC，NSS和AUC-J指标上分别超过了现有方法，提高了1.03 \％，2.35 \％，2.71 \％和0.33 \％。

Title: Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization

Authors: Shouwei Ruan, Zhenyu Wu, Yao Huang, Ruochen Zhang, Yitong Sun, Caixin Kang, Xingxing Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14290
Pdf URL: https://arxiv.org/pdf/2504.14290
Copy Paste: [[2504.14290]] Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization(https://arxiv.org/abs/2504.14290)
Keywords: generation
Abstract: Ensuring the safety of generated content remains a fundamental challenge for Text-to-Image (T2I) generation. Existing studies either fail to guarantee complete safety under potentially harmful concepts or struggle to balance safety with generation quality. To address these issues, we propose Safety-Constrained Direct Preference Optimization (SC-DPO), a novel framework for safety alignment in T2I models. SC-DPO integrates safety constraints into the general human preference calibration, aiming to maximize the likelihood of generating human-preferred samples while minimizing the safety cost of the generated outputs. In SC-DPO, we introduce a safety cost model to accurately quantify harmful levels for images, and train it effectively using the proposed contrastive learning and cost anchoring objectives. To apply SC-DPO for effective T2I safety alignment, we constructed SCP-10K, a safety-constrained preference dataset containing rich harmful concepts, which blends safety-constrained preference pairs under both harmful and clean instructions, further mitigating the trade-off between safety and sample quality. Additionally, we propose a Dynamic Focusing Mechanism (DFM) for SC-DPO, promoting the model's learning of difficult preference pair samples. Extensive experiments demonstrate that SC-DPO outperforms existing methods, effectively defending against various NSFW content while maintaining optimal sample quality and human preference alignment. Additionally, SC-DPO exhibits resilience against adversarial prompts designed to generate harmful content.
摘要：确保生成内容的安全仍然是文本到图像（T2i）一代的基本挑战。现有的研究要么无法确保在潜在有害概念下完全安全，要么难以在安全质量与发电质量之间取得平衡。为了解决这些问题，我们提出了安全约束的直接偏好优化（SC-DPO），这是T2I模型中安全对齐的新型框架。 SC-DPO将安全限制整合到一般的人类偏好校准中，旨在最大程度地提高产生人类优先样本的可能性，同时最大程度地减少产生的产出的安全成本。在SC-DPO中，我们引入了一个安全成本模型，以准确量化图像的有害水平，并使用所提出的对比度学习和成本锚定目标有效地训练它。为了应用SC-DPO以进行有效的T2I安全对准，我们构建了SCP-10K，这是一个包含丰富有害概念的安全性偏好数据集，在有害和干净的说明下融合了安全限制的偏好对，进一步缓解了安全质量和样品质量之间的权衡。此外，我们提出了用于SC-DPO的动态聚焦机制（DFM），从而促进了模型对困难偏好对样品的学习。广泛的实验表明，SC-DPO的表现优于现有方法，有效地防御各种NSFW含量，同时保持最佳样本质量和人类偏好一致性。此外，SC-DPO还针对旨在产生有害内容的对抗提示表现出弹性。

Title: Learning and Generating Diverse Residential Load Patterns Using GAN with Weakly-Supervised Training and Weight Selection

Authors: Xinyu Liang, Hao Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14300
Pdf URL: https://arxiv.org/pdf/2504.14300
Copy Paste: [[2504.14300]] Learning and Generating Diverse Residential Load Patterns Using GAN with Weakly-Supervised Training and Weight Selection(https://arxiv.org/abs/2504.14300)
Keywords: generation, generative
Abstract: The scarcity of high-quality residential load data can pose obstacles for decarbonizing the residential sector as well as effective grid planning and operation. The above challenges have motivated research into generating synthetic load data, but existing methods faced limitations in terms of scalability, diversity, and similarity. This paper proposes a Generative Adversarial Network-based Synthetic Residential Load Pattern (RLP-GAN) generation model, a novel weakly-supervised GAN framework, leveraging an over-complete autoencoder to capture dependencies within complex and diverse load patterns and learn household-level data distribution at scale. We incorporate a model weight selection method to address the mode collapse problem and generate load patterns with high diversity. We develop a holistic evaluation method to validate the effectiveness of RLP-GAN using real-world data of 417 households. The results demonstrate that RLP-GAN outperforms state-of-the-art models in capturing temporal dependencies and generating load patterns with higher similarity to real data. Furthermore, we have publicly released the RLP-GAN generated synthetic dataset, which comprises one million synthetic residential load pattern profiles.
摘要：高质量的住宅负载数据的稀缺性可能会构成脱碳，以及有效的网格计划和操作。上述挑战促使研究促进了生成合成负载数据，但是现有的方法在可伸缩性，多样性和相似性方面面临限制。本文提出了一种基于生成的对抗网络的合成住宅负载模式（RLP-GAN）生成模型，这是一种新型的弱监督的GAN框架，利用过度完整的自动编码器来捕获复杂和多样化的负载模式中的依赖性，并以规模的规模学习家庭级别数据分布。我们合并了一种模型重量选择方法，以解决模式崩溃问题并产生高度多样性的负载模式。我们开发了一种整体评估方法，以使用417个家庭的实际数据来验证RLP-GAN的有效性。结果表明，RLP-GAN在捕获时间依赖性和生成与真实数据相似性更高的载荷模式方面优于最先进的模型。此外，我们已公开发布了RLP-GAN生成的合成数据集，该数据集包括100万个合成的住宅负载模式剖面。

Title: Manipulating Multimodal Agents via Cross-Modal Prompt Injection

Authors: Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, Xianglong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14348
Pdf URL: https://arxiv.org/pdf/2504.14348
Copy Paste: [[2504.14348]] Manipulating Multimodal Agents via Cross-Modal Prompt Injection(https://arxiv.org/abs/2504.14348)
Keywords: generative
Abstract: The emergence of multimodal large language models has redefined the agent paradigm by integrating language and vision modalities with external data sources, enabling agents to better interpret human instructions and execute increasingly complex tasks. However, in this work, we identify a critical yet previously overlooked security vulnerability in multimodal agents: cross-modal prompt injection attacks. To exploit this vulnerability, we propose CrossInject, a novel attack framework in which attackers embed adversarial perturbations across multiple modalities to align with target malicious content, allowing external instructions to hijack the agent's decision-making process and execute unauthorized tasks. Our approach consists of two key components. First, we introduce Visual Latent Alignment, where we optimize adversarial features to the malicious instructions in the visual embedding space based on a text-to-image generative model, ensuring that adversarial images subtly encode cues for malicious task execution. Subsequently, we present Textual Guidance Enhancement, where a large language model is leveraged to infer the black-box defensive system prompt through adversarial meta prompting and generate an malicious textual command that steers the agent's output toward better compliance with attackers' requests. Extensive experiments demonstrate that our method outperforms existing injection attacks, achieving at least a +26.4% increase in attack success rates across diverse tasks. Furthermore, we validate our attack's effectiveness in real-world multimodal autonomous agents, highlighting its potential implications for safety-critical applications.
摘要：多模式大语言模型的出现通过将语言和视觉方式与外部数据源集成，使代理人能够更好地解释人类指令并执行日益复杂的任务，从而重新定义了代理范式。但是，在这项工作中，我们确定了多模式代理中的关键但以前被忽视的安全漏洞：跨模式提示注射攻击。为了利用这种漏洞，我们提出了一个新颖的攻击框架，其中攻击者嵌入了多种模式的对抗扰动，以与目标恶意内容保持一致，从而允许外部说明劫持代理商的决策过程并执行未经授权的任务。我们的方法由两个关键组成部分组成。首先，我们介绍了视觉潜在对齐，我们根据文本到图像生成模型在视觉嵌入空间中优化对抗性特征，以确保对对抗性图像的巧妙编码恶意任务执行的提示。随后，我们提出文本指导增强，其中利用大型语言模型通过对抗元提示并生成恶意的文本命令来推断黑盒防御系统提示，该命令将代理的输出引导到更好地遵守攻击者的请求。广泛的实验表明，我们的方法的表现优于现有的注射攻击，在各种任务中至少提高了攻击成功率 +26.4％。此外，我们验证了攻击在现实世界多模式自动企业中的有效性，从而强调了其对安全至关重要应用的潜在影响。

Title: Improving RL Exploration for LLM Reasoning through Retrospective Replay

Authors: Shihan Dou, Muling Wu, Jingwen Xu, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2504.14363
Pdf URL: https://arxiv.org/pdf/2504.14363
Copy Paste: [[2504.14363]] Improving RL Exploration for LLM Reasoning through Retrospective Replay(https://arxiv.org/abs/2504.14363)
Keywords: generation
Abstract: Reinforcement learning (RL) has increasingly become a pivotal technique in the post-training of large language models (LLMs). The effective exploration of the output space is essential for the success of RL. We observe that for complex problems, during the early stages of training, the model exhibits strong exploratory capabilities and can identify promising solution ideas. However, its limited capability at this stage prevents it from successfully solving these problems. The early suppression of these potentially valuable solution ideas by the policy gradient hinders the model's ability to revisit and re-explore these ideas later. Consequently, although the LLM's capabilities improve in the later stages of training, it still struggles to effectively address these complex problems. To address this exploration issue, we propose a novel algorithm named Retrospective Replay-based Reinforcement Learning (RRL), which introduces a dynamic replay mechanism throughout the training process. RRL enables the model to revisit promising states identified in the early stages, thereby improving its efficiency and effectiveness in exploration. To evaluate the effectiveness of RRL, we conduct extensive experiments on complex reasoning tasks, including mathematical reasoning and code generation, and general dialogue tasks. The results indicate that RRL maintains high exploration efficiency throughout the training period, significantly enhancing the effectiveness of RL in optimizing LLMs for complicated reasoning tasks. Moreover, it also improves the performance of RLHF, making the model both safer and more helpful.
摘要：强化学习（RL）越来越多地成为大型语言模型（LLMS）培训后的关键技术。对输出空间的有效探索对于RL的成功至关重要。我们观察到，对于复杂的问题，在培训的早期阶段，该模型表现出强大的探索功能，并可以识别出有希望的解决方案想法。但是，在此阶段，其功能有限可阻止其成功解决这些问题。政策梯度对这些潜在有价值的解决方案思想的早期抑制阻碍了模型重新访问和重新探索这些想法的能力。因此，尽管LLM的功能在培训的后期阶段有所提高，但它仍然难以有效解决这些复杂的问题。为了解决这个探索问题，我们提出了一种名为“回顾性重放的强化学习”（RRL）的新型算法，该算法在整个训练过程中引入了动态重播机制。 RRL使该模型能够在早期阶段重新审视有前途的状态，从而提高其在勘探方面的效率和有效性。为了评估RRL的有效性，我们对复杂的推理任务（包括数学推理和代码生成以及一般对话任务）进行了广泛的实验。结果表明，RRL在整个培训期间保持较高的勘探效率，从而显着提高了RL在优化复杂推理任务的LLM方面的有效性。此外，它还提高了RLHF的性能，从而使模型更安全，更有帮助。

Title: Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data

Authors: Shlomi Hod, Lucas Rosenblatt, Julia Stoyanovich
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2504.14368
Pdf URL: https://arxiv.org/pdf/2504.14368
Copy Paste: [[2504.14368]] Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data(https://arxiv.org/abs/2504.14368)
Keywords: generation
Abstract: Differentially private (DP) machine learning often relies on the availability of public data for tasks like privacy-utility trade-off estimation, hyperparameter tuning, and pretraining. While public data assumptions may be reasonable in text and image domains, they are less likely to hold for tabular data due to tabular data heterogeneity across domains. We propose leveraging powerful priors to address this limitation; specifically, we synthesize realistic tabular data directly from schema-level specifications - such as variable names, types, and permissible ranges - without ever accessing sensitive records. To that end, this work introduces the notion of "surrogate" public data - datasets generated independently of sensitive data, which consume no privacy loss budget and are constructed solely from publicly available schema or metadata. Surrogate public data are intended to encode plausible statistical assumptions (informed by publicly available information) into a dataset with many downstream uses in private mechanisms. We automate the process of generating surrogate public data with large language models (LLMs); in particular, we propose two methods: direct record generation as CSV files, and automated structural causal model (SCM) construction for sampling records. Through extensive experiments, we demonstrate that surrogate public tabular data can effectively replace traditional public data when pretraining differentially private tabular classifiers. To a lesser extent, surrogate public data are also useful for hyperparameter tuning of DP synthetic data generators, and for estimating the privacy-utility tradeoff.
摘要：差异化（DP）机器学习通常依赖于公共数据的可用性，例如隐私 - 私人权衡估算，超参数调整和训练。尽管公共数据假设在文本和图像域中可能是合理的，但由于跨域之间的表格数据异质性，它们不太可能保留表格数据。我们建议利用强大的先验来解决这一限制。具体而言，我们直接从模式级别的规格（例如可变名称，类型和允许的范围）中直接合成现实的表格数据，而无需访问敏感记录。为此，这项工作介绍了“替代”公共数据的概念 - 独立于敏感数据而生成的数据集，这些数据无需消耗隐私损失预算，并且仅由公共可用的模式或元数据构建。替代公共数据旨在将合理的统计假设（通过公开可用信息告知）到具有许多下游用途的私人机制中的数据集中。我们自动化使用大语言模型（LLM）生成替代公共数据的过程；特别是，我们提出了两种方法：作为CSV文件的直接记录生成以及用于采样记录的自动化结构因果模型（SCM）构建。通过广泛的实验，我们证明了替代公共表格数据可以在鉴定私人表格分类器时有效替代传统的公共数据。在较小程度上，替代公共数据也可用于DP合成数据生成器的高参数调整，并估算隐私 - 耐用性权衡。

Title: SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation

Authors: Minho Park, Taewoong Kang, Jooyeol Yun, Sungwon Hwang, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14396
Pdf URL: https://arxiv.org/pdf/2504.14396
Copy Paste: [[2504.14396]] SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation(https://arxiv.org/abs/2504.14396)
Keywords: generation
Abstract: The increasing demand for AR/VR applications has highlighted the need for high-quality 360-degree panoramic content. However, generating high-quality 360-degree panoramic images and videos remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or attempt tuning-free methods that still rely on ERP latent representations, leading to discontinuities near the poles. In this paper, we introduce SphereDiff, a novel approach for seamless 360-degree panoramic image and video generation using state-of-the-art diffusion models without additional tuning. We define a spherical latent representation that ensures uniform distribution across all perspectives, mitigating the distortions inherent in ERP. We extend MultiDiffusion to spherical latent space and propose a spherical latent sampling method to enable direct use of pretrained diffusion models. Moreover, we introduce distortion-aware weighted averaging to further improve the generation quality in the projection process. Our method outperforms existing approaches in generating 360-degree panoramic content while maintaining high fidelity, making it a robust solution for immersive AR/VR applications. The code is available here. this https URL
摘要：对AR/VR应用的需求不断增长，这突显了对高质量360度全景内容的需求。然而，由于Equirectangular投影（ERP）引起的严重扭曲，产生高质量的360度全景图像和视频仍然是一项艰巨的任务。现有方法在有限的ERP数据集上进行了微调预估计的扩散模型，或者尝试尝试仍然依赖ERP潜在表示的无调方法，从而导致杆附近的不连续性。在本文中，我们介绍了Spherediff，这是一种使用最先进的扩散模型的无缝360度全景图像和视频生成的新方法，而无需进行其他调整。我们定义了一个球形潜在表示，该表示可确保在所有角度上均匀分布，从而减轻ERP固有的扭曲。我们将多个扩散扩展到球形潜在空间，并提出了一种球形潜伏方法，以直接使用预验证的扩散模型。此外，我们引入了平均变形的平均值，以进一步提高投影过程中的发电质量。我们的方法在维持高保真度的同时，在生成360度全景内容方面的现有方法优于现有方法，这是对沉浸式AR/VR应用的强大解决方案。该代码可在此处使用。此HTTPS URL

Title: ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations

Authors: Ahmad Khalil, Mahmoud Khalil, Alioune Ngom
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14429
Pdf URL: https://arxiv.org/pdf/2504.14429
Copy Paste: [[2504.14429]] ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations(https://arxiv.org/abs/2504.14429)
Keywords: generation
Abstract: Large Language Models (LLMs) have transformed natural language processing (NLP) tasks, but they suffer from hallucination, generating plausible yet factually incorrect content. This issue extends to Video-Language Models (VideoLLMs), where textual descriptions may inaccurately represent visual content, resulting in multi-modal hallucinations. In this paper, we address hallucination in ResNetVLLM, a video-language model combining ResNet visual encoders with LLMs. We introduce a two-step protocol: (1) a faithfulness detection strategy that uses a modified Lynx model to assess semantic alignment between generated captions and ground-truth video references, and (2) a hallucination mitigation strategy using Retrieval-Augmented Generation (RAG) with an ad-hoc knowledge base dynamically constructed during inference. Our enhanced model, ResNetVLLM-2, reduces multi-modal hallucinations by cross-verifying generated content against external knowledge, improving factual consistency. Evaluation on the ActivityNet-QA benchmark demonstrates a substantial accuracy increase from 54.8% to 65.3%, highlighting the effectiveness of our hallucination detection and mitigation strategies in enhancing video-language model reliability.
摘要：大型语言模型（LLM）已转变自然语言处理（NLP）任务，但它们遭受了幻觉的困扰，产生了合理但实际上不正确的内容。此问题扩展到视频语言模型（视频学），文本描述可能不准确地表示视觉内容，从而导致多模式幻觉。在本文中，我们解决了Resnetvllm中的幻觉，这是一种视频语言模型，将Resnet视觉编码器与LLMS结合在一起。我们介绍了两步协议：（1）一种忠实的检测策略，该策略使用修改后的LYNX模型评估生成的字幕和地面真实视频参考之间的语义对齐，以及（2）使用检索仪（RAG）的幻觉缓解策略（RAG），并具有在参与期间动态构建的Ad-Hoc知识基础。我们增强的模型Resnetvllm-2，通过针对外部知识进行交叉验证生成的内容来减少多模式幻觉，从而提高了事实的一致性。对ActivityNet-QA基准的评估表明，从54.8％增加到65.3％，强调了我们的幻觉检测和缓解策略在增强视频语言模型可靠性方面的有效性。

Title: Causal Disentanglement for Robust Long-tail Medical Image Generation

Authors: Weizhi Nie, Zichun Zhang, Weijie Wang, Bruno Lepri, Anan Liu, Nicu Seb
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14450
Pdf URL: https://arxiv.org/pdf/2504.14450
Copy Paste: [[2504.14450]] Causal Disentanglement for Robust Long-tail Medical Image Generation(https://arxiv.org/abs/2504.14450)
Keywords: generation
Abstract: Counterfactual medical image generation effectively addresses data scarcity and enhances the interpretability of medical images. However, due to the complex and diverse pathological features of medical images and the imbalanced class distribution in medical data, generating high-quality and diverse medical images from limited data is significantly challenging. Additionally, to fully leverage the information in limited data, such as anatomical structure information and generate more structurally stable medical images while avoiding distortion or inconsistency. In this paper, in order to enhance the clinical relevance of generated data and improve the interpretability of the model, we propose a novel medical image generation framework, which generates independent pathological and structural features based on causal disentanglement and utilizes text-guided modeling of pathological features to regulate the generation of counterfactual images. First, we achieve feature separation through causal disentanglement and analyze the interactions between features. Here, we introduce group supervision to ensure the independence of pathological and identity features. Second, we leverage a diffusion model guided by pathological findings to model pathological features, enabling the generation of diverse counterfactual images. Meanwhile, we enhance accuracy by leveraging a large language model to extract lesion severity and location from medical reports. Additionally, we improve the performance of the latent diffusion model on long-tailed categories through initial noise optimization.
摘要：反事实医学图像生成有效地解决了数据稀缺性并增强了医学图像的解释性。但是，由于医学图像的复杂和多样化的病理特征以及医疗数据中的类不平衡分布，从有限数据中产生高质量和多样化的医学图像非常具有挑战性。此外，要充分利用有限数据的信息，例如解剖结构信息，并在避免失真或不一致的同时产生更稳定的医学图像。在本文中，为了增强生成数据的临床相关性并提高了模型的解释性，我们提出了一个新型的医学图像生成框架，该框架基于因果关系分解，生成独立的病理和结构特征，并利用病理学特征的文本指导建模来调节反事实图像的产生。首先，我们通过因果分离实现特征分离，并分析特征之间的相互作用。在这里，我们介绍小组监督，以确保病理和身份特征的独立性。其次，我们利用以病理发现为指导的扩散模型来建模病理特征，从而能够产生不同的反事实图像。同时，我们通过利用大型语言模型从医疗报告中提取病变严重性和位置来提高准确性。此外，我们通过初始噪声优化提高了长尾类别上潜在扩散模型的性能。

Title: LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation

Authors: Jiachen Li, Qing Xie, Xiaohan Yu, Hongyun Wang, Jinyu Xu, Yongjian Liu, Yongsheng Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14467
Pdf URL: https://arxiv.org/pdf/2504.14467
Copy Paste: [[2504.14467]] LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation(https://arxiv.org/abs/2504.14467)
Keywords: generation, generative
Abstract: Zero-shot referring image segmentation aims to locate and segment the target region based on a referring expression, with the primary challenge of aligning and matching semantics across visual and textual modalities without training. Previous works address this challenge by utilizing Vision-Language Models and mask proposal networks for region-text matching. However, this paradigm may lead to incorrect target localization due to the inherent ambiguity and diversity of free-form referring expressions. To alleviate this issue, we present LGD (Leveraging Generative Descriptions), a framework that utilizes the advanced language generation capabilities of Multi-Modal Large Language Models to enhance region-text matching performance in Vision-Language Models. Specifically, we first design two kinds of prompts, the attribute prompt and the surrounding prompt, to guide the Multi-Modal Large Language Models in generating descriptions related to the crucial attributes of the referent object and the details of surrounding objects, referred to as attribute description and surrounding description, respectively. Secondly, three visual-text matching scores are introduced to evaluate the similarity between instance-level visual features and textual features, which determines the mask most associated with the referring expression. The proposed method achieves new state-of-the-art performance on three public datasets RefCOCO, RefCOCO+ and RefCOCOg, with maximum improvements of 9.97% in oIoU and 11.29% in mIoU compared to previous methods.
摘要：零摄像的参考图像分割旨在根据参考表达来定位和细分目标区域，而在没有训练的情况下，在视觉和文本方式上对齐和匹配语义的主要挑战。以前的工作通过利用视觉语言模型和蒙版提案网络来应对该挑战，以进行区域文本匹配。但是，由于自由形式的表达式的固有歧义和多样性，该范式可能导致目标定位不正确。为了减轻此问题，我们提出了LGD（利用生成描述），该框架利用多模式大语言模型的先进语言生成能力来增强视觉模型中的区域文本匹配性能。具体而言，我们首先设计了两种提示，即属性提示和周围的提示，以指导多模式的大语言模型，以生成与引用对象的关键属性相关的描述和周围对象的详细信息，分别称为属性描述和周围描述。其次，引入了三个视觉文本匹配分数，以评估实例级别的视觉特征和文本特征之间的相似性，这决定了与参考表达式最相关的掩码。提出的方法可在三个公共数据集reccoco，refcoco+和reccocog上实现新的最先进性能，与以前的方法相比，OIOU的最大提高为9.97％，MIOU的最大提高为11.29％。

Title: Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis

Authors: Jingjing Ren, Wenbo Li, Zhongdao Wang, Haoze Sun, Bangzhen Liu, Haoyu Chen, Jiaqi Xu, Aoxue Li, Shifeng Zhang, Bin Shao, Yong Guo, Lei Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14470
Pdf URL: https://arxiv.org/pdf/2504.14470
Copy Paste: [[2504.14470]] Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis(https://arxiv.org/abs/2504.14470)
Keywords: generation, generative
Abstract: Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals. While diffusion transformers (DiTs) have demonstrated remarkable capabilities in high-quality video generation, scaling them to 2K resolution remains computationally prohibitive due to quadratic growth in memory and processing costs. In this work, we propose Turbo2K, an efficient and practical framework for generating detail-rich 2K videos while significantly improving training and inference efficiency. First, Turbo2K operates in a highly compressed latent space, reducing computational complexity and memory footprint, making high-resolution video synthesis feasible. However, the high compression ratio of the VAE and limited model size impose constraints on generative quality. To mitigate this, we introduce a knowledge distillation strategy that enables a smaller student model to inherit the generative capacity of a larger, more powerful teacher model. Our analysis reveals that, despite differences in latent spaces and architectures, DiTs exhibit structural similarities in their internal representations, facilitating effective knowledge transfer. Second, we design a hierarchical two-stage synthesis framework that first generates multi-level feature at lower resolutions before guiding high-resolution video generation. This approach ensures structural coherence and fine-grained detail refinement while eliminating redundant encoding-decoding overhead, further enhancing computational efficiency.Turbo2K achieves state-of-the-art efficiency, generating 5-second, 24fps, 2K videos with significantly reduced computational cost. Compared to existing methods, Turbo2K is up to 20$\times$ faster for inference, making high-resolution video generation more scalable and practical for real-world applications.
摘要：随着对超清晰视觉效果的消费者期望，对2K视频合成的需求正在增加。尽管扩散变压器（DIT）在高质量的视频生成中表现出了显着的功能，但由于记忆和处理成本的二次增长，将它们缩放到2K分辨率仍然在计算上保持了刺激性。在这项工作中，我们提出了Turbo2K，这是一个高效且实用的框架，用于生成富含详细信息的2K视频，同时显着提高培训和推理效率。首先，Turbo2K在高度压缩的潜在空间中运行，从而降低了计算复杂性和内存足迹，从而使高分辨率视频合成可行。但是，VAE的高压比和有限的模型大小对生成质量施加了限制。为了减轻这种情况，我们引入了一种知识蒸馏策略，使较小的学生模型可以继承更大，更强大的教师模型的生成能力。我们的分析表明，尽管潜在空间和体系结构有所不同，但DIT在其内部表示中表现出结构相似性，从而促进了有效的知识转移。其次，我们设计了一个分层的两阶段合成框架，该框架在指导高分辨率视频生成之前首先在较低分辨率上生成多层次功能。这种方法可确保结构上的连贯性和细粒细节的完善，同时消除了冗余编码的开销，进一步提高了计算效率。Turbo2k实现了最先进的效率，产生了5秒钟，24fps，2K视频，并显着降低了计算成本。与现有方法相比，Turbo2K的推理速度最高20美元$ \ times $，这使得高分辨率视频生成更具扩展性和实用性对于实际应用程序。

Title: STARS: Sparse Learning Correlation Filter with Spatio-temporal Regularization and Super-resolution Reconstruction for Thermal Infrared Target Tracking

Authors: Shang Zhang, Xiaobo Ding, Huanbin Zhang, Ruoyan Xiong, Yue Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14491
Pdf URL: https://arxiv.org/pdf/2504.14491
Copy Paste: [[2504.14491]] STARS: Sparse Learning Correlation Filter with Spatio-temporal Regularization and Super-resolution Reconstruction for Thermal Infrared Target Tracking(https://arxiv.org/abs/2504.14491)
Keywords: super-resolution
Abstract: Thermal infrared (TIR) target tracking methods often adopt the correlation filter (CF) framework due to its computational efficiency. However, the low resolution of TIR images, along with tracking interference, significantly limits the perfor-mance of TIR trackers. To address these challenges, we introduce STARS, a novel sparse learning-based CF tracker that incorporates spatio-temporal regulari-zation and super-resolution reconstruction. First, we apply adaptive sparse filter-ing and temporal domain filtering to extract key features of the target while reduc-ing interference from background clutter and noise. Next, we introduce an edge-preserving sparse regularization method to stabilize target features and prevent excessive blurring. This regularization integrates multiple terms and employs the alternating direction method of multipliers to optimize the solution. Finally, we propose a gradient-enhanced super-resolution method to extract fine-grained TIR target features and improve the resolution of TIR images, addressing performance degradation in tracking caused by low-resolution sequences. To the best of our knowledge, STARS is the first to integrate super-resolution methods within a sparse learning-based CF framework. Extensive experiments on the LSOTB-TIR, PTB-TIR, VOT-TIR2015, and VOT-TIR2017 benchmarks demonstrate that STARS outperforms state-of-the-art trackers in terms of robustness.
摘要：由于其计算效率，热红外（TIR）目标跟踪方法通常采用相关滤波器（CF）框架。然而，TIR图像的低分辨率以及跟踪干扰大大限制了TIR跟踪器的穿孔。为了应对这些挑战，我们介绍了Stars，这是一种新型的基于学习的CF跟踪器，结合了时空的常规化和超分辨率重建。首先，我们应用自适应的稀疏过滤器和时间域滤波来提取目标的关键特征，同时从背景杂物和噪声中减少干扰。接下来，我们引入了一种边缘保护稀疏的正则化方法，以稳定目标特征并防止过度模糊。此正规化集成了多个术语，并采用乘数的交替方向方法来优化解决方案。最后，我们提出了一种梯度增强的超分辨率方法，以提取细粒的TIR目标特征并改善TIR图像的分辨率，从而解决了由低分辨率序列引起的跟踪中的性能降解。据我们所知，恒星是第一个将超分辨率方法集成在稀疏学习的CF框架中的方法。在LSOTB-TIR，PTB-TIR，TOV-TIR2015和FOT-TIR2017的基准上进行了广泛的实验，这表明，在稳健性方面，Stars的表现优于最先进的跟踪器。

Title: Less is More: Adaptive Coverage for Synthetic Training Data

Authors: Sasan Tavakkol, Max Springer, Mohammadhossein Bateni, Neslihan Bulut, Vincent Cohen-Addad, MohammadTaghi Hajiaghayi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14508
Pdf URL: https://arxiv.org/pdf/2504.14508
Copy Paste: [[2504.14508]] Less is More: Adaptive Coverage for Synthetic Training Data(https://arxiv.org/abs/2504.14508)
Keywords: generation
Abstract: Synthetic training data generation with Large Language Models (LLMs) like Google's Gemma and OpenAI's GPT offer a promising solution to the challenge of obtaining large, labeled datasets for training classifiers. When rapid model deployment is critical, such as in classifying emerging social media trends or combating new forms of online abuse tied to current events, the ability to generate training data is invaluable. While prior research has examined the comparability of synthetic data to human-labeled data, this study introduces a novel sampling algorithm, based on the maximum coverage problem, to select a representative subset from a synthetically generated dataset. Our results demonstrate that training a classifier on this contextually sampled subset achieves superior performance compared to training on the entire dataset. This "less is more" approach not only improves accuracy but also reduces the volume of data required, leading to potentially more efficient model fine-tuning.
摘要：具有大型语言模型（LLM）（例如Google的Gemma和OpenAI的GPT）的合成培训数据生成为获得培训分类器的大型，标签的数据集提供了有希望的解决方案。当快速模型部署至关重要时，例如在对新兴的社交媒体趋势进行分类或打击与时事相关的新形式的在线虐待时，生成培训数据的能力是无价的。尽管先前的研究已经检查了合成数据与人体标记数据的可比性，但本研究基于最大覆盖率问题介绍了一种新颖的采样算法，以从合成生成的数据集中选择代表性子集。我们的结果表明，与整个数据集中的培训相比，对本上下文采样子集进行分类器的培训表现出色。这种“更少的是更多”的方法不仅提高了准确性，而且还减少了所需的数据量，从而导致可能更有效的模型进行微调。

Title: FlowLoss: Dynamic Flow-Conditioned Loss Strategy for Video Diffusion Models

Authors: Kuanting Wu, Kei Ota, Asako Kanezaki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14535
Pdf URL: https://arxiv.org/pdf/2504.14535
Copy Paste: [[2504.14535]] FlowLoss: Dynamic Flow-Conditioned Loss Strategy for Video Diffusion Models(https://arxiv.org/abs/2504.14535)
Keywords: generative
Abstract: Video Diffusion Models (VDMs) can generate high-quality videos, but often struggle with producing temporally coherent motion. Optical flow supervision is a promising approach to address this, with prior works commonly employing warping-based strategies that avoid explicit flow matching. In this work, we explore an alternative formulation, FlowLoss, which directly compares flow fields extracted from generated and ground-truth videos. To account for the unreliability of flow estimation under high-noise conditions in diffusion, we propose a noise-aware weighting scheme that modulates the flow loss across denoising steps. Experiments on robotic video datasets suggest that FlowLoss improves motion stability and accelerates convergence in early training stages. Our findings offer practical insights for incorporating motion-based supervision into noise-conditioned generative models.
摘要：视频扩散模型（VDM）可以生成高质量的视频，但通常在产生时间连贯的运动方面很难。光流监督是一种有前途的方法来解决这一问题，先前的工作通常采用基于扭曲的策略来避免显式流动匹配。在这项工作中，我们探索了一种替代配方Flowloss，该配方直接比较了从生成的和地面真实视频中提取的流场。为了说明在扩散中高噪声条件下流量估计的不可靠性，我们提出了一种噪音吸引力的加权方案，该方案调节了跨质膜的流量损失。机器人视频数据集的实验表明，Flowloss可以提高运动稳定性，并在早期训练阶段加速收敛。我们的发现提供了将基于运动的监督纳入噪声条件生成模型中的实用见解。

Title: VGNC: Reducing the Overfitting of Sparse-view 3DGS via Validation-guided Gaussian Number Control

Authors: Lifeng Lin, Rongfeng Lu, Quan Chen, Haofan Ren, Ming Lu, Yaoqi Sun, Chenggang Yan, Anke Xue
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14548
Pdf URL: https://arxiv.org/pdf/2504.14548
Copy Paste: [[2504.14548]] VGNC: Reducing the Overfitting of Sparse-view 3DGS via Validation-guided Gaussian Number Control(https://arxiv.org/abs/2504.14548)
Keywords: generation, generative
Abstract: Sparse-view 3D reconstruction is a fundamental yet challenging task in practical 3D reconstruction applications. Recently, many methods based on the 3D Gaussian Splatting (3DGS) framework have been proposed to address sparse-view 3D reconstruction. Although these methods have made considerable advancements, they still show significant issues with overfitting. To reduce the overfitting, we introduce VGNC, a novel Validation-guided Gaussian Number Control (VGNC) approach based on generative novel view synthesis (NVS) models. To the best of our knowledge, this is the first attempt to alleviate the overfitting issue of sparse-view 3DGS with generative validation images. Specifically, we first introduce a validation image generation method based on a generative NVS model. We then propose a Gaussian number control strategy that utilizes generated validation images to determine the optimal Gaussian numbers, thereby reducing the issue of overfitting. We conducted detailed experiments on various sparse-view 3DGS baselines and datasets to evaluate the effectiveness of VGNC. Extensive experiments show that our approach not only reduces overfitting but also improves rendering quality on the test set while decreasing the number of Gaussian points. This reduction lowers storage demands and accelerates both training and rendering. The code will be released.
摘要：稀疏视图3D重建是实用3D重建应用中的一项基本而又具有挑战性的任务。最近，已经提出了许多基于3D高斯（3DGS）框架的方法来解决稀疏视图3D重建。尽管这些方法取得了长足的进步，但它们仍然显示出过度拟合的重大问题。为了减少过度拟合，我们引入了VGNC，这是一种基于生成性新型视图合成（NVS）模型的新型验证引导的高斯数量控制（VGNC）方法。据我们所知，这是减轻具有生成验证图像的稀疏视图3DG的首次尝试。具体而言，我们首先基于生成NVS模型引入了验证图像生成方法。然后，我们提出了一种高斯数字控制策略，该策略利用生成的验证图像来确定最佳高斯数字，从而减少了过度拟合的问题。我们对各种稀疏视图3DGS基线和数据集进行了详细的实验，以评估VGNC的有效性。广泛的实验表明，我们的方法不仅减少了过度拟合，而且还提高了测试集的渲染质量，同时减少了高斯点的数量。这种减少降低了存储的需求，并加速了培训和渲染。代码将发布。

Title: NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results

Authors: Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Xiangyu Kong, Xiaoxuan Yu, Hyunhee Park, Suejin Han, Hakjae Jeon, Dafeng Zhang, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Lu Zhao, Yuyi Zhang, Pengyu Yan, Jiawei Hu, Pengwei Liu, Fengjun Guo, Hongyuan Yu, Pufan Xu, Zhijuan Huang, Shuyuan Cui, Peng Guo, Jiahui Liu, Dongkai Zhang, Heng Zhang, Huiyuan Fu, Huadong Ma, Yanhui Guo, Sisi Tian, Xin Liu, Jinwen Liang, Jie Liu, Jie Tang, Gangshan Wu, Zeyu Xiao, Zhuoyuan Li, Yinxiang Zhang, Wenxuan Cai, Vijayalaxmi Ashok Aralikatti, Nikhil Akalwadi, G Gyaneshwar Rao, Chaitra Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Marcos V. Conde, Alejandro Merino, Bruno Longarela, Javier Abad, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Aagam Jain, Milan Kumar Singh, Ankit Kumar, Shubh Kawa, Divyavardhan Singh, Anjali Sarvaiya, Kishor Upla, Raghavendra Ramachandra, Chia-Ming Lee, Yu-Fan Lin, Chih-Chung Hsu, Risheek V Hiremath, Yashaswini Palani, Yuxuan Jiang, Qiang Zhu, Siyue Teng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David Bull, Jingwei Liao, Yuqing Yang, Wenda Shao, Junyi Zhao, Qisheng Xu, Kele Xu, Sunder Ali Khowaja, Ik Hyun Lee, Snehal Singh Tomar, Rajarshi Ray, Klaus Mueller, Sachin Chaudhary, Surya Vashisth, Akshay Dudhane, Praful Hambarde, Satya Naryan Tazi, Prashant Patil, Santosh Kumar Vipparthi, Subrahmanyam Murala, Bilel Benjdira, Anas M. Ali
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14582
Pdf URL: https://arxiv.org/pdf/2504.14582
Copy Paste: [[2504.14582]] NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results(https://arxiv.org/abs/2504.14582)
Keywords: restoration, super-resolution
Abstract: This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.
摘要：本文提出了NTIRE 2025图像超分辨率（$ \ times $ 4）挑战，这是CVPR 2025的第10次NTIRE研讨会的相关竞赛之一。挑战旨在从低分辨率（LR）通过Bicubic Downsmpling与$ 4 Scalibles $ 4 Scalive Facter and Bicubic Down Smplpling生成的高分辨率（LR）相对的高分辨率（LR）图像。目的是开发有效的网络设计或解决方案，以实现最先进的SR性能。为了反映图像SR研究的双重目标，挑战包括两个子轨道：（1）恢复轨道，强调像素的精度，并根据PSNR进行排名；（2）感知轨道，专注于视觉现实主义，并以感知分数对结果进行排名。共有286名参与者参加了比赛，有25个团队提交了有效的条目。本报告总结了每个团队的挑战设计，数据集，评估协议，主要结果和方法。挑战是提高最新技术并促进图像SR的进步的基准。

Title: Using street view imagery and deep generative modeling for estimating the health of urban forests

Authors: Akshit Gupta, Remko Uijlenhoet
Subjects: cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2504.14583
Pdf URL: https://arxiv.org/pdf/2504.14583
Copy Paste: [[2504.14583]] Using street view imagery and deep generative modeling for estimating the health of urban forests(https://arxiv.org/abs/2504.14583)
Keywords: generative
Abstract: Healthy urban forests comprising of diverse trees and shrubs play a crucial role in mitigating climate change. They provide several key advantages such as providing shade for energy conservation, and intercepting rainfall to reduce flood runoff and soil erosion. Traditional approaches for monitoring the health of urban forests require instrumented inspection techniques, often involving a high amount of human labor and subjective evaluations. As a result, they are not scalable for cities which lack extensive resources. Recent approaches involving multi-spectral imaging data based on terrestrial sensing and satellites, are constrained respectively with challenges related to dedicated deployments and limited spatial resolutions. In this work, we propose an alternative approach for monitoring the urban forests using simplified inputs: street view imagery, tree inventory data and meteorological conditions. We propose to use image-to-image translation networks to estimate two urban forest health parameters, namely, NDVI and CTD. Finally, we aim to compare the generated results with ground truth data using an onsite campaign utilizing handheld multi-spectral and thermal imaging sensors. With the advent and expansion of street view imagery platforms such as Google Street View and Mapillary, this approach should enable effective management of urban forests for the authorities in cities at scale.
摘要：由各种树木和灌木组成的健康城市森林在缓解气候变化方面起着至关重要的作用。它们提供了几个关键优势，例如为节能提供阴影，并拦截降雨以减少洪水径流和土壤侵蚀。监测城市森林健康状况的传统方法需要仪器检查技术，通常涉及大量的人工劳动和主观评估。结果，对于缺乏广泛资源的城市，它们无法扩展。涉及基于陆地感应和卫星的多光谱成像数据的最新方法分别受到与专用部署和有限的空间分辨率相关的挑战。在这项工作中，我们提出了一种使用简化的输入来监视城市森林的替代方法：街道查看图像，树木库存数据和气象条件。我们建议使用图像到图像翻译网络来估计两个城市森林健康参数，即NDVI和CTD。最后，我们旨在使用手持式多光谱和热成像传感器的现场活动将生成的结果与地面真相数据进行比较。随着街道视图图像平台（例如Google Street View and Mapillary）的出现和扩展，这种方法应为城市当局的大规模管理城市森林提供有效的管理。

Title: Generative Auto-Bidding with Value-Guided Explorations

Authors: Jingtong Gao, Yewen Li, Shuai Mao, Peng Jiang, Nan Jiang, Yejing Wang, Qingpeng Cai, Fei Pan, Peng Jiang, Kun Gai, Bo An, Xiangyu Zhao
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2504.14587
Pdf URL: https://arxiv.org/pdf/2504.14587
Copy Paste: [[2504.14587]] Generative Auto-Bidding with Value-Guided Explorations(https://arxiv.org/abs/2504.14587)
Keywords: generative
Abstract: Auto-bidding, with its strong capability to optimize bidding decisions within dynamic and competitive online environments, has become a pivotal strategy for advertising platforms. Existing approaches typically employ rule-based strategies or Reinforcement Learning (RL) techniques. However, rule-based strategies lack the flexibility to adapt to time-varying market conditions, and RL-based methods struggle to capture essential historical dependencies and observations within Markov Decision Process (MDP) frameworks. Furthermore, these approaches often face challenges in ensuring strategy adaptability across diverse advertising objectives. Additionally, as offline training methods are increasingly adopted to facilitate the deployment and maintenance of stable online strategies, the issues of documented behavioral patterns and behavioral collapse resulting from training on fixed offline datasets become increasingly significant. To address these limitations, this paper introduces a novel offline Generative Auto-bidding framework with Value-Guided Explorations (GAVE). GAVE accommodates various advertising objectives through a score-based Return-To-Go (RTG) module. Moreover, GAVE integrates an action exploration mechanism with an RTG-based evaluation method to explore novel actions while ensuring stability-preserving updates. A learnable value function is also designed to guide the direction of action exploration and mitigate Out-of-Distribution (OOD) problems. Experimental results on two offline datasets and real-world deployments demonstrate that GAVE outperforms state-of-the-art baselines in both offline evaluations and online A/B tests. The implementation code is publicly available to facilitate reproducibility and further research.
摘要：自动竞标具有强大的功能，可以在动态和竞争性的在线环境中优化竞标决策，已成为广告平台的关键策略。现有方法通常采用基于规则的策略或加强学习（RL）技术。但是，基于规则的策略缺乏适应时间变化的市场状况的灵活性，基于RL的方法难以捕获马尔可夫决策过程（MDP）框架内基本的历史依赖和观察结果。此外，这些方法在确保各种广告目标的策略适应性方面经常面临挑战。此外，随着越来越多地采用离线培训方法来促进稳定的在线策略的部署和维护，因此在固定离线数据集中培训培训所导致的已记录行为模式和行为崩溃的问题变得越来越重要。为了解决这些局限性，本文介绍了一个新颖的离线生成自动投标框架，并具有价值引导的探索（给出）。通过基于分数的返回（RTG）模块，为各种广告目标提供了可容纳的广告目标。此外，还使用基于RTG的评估方法进行了集成的动作探索机制，以探索新颖的动作，同时确保稳定性更新。可学习的价值函数还旨在指导动作探索方向并减轻分布（OOD）问题。两个离线数据集和现实世界部署的实验结果表明，在离线评估和在线A/B测试中，给出了最先进的基线。实施代码可公开使用，以促进可重复性和进一步的研究。

Title: NTIRE 2025 Challenge on Real-World Face Restoration: Methods and Results

Authors: Zheng Chen, Jingkai Wang, Kai Liu, Jue Gong, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Jianxing Zhang, Jinlong Wu, Jun Wang, Zheng Xie, Hakjae Jeon, Suejin Han, Hyung-Ju Chun, Hyunhee Park, Zhicun Yin, Junjie Chen, Ming Liu, Xiaoming Li, Chao Zhou, Wangmeng Zuo, Weixia Zhang, Dingquan Li, Kede Ma, Yun Zhang, Zhuofan Zheng, Yuyue Liu, Shizhen Tang, Zihao Zhang, Yi Ning, Hao Jiang, Wenjie An, Kangmeng Yu, Chenyang Wang, Kui Jiang, Xianming Liu, Junjun Jiang, Yingfu Zhang, Gang He, Siqi Wang, Kepeng Xu, Zhenyang Liu, Changxin Zhou, Shanlan Shen, Yubo Duan, Yiang Chen, Jin Guo, Mengru Yang, Jen-Wei Lee, Chia-Ming Lee, Chih-Chung Hsu, Hu Peng, Chunming He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14600
Pdf URL: https://arxiv.org/pdf/2504.14600
Copy Paste: [[2504.14600]] NTIRE 2025 Challenge on Real-World Face Restoration: Methods and Results(https://arxiv.org/abs/2504.14600)
Keywords: restoration, quality assessment
Abstract: This paper provides a review of the NTIRE 2025 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural, realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. The track of the challenge evaluates performance using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 141 registrants, with 13 teams submitting valid models, and ultimately, 10 teams achieved a valid score in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.
摘要：本文对现实世界恢复的NTIRE 2025挑战进行了回顾，强调了拟议的解决方案和结果结果。挑战着重于产生自然，现实的产出，同时保持身份一致性。它的目标是提高最新解决方案的感知质量和现实主义，而不会对计算资源或培训数据施加限制。挑战的轨道使用加权图像质量评估（IQA）得分评估性能，并采用Adaface模型作为身份检查器。比赛吸引了141名注册人，有13支球队提交有效的模型，最终，有10支球队在最终排名中取得了有效的得分。这项合作的努力促进了现实世界的表现，同时介绍了该领域的最新趋势。

Title: AlphaZero-Edu: Making AlphaZero Accessible to Everyone

Authors: Binjie Guo, Hanyu Zheng, Guowei Su, Ru Zhang, Haohan Jiang, Xurong Lin, Hongyan Wei, Aisheng Mo, Jie Li, Zhiyuan Qian, Zhuhao Zhang, Xiaoyuan Cheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14636
Pdf URL: https://arxiv.org/pdf/2504.14636
Copy Paste: [[2504.14636]] AlphaZero-Edu: Making AlphaZero Accessible to Everyone(https://arxiv.org/abs/2504.14636)
Keywords: generation
Abstract: Recent years have witnessed significant progress in reinforcement learning, especially with Zero-like paradigms, which have greatly boosted the generalization and reasoning abilities of large-scale language models. Nevertheless, existing frameworks are often plagued by high implementation complexity and poor reproducibility. To tackle these challenges, we present AlphaZero-Edu, a lightweight, education-focused implementation built upon the mathematical framework of AlphaZero. It boasts a modular architecture that disentangles key components, enabling transparent visualization of the algorithmic processes. Additionally, it is optimized for resource-efficient training on a single NVIDIA RTX 3090 GPU and features highly parallelized self-play data generation, achieving a 3.2-fold speedup with 8 processes. In Gomoku matches, the framework has demonstrated exceptional performance, achieving a consistently high win rate against human opponents. AlphaZero-Edu has been open-sourced at this https URL, providing an accessible and practical benchmark for both academic research and industrial applications.
摘要：近年来，在强化学习方面取得了重大进展，尤其是零型范式，这极大地增强了大规模语言模型的概括和推理能力。然而，现有的框架通常会受到高实施复杂性和差可重复性的困扰。为了应对这些挑战，我们提出了Alphazero-Edu，这是基于Alphazero的数学框架的轻巧，以教育为中心的实施。它拥有一个模块化体系结构，该体系结构可散布关键组件，从而实现算法过程的透明可视化。此外，它是针对单个NVIDIA RTX 3090 GPU进行资源有效培训进行了优化的，并具有高度平行的自我播放数据生成，并实现了3.2倍的速度，并具有8个过程。在Gomoku比赛中，该框架表现出了出色的表现，对人类对手的胜利率一直很高。 Alphazero-Edu已在此HTTPS URL上开源，为学术研究和工业应用提供了可访问且实用的基准。

Title: Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension

Authors: Lin Li, Wei Chen, Jiahui Li, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14642
Pdf URL: https://arxiv.org/pdf/2504.14642
Copy Paste: [[2504.14642]] Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension(https://arxiv.org/abs/2504.14642)
Keywords: generation
Abstract: Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning, but remain limited in visual relation understanding (\eg, scene graph generation), particularly in modeling \textit{N}-ary relationships that identify multiple semantic roles among an action event. Such a lack of \textit{semantic dependencies} modeling among multi-entities leads to unreliable outputs, intensifying MLLMs' hallucinations and over-reliance on language priors. To this end, we propose Relation-R1, the first unified relational comprehension framework that explicitly integrates cognitive chain-of-thought (CoT)-guided Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we first establish foundational reasoning capabilities via SFT, enforcing structured outputs with thinking processes. Then, GRPO is utilized to refine these outputs via multi-reward optimization, prioritizing visual-semantic grounding over language-induced biases, thereby improving generalization capability. Extensive experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and \textit{N}-ary relation understanding.
摘要：多模式大语言模型（MLLM）的最新进展已显着改善了对象级接地和区域字幕，但在视觉关系理解中仍然有限（\ eg，场景绘图生成），尤其是在建模\ textit {n} - ary关系中，可以识别动作事件中多个语义角色。多个现象之间缺乏\ textit {语义依赖关系}建模会导致不可靠的输出，从而加剧了MLLMS的幻觉和对语言先验的过度依赖。为此，我们提出了关系R1，这是第一个明确整合认知链（COT）指导的受监督的微调（SFT）和组相对政策优化（GRPO）的统一关系理解框架（RL）范式。具体而言，我们首先通过SFT建立基本的推理能力，并通过思考过程实施结构化输出。然后，GRPO用于通过多回报优化来完善这些输出，优先考虑视觉语义与语言引起的偏见，从而提高了概括能力。广泛使用的PSG和SWIG数据集的广泛实验表明，关系R1在二进制和\ textit {n} - arry关系理解中都能达到最新的性能。

Title: LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

Authors: Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, Xiaolong Xu
Subjects: cs.LG, cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2504.14655
Pdf URL: https://arxiv.org/pdf/2504.14655
Copy Paste: [[2504.14655]] LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs(https://arxiv.org/abs/2504.14655)
Keywords: generation
Abstract: We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds. By curating LeetCode Python problems with rich metadata, broad coverage, 100+ test cases per problem, and temporal splits (pre/post July 2024), our dataset enables contamination-free evaluation and efficient supervised fine-tuning (SFT). Experiments show reasoning models significantly outperform non-reasoning counterparts, while SFT with only 2.6K model-generated solutions achieves performance comparable to 110K-sample counterparts. The dataset and evaluation framework are available on Hugging Face and Github.
摘要：我们介绍了LeetCodedataset，这是用于评估和培训代码生成模型的高质量基准，解决了LLM研究中的两个关键挑战：缺乏以推理为中心的编码基准和独立的培训测试台。通过策划Leetcode Python问题，具有丰富的元数据，广泛的覆盖范围，每个问题的100多个测试用例和时间分割（2024年7月之前/之后），我们的数据集可实现无污染的评估和有效监督的微调（SFT）。实验表明，推理模型的表现明显胜过非争议的同行，而仅具有2.6K模型生成的解决方案的SFT可以实现与110k样本相当的绩效。数据集和评估框架可在拥抱面和github上找到。

Title: Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

Authors: Kaihang Pan, Wang Lin, Zhongqi Yue, Tenglong Ao, Liyu Jia, Wei Zhao, Juncheng Li, Siliang Tang, Hanwang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14666
Pdf URL: https://arxiv.org/pdf/2504.14666
Copy Paste: [[2504.14666]] Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens(https://arxiv.org/abs/2504.14666)
Keywords: generation, generative
Abstract: Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve superior performance for multimodal comprehension and generation simultaneously compared with other MLLMs. Project Page: this https URL.
摘要：多模式大语言模型（MLLM）的最新努力旨在通过将LLM和扩散模型（分别是每个任务的最新技术）结合来统一视觉理解和生成。现有方法依赖于空间视觉令牌，其中图像贴片是根据空间顺序编码和排列的（例如栅格扫描）。但是，我们表明，空间令牌缺乏语言固有的递归结构，因此是LLM无法掌握的一种不可能的语言。在本文中，我们通过利用扩散时间步长学习离散的递归视觉令牌来构建适当的视觉语言。我们提出的令牌会递归地补偿噪声图像中的渐进属性损失，随着时间段的增加，使扩散模型能够在任何时间段上重建原始图像。这种方法使我们能够有效地将LLM的优势集成在自回归推理和扩散模型中，以精确的图像生成，从而在统一的框架内实现无缝的多模式理解和生成。广泛的实验表明，与其他MLLM相比，我们同时获得了多模式理解和产生的卓越性能。项目页面：此HTTPS URL。

Title: SuperCL: Superpixel Guided Contrastive Learning for Medical Image Segmentation Pre-training

Authors: Shuang Zeng, Lei Zhu, Xinliang Zhang, Hangzhou He, Yanye Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14737
Pdf URL: https://arxiv.org/pdf/2504.14737
Copy Paste: [[2504.14737]] SuperCL: Superpixel Guided Contrastive Learning for Medical Image Segmentation Pre-training(https://arxiv.org/abs/2504.14737)
Keywords: generation
Abstract: Medical image segmentation is a critical yet challenging task, primarily due to the difficulty of obtaining extensive datasets of high-quality, expert-annotated images. Contrastive learning presents a potential but still problematic solution to this issue. Because most existing methods focus on extracting instance-level or pixel-to-pixel representation, which ignores the characteristics between intra-image similar pixel groups. Moreover, when considering contrastive pairs generation, most SOTA methods mainly rely on manually setting thresholds, which requires a large number of gradient experiments and lacks efficiency and generalization. To address these issues, we propose a novel contrastive learning approach named SuperCL for medical image segmentation pre-training. Specifically, our SuperCL exploits the structural prior and pixel correlation of images by introducing two novel contrastive pairs generation strategies: Intra-image Local Contrastive Pairs (ILCP) Generation and Inter-image Global Contrastive Pairs (IGCP) Generation. Considering superpixel cluster aligns well with the concept of contrastive pairs generation, we utilize the superpixel map to generate pseudo masks for both ILCP and IGCP to guide supervised contrastive learning. Moreover, we also propose two modules named Average SuperPixel Feature Map Generation (ASP) and Connected Components Label Generation (CCL) to better exploit the prior structural information for IGCP. Finally, experiments on 8 medical image datasets indicate our SuperCL outperforms existing 12 methods. i.e. Our SuperCL achieves a superior performance with more precise predictions from visualization figures and 3.15%, 5.44%, 7.89% DSC higher than the previous best results on MMWHS, CHAOS, Spleen with 10% annotations. Our code will be released after acceptance.
摘要：医疗图像分割是一项至关重要但充满挑战的任务，这主要是由于难以获得广泛的高质量，专家宣布的图像的数据集。对比学习给这个问题带来了潜在的但仍然有问题的解决方案。因为大多数现有方法都专注于提取实例级别或像素到像素表示，而忽略了图像内图像相似的像素组之间的特征。此外，在考虑对比对的生成时，大多数SOTA方法主要依赖于手动设定阈值，这需要大量的梯度实验，并且缺乏效率和概括。为了解决这些问题，我们提出了一种新型的对比学习方法，称为MEARTAR图像细分预训练的SuperCL。具体而言，我们的SUPERCL通过引入两种新型的对比对生成策略来利用图像的结构先验和像素相关性：内图像内局部对比度对（ILCP）的产生（ILCP）生成和间形的全局对比度对（IGCP）生成。考虑到Superpixel群集与对比度成对的概念很好地保持一致，因此我们利用Superpixel Map为ILCP和IGCP生成伪口罩来指导监督的对比度学习。此外，我们还提出了两个名为“平均超级像素特征映射生成（ASP）”和“连接组件标签生成（CCL）”的模块，以更好地利用IGCP的先前结构信息。最后，在8个医疗图像数据集上进行的实验表明我们的SuperCL的表现优于现有的12种方法。即，我们的SuperCL可以通过可视化数字获得更精确的预测，而3.15％，5.44％，7.89％的DSC比以前的MMWH，混乱，脾脏脾脏高10％的DSC。我们的代码将在接受后发布。

Title: Novel Concept-Oriented Synthetic Data approach for Training Generative AI-Driven Crystal Grain Analysis Using Diffusion Model

Authors: Ahmed Sobhi Saleh, Kristof Croes, Hajdin Ceric, Ingrid De Wolf, Houman Zahedmanesh
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2504.14782
Pdf URL: https://arxiv.org/pdf/2504.14782
Copy Paste: [[2504.14782]] Novel Concept-Oriented Synthetic Data approach for Training Generative AI-Driven Crystal Grain Analysis Using Diffusion Model(https://arxiv.org/abs/2504.14782)
Keywords: generative
Abstract: The traditional techniques for extracting polycrystalline grain structures from microscopy images, such as transmission electron microscopy (TEM) and scanning electron microscopy (SEM), are labour-intensive, subjective, and time-consuming, limiting their scalability for high-throughput analysis. In this study, we present an automated methodology integrating edge detection with generative diffusion models to effectively identify grains, eliminate noise, and connect broken segments in alignment with predicted grain boundaries. Due to the limited availability of adequate images preventing the training of deep machine learning models, a new seven-stage methodology is employed to generate synthetic TEM images for training. This concept-oriented synthetic data approach can be extended to any field of interest where the scarcity of data is a challenge. The presented model was applied to various metals with average grain sizes down to the nanoscale, producing grain morphologies from low-resolution TEM images that are comparable to those obtained from advanced and demanding experimental techniques with an average accuracy of 97.23%.
摘要：从显微镜图像中提取多晶粒结构的传统技术，例如透射电子显微镜（TEM）和扫描电子显微镜（SEM），是劳动密集型，主观和时间耗时的，限制了它们的可扩展性，以进行高通量分析。在这项研究中，我们提出了一种自动化方法，该方法将边缘检测与生成扩散模型相结合，以有效地识别晶粒，消除噪声并将损坏的段与预测的晶界保持一致性。由于足够的图像可用性有限，阻止了深度机器学习模型的训练，因此采用了一种新的七阶段方法来生成合成TEM图像进行训练。这种面向概念的合成数据方法可以扩展到任何关注的领域，因为数据稀缺是一个挑战。提出的模型应用于各种金属，平均晶粒尺寸降低到纳米级，从而从低分辨率TEM图像中产生晶粒形态，这些图像与从高级且苛刻的实验技术获得的平均准确性为97.23％相当。

Title: When Cloud Removal Meets Diffusion Model in Remote Sensing

Authors: Zhenyu Yu, Mohd Yamani Idna Idris, Pei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14785
Pdf URL: https://arxiv.org/pdf/2504.14785
Copy Paste: [[2504.14785]] When Cloud Removal Meets Diffusion Model in Remote Sensing(https://arxiv.org/abs/2504.14785)
Keywords: generation
Abstract: Cloud occlusion significantly hinders remote sensing applications by obstructing surface information and complicating analysis. To address this, we propose DC4CR (Diffusion Control for Cloud Removal), a novel multimodal diffusion-based framework for cloud removal in remote sensing imagery. Our method introduces prompt-driven control, allowing selective removal of thin and thick clouds without relying on pre-generated cloud masks, thereby enhancing preprocessing efficiency and model adaptability. Additionally, we integrate low-rank adaptation for computational efficiency, subject-driven generation for improved generalization, and grouped learning to enhance performance on small datasets. Designed as a plug-and-play module, DC4CR seamlessly integrates into existing cloud removal models, providing a scalable and robust solution. Extensive experiments on the RICE and CUHK-CR datasets demonstrate state-of-the-art performance, achieving superior cloud removal across diverse conditions. This work presents a practical and efficient approach for remote sensing image processing with broad real-world applications.
摘要：云阻塞通过阻碍表面信息和复杂分析来显着阻碍遥感应用。为了解决这个问题，我们提出了DC4CR（用于云的扩散控制），这是一种新型的基于多模式扩散的框架，用于遥感图像中的云去除。我们的方法引入了迅速驱动的控制，可以选择性地去除薄和厚的云，而无需依赖预先生成的云面膜，从而提高了预处理效率和模型适应性。此外，我们集成了低秩适应性，以提高计算效率，主题驱动的生成以改善概括，并分组学习以提高小型数据集的性能。 DC4CR设计为即插即用的模块，无缝集成到现有的云拆卸模型中，提供了可扩展且可靠的解决方案。大米和CUHK-CR数据集进行的广泛实验表明了最先进的性能，从而在各种条件下实现了较高的云消除。这项工作为通过广泛的现实应用程序提供了一种实用有效的方法来处理遥感图像处理。

Title: Enhanced Data-driven Topology Design Methodology with Multi-level Mesh and Correlation-based Mutation for Stress-related Multi-objective Optimization

Authors: Jun Yang, Shintaro Yamasaki
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2504.14790
Pdf URL: https://arxiv.org/pdf/2504.14790
Copy Paste: [[2504.14790]] Enhanced Data-driven Topology Design Methodology with Multi-level Mesh and Correlation-based Mutation for Stress-related Multi-objective Optimization(https://arxiv.org/abs/2504.14790)
Keywords: generative
Abstract: Topology optimization (TO) serves as a widely applied structural design approach to tackle various engineering problems. Nevertheless, sensitivity-based TO methods usually struggle with solving strongly nonlinear optimization problems. By leveraging high capacity of deep generative model, which is an influential machine learning technique, the sensitivity-free data-driven topology design (DDTD) methodology is regarded as an effective means of overcoming these issues. The DDTD methodology depends on initial dataset with a certain regularity, making its results highly sensitive to initial dataset quality. This limits its effectiveness and generalizability, especially for optimization problems without priori information. In this research, we proposed a multi-level mesh DDTD-based method with correlation-based mutation module to escape from the limitation of the quality of the initial dataset on the results and enhance computational efficiency. The core is to employ a correlation-based mutation module to assign new geometric features with physical meaning to the generated data, while utilizing a multi-level mesh strategy to progressively enhance the refinement of the structural representation, thus avoiding the maintenance of a high degree-of-freedom (DOF) representation throughout the iterative process. The proposed multi-level mesh DDTD-based method can be driven by a low quality initial dataset without the need for time-consuming construction of a specific dataset, thus significantly increasing generality and reducing application difficulty, while further lowering computational cost of DDTD methodology. Various comparison experiments with the traditional sensitivity-based TO methods on stress-related strongly nonlinear problems demonstrate the generality and effectiveness of the proposed method.
摘要：拓扑优化（TO）是一种广泛应用的结构设计方法，可解决各种工程问题。然而，基于敏感的方法通常在解决强烈非线性优化问题方面努力。通过利用具有影响力的机器学习技术的高生成模型的高容量，无灵敏度的数据驱动拓扑设计（DDTD）方法被认为是克服这些问题的有效手段。 DDTD方法论取决于具有一定规律性的初始数据集，从而使其对初始数据集质量高度敏感。这限制了其有效性和可推广性，尤其是对于没有先验信息的优化问题。在这项研究中，我们提出了一种基于基于相关的突变模块的基于多级网格DDTD的方法，以逃避初始数据集质量在结果上的限制并提高计算效率。核心是采用基于相关的突变模块为生成的数据分配具有物理含义的新几何特征，同时利用多层网格策略来逐步增强结构表示的完善，从而避免维持在整个迭代过程中维持高度自由度（DOF）的表示。所提出的基于低质量的基于网格DDTD的方法可以由低质量的初始数据集驱动，而无需耗时的特定数据集构造，因此显着提高了通用性和降低应用程序的难度，同时进一步降低了DDTD方法的计算成本。与压力相关的强烈非线性问题的基于传统灵敏度的各种比较实验证明了该方法的一般性和有效性。

Title: Edge-boosted graph learning for functional brain connectivity analysis

Authors: David Yang, Mostafa Abdelmegeed, John Modl, Minjeong Kim
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2504.14796
Pdf URL: https://arxiv.org/pdf/2504.14796
Copy Paste: [[2504.14796]] Edge-boosted graph learning for functional brain connectivity analysis(https://arxiv.org/abs/2504.14796)
Keywords: generative
Abstract: Predicting disease states from functional brain connectivity is critical for the early diagnosis of severe neurodegenerative diseases such as Alzheimer's Disease and Parkinson's Disease. Existing studies commonly employ Graph Neural Networks (GNNs) to infer clinical diagnoses from node-based brain connectivity matrices generated through node-to-node similarities of regionally averaged fMRI signals. However, recent neuroscience studies found that such node-based connectivity does not accurately capture ``functional connections" within the brain. This paper proposes a novel approach to brain network analysis that emphasizes edge functional connectivity (eFC), shifting the focus to inter-edge relationships. Additionally, we introduce a co-embedding technique to integrate edge functional connections effectively. Experimental results on the ADNI and PPMI datasets demonstrate that our method significantly outperforms state-of-the-art GNN methods in classifying functional brain networks.
摘要：通过功能性大脑连通性预测疾病状态对于早期诊断出严重神经退行性疾病（例如阿尔茨海默氏病和帕金森氏病）至关重要。现有研究通常采用图形神经网络（GNN）从基于节点的大脑连通性矩阵中推断出通过区域平均fMRI信号产生的基于节点的大脑连通性矩阵。 However, recent neuroscience studies found that such node-based connectivity does not accurately capture ``functional connections" within the brain. This paper proposes a novel approach to brain network analysis that emphasizes edge functional connectivity (eFC), shifting the focus to inter-edge relationships. Additionally, we introduce a co-embedding technique to integrate edge functional connections effectively. Experimental results on the ADNI and PPMI datasets demonstrate that our method在对功能性脑网络分类中的最先进的GNN方法大大胜过。

Title: Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models

Authors: Hao Xuan, Xingyu Li
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2504.14798
Pdf URL: https://arxiv.org/pdf/2504.14798
Copy Paste: [[2504.14798]] Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models(https://arxiv.org/abs/2504.14798)
Keywords: generative
Abstract: Machine Unlearning (MUL) is crucial for privacy protection and content regulation, yet recent studies reveal that traces of forgotten information persist in unlearned models, enabling adversaries to resurface removed knowledge. Existing verification methods only confirm whether unlearning was executed, failing to detect such residual information leaks. To address this, we introduce the concept of Robust Unlearning, ensuring models are indistinguishable from retraining and resistant to adversarial recovery. To empirically evaluate whether unlearning techniques meet this security standard, we propose the Unlearning Mapping Attack (UMA), a post-unlearning verification framework that actively probes models for forgotten traces using adversarial queries. Extensive experiments on discriminative and generative tasks show that existing unlearning techniques remain vulnerable, even when passing existing verification metrics. By establishing UMA as a practical verification tool, this study sets a new standard for assessing and enhancing machine unlearning security.
摘要：机器划分（MUL）对于隐私保护和内容调节至关重要，但最近的研究表明，被遗忘的信息的痕迹持续存在，使对手能够消除知识的偏移。现有的验证方法仅确认是否执行了未经学习，无法检测到此类残留信息泄漏。为了解决这个问题，我们介绍了坚固的学习概念，确保模型与对抗性恢复的重新培训和抵抗力是无法区分的。为了凭经验评估未学习技术是否符合此安全标准，我们提出了未经学习的映射攻击（UMA），这是一个未检验后的验证框架，使用对抗性查询积极探究被遗忘的痕迹的模型。关于判别和生成任务的广泛实验表明，即使通过现有验证指标，现有的未学习技术仍然脆弱。通过将UMA建立为实用验证工具，本研究为评估和增强机器学习安全性设定了新的标准。

Title: What Lurks Within? Concept Auditing for Shared Diffusion Models at Scale

Authors: Xiaoyong Yuan, Xiaolong Ma, Linke Guo, Lan Zhang
Subjects: cs.LG, cs.AI, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2504.14815
Pdf URL: https://arxiv.org/pdf/2504.14815
Copy Paste: [[2504.14815]] What Lurks Within? Concept Auditing for Shared Diffusion Models at Scale(https://arxiv.org/abs/2504.14815)
Keywords: generation, generative
Abstract: Diffusion models (DMs) have revolutionized text-to-image generation, enabling the creation of highly realistic and customized images from text prompts. With the rise of parameter-efficient fine-tuning (PEFT) techniques like LoRA, users can now customize powerful pre-trained models using minimal computational resources. However, the widespread sharing of fine-tuned DMs on open platforms raises growing ethical and legal concerns, as these models may inadvertently or deliberately generate sensitive or unauthorized content, such as copyrighted material, private individuals, or harmful content. Despite the increasing regulatory attention on generative AI, there are currently no practical tools for systematically auditing these models before deployment. In this paper, we address the problem of concept auditing: determining whether a fine-tuned DM has learned to generate a specific target concept. Existing approaches typically rely on prompt-based input crafting and output-based image classification but suffer from critical limitations, including prompt uncertainty, concept drift, and poor scalability. To overcome these challenges, we introduce Prompt-Agnostic Image-Free Auditing (PAIA), a novel, model-centric concept auditing framework. By treating the DM as the object of inspection, PAIA enables direct analysis of internal model behavior, bypassing the need for optimized prompts or generated images. We evaluate PAIA on 320 controlled model and 690 real-world community models sourced from a public DM sharing platform. PAIA achieves over 90% detection accuracy while reducing auditing time by 18-40x compared to existing baselines. To our knowledge, PAIA is the first scalable and practical solution for pre-deployment concept auditing of diffusion models, providing a practical foundation for safer and more transparent diffusion model sharing.
摘要：扩散模型（DMS）彻底改变了文本对图像的生成，从而从文本提示中创建了高度逼真和定制的图像。随着参数有效的微调（PEFT）技术（如Lora）的兴起，用户现在可以使用最小的计算资源自定义强大的预训练模型。但是，开放平台上的微调DMS的广泛共享引起了道德和法律问题的越来越多，因为这些模型可能会无意中或故意产生敏感或未经授权的内容，例如受版权保护的材料，私人个人或有害内容。尽管对生成AI的监管关注越来越多，但目前尚无实用的工具来系统地审核这些模型。在本文中，我们解决了概念审核的问题：确定微调的DM是否学会了生成特定的目标概念。现有方法通常依赖基于及时的输入工艺和基于输出的图像分类，但受到关键限制，包括及时的不确定性，概念漂移和可扩展性差。为了克服这些挑战，我们引入了及时无形图像审计（PAIA），这是一个新颖的以模型为中心的概念审计框架。通过将DM视为检查的对象，PAIA可以直接分析内部模型行为，绕开需要优化的提示或生成的图像。我们在320个受控模型和690个现实世界社区模型上评估PAIA，该模型来自公共DM共享平台。与现有基准相比，PAIA达到了90％以上的检测准确性，同时将审计时间减少18-40倍。据我们所知，PAIA是第一个可扩展和实用的解决方案，用于扩散模型的预部部门概念审核，为更安全，更透明的扩散模型共享提供了实用的基础。

Title: Distribution-aware Dataset Distillation for Efficient Image Restoration

Authors: Zhuoran Zheng, Xin Su, Chen Wu, Xiuyi Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14826
Pdf URL: https://arxiv.org/pdf/2504.14826
Copy Paste: [[2504.14826]] Distribution-aware Dataset Distillation for Efficient Image Restoration(https://arxiv.org/abs/2504.14826)
Keywords: restoration
Abstract: With the exponential increase in image data, training an image restoration model is laborious. Dataset distillation is a potential solution to this problem, yet current distillation techniques are a blank canvas in the field of image restoration. To fill this gap, we propose the Distribution-aware Dataset Distillation method (TripleD), a new framework that extends the principles of dataset distillation to image restoration. Specifically, TripleD uses a pre-trained vision Transformer to extract features from images for complexity evaluation, and the subset (the number of samples is much smaller than the original training set) is selected based on complexity. The selected subset is then fed through a lightweight CNN that fine-tunes the image distribution to align with the distribution of the original dataset at the feature level. To efficiently condense knowledge, the training is divided into two stages. Early stages focus on simpler, low-complexity samples to build foundational knowledge, while later stages select more complex and uncertain samples as the model matures. Our method achieves promising performance on multiple image restoration tasks, including multi-task image restoration, all-in-one image restoration, and ultra-high-definition image restoration tasks. Note that we can train a state-of-the-art image restoration model on an ultra-high-definition (4K resolution) dataset using only one consumer-grade GPU in less than 8 hours (500 savings in computing resources and immeasurable training time).
摘要：随着图像数据的指数增加，培训图像恢复模型是费力的。数据集蒸馏是解决此问题的潜在解决方案，但是当前的蒸馏技术是图像恢复领域中的空白画布。为了填补这一空白，我们提出了分布感知的数据集蒸馏方法（三倍），这是一个将数据集蒸馏的原理扩展到图像恢复的新框架。具体而言，三倍使用预训练的视觉变压器从图像中提取特征以进行复杂性评估，并且根据复杂性选择了子集（样品的数量比原始训练集小得多）。然后，通过轻巧的CNN喂食所选子集，该CNN将图像分布微调以与特征级别的原始数据集的分布保持一致。为了有效地凝结知识，培训分为两个阶段。早期阶段集中于更简单，低复杂的样本来建立基础知识，而后期阶段则选择随着模型成熟而选择更复杂和不确定的样本。我们的方法在多个图像恢复任务上实现了有希望的性能，包括多任务图像恢复，多合一的图像恢复和超高定义的图像恢复任务。请注意，我们可以在不到8小时内使用一个消费级GPU训练最先进的图像恢复模型（4K分辨率）数据集（计算资源和不可估量的培训时间节省了500个消费级GPU）。

Title: Twin Co-Adaptive Dialogue for Progressive Image Generation

Authors: Jianhui Wang, Yangfan He, Yan Zhong, Xinyuan Song, Jiayi Su, Yuheng Feng, Hongyang He, Wenyu Zhu, Xinhang Yuan, Kuan Lu, Menghao Huo, Miao Zhang, Keqin Li, Jiaqi Chen, Tianyu Shi, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14868
Pdf URL: https://arxiv.org/pdf/2504.14868
Copy Paste: [[2504.14868]] Twin Co-Adaptive Dialogue for Progressive Image Generation(https://arxiv.org/abs/2504.14868)
Keywords: generation
Abstract: Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, iterative workflow where an intelligent dialogue agent continuously interacts with the user. Initially, a base image is generated from the user's prompt. Then, through a series of synchronized dialogue exchanges, the system adapts and optimizes the image according to evolving user feedback. The co-adaptive process allows the system to progressively narrow down ambiguities and better align with user intent. Experiments demonstrate that Twin-Co not only enhances user experience by reducing trial-and-error iterations but also improves the quality of the generated images, streamlining the creative process across various applications.
摘要：现代的文本到图像生成系统使创建非常逼真和高质量的视觉效果，但是在处理用户提示中固有的歧义时，它们通常会摇摇欲坠。在这项工作中，我们介绍了Twin-CO，该框架利用同步的共同自适应对话来逐步完善图像生成。 Twin-CO不是静态生成过程，而是采用了动态的，迭代的工作流程，智能对话代理与用户不断互动。最初，从用户的提示中生成基本图像。然后，通过一系列同步对话交换，系统根据不断发展的用户反馈来调整并优化图像。共同自适应过程允许系统逐步缩小歧义性并更好地与用户意图保持一致。实验表明，Twin-CO不仅通过减少试用和错误的迭代来增强用户体验，而且还提高了生成的图像的质量，从而简化了各种应用程序的创作过程。

Title: Memory-Augmented Dual-Decoder Networks for Multi-Class Unsupervised Anomaly Detection

Authors: Jingyu Xing, Chenwei Tang, Tao Wang, Rong Xiao, Wei Ju, Ji-Zhe Zhou, Liangli Zhen, Jiancheng Lv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14884
Pdf URL: https://arxiv.org/pdf/2504.14884
Copy Paste: [[2504.14884]] Memory-Augmented Dual-Decoder Networks for Multi-Class Unsupervised Anomaly Detection(https://arxiv.org/abs/2504.14884)
Keywords: restoration
Abstract: Recent advances in unsupervised anomaly detection (UAD) have shifted from single-class to multi-class scenarios. In such complex contexts, the increasing pattern diversity has brought two challenges to reconstruction-based approaches: (1) over-generalization: anomalies that are subtle or share compositional similarities with normal patterns may be reconstructed with high fidelity, making them difficult to distinguish from normal instances; and (2) insufficient normality reconstruction: complex normal features, such as intricate textures or fine-grained structures, may not be faithfully reconstructed due to the model's limited representational capacity, resulting in false positives. Existing methods typically focus on addressing the former, which unintentionally exacerbate the latter, resulting in inadequate representation of intricate normal patterns. To concurrently address these two challenges, we propose a Memory-augmented Dual-Decoder Networks (MDD-Net). This network includes two critical components: a Dual-Decoder Reverse Distillation Network (DRD-Net) and a Class-aware Memory Module (CMM). Specifically, the DRD-Net incorporates a restoration decoder designed to recover normal features from synthetic abnormal inputs and an identity decoder to reconstruct features that maintain the anomalous semantics. By exploiting the discrepancy between features produced by two decoders, our approach refines anomaly scores beyond the conventional encoder-decoder comparison paradigm, effectively reducing false positives and enhancing localization accuracy. Furthermore, the CMM explicitly encodes and preserves class-specific normal prototypes, actively steering the network away from anomaly reconstruction. Comprehensive experimental results across several benchmarks demonstrate the superior performance of our MDD-Net framework over current SoTA approaches in multi-class UAD tasks.
摘要：无监督的异常检测（UAD）的最新进展已从单阶段转变为多级场景。在如此复杂的环境中，增加的模式多样性给基于重建的方法带来了两个挑战：（1）过度属性：与正常模式相同的微妙或共享组成相似性的异常可能会以高忠诚重建，从而使它们难以与正常实例区分开；（2）不足的正态重建：复杂的正常特征，例如复杂的纹理或细粒结构，由于该模型的有限代表能力，可能不会忠实地重建，从而导致误报。现有方法通常集中于解决前者，这无意间加剧了后者，导致复杂的正常模式的表示不足。为了同时解决这两个挑战，我们提出了一个内存的双重码头网络（MDD-NET）。该网络包括两个关键组件：双码编码器反向蒸馏网络（DRD-NET）和一个类吸引的内存模块（CMM）。具体而言，DRD-NET结合了一种恢复解码器，旨在从合成异常输入中恢复正常特征，并为重建维持异常语义的特征恢复了标识解码器。通过利用两个解码器产生的特征之间的差异，我们的方法优化了异常得分，超出了传统的编码器比较比较范式，有效地降低了误报并提高了本地化精度。此外，CMM明确编码并保留了特定于类的正常原型，从而使网络远离异常重建。几个基准的全面实验结果表明，在多级UAD任务中，我们的MDD-NET框架比当前SOTA方法的出色表现。

Title: Latent Bayesian Optimization via Autoregressive Normalizing Flows

Authors: Seunghun Lee, Jinyoung Park, Jaewon Chu, Minseo Yoon, Hyunwoo J. Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.14889
Pdf URL: https://arxiv.org/pdf/2504.14889
Copy Paste: [[2504.14889]] Latent Bayesian Optimization via Autoregressive Normalizing Flows(https://arxiv.org/abs/2504.14889)
Keywords: generation, generative
Abstract: Bayesian Optimization (BO) has been recognized for its effectiveness in optimizing expensive and complex objective functions. Recent advancements in Latent Bayesian Optimization (LBO) have shown promise by integrating generative models such as variational autoencoders (VAEs) to manage the complexity of high-dimensional and structured data spaces. However, existing LBO approaches often suffer from the value discrepancy problem, which arises from the reconstruction gap between input and latent spaces. This value discrepancy problem propagates errors throughout the optimization process, leading to suboptimal outcomes. To address this issue, we propose a Normalizing Flow-based Bayesian Optimization (NF-BO), which utilizes normalizing flow as a generative model to establish one-to-one encoding function from the input space to the latent space, along with its left-inverse decoding function, eliminating the reconstruction gap. Specifically, we introduce SeqFlow, an autoregressive normalizing flow for sequence data. In addition, we develop a new candidate sampling strategy that dynamically adjusts the exploration probability for each token based on its importance. Through extensive experiments, our NF-BO method demonstrates superior performance in molecule generation tasks, significantly outperforming both traditional and recent LBO approaches.
摘要：贝叶斯优化（BO）因其在优化昂贵且复杂的目标功能方面的有效性而被认可。潜在贝叶斯优化（LBO）的最新进展通过整合生成模型（例如变异自动编码器（VAE））来管理高维和结构化数据空间的复杂性，这表明了有望。但是，现有的LBO方法通常会遭受价值差异问题，这是由于输入和潜在空间之间的重建差距引起的。此值差异问题在整个优化过程中传播错误，从而导致次优结果。为了解决此问题，我们提出了一个基于流动的贝叶斯优化（NF-BO），该优化利用正常流量作为生成模型来建立从输入空间到潜在空间的一对一编码函数，以及其左内解码功能，从而消除了重构差距。具体而言，我们引入了SEQFLOW，这是序列数据的自回旋归一化流。此外，我们制定了一种新的候选抽样策略，该策略会根据其重要性动态调整每个令牌的探索概率。通过广泛的实验，我们的NF-BO方法在分子生成任务中表现出卓越的性能，极大地表现出了传统和最近的LBO方法的表现。

Title: Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation

Authors: Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14899
Pdf URL: https://arxiv.org/pdf/2504.14899
Copy Paste: [[2504.14899]] Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation(https://arxiv.org/abs/2504.14899)
Keywords: generation, generative
Abstract: Camera and human motion controls have been extensively studied for video generation, but existing approaches typically address them separately, suffering from limited data with high-quality annotations for both aspects. To overcome this, we present Uni3C, a unified 3D-enhanced framework for precise control of both camera and human motion in video generation. Uni3C includes two key contributions. First, we propose a plug-and-play control module trained with a frozen video generative backbone, PCDController, which utilizes unprojected point clouds from monocular depth to achieve accurate camera control. By leveraging the strong 3D priors of point clouds and the powerful capacities of video foundational models, PCDController shows impressive generalization, performing well regardless of whether the inference backbone is frozen or fine-tuned. This flexibility enables different modules of Uni3C to be trained in specific domains, i.e., either camera control or human motion control, reducing the dependency on jointly annotated data. Second, we propose a jointly aligned 3D world guidance for the inference phase that seamlessly integrates both scenic point clouds and SMPL-X characters to unify the control signals for camera and human motion, respectively. Extensive experiments confirm that PCDController enjoys strong robustness in driving camera motion for fine-tuned backbones of video generation. Uni3C substantially outperforms competitors in both camera controllability and human motion quality. Additionally, we collect tailored validation sets featuring challenging camera movements and human actions to validate the effectiveness of our method.
摘要：相机和人类的运动控件已被广泛研究以进行视频生成，但是现有方法通常会分别解决它们，遭受有限的数据，这两个方面都具有高质量的注释。为了克服这一点，我们提出了Uni3C，这是一个统一的3D增强框架，用于在视频生成中精确控制相机和人类运动。 UNI3C包括两个关键贡献。首先，我们提出了一个通过冷冻视频生成主链PCDController训练的插件控制模块，该模块利用单眼深度从单眼深度来实现准确的相机控制。通过利用Point Clouds的强3D先验和视频基础模型的强大能力，PCDController表现出令人印象深刻的概括，无论推论骨干线是冷冻还是微调的，都表现出色。这种灵活性使UNI3C的不同模块可以在特定领域（即摄像机控制或人体运动控制）中进行训练，从而降低了对共同注释数据的依赖。其次，我们建议对推理阶段的共同对准3D世界指导，该指导阶段无缝地集成了风景云和SMPL-X字符，以分别统一相机和人类运动的控制信号。广泛的实验证实，PCDController在驾驶相机运动方面具有强大的鲁棒性，以进行视频生成的微调骨架。 Uni3c在相机可控性和人体运动质量方面大大优于竞争对手。此外，我们收集了量身定制的验证集，这些验证集具有挑战性的相机运动和人类行动，以验证我们方法的有效性。

Title: POLYRAG: Integrating Polyviews into Retrieval-Augmented Generation for Medical Applications

Authors: Chunjing Gan, Dan Yang, Binbin Hu, Ziqi Liu, Yue Shen, Zhiqiang Zhang, Jian Wang, Jun Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14917
Pdf URL: https://arxiv.org/pdf/2504.14917
Copy Paste: [[2504.14917]] POLYRAG: Integrating Polyviews into Retrieval-Augmented Generation for Medical Applications(https://arxiv.org/abs/2504.14917)
Keywords: generation
Abstract: Large language models (LLMs) have become a disruptive force in the industry, introducing unprecedented capabilities in natural language processing, logical reasoning and so on. However, the challenges of knowledge updates and hallucination issues have limited the application of LLMs in medical scenarios, where retrieval-augmented generation (RAG) can offer significant assistance. Nevertheless, existing retrieve-then-read approaches generally digest the retrieved documents, without considering the timeliness, authoritativeness and commonality of retrieval. We argue that these approaches can be suboptimal, especially in real-world applications where information from different sources might conflict with each other and even information from the same source in different time scale might be different, and totally relying on this would deteriorate the performance of RAG approaches. We propose PolyRAG that carefully incorporate judges from different perspectives and finally integrate the polyviews for retrieval augmented generation in medical applications. Due to the scarcity of real-world benchmarks for evaluation, to bridge the gap we propose PolyEVAL, a benchmark consists of queries and documents collected from real-world medical scenarios (including medical policy, hospital & doctor inquiry and healthcare) with multiple tagging (e.g., timeliness, authoritativeness) on them. Extensive experiments and analysis on PolyEVAL have demonstrated the superiority of PolyRAG.
摘要：大型语言模型（LLM）已成为该行业的破坏力，在自然语言处理，逻辑推理等方面引入了前所未有的功能。但是，知识更新和幻觉问题的挑战限制了LLM在医疗方案中的应用，在医疗方案中，检索效果发电（RAG）可以提供大量帮助。然而，现有的检索方法通常会消化检索文件，而不考虑检索的及时性，权威性和共同点。我们认为这些方法可能是次优的，尤其是在现实世界中的应用程序中，来自不同来源的信息可能相互冲突，甚至在不同时间范围内的信息也可能有所不同，并且完全依靠这将使RAG方法的性能恶化。我们提出了Polyrag，该Polyrag从不同的角度仔细地纳入了法官，并最终整合了在医疗应用中检索增强发电的PolyViews。由于现实世界中无法进行评估的基准缺乏，以弥合我们提出的多伴奏的差距，因此，基准由从现实世界中医疗场景（包括医疗政策，医院和医生查询和医疗保健）中收集的查询和文件组成。对多伴奏的广泛实验和分析证明了polyrag的优越性。

Title: TWIG: Two-Step Image Generation using Segmentation Masks in Diffusion Models

Authors: Mazharul Islam Rakib, Showrin Rahman, Joyanta Jyoti Mondal, Xi Xiao, David Lewis, Alessandra Mileo, Meem Arafat Manab
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14933
Pdf URL: https://arxiv.org/pdf/2504.14933
Copy Paste: [[2504.14933]] TWIG: Two-Step Image Generation using Segmentation Masks in Diffusion Models(https://arxiv.org/abs/2504.14933)
Keywords: generation, generative
Abstract: In today's age of social media and marketing, copyright issues can be a major roadblock to the free sharing of images. Generative AI models have made it possible to create high-quality images, but concerns about copyright infringement are a hindrance to their abundant use. As these models use data from training images to generate new ones, it is often a daunting task to ensure they do not violate intellectual property rights. Some AI models have even been noted to directly copy copyrighted images, a problem often referred to as source copying. Traditional copyright protection measures such as watermarks and metadata have also proven to be futile in this regard. To address this issue, we propose a novel two-step image generation model inspired by the conditional diffusion model. The first step involves creating an image segmentation mask for some prompt-based generated images. This mask embodies the shape of the image. Thereafter, the diffusion model is asked to generate the image anew while avoiding the shape in question. This approach shows a decrease in structural similarity from the training image, i.e. we are able to avoid the source copying problem using this approach without expensive retraining of the model or user-centered prompt generation techniques. This makes our approach the most computationally inexpensive approach to avoiding both copyright infringement and source copying for diffusion model-based image generation.
摘要：在当今社交媒体和市场营销时代，版权问题可能是图像免费分享的主要障碍。生成的AI模型使创建高质量的图像成为可能，但是对版权侵权的担忧是对它们丰富使用的障碍。由于这些模型使用培训图像生成新图像的数据，因此确保它们不会侵犯知识产权通常是一项艰巨的任务。甚至已经注意到一些AI模型直接复制受版权保护的图像，这通常称为源复制。在这方面，传统的版权保护措施（例如水印和元数据）也被证明是徒劳的。为了解决这个问题，我们提出了一个受条件扩散模型启发的新型两步图像生成模型。第一步涉及为一些基于提示的生成图像创建图像分割掩码。此面膜体现了图像的形状。此后，要求扩散模型重新生成图像，同时避免所讨论的形状。这种方法显示训练图像的结构相似性有所下降，即，我们能够使用这种方法避免源复制问题，而无需昂贵的模型或以用户为中心的及时生成技术。这使我们的方法成为避免版权侵权和基于扩散模型的图像生成的最便宜的方法。

Title: Efficient Document Retrieval with G-Retriever

Authors: Manthankumar Solanki
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.14955
Pdf URL: https://arxiv.org/pdf/2504.14955
Copy Paste: [[2504.14955]] Efficient Document Retrieval with G-Retriever(https://arxiv.org/abs/2504.14955)
Keywords: generation
Abstract: Textual data question answering has gained significant attention due to its growing applicability. Recently, a novel approach leveraging the Retrieval-Augmented Generation (RAG) method was introduced, utilizing the Prize-Collecting Steiner Tree (PCST) optimization for sub-graph construction. However, this method focused solely on node attributes, leading to incomplete contextual understanding. In this paper, we propose an enhanced approach that replaces the PCST method with an attention-based sub-graph construction technique, enabling more efficient and context-aware retrieval. Additionally, we encode both node and edge attributes, leading to richer graph representations. Our method also incorporates an improved projection layer and multi-head attention pooling for better alignment with Large Language Models (LLMs). Experimental evaluations on the WebQSP dataset demonstrate that our approach is competitive and achieves marginally better results compared to the original method, underscoring its potential for more accurate question answering.
摘要：由于其适用性不断增长，文本数据问题的回答引起了极大的关注。最近，采用了奖品收集的施泰纳树（PCST）优化来实现次级结构的奖励，利用了一种新的方法来利用检索功能的生成（RAG）方法。但是，此方法仅关注节点属性，从而导致不完整的上下文理解。在本文中，我们提出了一种增强的方法，该方法用基于注意力的子图构建技术取代了PCST方法，从而更有效，具有背景感知的检索。此外，我们同时编码节点和边缘属性，从而导致图形表示。我们的方法还结合了改进的投影层和多头注意集合，以更好地与大语言模型（LLMS）更好地对齐。 WebQSP数据集上的实验评估表明，与原始方法相比，我们的方法具有竞争力，并且取得了更好的结果，从而强调了其潜力，以获得更准确的问题。

Title: Cyc3D: Fine-grained Controllable 3D Generation via Cycle Consistency Regularization

Authors: Hongbin Xu, Chaohui Yu, Feng Xiao, Jiazheng Xing, Hai Ci, Weitao Chen, Ming Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.14975
Pdf URL: https://arxiv.org/pdf/2504.14975
Copy Paste: [[2504.14975]] Cyc3D: Fine-grained Controllable 3D Generation via Cycle Consistency Regularization(https://arxiv.org/abs/2504.14975)
Keywords: generation, generative
Abstract: Despite the remarkable progress of 3D generation, achieving controllability, i.e., ensuring consistency between generated 3D content and input conditions like edge and depth, remains a significant challenge. Existing methods often struggle to maintain accurate alignment, leading to noticeable discrepancies. To address this issue, we propose \name{}, a new framework that enhances controllable 3D generation by explicitly encouraging cyclic consistency between the second-order 3D content, generated based on extracted signals from the first-order generation, and its original input controls. Specifically, we employ an efficient feed-forward backbone that can generate a 3D object from an input condition and a text prompt. Given an initial viewpoint and a control signal, a novel view is rendered from the generated 3D content, from which the extracted condition is used to regenerate the 3D content. This re-generated output is then rendered back to the initial viewpoint, followed by another round of control signal extraction, forming a cyclic process with two consistency constraints. \emph{View consistency} ensures coherence between the two generated 3D objects, measured by semantic similarity to accommodate generative diversity. \emph{Condition consistency} aligns the final extracted signal with the original input control, preserving structural or geometric details throughout the process. Extensive experiments on popular benchmarks demonstrate that \name{} significantly improves controllability, especially for fine-grained details, outperforming existing methods across various conditions (e.g., +14.17\% PSNR for edge, +6.26\% PSNR for sketch).
摘要：尽管3D生成取得了显着的进展，但实现了可控性，即确保生成的3D内容与Edge和Depth（例如边缘和深度）之间的一致性仍然是一个重大挑战。现有的方法通常难以保持准确的一致性，从而导致明显的差异。为了解决此问题，我们提出了一个新框架，这是一个新框架，通过明确鼓励基于一阶生成的提取信号及其原始输入控件生成的二阶3D内容之间的循环一致性来增强可控的3D生成。具体来说，我们采用有效的进料前臂，可以从输入条件和文本提示中生成3D对象。给定初始观点和控制信号，从生成的3D内容中呈现出新的视图，从中提取的条件可用于再生3D含量。然后将重新生成的输出渲染回初始视点，然后再进行另一轮控制信号提取，形成具有两个一致性约束的环状过程。 \ emph {视图一致性}确保两个生成的3D对象之间的连贯性，该对象通过语义相似性来衡量以适应生成多样性。 \ emph {条件一致性}将最终提取的信号与原始输入控件对齐，在整个过程中保留结构或几何细节。对流行基准测试的广泛实验表明，\ name {}显着提高了可控性，尤其是对于细粒细节，在各种条件下都超过了现有方法（例如，边缘的+14.17 \％\％psnr，+6.26 \％psnr用于草图）。

Title: NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: KwaiSR Dataset and Study

Authors: Xin Li, Xijun Wang, Bingchen Li, Kun Yuan, Yizhen Shao, Suhang Yao, Ming Sun, Chao Zhou, Radu Timofte, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15003
Pdf URL: https://arxiv.org/pdf/2504.15003
Copy Paste: [[2504.15003]] NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: KwaiSR Dataset and Study(https://arxiv.org/abs/2504.15003)
Keywords: super-resolution, quality assessment
Abstract: In this work, we build the first benchmark dataset for short-form UGC Image Super-resolution in the wild, termed KwaiSR, intending to advance the research on developing image super-resolution algorithms for short-form UGC platforms. This dataset is collected from the Kwai Platform, which is composed of two parts, i.e., synthetic and wild parts. Among them, the synthetic dataset, including 1,900 image pairs, is produced by simulating the degradation following the distribution of real-world low-quality short-form UGC images, aiming to provide the ground truth for training and objective comparison in the validation/testing. The wild dataset contains low-quality images collected directly from the Kwai Platform, which are filtered using the quality assessment method KVQ from the Kwai Platform. As a result, the KwaiSR dataset contains 1800 synthetic image pairs and 1900 wild images, which are divided into training, validation, and testing parts with a ratio of 8:1:1. Based on the KwaiSR dataset, we organize the NTIRE 2025 challenge on a second short-form UGC Video quality assessment and enhancement, which attracts lots of researchers to develop the algorithm for it. The results of this competition have revealed that our KwaiSR dataset is pretty challenging for existing Image SR methods, which is expected to lead to a new direction in the image super-resolution field. The dataset can be found from this https URL.
摘要：在这项工作中，我们构建了第一个基准数据集，用于野外的短形式UGC图像超分辨率，称为Kwaisr，打算推进针对短形式UGC平台开发图像超分辨率算法的研究。该数据集是从KWAI平台收集的，该平台由两个部分（即合成部分和野生部件）组成。其中，合成数据集（包括1,900张图像对）是通过模拟现实世界中低质量短形式UGC图像后模拟降解而产生的，旨在为验证/测试中的训练和客观比较提供基础真相。野生数据集包含直接从KWAI平台收集的低质量图像，该图像使用KWAI平台的质量评估方法KVQ过滤。结果，KWAISR数据集包含1800个合成图像对和1900个野生图像，分为训练，验证和测试零件，比为8：1：1。基于KWAISR数据集，我们在第二个短形式的UGC视频质量评估和增强中组织了NTIRE 2025挑战，这吸引了许多研究人员为其开发算法。这项竞赛的结果表明，对于现有的图像SR方法，我们的KWAISR数据集非常具有挑战性，这有望导致图像超分辨率领域的新方向。可以从此HTTPS URL找到数据集。

Title: Insert Anything: Image Insertion via In-Context Editing in DiT

Authors: Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15009
Pdf URL: https://arxiv.org/pdf/2504.15009
Copy Paste: [[2504.15009]] Insert Anything: Image Insertion via In-Context Editing in DiT(https://arxiv.org/abs/2504.15009)
Keywords: generation
Abstract: This work presents Insert Anything, a unified framework for reference-based image insertion that seamlessly integrates objects from reference images into target scenes under flexible, user-specified control guidance. Instead of training separate models for individual tasks, our approach is trained once on our new AnyInsertion dataset--comprising 120K prompt-image pairs covering diverse tasks such as person, object, and garment insertion--and effortlessly generalizes to a wide range of insertion scenarios. Such a challenging setting requires capturing both identity features and fine-grained details, while allowing versatile local adaptations in style, color, and texture. To this end, we propose to leverage the multimodal attention of the Diffusion Transformer (DiT) to support both mask- and text-guided editing. Furthermore, we introduce an in-context editing mechanism that treats the reference image as contextual information, employing two prompting strategies to harmonize the inserted elements with the target scene while faithfully preserving their distinctive features. Extensive experiments on AnyInsertion, DreamBooth, and VTON-HD benchmarks demonstrate that our method consistently outperforms existing alternatives, underscoring its great potential in real-world applications such as creative content generation, virtual try-on, and scene composition.
摘要：这项工作介绍了任何内容，这是一个基于参考的图像插入的统一框架，该框架将对象从参考图像无缝地集成到灵活的，用户指定的控制指导下的目标场景中。我们的方法没有为单个任务培训单独的模型，而是在我们的新的AnyInsertion数据集中接受了一次培训 - 复杂的120K及时图像对涵盖了诸如人物，对象和服装插入等各种任务，并毫不费力地将其推广到广泛的插入方案。这样一个挑战性的设置需要捕获身份功能和细粒细节，同时允许在样式，颜色和纹理上进行多功能的本地改编。为此，我们建议利用扩散变压器（DIT）的多模式关注来支持遮罩和文本引导的编辑。此外，我们引入了一种文本编辑机制，将参考图像视为上下文信息，采用两种提示策略将插入元素与目标场景进行协调的，同时忠实地保留其独特的特征。关于任何Insertion，Dreambooth和Vton-HD基准测试的广泛实验表明，我们的方法始终优于现有的替代方案，强调了其在现实世界中的巨大潜力，例如创意内容生成，虚拟尝试和场景组成。

Title: Gaussian Shading++: Rethinking the Realistic Deployment Challenge of Performance-Lossless Image Watermark for Diffusion Models

Authors: Zijin Yang, Xin Zhang, Kejiang Chen, Kai Zeng, Qiyi Yao, Han Fang, Weiming Zhang, Nenghai Yu
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2504.15026
Pdf URL: https://arxiv.org/pdf/2504.15026
Copy Paste: [[2504.15026]] Gaussian Shading++: Rethinking the Realistic Deployment Challenge of Performance-Lossless Image Watermark for Diffusion Models(https://arxiv.org/abs/2504.15026)
Keywords: generation
Abstract: Ethical concerns surrounding copyright protection and inappropriate content generation pose challenges for the practical implementation of diffusion models. One effective solution involves watermarking the generated images. Existing methods primarily focus on ensuring that watermark embedding does not degrade the model performance. However, they often overlook critical challenges in real-world deployment scenarios, such as the complexity of watermark key management, user-defined generation parameters, and the difficulty of verification by arbitrary third parties. To address this issue, we propose Gaussian Shading++, a diffusion model watermarking method tailored for real-world deployment. We propose a double-channel design that leverages pseudorandom error-correcting codes to encode the random seed required for watermark pseudorandomization, achieving performance-lossless watermarking under a fixed watermark key and overcoming key management challenges. Additionally, we model the distortions introduced during generation and inversion as an additive white Gaussian noise channel and employ a novel soft decision decoding strategy during extraction, ensuring strong robustness even when generation parameters vary. To enable third-party verification, we incorporate public key signatures, which provide a certain level of resistance against forgery attacks even when model inversion capabilities are fully disclosed. Extensive experiments demonstrate that Gaussian Shading++ not only maintains performance losslessness but also outperforms existing methods in terms of robustness, making it a more practical solution for real-world deployment.
摘要：Ethical concerns surrounding copyright protection and inappropriate content generation pose challenges for the practical implementation of diffusion models.一种有效的解决方案涉及对生成的图像进行水印。现有方法主要集中于确保水印嵌入不会降低模型性能。但是，他们经常忽略实际部署方案中的关键挑战，例如水印密钥管理的复杂性，用户定义的生成参数以及任意第三方的验证难度。为了解决这个问题，我们建议使用用于现实世界部署的扩散模型水印方法高斯阴影++。我们提出了一种双通道设计，该设计利用伪和误差校正代码来编码水印伪和物质所需的随机种子，从而在固定的水印钥匙下实现了无效的水印并克服了密钥管理挑战。此外，我们将发电和反转过程中引入的扭曲建模为一个加性白色高斯噪声通道，并在提取过程中采用新颖的软决策解码策略，从而确保即使生成参数变化，也确保了强大的鲁棒性。为了实现第三方验证，我们合并了公共密钥签名，即使完全披露了模型反演能力，这些签名即使对伪造攻击也提供了一定程度的抵抗。广泛的实验表明，高斯阴影++不仅保持绩效无损性，而且在鲁棒性方面优于现有方法，这使其成为现实世界部署的更实际解决方案。

Title: DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation

Authors: Weijie He, Mushui Liu, Yunlong Yu, Zhao Wang, Chao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15032
Pdf URL: https://arxiv.org/pdf/2504.15032
Copy Paste: [[2504.15032]] DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation(https://arxiv.org/abs/2504.15032)
Keywords: generation
Abstract: Compositional text-to-video generation, which requires synthesizing dynamic scenes with multiple interacting entities and precise spatial-temporal relationships, remains a critical challenge for diffusion-based models. Existing methods struggle with layout discontinuity, entity identity drift, and implausible interaction dynamics due to unconstrained cross-attention mechanisms and inadequate physics-aware reasoning. To address these limitations, we propose DyST-XL, a \textbf{training-free} framework that enhances off-the-shelf text-to-video models (e.g., CogVideoX-5B) through frame-aware control. DyST-XL integrates three key innovations: (1) A Dynamic Layout Planner that leverages large language models (LLMs) to parse input prompts into entity-attribute graphs and generates physics-aware keyframe layouts, with intermediate frames interpolated via trajectory optimization; (2) A Dual-Prompt Controlled Attention Mechanism that enforces localized text-video alignment through frame-aware attention masking, achieving the precise control over individual entities; and (3) An Entity-Consistency Constraint strategy that propagates first-frame feature embeddings to subsequent frames during denoising, preserving object identity without manual annotation. Experiments demonstrate that DyST-XL excels in compositional text-to-video generation, significantly improving performance on complex prompts and bridging a crucial gap in training-free video synthesis.
摘要：Compositional text-to-video generation, which requires synthesizing dynamic scenes with multiple interacting entities and precise spatial-temporal relationships, remains a critical challenge for diffusion-based models.现有方法与布局不连续性，实体身份漂移以及由于不受限制的跨注意机制和物理意识推理不足而引起的令人难以置信的相互作用动态。为了解决这些局限性，我们建议通过框架感受控制的控制，提出了一个\ textbf {traine-tree-tree-tree-tree-tree-tree-tree}框架，该框架可以增强现成的文本到视频模型（例如Cogvideox-5b）。 DYST-XL集成了三个关键创新：（1）一个动态布局计划器，该计划器利用大型语言模型（LLMS）将输入提示解析到Entity-Attribute图中，并生成物理意识到的关键框架布局，并通过轨迹优化插值插值；（2）双重策略受控的注意机制，该机制通过框架意识到的注意掩盖来实现局部文本视频对齐，从而实现了对个体实体的精确控制；（3）一种实体 - 一致性约束策略，将第一个框架的特征嵌入到denoing期间的后续帧中，在没有手动注释的情况下保留对象身份。实验表明，Dyst-XL在构图文本到视频的生成中表现出色，从而显着提高了复杂提示的性能，并在无训练的视频合成中弥合了至关重要的差距。

Title: VistaDepth: Frequency Modulation With Bias Reweighting For Enhanced Long-Range Depth Estimation

Authors: Mingxia Zhan, Li Zhang, XiaoMeng Chu, Beibei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15095
Pdf URL: https://arxiv.org/pdf/2504.15095
Copy Paste: [[2504.15095]] VistaDepth: Frequency Modulation With Bias Reweighting For Enhanced Long-Range Depth Estimation(https://arxiv.org/abs/2504.15095)
Keywords: generation
Abstract: Monocular depth estimation (MDE) aims to predict per-pixel depth values from a single RGB image. Recent advancements have positioned diffusion models as effective MDE tools by framing the challenge as a conditional image generation task. Despite their progress, these methods often struggle with accurately reconstructing distant depths, due largely to the imbalanced distribution of depth values and an over-reliance on spatial-domain features. To overcome these limitations, we introduce VistaDepth, a novel framework that integrates adaptive frequency-domain feature enhancements with an adaptive weight-balancing mechanism into the diffusion process. Central to our approach is the Latent Frequency Modulation (LFM) module, which dynamically refines spectral responses in the latent feature space, thereby improving the preservation of structural details and reducing noisy artifacts. Furthermore, we implement an adaptive weighting strategy that modulates the diffusion loss in real-time, enhancing the model's sensitivity towards distant depth reconstruction. These innovations collectively result in superior depth perception performance across both distance and detail. Experimental evaluations confirm that VistaDepth achieves state-of-the-art performance among diffusion-based MDE techniques, particularly excelling in the accurate reconstruction of distant regions.
摘要：Monocular depth estimation (MDE) aims to predict per-pixel depth values from a single RGB image.最近的进步将扩散模型定位为有效的MDE工具，通过将挑战作为有条件的图像生成任务。尽管取得了进步，但这些方法通常在准确地重建遥远的深度方面遇到困难，这主要是由于深度值的分布不平衡和对空间域特征的过度依赖。为了克服这些局限性，我们介绍了Vistadepth，这是一个新颖的框架，将适应性频域特征增强与自适应重量平衡机制相结合到扩散过程中。我们方法的核心是潜在的频率调制（LFM）模块，该模块动态地完善了潜在特征空间中的光谱响应，从而改善了结构细节的保存并减少了嘈杂的文物。此外，我们实施了一种自适应加权策略，该策略可调节实时的扩散损失，从而增强了模型对远处重建的敏感性。这些创新共同导致了距离和细节的高度深度感知表现。实验评估证实，VistAdepth在基于扩散的MDE技术中实现了最先进的性能，尤其是在遥远地区的准确重建方面尤其出色。

Title: Fast-Slow Co-advancing Optimizer: Toward Harmonious Adversarial Training of GAN

Authors: Lin Wang, Xiancheng Wang, Rui Wang, Zhibo Zhang, Minghang Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.15099
Pdf URL: https://arxiv.org/pdf/2504.15099
Copy Paste: [[2504.15099]] Fast-Slow Co-advancing Optimizer: Toward Harmonious Adversarial Training of GAN(https://arxiv.org/abs/2504.15099)
Keywords: generative
Abstract: Up to now, the training processes of typical Generative Adversarial Networks (GANs) are still particularly sensitive to data properties and hyperparameters, which may lead to severe oscillations, difficulties in convergence, or even failures to converge, especially when the overall variances of the training sets are large. These phenomena are often attributed to the training characteristics of such networks. Aiming at the problem, this paper develops a new intelligent optimizer, Fast-Slow Co-advancing Optimizer (FSCO), which employs reinforcement learning in the training process of GANs to make training easier. Specifically, this paper allows the training step size to be controlled by an agent to improve training stability, and makes the training process more intelligent with variable learning rates, making GANs less sensitive to step size. Experiments have been conducted on three benchmark datasets to verify the effectiveness of the developed FSCO.
摘要：Up to now, the training processes of typical Generative Adversarial Networks (GANs) are still particularly sensitive to data properties and hyperparameters, which may lead to severe oscillations, difficulties in convergence, or even failures to converge, especially when the overall variances of the training sets are large.这些现象通常归因于此类网络的训练特征。针对这个问题，本文开发了一种新的智能优化器，快速慢的共加速优化器（FSCO），该功能在甘斯的培训过程中采用了加强学习，以使培训更加容易。具体来说，本文允许训练步骤大小由代理控制以提高训练稳定性，并通过可变的学习率使训练过程更加聪明，从而使gans对步长的敏感性降低。已经在三个基准数据集上进行了实验，以验证开发的FSCO的有效性。

Title: Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration

Authors: Junyuan Deng, Xinyi Wu, Yongxing Yang, Congchao Zhu, Song Wang, Zhenyao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15159
Pdf URL: https://arxiv.org/pdf/2504.15159
Copy Paste: [[2504.15159]] Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration(https://arxiv.org/abs/2504.15159)
Keywords: restoration, generation, generative
Abstract: Recently, pre-trained text-to-image (T2I) models have been extensively adopted for real-world image restoration because of their powerful generative prior. However, controlling these large models for image restoration usually requires a large number of high-quality images and immense computational resources for training, which is costly and not privacy-friendly. In this paper, we find that the well-trained large T2I model (i.e., Flux) is able to produce a variety of high-quality images aligned with real-world distributions, offering an unlimited supply of training samples to mitigate the above issue. Specifically, we proposed a training data construction pipeline for image restoration, namely FluxGen, which includes unconditional image generation, image selection, and degraded image simulation. A novel light-weighted adapter (FluxIR) with squeeze-and-excitation layers is also carefully designed to control the large Diffusion Transformer (DiT)-based T2I model so that reasonable details can be restored. Experiments demonstrate that our proposed method enables the Flux model to adapt effectively to real-world image restoration tasks, achieving superior scores and visual quality on both synthetic and real-world degradation datasets - at only about 8.5\% of the training cost compared to current approaches.
摘要：Recently, pre-trained text-to-image (T2I) models have been extensively adopted for real-world image restoration because of their powerful generative prior.但是，控制这些大型图像恢复模型通常需要大量的高质量图像和巨大的计算资源来培训，这是昂贵且不友好的培训。在本文中，我们发现训练有素的大型T2I模型（即通量）能够产生与现实世界分布对齐的各种高质量图像，从而提供无限的培训样品供应以减轻上述问题。具体而言，我们提出了用于图像恢复的训练数据构建管道，即Fluxgen，其中包括无条件的图像产生，图像选择和降级图像模拟。还仔细设计了一种带有挤压和激发层的新型轻加权适配器（通量），以控制基于大扩散变压器（DIT）的T2I模型，以便可以恢复合理的细节。实验表明，我们提出的方法使通量模型能够有效适应现实世界图像恢复任务，在合成和现实世界中的降级数据集上达到了卓越的分数和视觉质量 - 与当前方法相比，培训成本的约为8.5％。

Title: DSPO: Direct Semantic Preference Optimization for Real-World Image Super-Resolution

Authors: Miaomiao Cai, Simiao Li, Wei Li, Xudong Huang, Hanting Chen, Jie Hu, Yunhe Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15176
Pdf URL: https://arxiv.org/pdf/2504.15176
Copy Paste: [[2504.15176]] DSPO: Direct Semantic Preference Optimization for Real-World Image Super-Resolution(https://arxiv.org/abs/2504.15176)
Keywords: super-resolution, generation
Abstract: Recent advances in diffusion models have improved Real-World Image Super-Resolution (Real-ISR), but existing methods lack human feedback integration, risking misalignment with human preference and may leading to artifacts, hallucinations and harmful content generation. To this end, we are the first to introduce human preference alignment into Real-ISR, a technique that has been successfully applied in Large Language Models and Text-to-Image tasks to effectively enhance the alignment of generated outputs with human preferences. Specifically, we introduce Direct Preference Optimization (DPO) into Real-ISR to achieve alignment, where DPO serves as a general alignment technique that directly learns from the human preference dataset. Nevertheless, unlike high-level tasks, the pixel-level reconstruction objectives of Real-ISR are difficult to reconcile with the image-level preferences of DPO, which can lead to the DPO being overly sensitive to local anomalies, leading to reduced generation quality. To resolve this dichotomy, we propose Direct Semantic Preference Optimization (DSPO) to align instance-level human preferences by incorporating semantic guidance, which is through two strategies: (a) semantic instance alignment strategy, implementing instance-level alignment to ensure fine-grained perceptual consistency, and (b) user description feedback strategy, mitigating hallucinations through semantic textual feedback on instance-level images. As a plug-and-play solution, DSPO proves highly effective in both one-step and multi-step SR frameworks.
摘要：Recent advances in diffusion models have improved Real-World Image Super-Resolution (Real-ISR), but existing methods lack human feedback integration, risking misalignment with human preference and may leading to artifacts, hallucinations and harmful content generation.为此，我们是第一个将人类偏好一致性引入Real-ISR的人，该技术已成功地应用于大型语言模型和文本到图像任务中，以有效地增强了与人类偏好的生成输出的一致性。具体而言，我们将直接偏好优化（DPO）引入实际ISR以实现对齐方式，DPO是一种直接从人类偏好数据集中学习的一般对齐技术。然而，与高级任务不同，真实ISR的像素级重建目标很难与DPO的图像级偏好调和，这可能导致DPO对本地异常过于敏感，从而导致发电质量降低。为了解决这种二分法，我们提出直接的语义偏好优化（DSPO），通过合并语义指导来使实例级的偏好保持一致，这是通过两种策略来结合的：（a）语义实例统一策略，实现实例级别的一致性，以确保通过良好的感知一致性，以及（b）用户描述反馈策略，通过语言构图进行良好的示例策略，通过语言构图 - 逐渐浏览构图。作为插件解决方案，DSPO在一步和多步SR框架中都非常有效。

Title: FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image

Authors: Fei Yin, Mallikarjun B R, Chun-Han Yao, Rafał Mantiuk, Varun Jampani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15179
Pdf URL: https://arxiv.org/pdf/2504.15179
Copy Paste: [[2504.15179]] FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image(https://arxiv.org/abs/2504.15179)
Keywords: generation
Abstract: We present a novel framework for generating high-quality, animatable 4D avatar from a single image. While recent advances have shown promising results in 4D avatar creation, existing methods either require extensive multiview data or struggle with shape accuracy and identity consistency. To address these limitations, we propose a comprehensive system that leverages shape, image, and video priors to create full-view, animatable avatars. Our approach first obtains initial coarse shape through 3D-GAN inversion. Then, it enhances multiview textures using depth-guided warping signals for cross-view consistency with the help of the image diffusion model. To handle expression animation, we incorporate a video prior with synchronized driving signals across viewpoints. We further introduce a Consistent-Inconsistent training to effectively handle data inconsistencies during 4D reconstruction. Experimental results demonstrate that our method achieves superior quality compared to the prior art, while maintaining consistency across different viewpoints and expressions.
摘要：我们提出了一个新颖的框架，用于从单个图像中生成高质量的动画4D化身。尽管最近的进步在4D化身创建中显示出令人鼓舞的结果，但现有方法要么需要大量的多视图数据，要么以形状的准确性和身份一致性而挣扎。为了解决这些限制，我们提出了一个综合系统，该系统利用形状，图像和视频先验创建全视图，可动画化的化身。我们的方法首先通过3D-GAR倒置获得初始的粗大形状。然后，它在图像扩散模型的帮助下，使用深度引导的翘曲信号来增强多视纹理。为了处理表达动画，我们将视频与跨视点的同步驾驶信号合并在一起。我们进一步引入了一致的不一致的培训，以有效地处理4D重建过程中的数据不一致。实验结果表明，与先前的ART相比，我们的方法具有优越的质量，同时保持了不同观点和表达式之间的一致性。

Title: Tiger200K: Manually Curated High Visual Quality Video Dataset from UGC Platform

Authors: Xianpan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15182
Pdf URL: https://arxiv.org/pdf/2504.15182
Copy Paste: [[2504.15182]] Tiger200K: Manually Curated High Visual Quality Video Dataset from UGC Platform(https://arxiv.org/abs/2504.15182)
Keywords: generation, generative
Abstract: The recent surge in open-source text-to-video generation models has significantly energized the research community, yet their dependence on proprietary training datasets remains a key constraint. While existing open datasets like Koala-36M employ algorithmic filtering of web-scraped videos from early platforms, they still lack the quality required for fine-tuning advanced video generation models. We present Tiger200K, a manually curated high visual quality video dataset sourced from User-Generated Content (UGC) platforms. By prioritizing visual fidelity and aesthetic quality, Tiger200K underscores the critical role of human expertise in data curation, and providing high-quality, temporally consistent video-text pairs for fine-tuning and optimizing video generation architectures through a simple but effective pipeline including shot boundary detection, OCR, border detecting, motion filter and fine bilingual caption. The dataset will undergo ongoing expansion and be released as an open-source initiative to advance research and applications in video generative models. Project page: this https URL
摘要：开源文本到视频生成模型的最新激增使研究界充满了活力，但他们对专有培训数据集的依赖仍然是一个关键限制。尽管现有的开放数据集（如Koala-36M）采用了早期平台的Web绑带视频的算法过滤，但它们仍然缺乏微调高级视频生成模型所需的质量。我们提供Tiger200K，这是一种手动策划的高视觉质量视频数据集，该数据集来自用户生成的内容（UGC）平台。通过优先考虑视觉保真度和审美质量，Tiger200K强调了人类专业知识在数据策展中的关键作用，并提供了高质量的，时间一致的视频介绍对，以通过简单但有效的管道进行微调和优化视频生成体系结构，包括射击边界检测，OCR，Bordect，Bordect，Bordect，Bordect，Notecting，Motion，Motion，Motion，Motion，Motion，Motion，Motion，Motion，Motion Captier和Fine Bilingual Caption。该数据集将进行持续的扩展，并作为开源计划发布，以推动视频生成模型中的研究和应用。项目页面：此HTTPS URL

Title: Bringing Diversity from Diffusion Models to Semantic-Guided Face Asset Generation

Authors: Yunxuan Cai, Sitao Xiang, Zongjian Li, Haiwei Chen, Yajie Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.15259
Pdf URL: https://arxiv.org/pdf/2504.15259
Copy Paste: [[2504.15259]] Bringing Diversity from Diffusion Models to Semantic-Guided Face Asset Generation(https://arxiv.org/abs/2504.15259)
Keywords: generation, generative
Abstract: Digital modeling and reconstruction of human faces serve various applications. However, its availability is often hindered by the requirements of data capturing devices, manual labor, and suitable actors. This situation restricts the diversity, expressiveness, and control over the resulting models. This work aims to demonstrate that a semantically controllable generative network can provide enhanced control over the digital face modeling process. To enhance diversity beyond the limited human faces scanned in a controlled setting, we introduce a novel data generation pipeline that creates a high-quality 3D face database using a pre-trained diffusion model. Our proposed normalization module converts synthesized data from the diffusion model into high-quality scanned data. Using the 44,000 face models we obtained, we further developed an efficient GAN-based generator. This generator accepts semantic attributes as input, and generates geometry and albedo. It also allows continuous post-editing of attributes in the latent space. Our asset refinement component subsequently creates physically-based facial assets. We introduce a comprehensive system designed for creating and editing high-quality face assets. Our proposed model has undergone extensive experiment, comparison and evaluation. We also integrate everything into a web-based interactive tool. We aim to make this tool publicly available with the release of the paper.
摘要：人脸的数字建模和重建服务于各种应用。但是，数据捕获设备，手动劳动和合适的参与者的要求通常会阻碍其可用性。这种情况限制了对产生模型的多样性，表现力和控制。这项工作旨在证明，语义可控的生成网络可以增强对数字面部建模过程的控制。为了增强超出受控环境中扫描的有限人类面孔的多样性，我们引入了一种新型的数据生成管道，该管道使用预先训练的扩散模型创建了高质量的3D面部数据库。我们提出的归一化模块将综合数据从扩散模型转换为高质量扫描数据。使用我们获得的44,000个面部模型，我们进一步开发了一个有效的基于GAN的发电机。该发电机接受语义属性作为输入，并生成几何和反照率。它还允许在潜在空间中连续编辑属性。我们的资产改进部分随后创建了基于物理的面部资产。我们介绍了一个旨在创建和编辑高质量面部资产的综合系统。我们提出的模型进行了广泛的实验，比较和评估。我们还将所有内容集成到基于Web的交互式工具中。我们的目的是通过发行纸张公开使用该工具。

Title: StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians

Authors: Cailin Zhuang, Yaoqi Hu, Xuanyang Zhang, Wei Cheng, Jiacheng Bao, Shengqi Liu, Yiying Yang, Xianfang Zeng, Gang Yu, Ming Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.15281
Pdf URL: https://arxiv.org/pdf/2504.15281
Copy Paste: [[2504.15281]] StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians(https://arxiv.org/abs/2504.15281)
Keywords: quality assessment
Abstract: 3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement. Our key insights include: (1) optimizing only RGB attributes preserves geometric integrity during stylization; (2) disentangling low-, medium-, and high-level semantics is critical for coherent style transfer; (3) scalability across isolated objects and complex scenes is essential for practical deployment. StyleMe3D introduces four novel components: Dynamic Style Score Distillation (DSSD), leveraging Stable Diffusion's latent space for semantic alignment; Contrastive Style Descriptor (CSD) for localized, content-aware texture transfer; Simultaneously Optimized Scale (SOS) to decouple style details and structural coherence; and 3D Gaussian Quality Assessment (3DG-QA), a differentiable aesthetic prior trained on human-rated data to suppress artifacts and enhance visual harmony. Evaluated on NeRF synthetic dataset (objects) and tandt db (scenes) datasets, StyleMe3D outperforms state-of-the-art methods in preserving geometric details (e.g., carvings on sculptures) and ensuring stylistic consistency across scenes (e.g., coherent lighting in landscapes), while maintaining real-time rendering. This work bridges photorealistic 3D GS and artistic stylization, unlocking applications in gaming, virtual worlds, and digital art.
摘要：3D高斯碎片（3DGS）在影片场景重建中出色，但由于质地碎片，语义不对对准和对抽象美学的适应性有限，与风格化的场景（例如，卡通，游戏）挣扎。我们提出了Styleme3D，这是一个用于3D GS样式传输的整体框架，该框架集成了多模式样式调理，多级别的语义对齐和感知质量的增强。我们的主要见解包括：（1）仅优化RGB属性在风格化过程中保留几何完整性；（2）删除低，中和高级语义对于连贯样式转移至关重要；（3）隔离对象和复杂场景之间的可伸缩性对于实际部署至关重要。 Styleme3D介绍了四个新颖的组件：动态样式得分蒸馏（DSSD），利用稳定的扩散的潜在潜在空间来保持语义一致性；对比度样式描述符（CSD），用于本地化，内容感知的纹理转移；同时优化量表（SOS），以解除样式细节和结构连贯性；和3D高斯质量评估（3DG-QA），这是一种对以人等级数据的培训，可抑制伪影并增强视觉和谐。在NERF合成数据集（对象）和TANDT DB（场景）数据集上进行了评估，Styleme3D在保留几何细节（例如雕塑上的雕刻）方面优于最先进的方法（例如，雕塑上的雕刻）和确保场景跨场景的风格一致性（例如，在跨场景中进行均匀的光线），同时维持实时的范围。这项工作桥接了逼真的3D GS和艺术风格，在游戏，虚拟世界和数字艺术中解锁应用程序。