2025-02-28

Title: On the Interpolation Effect of Score Smoothing

Authors: Zhengdao Chen
Subjects: cs.LG, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2502.19499
Pdf URL: https://arxiv.org/pdf/2502.19499
Copy Paste: [[2502.19499]] On the Interpolation Effect of Score Smoothing(https://arxiv.org/abs/2502.19499)
Keywords: generation
Abstract: Score-based diffusion models have achieved remarkable progress in various domains with the ability to generate new data samples that do not exist in the training set. In this work, we examine the hypothesis that their generalization ability arises from an interpolation effect caused by a smoothing of the empirical score function. Focusing on settings where the training set lies uniformly in a one-dimensional linear subspace, we study the interplay between score smoothing and the denoising dynamics with mathematically solvable models. In particular, we demonstrate how a smoothed score function can lead to the generation of samples that interpolate among the training data within their subspace while avoiding full memorization. We also present evidence that learning score functions with regularized neural networks can have a similar effect on the denoising dynamics as score smoothing.
摘要：基于分数的扩散模型在各个领域都取得了显着的进展，具有生成训练集中不存在的新数据样本的能力。在这项工作中，我们研究了以下假设：它们的概括能力是由经验得分函数平滑引起的插值效应。专注于训练集在一维线性子空间中均匀位于训练集的设置时，我们研究了得分平滑和与数学上可解决的模型之间的分数平滑动力学之间的相互作用。特别是，我们演示了平滑的分数函数如何导致产生的样品，这些样本在其子空间内的训练数据之间插值，同时避免完全记忆。我们还提供了证据，表明具有正则化神经网络的学习分数功能对分数平滑的动力学有类似的影响。

Title: Evaluating the Suitability of Different Intraoral Scan Resolutions for Deep Learning-Based Tooth Segmentation

Authors: Daron Weekley, Jace Duckworth, Anastasiia Sukhanova, Ananya Jana
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19515
Pdf URL: https://arxiv.org/pdf/2502.19515
Copy Paste: [[2502.19515]] Evaluating the Suitability of Different Intraoral Scan Resolutions for Deep Learning-Based Tooth Segmentation(https://arxiv.org/abs/2502.19515)
Keywords: restoration
Abstract: Intraoral scans are widely used in digital dentistry for tasks such as dental restoration, treatment planning, and orthodontic procedures. These scans contain detailed topological information, but manual annotation of these scans remains a time-consuming task. Deep learning-based methods have been developed to automate tasks such as tooth segmentation. A typical intraoral scan contains over 200,000 mesh cells, making direct processing computationally expensive. Models are often trained on downsampled versions, typically with 10,000 or 16,000 cells. Previous studies suggest that downsampling may degrade segmentation accuracy, but the extent of this degradation remains unclear. Understanding the extent of degradation is crucial for deploying ML models on edge devices. This study evaluates the extent of performance degradation with decreasing resolution. We train a deep learning model (PointMLP) on intraoral scans decimated to 16K, 10K, 8K, 6K, 4K, and 2K mesh cells. Models trained at lower resolutions are tested on high-resolution scans to assess performance. Our goal is to identify a resolution that balances computational efficiency and segmentation accuracy.
摘要：口腔内扫描被广泛用于数字牙科，用于牙科修复，治疗计划和正畸手术。这些扫描包含详细的拓扑信息，但是这些扫描的手动注释仍然是一项耗时的任务。已经开发了基于深度学习的方法来自动化诸如牙齿分割之类的任务。典型的口腔内扫描包含超过200,000个网细胞，使直接处理的计算昂贵。通常对模型训练，通常有10,000或16,000个单元格。先前的研究表明，下采样可能会降低分割精度，但是这种降解的程度尚不清楚。了解降解的程度对于在边缘设备上部署ML模型至关重要。这项研究通过减少分辨率评估了性能降解的程度。我们在逐渐扫描中训练一个深度学习模型（PointMLP），该模型被减少到16K，10K，8K，6K，4K和2K网状细胞。在高分辨率扫描中测试了在较低分辨率上训练的模型以评估性能。我们的目标是确定平衡计算效率和细分精度的决议。

Title: Retrieval Augmented Anomaly Detection (RAAD): Nimble Model Adjustment Without Retraining

Authors: Sam Pastoriza, Iman Yousfi, Christopher Redino, Marc Vucovich, Abdul Rahman, Sal Aguinaga, Dhruv Nandakumar
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2502.19534
Pdf URL: https://arxiv.org/pdf/2502.19534
Copy Paste: [[2502.19534]] Retrieval Augmented Anomaly Detection (RAAD): Nimble Model Adjustment Without Retraining(https://arxiv.org/abs/2502.19534)
Keywords: generation
Abstract: We propose a novel mechanism for real-time (human-in-the-loop) feedback focused on false positive reduction to enhance anomaly detection models. It was designed for the lightweight deployment of a behavioral network anomaly detection model. This methodology is easily integrable to similar domains that require a premium on throughput while maintaining high precision. In this paper, we introduce Retrieval Augmented Anomaly Detection, a novel method taking inspiration from Retrieval Augmented Generation. Human annotated examples are sent to a vector store, which can modify model outputs on the very next processed batch for model inference. To demonstrate the generalization of this technique, we benchmarked several different model architectures and multiple data modalities, including images, text, and graph-based data.
摘要：我们提出了一种新型的实时（人类中）反馈的机制，该机制集中于假阳性减少以增强异常检测模型。它是为行为网络异常检测模型的轻巧部署而设计的。该方法很容易地与需要在高精度的同时需要溢出的类似域。在本文中，我们引入了检索增强的异常检测，这是一种新颖的方法，从检索增强一代中汲取灵感。人类注释的示例将发送到矢量存储，该示例可以在下一个处理的批次上修改模型输出以进行模型推断。为了证明该技术的概括，我们基准了几种不同的模型架构和多个数据模式，包括图像，文本和基于图的数据。

Title: Improving Representation Learning of Complex Critical Care Data with ICU-BERT

Authors: Ricardo Santos, André V. Carreiro, Xi Peng, Hugo Gamboa, Holger Fröhlich
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19593
Pdf URL: https://arxiv.org/pdf/2502.19593
Copy Paste: [[2502.19593]] Improving Representation Learning of Complex Critical Care Data with ICU-BERT(https://arxiv.org/abs/2502.19593)
Keywords: generative
Abstract: The multivariate, asynchronous nature of real-world clinical data, such as that generated in Intensive Care Units (ICUs), challenges traditional AI-based decision-support systems. These often assume data regularity and feature independence and frequently rely on limited data scopes and manual feature engineering. The potential of generative AI technologies has not yet been fully exploited to analyze clinical data. We introduce ICU-BERT, a transformer-based model pre-trained on the MIMIC-IV database using a multi-task scheme to learn robust representations of complex ICU data with minimal preprocessing. ICU-BERT employs a multi-token input strategy, incorporating dense embeddings from a biomedical Large Language Model to learn a generalizable representation of complex and multivariate ICU data. With an initial evaluation of five tasks and four additional ICU datasets, ICU-BERT results indicate that ICU-BERT either compares to or surpasses current performance benchmarks by leveraging fine-tuning. By integrating structured and unstructured data, ICU-BERT advances the use of foundational models in medical informatics, offering an adaptable solution for clinical decision support across diverse applications.
摘要：现实世界中临床数据的多元，异步性质，例如在重症监护病房（ICU）中产生的，挑战了传统的基于AI的决策支持系统。这些通常会假定数据规律性和功能独立性，并且经常依靠有限的数据范围和手动功能工程。生成AI技术的潜力尚未完全利用以分析临床数据。我们介绍了ICU-BERT，这是一种基于变压器的模型，使用多任务方案在模拟IV数据库上进行了预训练，以了解使用最小的预处理的复杂ICU数据的鲁棒表示。 ICU-BERT采用了多种输入策略，并结合了来自生物医学大语言模型的密集嵌入，以学习复杂和多元ICU数据的可推广表示。通过对五个任务和其他四个ICU数据集进行初步评估，ICU-Bert结果表明，ICU-Bert通过利用微调来比较或超过当前的性能基准。通过整合结构化和非结构化数据，ICU-Bert在医学信息学中使用基础模型，为跨不同应用程序提供了适应性的解决方案，以提供临床决策支持。

Title: cMIM: A Contrastive Mutual Information Framework for Unified Generative and Discriminative Representation Learning

Authors: Micha Livne
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.19642
Pdf URL: https://arxiv.org/pdf/2502.19642
Copy Paste: [[2502.19642]] cMIM: A Contrastive Mutual Information Framework for Unified Generative and Discriminative Representation Learning(https://arxiv.org/abs/2502.19642)
Keywords: generative
Abstract: Learning representations that are useful for unknown downstream tasks is a fundamental challenge in representation learning. Prominent approaches in this domain include contrastive learning, self-supervised masking, and denoising auto-encoders. In this paper, we introduce a novel method, termed contrastive Mutual Information Machine (cMIM), which aims to enhance the utility of learned representations for downstream tasks. cMIM integrates a new contrastive learning loss with the Mutual Information Machine (MIM) learning framework, a probabilistic auto-encoder that maximizes the mutual information between inputs and latent representations while clustering the latent codes. Despite MIM's potential, initial experiments indicated that the representations learned by MIM were less effective for discriminative downstream tasks compared to state-of-the-art (SOTA) models. The proposed cMIM method directly addresses this limitation. The main contributions of this work are twofold: (1) We propose a novel contrastive extension to MIM for learning discriminative representations which eliminates the need for data augmentation and is robust to variations in the number of negative examples (i.e., batch size). (2) We introduce a generic method for extracting informative embeddings from encoder-decoder models, which significantly improves performance in discriminative downstream tasks without requiring additional training. This method is applicable to any pre-trained encoder-decoder model. By presenting cMIM, we aim to offer a unified generative model that is effective for both generative and discriminative tasks. Our results demonstrate that the learned representations are valuable for downstream tasks while maintaining the generative capabilities of MIM.
摘要：对未知下游任务有用的学习表征是表示学习的基本挑战。该领域中的突出方法包括对比度学习，自我监督掩盖和自动编码器。在本文中，我们介绍了一种新型方法，称为对比度相互信息机（CMIM），该方法旨在增强学习代表对下游任务的实用性。 CMIM将新的对比度学习损失与共同信息机（MIM）学习框架集成在一起，这是一种概率的自动编码器，在聚集潜在代码的同时，在输入和潜在表示之间最大化了相互信息。尽管MIM具有潜力，但最初的实验表明，与最先进的模型相比，MIM学到的表示对判别下游任务的有效性较小。提出的CMIM方法直接解决了此限制。这项工作的主要贡献是双重的：（1）我们提出了一种对MIM进行的新型对比度扩展，以学习判别性表示，这消除了对数据增强的需求，并且对负面示例数量的变化（即批量大小）是鲁棒的。（2）我们引入了一种通用方法，用于从编码器模型中提取信息嵌入，该方法可显着提高判别下游任务的性能而无需进行额外的培训。此方法适用于任何预训练的编码器模型。通过介绍CMIM，我们的目标是提供一个统一的生成模型，对生成和歧视性任务有效。我们的结果表明，学习的表示形式对于下游任务很有价值，同时保持MIM的生成能力。

Title: Adaptive Score Alignment Learning for Continual Perceptual Quality Assessment of 360-Degree Videos in Virtual Reality

Authors: Kanglei Zhou, Zikai Hao, Liyuan Wang, Xiaohui Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19644
Pdf URL: https://arxiv.org/pdf/2502.19644
Copy Paste: [[2502.19644]] Adaptive Score Alignment Learning for Continual Perceptual Quality Assessment of 360-Degree Videos in Virtual Reality(https://arxiv.org/abs/2502.19644)
Keywords: quality assessment
Abstract: Virtual Reality Video Quality Assessment (VR-VQA) aims to evaluate the perceptual quality of 360-degree videos, which is crucial for ensuring a distortion-free user experience. Traditional VR-VQA methods trained on static datasets with limited distortion diversity struggle to balance correlation and precision. This becomes particularly critical when generalizing to diverse VR content and continually adapting to dynamic and evolving video distribution variations. To address these challenges, we propose a novel approach for assessing the perceptual quality of VR videos, Adaptive Score Alignment Learning (ASAL). ASAL integrates correlation loss with error loss to enhance alignment with human subjective ratings and precision in predicting perceptual quality. In particular, ASAL can naturally adapt to continually changing distributions through a feature space smoothing process that enhances generalization to unseen content. To further improve continual adaptation to dynamic VR environments, we extend ASAL with adaptive memory replay as a novel Continul Learning (CL) framework. Unlike traditional CL models, ASAL utilizes key frame extraction and feature adaptation to address the unique challenges of non-stationary variations with both the computation and storage restrictions of VR devices. We establish a comprehensive benchmark for VR-VQA and its CL counterpart, introducing new data splits and evaluation metrics. Our experiments demonstrate that ASAL outperforms recent strong baseline models, achieving overall correlation gains of up to 4.78\% in the static joint training setting and 12.19\% in the dynamic CL setting on various datasets. This validates the effectiveness of ASAL in addressing the inherent challenges of this http URL code is available at this https URL.
摘要：虚拟现实视频质量评估（VR-VQA）旨在评估360度视频的感知质量，这对于确保无失真的用户体验至关重要。在静态数据集中训练有限的失真多样性努力以平衡相关性和精度，传统的VR-VQA方法在静态数据集上进行了培训。当概括到不同的VR含量并不断适应动态和不断发展的视频分布变化时，这变得尤其重要。为了应对这些挑战，我们提出了一种新颖的方法来评估VR视频的感知质量，自适应得分对准学习（ASAL）。 ASAL将相关性损失与误差损失相结合，以增强与人类主观评分的一致性，并在预测感知质量方面的精确度。特别是，ASAL自然可以通过特征空间平滑过程来不断地更改分布，从而增强概括性对看不见的内容。为了进一步改善对动态VR环境的持续适应，我们随着新颖的持续学习（CL）框架而随着自适应记忆重播而扩展。与传统的CL模型不同，ASAL利用关键框架提取和功能适应来解决非稳态变化的独特挑战，并使用VR设备的计算和存储限制。我们为VR-VQA及其CL对应物建立了全面的基准，引入了新的数据分割和评估指标。我们的实验表明，在静态关节训练环境中，ASAL的表现优于最近的强基线模型，在静态关节训练环境中达到了高达4.78 \％的总体相关性增长，在各种数据集中的动态CL设置中达到了12.19 \％。这验证了ASAL在解决此HTTP URL代码固有挑战方面的有效性，可在此HTTPS URL上获得。

Title: SubZero: Composing Subject, Style, and Action via Zero-Shot Personalization

Authors: Shubhankar Borse, Kartikeya Bhardwaj, Mohammad Reza Karimi Dastjerdi, Hyojin Park, Shreya Kadambi, Shobitha Shivakumar, Prathamesh Mandke, Ankita Nayak, Harris Teague, Munawar Hayat, Fatih Porikli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19673
Pdf URL: https://arxiv.org/pdf/2502.19673
Copy Paste: [[2502.19673]] SubZero: Composing Subject, Style, and Action via Zero-Shot Personalization(https://arxiv.org/abs/2502.19673)
Keywords: generative
Abstract: Diffusion models are increasingly popular for generative tasks, including personalized composition of subjects and styles. While diffusion models can generate user-specified subjects performing text-guided actions in custom styles, they require fine-tuning and are not feasible for personalization on mobile devices. Hence, tuning-free personalization methods such as IP-Adapters have progressively gained traction. However, for the composition of subjects and styles, these works are less flexible due to their reliance on ControlNet, or show content and style leakage artifacts. To tackle these, we present SubZero, a novel framework to generate any subject in any style, performing any action without the need for fine-tuning. We propose a novel set of constraints to enhance subject and style similarity, while reducing leakage. Additionally, we propose an orthogonalized temporal aggregation scheme in the cross-attention blocks of denoising model, effectively conditioning on a text prompt along with single subject and style images. We also propose a novel method to train customized content and style projectors to reduce content and style leakage. Through extensive experiments, we show that our proposed approach, while suitable for running on-edge, shows significant improvements over state-of-the-art works performing subject, style and action composition.
摘要：扩散模型在生成任务中越来越流行，包括对主题和样式的个性化组成。尽管扩散模型可以生成用户指定的主题，以自定义样式执行文本指导的操作，但它们需要微调，并且对于在移动设备上的个性化不可行。因此，无调的个性化方法（例如IP-适配器）逐渐获得了吸引力。但是，对于受试者和样式的组成，由于依赖控制网，或显示内容和样式泄漏伪像，这些作品的灵活性较小。为了解决这些问题，我们提出了Subzero，这是一个新颖的框架，可以以任何样式生成任何主题，执行任何动作而无需进行微调。我们提出了一组新颖的约束，以增强主题和样式相似性，同时减少泄漏。此外，我们在Denoising模型的交叉注意区块中提出了一个正交的时间聚合方案，有效地调理了文本提示，以及单个主题和样式图像。我们还提出了一种新颖的方法来训练定制的内容和样式投影仪，以减少内容和样式泄漏。通过广泛的实验，我们表明我们提出的方法虽然适合在边缘运行，但对执行主题，样式和动作组成的最先进的作品表现出显着改善。

Title: BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance

Authors: Xin Ye, Burhaneddin Yaman, Sheng Cheng, Feng Tao, Abhirup Mallik, Liu Ren
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.19694
Pdf URL: https://arxiv.org/pdf/2502.19694
Copy Paste: [[2502.19694]] BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance(https://arxiv.org/abs/2502.19694)
Keywords: generation
Abstract: Bird's-eye-view (BEV) representations play a crucial role in autonomous driving tasks. Despite recent advancements in BEV generation, inherent noise, stemming from sensor limitations and the learning process, remains largely unaddressed, resulting in suboptimal BEV representations that adversely impact the performance of downstream tasks. To address this, we propose BEVDiffuser, a novel diffusion model that effectively denoises BEV feature maps using the ground-truth object layout as guidance. BEVDiffuser can be operated in a plug-and-play manner during training time to enhance existing BEV models without requiring any architectural modifications. Extensive experiments on the challenging nuScenes dataset demonstrate BEVDiffuser's exceptional denoising and generation capabilities, which enable significant enhancement to existing BEV models, as evidenced by notable improvements of 12.3\% in mAP and 10.1\% in NDS achieved for 3D object detection without introducing additional computational complexity. Moreover, substantial improvements in long-tail object detection and under challenging weather and lighting conditions further validate BEVDiffuser's effectiveness in denoising and enhancing BEV representations.
摘要：Bird's-eye-View（BEV）表示在自主驾驶任务中起着至关重要的作用。尽管BEV生成的最新进展，但固有的噪声（源于传感器局限性和学习过程）在很大程度上仍未得到解决，从而导致次优的BEV表示，从而对下游任务的性能产生不利影响。为了解决这个问题，我们提出了Bevdiffuser，这是一个新型扩散模型，可以使用地面真实对象布局作为指导有效地确定BEV特征图。可以在训练时间内以插件的方式操作BevDiffuser，以增强现有的BEV模型，而无需进行任何架构修改。关于具有挑战性的Nuscenes数据集进行的广泛实验表明，Bevdiffuser的出色deNoising和发电能力，可以对现有BEV模型进行显着增强，这证明了MAP中的12.3 \％的显着改善，而在不引入其他计算复杂性的情况下，NDS中的NDS中有10.1 \％\％。此外，长尾对象检测以及在挑战性的天气和照明条件下的实质改善，进一步验证了贝维德夫斯在变质和增强BEV表示方面的有效性。

Title: You Only Click Once: Single Point Weakly Supervised 3D Instance Segmentation for Autonomous Driving

Authors: Guangfeng Jiang, Jun Liu, Yongxuan Lv, Yuzhi Wu, Xianfei Li, Wenlong Liao, Tao He, Pai Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19698
Pdf URL: https://arxiv.org/pdf/2502.19698
Copy Paste: [[2502.19698]] You Only Click Once: Single Point Weakly Supervised 3D Instance Segmentation for Autonomous Driving(https://arxiv.org/abs/2502.19698)
Keywords: generation
Abstract: Outdoor LiDAR point cloud 3D instance segmentation is a crucial task in autonomous driving. However, it requires laborious human efforts to annotate the point cloud for training a segmentation model. To address this challenge, we propose a YoCo framework, which generates 3D pseudo labels using minimal coarse click annotations in the bird's eye view plane. It is a significant challenge to produce high-quality pseudo labels from sparse annotations. Our YoCo framework first leverages vision foundation models combined with geometric constraints from point clouds to enhance pseudo label generation. Second, a temporal and spatial-based label updating module is designed to generate reliable updated labels. It leverages predictions from adjacent frames and utilizes the inherent density variation of point clouds (dense near, sparse far). Finally, to further improve label quality, an IoU-guided enhancement module is proposed, replacing pseudo labels with high-confidence and high-IoU predictions. Experiments on the Waymo dataset demonstrate YoCo's effectiveness and generality, achieving state-of-the-art performance among weakly supervised methods and surpassing fully supervised Cylinder3D. Additionally, the YoCo is suitable for various networks, achieving performance comparable to fully supervised methods with minimal fine-tuning using only 0.8% of the fully labeled data, significantly reducing annotation costs.
摘要：室外激光点云3D实例细分是自动驾驶中的关键任务。但是，这需要艰苦的人类努力来注释点云，以训练分割模型。为了应对这一挑战，我们提出了一个YOCO框架，该框架使用鸟类视图平面中的最小粗咔嗒声产生3D伪标签。从稀疏注释中产生高质量的伪标签是一个重大挑战。我们的Yoco框架首先利用视觉基础模型以及从点云的几何约束来增强伪标签的生成。其次，基于时间和空间的标签更新模块旨在生成可靠的更新标签。它利用相邻框架的预测，并利用点云的固有密度变化（密集近，稀疏）。最后，为了进一步提高标签质量，提出了一个IOU引导的增强模块，以高信心和高度预测代替伪标签。 Waymo数据集的实验证明了Yoco的有效性和一般性，在弱监督的方法中实现了最新的性能并超过了完全监督的Cylinder3D。此外，YOCO适用于各种网络，仅使用0.8％的完全标记的数据，可与完全监督的方法相媲美，以最小的微调进行微调，从而大大降低了注释成本。

Title: SAP-DIFF: Semantic Adversarial Patch Generation for Black-Box Face Recognition Models via Diffusion Models

Authors: Mingsi Wang, Shuaiyin Yao, Chang Yue, Lijie Zhang, Guozhu Meng
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2502.19710
Pdf URL: https://arxiv.org/pdf/2502.19710
Copy Paste: [[2502.19710]] SAP-DIFF: Semantic Adversarial Patch Generation for Black-Box Face Recognition Models via Diffusion Models(https://arxiv.org/abs/2502.19710)
Keywords: generation
Abstract: Given the need to evaluate the robustness of face recognition (FR) models, many efforts have focused on adversarial patch attacks that mislead FR models by introducing localized perturbations. Impersonation attacks are a significant threat because adversarial perturbations allow attackers to disguise themselves as legitimate users. This can lead to severe consequences, including data breaches, system damage, and misuse of resources. However, research on such attacks in FR remains limited. Existing adversarial patch generation methods exhibit limited efficacy in impersonation attacks due to (1) the need for high attacker capabilities, (2) low attack success rates, and (3) excessive query requirements. To address these challenges, we propose a novel method SAP-DIFF that leverages diffusion models to generate adversarial patches via semantic perturbations in the latent space rather than direct pixel manipulation. We introduce an attention disruption mechanism to generate features unrelated to the original face, facilitating the creation of adversarial samples and a directional loss function to guide perturbations toward the target identity feature space, thereby enhancing attack effectiveness and efficiency. Extensive experiments on popular FR models and datasets demonstrate that our method outperforms state-of-the-art approaches, achieving an average attack success rate improvement of 45.66% (all exceeding 40%), and a reduction in the number of queries by about 40% compared to the SOTA approach
摘要：鉴于需要评估面部识别模型（FR）模型的鲁棒性，因此许多努力集中在对抗斑块攻击上，这些攻击通过引入局部扰动来误导FR模型。模仿攻击是一个重大威胁，因为对抗性扰动使攻击者可以掩饰自己为合法用户。这可能导致严重的后果，包括数据泄露，系统损害和资源滥用。但是，对FR中此类攻击的研究仍然有限。现有的对抗斑块生成方法在（1）需要高攻击者功能，（2）低攻击成功率以及（3）过度查询要求的情况下，在模仿攻击中表现出有限的功效。为了应对这些挑战，我们提出了一种新型的方法SAP-DIFF，该方法利用扩散模型通过潜在空间中的语义扰动而不是直接像素操纵来产生对抗斑块。我们引入了一种注意力中断机制，以生成与原始面部无关的特征，从而促进了对抗样本的创建和定向损失功能，以指导扰动目标身份特征特征空间，从而提高攻击效果和效率。对流行的FR模型和数据集进行的广泛实验表明，我们的方法的表现优于最先进的方法，与SOTA方法相比，平均攻击成功率提高了45.66％（全部超过40％），并将查询数量降低了约40％。

Title: Finding Local Diffusion Schrödinger Bridge using Kolmogorov-Arnold Network

Authors: Xingyu Qiu, Mengying Yang, Xinghua Ma, Fanding Li, Dong Liang, Gongning Luo, Wei Wang, Kuanquan Wang, Shuo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19754
Pdf URL: https://arxiv.org/pdf/2502.19754
Copy Paste: [[2502.19754]] Finding Local Diffusion Schrödinger Bridge using Kolmogorov-Arnold Network(https://arxiv.org/abs/2502.19754)
Keywords: generation
Abstract: In image generation, Schrödinger Bridge (SB)-based methods theoretically enhance the efficiency and quality compared to the diffusion models by finding the least costly path between two distributions. However, they are computationally expensive and time-consuming when applied to complex image data. The reason is that they focus on fitting globally optimal paths in high-dimensional spaces, directly generating images as next step on the path using complex networks through self-supervised training, which typically results in a gap with the global optimum. Meanwhile, most diffusion models are in the same path subspace generated by weights $f_A(t)$ and $f_B(t)$, as they follow the paradigm ($x_t = f_A(t)x_{Img} + f_B(t)\epsilon$). To address the limitations of SB-based methods, this paper proposes for the first time to find local Diffusion Schrödinger Bridges (LDSB) in the diffusion path subspace, which strengthens the connection between the SB problem and diffusion models. Specifically, our method optimizes the diffusion paths using Kolmogorov-Arnold Network (KAN), which has the advantage of resistance to forgetting and continuous output. The experiment shows that our LDSB significantly improves the quality and efficiency of image generation using the same pre-trained denoising network and the KAN for optimising is only less than 0.1MB. The FID metric is reduced by \textbf{more than 15\%}, especially with a reduction of 48.50\% when NFE of DDIM is $5$ for the CelebA dataset. Code is available at this https URL.
摘要：在图像生成中，与扩散模型相比，基于Schrödinger桥（SB）的方法从理论上提高了效率和质量，通过在两个分布之间找到最低的距离路径。但是，当应用于复杂的图像数据时，它们在计算上昂贵且耗时。原因是他们专注于在高维空间中拟合全球最佳路径，直接通过自我监督训练将图像作为下一步，作为路径上的下一步，这通常会导致与全球最佳距离的差距。同时，大多数扩散模型都处于由权重$ f_a（t）$和$ f_b（t）$生成的相同路径子空间，因为它们遵循范式（$ x_t = f_a（t）x_ {img} + f_b（t） + f_b（t）\ epsilon $）。为了解决基于SB的方法的局限性，本文首次提议在扩散路径子空间中首次找到局部扩散schrödinger桥（LDSB），从而增强了SB问题与扩散模型之间的联系。具体而言，我们的方法使用Kolmogorov-Arnold网络（KAN）优化了扩散路径，该网络具有阻力忘记和连续输出的优势。该实验表明，使用相同的预训练的denoising网络，我们的LDSB显着提高了图像生成的质量和效率，而用于优化的KAN仅小于0.1MB。 \ textbf {超过15 \％}减少了FID指标，尤其是当ddim的nfe $ 5 $ $ 5 $时，尤其是降低48.50 \％。代码可在此HTTPS URL上找到。

Title: MFSR: Multi-fractal Feature for Super-resolution Reconstruction with Fine Details Recovery

Authors: Lianping Yang, Peng Jiao, Jinshan Pan, Hegui Zhu, Su Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19797
Pdf URL: https://arxiv.org/pdf/2502.19797
Copy Paste: [[2502.19797]] MFSR: Multi-fractal Feature for Super-resolution Reconstruction with Fine Details Recovery(https://arxiv.org/abs/2502.19797)
Keywords: super-resolution
Abstract: In the process of performing image super-resolution processing, the processing of complex localized information can have a significant impact on the quality of the image generated. Fractal features can capture the rich details of both micro and macro texture structures in an image. Therefore, we propose a diffusion model-based super-resolution method incorporating fractal features of low-resolution images, named MFSR. MFSR leverages these fractal features as reinforcement conditions in the denoising process of the diffusion model to ensure accurate recovery of texture information. MFSR employs convolution as a soft assignment to approximate the fractal features of low-resolution images. This approach is also used to approximate the density feature maps of these images. By using soft assignment, the spatial layout of the image is described hierarchically, encoding the self-similarity properties of the image at different scales. Different processing methods are applied to various types of features to enrich the information acquired by the model. In addition, a sub-denoiser is integrated in the denoising U-Net to reduce the noise in the feature maps during the up-sampling process in order to improve the quality of the generated images. Experiments conducted on various face and natural image datasets demonstrate that MFSR can generate higher quality images.
摘要：在执行图像超分辨率处理的过程中，复杂的局部信息的处理可以对生成的图像质量产生重大影响。分形特征可以捕获图像中微观和宏观纹理结构的丰富细节。因此，我们提出了一种基于扩散模型的超分辨率方法，该方法结合了名为MFSR的低分辨率图像的分形特征。 MFSR利用这些分形特征作为扩散模型的剥离过程中的增强条件，以确保准确恢复纹理信息。 MFSR采用卷积作为软分配来近似低分辨率图像的分形特征。此方法还用于近似这些图像的密度特征图。通过使用软分配，图像的空间布局在层次上描述，在不同的尺度上编码图像的自相似性属性。将不同的处理方法应用于各种类型的功能，以丰富模型获得的信息。此外，在denoising U-net中集成了一个子登录器，以减少在上采样过程中特征地图中的噪声，以提高生成的图像的质量。在各种面部和自然图像数据集上进行的实验表明，MFSR可以产生更高质量的图像。

Title: UIFace: Unleashing Inherent Model Capabilities to Enhance Intra-Class Diversity in Synthetic Face Recognition

Authors: Xiao Lin, Yuge Huang, Jianqing Xu, Yuxi Mi, Shuigeng Zhou, Shouhong Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19803
Pdf URL: https://arxiv.org/pdf/2502.19803
Copy Paste: [[2502.19803]] UIFace: Unleashing Inherent Model Capabilities to Enhance Intra-Class Diversity in Synthetic Face Recognition(https://arxiv.org/abs/2502.19803)
Keywords: generation
Abstract: Face recognition (FR) stands as one of the most crucial applications in computer vision. The accuracy of FR models has significantly improved in recent years due to the availability of large-scale human face datasets. However, directly using these datasets can inevitably lead to privacy and legal problems. Generating synthetic data to train FR models is a feasible solution to circumvent these issues. While existing synthetic-based face recognition methods have made significant progress in generating identity-preserving images, they are severely plagued by context overfitting, resulting in a lack of intra-class diversity of generated images and poor face recognition performance. In this paper, we propose a framework to Unleash Inherent capability of the model to enhance intra-class diversity for synthetic face recognition, shortened as UIFace. Our framework first trains a diffusion model that can perform sampling conditioned on either identity contexts or a learnable empty context. The former generates identity-preserving images but lacks variations, while the latter exploits the model's intrinsic ability to synthesize intra-class-diversified images but with random identities. Then we adopt a novel two-stage sampling strategy during inference to fully leverage the strengths of both types of contexts, resulting in images that are diverse as well as identitypreserving. Moreover, an attention injection module is introduced to further augment the intra-class variations by utilizing attention maps from the empty context to guide the sampling process in ID-conditioned generation. Experiments show that our method significantly surpasses previous approaches with even less training data and half the size of synthetic dataset. The proposed UIFace even achieves comparable performance with FR models trained on real datasets when we further increase the number of synthetic identities.
摘要：面部识别（FR）是计算机视觉中最关键的应用之一。由于大规模的人脸数据集的可用性，近年来，FR模型的准确性显着提高。但是，直接使用这些数据集不可避免地会导致隐私和法律问题。生成合成数据训练FR模型是避免这些问题的可行解决方案。尽管现有的基于合成的面部识别方法在产生具有身份的图像方面取得了重大进展，但它们严重困扰着上下文过度拟合，导致缺乏生成的图像的内在多样性和不良的面部识别性能。在本文中，我们提出了一个框架，以释放模型的固有能力，以增强属于合成面部识别的类内多样性，并缩短为uiface。我们的框架首先训练一个扩散模型，该模型可以执行以身份上下文或可学习的空上下文为条件的采样。前者产生了具有身份的图像，但缺乏变化，而后者则利用了模型的固有能力，可以合成阶级内部变化图像但具有随机身份的固有能力。然后，我们在推断过程中采用了一种新颖的两阶段抽样策略，以充分利用两种类型上下文的优势，从而产生多样化和身份证明的图像。此外，引入了注意力注入模块，以进一步增加阶级内变化，通过利用从空情境中的注意图来指导ID条件生成中的采样过程。实验表明，我们的方法显着超过了先前的方法，而训练数据和合成数据集的大小的一半也更少。当我们进一步增加合成身份的数量时，所提出的Uiface甚至可以通过在实际数据集中训练的FR模型实现可比性的性能。

Title: Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study

Authors: Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19828
Pdf URL: https://arxiv.org/pdf/2502.19828
Copy Paste: [[2502.19828]] Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study(https://arxiv.org/abs/2502.19828)
Keywords: generation
Abstract: Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable performance in zero-shot classification tasks, yet their efficacy in handling complex multi-object scenarios remains challenging. This study presents a comprehensive analysis of CLIP's performance limitations in multi-object contexts through controlled experiments. We introduce two custom datasets, SimCO and CompCO, to evaluate CLIP's image and text encoders in various multi-object configurations. Our findings reveal significant biases in both encoders: the image encoder favors larger objects, while the text encoder prioritizes objects mentioned first in descriptions. We hypothesize these biases originate from CLIP's training process and provide evidence through analyses of the COCO dataset and CLIP's training progression. Additionally, we extend our investigation to Stable Diffusion models, revealing that biases in the CLIP text encoder significantly impact text-to-image generation tasks. Our experiments demonstrate how these biases affect CLIP's performance in image-caption matching and generation tasks, particularly when manipulating object sizes and their order in captions. This work contributes valuable insights into CLIP's behavior in complex visual environments and highlights areas for improvement in future vision-language models.
摘要：对比性语言图像预训练（剪辑）模型在零摄像分类任务中表现出了出色的性能，但是它们在处理复杂的多对象方案方面的功效仍然具有挑战性。这项研究通过受控实验对夹在多对象上下文中的性能限制进行了全面分析。我们介绍了两个自定义数据集，即SIMCO和COMPCO，以评估剪贴画的图像和文本编码器中的各种多对象配置。我们的发现揭示了两个编码器中的显着偏见：图像编码器有利于较大的对象，而文本编码器优先考虑描述中提到的对象。我们假设这些偏见源自CLIP的训练过程，并通过分析可可数据集和夹子的训练进程提供证据。此外，我们将调查扩展到稳定的扩散模型，揭示了剪辑文本编码器中的偏见会显着影响文本到图像生成任务。我们的实验表明，这些偏差如何影响剪辑在图像捕获匹配和生成任务中的性能，尤其是在操纵对象大小及其在字幕中的顺序时。这项工作为复杂的视觉环境中的剪辑行为提供了宝贵的见解，并突出了未来视觉模型的改进领域。

Title: Knowledge Bridger: Towards Training-free Missing Multi-modality Completion

Authors: Guanzhou Ke, Shengfeng He, Xiao Li Wang, Bo Wang, Guoqing Chao, Yuanyang Zhang, Yi Xie, HeXing Su
Subjects: cs.LG, cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2502.19834
Pdf URL: https://arxiv.org/pdf/2502.19834
Copy Paste: [[2502.19834]] Knowledge Bridger: Towards Training-free Missing Multi-modality Completion(https://arxiv.org/abs/2502.19834)
Keywords: generation
Abstract: Previous successful approaches to missing modality completion rely on carefully designed fusion techniques and extensive pre-training on complete data, which can limit their generalizability in out-of-domain (OOD) scenarios. In this study, we pose a new challenge: can we develop a missing modality completion model that is both resource-efficient and robust to OOD generalization? To address this, we present a training-free framework for missing modality completion that leverages large multimodal models (LMMs). Our approach, termed the "Knowledge Bridger", is modality-agnostic and integrates generation and ranking of missing modalities. By defining domain-specific priors, our method automatically extracts structured information from available modalities to construct knowledge graphs. These extracted graphs connect the missing modality generation and ranking modules through the LMM, resulting in high-quality imputations of missing modalities. Experimental results across both general and medical domains show that our approach consistently outperforms competing methods, including in OOD generalization. Additionally, our knowledge-driven generation and ranking techniques demonstrate superiority over variants that directly employ LMMs for generation and ranking, offering insights that may be valuable for applications in other domains.
摘要：以前的成功方法完成的方法取决于精心设计的融合技术和对完整数据的大量预培训，这可能会限制其在室外（OOD）方案中的普遍性。在这项研究中，我们提出了一个新的挑战：我们能否开发一个缺失的模式完成模型，既资源效率又强大？为了解决这个问题，我们提出了一个无训练的框架，用于缺少模式完成，该框架利用大型多模型（LMMS）。我们的方法称为“知识布里奇”，是模态敏捷的，并整合了缺失模态的产生和排名。通过定义特定领域的先验，我们的方法自动从可用模式中提取结构化信息以构建知识图。这些提取的图将缺失的模态生成和通过LMM进行排名，从而导致缺失模态的高质量归档。一般和医疗领域的实验结果表明，我们的方法始终优于竞争方法，包括在OOD概括中。此外，我们的知识驱动的生成和排名技术表明，与直接采用LMM进行生成和排名的变体相比，提供了对其他领域应用程序可能有价值的见解。

Title: ProAPO: Progressively Automatic Prompt Optimization for Visual Classification

Authors: Xiangyan Qu, Gaopeng Gou, Jiamin Zhuang, Jing Yu, Kun Song, Qihao Wang, Yili Li, Gang Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19844
Pdf URL: https://arxiv.org/pdf/2502.19844
Copy Paste: [[2502.19844]] ProAPO: Progressively Automatic Prompt Optimization for Visual Classification(https://arxiv.org/abs/2502.19844)
Keywords: generation
Abstract: Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our optimal prompts improve adapter-based methods and transfer effectively across different backbones.
摘要：视觉语言模型（VLM）通过大规模配对的图像文本数据训练在图像分类方面取得了重大进展。他们的表演很大程度上取决于及时的质量。虽然最近的方法表明，大语模型（LLMS）生成的视觉描述增强了VLM的概括，但由于LLMS的幻觉，特定于类的提示可能是不准确或缺乏歧视的。在本文中，我们旨在找到视觉上的歧视性提示，以最小的监督和没有人类的人性化。提出了一种基于进化的算法，以逐步优化从特定于任务的模板到特定于类的描述的语言提示。与优化模板不同，搜索空间在特定于类的候选提示中显示了爆炸。这增加了迅速的发电成本，迭代时间和过度拟合的问题。为此，我们首先介绍了几个简单但有效的基于进化的操作，以通过LLMS的一次性查询生成多样化的候选提示。然后，提出了两种抽样策略，以找到更好的初始搜索点并减少经过的类别，从而节省迭代成本。此外，我们将新颖的健身评分和熵约束应用于减轻过度拟合。在具有挑战性的单次图像分类设置中，我们的方法优于现有的基于文本提示的方法，并改善了13个数据集的LLM生成的描述方法。同时，我们证明了最佳提示改善基于适配器的方法并在不同的骨架上有效地转移。

Title: One-for-More: Continual Diffusion Model for Anomaly Detection

Authors: Xiaofan Li, Xin Tan, Zhuo Chen, Zhizhong Zhang, Ruixin Zhang, Rizen Guo, Guanna Jiang, Yulong Chen, Yanyun Qu, Lizhuang Ma, Yuan Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19848
Pdf URL: https://arxiv.org/pdf/2502.19848
Copy Paste: [[2502.19848]] One-for-More: Continual Diffusion Model for Anomaly Detection(https://arxiv.org/abs/2502.19848)
Keywords: generative
Abstract: With the rise of generative models, there is a growing interest in unifying all tasks within a generative framework. Anomaly detection methods also fall into this scope and utilize diffusion models to generate or reconstruct normal samples when given arbitrary anomaly images. However, our study found that the diffusion model suffers from severe ``faithfulness hallucination'' and ``catastrophic forgetting'', which can't meet the unpredictable pattern increments. To mitigate the above problems, we propose a continual diffusion model that uses gradient projection to achieve stable continual learning. Gradient projection deploys a regularization on the model updating by modifying the gradient towards the direction protecting the learned knowledge. But as a double-edged sword, it also requires huge memory costs brought by the Markov process. Hence, we propose an iterative singular value decomposition method based on the transitive property of linear representation, which consumes tiny memory and incurs almost no performance loss. Finally, considering the risk of ``over-fitting'' to normal images of the diffusion model, we propose an anomaly-masked network to enhance the condition mechanism of the diffusion model. For continual anomaly detection, ours achieves first place in 17/18 settings on MVTec and VisA. Code is available at this https URL
摘要：随着生成模型的兴起，人们对在生成框架中统一所有任务的兴趣越来越大。异常检测方法也属于此范围，并在给出任意异常图像时利用扩散模型来生成或重建正常样品。但是，我们的研究发现，扩散模型遭受了严重的``忠实幻觉''和``灾难性遗忘''的侵害，这无法满足无法预测的模式增量。为了减轻上述问题，我们提出了一个连续扩散模型，该模型使用梯度投影来实现稳定的持续学习。梯度投影通过将梯度朝着保护学习知识的方向修改，在模型更新模型上部署正规化。但是，作为一把双刃剑，它还需要马尔可夫流程带来的巨大记忆成本。因此，我们基于线性表示的瞬态属性提出了一种迭代奇异值分解方法，该方法消耗了微小的记忆，几乎没有绩效损失。最后，考虑到扩散模型的正常图像``过拟合''的风险，我们提出了一个反式屏蔽网络，以增强扩散模型的条件机制。为了进行持续的异常检测，我们在MVTEC和Visa的17/18设置中获得了第一名。代码可在此HTTPS URL上找到

Title: C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation

Authors: Yuhao Li, Mirana Claire Angel, Salman Khan, Yu Zhu, Jinqiu Sun, Yanning Zhang, Fahad Shahbaz Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19868
Pdf URL: https://arxiv.org/pdf/2502.19868
Copy Paste: [[2502.19868]] C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation(https://arxiv.org/abs/2502.19868)
Keywords: generation
Abstract: Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation. However, the existing trajectory-based approaches are usually limited to only generating the motion trajectory of the controlled object and ignoring the dynamic interactions between the controlled object and its surroundings. To address this limitation, we propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag. Instead of directly generating the motion of some objects, our C-Drag first performs object perception and then reasons the dynamic interactions between different objects according to the given motion control of the objects. Specifically, our method includes an object perception module and a Chain-of-Thought-based motion reasoning module. The object perception module employs visual language models to capture the position and category information of various objects within the image. The Chain-of-Thought-based motion reasoning module takes this information as input and conducts a stage-wise reasoning process to generate motion trajectories for each of the affected objects, which are subsequently fed to the diffusion model for video synthesis. Furthermore, we introduce a new video object interaction (VOI) dataset to evaluate the generation quality of motion controlled video generation methods. Our VOI dataset contains three typical types of interactions and provides the motion trajectories of objects that can be used for accurate performance evaluation. Experimental results show that C-Drag achieves promising performance across multiple metrics, excelling in object motion control. Our benchmark, codes, and models will be available at this https URL.
摘要：基于轨迹的运动控制已成为可控视频生成的直观有效的方法。但是，现有的基于轨迹的方法通常仅限于仅生成受控对象的运动轨迹，而忽略受控对象及其周围环境之间的动态相互作用。为了解决这一限制，我们为可控视频生成（名为C-Drag）提出了一个基于想象的运动控制器。我们的c-Drag不是直接生成某些对象的运动，而是首先执行对象感知，然后根据对象的给定运动控制来推理不同对象之间的动态相互作用。具体而言，我们的方法包括一个对象感知模块和一个基于想法的运动推理模块。对象感知模块采用视觉语言模型来捕获图像中各种对象的位置和类别信息。基于思想链的运动推理模块将此信息作为输入，并进行阶段推理过程，以生成每个受影响对象的运动轨迹，随后将其馈送到视频合成的扩散模型中。此外，我们引入了一个新的视频对象互动（VOI）数据集，以评估运动控制视频生成方法的发电质量。我们的VOI数据集包含三种典型的交互类型，并提供了可用于准确性能评估的对象的运动轨迹。实验结果表明，C-Drag在多个指标中实现了有希望的性能，在对象运动控制方面表现出色。我们的基准，代码和模型将在此HTTPS URL上可用。

Title: GenPC: Zero-shot Point Cloud Completion via 3D Generative Priors

Authors: An Li, Zhe Zhu, Mingqiang Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19896
Pdf URL: https://arxiv.org/pdf/2502.19896
Copy Paste: [[2502.19896]] GenPC: Zero-shot Point Cloud Completion via 3D Generative Priors(https://arxiv.org/abs/2502.19896)
Keywords: generation, generative
Abstract: Existing point cloud completion methods, which typically depend on predefined synthetic training datasets, encounter significant challenges when applied to out-of-distribution, real-world scans. To overcome this limitation, we introduce a zero-shot completion framework, termed GenPC, designed to reconstruct high-quality real-world scans by leveraging explicit 3D generative priors. Our key insight is that recent feed-forward 3D generative models, trained on extensive internet-scale data, have demonstrated the ability to perform 3D generation from single-view images in a zero-shot setting. To harness this for completion, we first develop a Depth Prompting module that links partial point clouds with image-to-3D generative models by leveraging depth images as a stepping stone. To retain the original partial structure in the final results, we design the Geometric Preserving Fusion module that aligns the generated shape with input by adaptively adjusting its pose and scale. Extensive experiments on widely used benchmarks validate the superiority and generalizability of our approach, bringing us a step closer to robust real-world scan completion.
摘要：现有的点云完成方法通常取决于预定义的合成训练数据集，当应用于分布外的现实世界扫描时会遇到重大挑战。为了克服这一限制，我们引入了一个零拍的完成框架，称为GENPC，旨在通过利用显式3D生成先验来重建高质量的现实世界扫描。我们的关键见解是，最近在互联网规模的数据中训练的近期馈电3D生成模型已经证明了在零摄影设置中从单视图中执行3D代的能力。为了实现这一点，我们首先开发一个深度促使模块，该模块将部分云与图像到3D生成模型联系起来，通过将深度图像作为垫脚石来链接。为了在最终结果中保留原始的部分结构，我们设计了几何保存融合模块，该模块通过自适应调整其姿势和比例来使生成的形状与输入相一致。广泛使用基准的广泛实验验证了我们方法的优越性和概括性，使我们更接近可靠的现实世界扫描完成。

Title: Shifting the Paradigm: A Diffeomorphism Between Time Series Data Manifolds for Achieving Shift-Invariancy in Deep Learning

Authors: Berken Utku Demirel, Christian Holz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.19921
Pdf URL: https://arxiv.org/pdf/2502.19921
Copy Paste: [[2502.19921]] Shifting the Paradigm: A Diffeomorphism Between Time Series Data Manifolds for Achieving Shift-Invariancy in Deep Learning(https://arxiv.org/abs/2502.19921)
Keywords: generation
Abstract: Deep learning models lack shift invariance, making them sensitive to input shifts that cause changes in output. While recent techniques seek to address this for images, our findings show that these approaches fail to provide shift-invariance in time series, where the data generation mechanism is more challenging due to the interaction of low and high frequencies. Worse, they also decrease performance across several tasks. In this paper, we propose a novel differentiable bijective function that maps samples from their high-dimensional data manifold to another manifold of the same dimension, without any dimensional reduction. Our approach guarantees that samples -- when subjected to random shifts -- are mapped to a unique point in the manifold while preserving all task-relevant information without loss. We theoretically and empirically demonstrate that the proposed transformation guarantees shift-invariance in deep learning models without imposing any limits to the shift. Our experiments on six time series tasks with state-of-the-art methods show that our approach consistently improves the performance while enabling models to achieve complete shift-invariance without modifying or imposing restrictions on the model's topology. The source code is available on \href{this https URL}{GitHub}.
摘要：深度学习模型缺乏转移不变性，使其对导致输出变化的输入转移敏感。尽管最近的技术试图解决图像，但我们的发现表明，这些方法无法在时间序列中提供转变，因为由于低频和高频的相互作用，数据生成机制更具挑战性。更糟糕的是，它们还降低了几个任务的性能。在本文中，我们提出了一种新型的可区分射击功能，该函数将样品从其高维数据歧管映射到同一维度的另一个歧管，而没有任何维度降低。我们的方法确保样品（进行随机换档）被映射到流派中的一个独特点，同时保留所有与任务相关的信息而不会损失。从理论上讲，我们从理论上和经验上证明，所提出的转型可以保证深度学习模型中的转移不变，而不会对转变施加任何限制。我们对使用最新方法的六个时间序列任务进行的实验表明，我们的方法一致地提高了性能，同时使模型能够实现完全的换档，而无需对模型拓扑的修改或施加限制。源代码可在\ href {此https url} {github}上获得。

Title: Identity-preserving Distillation Sampling by Fixed-Point Iterator

Authors: SeonHwa Kim, Jiwon Kim, Soobin Park, Donghoon Ahn, Jiwon Kang, Seungryong Kim, Kyong Hwan Jin, Eunju Cha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.19930
Pdf URL: https://arxiv.org/pdf/2502.19930
Copy Paste: [[2502.19930]] Identity-preserving Distillation Sampling by Fixed-Point Iterator(https://arxiv.org/abs/2502.19930)
Keywords: generation
Abstract: Score distillation sampling (SDS) demonstrates a powerful capability for text-conditioned 2D image and 3D object generation by distilling the knowledge from learned score functions. However, SDS often suffers from blurriness caused by noisy gradients. When SDS meets the image editing, such degradations can be reduced by adjusting bias shifts using reference pairs, but the de-biasing techniques are still corrupted by erroneous gradients. To this end, we introduce Identity-preserving Distillation Sampling (IDS), which compensates for the gradient leading to undesired changes in the results. Based on the analysis that these errors come from the text-conditioned scores, a new regularization technique, called fixed-point iterative regularization (FPR), is proposed to modify the score itself, driving the preservation of the identity even including poses and structures. Thanks to a self-correction by FPR, the proposed method provides clear and unambiguous representations corresponding to the given prompts in image-to-image editing and editable neural radiance field (NeRF). The structural consistency between the source and the edited data is obviously maintained compared to other state-of-the-art methods.
摘要：得分蒸馏采样（SDS）通过将知识从学习分数功能中提取，证明了文本条件2D图像和3D对象生成的强大能力。但是，SD通常遭受嘈杂梯度引起的模糊性。当SDS符合图像编辑时，可以通过使用参考对调整偏置变化来减少此类降解，但是偏低的技术仍被错误的梯度损坏。为此，我们介绍了赋予身份的蒸馏采样（IDS），该采样（IDS）补偿了梯度导致结果不希望的变化。基于这些错误来自文本条件分数的分析，提出了一种称为定点迭代正则化（FPR）的新的正则化技术，以修改分数本身，推动身份的保留，甚至包括姿势和结构。得益于FPR的自我纠正，该建议的方法提供了与图像到图像编辑和可编辑神经辐射场（NERF）中给定提示相对应的清晰明确的表示。与其他最新方法相比，源和编辑数据之间的结构一致性显然是维持的。

Title: Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios

Authors: Chao Wang, Luning Zhang, Zheng Wang, Yang Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.19973
Pdf URL: https://arxiv.org/pdf/2502.19973
Copy Paste: [[2502.19973]] Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios(https://arxiv.org/abs/2502.19973)
Keywords: generation
Abstract: Combining multiple perceptual inputs and performing combinatorial reasoning in complex scenarios is a sophisticated cognitive function in humans. With advancements in multi-modal large language models, recent benchmarks tend to evaluate visual understanding across multiple images. However, they often overlook the necessity of combinatorial reasoning across multiple perceptual information. To explore the ability of advanced models to integrate multiple perceptual inputs for combinatorial reasoning in complex scenarios, we introduce two benchmarks: Clue-Visual Question Answering (CVQA), with three task types to assess visual comprehension and synthesis, and Clue of Password-Visual Question Answering (CPVQA), with two task types focused on accurate interpretation and application of visual data. For our benchmarks, we present three plug-and-play approaches: utilizing model input for reasoning, enhancing reasoning through minimum margin decoding with randomness generation, and retrieving semantically relevant visual information for effective data integration. The combined results reveal current models' poor performance on combinatorial reasoning benchmarks, even the state-of-the-art (SOTA) closed-source model achieves only 33.04% accuracy on CVQA, and drops to 7.38% on CPVQA. Notably, our approach improves the performance of models on combinatorial reasoning, with a 22.17% boost on CVQA and 9.40% on CPVQA over the SOTA closed-source model, demonstrating its effectiveness in enhancing combinatorial reasoning with multiple perceptual inputs in complex scenarios. The code will be publicly available.
摘要：在复杂的情况下，将多个感知输入和执行组合推理结合在一起是人类的复杂认知功能。随着多模式大语言模型的进步，最近的基准倾向于评估跨多个图像的视觉理解。但是，他们经常忽略跨多个感知信息的组合推理的必要性。为了探索高级模型在复杂场景中集成多个感知输入以进行组合推理的能力，我们介绍了两个基准：线索 - 视觉问题答案（CVQA），以及三种任务类型，以评估视觉理解和合成，以及密码 - 视觉询问答案的线索以及两种任务类型的范围，并应用了两个任务类型。对于我们的基准测试，我们提出了三种插件方法：利用模型输入来推理，通过随机性生成的最小余量解码来增强推理，并检索与语义相关的视觉信息以进行有效的数据集成。组合结果表明，当前模型在组合推理基准上的性能差，甚至最先进的（SOTA）闭合源模型在CVQA上仅达到33.04％的准确性，CPVQA的精度仅达到7.38％。值得注意的是，我们的方法提高了模型在组合推理上的性能，而在SOTA闭合源模型上，CVQA增长了22.17％，CPVQA上的CPVQA增强了9.40％，这表明了其在复杂场景中具有多个感知输入的组合推理方面的有效性。该代码将公开可用。

Title: Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation

Authors: Kang Liu, Zhuoqi Ma, Xiaolu Kang, Yunan Li, Kun Xie, Zhicheng Jiao, Qiguang Miao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20056
Pdf URL: https://arxiv.org/pdf/2502.20056
Copy Paste: [[2502.20056]] Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation(https://arxiv.org/abs/2502.20056)
Keywords: generation
Abstract: Automated radiology report generation offers an effective solution to alleviate radiologists' workload. However, most existing methods focus primarily on single or fixed-view images to model current disease conditions, which limits diagnostic accuracy and overlooks disease progression. Although some approaches utilize longitudinal data to track disease progression, they still rely on single images to analyze current visits. To address these issues, we propose enhanced contrastive learning with Multi-view Longitudinal data to facilitate chest X-ray Report Generation, named MLRG. Specifically, we introduce a multi-view longitudinal contrastive learning method that integrates spatial information from current multi-view images and temporal information from longitudinal data. This method also utilizes the inherent spatiotemporal information of radiology reports to supervise the pre-training of visual and textual representations. Subsequently, we present a tokenized absence encoding technique to flexibly handle missing patient-specific prior knowledge, allowing the model to produce more accurate radiology reports based on available prior knowledge. Extensive experiments on MIMIC-CXR, MIMIC-ABN, and Two-view CXR datasets demonstrate that our MLRG outperforms recent state-of-the-art methods, achieving a 2.3% BLEU-4 improvement on MIMIC-CXR, a 5.5% F1 score improvement on MIMIC-ABN, and a 2.7% F1 RadGraph improvement on Two-view CXR.
摘要：自动放射学报告一代提供了一种有效的解决方案来减轻放射科医生的工作量。但是，大多数现有方法主要集中于单个或固定视图图像，以模拟当前疾病状况，这限制了诊断准确性并忽略疾病进展。尽管某些方法利用纵向数据来跟踪疾病进展，但它们仍然依靠单个图像来分析当前的访问。为了解决这些问题，我们建议使用多视图纵向数据增强对比度学习，以促进胸部X射线报告的生成，名为MLRG。具体而言，我们引入了一种多视图纵向对比学习方法，该方法从当前的多视图图像和纵向数据中的时间信息中整合了空间信息。该方法还利用放射学报告的固有时空信息来监督视觉和文本表示的预培训。随后，我们提出了一种令牌化的编码技术，以灵活处理缺失的患者特定先验知识，从而使模型可以基于可用的先验知识生成更准确的放射学报告。 Extensive experiments on MIMIC-CXR, MIMIC-ABN, and Two-view CXR datasets demonstrate that our MLRG outperforms recent state-of-the-art methods, achieving a 2.3% BLEU-4 improvement on MIMIC-CXR, a 5.5% F1 score improvement on MIMIC-ABN, and a 2.7% F1 RadGraph improvement on Two-view CXR.

Title: A Generative Model Enhanced Multi-Agent Reinforcement Learning Method for Electric Vehicle Charging Navigation

Authors: Tianyang Qi, Shibo Chen, Jun Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.20068
Pdf URL: https://arxiv.org/pdf/2502.20068
Copy Paste: [[2502.20068]] A Generative Model Enhanced Multi-Agent Reinforcement Learning Method for Electric Vehicle Charging Navigation(https://arxiv.org/abs/2502.20068)
Keywords: generative
Abstract: With the widespread adoption of electric vehicles (EVs), navigating for EV drivers to select a cost-effective charging station has become an important yet challenging issue due to dynamic traffic conditions, fluctuating electricity prices, and potential competition from other EVs. The state-of-the-art deep reinforcement learning (DRL) algorithms for solving this task still require global information about all EVs at the execution stage, which not only increases communication costs but also raises privacy issues among EV drivers. To overcome these drawbacks, we introduce a novel generative model-enhanced multi-agent DRL algorithm that utilizes only the EV's local information while achieving performance comparable to these state-of-the-art algorithms. Specifically, the policy network is implemented on the EV side, and a Conditional Variational Autoencoder-Long Short Term Memory (CVAE-LSTM)-based recommendation model is developed to provide recommendation information. Furthermore, a novel future charging competition encoder is designed to effectively compress global information, enhancing training performance. The multi-gradient descent algorithm (MGDA) is also utilized to adaptively balance the weight between the two parts of the training objective, resulting in a more stable training process. Simulations are conducted based on a practical area in Xián, China. Experimental results show that our proposed algorithm, which relies on local information, outperforms existing local information-based methods and achieves less than 8\% performance loss compared to global information-based methods.
摘要：随着电动汽车（EVS）广泛采用的广泛采用，由于动态的交通状况，电力价格波动以及其他电动汽车的潜在竞争，导航EV驾驶员选择具有成本效益的充电站已成为一个重要但艰巨的问题。解决此任务的最新深入强化学习（DRL）算法仍然需要有关执行阶段所有电动汽车的全球信息，这不仅增加了通信成本，而且还会在EV驱动程序中引发隐私问题。为了克服这些缺点，我们引入了一种新颖的生成模型增强的多代理DRL算法，该算法仅利用EV的本地信息，同时实现与这些最先进的算法相当的性能。具体而言，策略网络是在电动汽车端实施的，并开发了有条件的变异自动编码器长期内存（CVAE-LSTM）建议模型，以提供建议信息。此外，一项新颖的未来充电竞赛编码器旨在有效地压缩全球信息，从而提高培训性能。多效率下降算法（MGDA）也用于适应训练目标的两个部分之间的重量，从而实现更稳定的训练过程。模拟是根据中国西安的实用领域进行的。实验结果表明，与基于全球信息的方法相比，我们所提出的算法依赖于本地信息，其表现优于现有的基于本地信息的方法，并且实现了低于8％的性能损失。

Title: VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers

Authors: Ziang Guo, Konstantin Gubernatorov, Selamawit Asfaw, Zakhar Yagudin, Dzmitry Tsetserukou
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2502.20108
Pdf URL: https://arxiv.org/pdf/2502.20108
Copy Paste: [[2502.20108]] VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers(https://arxiv.org/abs/2502.20108)
Keywords: generation
Abstract: In autonomous driving, dynamic environment and corner cases pose significant challenges to the robustness of ego vehicle's decision-making. To address these challenges, commencing with the representation of state-action mapping in the end-to-end autonomous driving paradigm, we introduce a novel pipeline, VDT-Auto. Leveraging the advancement of the state understanding of Visual Language Model (VLM), incorporating with diffusion Transformer-based action generation, our VDT-Auto parses the environment geometrically and contextually for the conditioning of the diffusion process. Geometrically, we use a bird's-eye view (BEV) encoder to extract feature grids from the surrounding images. Contextually, the structured output of our fine-tuned VLM is processed into textual embeddings and noisy paths. During our diffusion process, the added noise for the forward process is sampled from the noisy path output of the fine-tuned VLM, while the extracted BEV feature grids and embedded texts condition the reverse process of our diffusion Transformers. Our VDT-Auto achieved 0.52m on average L2 errors and 21% on average collision rate in the nuScenes open-loop planning evaluation. Moreover, the real-world demonstration exhibited prominent generalizability of our VDT-Auto. The code and dataset will be released after acceptance.
摘要：在自主驾驶中，动态环境和角落案件对自我决策的鲁棒性构成了重大挑战。为了应对这些挑战，从端到端自主驾驶范式中的国家行动映射开始，我们引入了一条新颖的管道VDT-Auto。利用国家对视觉语言模型（VLM）的理解的进步，并结合了基于扩散变压器的动作生成，我们的VDT-AUTO在几何和上下文上以对扩散过程的调理来解析环境。从几何上讲，我们使用鸟眼视图（BEV）编码器从周围图像中提取特征网格。在上下文上，我们的微调VLM的结构化输出被处理为文本嵌入和嘈杂的路径。在我们的扩散过程中，向前过程的添加噪声是从微型VLM的嘈杂路径输出中取样的，而提取的BEV功能网格和嵌入式文本则调节了我们扩散变压器的反向过程。我们的VDT-Auto在Nuscenes开环计划评估中平均达到了0.52亿的L2错误，平均碰撞率为21％。此外，现实世界中的演示表现出我们的VDT-Auto的明显普遍性。接受后，代码和数据集将在接受后发布。

Title: FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

Authors: Sotiris Anagnostidis, Gregor Bachmann, Yeongmin Kim, Jonas Kohler, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Albert Pumarola, Ali Thabet, Edgar Schönfeld
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2502.20126
Pdf URL: https://arxiv.org/pdf/2502.20126
Copy Paste: [[2502.20126]] FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute(https://arxiv.org/abs/2502.20126)
Keywords: generation
Abstract: Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into \emph{flexible} ones -- dubbed FlexiDiT -- allowing them to process inputs at varying compute budgets. We demonstrate how a single \emph{flexible} model can generate images without any drop in quality, while reducing the required FLOPs by more than $40$\% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to $75$\% less compute without compromising performance.
摘要：尽管其性能出色，但在推断期间，现代扩散变压器受到了大量资源要求的阻碍，这是由于每个DeNoising步骤所需的固定和大量计算。在这项工作中，我们重新审视了传统的静态范式，该范围会分配固定的计算预算。我们的简单和样本效率的框架使预训练的DIT模型可以转换为\ emph {flexible}的模型 - 称为FlexIdit-允许它们以不同的计算预算处理输入。我们演示了单个\ emph {flexible}模型如何生成图像而无需任何质量下降，同时与静态对应物相比，所需的拖鞋降低了$ 40 $ \％，用于课堂条件和文本条件的图像生成。我们的方法对输入和调理方式是一般的，不可知。我们展示了如何轻易扩展到视频生成的方法，其中FlexIdit模型在不损害性能的情况下生成最高$ 75 $ \％的计算的样品。

Title: Adaptive H&E-IHC information fusion staining framework based on feature extra

Authors: Yifan Jia, Xingda Yu, Zhengyang Ji, Songning Lai, Yutao Yue
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20156
Pdf URL: https://arxiv.org/pdf/2502.20156
Copy Paste: [[2502.20156]] Adaptive H&E-IHC information fusion staining framework based on feature extra(https://arxiv.org/abs/2502.20156)
Keywords: generative
Abstract: Immunohistochemistry (IHC) staining plays a significant role in the evaluation of diseases such as breast cancer. The H&E-to-IHC transformation based on generative models provides a simple and cost-effective method for obtaining IHC images. Although previous models can perform digital coloring well, they still suffer from (i) coloring only through the pixel features that are not prominent in HE, which is easy to cause information loss in the coloring process; (ii) The lack of pixel-perfect H&E-IHC groundtruth pairs poses a challenge to the classical L1 this http URL address the above challenges, we propose an adaptive information enhanced coloring framework based on feature extractors. We first propose the VMFE module to effectively extract the color information features using multi-scale feature extraction and wavelet transform convolution, while combining the shared decoder for feature fusion. The high-performance dual feature extractor of H&E-IHC is trained by contrastive learning, which can effectively perform feature alignment of HE-IHC in high latitude space. At the same time, the trained feature encoder is used to enhance the features and adaptively adjust the loss in the HE section staining process to solve the problems related to unclear and asymmetric information. We have tested on different datasets and achieved excellent this http URL code is available at this https URL
摘要：免疫组织化学（IHC）染色在评估乳腺癌等疾病中起着重要作用。基于生成模型的H＆E-e-e-IHC转换提供了一种简单且具有成本效益的方法来获得IHC图像。尽管以前的型号可以很好地执行数字色彩，但它们仍然只能通过（i）通过HE中突出的像素功能来造成（i）着色，这很容易在着色过程中引起信息丢失；（ii）缺乏像素完美的H＆E-IHC地面图对经典L1提出了挑战，该HTTP URL解决了上述挑战，我们提出了一个基于功能提取器的自适应信息增强着色框架。我们首先建议使用多尺度特征提取和小波变换卷积有效地提取颜色信息功能，同时将共享解码器结合在一起以进行融合。 H＆E-IHC的高性能双特征提取器是通过对比度学习训练的，可以有效地在高纬度空间中对HE-IHC进行特征对齐。同时，训练有素的特征编码器用于增强特征并自适应调整HE截面染色过程中的损失，以解决与不清楚和不对称信息有关的问题。我们已经在不同的数据集上进行了测试，并在此HTTPS URL上获得了此HTTP URL代码的出色

Title: Similarity-Distance-Magnitude Universal Verification

Authors: Allen Schmaltz
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2502.20167
Pdf URL: https://arxiv.org/pdf/2502.20167
Copy Paste: [[2502.20167]] Similarity-Distance-Magnitude Universal Verification(https://arxiv.org/abs/2502.20167)
Keywords: generation
Abstract: We solve the neural network robustness problem by adding Similarity (i.e., correctly predicted depth-matches into training)-awareness and Distance-to-training-distribution-awareness to the existing output Magnitude (i.e., decision-boundary)-awareness of the softmax function. The resulting sdm activation function provides strong signals of the relative epistemic (reducible) predictive uncertainty. We use this novel behavior to further address the complementary HCI problem of mapping the output to human-interpretable summary statistics over relevant partitions of a held-out calibration set. Estimates of prediction-conditional uncertainty are obtained via a parsimonious learned transform over the class-conditional empirical CDFs of the output of a final-layer sdm activation function. For decision-making and as an intrinsic model check, estimates of class-conditional accuracy are obtained by further partitioning the high-probability regions of this calibrated output into class-conditional, region-specific CDFs. The uncertainty estimates from sdm calibration are remarkably robust to test-time distribution shifts and out-of-distribution inputs; incorporate awareness of the effective sample size; provide estimates of uncertainty from the learning and data splitting processes; and are well-suited for selective classification and conditional branching for additional test-time compute based on the predictive uncertainty, as for selective LLM generation, routing, and composition over multiple models and retrieval. Finally, we construct sdm networks, LLMs with uncertainty-aware verification and interpretability-by-exemplar as intrinsic properties. We provide open-source software implementing these results.
摘要：我们通过增加相似性（即正确预测到训练中的深度匹配）来解决神经网络的鲁棒性问题 - 对现有的输出幅度（即决策 - 决策）功能的意识和距离训练 - 分布意识 - 软性功能。所得的SDM激活函数提供了相对认知（可还原）预测不确定性的强信号。我们使用这种新颖的行为进一步解决了互补的HCI问题，该问题将输出映射到持有校准集的相关分区的人类解释汇总统计数据。预测条件不确定性的估计值是通过最终层SDM激活函数的输出的类别经验CDF来获得的简约学习转换。对于决策和固有模型检查，通过将此校准的输出的高概率区域进一步划分为类别条件，特定于区域的特定区域CDF来获得类别条件准确性的估计。 SDM校准的不确定性估计值对于测试时间分配变化和分布输入的输入非常强大。结合了对有效样本量的认识；从学习和数据分裂过程中提供不确定性的估计；并且非常适合选择性分类和条件分支，以基于预测不确定性的其他测试时间计算，例如在多个模型和检索上选择性LLM生成，路由和组成。最后，我们构建具有不确定性感知验证的SDM网络，llms作为内在属性。我们提供实施这些结果的开源软件。

Title: Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

Authors: Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2502.20172
Pdf URL: https://arxiv.org/pdf/2502.20172
Copy Paste: [[2502.20172]] Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think(https://arxiv.org/abs/2502.20172)
Keywords: generation
Abstract: The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.
摘要：高级文本到图像生成的领域正在见证将功能强大的文本编码器（例如夹子和T5）与扩散变压器骨架相结合的统一框架的出现。尽管已经努力控制具有其他条件的输出图像，例如Canny和Depth Map，但仍缺乏用于任意文本图像交错控制的综合框架。当试图从生成过程中的多个图像中合并概念或视觉元素时，这一差距尤为明显。为了减轻差距，我们进行了初步实验，表明大型多模型（LMMS）提供了一个有效的共享表示空间，可以很好地对图像和文本进行良好的一致性，以作为外部扩散模型的条件。基于这一发现，我们提出了Dream Engine，这是一个高效且统一的框架，旨在在图像生成模型中进行任意文本图像交错控制。在强大的文本到图像模型（如SD3.5）的基础上，我们通过合并多种模式信息编码器（例如qwenvl）来替换原始的仅文本编码器。我们的方法利用了两个阶段的训练范式，包括联合文本图像对齐和多模式交织教学调整。我们的实验表明，这种训练方法是有效的，在Geneval基准上取得了0.69的总分，并匹配了最先进的文本对图像模型（如SD3.5和Flux）的性能。

Title: Attention Distillation: A Unified Approach to Visual Characteristics Transfer

Authors: Yang Zhou, Xu Gao, Zichong Chen, Hui Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20235
Pdf URL: https://arxiv.org/pdf/2502.20235
Copy Paste: [[2502.20235]] Attention Distillation: A Unified Approach to Visual Characteristics Transfer(https://arxiv.org/abs/2502.20235)
Keywords: generation, generative
Abstract: Recent advances in generative diffusion models have shown a notable inherent understanding of image style and semantics. In this paper, we leverage the self-attention features from pretrained diffusion networks to transfer the visual characteristics from a reference to generated images. Unlike previous work that uses these features as plug-and-play attributes, we propose a novel attention distillation loss calculated between the ideal and current stylization results, based on which we optimize the synthesized image via backpropagation in latent space. Next, we propose an improved Classifier Guidance that integrates attention distillation loss into the denoising sampling process, further accelerating the synthesis and enabling a broad range of image generation applications. Extensive experiments have demonstrated the extraordinary performance of our approach in transferring the examples' style, appearance, and texture to new images in synthesis. Code is available at this https URL.
摘要：生成扩散模型的最新进展显示出对图像样式和语义的固有理解。在本文中，我们利用了预审前的扩散网络的自我发明特征将视觉特征从引用到生成的图像中传递。与以前使用这些功能作为插件属性的工作不同，我们提出了一个新颖的注意力提炼损失，在理想和当前的风格化结果之间计算出来，基于我们通过潜在空间中的反向流磁来优化合成的图像。接下来，我们提出了一个改进的分类器指导，将注意力提炼损失纳入脱牙的采样过程，进一步加速综合并实现了广泛的图像生成应用。广泛的实验证明了我们的方法在将示例的样式，外观和纹理传递到合成中的新图像方面的表现非常出色。代码可在此HTTPS URL上找到。

Title: Do computer vision foundation models learn the low-level characteristics of the human visual system?

Authors: Yancheng Cai, Fei Yin, Dounia Hammou, Rafal Mantiuk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20256
Pdf URL: https://arxiv.org/pdf/2502.20256
Copy Paste: [[2502.20256]] Do computer vision foundation models learn the low-level characteristics of the human visual system?(https://arxiv.org/abs/2502.20256)
Keywords: generative
Abstract: Computer vision foundation models, such as DINO or OpenCLIP, are trained in a self-supervised manner on large image datasets. Analogously, substantial evidence suggests that the human visual system (HVS) is influenced by the statistical distribution of colors and patterns in the natural world, characteristics also present in the training data of foundation models. The question we address in this paper is whether foundation models trained on natural images mimic some of the low-level characteristics of the human visual system, such as contrast detection, contrast masking, and contrast constancy. Specifically, we designed a protocol comprising nine test types to evaluate the image encoders of 45 foundation and generative models. Our results indicate that some foundation models (e.g., DINO, DINOv2, and OpenCLIP), share some of the characteristics of human vision, but other models show little resemblance. Foundation models tend to show smaller sensitivity to low contrast and rather irregular responses to contrast across frequencies. The foundation models show the best agreement with human data in terms of contrast masking. Our findings suggest that human vision and computer vision may take both similar and different paths when learning to interpret images of the real world. Overall, while differences remain, foundation models trained on vision tasks start to align with low-level human vision, with DINOv2 showing the closest resemblance.
摘要：诸如Dino或OpenClip之类的计算机视觉基础模型在大图像数据集上以自我监督的方式进行训练。类似地，大量证据表明，人类视觉系统（HVS）受自然世界中颜色和模式的统计分布的影响，基础模型的训练数据中也存在特征。我们在本文中解决的问题是，在自然图像上训练的基础模型是否模仿了人类视觉系统的一些低级特征，例如对比检测，对比度掩盖和对比度。具体而言，我们设计了一个包括九种测试类型的协议来评估45基础和生成模型的图像编码器。我们的结果表明，某些基础模型（例如Dino，Dinov2和OpenClip）具有人类视力的某些特征，但其他模型几乎没有相似之处。基础模型倾向于显示对低对比度的敏感性较小，并且对频率之间对比的反应不规则。在对比度掩盖方面，基础模型显示了与人类数据的最佳一致性。我们的发现表明，当学习解释现实世界的图像时，人类的视野和计算机视觉可能会采用相似和不同的路径。总体而言，尽管仍然存在差异，但接受视觉任务的训练的基础模型开始与低级人类视力保持一致，Dinov2显示出最接近的相似之处。

Title: Mobius: Text to Seamless Looping Video Generation via Latent Shift

Authors: Xiuli Bi, Jianfei Yuan, Bo Liu, Yong Zhang, Xiaodong Cun, Chi-Man Pun, Bin Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20307
Pdf URL: https://arxiv.org/pdf/2502.20307
Copy Paste: [[2502.20307]] Mobius: Text to Seamless Looping Video Generation via Latent Shift(https://arxiv.org/abs/2502.20307)
Keywords: generation
Abstract: We present Mobius, a novel method to generate seamlessly looping videos from text descriptions directly without any user annotations, thereby creating new visual materials for the multi-media presentation. Our method repurposes the pre-trained video latent diffusion model for generating looping videos from text prompts without any training. During inference, we first construct a latent cycle by connecting the starting and ending noise of the videos. Given that the temporal consistency can be maintained by the context of the video diffusion model, we perform multi-frame latent denoising by gradually shifting the first-frame latent to the end in each step. As a result, the denoising context varies in each step while maintaining consistency throughout the inference process. Moreover, the latent cycle in our method can be of any length. This extends our latent-shifting approach to generate seamless looping videos beyond the scope of the video diffusion model's context. Unlike previous cinemagraphs, the proposed method does not require an image as appearance, which will restrict the motions of the generated results. Instead, our method can produce more dynamic motion and better visual quality. We conduct multiple experiments and comparisons to verify the effectiveness of the proposed method, demonstrating its efficacy in different scenarios. All the code will be made available.
摘要：我们提出了Mobius，这是一种新颖的方法，可以直接从文本描述中生成无缝循环视频，而无需任何用户注释，从而为多媒体演示文稿创建了新的视觉材料。我们的方法重新利用了预先训练的视频潜扩散模型，用于从文本提示中生成循环视频，而无需任何培训。在推断期间，我们首先通过连接视频的启动和结束噪声来构建潜在周期。鉴于时间一致性可以通过视频扩散模型的上下文来维护，因此我们通过将第一个框架逐渐转移到每个步骤中的结尾，来执行多帧潜在deno。结果，在整个推理过程中保持一致性的同时，denoising上下文在每个步骤中都有所不同。此外，我们方法中的潜在周期可能具有任何长度。这扩展了我们的潜在转移方法，以生成视频扩散模型上下文范围之外的无缝循环视频。与以前的摄影不同，所提出的方法不需要图像作为外观，这将限制生成的结果的运动。相反，我们的方法可以产生更动态的运动和更好的视觉质量。我们进行了多个实验和比较，以验证该方法的有效性，并在不同情况下证明了其功效。所有代码将可用。

Title: FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction

Authors: Siyu Jiao, Gengwei Zhang, Yinlong Qian, Jiancheng Huang, Yao Zhao, Humphrey Shi, Lin Ma, Yunchao Wei, Zequn Jie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20313
Pdf URL: https://arxiv.org/pdf/2502.20313
Copy Paste: [[2502.20313]] FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction(https://arxiv.org/abs/2502.20313)
Keywords: generation
Abstract: This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images ($\leq$ 256px), FlexVAR can: (1) Generate images of various resolutions and aspect ratios, even exceeding the resolution of the training images. (2) Support various image-to-image tasks, including image refinement, in/out-painting, and image expansion. (3) Adapt to various autoregressive steps, allowing for faster inference with fewer steps or enhancing image quality with more steps. Our 1.0B model outperforms its VAR counterpart on the ImageNet 256$\times$256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2.08 FID, outperforming state-of-the-art autoregressive models AiM/VAR by 0.25/0.28 FID and popular diffusion models LDM/DiT by 1.52/0.19 FID, respectively. When transferring our 1.0B model to the ImageNet 512$\times$512 benchmark in a zero-shot manner, FlexVAR achieves competitive results compared to the VAR 2.3B model, which is a fully supervised model trained at 512$\times$512 resolution.
摘要：这项工作挑战了视觉自回归建模中的剩余预测范式，并提出了Flexvar，这是一种新的灵活视觉自动回归图像生成范式。 Flexvar通过基础预测来促进自回旋学习，使每个步骤都能独立产生合理的图像。这种简单，直观的方法迅速学习了视觉分布，并使生成过程更加灵活和适应性。 Flexvar可以仅根据低分辨率图像（$ \ leq $ 256px）进行培训，可以：（1）生成各种分辨率和长宽比的图像，甚至超过培训图像的分辨率。（2）支持各种图像到图像的任务，包括图像细化，内/覆盖和图像扩展。（3）适应各种自回旋步骤，以更少的步骤更快地推断或通过更多步骤来增强图像质量。我们的1.0B型号在Imagenet 256 $ \ times $ 256基准上优于其VAR对应物。此外，当零射击以13个步骤转移图像生成过程时，性能进一步提高到2.08 FID，优于最先进的自动回归模型AIM/VAR/VAR分别为0.25/0.28 FID和流行的扩散模型LDM/DIT，分别为1.52/0.19 FID。当将我们的1.0B型号转移到Imagenet 512 $ \ times $ 512基准时，Flexvar与VAR 2.3B型号相比，Flexvar取得了竞争性的结果，VAR 2.3B型号是一种完全有监督的模型，该模型以512 $ \ times $ 512的分辨率进行了培训。

Title: Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds

Authors: Mohamed Abdelsamad, Michael Ulrich, Claudius Gläser, Abhinav Valada
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2502.20316
Pdf URL: https://arxiv.org/pdf/2502.20316
Copy Paste: [[2502.20316]] Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds(https://arxiv.org/abs/2502.20316)
Keywords: generation, generative
Abstract: Masked autoencoders (MAE) have shown tremendous potential for self-supervised learning (SSL) in vision and beyond. However, point clouds from LiDARs used in automated driving are particularly challenging for MAEs since large areas of the 3D volume are empty. Consequently, existing work suffers from leaking occupancy information into the decoder and has significant computational complexity, thereby limiting the SSL pre-training to only 2D bird's eye view encoders in practice. In this work, we propose the novel neighborhood occupancy MAE (NOMAE) that overcomes the aforementioned challenges by employing masked occupancy reconstruction only in the neighborhood of non-masked voxels. We incorporate voxel masking and occupancy reconstruction at multiple scales with our proposed hierarchical mask generation technique to capture features of objects of different sizes in the point cloud. NOMAEs are extremely flexible and can be directly employed for SSL in existing 3D architectures. We perform extensive evaluations on the nuScenes and Waymo Open datasets for the downstream perception tasks of semantic segmentation and 3D object detection, comparing with both discriminative and generative SSL methods. The results demonstrate that NOMAE sets the new state-of-the-art on multiple benchmarks for multiple point cloud perception tasks.
摘要：蒙面的自动编码器（MAE）在视力及其他方面表现出了巨大的自我监督学习（SSL）的潜力。但是，由于3D体积的大面积是空的，因此在自动驾驶中使用的激光雷达的点云特别具有挑战性。因此，现有的工作遭受了将占用信息泄漏到解码器中，并且具有显着的计算复杂性，从而将SSL预训练限制在实际上仅在2D Bird's Eye Views编码器中。在这项工作中，我们提出了新颖的社区占用MAE（NOMAE），该邻里占领MAE（NOMAE）仅在非掩盖体素附近采用蒙面的占用重建来克服上述挑战。我们将多个尺度上的体素掩蔽和占用重建与我们提出的层次蒙版生成技术结合在一起，以捕获点云中不同大小的对象的特征。 Nomaes非常灵活，可以直接用于现有3D体系结构中的SSL。我们对Nuscenes和Waymo开放数据集进行了广泛的评估，以与歧视性和生成性SSL方法相比，对语义分割和3D对象检测的下游感知任务进行了广泛的评估。结果表明，Nomae将新的最新设备设置为多个点云知觉任务的多个基准测试。

Title: UniTok: A Unified Tokenizer for Visual Generation and Understanding

Authors: Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.20321
Pdf URL: https://arxiv.org/pdf/2502.20321
Copy Paste: [[2502.20321]] UniTok: A Unified Tokenizer for Visual Generation and Understanding(https://arxiv.org/abs/2502.20321)
Keywords: generation
Abstract: The representation disparity between visual generation and understanding imposes a critical gap in integrating these capabilities into a single framework. To bridge this gap, we introduce UniTok, a discrete visual tokenizer that encodes fine-grained details for generation while also capturing high-level semantics for understanding. Despite recent studies have shown that these objectives could induce loss conflicts in training, we reveal that the underlying bottleneck stems from limited representational capacity of discrete tokens. We address this by introducing multi-codebook quantization, which divides vector quantization with several independent sub-codebooks to expand the latent feature space, while avoiding training instability caused by overlarge codebooks. Our method significantly raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers. For instance, UniTok achieves a remarkable rFID of 0.38 (versus 0.87 for SD-VAE) and a zero-shot accuracy of 78.6% (versus 76.2% for CLIP) on ImageNet. Our code is available at this https URL.
摘要：视觉生成和理解之间的表示差异在将这些功能集成到单个框架中时施加了关键的差距。为了弥合这一差距，我们介绍了一种离散的视觉令牌仪，它编码了生成细节的细节，同时还捕获了高级语义以供理解。尽管最近的研究表明，这些目标可能引起训练中的损失冲突，但我们揭示了基础瓶颈源于离散令牌的代表性有限。我们通过引入多编码书量化来解决这一问题，该量化将矢量量化与几个独立的子编码书进行了划分以扩展潜在的特征空间，同时避免了由超级代码手册引起的训练不稳定。我们的方法大大提高了统一离散令牌的上限，以匹配甚至超过域特异性连续引物。例如，INTOK在Imagenet上获得了0.38（SD-VAE的0.87）的显着RFID（对SD-VAE为0.87），零射击精度为78.6％（对夹子为76.2％）。我们的代码可在此HTTPS URL上找到。

Title: ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

Authors: Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, Tatsuya Harada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20323
Pdf URL: https://arxiv.org/pdf/2502.20323
Copy Paste: [[2502.20323]] ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model(https://arxiv.org/abs/2502.20323)
Keywords: generation
Abstract: Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. Although existing diffusion-based methods are capable of producing natural motions, their slow generation speed limits their application potential. In this paper, we introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks by learning a mapping from speech to a multi-scale motion codebook. Furthermore, our model can adapt to unseen speaking styles using sample motion sequences, enabling the creation of 3D talking avatars with unique personal styles beyond the identities seen during training. Extensive evaluations and user studies demonstrate that our method outperforms existing approaches in lip synchronization accuracy and perceived quality.
摘要：语音驱动的3D面部动画旨在从任意音频剪辑中为3D头模型生成逼真的唇部运动和面部表情。尽管现有的基于扩散的方法能够产生自然动作，但它们的缓慢生成速度限制了其应用潜力。在本文中，我们介绍了一种新型的自回归模型，该模型通过学习从语音到多规模运动代码簿的映射，实现了高度同步的唇部动作以及现实的头部姿势和眼睛眨眼的实时生成。此外，我们的模型可以使用示例运动序列适应看不见的口语风格，从而使3D会说话的化身具有独特的个人风格，而不是训练过程中所看到的身份。广泛的评估和用户研究表明，我们的方法在唇部同步准确性和感知质量方面优于现有方法。

Title: Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation

Authors: Zhi Cen, Huaijin Pi, Sida Peng, Qing Shuai, Yujun Shen, Hujun Bao, Xiaowei Zhou, Ruizhen Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20370
Pdf URL: https://arxiv.org/pdf/2502.20370
Copy Paste: [[2502.20370]] Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation(https://arxiv.org/abs/2502.20370)
Keywords: generation
Abstract: This paper addresses the task of generating two-character online interactions. Previously, two main settings existed for two-character interaction generation: (1) generating one's motions based on the counterpart's complete motion sequence, and (2) jointly generating two-character motions based on specific conditions. We argue that these settings fail to model the process of real-life two-character interactions, where humans will react to their counterparts in real time and act as independent individuals. In contrast, we propose an online reaction policy, called Ready-to-React, to generate the next character pose based on past observed motions. Each character has its own reaction policy as its "brain", enabling them to interact like real humans in a streaming manner. Our policy is implemented by incorporating a diffusion head into an auto-regressive model, which can dynamically respond to the counterpart's motions while effectively mitigating the error accumulation throughout the generation process. We conduct comprehensive experiments using the challenging boxing task. Experimental results demonstrate that our method outperforms existing baselines and can generate extended motion sequences. Additionally, we show that our approach can be controlled by sparse signals, making it well-suited for VR and other online interactive environments.
摘要：本文介绍了生成两个字符在线互动的任务。以前，有两个主要设置用于两种字符的相互作用生成：（1）基于对应方的完整运动序列生成一个动作，（2）（2）基于特定条件共同生成两个字符的运动。我们认为，这些设置未能模拟现实生活中的两种特征相互作用的过程，在这种过程中，人类将实时对同行做出反应并充当独立个人。相比之下，我们提出了一种称为“现成反应”的在线反应政策，以基于过去观察到的动议产生下一个角色姿势。每个角色都有自己的反应政策作为“大脑”，使他们能够以流媒体方式像真实的人一样互动。我们的策略是通过将扩散头纳入自动回归模型来实施的，该模型可以动态地响应对应物的动作，同时有效地减轻整个生成过程中的错误积累。我们使用具有挑战性的拳击任务进行全面的实验。实验结果表明，我们的方法表现优于现有基准，并且可以生成扩展的运动序列。此外，我们表明我们的方法可以通过稀疏信号来控制，这使其适合VR和其他在线互动环境。

Title: Constrained Generative Modeling with Manually Bridged Diffusion Models

Authors: Saeid Naderiparizi, Xiaoxuan Liang, Berend Zwartsenberg, Frank Wood
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.20371
Pdf URL: https://arxiv.org/pdf/2502.20371
Copy Paste: [[2502.20371]] Constrained Generative Modeling with Manually Bridged Diffusion Models(https://arxiv.org/abs/2502.20371)
Keywords: generative
Abstract: In this paper we describe a novel framework for diffusion-based generative modeling on constrained spaces. In particular, we introduce manual bridges, a framework that expands the kinds of constraints that can be practically used to form so-called diffusion bridges. We develop a mechanism for combining multiple such constraints so that the resulting multiply-constrained model remains a manual bridge that respects all constraints. We also develop a mechanism for training a diffusion model that respects such multiple constraints while also adapting it to match a data distribution. We develop and extend theory demonstrating the mathematical validity of our mechanisms. Additionally, we demonstrate our mechanism in constrained generative modeling tasks, highlighting a particular high-value application in modeling trajectory initializations for path planning and control in autonomous vehicles.
摘要：在本文中，我们描述了在约束空间上基于扩散的生成建模的新型框架。特别是，我们介绍了手动桥，该框架扩展了可以实际使用以形成所谓扩散桥的限制。我们开发了一种结合多个此类约束的机制，以使所得的多重约束模型仍然是尊重所有约束的手动桥。我们还开发了一种训练扩散模型的机制，该模型尊重这种多个约束，同时还将其调整以匹配数据分布。我们发展并扩展理论，证明了我们机制的数学有效性。此外，我们在受约束的生成建模任务中演示了我们的机制，突出了特定的高价值应用在建模轨迹初始化的自动驾驶汽车中的轨迹初始化中。

Title: Multi-Turn Code Generation Through Single-Step Rewards

Authors: Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.20380
Pdf URL: https://arxiv.org/pdf/2502.20380
Copy Paste: [[2502.20380]] Multi-Turn Code Generation Through Single-Step Rewards(https://arxiv.org/abs/2502.20380)
Keywords: generation
Abstract: We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $\mu$Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $\mu$Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$Code at utilizing the execution feedback. Our code is available at this https URL.
摘要：我们从多转弯执行反馈中解决了代码生成的问题。现有方法要么生成无反馈的代码，要么使用复杂的，分层的增强学习来优化多转奖励。我们提出了一种简单但可扩展的方法，即$ \ mu $代码，该方法仅使用单步奖励求解多转码的生成。我们的关键见解是，代码生成是一步可恢复的MDP，可以在单个回合中从任何中间代码状态中恢复正确的代码。 $ \ MU $ $迭代训练既有生成器，以提供以多转弯执行反馈为条件的代码解决方案，也可以为新生成的代码进行评分。实验评估表明，我们的方法比最新的基线实现了重大改进。我们提供奖励模型和政策的设计选择的分析，并显示$ \ MU $代码在利用执行反馈方面的功效。我们的代码可在此HTTPS URL上找到。

Title: Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis

Authors: Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, Yizheng Chen
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2502.20383
Pdf URL: https://arxiv.org/pdf/2502.20383
Copy Paste: [[2502.20383]] Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis(https://arxiv.org/abs/2502.20383)
Keywords: generation
Abstract: Recent advancements in Web AI agents have demonstrated remarkable capabilities in addressing complex web navigation tasks. However, emerging research shows that these agents exhibit greater vulnerability compared to standalone Large Language Models (LLMs), despite both being built upon the same safety-aligned models. This discrepancy is particularly concerning given the greater flexibility of Web AI Agent compared to standalone LLMs, which may expose them to a wider range of adversarial user inputs. To build a scaffold that addresses these concerns, this study investigates the underlying factors that contribute to the increased vulnerability of Web AI agents. Notably, this disparity stems from the multifaceted differences between Web AI agents and standalone LLMs, as well as the complex signals - nuances that simple evaluation metrics, such as success rate, often fail to capture. To tackle these challenges, we propose a component-level analysis and a more granular, systematic evaluation framework. Through this fine-grained investigation, we identify three critical factors that amplify the vulnerability of Web AI agents; (1) embedding user goals into the system prompt, (2) multi-step action generation, and (3) observational capabilities. Our findings highlights the pressing need to enhance security and robustness in AI agent design and provide actionable insights for targeted defense strategies.
摘要：Web AI代理商的最新进步表明，在解决复杂的Web导航任务方面具有显着的功能。但是，新兴的研究表明，与独立的大语言模型（LLMS）相比，这些药物表现出更大的脆弱性，尽管这两者都是建立在相同的安全一致模型上的。与独立的LLM相比，鉴于Web AI代理的灵活性更大，这种差异尤其令人担忧，这可能会使它们暴露于更广泛的对抗用户输入。为了建立解决这些问题的脚手架，本研究调查了导致Web AI代理人增加脆弱性的根本因素。值得注意的是，这种差异源于Web AI代理和独立LLM之间的多方面差异，以及复杂的信号 - 简单评估指标（例如成功率）通常无法捕获的简单评估指标的细微差别。为了应对这些挑战，我们提出了一个组件级分析和更精细的系统评估框架。通过这项细化的调查，我们确定了三个关键因素，以扩大Web AI代理的脆弱性；（1）将用户目标嵌入系统提示中，（2）多步操作生成和（3）观察功能。我们的发现突出了迫切需要增强AI代理设计中的安全性和鲁棒性，并为有针对性的防御策略提供了可行的见解。

Title: Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Authors: Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.20388
Pdf URL: https://arxiv.org/pdf/2502.20388
Copy Paste: [[2502.20388]] Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation(https://arxiv.org/abs/2502.20388)
Keywords: generation, generative
Abstract: Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a $k\times k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as \textbf{continuous entity regression}, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20$\times$ faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2$\times$ faster than the previous best-performing model without relying on vision foundation modules (\eg, DINOv2) or advanced guidance interval sampling.
摘要：自回归（AR）建模，以其下一句话的预测范式而闻名，是最先进的语言和视觉生成模型的基础。传统上，``代币''被视为最小的预测单元，通常是语言或视觉中量化贴片的离散符号。但是，2D图像结构的最佳令牌定义仍然是一个空旷的问题。此外，AR模型遭受了暴露偏见的困扰，在训练中，老师强迫会导致推理中的错误积累。在本文中，我们提出了XAR，这是一个广义的AR框架，将令牌的概念扩展到实体X，该概念可以代表单个贴片令牌，一个单元格（$ k \ times k $ a implentering patches的分组），一个子样本（非远距离组的远处组），一个刻度（粗到5分的分辨率）（coble至Chible to Choollosity conseption），甚至是整个图像。此外，我们将离散令牌分类重新调整为\ textbf {连续实体回归}，在每个AR步骤中利用流量匹配方法。这种方法条件对嘈杂实体而不是地面真理代币进行了培训，从而导致嘈杂的环境学习，从而有效地减轻了暴露偏见。结果，XAR提供了两个关键优势：（1）它可以启用灵活的预测单元，以捕获不同的上下文粒度和空间结构，以及（2）通过避免依赖对教师强迫的依赖来减轻暴露偏见。在ImagEnet-256生成基准测试中，我们的基本型号XAR-B（172m），优于DIT-XL/SIT-XL（67.5m），而实现20 $ \ times $ $更快的推断。同时，XAR-H设置了一个新的最先进的FID为1.24，比以前的表现最佳模型快2.2 $ \ times $，而无需依靠视觉基础模块（\ eg，dinov2）或高级指导间隔抽样。

Title: InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions

Authors: Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, Liang-Yan Gui
Subjects: cs.CV, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2502.20390
Pdf URL: https://arxiv.org/pdf/2502.20390
Copy Paste: [[2502.20390]] InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions(https://arxiv.org/abs/2502.20390)
Keywords: generative
Abstract: Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.
摘要：长期以来，实现与各种物体互动的人类互动的现实模拟一直是一个基本目标。由于复杂的人类对象耦合，对象几何形状的可变性以及运动捕获数据中的伪像，因此将基于物理的运动模仿扩展到复杂的人类对象相互作用（HOI）是具有挑战性的，例如不准确的接触和有限的手部细节。我们介绍了Intermimic，该框架使单个策略能够从不完美的MOCAP数据中鲁棒性地学习，涵盖了与动态和多样化对象的多元化相互作用。我们的主要见解是采用课程策略 - 完美，然后扩展。我们首先将特定于主题的教师政策培训，以模仿，重新制定和精炼运动捕获数据。接下来，我们将这些老师提炼为学生政策，教师作为在线专家提供直接监督以及高质量的参考。值得注意的是，我们将RL微调纳入了学生政策，以超越示范复制并实现更高质量的解决方案。我们的实验表明，间介绍会在多个HOI数据集中产生现实和多样化的相互作用。学识渊博的政策以零拍的方式概括，并与运动学发电机无缝集成，从而将框架从单纯的模仿到复杂人类对象相互作用的生成建模。