2025-03-18

Title: A Survey of Direct Preference Optimization

Authors: Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Yongbin Li, Dacheng Tao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11701
Pdf URL: https://arxiv.org/pdf/2503.11701
Copy Paste: [[2503.11701]] A Survey of Direct Preference Optimization(https://arxiv.org/abs/2503.11701)
Keywords: generative
Abstract: Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful paradigm for aligning LLMs with human preferences, its reliance on complex reward modeling introduces inherent trade-offs in computational efficiency and training stability. In this context, Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative that directly optimizes LLMs using human preferences, thereby circumventing the need for explicit reward modeling. Owing to its theoretical elegance and computational efficiency, DPO has rapidly attracted substantial research efforts exploring its various implementations and applications. However, this field currently lacks systematic organization and comparative analysis. In this survey, we conduct a comprehensive overview of DPO and introduce a novel taxonomy, categorizing previous works into four key dimensions: data strategy, learning framework, constraint mechanism, and model property. We further present a rigorous empirical analysis of DPO variants across standardized benchmarks. Additionally, we discuss real-world applications, open challenges, and future directions for DPO. This work delivers both a conceptual framework for understanding DPO and practical guidance for practitioners, aiming to advance robust and generalizable alignment paradigms. All collected resources are available and will be continuously updated at this https URL.
摘要：大型语言模型（LLM）表现出了前所未有的生成能力，但是它们与人类价值观的一致性对于确保有用和无害部署仍然至关重要。尽管从人类反馈中学习（RLHF）的强化学习已经成为使LLM与人类偏好保持一致的强大范式，但其对复杂奖励建模的依赖引入了计算效率和培训稳定性方面的固有权衡。在这种情况下，直接偏好优化（DPO）最近成为一种简化的替代方案，它可以使用人类偏好直接优化LLM，从而规避了对显式奖励建模的需求。由于其理论优雅和计算效率，DPO迅速吸引了探索其各种实现和应用的大量研究工作。但是，该领域目前缺乏系统的组织和比较分析。在这项调查中，我们对DPO进行了全面的概述，并引入了一种新颖的分类法，将以前的作品分为四个关键维度：数据策略，学习框架，约束机制和模型属性。我们进一步对标准化基准的DPO变体进行了严格的经验分析。此外，我们讨论了现实世界中的应用程序，开放挑战以及DPO的未来方向。这项工作既提供了理解DPO的概念框架，也提供了针对从业者的实践指导，旨在提高健壮且可推广的一致性范式。所有收集的资源都可以使用，并将在此HTTPS URL上不断更新。

Title: Fine-Tuning Diffusion Generative Models via Rich Preference Optimization

Authors: Hanyang Zhao, Haoxian Chen, Yucheng Guo, Genta Indra Winata, Tingting Ou, Ziyu Huang, David D. Yao, Wenpin Tang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11720
Pdf URL: https://arxiv.org/pdf/2503.11720
Copy Paste: [[2503.11720]] Fine-Tuning Diffusion Generative Models via Rich Preference Optimization(https://arxiv.org/abs/2503.11720)
Keywords: generative
Abstract: We introduce Rich Preference Optimization (RPO), a novel pipeline that leverages rich feedback signals to improve the curation of preference pairs for fine-tuning text-to-image diffusion models. Traditional methods, like Diffusion-DPO, often rely solely on reward model labeling, which can be opaque, offer limited insights into the rationale behind preferences, and are prone to issues such as reward hacking or overfitting. In contrast, our approach begins with generating detailed critiques of synthesized images to extract reliable and actionable image editing instructions. By implementing these instructions, we create refined images, resulting in synthetic, informative preference pairs that serve as enhanced tuning datasets. We demonstrate the effectiveness of our pipeline and the resulting datasets in fine-tuning state-of-the-art diffusion models.
摘要：我们介绍了丰富的偏好优化（RPO），这是一种新型的管道，利用丰富的反馈信号来改善偏好对的策划，以进行微调的文本到图像扩散模型。传统方法（例如扩散DPO）通常仅依靠奖励模型标签（可能是不透明的标签），可以对偏好背后的理由提供有限的见解，并且很容易出现奖励黑客或过度拟合等问题。相反，我们的方法始于生成综合图像的详细评论，以提取可靠且可操作的图像编辑指令。通过实施这些说明，我们创建了精致的图像，从而导致合成的，信息丰富的优先偏好对，可作为增强的调谐数据集。我们证明了管道的有效性以及最先进的扩散模型中所得的数据集。

Title: SPECTra: Scalable Multi-Agent Reinforcement Learning with Permutation-Free Networks

Authors: Hyunwoo Park, Baekryun Seong, Sang-Ki Ko
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11726
Pdf URL: https://arxiv.org/pdf/2503.11726
Copy Paste: [[2503.11726]] SPECTra: Scalable Multi-Agent Reinforcement Learning with Permutation-Free Networks(https://arxiv.org/abs/2503.11726)
Keywords: generation
Abstract: In cooperative multi-agent reinforcement learning (MARL), the permutation problem where the state space grows exponentially with the number of agents reduces sample efficiency. Additionally, many existing architectures struggle with scalability, relying on a fixed structure tied to a specific number of agents, limiting their applicability to environments with a variable number of entities. While approaches such as graph neural networks (GNNs) and self-attention mechanisms have progressed in addressing these challenges, they have significant limitations as dense GNNs and self-attention mechanisms incur high computational costs. To overcome these limitations, we propose a novel agent network and a non-linear mixing network that ensure permutation-equivariance and scalability, allowing them to generalize to environments with various numbers of agents. Our agent network significantly reduces computational complexity, and our scalable hypernetwork enables efficient weight generation for non-linear mixing. Additionally, we introduce curriculum learning to improve training efficiency. Experiments on SMACv2 and Google Research Football (GRF) demonstrate that our approach achieves superior learning performance compared to existing methods. By addressing both permutation-invariance and scalability in MARL, our work provides a more efficient and adaptable framework for cooperative MARL. Our code is available at this https URL.
摘要：在合作多代理增强学习（MARL）中，随着代理数量呈指数增长的置换问题降低了样本效率。此外，许多现有的体系结构在可扩展性方面遇到了困难，依靠与特定数量的代理相关的固定结构，从而将其适用性限制在具有可变数量实体的环境中。尽管诸如图形神经网络（GNN）和自我注意解机制之类的方法在应对这些挑战方面已经取得了进步，但由于密集的GNN和自我注意力的机制产生了高度的计算成本，它们具有重大局限性。为了克服这些局限性，我们提出了一个新型的代理网络和一个非线性混合网络，以确保置换式 - 等值和可扩展性，从而使它们可以推广到具有不同数量的代理的环境。我们的代理网络可显着降低计算复杂性，而我们的可扩展性超网络可为非线性混合提供有效的重量。此外，我们介绍了课程学习以提高培训效率。 SMACV2和Google Research Football（GRF）的实验表明，与现有方法相比，我们的方法实现了出色的学习表现。通过解决MARL中的置换和可扩展性，我们的工作为合作MARL提供了一个更有效，更适应性的框架。我们的代码可在此HTTPS URL上找到。

Title: BACE-RUL: A Bi-directional Adversarial Network with Covariate Encoding for Machine Remaining Useful Life Prediction

Authors: Zekai Zhang, Dan Li, Shunyu Wu, Junya Cai, Bo Zhang, See Kiong Ng, Zibin Zheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11730
Pdf URL: https://arxiv.org/pdf/2503.11730
Copy Paste: [[2503.11730]] BACE-RUL: A Bi-directional Adversarial Network with Covariate Encoding for Machine Remaining Useful Life Prediction(https://arxiv.org/abs/2503.11730)
Keywords: generative
Abstract: Prognostic and Health Management (PHM) are crucial ways to avoid unnecessary maintenance for Cyber-Physical Systems (CPS) and improve system reliability. Predicting the Remaining Useful Life (RUL) is one of the most challenging tasks for PHM. Existing methods require prior knowledge about the system, contrived assumptions, or temporal mining to model the life cycles of machine equipment/devices, resulting in diminished accuracy and limited applicability in real-world scenarios. This paper proposes a Bi-directional Adversarial network with Covariate Encoding for machine Remaining Useful Life (BACE-RUL) prediction, which only adopts sensor measurements from the current life cycle to predict RUL rather than relying on previous consecutive cycle recordings. The current sensor measurements of mechanical devices are encoded to a conditional space to better understand the implicit inner mechanical status. The predictor is trained as a conditional generative network with the encoded sensor measurements as its conditions. Various experiments on several real-world datasets, including the turbofan aircraft engine dataset and the dataset collected from degradation experiments of Li-Ion battery cells, show that the proposed model is a general framework and outperforms state-of-the-art methods.
摘要：预后和健康管理（PHM）是避免对网络物理系统（CPS）不必要维护并提高系统可靠性的关键方法。预测剩余的使用寿命（RUL）是PHM最具挑战性的任务之一。现有方法需要有关系统，人为假设或时间挖掘的先验知识，以建模机器/设备的生命周期，从而导致精度降低和在现实世界中的适用性有限。本文提出了一个双向对抗网络，其对机器剩余寿命（BACE-RUL）预测的协变量编码，该预测仅采用当前生命周期中的传感器测量来预测RUL，而不是依靠以前的连续周期记录。机械设备的当前传感器测量被编码为条件空间，以更好地了解隐式内部机械状态。将预测变量作为有条件的生成网络训练，并以编码的传感器测量为条件。在几个现实世界中的数据集上进行了各种实验，包括锂离子电池电池的降解实验收集的涡轮扇飞机发动机数据集和数据集，表明所提出的模型是一种一般框架，并且优于最先进的方法。

Title: ECLARE: Efficient cross-planar learning for anisotropic resolution enhancement

Authors: Samuel W. Remedios, Shuwen Wei, Shuo Han, Jinwei Zhang, Aaron Carass, Kurt G. Schilling, Dzung L. Pham, Jerry L. Prince, Blake E. Dewey
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.11787
Pdf URL: https://arxiv.org/pdf/2503.11787
Copy Paste: [[2503.11787]] ECLARE: Efficient cross-planar learning for anisotropic resolution enhancement(https://arxiv.org/abs/2503.11787)
Keywords: super-resolution
Abstract: In clinical imaging, magnetic resonance (MR) image volumes are often acquired as stacks of 2D slices, permitting decreased scan times, improved signal-to-noise ratio, and image contrasts unique to 2D MR pulse sequences. While this is sufficient for clinical evaluation, automated algorithms designed for 3D analysis perform sub-optimally on 2D-acquired scans, especially those with thick slices and gaps between slices. Super-resolution (SR) methods aim to address this problem, but previous methods do not address all of the following: slice profile shape estimation, slice gap, domain shift, and non-integer / arbitrary upsampling factors. In this paper, we propose ECLARE (Efficient Cross-planar Learning for Anisotropic Resolution Enhancement), a self-SR method that addresses each of these factors. ECLARE estimates the slice profile from the 2D-acquired multi-slice MR volume, trains a network to learn the mapping from low-resolution to high-resolution in-plane patches from the same volume, and performs SR with anti-aliasing. We compared ECLARE to cubic B-spline interpolation, SMORE, and other contemporary SR methods. We used realistic and representative simulations so that quantitative performance against a ground truth could be computed, and ECLARE outperformed all other methods in both signal recovery and downstream tasks. On real data for which there is no ground truth, ECLARE demonstrated qualitative superiority over other methods as well. Importantly, as ECLARE does not use external training data it cannot suffer from domain shift between training and testing. Our code is open-source and available at this https URL.
摘要：在临床成像中，磁共振（MR）图像体积通常被作为2D切片的堆栈获取，允许减少扫描时间，提高信噪比，图像对比2D MR脉冲序列所特有的。虽然这足以进行临床评估，但设计用于3D分析的自动化算法在2D获得的扫描上进行了次优，尤其是那些在切片之间较厚的切片和间隙的扫描。超分辨率（SR）方法旨在解决此问题，但是以前的方法并未解决以下所有问题：切片轮廓形状估计，切片差距，域移位，非整数 /任意UPSMPLING系列。在本文中，我们提出了Eclare（各向异性分辨率增强的有效的跨平面学习），这是一种解决这些因素的自动SR方法。 Eclare估算了从2D获得的多切片MR音量的切片曲线，训练网络以从同一体积学习从低分辨率到高分辨率的平面内贴片的映射，并使用抗偏降低。我们将Eclare与立方B-Spline插值，SMORE和其他当代SR方法进行了比较。我们使用了现实和代表性的模拟，以便可以计算针对地面真理的定量性能，并且Eclare在信号恢复和下游任务中的所有其他方法都优于所有其他方法。在没有地面真理的真实数据上，Eclare也表现出与其他方法相比的定性优势。重要的是，由于Eclare不使用外部训练数据，因此在训练和测试之间无法遭受域的转移。我们的代码是开源的，可在此HTTPS URL上找到。

Title: StyleMorpheus: A Style-Based 3D-Aware Morphable Face Model

Authors: Peizhi Yan, Rabab K. Ward, Dan Wang, Qiang Tang, Shan Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11792
Pdf URL: https://arxiv.org/pdf/2503.11792
Copy Paste: [[2503.11792]] StyleMorpheus: A Style-Based 3D-Aware Morphable Face Model(https://arxiv.org/abs/2503.11792)
Keywords: generative
Abstract: For 3D face modeling, the recently developed 3D-aware neural rendering methods are able to render photorealistic face images with arbitrary viewing directions. The training of the parametric controllable 3D-aware face models, however, still relies on a large-scale dataset that is lab-collected. To address this issue, this paper introduces "StyleMorpheus", the first style-based neural 3D Morphable Face Model (3DMM) that is trained on in-the-wild images. It inherits 3DMM's disentangled controllability (over face identity, expression, and appearance) but without the need for accurately reconstructed explicit 3D shapes. StyleMorpheus employs an auto-encoder structure. The encoder aims at learning a representative disentangled parametric code space and the decoder improves the disentanglement using shape and appearance-related style codes in the different sub-modules of the network. Furthermore, we fine-tune the decoder through style-based generative adversarial learning to achieve photorealistic 3D rendering quality. The proposed style-based design enables StyleMorpheus to achieve state-of-the-art 3D-aware face reconstruction results, while also allowing disentangled control of the reconstructed face. Our model achieves real-time rendering speed, allowing its use in virtual reality applications. We also demonstrate the capability of the proposed style-based design in face editing applications such as style mixing and color editing. Project homepage: this https URL.
摘要：对于3D面部建模，最近开发的3D感知神经渲染方法能够用任意查看方向渲染具有逼真的面部图像。但是，对参数可控的3D感知面部模型的训练仍然依赖于实验室收集的大规模数据集。为了解决这个问题，本文介绍了“ Stylemorpheus”，这是第一个基于样式的神经3D可变形面模型（3DMM），该模型在野外图像上训练。它继承了3DMM的脱离可控性（超过面部身份，表达和外观），但无需准确重建的显式3D形状。 Stylemorpheus采用自动编码器结构。编码器旨在学习代表性的分离参数代码空间，而解码器则使用网络不同子模块中的形状和外观相关样式代码来改善分离。此外，我们通过基于样式的生成对抗学习微调解码器，以实现逼真的3D渲染质量。提出的基于样式的设计使Stylemorpheus能够实现最新的3D感知面部重建结果，同时还允许对重建的面部进行分解的控制。我们的模型实现了实时渲染速度，从而可以在虚拟现实应用程序中使用。我们还展示了拟议的基于样式设计在面部编辑应用中的功能，例如样式混合和颜色编辑。项目主页：此HTTPS URL。

Title: Towards a Unified Copernicus Foundation Model for Earth Vision

Authors: Yi Wang, Zhitong Xiong, Chenying Liu, Adam J. Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taixé, Xiao Xiang Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11849
Pdf URL: https://arxiv.org/pdf/2503.11849
Copy Paste: [[2503.11849]] Towards a Unified Copernicus Foundation Model for Earth Vision(https://arxiv.org/abs/2503.11849)
Keywords: generation
Abstract: Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth's surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth's surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research. Codes, datasets and models are available at this https URL.
摘要：地球观察（EO）基础模型的进步已解锁了大卫星数据从空间中学习通用表示的潜力，从而使广泛的下游应用程序受益于我们的星球至关重要。但是，大多数现有的努力仍然限于固定光谱传感器，仅专注于地球表面，而忽略了图像之外的宝贵元数据。在这项工作中，我们朝着下一代EO基础模型迈出了三个关键组成部分：1）哥白尼 - 前景，这是一个大规模的预处理数据集，该数据集集成了来自所有主要哥白尼前哨任务的1870万个对齐图像，从地球表面跨越了地面到大气层； 2）Copernicus-FM，一种统一的基础模型，能够使用扩展的动态超网络和灵活的元数据编码来处理任何光谱或非光谱传感器模式； 3）哥白尼板凳，这是一种系统的评估基准，其中15个层次下游任务从预处理到每个前哨任务的专用应用程序。我们的数据集，模型和基准大大提高了EO基础模型的可扩展性，多功能性和多模式适应性，同时还为连接EO，天气和气候研究创造了新的机会。该HTTPS URL可用代码，数据集和模型。

Title: Spatio-temporal Fourier Transformer (StFT) for Long-term Dynamics Prediction

Authors: Da Long, Shandian Zhe, Samuel Williams, Leonid Oliker, Zhe Bai
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.11899
Pdf URL: https://arxiv.org/pdf/2503.11899
Copy Paste: [[2503.11899]] Spatio-temporal Fourier Transformer (StFT) for Long-term Dynamics Prediction(https://arxiv.org/abs/2503.11899)
Keywords: generative
Abstract: Simulating the long-term dynamics of multi-scale and multi-physics systems poses a significant challenge in understanding complex phenomena across science and engineering. The complexity arises from the intricate interactions between scales and the interplay of diverse physical processes. Neural operators have emerged as promising models for predicting such dynamics due to their flexibility and computational efficiency. However, they often fail to effectively capture multi-scale interactions or quantify the uncertainties inherent in the predictions. These limitations lead to rapid error accumulation, particularly in long-term forecasting of systems characterized by complex and coupled dynamics. To address these challenges, we propose a spatio-temporal Fourier transformer (StFT), in which each transformer block is designed to learn dynamics at a specific scale. By leveraging a structured hierarchy of StFT blocks, the model explicitly captures dynamics across both macro- and micro- spatial scales. Furthermore, a generative residual correction mechanism is integrated to estimate and mitigate predictive uncertainties, enhancing both the accuracy and reliability of long-term forecasts. Evaluations conducted on three benchmark datasets (plasma, fluid, and atmospheric dynamics) demonstrate the advantages of our approach over state-of-the-art ML methods.
摘要：模拟多尺度和多物理系统的长期动态，在理解跨科学和工程的复杂现象方面构成了重大挑战。复杂性来自量表与各种物理过程的相互作用之间的复杂相互作用。神经操作员由于其灵活性和计算效率而成为预测这种动态的有希望的模型。但是，他们通常无法有效捕获多尺度的相互作用或量化预测中固有的不确定性。这些局限性导致误差迅速积累，特别是在以复杂和耦合动力学为特征的系统的长期预测中。为了应对这些挑战，我们提出了一个时空傅立叶变压器（STFT），其中每个变压器块旨在以特定的规模学习动力学。通过利用STFT块的结构化层次结构，该模型明确捕获了宏观和微空间尺度上的动力学。此外，将生成的残余校正机制集成到估计和减轻预测性不确定性，从而提高了长期预测的准确性和可靠性。在三个基准数据集（等离子体，流体和大气动力学）上进行的评估证明了我们方法比最先进的ML方法的优势。

Title: Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities

Authors: Ruchika Chavhan, Abhinav Mehrotra, Malcolm Chadwick, Alberto Gil Ramos, Luca Morreale, Mehdi Noroozi, Sourav Bhattacharya
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11905
Pdf URL: https://arxiv.org/pdf/2503.11905
Copy Paste: [[2503.11905]] Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities(https://arxiv.org/abs/2503.11905)
Keywords: super-resolution, generation
Abstract: Text-to-image synthesis has witnessed remarkable advancements in recent years. Many attempts have been made to adopt text-to-image models to support multiple tasks. However, existing approaches typically require resource-intensive re-training or additional parameters to accommodate for the new tasks, which makes the model inefficient for on-device deployment. We propose Multi-Task Upcycling (MTU), a simple yet effective recipe that extends the capabilities of a pre-trained text-to-image diffusion model to support a variety of image-to-image generation tasks. MTU replaces Feed-Forward Network (FFN) layers in the diffusion model with smaller FFNs, referred to as experts, and combines them with a dynamic routing mechanism. To the best of our knowledge, MTU is the first multi-task diffusion modeling approach that seamlessly blends multi-tasking with on-device compatibility, by mitigating the issue of parameter inflation. We show that the performance of MTU is on par with the single-task fine-tuned diffusion models across several tasks including image editing, super-resolution, and inpainting, while maintaining similar latency and computational load (GFLOPs) as the single-task fine-tuned models.
摘要：近年来，文本对图像的综合见证了显着的进步。已经尝试采用文本图模型来支持多个任务。但是，现有方法通常需要资源密集型的重新训练或其他参数来适应新任务，这使得模型降低了对设备部署的效率。我们提出了多任务升级（MTU），这是一种简单而有效的食谱，扩展了预先训练的文本对图像扩散模型的功能，以支持各种图像到图像生成任务。 MTU用较小的FFN替换了扩散模型中的前馈网络（FFN）层，称为专家，并将它们与动态路由机制结合在一起。据我们所知，MTU是第一种多任务扩散建模方法，它通过减轻参数通货膨胀的问题来无缝将多任务与在设备兼容性融合。我们表明，MTU的性能与几个任务的单任务微调扩散模型相当，包括图像编辑，超分辨率和内部介入，同时保持与单任务微型调整模型相似的延迟和计算负载（GFLOPS）。

Title: Generating a Biometrically Unique and Realistic Iris Database

Authors: Jingxuan Zhang, Robert J. Hart, Ziqian Bi, Shiaofen Fang, Susan Walsh
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11930
Pdf URL: https://arxiv.org/pdf/2503.11930
Copy Paste: [[2503.11930]] Generating a Biometrically Unique and Realistic Iris Database(https://arxiv.org/abs/2503.11930)
Keywords: generation
Abstract: The use of the iris as a biometric identifier has increased dramatically over the last 30 years, prompting privacy and security concerns about the use of iris images in research. It can be difficult to acquire iris image databases due to ethical concerns, and this can be a barrier for those performing biometrics research. In this paper, we describe and show how to create a database of realistic, biometrically unidentifiable colored iris images by training a diffusion model within an open-source diffusion framework. Not only were we able to verify that our model is capable of creating iris textures that are biometrically unique from the training data, but we were also able to verify that our model output creates a full distribution of realistic iris pigmentations. We highlight the fact that the utility of diffusion networks to achieve these criteria with relative ease, warrants additional research in its use within the context of iris database generation and presentation attack security.
摘要：在过去的30年中，将虹膜用作生物识别剂的使用急剧增加，这引起了人们对在研究中使用虹膜图像的隐私和安全问题。由于道德问题，因此很难获得虹膜图像数据库，这对于那些进行生物识别研究的人来说可能是一个障碍。在本文中，我们描述并展示了如何通过在开源扩散框架内训练一个扩散模型来创建现实，生物识别的彩色虹膜图像的数据库。我们不仅能够验证我们的模型能够创建从训练数据中生物识别唯一独特的虹膜纹理，而且我们还能够验证我们的模型输出是否可以完全分布逼真的虹膜色素化。我们强调了一个事实，即扩散网络相对轻松地实现这些标准的实用性，保证在IRIS数据库生成和演示攻击安全性的背景下进行更多的研究。

Title: CHOrD: Generation of Collision-Free, House-Scale, and Organized Digital Twins for 3D Indoor Scenes with Controllable Floor Plans and Optimal Layouts

Authors: Chong Su, Yingbin Fu, Zheyuan Hu, Jing Yang, Param Hanji, Shaojun Wang, Xuan Zhao, Cengiz Öztireli, Fangcheng Zhong
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.11958
Pdf URL: https://arxiv.org/pdf/2503.11958
Copy Paste: [[2503.11958]] CHOrD: Generation of Collision-Free, House-Scale, and Organized Digital Twins for 3D Indoor Scenes with Controllable Floor Plans and Optimal Layouts(https://arxiv.org/abs/2503.11958)
Keywords: generation
Abstract: We introduce CHOrD, a novel framework for scalable synthesis of 3D indoor scenes, designed to create house-scale, collision-free, and hierarchically structured indoor digital twins. In contrast to existing methods that directly synthesize the scene layout as a scene graph or object list, CHOrD incorporates a 2D image-based intermediate layout representation, enabling effective prevention of collision artifacts by successfully capturing them as out-of-distribution (OOD) scenarios during generation. Furthermore, unlike existing methods, CHOrD is capable of generating scene layouts that adhere to complex floor plans with multi-modal controls, enabling the creation of coherent, house-wide layouts robust to both geometric and semantic variations in room structures. Additionally, we propose a novel dataset with expanded coverage of household items and room configurations, as well as significantly improved data quality. CHOrD demonstrates state-of-the-art performance on both the 3D-FRONT and our proposed datasets, delivering photorealistic, spatially coherent indoor scene synthesis adaptable to arbitrary floor plan variations.
摘要：我们介绍了Chord，这是一个新颖的框架，用于可扩展的3D室内场景，旨在创建房屋规模，无碰撞和层次结构的室内数字双胞胎。与现有的方法将场景布局直接合成为场景图或对象列表的现有方法相反，Chord结合了基于2D图像的中间布局表示形式，从而通过成功将它们作为分发外（OOD）方案成功捕获，从而有效预防了碰撞伪像。此外，与现有方法不同，Chord能够生成场景布局，这些布局符合具有多模式控件的复杂平面图，从而可以在房间结构中创建一致的，室内范围的布局，从而为几何和语义变化提供了强大的功能。此外，我们提出了一个新颖的数据集，具有扩大的家用物品和房间配置的扩展，并显着提高了数据质量。 Chord在3D前和我们提出的数据集上展示了最先进的性能，并提供了可适应任意平面图变化的光真逼真的，空间相干的室内场景合成。

Title: DecompDreamer: Advancing Structured 3D Asset Generation with Multi-Object Decomposition and Gaussian Splatting

Authors: Utkarsh Nath, Rajeev Goel, Rahul Khurana, Kyle Min, Mark Ollila, Pavan Turaga, Varun Jampani, Tejaswi Gowda
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11981
Pdf URL: https://arxiv.org/pdf/2503.11981
Copy Paste: [[2503.11981]] DecompDreamer: Advancing Structured 3D Asset Generation with Multi-Object Decomposition and Gaussian Splatting(https://arxiv.org/abs/2503.11981)
Keywords: generation
Abstract: Text-to-3D generation saw dramatic advances in recent years by leveraging Text-to-Image models. However, most existing techniques struggle with compositional prompts, which describe multiple objects and their spatial relationships. They often fail to capture fine-grained inter-object interactions. We introduce DecompDreamer, a Gaussian splatting-based training routine designed to generate high-quality 3D compositions from such complex prompts. DecompDreamer leverages Vision-Language Models (VLMs) to decompose scenes into structured components and their relationships. We propose a progressive optimization strategy that first prioritizes joint relationship modeling before gradually shifting toward targeted object refinement. Our qualitative and quantitative evaluations against state-of-the-art text-to-3D models demonstrate that DecompDreamer effectively generates intricate 3D compositions with superior object disentanglement, offering enhanced control and flexibility in 3D generation. Project page : this https URL
摘要：近年来，通过利用文本对图像模型，文本到3d代的发展迅速。但是，大多数现有的技术都在构图提示中遇到困难，这些提示描述了多个对象及其空间关系。他们通常无法捕获细粒度间相互作用。我们介绍了Doppreamer，这是一种基于高斯脱落的训练程序，旨在从这种复杂的提示中产生高质量的3D组成。 Dempreamer利用视觉语言模型（VLM）将场景分解为结构化组件及其关系。我们提出了一种渐进优化策略，该策略首先优先考虑联合关系建模，然后再逐步转移到靶向物体改进。我们针对最先进的文本到3D模型进行的定性和定量评估表明，分子者有效地生成具有出色物体分离的复杂的3D组成，从而在3D生成中提供了增强的控制和灵活性。项目页面：此HTTPS URL

Title: QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution

Authors: Donglin Yang, Paul Vicol, Xiaojuan Qi, Renjie Liao, Xiaofan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12015
Pdf URL: https://arxiv.org/pdf/2503.12015
Copy Paste: [[2503.12015]] QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution(https://arxiv.org/abs/2503.12015)
Keywords: super-resolution
Abstract: Deep learning-based super-resolution (SR) methods often perform pixel-wise computations uniformly across entire images, even in homogeneous regions where high-resolution refinement is redundant. We propose the Quadtree Diffusion Model (QDM), a region-adaptive diffusion framework that leverages a quadtree structure to selectively enhance detail-rich regions while reducing computations in homogeneous areas. By guiding the diffusion with a quadtree derived from the low-quality input, QDM identifies key regions-represented by leaf nodes-where fine detail is essential and applies minimal refinement elsewhere. This mask-guided, two-stream architecture adaptively balances quality and efficiency, producing high-fidelity outputs with low computational redundancy. Experiments demonstrate QDM's effectiveness in high-resolution SR tasks across diverse image types, particularly in medical imaging (e.g., CT scans), where large homogeneous regions are prevalent. Furthermore, QDM outperforms or is comparable to state-of-the-art SR methods on standard benchmarks while significantly reducing computational costs, highlighting its efficiency and suitability for resource-limited environments. Our code is available at this https URL.
摘要：基于深度学习的超分辨率（SR）方法通常在整个图像中均匀地执行像素计算，即使在高分辨率细化的均匀区域中，也经常进行像素计算。我们提出了Quadtree扩散模型（QDM），该模型是一种区域自适应扩散框架，利用Quadtree结构有选择地增强细节的区域，同时减少均匀区域的计算。通过使用源自低质量输入的四极管引导扩散，QDM识别由叶子节点代表的关键区域 - 细节是必不可少的，并且在其他地方适用于最小的细节。这种掩盖引导的两流体系结构可自适应地平衡质量和效率，从而产生具有低计算冗余的高保真输出。实验证明了QDM在各种图像类型的高分辨率SR任务中的有效性，尤其是在医学成像（例如CT扫描）中，其中大型均质区域普遍存在。此外，QDM的表现优于标准基准的最先进的SR方法，同时大大降低了计算成本，突出了其对资源有限环境的适用性。我们的代码可在此HTTPS URL上找到。

Title: Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art

Authors: Zhe Jin, Tat-Seng Chua
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12018
Pdf URL: https://arxiv.org/pdf/2503.12018
Copy Paste: [[2503.12018]] Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art(https://arxiv.org/abs/2503.12018)
Keywords: generation
Abstract: Text-to-Image (T2I) diffusion models (DM) have garnered widespread adoption due to their capability in generating high-fidelity outputs and accessibility to anyone able to put imagination into words. However, DMs are often predisposed to generate unappealing outputs, much like the random images on the internet they were trained on. Existing approaches to address this are founded on the implicit premise that visual aesthetics is universal, which is limiting. Aesthetics in the T2I context should be about personalization and we propose the novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output. Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ, known as the Principles of Art (PoA). To facilitate this study, we introduce CompArt, a large-scale compositional art dataset building on top of WikiArt with PoA analysis annotated by a capable Multimodal LLM. Leveraging the expressive power of LLMs and training a lightweight and transferrable adapter, we demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions. Additionally, we design an appropriate evaluation framework to assess the efficacy of our approach.
摘要：文本对图像（T2I）扩散模型（DM）由于能力生成高保真输出和能够将想象力投入文字的人产生高保真输出和可访问性而获得了广泛的采用。但是，DMS通常会倾向于产生不吸引人的输出，就像他们经过培训的Internet上的随机图像一样。现有的解决此问题的方法是建立在隐式前提的基础上，即视觉美学是普遍的，这是有限的。 T2i上下文中的美学应与个性化有关，我们提出了新颖的美学对齐任务，该任务旨在将用户指定的美学与T2I生成输出保持一致。灵感来自艺术品如何提供美学的宝贵观点，我们使用构图框架艺术家使用的艺术家（称为艺术原理（POA））编纂了视觉美学。为了促进这项研究，我们介绍了Coptart，这是Wikiart顶部的大型构图艺术数据集建筑物，并由有能力的多模式LLM注释的POA分析。利用LLM的表达能力并培训轻巧且可转让的适配器，我们证明T2I DMS可以通过用户指定的POA条件有效地提供10种构图控制。此外，我们设计了一个适当的评估框架来评估我们方法的功效。

Title: SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering

Authors: Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, Changick Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12024
Pdf URL: https://arxiv.org/pdf/2503.12024
Copy Paste: [[2503.12024]] SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering(https://arxiv.org/abs/2503.12024)
Keywords: generation
Abstract: Recent progress in 3D/4D scene generation emphasizes the importance of physical alignment throughout video generation and scene reconstruction. However, existing methods improve the alignment separately at each stage, making it difficult to manage subtle misalignments arising from another stage. Here, we present SteerX, a zero-shot inference-time steering method that unifies scene reconstruction into the generation process, tilting data distributions toward better geometric alignment. To this end, we introduce two geometric reward functions for 3D/4D scene generation by using pose-free feed-forward scene reconstruction models. Through extensive experiments, we demonstrate the effectiveness of SteerX in improving 3D/4D scene generation.
摘要：3D/4D场景一代的最新进展强调了整个视频生成和场景重建的物理对齐的重要性。但是，现有方法在每个阶段分别改善了对齐方式，因此很难管理另一个阶段引起的微妙未对准。在这里，我们提出了Steerx，这是一种零射的推理时间转向方法，该方法将场景重建为生成过程，将数据分布倾斜到更好的几何比对。为此，我们通过使用无姿势的馈送场景重建模型引入了3D/4D场景生成的两个几何奖励函数。通过广泛的实验，我们证明了Steerx在改善3D/4D场景生成方面的有效性。

Title: Tailor: An Integrated Text-Driven CG-Ready Human and Garment Generation System

Authors: Zhiyao Sun, Yu-Hui Wen, Matthieu Lin, Ho-Jui Fang, Sheng Ye, Tian Lv, Yong-Jin Liu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2503.12052
Pdf URL: https://arxiv.org/pdf/2503.12052
Copy Paste: [[2503.12052]] Tailor: An Integrated Text-Driven CG-Ready Human and Garment Generation System(https://arxiv.org/abs/2503.12052)
Keywords: generation, generative
Abstract: Creating detailed 3D human avatars with garments typically requires specialized expertise and labor-intensive processes. Although recent advances in generative AI have enabled text-to-3D human/clothing generation, current methods fall short in offering accessible, integrated pipelines for producing ready-to-use clothed avatars. To solve this, we introduce Tailor, an integrated text-to-avatar system that generates high-fidelity, customizable 3D humans with simulation-ready garments. Our system includes a three-stage pipeline. We first employ a large language model to interpret textual descriptions into parameterized body shapes and semantically matched garment templates. Next, we develop topology-preserving deformation with novel geometric losses to adapt garments precisely to body geometries. Furthermore, an enhanced texture diffusion module with a symmetric local attention mechanism ensures both view consistency and photorealistic details. Quantitative and qualitative evaluations demonstrate that Tailor outperforms existing SoTA methods in terms of fidelity, usability, and diversity. Code will be available for academic use.
摘要：使用服装创建详细的3D人类化身需要专业的专业知识和劳动密集型流程。尽管最新的生成AI的进展使文本到3D人/服装的生成能力，但当前方法在提供可访问的，集成的管道来生产现成的服装化像方面。为了解决这个问题，我们介绍了Tailor，这是一个集成的文本对瓦塔尔系统，它生成具有模拟的服装的高保真，可自定义的3D人类。我们的系统包括三个阶段的管道。我们首先采用大型语言模型将文本描述解释为参数化的身体形状和语义匹配的服装模板。接下来，我们通过新颖的几何损失来开发具有拓扑的变形，以使服装精确地适应人体几何形状。此外，具有对称局部注意机制的增强纹理扩散模块可确保视图一致性和影照相细节。定量和定性评估表明，在忠诚度，可用性和多样性方面，量身定制的表现优于现有的SOTA方法。代码将用于学术用途。

Title: Robust Dataset Distillation by Matching Adversarial Trajectories

Authors: Wei Lai, Tianyu Ding, ren dongdong, Lei Wang, Jing Huo, Yang Gao, Wenbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12069
Pdf URL: https://arxiv.org/pdf/2503.12069
Copy Paste: [[2503.12069]] Robust Dataset Distillation by Matching Adversarial Trajectories(https://arxiv.org/abs/2503.12069)
Keywords: generation
Abstract: Dataset distillation synthesizes compact datasets that enable models to achieve performance comparable to training on the original large-scale datasets. However, existing distillation methods overlook the robustness of the model, resulting in models that are vulnerable to adversarial attacks when trained on distilled data. To address this limitation, we introduce the task of ``robust dataset distillation", a novel paradigm that embeds adversarial robustness into the synthetic datasets during the distillation process. We propose Matching Adversarial Trajectories (MAT), a method that integrates adversarial training into trajectory-based dataset distillation. MAT incorporates adversarial samples during trajectory generation to obtain robust training trajectories, which are then used to guide the distillation process. As experimentally demonstrated, even through natural training on our distilled dataset, models can achieve enhanced adversarial robustness while maintaining competitive accuracy compared to existing distillation methods. Our work highlights robust dataset distillation as a new and important research direction and provides a strong baseline for future research to bridge the gap between efficient training and adversarial robustness.
摘要：数据集蒸馏综合了紧凑的数据集，使模型能够实现与原始大规模数据集上的培训相当的性能。但是，现有的蒸馏方法忽略了模型的稳健性，从而导致在接受蒸馏数据训练时容易受到对抗性攻击的模型。为了解决这一局限性，我们介绍了``鲁棒数据集蒸馏''的任务，这是一种新颖的范式，将对抗性的鲁棒性嵌入蒸馏过程中的合成数据集中。我们提出了匹配的对抗性轨迹（MAT），将一种在基于弹性的数据集合到基于弹性的数据集合中，将培训整合到弹药范围内。然后，通过在我们的蒸馏数据集中进行自然训练，将用于指导蒸馏过程的轨迹，模型可以实现增强的对抗性鲁棒性，同时与现有的蒸馏方法相比，保持了竞争精度。

Title: E-SAM: Training-Free Segment Every Entity Model

Authors: Weiming Zhang, Dingwen Xiao, Lei Chen, Lin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12094
Pdf URL: https://arxiv.org/pdf/2503.12094
Copy Paste: [[2503.12094]] E-SAM: Training-Free Segment Every Entity Model(https://arxiv.org/abs/2503.12094)
Keywords: generation
Abstract: Entity Segmentation (ES) aims at identifying and segmenting distinct entities within an image without the need for predefined class labels. This characteristic makes ES well-suited to open-world applications with adaptation to diverse and dynamically changing environments, where new and previously unseen entities may appear frequently. Existing ES methods either require large annotated datasets or high training costs, limiting their scalability and adaptability. Recently, the Segment Anything Model (SAM), especially in its Automatic Mask Generation (AMG) mode, has shown potential for holistic image segmentation. However, it struggles with over-segmentation and under-segmentation, making it less effective for ES. In this paper, we introduce E-SAM, a novel training-free framework that exhibits exceptional ES capability. Specifically, we first propose Multi-level Mask Generation (MMG) that hierarchically processes SAM's AMG outputs to generate reliable object-level masks while preserving fine details at other levels. Entity-level Mask Refinement (EMR) then refines these object-level masks into accurate entity-level masks. That is, it separates overlapping masks to address the redundancy issues inherent in SAM's outputs and merges similar masks by evaluating entity-level consistency. Lastly, Under-Segmentation Refinement (USR) addresses under-segmentation by generating additional high-confidence masks fused with EMR outputs to produce the final ES map. These three modules are seamlessly optimized to achieve the best ES without additional training overhead. Extensive experiments demonstrate that E-SAM achieves state-of-the-art performance compared to prior ES methods, demonstrating a significant improvement by +30.1 on benchmark metrics.
摘要：实体分割（ES）旨在识别和分割图像中的不同实体，而无需预定义的类标签。这种特征使ES非常适合开放世界应用程序，并适应了各种和动态变化的环境，在这些环境中，新的和以前看不见的实体可能经常出现。现有的ES方法需要大量注释的数据集或高培训成本，从而限制了其可扩展性和适应性。最近，该段的任何模型（SAM），尤其是在其自动蒙版生成（AMG）模式下，显示了整体图像分割的潜力。但是，它在过度分割和分割不足方面挣扎，因此对ES的有效性降低了。在本文中，我们介绍了E-SAM，这是一个新型的无培训框架，具有出色的ES能力。具体而言，我们首先提出了多层掩码生成（MMG），该层次结构处理SAM的AMG输出以生成可靠的对象级掩码，同时在其他级别保留细节。然后，实体级掩码细化（EMR）然后将这些对象级掩码完善成精确的实体级掩码。也就是说，它将重叠的掩码分开，以解决SAM输出中固有的冗余问题，并通过评估实体级别的一致性来合并相似的掩码。最后，通过生成与EMR输出融合的其他高信心掩模来生成最终ES映射，从而解决了不足的细分细分（USR）来解决细分。这三个模块是无缝优化的，可以实现最佳ES，而无需其他培训开销。广泛的实验表明，与先前的ES方法相比，E-SAM可以实现最先进的性能，这表明基准指标的显着改善+30.1。

Title: A Speech-to-Video Synthesis Approach Using Spatio-Temporal Diffusion for Vocal Tract MRI

Authors: Paula Andrea Pérez-Toro, Tomás Arias-Vergara, Fangxu Xing, Xiaofeng Liu, Maureen Stone, Jiachen Zhuo, Juan Rafael Orozco-Arroyave, Elmar Nöth, Jana Hutter, Jerry L. Prince, Andreas Maier, Jonghye Woo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12102
Pdf URL: https://arxiv.org/pdf/2503.12102
Copy Paste: [[2503.12102]] A Speech-to-Video Synthesis Approach Using Spatio-Temporal Diffusion for Vocal Tract MRI(https://arxiv.org/abs/2503.12102)
Keywords: generation
Abstract: Understanding the relationship between vocal tract motion during speech and the resulting acoustic signal is crucial for aided clinical assessment and developing personalized treatment and rehabilitation strategies. Toward this goal, we introduce an audio-to-video generation framework for creating Real Time/cine-Magnetic Resonance Imaging (RT-/cine-MRI) visuals of the vocal tract from speech signals. Our framework first preprocesses RT-/cine-MRI sequences and speech samples to achieve temporal alignment, ensuring synchronization between visual and audio data. We then employ a modified stable diffusion model, integrating structural and temporal blocks, to effectively capture movement characteristics and temporal dynamics in the synchronized data. This process enables the generation of MRI sequences from new speech inputs, improving the conversion of audio into visual data. We evaluated our framework on healthy controls and tongue cancer patients by analyzing and comparing the vocal tract movements in synthesized videos. Our framework demonstrated adaptability to new speech inputs and effective generalization. In addition, positive human evaluations confirmed its effectiveness, with realistic and accurate visualizations, suggesting its potential for outpatient therapy and personalized simulation of vocal tract visualizations.
摘要：了解言语过程中声道运动与由此产生的声学信号之间的关系对于有助于临床评估以及制定个性化治疗和康复策略至关重要。为了实现这一目标，我们引入了一个音频到视频生成框架，用于创建实时/Cine-Magnetic共振成像（RT-/Cine-MRI）的视觉效果，从语音信号中进行声带。我们的框架首先预处理RT-/Cine-MRI序列和语音样本以实现时间对齐，从而确保视觉和音频数据之间的同步。然后，我们采用修改的稳定扩散模型，集成结构和时间块，以有效捕获同步数据中的运动特性和时间动力学。该过程使新的语音输入中的MRI序列产生，从而改善了音频转换为视觉数据。我们通过分析和比较合成视频中的人声运动来评估健康对照和舌癌患者的框架。我们的框架表明了对新的语音输入和有效概括的适应性。此外，积极的人类评估通过现实而准确的可视化证实了其有效性，这表明其在门诊治疗和声音可视化的个性化模拟中的潜力。

Title: Z-Magic: Zero-shot Multiple Attributes Guided Image Creator

Authors: Yingying Deng, Xiangyu He, Fan Tang, Weiming Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12124
Pdf URL: https://arxiv.org/pdf/2503.12124
Copy Paste: [[2503.12124]] Z-Magic: Zero-shot Multiple Attributes Guided Image Creator(https://arxiv.org/abs/2503.12124)
Keywords: generation
Abstract: The customization of multiple attributes has gained popularity with the rising demand for personalized content creation. Despite promising empirical results, the contextual coherence between different attributes has been largely overlooked. In this paper, we argue that subsequent attributes should follow the multivariable conditional distribution introduced by former attribute creation. In light of this, we reformulate multi-attribute creation from a conditional probability theory perspective and tackle the challenging zero-shot setting. By explicitly modeling the dependencies between attributes, we further enhance the coherence of generated images across diverse attribute combinations. Furthermore, we identify connections between multi-attribute customization and multi-task learning, effectively addressing the high computing cost encountered in multi-attribute synthesis. Extensive experiments demonstrate that Z-Magic outperforms existing models in zero-shot image generation, with broad implications for AI-driven design and creative applications.
摘要：随着对个性化内容创建的需求不断增长，多个属性的定制已广受欢迎。尽管有希望的经验结果，但不同属性之间的上下文连贯性在很大程度上被忽略了。在本文中，我们认为随后的属性应遵循以前属性创建引入的多变量条件分布。鉴于此，我们从条件概率理论的角度重新重新制定了多属性创作，并解决了具有挑战性的零击设置。通过明确建模属性之间的依赖关系，我们进一步增强了跨不同属性组合的产生图像的连贯性。此外，我们确定了多属性自定义和多任务学习之间的联系，从而有效地解决了多属性合成中遇到的高计算成本。广泛的实验表明，Z-Magic在零拍摄的产生中优于现有模型，对AI驱动的设计和创意应用具有广泛的影响。

Title: DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap

Authors: Shentong Mo, Zehua Chen, Fan Bao, Jun Zhu
Subjects: cs.CV, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.12131
Pdf URL: https://arxiv.org/pdf/2503.12131
Copy Paste: [[2503.12131]] DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap(https://arxiv.org/abs/2503.12131)
Keywords: generation, generative
Abstract: Recent works in cross-modal understanding and generation, notably through models like CLAP (Contrastive Language-Audio Pretraining) and CAVP (Contrastive Audio-Visual Pretraining), have significantly enhanced the alignment of text, video, and audio embeddings via a single contrastive loss. However, these methods often overlook the bidirectional interactions and inherent noises present in each modality, which can crucially impact the quality and efficacy of cross-modal integration. To address this limitation, we introduce DiffGAP, a novel approach incorporating a lightweight generative module within the contrastive space. Specifically, our DiffGAP employs a bidirectional diffusion process tailored to bridge the cross-modal gap more effectively. This involves a denoising process on text and video embeddings conditioned on audio embeddings and vice versa, thus facilitating a more nuanced and robust cross-modal interaction. Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks, confirming its effectiveness in enhancing cross-modal understanding and generation capabilities.
摘要：跨模式理解和产生的最新作品，特别是通过拍手（对比的语言审计）和CAVP（对比的视听预审预周仔），通过单个对比损失显着增强了文本，视频和音频嵌入的对齐。但是，这些方法经常忽略每种方式中存在的双向相互作用和固有的噪声，这可能会影响跨模式整合的质量和功效。为了解决这一限制，我们引入了DiffGap，这是一种新颖的方法，该方法在对比度空间内包含轻质生成模块。具体而言，我们的DIFFGAP采用了一个双向扩散过程，该过程量身定制，以更有效地弥合跨模式间隙。这涉及到以音频嵌入为条件的文本和视频嵌入过程中的降解过程，反之亦然，从而促进了更加细微，更强大的跨模式相互作用。我们对VGGSOUND和AUDIOCAPS数据集的实验结果表明，DiffGap显着提高了视频/文本原告生成和检索任务的性能，从而证实了其在增强跨模式理解和发电能力方面的有效性。

Title: Probabilistic Graph Circuits: Deep Generative Models for Tractable Probabilistic Inference over Graphs

Authors: Milan Papež, Martin Rektoris, Václav Šmídl, Tomáš Pevný
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12162
Pdf URL: https://arxiv.org/pdf/2503.12162
Copy Paste: [[2503.12162]] Probabilistic Graph Circuits: Deep Generative Models for Tractable Probabilistic Inference over Graphs(https://arxiv.org/abs/2503.12162)
Keywords: generation, generative
Abstract: Deep generative models (DGMs) have recently demonstrated remarkable success in capturing complex probability distributions over graphs. Although their excellent performance is attributed to powerful and scalable deep neural networks, it is, at the same time, exactly the presence of these highly non-linear transformations that makes DGMs intractable. Indeed, despite representing probability distributions, intractable DGMs deny probabilistic foundations by their inability to answer even the most basic inference queries without approximations or design choices specific to a very narrow range of queries. To address this limitation, we propose probabilistic graph circuits (PGCs), a framework of tractable DGMs that provide exact and efficient probabilistic inference over (arbitrary parts of) graphs. Nonetheless, achieving both exactness and efficiency is challenging in the permutation-invariant setting of graphs. We design PGCs that are inherently invariant and satisfy these two requirements, yet at the cost of low expressive power. Therefore, we investigate two alternative strategies to achieve the invariance: the first sacrifices the efficiency, and the second sacrifices the exactness. We demonstrate that ignoring the permutation invariance can have severe consequences in anomaly detection, and that the latter approach is competitive with, and sometimes better than, existing intractable DGMs in the context of molecular graph generation.
摘要：深层生成模型（DGM）最近在捕获图表上的复杂概率分布方面取得了显着成功。尽管它们的出色性能归因于功能强大且可扩展的深神经网络，但与此同时，这些高度非线性转换的存在使DGM棘手。确实，尽管代表了概率分布，但顽固性的DGM拒绝了概率基础，即使在没有近似或设计选择的最基本推理查询中，也无法回答最狭窄的查询范围。为了解决此限制，我们提出了概率图电路（PGC），这是一个可拖动的DGM的框架，可提供精确有效的概率推断图（任意部分）图。尽管如此，在图表的置换不变设置中，实现精确性和效率都具有挑战性。我们设计的PGC本质上是不变的，并且满足了这两个要求，但以低表达能力为代价。因此，我们研究了实现不变性的两种替代策略：第一个牺牲效率，第二次牺牲了确切性。我们证明，忽略置换不变性可能会在异常检测中产生严重的后果，并且在分子图生成的背景下，后一种方法与现有棘手的DGM具有竞争力，有时比现有棘手的DGM更好。

Title: SEAL: Semantic Aware Image Watermarking

Authors: Kasra Arabi, R. Teal Witter, Chinmay Hegde, Niv Cohen
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2503.12172
Pdf URL: https://arxiv.org/pdf/2503.12172
Copy Paste: [[2503.12172]] SEAL: Semantic Aware Image Watermarking(https://arxiv.org/abs/2503.12172)
Keywords: generative
Abstract: Generative models have rapidly evolved to generate realistic outputs. However, their synthetic outputs increasingly challenge the clear distinction between natural and AI-generated content, necessitating robust watermarking techniques. Watermarks are typically expected to preserve the integrity of the target image, withstand removal attempts, and prevent unauthorized replication onto unrelated images. To address this need, recent methods embed persistent watermarks into images produced by diffusion models using the initial noise. Yet, to do so, they either distort the distribution of generated images or rely on searching through a long dictionary of used keys for detection. In this paper, we propose a novel watermarking method that embeds semantic information about the generated image directly into the watermark, enabling a distortion-free watermark that can be verified without requiring a database of key patterns. Instead, the key pattern can be inferred from the semantic embedding of the image using locality-sensitive hashing. Furthermore, conditioning the watermark detection on the original image content improves robustness against forgery attacks. To demonstrate that, we consider two largely overlooked attack strategies: (i) an attacker extracting the initial noise and generating a novel image with the same pattern; (ii) an attacker inserting an unrelated (potentially harmful) object into a watermarked image, possibly while preserving the watermark. We empirically validate our method's increased robustness to these attacks. Taken together, our results suggest that content-aware watermarks can mitigate risks arising from image-generative models.
摘要：生成模型已迅速发展以生成逼真的输出。但是，它们的合成产量越来越多地挑战了天然和AI生成的含量之间的明显区别，因此需要强大的水印技术。通常期望水印将保留目标图像的完整性，承受去除尝试，并防止未经授权的复制到无关的图像上。为了满足这一需求，最近的方法将持续的水印嵌入了使用初始噪声的扩散模型产生的图像中。但是，为此，他们要么扭曲了生成的图像的分布，要么依靠搜索用二手键的长词典进行检测。在本文中，我们提出了一种新型的水印方法，该方法将有关生成图像的语义信息直接嵌入水标准中，从而实现了无失真的水印，而无需密钥模式数据库就可以进行验证。取而代之的是，可以使用对局部敏感的哈希从图像的语义嵌入来推断关键模式。此外，根据原始图像含量调节水印检测可改善伪造攻击的鲁棒性。为了证明这一点，我们考虑了两种在很大程度上被忽视的攻击策略：（i）攻击者提取初始噪声并以相同的模式产生新的图像；（ii）攻击者将无关（可能有害的）物体插入水标图像中，可能是在保留水印的同时。我们从经验上验证了我们方法对这些攻击的鲁棒性的提高。综上所述，我们的结果表明，内容感知的水印可以减轻图像产生模型引起的风险。

Title: LAPIG: Language Guided Projector Image Generation with Surface Adaptation and Stylization

Authors: Yuchen Deng, Haibin Ling, Bingyao Huang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.12173
Pdf URL: https://arxiv.org/pdf/2503.12173
Copy Paste: [[2503.12173]] LAPIG: Language Guided Projector Image Generation with Surface Adaptation and Stylization(https://arxiv.org/abs/2503.12173)
Keywords: generation
Abstract: We propose LAPIG, a language guided projector image generation method with surface adaptation and stylization. LAPIG consists of a projector-camera system and a target textured projection surface. LAPIG takes the user text prompt as input and aims to transform the surface style using the projector. LAPIG's key challenge is that due to the projector's physical brightness limitation and the surface texture, the viewer's perceived projection may suffer from color saturation and artifacts in both dark and bright regions, such that even with the state-of-the-art projector compensation techniques, the viewer may see clear surface texture-related artifacts. Therefore, how to generate a projector image that follows the user's instruction while also displaying minimum surface artifacts is an open problem. To address this issue, we propose projection surface adaptation (PSA) that can generate compensable surface stylization. We first train two networks to simulate the projector compensation and project-and-capture processes, this allows us to find a satisfactory projector image without real project-and-capture and utilize gradient descent for fast convergence. Then, we design content and saturation losses to guide the projector image generation, such that the generated image shows no clearly perceivable artifacts when projected. Finally, the generated image is projected for visually pleasing surface style morphing effects. The source code and video are available on the project page: this https URL.
摘要：我们提出了Lapig，这是一种具有表面适应和风格化的语言引导的投影仪图像生成方法。拉皮格由投影仪摄像机系统和目标纹理投影表面组成。 Lapig将用户文本提示称为输入，并旨在使用投影仪转换表面样式。拉皮格（Lapig）的主要挑战是，由于投影仪的物理亮度限制和表面质地，观众的感知投影可能会在黑暗和明亮的区域遭受色彩饱和度和伪像，因此即使使用了最先进的投影仪补偿技术，观众也可能会看到与表面表面纹理相关的明确与表面纹理相关的工件。因此，如何生成遵循用户指令的投影仪图像，同时显示最小的表面伪像是一个开放的问题。为了解决这个问题，我们提出了可以产生可补偿表面风格的投影表面适应（PSA）。我们首先训练两个网络来模拟投影仪的补偿和项目和捕获过程，这使我们能够找到令人满意的投影仪图像，而无需实际项目和捕获，并利用梯度下降来快速收敛。然后，我们设计内容和饱和度损失以指导投影仪图像的生成，因此生成的图像在投影时没有明显可感知的伪像。最后，生成的图像被投影用于视觉令人愉悦的表面样式变形效果。源代码和视频可在项目页面上找到：此HTTPS URL。

Title: STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation

Authors: Ruyu Wang, Xuefeng Hou, Sabrina Schmedding, Marco F. Huber
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12213
Pdf URL: https://arxiv.org/pdf/2503.12213
Copy Paste: [[2503.12213]] STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation(https://arxiv.org/abs/2503.12213)
Keywords: generation
Abstract: In layout-to-image (L2I) synthesis, controlled complex scenes are generated from coarse information like bounding boxes. Such a task is exciting to many downstream applications because the input layouts offer strong guidance to the generation process while remaining easily reconfigurable by humans. In this paper, we proposed STyled LAYout Diffusion (STAY Diffusion), a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes. Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation using a novel Edge-Aware Normalization (EA Norm). A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships. These measures provide consistent guidance through the model, enabling more accurate and controllable image generation. Extensive benchmarking demonstrates that our STAY Diffusion presents high-quality images while surpassing previous state-of-the-art methods in generation diversity, accuracy, and controllability.
摘要：在布局到图像（L2I）合成中，受控的复杂场景是由边界框（例如边界框）产生的。这样的任务对于许多下游应用程序来说是令人兴奋的，因为输入布局为生成过程提供了强有力的指导，同时仍然可以轻松地由人类重新配置。在本文中，我们提出了样式的布局扩散（保持扩散），这是一种基于扩散的模型，可产生照片真实的图像，并对场景中的风格化对象进行细粒度的控制。我们的方法学习了每个布局的全球条件，并使用新颖的边缘感知标准化（EA Norm）来调制体重调节的自我监督语义图。还引入了一种新的样式遮罩注意力（SM注意），以交叉结合用于捕获对象关系的全局状况和图像功能。这些措施通过模型提供了一致的指导，从而实现了更准确和可控的图像生成。广泛的基准测试表明，我们的住宿扩散呈现出高质量的图像，同时超过了以前的发电多样性，准确性和可控性的最先进方法。

Title: Cross-Modal Diffusion for Biomechanical Dynamical Systems Through Local Manifold Alignment

Authors: Sharmita Dey, Sarath Ravindran Nair
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2503.12214
Pdf URL: https://arxiv.org/pdf/2503.12214
Copy Paste: [[2503.12214]] Cross-Modal Diffusion for Biomechanical Dynamical Systems Through Local Manifold Alignment(https://arxiv.org/abs/2503.12214)
Keywords: generation
Abstract: We present a mutually aligned diffusion framework for cross-modal biomechanical motion generation, guided by a dynamical systems perspective. By treating each modality, e.g., observed joint angles ($X$) and ground reaction forces ($Y$), as complementary observations of a shared underlying locomotor dynamical system, our method aligns latent representations at each diffusion step, so that one modality can help denoise and disambiguate the other. Our alignment approach is motivated by the fact that local time windows of $X$ and $Y$ represent the same phase of an underlying dynamical system, thereby benefiting from a shared latent manifold. We introduce a simple local latent manifold alignment (LLMA) strategy that incorporates first-order and second-order alignment within the latent space for robust cross-modal biomechanical generation without bells and whistles. Through experiments on multimodal human biomechanics data, we show that aligning local latent dynamics across modalities improves generation fidelity and yields better representations.
摘要：我们提出了一个以动态系统透视为指导的跨模式生物力学运动产生的相互对准扩散框架。通过将观察到的关节角（$ x $）和地面反作用力（$ y $）视为对共享的潜在运动动力学系统的互补观察，我们的方法在每个扩散步骤中对齐潜在表示，以便一种方式可以帮助一种模态，以便一种模态可以帮助deoiise和另一个。我们的对齐方式是由$ x $和$ y $的本地时间窗口代表基本动力系统的同一阶段的动机，从而受益于共享的潜在流形。我们引入了一个简单的局部潜在歧管（LLMA）策略，该策略结合了潜在空间中的一阶和二阶对齐，以实现强大的跨模式生物力学生成，而无需铃铛和哨声。通过对多模式人类生物力学数据的实验，我们表明，跨模态的局部潜在动力学对齐可以提高产生忠诚度并产生更好的表示。

Title: Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

Authors: Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, Aditya Grover
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12271
Pdf URL: https://arxiv.org/pdf/2503.12271
Copy Paste: [[2503.12271]] Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection(https://arxiv.org/abs/2503.12271)
Keywords: generation
Abstract: The predominant approach to advancing text-to-image generation has been training-time scaling, where larger models are trained on more data using greater computational resources. While effective, this approach is computationally expensive, leading to growing interest in inference-time scaling to improve performance. Currently, inference-time scaling for text-to-image diffusion models is largely limited to best-of-N sampling, where multiple images are generated per prompt and a selection model chooses the best output. Inspired by the recent success of reasoning models like DeepSeek-R1 in the language domain, we introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context reflection capabilities. We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. Instead of passively relying on random sampling and hoping for a better result in a future generation, Reflect-DiT explicitly tailors its generations to address specific aspects requiring enhancement. Experimental results demonstrate that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model. Additionally, it achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80, which was obtained using a significantly larger model (SANA-1.5-4.8B) with 2048 samples under the best-of-N approach.
摘要：推进文本到图像生成的主要方法是训练时间缩放，其中使用更大的计算资源对更大的模型进行了培训。虽然有效，但这种方法在计算上很昂贵，导致对推理时间缩放的兴趣日益增加，以提高性能。当前，文本对图像扩散模型的推理时间缩放在很大程度上仅限于最佳N采样，其中每个提示都会生成多个图像，并且选择模型选择最佳输出。受到语言域中DeepSeek-R1（例如DeepSeek-r1）的最新成功的启发，我们通过将文本到图像扩散变压器配备具有文本反射反射功能来替代Naive Nive最佳采样的替代方案。我们提出了反射 - dit，这种方法可以使用先前生成的图像的内在示例以及文本反馈描述必要的改进，从而使扩散变压器能够完善其世代。反射点明确定制了其几代人，以解决需要增强的特定方面，而不是被动地依靠随机抽样并希望获得更好的结果。实验结果表明，反射-DIT使用SANA-1.0-1.6B作为基本模型来提高遗传基准（+0.19）的性能。此外，它在Geneval上达到了0.81的新最新分数，同时每提示只生成20个样本，超过了先前的0.80分数，在最佳N方法下，使用2048个样本使用了明显更大的模型（SANA-1.5-4.8B）获得的最佳分数。

Title: Towards Self-Improving Systematic Cognition for Next-Generation Foundation MLLMs

Authors: Xiaoying Zhang, Da Peng, Yipeng Zhang, Zonghao Guo, Chengyue Wu, Chi Chen, Wei Ke, Helen Meng, Maosong Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12303
Pdf URL: https://arxiv.org/pdf/2503.12303
Copy Paste: [[2503.12303]] Towards Self-Improving Systematic Cognition for Next-Generation Foundation MLLMs(https://arxiv.org/abs/2503.12303)
Keywords: generation
Abstract: Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) face challenges with fine-grained perception and complex reasoning. Prevalent pre-training approaches focus on enhancing perception by training on high-quality image captions due to the extremely high cost of collecting chain-of-thought (CoT) reasoning data for improving reasoning. While leveraging advanced MLLMs for caption generation enhances scalability, the outputs often lack comprehensiveness and accuracy. In this paper, we introduce Self-Improving Cognition (SIcog), a self-learning framework designed to construct next-generation foundation MLLMs by enhancing their systematic cognitive capabilities through multimodal pre-training with self-generated data. Specifically, we propose chain-of-description, an approach that improves an MLLM's systematic perception by enabling step-by-step visual understanding, ensuring greater comprehensiveness and accuracy. Additionally, we adopt a structured CoT reasoning technique to enable MLLMs to integrate in-depth multimodal reasoning. To construct a next-generation foundation MLLM with self-improved cognition, SIcog first equips an MLLM with systematic perception and reasoning abilities using minimal external annotations. The enhanced models then generate detailed captions and CoT reasoning data, which are further curated through self-consistency. This curated data is ultimately used to refine the MLLM during multimodal pre-training, facilitating next-generation foundation MLLM construction. Extensive experiments on both low- and high-resolution MLLMs across diverse benchmarks demonstrate that, with merely 213K self-generated pre-training samples, SIcog produces next-generation foundation MLLMs with significantly improved cognition, achieving benchmark-leading performance compared to prevalent pre-training approaches.
摘要：尽管具有令人印象深刻的能力，但多模式的大语言模型（MLLM）面临着细粒度的感知和复杂的推理的挑战。普遍的训练预训练方法集中于通过对高质量图像标题的培训来增强感知，这是因为收集了经过思考链（COT）推理数据以改善推理的成本。尽管利用高级MLLM进行字幕生成增强了可扩展性，但输出通常缺乏全面性和准确性。在本文中，我们介绍了自我研究的认知（SICOG），这是一个自学习框架，旨在通过通过自生数据的多模式预训练来增强其系统的认知能力，从而构建下一代基础MLLM。具体而言，我们提出了描述链，这种方法通过实现逐步的视觉理解，确保更高的全面性和准确性来改善MLLM的系统感知。此外，我们采用结构化的COT推理技术，使MLLM能够整合深入的多模式推理。为了构建具有自我改良认知的下一代基础MLLM，SICOG首先使用最小的外部注释使MLLM具有系统的感知和推理能力。然后，增强的模型生成详细的字幕和COT推理数据，这些数据通过自相连进一步策划。该策划的数据最终用于在多模式预训练期间完善MLLM，促进下一代基础MLLM构建。对各种基准的低分辨率MLLM的广泛实验表明，只有213K自养生的预训练样本，SICOG会产生下一代基础基础MLLM，具有显着改善的认知，与普遍的预训练相比，可以实现基准领先的性能。

Title: ResLPR: A LiDAR Data Restoration Network and Benchmark for Robust Place Recognition Against Weather Corruptions

Authors: Wenqing Kuang (1), Xiongwei Zhao (2), Yehui Shen (1), Congcong Wen (3), Huimin Lu (1), Zongtan Zhou (1), Xieyuanli Chen (1) ((1) National University of Defense Technology, (2) Harbin Institute of Technology, (3) New York University Abu Dhabi)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12350
Pdf URL: https://arxiv.org/pdf/2503.12350
Copy Paste: [[2503.12350]] ResLPR: A LiDAR Data Restoration Network and Benchmark for Robust Place Recognition Against Weather Corruptions(https://arxiv.org/abs/2503.12350)
Keywords: restoration
Abstract: LiDAR-based place recognition (LPR) is a key component for autonomous driving, and its resilience to environmental corruption is critical for safety in high-stakes applications. While state-of-the-art (SOTA) LPR methods perform well in clean weather, they still struggle with weather-induced corruption commonly encountered in driving scenarios. To tackle this, we propose ResLPRNet, a novel LiDAR data restoration network that largely enhances LPR performance under adverse weather by restoring corrupted LiDAR scans using a wavelet transform-based network. ResLPRNet is efficient, lightweight and can be integrated plug-and-play with pretrained LPR models without substantial additional computational cost. Given the lack of LPR datasets under adverse weather, we introduce ResLPR, a novel benchmark that examines SOTA LPR methods under a wide range of LiDAR distortions induced by severe snow, fog, and rain conditions. Experiments on our proposed WeatherKITTI and WeatherNCLT datasets demonstrate the resilience and notable gains achieved by using our restoration method with multiple LPR approaches in challenging weather scenarios. Our code and benchmark are publicly available here: this https URL.
摘要：基于激光雷达的位置识别（LPR）是自动驾驶的关键组成部分，其对环境腐败的韧性对于高风险应用中的安全至关重要。尽管最新的（SOTA）LPR方法在洁净的天气中表现良好，但他们仍然在驾驶场景中通常遇到的天气引起的腐败。为了解决这个问题，我们提出了RESLPRNET，这是一个新颖的LIDAR数据恢复网络，通过使用基于小波的转换网络恢复损坏的LIDAR扫描，可以在不利天气下大大提高LPR性能。 RESLPRNET是有效的，轻巧的，可以与预审计的LPR模型进行集成插件，而无需大量额外的计算成本。鉴于在不利天气下缺乏LPR数据集，我们引入了RESLPR，这是一种新颖的基准，该基准在严重的雪，雾和雨水状况引起的各种激光畸变下检查了SOTA LPR方法。我们提出的WeatherKitti和WeathernClt数据集进行的实验证明了通过在充满挑战的天气情况下使用多种LPR方法，通过使用我们的恢复方法获得了韧性和显着的收益。我们的代码和基准标准在这里公开可用：此HTTPS URL。

Title: Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation

Authors: Byung Hyun Lee, Sungjin Lim, Se Young Chun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12356
Pdf URL: https://arxiv.org/pdf/2503.12356
Copy Paste: [[2503.12356]] Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation(https://arxiv.org/abs/2503.12356)
Keywords: generation
Abstract: Fine-tuning based concept erasing has demonstrated promising results in preventing generation of harmful contents from text-to-image diffusion models by removing target concepts while preserving remaining concepts. To maintain the generation capability of diffusion models after concept erasure, it is necessary to remove only the image region containing the target concept when it locally appears in an image, leaving other regions intact. However, prior arts often compromise fidelity of the other image regions in order to erase the localized target concept appearing in a specific area, thereby reducing the overall performance of image generation. To address these limitations, we first introduce a framework called localized concept erasure, which allows for the deletion of only the specific area containing the target concept in the image while preserving the other regions. As a solution for the localized concept erasure, we propose a training-free approach, dubbed Gated Low-rank adaptation for Concept Erasure (GLoCE), that injects a lightweight module into the diffusion model. GLoCE consists of low-rank matrices and a simple gate, determined only by several generation steps for concepts without training. By directly applying GLoCE to image embeddings and designing the gate to activate only for target concepts, GLoCE can selectively remove only the region of the target concepts, even when target and remaining concepts coexist within an image. Extensive experiments demonstrated GLoCE not only improves the image fidelity to text prompts after erasing the localized target concepts, but also outperforms prior arts in efficacy, specificity, and robustness by large margin and can be extended to mass concept erasure.
摘要：基于微调的概念擦除表现出了有希望的结果，可以通过消除目标概念，同时保留剩余的概念，从而防止文本到图像扩散模型产生有害内容。为了在概念擦除后保持扩散模型的生成能力，在局部出现在图像中时，只需删除包含目标概念的图像区域，而其他区域完整。但是，先前的艺术通常会损害其他图像区域的保真度，以消除特定区域中出现的局部目标概念，从而降低图像产生的整体性能。为了解决这些限制，我们首先引入一个称为局部概念擦除的框架，该框架仅删除图像中包含目标概念的特定区域，同时保留其他区域。作为局部概念擦除的解决方案，我们提出了一种无训练的方法，称为概念擦除（GLOCE）的门控低级别适应性（GLOCE），将轻量级模块注入扩散模型。 Gloce由低级矩阵和一个简单的门组成，仅由未经培训的概念几代步骤决定。通过将GLOCE直接应用于图像嵌入并设计门以仅针对目标概念激活，Gloce即使在图像中共存的目标概念和剩下的概念也可以选择性地删除目标概念的区域。广泛的实验表明，GLOCE不仅在擦除了局部目标概念后提高了文本提示的图像保真度，而且还可以通过大幅度优于效力，特异性和鲁棒性的先前艺术，并且可以扩展到大众概念擦除。

Title: VRsketch2Gaussian: 3D VR Sketch Guided 3D Object Generation with Gaussian Splatting

Authors: Songen Gu, Haoxuan Song, Binjie Liu, Qian Yu, Sanyi Zhang, Haiyong Jiang, Jin Huang, Feng Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12383
Pdf URL: https://arxiv.org/pdf/2503.12383
Copy Paste: [[2503.12383]] VRsketch2Gaussian: 3D VR Sketch Guided 3D Object Generation with Gaussian Splatting(https://arxiv.org/abs/2503.12383)
Keywords: generation
Abstract: We propose VRSketch2Gaussian, a first VR sketch-guided, multi-modal, native 3D object generation framework that incorporates a 3D Gaussian Splatting representation. As part of our work, we introduce VRSS, the first large-scale paired dataset containing VR sketches, text, images, and 3DGS, bridging the gap in multi-modal VR sketch-based generation. Our approach features the following key innovations: 1) Sketch-CLIP feature alignment. We propose a two-stage alignment strategy that bridges the domain gap between sparse VR sketch embeddings and rich CLIP embeddings, facilitating both VR sketch-based retrieval and generation tasks. 2) Fine-Grained multi-modal conditioning. We disentangle the 3D generation process by using explicit VR sketches for geometric conditioning and text descriptions for appearance control. To facilitate this, we propose a generalizable VR sketch encoder that effectively aligns different modalities. 3) Efficient and high-fidelity 3D native generation. Our method leverages a 3D-native generation approach that enables fast and texture-rich 3D object synthesis. Experiments conducted on our VRSS dataset demonstrate that our method achieves high-quality, multi-modal VR sketch-based 3D generation. We believe our VRSS dataset and VRsketch2Gaussian method will be beneficial for the 3D generation community.
摘要：我们提出了VRSKetch2Gaussian，这是第一个VR草图引导，多模式的本机3D对象生成框架，其中包含3D高斯分裂表示。作为我们工作的一部分，我们介绍了VRSS，这是第一个包含VR草图，文本，图像和3DG的大规模配对数据集，在基于多模式VR草图的一代中弥合了差距。我们的方法具有以下关键创新：1）Sketch-CLIP功能对齐。我们提出了一个两阶段的对齐策略，该策略弥合了稀疏的VR素描嵌入和丰富的剪辑嵌入之间的域间隙，从而促进了基于VR素描的检索和生成任务。 2）细粒度的多模式调节。我们通过使用明确的VR草图进行几何条件和文本描述来删除3D生成过程，以进行外观控制。为了促进这一点，我们提出了一个可有效地对齐不同模态的可推广的VR草图编码器。 3）有效且高保真3D本地产生。我们的方法利用了一种3D新的生成方法，该方法可以快速且纹理富含3D对象合成。在我们的VRSS数据集上进行的实验表明，我们的方法可实现高质量的基于多模式VR素描的3D生成。我们认为，我们的VRSS数据集和VRSKetch2Gaussian方法将对3D代社区有益。

Title: Pathology Image Restoration via Mixture of Prompts

Authors: Jiangdong Cai, Yan Chen, Zhenrong Shen, Haotian Jiang, Honglin Xiong, Kai Xuan, Lichi Zhang, Qian Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.12399
Pdf URL: https://arxiv.org/pdf/2503.12399
Copy Paste: [[2503.12399]] Pathology Image Restoration via Mixture of Prompts(https://arxiv.org/abs/2503.12399)
Keywords: restoration
Abstract: In digital pathology, acquiring all-in-focus images is essential to high-quality imaging and high-efficient clinical workflow. Traditional scanners achieve this by scanning at multiple focal planes of varying depths and then merging them, which is relatively slow and often struggles with complex tissue defocus. Recent prevailing image restoration technique provides a means to restore high-quality pathology images from scans of single focal planes. However, existing image restoration methods are inadequate, due to intricate defocus patterns in pathology images and their domain-specific semantic complexities. In this work, we devise a two-stage restoration solution cascading a transformer and a diffusion model, to benefit from their powers in preserving image fidelity and perceptual quality, respectively. We particularly propose a novel mixture of prompts for the two-stage solution. Given initial prompt that models defocus in microscopic imaging, we design two prompts that describe the high-level image semantics from pathology foundation model and the fine-grained tissue structures via edge extraction. We demonstrate that, by feeding the prompt mixture to our method, we can restore high-quality pathology images from single-focal-plane scans, implying high potentials of the mixture of prompts to clinical usage. Code will be publicly available at this https URL.
摘要：在数字病理学中，获取全焦点图像对于高质量成像和高效临床工作流程至关重要。传统的扫描仪通过在不同深度的多个焦点平面上进行扫描，然后将其合并，从而实现这一目标，这相对较慢，并且通常与复杂的组织散焦斗争。最近的流行图像恢复技术提供了一种从扫描单个焦平面的扫描中恢复高质量病理图像的手段。但是，由于病理图像中复杂的散焦模式及其特定于域特异性的语义复杂性，因此现有的图像恢复方法是不足的。在这项工作中，我们设计了一个两阶段的恢复解决方案，使变压器和扩散模型分别受益于其在保留图像保真度和感知质量方面的能力。我们特别提出了两阶段解决方案提示的新型混合物。鉴于最初的提示是模型在微观成像中模型，我们设计了两个提示，它们描述了病理基础模型的高级图像语义和通过边缘提取的细粒组织结构。我们证明，通过将提示混合物馈送到我们的方法中，我们可以从单个面积扫描中恢复高质量的病理图像，这意味着提示混合物的高电位可用于临床使用。代码将在此HTTPS URL上公开可用。

Title: MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification

Authors: Jianwei Zhao, Xin Li, Fan Yang, Qiang Zhai, Ao Luo, Yang Zhao, Hong Cheng, Huazhu Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12401
Pdf URL: https://arxiv.org/pdf/2503.12401
Copy Paste: [[2503.12401]] MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification(https://arxiv.org/abs/2503.12401)
Keywords: generative
Abstract: Whole Slide Image (WSI) classification poses unique challenges due to the vast image size and numerous non-informative regions, which introduce noise and cause data imbalance during feature aggregation. To address these issues, we propose MExD, an Expert-Infused Diffusion Model that combines the strengths of a Mixture-of-Experts (MoE) mechanism with a diffusion model for enhanced classification. MExD balances patch feature distribution through a novel MoE-based aggregator that selectively emphasizes relevant information, effectively filtering noise, addressing data imbalance, and extracting essential features. These features are then integrated via a diffusion-based generative process to directly yield the class distribution for the WSI. Moving beyond conventional discriminative approaches, MExD represents the first generative strategy in WSI classification, capturing fine-grained details for robust and precise results. Our MExD is validated on three widely-used benchmarks-Camelyon16, TCGA-NSCLC, and BRACS consistently achieving state-of-the-art performance in both binary and multi-class tasks.
摘要：整个幻灯片图像（WSI）分类由于巨大的图像大小和众多非信息区域而带来了独特的挑战，这些区域会引入噪声并在功能聚合过程中导致数据失衡。为了解决这些问题，我们提出了MEXD，这是一种专家注入的扩散模型，将Experts（MOE）机制混合物的优势与扩散模型相结合，以增强分类。 MEXD通过基于MOE的新型聚合器来平衡斑块的特征分布，该聚合器有效地强调相关信息，有效地过滤噪声，解决数据不平衡和提取基本特征。然后，通过基于扩散的生成过程集成这些特征，以直接产生WSI的类分布。 MEXD超越了传统的判别方法，代表了WSI分类中的第一个生成策略，可捕获细粒细节，以获得稳健和精确的结果。我们的MEXD已在三个广泛使用的基准测试基准中验证，TCGA-NSCLC和BRAC始终在二进制和多级任务中实现最先进的性能。

Title: LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching

Authors: Feihong Yan, Qingyan Wei, Jiayi Tang, Jiajun Li, Yulin Wang, Xuming Hu, Huiqi Li, Linfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12450
Pdf URL: https://arxiv.org/pdf/2503.12450
Copy Paste: [[2503.12450]] LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching(https://arxiv.org/abs/2503.12450)
Keywords: generation
Abstract: Masked Autoregressive (MAR) models have emerged as a promising approach in image generation, expected to surpass traditional autoregressive models in computational efficiency by leveraging the capability of parallel decoding. However, their dependence on bidirectional self-attention inherently conflicts with conventional KV caching mechanisms, creating unexpected computational bottlenecks that undermine their expected efficiency. To address this problem, this paper studies the caching mechanism for MAR by leveraging two types of redundancy: Token Redundancy indicates that a large portion of tokens have very similar representations in the adjacent decoding steps, which allows us to first cache them in previous steps and then reuse them in the later steps. Condition Redundancy indicates that the difference between conditional and unconditional output in classifier-free guidance exhibits very similar values in adjacent steps. Based on these two redundancies, we propose LazyMAR, which introduces two caching mechanisms to handle them one by one. LazyMAR is training-free and plug-and-play for all MAR models. Experimental results demonstrate that our method achieves 2.83 times acceleration with almost no drop in generation quality. Our codes will be released in this https URL.
摘要：掩盖的自回旋（MAR）模型已成为图像生成中的一种有希望的方法，预计通过利用并行解码的能力来超越计算效率的传统自回归模型。但是，他们对双向自我注意力的依赖固有地与常规的KV缓存机制发生冲突，从而产生了意外的计算瓶颈，从而破坏了他们的预期效率。为了解决这个问题，本文通过利用两种类型的冗余来研究MAR的缓存机制：令牌冗余表明，在相邻的解码步骤中，很大一部分令牌具有非常相似的表示，这使我们能够首先在上一步中加速它们，然后在以后的步骤中重复使用它们。条件冗余表明，无分类器指南中条件输出和无条件输出之间的差异在相邻步骤中表现出非常相似的值。基于这两个冗余，我们提出了Lazymar，该裁员引入了两种缓存机制，以一一处理它们。 Lazymar为所有MAR型号都进行了无训练和插件。实验结果表明，我们的方法达到了2.83倍的加速度，几乎没有发电质量下降。我们的代码将在此HTTPS URL中发布。

Title: DPF-Net: Physical Imaging Model Embedded Data-Driven Underwater Image Enhancement

Authors: Han Mei, Kunqian Li, Shuaixin Liu, Chengzhi Ma, Qianli Jiang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.12470
Pdf URL: https://arxiv.org/pdf/2503.12470
Copy Paste: [[2503.12470]] DPF-Net: Physical Imaging Model Embedded Data-Driven Underwater Image Enhancement(https://arxiv.org/abs/2503.12470)
Keywords: restoration
Abstract: Due to the complex interplay of light absorption and scattering in the underwater environment, underwater images experience significant degradation. This research presents a two-stage underwater image enhancement network called the Data-Driven and Physical Parameters Fusion Network (DPF-Net), which harnesses the robustness of physical imaging models alongside the generality and efficiency of data-driven methods. We first train a physical parameter estimate module using synthetic datasets to guarantee the trustworthiness of the physical parameters, rather than solely learning the fitting relationship between raw and reference images by the application of the imaging equation, as is common in prior studies. This module is subsequently trained in conjunction with an enhancement network, where the estimated physical parameters are integrated into a data-driven model within the embedding space. To maintain the uniformity of the restoration process amid underwater imaging degradation, we propose a physics-based degradation consistency loss. Additionally, we suggest an innovative weak reference loss term utilizing the entire dataset, which alleviates our model's reliance on the quality of individual reference images. Our proposed DPF-Net demonstrates superior performance compared to other benchmark methods across multiple test sets, achieving state-of-the-art results. The source code and pre-trained models are available on the project home page: this https URL.
摘要：由于水下环境中光吸收和散射的复杂相互作用，水下图像经历了显着降解。这项研究提出了一个两阶段的水下图像增强网络，称为数据驱动和物理参数融合网络（DPF-NET），该网络将物理成像模型的鲁棒性与数据驱动方法的一般性和效率一起利用。我们首先使用合成数据集训练物理参数估算模块，以确保物理参数的可信度，而不是仅通过应用成像方程来学习原始图像和参考图像之间的拟合关系，这在先前的研究中是常见的。随后，该模块与增强网络一起训练，其中估计的物理参数集成到嵌入式空间内的数据驱动模型中。为了在水下成像降解中保持恢复过程的均匀性，我们提出了基于物理学的降解一致性损失。此外，我们建议利用整个数据集使用创新的弱参考损失术语，这减轻了我们的模型对单个参考图像质量的依赖。我们提出的DPF-NET与在多个测试集中的其他基准方法相比表现出卓越的性能，从而实现了最新的结果。源代码和预训练的模型可在项目主页上找到：此HTTPS URL。

Title: Diffusion-based Synthetic Data Generation for Visible-Infrared Person Re-Identification

Authors: Wenbo Dai, Lijing Lu, Zhihang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12472
Pdf URL: https://arxiv.org/pdf/2503.12472
Copy Paste: [[2503.12472]] Diffusion-based Synthetic Data Generation for Visible-Infrared Person Re-Identification(https://arxiv.org/abs/2503.12472)
Keywords: generation
Abstract: The performance of models is intricately linked to the abundance of training data. In Visible-Infrared person Re-IDentification (VI-ReID) tasks, collecting and annotating large-scale images of each individual under various cameras and modalities is tedious, time-expensive, costly and must comply with data protection laws, posing a severe challenge in meeting dataset requirements. Current research investigates the generation of synthetic data as an efficient and privacy-ensuring alternative to collecting real data in the field. However, a specific data synthesis technique tailored for VI-ReID models has yet to be explored. In this paper, we present a novel data generation framework, dubbed Diffusion-based VI-ReID data Expansion (DiVE), that automatically obtain massive RGB-IR paired images with identity preserving by decoupling identity and modality to improve the performance of VI-ReID models. Specifically, identity representation is acquired from a set of samples sharing the same ID, whereas the modality of images is learned by fine-tuning the Stable Diffusion (SD) on modality-specific data. DiVE extend the text-driven image synthesis to identity-preserving RGB-IR multimodal image synthesis. This approach significantly reduces data collection and annotation costs by directly incorporating synthetic data into ReID model training. Experiments have demonstrated that VI-ReID models trained on synthetic data produced by DiVE consistently exhibit notable enhancements. In particular, the state-of-the-art method, CAJ, trained with synthetic images, achieves an improvement of about $9\%$ in mAP over the baseline on the LLCM dataset. Code: this https URL
摘要：模型的性能与丰富的训练数据无关。在可见的红外人员重新识别（VI-REID）任务中，在各种摄像机和方式下收集和注释每个人的大规模图像是乏味，廉价的，昂贵的，必须遵守数据保护法，在满足数据集要求时面临严重的挑战。当前的研究将合成数据的产生作为一种有效且具有隐私性的替代方案，用于收集现场的真实数据。但是，针对VI-Reid模型量身定制的特定数据合成技术尚未探索。在本文中，我们提出了一个新型的数据生成框架，称为基于扩散的VI-REID数据扩展（DIVE），该框架自动获得了具有通过解耦身份和模态来确保身份的大量RGB-IR配对图像，以提高VI-REID模型的性能。具体而言，身份表示是从共享相同ID的一组样本中获取的，而图像模态是通过在模态特异性数据上微调稳定扩散（SD）来学习的。潜水将文本驱动的图像合成扩展到身份保护RGB-IR多模式图像合成。这种方法通过将合成数据直接纳入REID模型培训来大大降低数据收集和注释成本。实验表明，对DIVE产生的合成数据训练的VI-REID模型始终显示出显着的增强。特别是，经过合成图像训练的最新方法CAJ可在LLCM数据集上的基线上获得约9美元的地图的提高。代码：此HTTPS URL

Title: BS-Mamba for Black-Soil Area Detection On the Qinghai-Tibetan Plateau

Authors: Xuan Ma, Zewen Lv, Chengcai Ma, Tao Zhang, Yuelan Xin, Kun Zhan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12495
Pdf URL: https://arxiv.org/pdf/2503.12495
Copy Paste: [[2503.12495]] BS-Mamba for Black-Soil Area Detection On the Qinghai-Tibetan Plateau(https://arxiv.org/abs/2503.12495)
Keywords: restoration
Abstract: Extremely degraded grassland on the Qinghai-Tibetan Plateau (QTP) presents a significant environmental challenge due to overgrazing, climate change, and rodent activity, which degrade vegetation cover and soil quality. These extremely degraded grassland on QTP, commonly referred to as black-soil area, require accurate assessment to guide effective restoration efforts. In this paper, we present a newly created QTP black-soil dataset, annotated under expert guidance. We introduce a novel neural network model, BS-Mamba, specifically designed for the black-soil area detection using UAV remote sensing imagery. The BS-Mamba model demonstrates higher accuracy in identifying black-soil area across two independent test datasets than the state-of-the-art models. This research contributes to grassland restoration by providing an efficient method for assessing the extent of black-soil area on the QTP.
摘要：青海地基高原（QTP）上极度退化的草原，由于过度放牧，气候变化和啮齿动物活动，造成了巨大的环境挑战，这些挑战降低了植被覆盖率和土壤质量。这些在QTP上的极度退化的草原（通常称为黑土壤面积）需要准确的评估来指导有效的恢复工作。在本文中，我们提出了一个新创建的QTP黑土壤数据集，并根据专家指导注释。我们介绍了一种新型的神经网络模型BS-Mamba，该模型专为使用无人机遥感图像的黑土壤检测而设计。 BS-MAMBA模型比最先进的模型显示出在两个独立测试数据集中识别黑土面积的准确性更高。这项研究通过提供有效的方法来评估QTP上黑土面积的程度，从而有助于草地的恢复。

Title: Segment Any-Quality Images with Generative Latent Space Enhancement

Authors: Guangqian Guo, Yoong Guo, Xuehui Yu, Wenbo Li, Yaoxing Wang, Shan Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12507
Pdf URL: https://arxiv.org/pdf/2503.12507
Copy Paste: [[2503.12507]] Segment Any-Quality Images with Generative Latent Space Enhancement(https://arxiv.org/abs/2503.12507)
Keywords: generative
Abstract: Despite their success, Segment Anything Models (SAMs) experience significant performance drops on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Specifically, we adapt the concept of latent diffusion to SAM-based segmentation frameworks and perform the generative diffusion process in the latent space of SAM to reconstruct high-quality representation, thereby improving segmentation. Additionally, we introduce two techniques to improve compatibility between the pre-trained diffusion model and the segmentation framework. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. We also construct the LQSeg dataset with a greater diversity of degradation types and levels for training and evaluating the model. Extensive experiments demonstrate that GleSAM significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM also performs well on unseen degradations, underscoring the versatility of our approach and dataset.
摘要：尽管他们成功了，但任何模型（SAM）的任何模型都会在严重退化，低质量的图像上经历着显着的性能下降，从而限制了它们在实际情况下的有效性。为了解决这个问题，我们提出了GLESAM，该GLESAM利用生成的潜在空间增强来增强对低质量图像的鲁棒性，从而实现了各种图像质量的概括。具体而言，我们将潜在扩散的概念调整为基于SAM的分割框架，并在SAM的潜在空间中执行生成扩散过程以重建高质量表示，从而改善分段。此外，我们引入了两种技术，以提高预训练的扩散模型和分割框架之间的兼容性。我们的方法只能使用最小的其他可学习参数应用于预训练的SAM和SAM2，从而有效优化。我们还构建了LQSEG数据集，具有更大的降解类型和水平的多样性，用于培训和评估模型。广泛的实验表明，GLESAM显着提高了复杂降解的分割鲁棒性，同时将概括性化至清除图像。此外，GLESAM在看不见的降解方面也表现良好，强调了我们的方法和数据集的多功能性。

Title: EditID: Training-Free Editable ID Customization for Text-to-Image Generation

Authors: Guandong Li, Zhaobin Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12526
Pdf URL: https://arxiv.org/pdf/2503.12526
Copy Paste: [[2503.12526]] EditID: Training-Free Editable ID Customization for Text-to-Image Generation(https://arxiv.org/abs/2503.12526)
Keywords: generation
Abstract: We propose EditID, a training-free approach based on the DiT architecture, which achieves highly editable customized IDs for text to image generation. Existing text-to-image models for customized IDs typically focus more on ID consistency while neglecting editability. It is challenging to alter facial orientation, character attributes, and other features through prompts. EditID addresses this by deconstructing the text-to-image model for customized IDs into an image generation branch and a character feature branch. The character feature branch is further decoupled into three modules: feature extraction, feature fusion, and feature integration. By introducing a combination of mapping features and shift features, along with controlling the intensity of ID feature integration, EditID achieves semantic compression of local features across network depths, forming an editable feature space. This enables the successful generation of high-quality images with editable IDs while maintaining ID consistency, achieving excellent results in the IBench evaluation, which is an editability evaluation framework for the field of customized ID text-to-image generation that quantitatively demonstrates the superior performance of EditID. EditID is the first text-to-image solution to propose customizable ID editability on the DiT architecture, meeting the demands of long prompts and high quality image generation.
摘要：我们提出了基于DIT体系结构的无培训方法Editid，该方法实现了用于图像生成的文本的高度可编辑的自定义ID。定制IDS的现有文本对图像模型通常集中在ID一致性上，同时忽略编辑性。通过提示更改面部取向，角色属性和其他功能是一项挑战。 EditID通过将自定义ID的文本对图像模型解构为图像生成分支和字符特征分支来解决此问题。角色特征分支进一步分为三个模块：特征提取，特征融合和特征集成。通过引入映射功能和换档功能的组合，并控制ID特征集成的强度，EditID实现了跨网络深度的局部特征的语义压缩，从而形成了可编辑的功能空间。这使得成功地生成具有可编辑ID的高质量图像的同时保持ID一致性，并在IBench评估中取得了出色的成果，这是定制ID文本到图像生成领域的编辑评估框架，该框架量化了量化的EDITID的出色表现。 EditID是第一个在DIT体系结构上提出可自定义ID编辑性的文本到图像解决方案，满足长提示和高质量图像生成的需求。

Title: Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks

Authors: Mehmet Kerem Turkcan, Mattia Ballo, Filippo Filicori, Zoran Kostic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12531
Pdf URL: https://arxiv.org/pdf/2503.12531
Copy Paste: [[2503.12531]] Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks(https://arxiv.org/abs/2503.12531)
Keywords: generative
Abstract: We introduce specialized diffusion-based generative models that capture the spatiotemporal dynamics of fine-grained robotic surgical sub-stitch actions through supervised learning on annotated laparoscopic surgery footage. The proposed models form a foundation for data-driven world models capable of simulating the biomechanical interactions and procedural dynamics of surgical suturing with high temporal fidelity. Annotating a dataset of $\sim2K$ clips extracted from simulation videos, we categorize surgical actions into fine-grained sub-stitch classes including ideal and non-ideal executions of needle positioning, targeting, driving, and withdrawal. We fine-tune two state-of-the-art video diffusion models, LTX-Video and HunyuanVideo, to generate high-fidelity surgical action sequences at $\ge$768x512 resolution and $\ge$49 frames. For training our models, we explore both Low-Rank Adaptation (LoRA) and full-model fine-tuning approaches. Our experimental results demonstrate that these world models can effectively capture the dynamics of suturing, potentially enabling improved training simulators, surgical skill assessment tools, and autonomous surgical systems. The models also display the capability to differentiate between ideal and non-ideal technique execution, providing a foundation for building surgical training and evaluation systems. We release our models for testing and as a foundation for future research. Project Page: this https URL
摘要：我们介绍了基于专门的扩散生成模型，该模型通过在注释的腹腔镜手术镜头上进行监督学习，捕获细粒机器人手术子缝隙动作的时空动力学。提出的模型为数据驱动的世界模型构成了基础，该模型能够模拟具有高时间忠诚的手术缝合的生物力学相互作用和程序动力学。注释从模拟视频中提取的$ \ sim2k $剪辑的数据集，我们将手术动作分类为细粒的子缝线类，包括理想和非理想执行针头定位，靶向，驾驶，驾驶和撤回。我们微调了两个最先进的视频扩散模型LTX-Video和HunyuanVideo，以生成高保真手术动作序列，以$ \ ge $ 768x512分辨率和$ \ ge $ 49帧。为了培训我们的模型，我们探索了低级适应性（LORA）和全模型微调方法。我们的实验结果表明，这些世界模型可以有效地捕获缝合的动力学，有可能改善训练模拟器，手术技能评估工具和自主手术系统。这些模型还显示了区分理想和非理想技术执行的能力，为建筑外科培训和评估系统提供了基础。我们发布了测试模型，并作为未来研究的基础。项目页面：此HTTPS URL

Title: SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs

Authors: Guibiao Liao, Qing Li, Zhenyu Bao, Guoping Qiu, Kanglin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12535
Pdf URL: https://arxiv.org/pdf/2503.12535
Copy Paste: [[2503.12535]] SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs(https://arxiv.org/abs/2503.12535)
Keywords: generation
Abstract: 3D Gaussian Splatting-based indoor open-world free-view synthesis approaches have shown significant performance with dense input images. However, they exhibit poor performance when confronted with sparse inputs, primarily due to the sparse distribution of Gaussian points and insufficient view supervision. To relieve these challenges, we propose SPC-GS, leveraging Scene-layout-based Gaussian Initialization (SGI) and Semantic-Prompt Consistency (SPC) Regularization for open-world free view synthesis with sparse inputs. Specifically, SGI provides a dense, scene-layout-based Gaussian distribution by utilizing view-changed images generated from the video generation model and view-constraint Gaussian points densification. Additionally, SPC mitigates limited view supervision by employing semantic-prompt-based consistency constraints developed by SAM2. This approach leverages available semantics from training views, serving as instructive prompts, to optimize visually overlapping regions in novel views with 2D and 3D consistency constraints. Extensive experiments demonstrate the superior performance of SPC-GS across Replica and ScanNet benchmarks. Notably, our SPC-GS achieves a 3.06 dB gain in PSNR for reconstruction quality and a 7.3% improvement in mIoU for open-world semantic segmentation.
摘要：基于3D高斯剥落的室内开放世界自由视图合成方法已显示出具有密集的输入图像的显着性能。但是，当面对稀疏的输入时，它们的性能较差，这主要是由于高斯点的分布稀疏和视图监督不足。为了缓解这些挑战，我们提出了SPC-GS，利用基于场景的高斯初始化（SGI）和语义 - 突出一致性（SPC）正则化（SPC）正则化，以散布输入，以进行开放世界自由视图合成。具体而言，SGI通过利用从视频生成模型和View-Constraint Gaussian点致密化产生的视图变化图像来提供基于场景的高斯分布。此外，SPC通过采用SAM2开发的基于语义促进的一致性约束来减轻有限的视图监督。这种方法利用可用的语义从训练观点（用作启发性提示）来优化具有2D和3D一致性约束的新型视图中的视觉重叠区域。广泛的实验表明，SPC-GS在复制品和扫描基准测试中的出色性能。值得注意的是，我们的SPC-GS可在PSNR中获得3.06 dB的增长，用于重建质量，并提高了MIOU 7.3％的开放世界语义细分。

Title: Debiasing Diffusion Model: Enhancing Fairness through Latent Representation Learning in Stable Diffusion Model

Authors: Lin-Chun Huang, Ching Chieh Tsao, Fang-Yi Su, Jung-Hsien Chiang
Subjects: cs.LG, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2503.12536
Pdf URL: https://arxiv.org/pdf/2503.12536
Copy Paste: [[2503.12536]] Debiasing Diffusion Model: Enhancing Fairness through Latent Representation Learning in Stable Diffusion Model(https://arxiv.org/abs/2503.12536)
Keywords: generative
Abstract: Image generative models, particularly diffusion-based models, have surged in popularity due to their remarkable ability to synthesize highly realistic images. However, since these models are data-driven, they inherit biases from the training datasets, frequently leading to disproportionate group representations that exacerbate societal inequities. Traditionally, efforts to debiase these models have relied on predefined sensitive attributes, classifiers trained on such attributes, or large language models to steer outputs toward fairness. However, these approaches face notable drawbacks: predefined attributes do not adequately capture complex and continuous variations among groups. To address these issues, we introduce the Debiasing Diffusion Model (DDM), which leverages an indicator to learn latent representations during training, promoting fairness through balanced representations without requiring predefined sensitive attributes. This approach not only demonstrates its effectiveness in scenarios previously addressed by conventional techniques but also enhances fairness without relying on predefined sensitive attributes as conditions. In this paper, we discuss the limitations of prior bias mitigation techniques in diffusion-based models, elaborate on the architecture of the DDM, and validate the effectiveness of our approach through experiments.
摘要：图像生成模型，尤其是基于扩散的模型，由于其综合高度逼真的图像的出色能力而引起了人们的流行。但是，由于这些模型是数据驱动的，因此它们从培训数据集中继承了偏见，经常导致不成比例的群体表示会加剧社会不平等。传统上，这些模型为DebiAse的努力依赖于预定义的敏感属性，经过此类属性培训的分类器或大型语言模型以将输出转向公平。但是，这些方法面临着显着的缺点：预定义的属性不能充分捕获组之间的复杂和连续变化。为了解决这些问题，我们介绍了偏见扩散模型（DDM），该模型利用指标在培训期间学习潜在表示，通过平衡表示不需要预定义的敏感属性来促进公平性。这种方法不仅证明了其在以前通过常规技术解决的方案中的有效性，而且在不依赖预定义敏感属性作为条件的情况下增强了公平性。在本文中，我们讨论了基于扩散的模型中先前缓解技术的局限性，并详细介绍了DDM的体系结构，并通过实验验证了我们方法的有效性。

Title: Diffusion on Graph: Augmentation of Graph Structure for Node Classification

Authors: Yancheng Wang, Changyu Liu, Yingzhen Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12563
Pdf URL: https://arxiv.org/pdf/2503.12563
Copy Paste: [[2503.12563]] Diffusion on Graph: Augmentation of Graph Structure for Node Classification(https://arxiv.org/abs/2503.12563)
Keywords: generation
Abstract: Graph diffusion models have recently been proposed to synthesize entire graphs, such as molecule graphs. Although existing methods have shown great performance in generating entire graphs for graph-level learning tasks, no graph diffusion models have been developed to generate synthetic graph structures, that is, synthetic nodes and associated edges within a given graph, for node-level learning tasks. Inspired by the research in the computer vision literature using synthetic data for enhanced performance, we propose Diffusion on Graph (DoG), which generates synthetic graph structures to boost the performance of GNNs. The synthetic graph structures generated by DoG are combined with the original graph to form an augmented graph for the training of node-level learning tasks, such as node classification and graph contrastive learning (GCL). To improve the efficiency of the generation process, a Bi-Level Neighbor Map Decoder (BLND) is introduced in DoG. To mitigate the adverse effect of the noise introduced by the synthetic graph structures, a low-rank regularization method is proposed for the training of graph neural networks (GNNs) on the augmented graphs. Extensive experiments on various graph datasets for semi-supervised node classification and graph contrastive learning have been conducted to demonstrate the effectiveness of DoG with low-rank regularization. The code of DoG is available at this https URL.
摘要：最近提出了图扩散模型来合成整个图，例如分子图。尽管现有方法在生成图形级学习任务的整个图表方面表现出色，但是尚未开发出图形扩散模型来生成合成图结构，即给定图中的合成节点和相关的边缘，用于节点级学习任务。受到计算机视觉文献研究的启发，我们建议对图（DOG）扩散（狗），该数据生成合成图结构以提高GNN的性能。狗生成的合成图结构与原始图相结合，形成了一个增强图，用于训练节点级学习任务，例如节点分类和图形对比度学习（GCL）。为了提高发电过程的效率，在狗中引入了双层邻居图解码器（BLND）。为了减轻合成图结构引入的噪声的不利影响，提出了一种低级别的正则化方法，用于在增强图上训练图神经网络（GNN）。在各种图形数据集上进行了针对半监督节点分类和图形对比度学习的广泛实验，以证明狗具有低级别正则化的有效性。狗代码可在此HTTPS URL上找到。

Title: GAN-Based Single-Stage Defense for Traffic Sign Classification Under Adversarial Patch Attack

Authors: Abyad Enan, Mashrur Chowdhury
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12567
Pdf URL: https://arxiv.org/pdf/2503.12567
Copy Paste: [[2503.12567]] GAN-Based Single-Stage Defense for Traffic Sign Classification Under Adversarial Patch Attack(https://arxiv.org/abs/2503.12567)
Keywords: generative
Abstract: Computer Vision plays a critical role in ensuring the safe navigation of autonomous vehicles (AVs). An AV perception module is responsible for capturing and interpreting the surrounding environment to facilitate safe navigation. This module enables AVs to recognize traffic signs, traffic lights, and various road users. However, the perception module is vulnerable to adversarial attacks, which can compromise their accuracy and reliability. One such attack is the adversarial patch attack (APA), a physical attack in which an adversary strategically places a specially crafted sticker on an object to deceive object classifiers. In APA, an adversarial patch is positioned on a target object, leading the classifier to misidentify it. Such an APA can cause AVs to misclassify traffic signs, leading to catastrophic incidents. To enhance the security of an AV perception system against APAs, this study develops a Generative Adversarial Network (GAN)-based single-stage defense strategy for traffic sign classification. This approach is tailored to defend against APAs on different classes of traffic signs without prior knowledge of a patch's design. This study found this approach to be effective against patches of varying sizes. Our experimental analysis demonstrates that the defense strategy presented in this paper improves the classifier's accuracy under APA conditions by up to 80.8% and enhances overall classification accuracy for all the traffic signs considered in this study by 58%, compared to a classifier without any defense mechanism. Our defense strategy is model-agnostic, making it applicable to any traffic sign classifier, regardless of the underlying classification model.
摘要：计算机视觉在确保自动驾驶汽车（AV）的安全导航方面起着至关重要的作用。 AV感知模块负责捕获和解释周围环境以促进安全导航。该模块使AV可以识别交通标志，交通信号灯和各种道路使用者。但是，感知模块容易受到对抗攻击的影响，这可能会损害其准确性和可靠性。一种这样的攻击是对抗斑块攻击（APA），这是一种物理攻击，在这种攻击中，对手在战略上将特殊精心制作的贴纸放在对象上以欺骗对象分类器。在APA中，对抗贴片位于目标对象上，导致分类器误导它。这样的APA会导致AVS错误分类的交通标志，从而导致灾难性事件。为了增强针对APA的AV感知系统的安全性，本研究开发了一种基于流量标志分类的基于生成的对抗网络（GAN）的单级防御策略。这种方法是为了防御不同类别的交通标志的APA而量身定制的，而没有事先了解补丁的设计。这项研究发现，这种方法可以有效地抵抗各种大小的斑块。我们的实验分析表明，与没有任何防御机制的分类器相比，本研究中的所有交通符号的总体分类准确性提高了分类器在APA条件下的准确性高达80.8％，并提高了58％的总体分类准确性。我们的防御策略是模型不合时宜的，使其适用于任何流量标志分类器，而不论基础分类模型如何。

Title: Personalize Anything for Free with Diffusion Transformer

Authors: Haoran Feng, Zehuan Huang, Lin Li, Hairong Lv, Lu Sheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12590
Pdf URL: https://arxiv.org/pdf/2503.12590
Copy Paste: [[2503.12590]] Personalize Anything for Free with Diffusion Transformer(https://arxiv.org/abs/2503.12590)
Keywords: generation
Abstract: Personalized image generation aims to produce images of user-specified concepts while enabling flexible editing. Recent training-free approaches, while exhibit higher computational efficiency than training-based methods, struggle with identity preservation, applicability, and compatibility with diffusion transformers (DiTs). In this paper, we uncover the untapped potential of DiT, where simply replacing denoising tokens with those of a reference subject achieves zero-shot subject reconstruction. This simple yet effective feature injection technique unlocks diverse scenarios, from personalization to image editing. Building upon this observation, we propose \textbf{Personalize Anything}, a training-free framework that achieves personalized image generation in DiT through: 1) timestep-adaptive token replacement that enforces subject consistency via early-stage injection and enhances flexibility through late-stage regularization, and 2) patch perturbation strategies to boost structural diversity. Our method seamlessly supports layout-guided generation, multi-subject personalization, and mask-controlled editing. Evaluations demonstrate state-of-the-art performance in identity preservation and versatility. Our work establishes new insights into DiTs while delivering a practical paradigm for efficient personalization.
摘要：个性化的图像生成旨在产生用户指定概念的图像，同时启用灵活的编辑。最近的无培训方法比基于培训的方法具有更高的计算效率，但仍与身份保存，适用性和与扩散变压器（DITS）相容。在本文中，我们揭示了DIT的未开发潜力，在这些潜力中，简单地用参考主题的代币替换了代币，从而实现了零拍的主题重建。从个性化到图像编辑，这种简单而有效的功能注入技术可以解锁各种情况。在这一观察结果的基础上，我们提出了\ textbf {个性化}，这是一个无训练的框架，通过以下方式在DIT中实现个性化的图像生成：1）TimeStep-Aptappaptive令牌置换，可通过早期注射来实施主题一致性，并通过早期阶段增强灵活性，并通过晚期正规化来提高敏感性，以及2）攻击策略的结构多样性。我们的方法无缝支持布局引导的生成，多对象个性化和面具控制的编辑。评估证明了身份保存和多功能性方面的最新性能。我们的工作为DIT建立了新的见解，同时提供了有效个性化的实用范式。

Title: SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models

Authors: Kunyang Sun, Dorian Bagni, Joseph M. Cavanagh, Yingze Wang, Jacob M. Sawyer, Andrew Gritsevskiy, Teresa Head-Gordon
Subjects: cs.LG, physics.bio-ph
Abstract URL: https://arxiv.org/abs/2503.12602
Pdf URL: https://arxiv.org/pdf/2503.12602
Copy Paste: [[2503.12602]] SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models(https://arxiv.org/abs/2503.12602)
Keywords: generation, generative
Abstract: Generative machine learning models for small molecule drug discovery have shown immense promise, but many molecules generated by this approach are too difficult to synthesize to be worth further investigation or further development. We present a novel approach by fine-tuning Meta's Llama3 large language models (LLMs) to create SynLlama, which generates full synthetic pathways made of commonly accessible Enamine building blocks and robust organic reaction templates. SynLlama explores a large synthesizable space using significantly less data compared to other state-of-the-art methods, and offers strong performance in bottom-up synthesis, synthesizable analog generation, and hit expansion, offering medicinal chemists a valuable tool for drug discovery developments. We find that SynLlama can effectively generalize to unseen yet purchasable building blocks, meaning that its reconstruction capabilities extend to a broader synthesizable chemical space than the training data.
摘要：用于小分子药物发现的生成机器学习模型已经显示出巨大的希望，但是这种方法产生的许多分子都难以合成，无法进一步研究或进一步发展。我们通过微调Meta的Llama3大语言模型（LLMS）提出了一种新颖的方法，以创建Synllama，该模型生成了由常见的磁胺构建块制成的完整合成途径和稳健的有机反应模板。与其他最先进的方法相比，Synllama使用数据明显更少的数据探索了大型合成空间，并在自下而上的合成，可合成的模拟产生和命中率扩展方面提供了强劲的性能，为药物化学家提供了药物发现开发的有价值的工具。我们发现，Synllama可以有效地概括以看不见但可购买的构件，这意味着其重建功能扩展到比培训数据更广泛的合成化学空间。

Title: Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Authors: Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, William Wang, Ziwei Liu, Jiebo Luo, Hao Fei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12605
Pdf URL: https://arxiv.org/pdf/2503.12605
Copy Paste: [[2503.12605]] Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey(https://arxiv.org/abs/2503.12605)
Keywords: generation
Abstract: By extending the advantage of chain-of-thought (CoT) reasoning in human-like step-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoning has recently garnered significant research attention, especially in the integration with multimodal large language models (MLLMs). Existing MCoT studies design various methodologies and innovative reasoning paradigms to address the unique challenges of image, video, speech, audio, 3D, and structured data across different modalities, achieving extensive success in applications such as robotics, healthcare, autonomous driving, and multimodal generation. However, MCoT still presents distinct challenges and opportunities that require further focus to ensure consistent thriving in this field, where, unfortunately, an up-to-date review of this domain is lacking. To bridge this gap, we present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions. We offer a comprehensive taxonomy and an in-depth analysis of current methodologies from diverse perspectives across various application scenarios. Furthermore, we provide insights into existing challenges and future research directions, aiming to foster innovation toward multimodal AGI.
摘要：通过将类似人类的逐步过程中的经营链（COT）推理的优势扩展到多模式环境，多模式COT（MCOT）推理最近引起了大量的研究关注，尤其是在与多模式大型语言模型（MLLMS）的整合中。现有的MCOT研究设计了各种方法和创新的推理范式，以应对不同方式的图像，视频，语音，音频，3D和结构化数据的独特挑战，从而在机器人技术，医疗保健，自动驾驶和多模式产生等应用方面取得了广泛的成功。但是，MCOT仍然提出了不同的挑战和机遇，需要进一步的重点，以确保该领域的持续发展，不幸的是，缺乏对该领域的最新审查。为了弥合这一差距，我们介绍了MCOT推理的首次系统调查，阐明了相关的基础概念和定义。我们从各种应用程序方面的不同角度提供了全面的分类学和对当前方法的深入分析。此外，我们提供了有关现有挑战和未来研究方向的见解，旨在促进对多模式AGI的创新。

Title: LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization

Authors: Alessio Spagnoletti, Jean Prost, Andrés Almansa, Nicolas Papadakis, Marcelo Pereyra
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12615
Pdf URL: https://arxiv.org/pdf/2503.12615
Copy Paste: [[2503.12615]] LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization(https://arxiv.org/abs/2503.12615)
Keywords: generative
Abstract: Text-to-image latent diffusion models (LDMs) have recently emerged as powerful generative models with great potential for solving inverse problems in imaging. However, leveraging such models in a Plug & Play (PnP), zero-shot manner remains challenging because it requires identifying a suitable text prompt for the unknown image of interest. Also, existing text-to-image PnP approaches are highly computationally expensive. We herein address these challenges by proposing a novel PnP inference paradigm specifically designed for embedding generative models within stochastic inverse solvers, with special attention to Latent Consistency Models (LCMs), which distill LDMs into fast generators. We leverage our framework to propose LAtent consisTency INverse sOlver (LATINO), the first zero-shot PnP framework to solve inverse problems with priors encoded by LCMs. Our conditioning mechanism avoids automatic differentiation and reaches SOTA quality in as little as 8 neural function evaluations. As a result, LATINO delivers remarkably accurate solutions and is significantly more memory and computationally efficient than previous approaches. We then embed LATINO within an empirical Bayesian framework that automatically calibrates the text prompt from the observed measurements by marginal maximum likelihood estimation. Extensive experiments show that prompt self-calibration greatly improves estimation, allowing LATINO with PRompt Optimization to define new SOTAs in image reconstruction quality and computational efficiency.
摘要：文本到图像潜伏扩散模型（LDMS）最近已成为强大的生成模型，具有巨大的潜力来解决成像中的反问题。但是，在插头游戏（PNP）中利用此类模型，零射击方式仍然具有挑战性，因为它需要确定合适的文本提示，以了解未知的图像。同样，现有的文本对图像PNP方法在计算上非常昂贵。我们在此解决了这些挑战，提出了一种新型的PNP推理范式，专门设计用于将生成模型嵌入随机逆求中器中，并特别注意潜在的一致性模型（LCMS），该模型（LCMS）将LDMS提炼到快速生成器中。我们利用我们的框架提出潜在的一致性逆求器（Latino），这是第一个用LCMS编码的PRIORS解决相反问题的零击PNP框架。我们的调节机制避免了自动分化，并在8个神经功能评估中达到SOTA质量。结果，拉丁裔提供了明显准确的解决方案，并且比以前的方法更有效率。然后，我们将拉丁裔嵌入了经验的贝叶斯框架中，该框架通过通过边际最大似然估计来自动校准观察到的测量值的文本提示。广泛的实验表明，迅速的自我校准大大改善了估计，从而使拉丁裔迅速优化可以在图像重建质量和计算效率中定义新的SOTA。

Title: UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Authors: Tsu-Jui Fu, Yusu Qian, Chen Chen, Wenze Hu, Zhe Gan, Yinfei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12652
Pdf URL: https://arxiv.org/pdf/2503.12652
Copy Paste: [[2503.12652]] UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing(https://arxiv.org/abs/2503.12652)
Keywords: generation
Abstract: Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.
摘要：文本对图像（T2I）扩散模型在按用户提示下生成视觉上引人入胜的图像时显示出令人印象深刻的结果。在此基础上，各种方法进一步调整了针对特定任务的预训练的T2I模型。但是，这需要单独的模型体系结构，培训设计和多个参数集来处理不同的任务。在本文中，我们介绍了Univg，这是一个通才扩散模型，该模型能够支持具有一组权重的各种图像生成任务。 UNIVG将多模式输入视为统一条件，以实现各种下游应用程序，从T2I生成，基于指导，基于指导的编辑，具有身份的定义性生成以及布局引导的生成，到深度估计和参考段。通过有关数据混合和多任务培训的全面实证研究，我们对培训过程和决策提供了详细的见解，从而为我们的最终设计提供了信息。例如，我们表明，T2I生成和其他任务（例如基于指令的编辑）可以在无绩效折衷的情况下共存，而辅助任务（例如深度估计和参考细分）可以增强图像编辑。值得注意的是，我们的模型甚至可以在其各自的基准上胜过某些特定于任务的模型，这标志着朝着统一的图像生成模型迈出的重要一步。

Title: Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding

Authors: Imran Kabir, Md Alimoor Reza, Syed Billah
Subjects: cs.CV, cs.CL, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.12663
Pdf URL: https://arxiv.org/pdf/2503.12663
Copy Paste: [[2503.12663]] Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding(https://arxiv.org/abs/2503.12663)
Keywords: generation
Abstract: Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 55% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at this https URL.
摘要：大型的多模型模型（LMM）越来越多地集成到用于用户交互的自主驱动系统中。但是，它们在细粒度的空间推理中的局限性对系统解释性和用户信任提出了挑战。我们介绍了Logic-rag，这是一种新颖的检索演出一代（RAG）框架，可改善LMMS在驾驶场景中的空间理解。 Logic-rag使用感知模块，查询到逻辑嵌入器和逻辑推理引擎构建了有关对象对象关系（fol）中对象对象关系的动态知识库（KB）。我们使用综合和现实世界驱动视频评估了视觉空间查询的逻辑抹布。当使用流行的LMM（GPT-4V，Claude 3.5）作为自动驾驶系统的代理时，这些型号在合成驾驶场景上只能达到55％的精度，而在现实世界驾驶场景中，这些型号仅能达到75％。用逻辑障碍将它们的精度扩大到80％和90％以上。一项消融研究表明，即使没有逻辑推断，仅由逻辑障碍构建的基于事实的上下文也将精度提高了15％。 Logic-rag是可扩展的：它允许使用改进版本的单个组件无缝替换，并使域专家能够在FOL和自然语言中撰写新知识。总而言之，逻辑开是LMMS在自动驾驶应用中的关键空间推理缺陷。代码和数据可在此HTTPS URL上找到。

Title: Can LLMs Formally Reason as Abstract Interpreters for Program Analysis?

Authors: Jacqueline L. Mitchell, Brian Hyeongseok Kim, Chenyu Zhou, Chao Wang
Subjects: cs.LG, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2503.12686
Pdf URL: https://arxiv.org/pdf/2503.12686
Copy Paste: [[2503.12686]] Can LLMs Formally Reason as Abstract Interpreters for Program Analysis?(https://arxiv.org/abs/2503.12686)
Keywords: generation
Abstract: LLMs have demonstrated impressive capabilities in code generation and comprehension, but their potential in being able to perform program analysis in a formal, automatic manner remains under-explored. To that end, we systematically investigate whether LLMs can reason about programs using a program analysis framework called abstract interpretation. We prompt LLMs to follow two different strategies, denoted as Compositional and Fixed Point Equation, to formally reason in the style of abstract interpretation, which has never been done before to the best of our knowledge. We validate our approach using state-of-the-art LLMs on 22 challenging benchmark programs from the Software Verification Competition (SV-COMP) 2019 dataset, widely used in program analysis. Our results show that our strategies are able to elicit abstract interpretation-based reasoning in the tested models, but LLMs are susceptible to logical errors, especially while interpreting complex program structures, as well as general hallucinations. This highlights key areas for improvement in the formal reasoning capabilities of LLMs.
摘要：LLM在代码生成和理解中表现出了令人印象深刻的能力，但是它们能够以正式的自动方式进行程序分析的潜力仍然不足。为此，我们系统地研究LLM是否可以使用称为抽象解释的程序分析框架来推理程序。我们促使LLM遵循两种不同的策略，称为组成和固定点方程，以抽象解释的方式正式理由，这是我们最好的知识。我们使用最先进的LLMS在22个具有挑战性的基准计划（SV-COMP）2019数据集中验证了我们的方法，该计划广泛用于程序分析。我们的结果表明，我们的策略能够在经过测试的模型中引起基于抽象的解释推理，但是LLMS容易受到逻辑错误的影响，尤其是在解释复杂的程序结构以及一般幻觉的同时。这突出了改善LLM的正式推理能力的关键领域。

Title: MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

Authors: Hengjia Li, Lifan Jiang, Xi Xiao, Tianyang Wang, Hongwei Yi, Boxi Wu, Deng Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12689
Pdf URL: https://arxiv.org/pdf/2503.12689
Copy Paste: [[2503.12689]] MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization(https://arxiv.org/abs/2503.12689)
Keywords: generation
Abstract: Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static images. To address these issues, we introduce $\textbf{MagicID}$, a novel framework designed to directly promote the generation of identity-consistent and dynamically rich videos tailored to user preferences. Specifically, we propose constructing pairwise preference video data with explicit identity and dynamic rewards for preference learning, instead of sticking to the traditional self-reconstruction. To address the constraints of customized preference data, we introduce a hybrid sampling strategy. This approach first prioritizes identity preservation by leveraging static videos derived from reference images, then enhances dynamic motion quality in the generated videos using a Frontier-based sampling method. By utilizing these hybrid preference pairs, we optimize the model to align with the reward differences between pairs of customized preferences. Extensive experiments show that MagicID successfully achieves consistent identity and natural dynamics, surpassing existing methods across various metrics.
摘要：视频身份自定义旨在制作高保真视频，以保持一致的身份并根据用户的参考图像表现出重要的动态。但是，现有方法面临两个关键挑战：身份降低了扩展视频的长度，并减少了训练期间的动态，这主要是由于它们依赖静态图像的传统自我重建训练。为了解决这些问题，我们介绍了$ \ textbf {MagicID} $，这是一个新颖的框架，旨在直接促进针对用户偏好量身定制的符合身份一致且动态丰富的视频。具体而言，我们建议使用明确的身份和动态奖励来构建成对的偏好视频数据，以进行偏好学习，而不是坚持传统的自我重构。为了解决自定义偏好数据的限制，我们引入了混合采样策略。这种方法首先通过利用从参考图像得出的静态视频来确定身份保存，然后使用基于边境的采样方法在生成的视频中增强动态运动质量。通过利用这些混合偏好对，我们优化了模型以与定制偏好对之间的奖励差异保持一致。广泛的实验表明，MagicID成功地实现了一致的身份和自然动态，超过了各种指标的现有方法。

Title: GenStereo: Towards Open-World Generation of Stereo Images and Unsupervised Matching

Authors: Feng Qiao, Zhexiao Xiong, Eric Xing, Nathan Jacobs
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12720
Pdf URL: https://arxiv.org/pdf/2503.12720
Copy Paste: [[2503.12720]] GenStereo: Towards Open-World Generation of Stereo Images and Unsupervised Matching(https://arxiv.org/abs/2503.12720)
Keywords: generation
Abstract: Stereo images are fundamental to numerous applications, including extended reality (XR) devices, autonomous driving, and robotics. Unfortunately, acquiring high-quality stereo images remains challenging due to the precise calibration requirements of dual-camera setups and the complexity of obtaining accurate, dense disparity maps. Existing stereo image generation methods typically focus on either visual quality for viewing or geometric accuracy for matching, but not both. We introduce GenStereo, a diffusion-based approach, to bridge this gap. The method includes two primary innovations (1) conditioning the diffusion process on a disparity-aware coordinate embedding and a warped input image, allowing for more precise stereo alignment than previous methods, and (2) an adaptive fusion mechanism that intelligently combines the diffusion-generated image with a warped image, improving both realism and disparity consistency. Through extensive training on 11 diverse stereo datasets, GenStereo demonstrates strong generalization ability. GenStereo achieves state-of-the-art performance in both stereo image generation and unsupervised stereo matching tasks. Our framework eliminates the need for complex hardware setups while enabling high-quality stereo image generation, making it valuable for both real-world applications and unsupervised learning scenarios. Project page is available at this https URL
摘要：立体声图像对于众多应用程序至关重要，包括扩展现实（XR）设备，自动驾驶和机器人技术。不幸的是，由于双相机设置的精确校准要求以及获得准确，密集的差异图的复杂性，获取高质量的立体声图像仍然具有挑战性。现有的立体图像生成方法通常集中于视觉质量以查看或进行匹配的几何精度，但并非两者兼而有之。我们介绍了基于扩散的方法Genstero，以弥合这一差距。该方法包括两项主要的创新（1）在差异感知的坐标嵌入和扭曲的输入图像上调节扩散过程，与以前的方法相比，可以更精确的立体声对齐，（2）适应性融合机制，该机制可以智能地将扩散生成的图像与扭曲的图像结合到扭曲的图像，提高现实主义和差异的一致性，从而智能地结合了扩散的图像。通过对11种不同立体声数据集的广泛培训，Genstero表现出强大的概括能力。 Genstereo在立体声图像生成和无监督的立体声匹配任务中都达到了最先进的性能。我们的框架消除了对复杂的硬件设置的需求，同时可以实现高质量的立体图像生成，从而使其对现实世界应用程序和无监督学习方案都很有价值。项目页面可在此HTTPS URL上找到

Title: TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

Authors: Philip Quirke, Clement Neo, Abir Harrasse, Dhruv Nathawani, Amir Abdullah
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2503.12730
Pdf URL: https://arxiv.org/pdf/2503.12730
Copy Paste: [[2503.12730]] TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research(https://arxiv.org/abs/2503.12730)
Keywords: generation
Abstract: Mechanistic interpretability research faces a gap between analyzing simple circuits in toy tasks and discovering features in large models. To bridge this gap, we propose text-to-SQL generation as an ideal task to study, as it combines the formal structure of toy tasks with real-world complexity. We introduce TinySQL, a synthetic dataset progressing from basic to advanced SQL operations, and train models ranging from 33M to 1B parameters to establish a comprehensive testbed for interpretability. We apply multiple complementary interpretability techniques, including edge attribution patching and sparse autoencoders, to identify minimal circuits and components supporting SQL generation. Our analysis reveals both the potential and limitations of current interpretability methods, showing how circuits can vary even across similar queries. Lastly, we demonstrate how mechanistic interpretability can identify flawed heuristics in models and improve synthetic dataset design. Our work provides a comprehensive framework for evaluating and advancing interpretability techniques while establishing clear boundaries for their reliable application.
摘要：机械性可解释性研究面临分析玩具任务中的简单电路与在大型模型中发现功能之间的差距。为了弥合这一差距，我们将文本到SQL生成作为研究的理想任务，因为它将玩具任务的形式结构与现实世界中的复杂性相结合。我们介绍了TinySQL，这是一个从基本的SQL操作到高级SQL操作的合成数据集，并介绍了从33m到1B参数的训练模型，以建立一个全面的测试床以进行解释性。我们应用多种互补的可解释性技术，包括边缘归因补丁和稀疏自动编码器，以识别支持SQL生成的最小电路和组件。我们的分析揭示了当前的可解释性方法的潜在和局限性，表明电路在相似的查询中如何变化。最后，我们证明了机械性解释性如何识别模型中的启发式方法并改善合成数据集设计。我们的工作提供了一个全面的框架，用于评估和推进可解释性技术，同时为其可靠的应用建立明确的界限。

Title: VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis

Authors: Zhifeng Wang, Renjiao Yi, Xin Wen, Chenyang Zhu, Kai Xu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.12758
Pdf URL: https://arxiv.org/pdf/2503.12758
Copy Paste: [[2503.12758]] VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis(https://arxiv.org/abs/2503.12758)
Keywords: generation, generative
Abstract: Angiography imaging is a medical imaging technique that enhances the visibility of blood vessels within the body by using contrast agents. Angiographic images can effectively assist in the diagnosis of vascular diseases. However, contrast agents may bring extra radiation exposure which is harmful to patients with health risks. To mitigate these concerns, in this paper, we aim to automatically generate angiography from non-angiographic inputs, by leveraging and enhancing the inherent physical properties of vascular structures. Previous methods relying on 2D slice-based angiography synthesis struggle with maintaining continuity in 3D vascular structures and exhibit limited effectiveness across different imaging modalities. We propose VasTSD, a 3D vascular tree-state space diffusion model to synthesize angiography from 3D non-angiographic volumes, with a novel state space serialization approach that dynamically constructs vascular tree topologies, integrating these with a diffusion-based generative model to ensure the generation of anatomically continuous vasculature in 3D volumes. A pre-trained vision embedder is employed to construct vascular state space representations, enabling consistent modeling of vascular structures across multiple modalities. Extensive experiments on various angiographic datasets demonstrate the superiority of VasTSD over prior works, achieving enhanced continuity of blood vessels in synthesized angiographic synthesis for multiple modalities and anatomical regions.
摘要：血管造影成像是一种医学成像技术，可通过使用造影剂增强体内血管的可见性。血管造影图像可以有效地帮助诊断血管疾病。但是，对比剂可能会带来额外的辐射暴露，这对患有健康风险的患者有害。为了减轻这些问题，在本文中，我们旨在通过利用和增强血管结构的固有物理特性来自动从非血管形输入中产生血管造影。以前的方法依靠基于2D切片的血管造影合成在3D血管结构中保持连续性并在不同成像方式上表现出有限的有效性。我们提出了一种3D血管树 - 状态空间扩散模型的VASTSD，以通过3D非血管造影量合成血管造影，采用一种新型的状态空间序列化方法，该方法动态地构建了血管树拓扑结构，并将其与基于扩散的生成模型整合在一起，以确保在3D伏特中产生解剖连续的血管的产生。采用预训练的视力嵌入器来构建血管状态空间表示，从而实现了多种方式跨多种模态的血管结构的一致建模。在各种血管造影数据集上进行的广泛实验证明了VASTSD优于先前的作品，从而实现了多种方式和解剖区域的合成血管造影合成中血管的连续性增强。

Title: A Survey on Human Interaction Motion Generation

Authors: Kewei Sui, Anindita Ghosh, Inwoo Hwang, Jian Wang, Chuan Guo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12763
Pdf URL: https://arxiv.org/pdf/2503.12763
Copy Paste: [[2503.12763]] A Survey on Human Interaction Motion Generation(https://arxiv.org/abs/2503.12763)
Keywords: generation, generative
Abstract: Humans inhabit a world defined by interactions -- with other humans, objects, and environments. These interactive movements not only convey our relationships with our surroundings but also demonstrate how we perceive and communicate with the real world. Therefore, replicating these interaction behaviors in digital systems has emerged as an important topic for applications in robotics, virtual reality, and animation. While recent advances in deep generative models and new datasets have accelerated progress in this field, significant challenges remain in modeling the intricate human dynamics and their interactions with entities in the external world. In this survey, we present, for the first time, a comprehensive overview of the literature in human interaction motion generation. We begin by establishing foundational concepts essential for understanding the research background. We then systematically review existing solutions and datasets across three primary interaction tasks -- human-human, human-object, and human-scene interactions -- followed by evaluation metrics. Finally, we discuss open research directions and future opportunities.
摘要：人类居住在一个与其他人，物体和环境相互作用定义的世界中。这些互动运动不仅传达了我们与周围环境的关系，而且还展示了我们如何感知和与现实世界进行交流。因此，在数字系统中复制这些相互作用行为已成为机器人技术，虚拟现实和动画应用程序的重要主题。尽管最深层生成模型和新数据集的最新进展已加快了该领域的进步，但在建模复杂的人类动态及其与外部世界中的实体的相互作用方面仍然存在重大挑战。在这项调查中，我们首次介绍了人类相互作用运动中文献的全面概述。我们首先建立对理解研究背景必不可少的基础概念。然后，我们系统地检查了三个主要的交互任务（人类人类，人类对象和人类习惯相互作用）的现有解决方案和数据集，然后是评估指标。最后，我们讨论开放的研究方向和未来机会。

Title: Decouple to Reconstruct: High Quality UHD Restoration via Active Feature Disentanglement and Reversible Fusion

Authors: Yidi Liu, Dong Li, Yuxin Ma, Jie Huang, Wenlong Zhang, Xueyang Fu, Zheng-jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12764
Pdf URL: https://arxiv.org/pdf/2503.12764
Copy Paste: [[2503.12764]] Decouple to Reconstruct: High Quality UHD Restoration via Active Feature Disentanglement and Reversible Fusion(https://arxiv.org/abs/2503.12764)
Keywords: restoration
Abstract: Ultra-high-definition (UHD) image restoration often faces computational bottlenecks and information loss due to its extremely high resolution. Existing studies based on Variational Autoencoders (VAE) improve efficiency by transferring the image restoration process from pixel space to latent space. However, degraded components are inherently coupled with background elements in degraded images, both information loss during compression and information gain during compensation remain uncontrollable. These lead to restored images often exhibiting image detail loss and incomplete degradation removal. To address this issue, we propose a Controlled Differential Disentangled VAE, which utilizes Hierarchical Contrastive Disentanglement Learning and an Orthogonal Gated Projection Module to guide the VAE to actively discard easily recoverable background information while encoding more difficult-to-recover degraded information into the latent space. Additionally, we design a Complex Invertible Multiscale Fusion Network to handle background features, ensuring their consistency, and utilize a latent space restoration network to transform the degraded latent features, leading to more accurate restoration results. Extensive experimental results demonstrate that our method effectively alleviates the information loss problem in VAE models while ensuring computational efficiency, significantly improving the quality of UHD image restoration, and achieves state-of-the-art results in six UHD restoration tasks with only 1M parameters.
摘要：超高定义（UHD）图像恢复通常由于其极高的分辨率而面临计算瓶颈和信息损失。基于变异自动编码器（VAE）的现有研究通过将图像恢复过程从像素空间转移到潜在空间来提高效率。但是，退化的组件与降级图像中的背景元素固有耦合，压缩过程中的信息丢失和补偿期间的信息增益仍然无法控制。这些导致恢复的图像通常表现出图像细节损失和不完全的降解去除。为了解决这个问题，我们提出了一个受控的差分脱离的VAE，该差距使用层次对比度拆卸学习和正交封闭式投影模块，以指导VAE积极地丢弃易于回收的背景信息，同时编码更难以恢复的较难降级的降级信息到潜在的空间中。此外，我们设计了一个复杂的可逆多尺度融合网络来处理背景功能，确保其一致性，并利用潜在的空间恢复网络来改变降级的潜在特征，从而更准确地恢复结果。广泛的实验结果表明，我们的方法有效地减轻了VAE模型中的信息损失问题，同时确保计算效率，显着提高UHD图像恢复的质量，并实现最先进的方法，从而导致六项UHD恢复任务，只有1M参数。

Title: TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image

Authors: Haoxiao Wang, Kaichen Zhou, Binrui Gu, Zhiyuan Feng, Weijie Wang, Peilin Sun, Yicheng Xiao, Jianhua Zhang, Hao Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12779
Pdf URL: https://arxiv.org/pdf/2503.12779
Copy Paste: [[2503.12779]] TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image(https://arxiv.org/abs/2503.12779)
Keywords: generation
Abstract: Manipulating transparent objects presents significant challenges due to the complexities introduced by their reflection and refraction properties, which considerably hinder the accurate estimation of their 3D shapes. To address these challenges, we propose a single-view RGB-D-based depth completion framework, TransDiff, that leverages the Denoising Diffusion Probabilistic Models(DDPM) to achieve material-agnostic object grasping in desktop. Specifically, we leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process. Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information, ensuring more accurate depth estimation in scenarios involving transparent objects. Additionally, we propose a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step. Finally, we utilized an improved inference process to accelerate the denoising procedure. Through comprehensive experimental validation, we demonstrate that our method significantly outperforms the baselines in both synthetic and real-world benchmarks with acceptable inference time. The demo of our method can be found on this https URL
摘要：操纵透明物体由于其反射和折射属性所带来的复杂性而提出了重大挑战，这极大地阻碍了其3D形状的准确估计。为了应对这些挑战，我们提出了一个基于单视RGB-D的深度完成框架Transdiff，该框架利用了Denoising扩散概率模型（DDPM），以实现桌面中的材料 - 静态对象。具体而言，我们利用从RGB图像中提取的功能，包括语义分割，边缘图和正常地图，以调节深度图生成过程。我们的方法学习了一个迭代授予过程，该过程将随机的深度分布转换为深度图，以最初完善的深度信息为指导，从而确保在涉及透明对象的场景中更准确的深度估计。此外，我们提出了一种新颖的训练方法，以更好地对齐嘈杂的深度和RGB图像特征，这些特征被用作逐步完善深度估计的条件。最后，我们利用了改进的推理过程来加速降解程序。通过全面的实验验证，我们证明了我们的方法在合成和现实世界中的基准中都显着优于具有可接受的推理时间的基准。我们方法的演示可以在此HTTPS URL上找到

Title: Improving Generalization of Universal Adversarial Perturbation via Dynamic Maximin Optimization

Authors: Yechao Zhang, Yingzhe Xu, Junyu Shi, Leo Yu Zhang, Shengshan Hu, Minghui Li, Yanjun Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.12793
Pdf URL: https://arxiv.org/pdf/2503.12793
Copy Paste: [[2503.12793]] Improving Generalization of Universal Adversarial Perturbation via Dynamic Maximin Optimization(https://arxiv.org/abs/2503.12793)
Keywords: generation
Abstract: Deep neural networks (DNNs) are susceptible to universal adversarial perturbations (UAPs). These perturbations are meticulously designed to fool the target model universally across all sample classes. Unlike instance-specific adversarial examples (AEs), generating UAPs is more complex because they must be generalized across a wide range of data samples and models. Our research reveals that existing universal attack methods, which optimize UAPs using DNNs with static model parameter snapshots, do not fully leverage the potential of DNNs to generate more effective UAPs. Rather than optimizing UAPs against static DNN models with a fixed training set, we suggest using dynamic model-data pairs to generate UAPs. In particular, we introduce a dynamic maximin optimization strategy, aiming to optimize the UAP across a variety of optimal model-data pairs. We term this approach DM-UAP. DM-UAP utilizes an iterative max-min-min optimization framework that refines the model-data pairs, coupled with a curriculum UAP learning algorithm to examine the combined space of model parameters and data thoroughly. Comprehensive experiments on the ImageNet dataset demonstrate that the proposed DM-UAP markedly enhances both cross-sample universality and cross-model transferability of UAPs. Using only 500 samples for UAP generation, DM-UAP outperforms the state-of-the-art approach with an average increase in fooling ratio of 12.108%.
摘要：深度神经网络（DNN）容易受到普遍的对抗扰动（UAPS）的影响。这些扰动是精心设计的，可以在所有样本类别中普遍欺骗目标模型。与特定于实例的对抗示例（AES）不同，生成UAP更为复杂，因为它们必须在广泛的数据示例和模型中概括。我们的研究表明，现有的通用攻击方法使用具有静态模型参数快照的DNN优化UAP，并不能完全利用DNN的潜力生成更有效的UAP。我们建议使用动态模型数据对生成UAP，而不是针对静态DNN模型优化UAP。特别是，我们引入了动态最大化优化策略，旨在优化各种最佳模型数据对的UAP。我们称这种方法DM-UAP。 DM-UAP利用了一个迭代的Max-Min-min优化框架，该框架优化了模型数据对，并与课程UAP学习算法相结合，以彻底检查模型参数的合并空间和数据。 ImageNet数据集的全面实验表明，所提出的DM-UAP显着增强了UAPS的跨样本普遍性和跨模型可传递性。 DM-UAP仅使用500个样品来生成，以优于最先进的方法，平均愚蠢比率为12.108％。

Title: A Reinforcement Learning-Driven Transformer GAN for Molecular Generation

Authors: Chen Li, Huidong Tang, Ye Zhu, Yoshihiro Yamanishi
Subjects: cs.LG, cs.CL, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2503.12796
Pdf URL: https://arxiv.org/pdf/2503.12796
Copy Paste: [[2503.12796]] A Reinforcement Learning-Driven Transformer GAN for Molecular Generation(https://arxiv.org/abs/2503.12796)
Keywords: generation, generative
Abstract: Generating molecules with desired chemical properties presents a critical challenge in fields such as chemical synthesis and drug discovery. Recent advancements in artificial intelligence (AI) and deep learning have significantly contributed to data-driven molecular generation. However, challenges persist due to the inherent sensitivity of simplified molecular input line entry system (SMILES) representations and the difficulties in applying generative adversarial networks (GANs) to discrete data. This study introduces RL-MolGAN, a novel Transformer-based discrete GAN framework designed to address these challenges. Unlike traditional Transformer architectures, RL-MolGAN utilizes a first-decoder-then-encoder structure, facilitating the generation of drug-like molecules from both $de~novo$ and scaffold-based designs. In addition, RL-MolGAN integrates reinforcement learning (RL) and Monte Carlo tree search (MCTS) techniques to enhance the stability of GAN training and optimize the chemical properties of the generated molecules. To further improve the model's performance, RL-MolWGAN, an extension of RL-MolGAN, incorporates Wasserstein distance and mini-batch discrimination, which together enhance the stability of the GAN. Experimental results on two widely used molecular datasets, QM9 and ZINC, validate the effectiveness of our models in generating high-quality molecular structures with diverse and desirable chemical properties.
摘要：具有所需化学性质的分子产生的分子在化学合成和药物发现等领域提出了一个关键挑战。人工智能（AI）和深度学习的最新进展显着促进了数据驱动的分子产生。但是，由于简化的分子输入线输入系统（Smiles）表示的固有灵敏度以及应用生成性对抗网络（GAN）的困难，挑战仍然存在。这项研究介绍了RL-Molgan，这是一种新型的基于变压器的离散GAN框架，旨在应对这些挑战。与传统的变压器体系结构不同，RL-Molgan利用了一个二十个编码器，然后使用了编码器结构，从而促进了从$ de〜novo $和基于脚手架的设计中产生类似药物的分子。此外，RL-molgan整合了加固学习（RL）和蒙特卡洛树搜索（MCTS）技术，以增强GAN训练的稳定性并优化产生的分子的化学性质。为了进一步提高模型的性能，RL-Molwgan（RL-Molgan的扩展）结合了Wasserstein距离和迷你批处理区分，共同增强了GAN的稳定性。对两个广泛使用的分子数据集QM9和锌的实验结果验证了我们模型在产生具有多种和理想化学性质的高质量分子结构方面的有效性。

Title: From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration

Authors: Mingyang Song, Xiaoye Qu, Jiawei Zhou, Yu Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12821
Pdf URL: https://arxiv.org/pdf/2503.12821
Copy Paste: [[2503.12821]] From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration(https://arxiv.org/abs/2503.12821)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) have achieved significant progress in combining visual comprehension with language generation. Despite this success, the training data of LVLMs still suffers from Long-Tail (LT) problems, where the data distribution is highly imbalanced. Previous works have mainly focused on traditional VLM architectures, i.e., CLIP or ViT, and specific tasks such as recognition and classification. Nevertheless, the exploration of LVLM (e.g. LLaVA) and more general tasks (e.g. Visual Question Answering and Visual Reasoning) remains under-explored. In this paper, we first conduct an in-depth analysis of the LT issues in LVLMs and identify two core causes: the overrepresentation of head concepts and the underrepresentation of tail concepts. Based on the above observation, we propose an $\textbf{A}$daptive $\textbf{D}$ata $\textbf{R}$efinement Framework ($\textbf{ADR}$), which consists of two stages: $\textbf{D}$ata $\textbf{R}$ebalancing ($\textbf{DR}$) and $\textbf{D}$ata $\textbf{S}$ynthesis ($\textbf{DS}$). In the DR stage, we adaptively rebalance the redundant data based on entity distributions, while in the DS stage, we leverage Denoising Diffusion Probabilistic Models (DDPMs) and scarce images to supplement underrepresented portions. Through comprehensive evaluations across eleven benchmarks, our proposed ADR effectively mitigates the long-tail problem in the training data, improving the average performance of LLaVA 1.5 relatively by 4.36%, without increasing the training data volume.
摘要：大型视觉模型（LVLM）在将视觉理解与语言产生结合在一起方面取得了重大进展。尽管取得了成功，但LVLMS的培训数据仍然遭受长尾（LT）问题，数据分布高度不平衡。先前的作品主要集中于传统的VLM架构，即剪辑或VIT，以及诸如识别和分类之类的特定任务。然而，对LVLM（例如LLAVA）和更一般任务（例如，视觉问题的回答和视觉推理）的探索仍然没有探索。在本文中，我们首先对LVLM中的LT问题进行了深入的分析，并确定了两个核心原因：头部概念的过度代表和尾巴概念的代表性不足。基于上述观察，我们提出了一个$ \ textbf {a} $ daptive $ \ textbf {d} $ ata $ \ textbf {r} $ efinement框架（$ \ textbf {adr {adr} $），该阶段由两个阶段：$ \ textbf {$ \ textbf {$ \ textbf {r. （$ \ textbf {dr} $）和$ \ textbf {d} $ ata $ \ textbf {s} $ ynthesis（$ \ textbf {ds} $）。在DR阶段，我们基于实体分布来适应冗余数据，而在DS阶段，我们利用deno deNo的扩散概率模型（DDPM）和稀缺图像来补充代表性不足的部分。通过对11个基准的全面评估，我们提出的ADR有效地减轻了训练数据中的长尾问题，从而在没有增加培训数据量的情况下将LLAVA 1.5的平均性能相对提高4.36％。

Title: PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior

Authors: Seunggwan Lee, Hwanhee Jung, Byoungsoo Koh, Qixing Huang, Sangho Yoon, Sangpil Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12834
Pdf URL: https://arxiv.org/pdf/2503.12834
Copy Paste: [[2503.12834]] PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior(https://arxiv.org/abs/2503.12834)
Keywords: generation
Abstract: A fundamental challenge in conditional 3D shape generation is to minimize the information loss and maximize the intention of user input. Existing approaches have predominantly focused on two types of isolated conditional signals, i.e., user sketches and text descriptions, each of which does not offer flexible control of the generated shape. In this paper, we introduce PASTA, the flexible approach that seamlessly integrates a user sketch and a text description for 3D shape generation. The key idea is to use text embeddings from a vision-language model to enrich the semantic representation of sketches. Specifically, these text-derived priors specify the part components of the object, compensating for missing visual cues from ambiguous sketches. In addition, we introduce ISG-Net which employs two types of graph convolutional networks: IndivGCN, which processes fine-grained details, and PartGCN, which aggregates these details into parts and refines the structure of objects. Extensive experiments demonstrate that PASTA outperforms existing methods in part-level editing and achieves state-of-the-art results in sketch-to-3D shape generation.
摘要：条件3D形状生成的基本挑战是最大程度地减少信息丢失并最大化用户输入的意图。现有的方法主要集中在两种类型的孤立条件信号上，即用户草图和文本说明，每种描述都不提供对生成形状的灵活控制。在本文中，我们介绍了意大利面，这是一种灵活的方法，该方法无缝地集成了用户草图和3D形状生成的文本描述。关键思想是使用视觉模型中的文本嵌入来丰富草图的语义表示。具体而言，这些文本衍生的先验指定对象的部分组成部分，以补偿模棱两可草图中缺少的视觉提示。此外，我们介绍了ISG-NET，它采用了两种类型的图形卷积网络：IndivGCN，该网络处理细粒细节，并将这些细节汇总为部分并完善对象的结构。广泛的实验表明，意大利面在零件级编辑中优于现有方法，并实现最先进的方法会导致素描到3D形状的生成。

Title: DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Mode

Authors: Junjia Huang, Pengxiang Yan, Jinhang Cai, Jiyang Liu, Zhao Wang, Yitong Wang, Xinglong Wu, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12838
Pdf URL: https://arxiv.org/pdf/2503.12838
Copy Paste: [[2503.12838]] DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Mode(https://arxiv.org/abs/2503.12838)
Keywords: generation
Abstract: Text-driven image generation using diffusion models has recently gained significant attention. To enable more flexible image manipulation and editing, recent research has expanded from single image generation to transparent layer generation and multi-layer compositions. However, existing approaches often fail to provide a thorough exploration of multi-layer structures, leading to inconsistent inter-layer interactions, such as occlusion relationships, spatial layout, and shadowing. In this paper, we introduce DreamLayer, a novel framework that enables coherent text-driven generation of multiple image layers, by explicitly modeling the relationship between transparent foreground and background layers. DreamLayer incorporates three key components, i.e., Context-Aware Cross-Attention (CACA) for global-local information exchange, Layer-Shared Self-Attention (LSSA) for establishing robust inter-layer connections, and Information Retained Harmonization (IRH) for refining fusion details at the latent level. By leveraging a coherent full-image context, DreamLayer builds inter-layer connections through attention mechanisms and applies a harmonization step to achieve seamless layer fusion. To facilitate research in multi-layer generation, we construct a high-quality, diverse multi-layer dataset including 400k samples. Extensive experiments and user studies demonstrate that DreamLayer generates more coherent and well-aligned layers, with broad applicability, including latent-space image editing and image-to-layer decomposition.
摘要：使用扩散模型的文本驱动图像产生最近引起了极大的关注。为了实现更灵活的图像操纵和编辑，最近的研究已从单一图像生成到透明层的产生和多层组成。但是，现有方法通常无法对多层结构进行彻底的探索，从而导致层间相互作用不一致，例如遮挡关系，空间布局和阴影。在本文中，我们介绍了Dreamlayer，这是一个新颖的框架，可以通过对透明前景和背景层之间的关系进行显式建模，从而实现连贯的文本驱动的多个图像层。 Dreamlayer结合了三个关键组成部分，即用于全球本地信息交换的上下文感知的跨意识（CACA），用于建立稳健的层间连接的层共享自我注意力（LSSA）以及在潜水级别进行融合细节的信息保留融合（IRH）。通过利用连贯的全图背景，Dreamlayer通过注意机制建立了层间连接，并应用了一个协调的步骤来实现无缝层融合。为了促进多层生成的研究，我们构建了一个高质量的多层数据集，包括400K样品。广泛的实验和用户研究表明，Dreamlayer生成更连贯且良好的层，具有广泛的适用性，包括潜在空间图像编辑和图像到层的分解。

Title: GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

Authors: Junhyeok Kim, Jaewoo Park, Junhee Park, Sangeyl Lee, Jiwan Chung, Jisung Kim, Ji Hoon Joung, Youngjae Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12844
Pdf URL: https://arxiv.org/pdf/2503.12844
Copy Paste: [[2503.12844]] GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance(https://arxiv.org/abs/2503.12844)
Keywords: generation
Abstract: Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV), with 7% of visually impaired individuals experiencing falls at least once a month. While recent advances in Multimodal Large Language Models (MLLMs) offer promising opportunities for BLV assistance, their development has been hindered by limited datasets. This limitation stems from the fact that BLV-aware annotation requires specialized domain knowledge and intensive labor. To address this gap, we introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs (including 2K human-annotated pairs) that capture diverse real-world scenes from a pedestrian's viewpoint. Our approach shifts the annotation burden from generation to verification through a collaborative human-AI framework grounded in established accessibility standards, significantly improving efficiency while maintaining high-quality annotations. We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities, specifically object recognition and relative depth perception. Our experimental results highlight the importance of accurate spatial understanding for effective BLV guidance. GuideDog and GuideDogQA will advance research in MLLM-based assistive technologies for BLV individuals while contributing to broader applications in understanding egocentric scenes for robotics and augmented reality. The code and dataset will be publicly available.
摘要：对于全世界受盲目和低视力影响（BLV）影响的22亿人（BLV），流动性仍然是一个重大挑战，其中7％的视力受损的人每月至少下降一次。尽管多模式大语言模型（MLLM）的最新进展为BLV援助提供了有希望的机会，但其发展受到有限数据集的阻碍。这种局限性源于以下事实：BLV感知注释需要专门的领域知识和密集的劳动。为了解决这一差距，我们介绍了guidedog，这是一种新颖的可访问性指南数据集，其中包含22k图像描述对（包括2k人类通知对），该数据集从行人的角度捕获了不同的现实世界场景。我们的方法通过以既定的可访问性标准为基础的协作人类AI框架将注释负担转移到验证，从而大大提高了效率，同时保持高质量的注释。我们还开发了引导QA，这是818个样本的子集，这些样本具有多项选择问题，旨在评估细粒的视觉感知能力，特别是对象识别和相对深度感知。我们的实验结果强调了准确的空间理解对于有效的BLV指导的重要性。引导戈和引导QA将对BLV个人的基于MLLM的辅助技术进行研究，同时为理解以机器人技术和增强现实的为中心场景做出更广泛的应用。代码和数据集将公开可用。

Title: UniReg: Foundation Model for Controllable Medical Image Registration

Authors: Zi Li, Jianpeng Zhang, Tai Ma, Tony C. W. Mok, Yan-Jie Zhou, Zeli Chen, Xianghua Ye, Le Lu, Dakai Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12868
Pdf URL: https://arxiv.org/pdf/2503.12868
Copy Paste: [[2503.12868]] UniReg: Foundation Model for Controllable Medical Image Registration(https://arxiv.org/abs/2503.12868)
Keywords: generation
Abstract: Learning-based medical image registration has achieved performance parity with conventional methods while demonstrating a substantial advantage in computational efficiency. However, learning-based registration approaches lack generalizability across diverse clinical scenarios, requiring the laborious development of multiple isolated networks for specific registration tasks, e.g., inter-/intra-subject registration or organ-specific alignment. % To overcome this limitation, we propose \textbf{UniReg}, the first interactive foundation model for medical image registration, which combines the precision advantages of task-specific learning methods with the generalization of traditional optimization methods. Our key innovation is a unified framework for diverse registration scenarios, achieved through a conditional deformation field estimation within a unified registration model. This is realized through a dynamic learning paradigm that explicitly encodes: (1) anatomical structure priors, (2) registration type constraints (inter/intra-subject), and (3) instance-specific features, enabling the generation of scenario-optimal deformation fields. % Through comprehensive experiments encompassing $90$ anatomical structures at different body regions, our UniReg model demonstrates comparable performance with contemporary state-of-the-art methodologies while achieving ~50\% reduction in required training iterations relative to the conventional learning-based paradigm. This optimization contributes to a significant reduction in computational resources, such as training time. Code and model will be available.
摘要：基于学习的医学图像注册已通过常规方法实现了绩效均衡，同时证明了计算效率的实质性优势。但是，基于学习的注册方法缺乏各种临床场景的可普遍性，需要为特定的注册任务（例如，/主体内的注册或特定器官特定的对齐）开发多个孤立的网络。为了克服这一限制，我们提出了\ textbf {unireg}，这是第一个用于医学图像注册的交互式基础模型，它结合了特定于任务的学习方法的精确优势与传统优化方法的概括。我们的关键创新是通过统一注册模型中的条件变形场估计实现的各种注册场景的统一框架。这是通过明确编码的动态学习范式实现的：（1）解剖结构先验，（2）注册类型约束（/间/内主体内）和（3）实例特定特征，使场景的生成 - 比较偏见的不良错误字段。％通过在不同人体区域的全面实验中进行$ 90 $解剖结构的全面实验，我们的UNIREG模型与当代最先进的方法相比表现出可比的性能，同时相对于基于常规学习的范式，在所需的培训中降低了约50 \％的降低。这种优化有助于大大减少计算资源，例如培训时间。代码和模型将可用。

Title: DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

Authors: Dewei Zhou, Mingwei Li, Zongxin Yang, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12885
Pdf URL: https://arxiv.org/pdf/2503.12885
Copy Paste: [[2503.12885]] DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models(https://arxiv.org/abs/2503.12885)
Keywords: generation
Abstract: Image-conditioned generation methods, such as depth- and canny-conditioned approaches, have demonstrated remarkable abilities for precise image synthesis. However, existing models still struggle to accurately control the content of multiple instances (or regions). Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. DreamRenderer enables users to control the content of each instance via bounding boxes or masks, while ensuring overall visual harmony. We propose two key innovations: 1) Bridge Image Tokens for Hard Text Attribute Binding, which uses replicated image tokens as bridge tokens to ensure that T5 text embeddings, pre-trained solely on text data, bind the correct visual attributes for each instance during Joint Attention; 2) Hard Image Attribute Binding applied only to vital layers. Through our analysis of FLUX, we identify the critical layers responsible for instance attribute rendering and apply Hard Image Attribute Binding only in these layers, using soft binding in the others. This approach ensures precise control while preserving image quality. Evaluations on the COCO-POS and COCO-MIG benchmarks demonstrate that DreamRenderer improves the Image Success Ratio by 17.7% over FLUX and enhances the performance of layout-to-image models like GLIGEN and 3DIS by up to 26.8%. Project Page: this https URL.
摘要：图像条件的生成方法（例如深度和巧妙的条件方法）表现出显着的精确图像合成能力。但是，现有模型仍然难以准确控制多个实例（或区域）的内容。即使是磁通量和3DIS等最新模型也面临着挑战，例如实例之间的属性泄漏，这限制了用户控制。为了解决这些问题，我们介绍了DreamRenderer，这是一种基于通量模型的无训练方法。 DreamRenderer使用户能够通过边界框或口罩来控制每个实例的内容，同时确保整体视觉和谐。我们提出了两个关键创新：1）硬文本属性绑定的桥图图像令牌，该属性属性绑定，该属性使用复制的图像令牌作为桥梁令牌，以确保仅根据文本数据进行预训练的T5文本嵌入，在关节注意过程中为每个实例绑定正确的视觉属性； 2）仅应用于重要层的硬图像属性结合。通过对通量的分析，我们确定了负责实例属性渲染的关键层，并仅使用其他层中的软绑定来应用这些层中的硬图像属性。这种方法可确保精确控制，同时保留图像质量。对可可-POS和可可 - 莫格基准测试的评估表明，DreamRenderer的图像成功率提高了17.7％，而不是通量，并提高了诸如Gligen和3Dis的布局与图像模型的性能，最高可达26.8％。项目页面：此HTTPS URL。

Title: MMLNB: Multi-Modal Learning for Neuroblastoma Subtyping Classification Assisted with Textual Description Generation

Authors: Huangwei Chen, Zhu Zhu, Zhenyu Yan, Yifei Chen, Mingyang Ding, Chenlei Li, Feiwei Qin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12927
Pdf URL: https://arxiv.org/pdf/2503.12927
Copy Paste: [[2503.12927]] MMLNB: Multi-Modal Learning for Neuroblastoma Subtyping Classification Assisted with Textual Description Generation(https://arxiv.org/abs/2503.12927)
Keywords: generation
Abstract: Neuroblastoma (NB), a leading cause of childhood cancer mortality, exhibits significant histopathological variability, necessitating precise subtyping for accurate prognosis and treatment. Traditional diagnostic methods rely on subjective evaluations that are time-consuming and inconsistent. To address these challenges, we introduce MMLNB, a multi-modal learning (MML) model that integrates pathological images with generated textual descriptions to improve classification accuracy and interpretability. The approach follows a two-stage process. First, we fine-tune a Vision-Language Model (VLM) to enhance pathology-aware text generation. Second, the fine-tuned VLM generates textual descriptions, using a dual-branch architecture to independently extract visual and textual features. These features are fused via Progressive Robust Multi-Modal Fusion (PRMF) Block for stable training. Experimental results show that the MMLNB model is more accurate than the single modal model. Ablation studies demonstrate the importance of multi-modal fusion, fine-tuning, and the PRMF mechanism. This research creates a scalable AI-driven framework for digital pathology, enhancing reliability and interpretability in NB subtyping classification. Our source code is available at this https URL.
摘要：神经母细胞瘤（NB）是儿童癌症死亡率的主要原因，表现出明显的组织病理学变异性，需要精确的亚型才能准确预后和治疗。传统的诊断方法依赖于耗时且不一致的主观评估。为了应对这些挑战，我们引入了MMLNB，这是一种多模式学习（MML）模型，将病理图像与生成的文本描述集成在一起，以提高分类准确性和解释性。该方法遵循两个阶段的过程。首先，我们微调一个视觉模型（VLM），以增强病理学感知的文本生成。其次，微调的VLM使用双分支架构生成文本描述，以独立提取视觉和文本功能。这些功能通过渐进式强大的多模式融合（PRMF）块进行融合以进行稳定训练。实验结果表明，MMLNB模型比单模式模型更准确。消融研究表明了多模式融合，微调和PRMF机制的重要性。这项研究为数字病理学创建了可扩展的AI驱动框架，增强了NB亚型分类中的可靠性和可解释性。我们的源代码可在此HTTPS URL上找到。

Title: AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction

Authors: Xuying Zhang, Yupeng Zhou, Kai Wang, Yikai Wang, Zhen Li, Xiuli Shao, Daquan Zhou, Qibin Hou, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12929
Pdf URL: https://arxiv.org/pdf/2503.12929
Copy Paste: [[2503.12929]] AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction(https://arxiv.org/abs/2503.12929)
Keywords: generation
Abstract: Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR-1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTM-based global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.
摘要：新型视图合成（NVS）是图像到3D创建的基石。但是，现有的作品仍然难以保持生成的视图和输入视图之间的一致性，尤其是在存在巨大的相机姿势差异时，导致质量不佳的3D几何和纹理。根据我们的经验观察，我们将这个问题归因于他们对所有目标视图的对待，即目标观点更接近输入视图表现出更高的保真度。借助这种灵感，我们提出了基于扩散模型的新颖的下一视图预测范式AR-1-3，该预测模型首先生成接近输入视图的视图，然后将其用作上下文信息来逐步综合更远的视图。为了将生成的视图子序列编码为下一视图预测的本地和全局条件，因此我们开发了一个堆叠的本地特征编码策略（堆叠式LE）和基于LSTM的全局特征编码策略（LSTM-GE）。广泛的实验表明，我们的方法显着提高了生成的视图和输入视图之间的一致性，从而产生了高保真3D资产。

Title: Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs

Authors: Wei Hung, Shao-Hua Sun, Ping-Chun Hsieh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12932
Pdf URL: https://arxiv.org/pdf/2503.12932
Copy Paste: [[2503.12932]] Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs(https://arxiv.org/abs/2503.12932)
Keywords: generative
Abstract: Action-constrained reinforcement learning (ACRL) is a generic framework for learning control policies with zero action constraint violation, which is required by various safety-critical and resource-constrained applications. The existing ACRL methods can typically achieve favorable constraint satisfaction but at the cost of either high computational burden incurred by the quadratic programs (QP) or increased architectural complexity due to the use of sophisticated generative models. In this paper, we propose a generic and computationally efficient framework that can adapt a standard unconstrained RL method to ACRL through two modifications: (i) To enforce the action constraints, we leverage the classic acceptance-rejection method, where we treat the unconstrained policy as the proposal distribution and derive a modified policy with feasible actions. (ii) To improve the acceptance rate of the proposal distribution, we construct an augmented two-objective Markov decision process (MDP), which include additional self-loop state transitions and a penalty signal for the rejected actions. This augmented MDP incentives the learned policy to stay close to the feasible action sets. Through extensive experiments in both robot control and resource allocation domains, we demonstrate that the proposed framework enjoys faster training progress, better constraint satisfaction, and a lower action inference time simultaneously than the state-of-the-art ACRL methods. We have made the source code publicly available to encourage further research in this direction.
摘要：受动作约束的强化学习（ACRL）是学习控制策略具有零行动约束违规行为的通用框架，这是各种安全和资源受限的应用程序所要求的。现有的ACRL方法通常可以达到有利的约束满意度，但由于使用复杂的生成模型而导致的二次程序（QP）产生的高计算负担或增加的体系结构复杂性。在本文中，我们提出了一个通用和计算有效的框架，可以通过两种修改将标准的不受限制的RL方法调整为ACRL：（i）为执行动作约束，我们利用了经典的接受性拒绝方法，在此我们将无约束的策略视为提案分布并通过可行的行动进行修改的政策。（ii）为提高提案分布的接受率，我们构建了一个增强的两项目标马尔可夫决策过程（MDP），其中包括额外的自循环状态过渡和拒绝行动的惩罚信号。这增强了MDP激励措施，以了解与可行的行动集的知识政策。通过在机器人控制和资源分配域中进行的广泛实验，我们证明了所提出的框架比最先进的ACRL方法享有更快的训练进度，更好的约束满意度和更低的动作推理时间。我们已经公开提供了源代码，以鼓励朝这个方向进行进一步的研究。

Title: L2HCount:Generalizing Crowd Counting from Low to High Crowd Density via Density Simulation

Authors: Guoliang Xu, Jianqin Yin, Ren Zhang, Yonghao Dang, Feng Zhou, Bo Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12935
Pdf URL: https://arxiv.org/pdf/2503.12935
Copy Paste: [[2503.12935]] L2HCount:Generalizing Crowd Counting from Low to High Crowd Density via Density Simulation(https://arxiv.org/abs/2503.12935)
Keywords: generation
Abstract: Since COVID-19, crowd-counting tasks have gained wide applications. While supervised methods are reliable, annotation is more challenging in high-density scenes due to small head sizes and severe occlusion, whereas it's simpler in low-density scenes. Interestingly, can we train the model in low-density scenes and generalize it to high-density scenes? Therefore, we propose a low- to high-density generalization framework (L2HCount) that learns the pattern related to high-density scenes from low-density ones, enabling it to generalize well to high-density scenes. Specifically, we first introduce a High-Density Simulation Module and a Ground-Truth Generation Module to construct fake high-density images along with their corresponding ground-truth crowd annotations respectively by image-shifting technique, effectively simulating high-density crowd patterns. However, the simulated images have two issues: image blurring and loss of low-density image characteristics. Therefore, we second propose a Head Feature Enhancement Module to extract clear features in the simulated high-density scene. Third, we propose a Dual-Density Memory Encoding Module that uses two crowd memories to learn scene-specific patterns from low- and simulated high-density scenes, respectively. Extensive experiments on four challenging datasets have shown the promising performance of L2HCount.
摘要：自Covid-19以来，人群的任务已获得广泛的应用。尽管监督方法是可靠的，但由于头大小和严重的遮挡，注释在高密度场景中更具挑战性，而在低密度场景中则更简单。有趣的是，我们可以在低密度场景中训练模型并将其推广到高密度场景吗？因此，我们提出了一个低密度泛化框架（L2HCOUNT），该框架（L2HCOUNT）学习了与低密度场景相关的模式，从而使其能够很好地推广到高密度场景。具体而言，我们首先引入了一个高密度模拟模块和一个地面真相的生成模块，以通过图像变换技术分别构建假的高密度图像及其相应的地面真相人群注释，从而有效地模拟了高密度的人群模式。但是，模拟图像有两个问题：图像模糊和低密度图像特征的丢失。因此，我们第二提出了一个头部功能增强模块，以在模拟的高密度场景中提取清晰的特征。第三，我们提出了一个双密度内存编码模块，该模块使用两个人群记忆分别从低密度和模拟的高密度场景中学习特定于场景的模式。在四个具有挑战性的数据集上进行了广泛的实验，显示了L2HCount的有希望的性能。

Title: Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction

Authors: Zheyuan Liu, Junyan Wang, Zicheng Duan, Cristian Rodriguez-Opazo, Anton van den Hengel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12953
Pdf URL: https://arxiv.org/pdf/2503.12953
Copy Paste: [[2503.12953]] Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction(https://arxiv.org/abs/2503.12953)
Keywords: generation
Abstract: Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. The project page is at this https URL .
摘要：Text-Video预测（TVP）是一个下游视频生成任务，它需要一个模型来产生后续视频帧，并给定一系列初始视频框架和描述所需运动的文本。在实践中，TVP方法着眼于一个特定类别的视频，描述了对人类或机器人武器进行的对象的操纵。以前的方法适应了在文本到图像任务上预先训练的模型，因此倾向于生成缺乏所需连续性的视频。自然的进步是利用更近期的预训练的文本对视频（T2V）模型。最常见的微调技术，低级别适应（LORA）会产生不良结果，这使这种方法变得更具挑战性。在这项工作中，我们提出了一种基于适应性的策略，我们将框架调节适应（FCA）标记。在模块中，我们设计了一个子模块，该子模块从输入文本中产生框架的文本嵌入，这是辅助生成的附加文本条件。我们使用FCA微调T2V模型，该模型将初始框架作为额外的条件。我们比较并讨论将这种嵌入到T2V模型中的更有效的策略。我们通过定量和定性性能分析对我们的设计选择进行广泛的消融研究。我们的方法为TVP任务建立了新的最新最新。项目页面位于此HTTPS URL。

Title: Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

Authors: Chaolong Yang, Kai Yao, Yuyao Yan, Chenru Jiang, Weiguang Zhao, Jie Sun, Guangliang Cheng, Yifei Zhang, Bin Dong, Kaizhu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12963
Pdf URL: https://arxiv.org/pdf/2503.12963
Copy Paste: [[2503.12963]] Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait(https://arxiv.org/abs/2503.12963)
Keywords: generation, generative
Abstract: Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution this http URL codes are available at this https URL.
摘要：音频驱动的单片图像肖像一代在虚拟现实，数字人类创作和电影制作中起着至关重要的作用。现有方法通常分为基于按键和基于图像的方法。基于Kepoint的方法有效地保留了角色身份，但由于3D可变形模型的固定点限制，难以捕获精细的面部细节。此外，传统的生成网络在有限的数据集上建立音频和关键点之间的因果关系面临挑战，从而导致姿势多样性低。相比之下，基于图像的方法使用扩散网络产生具有不同细节的高质量肖像，但会产生身份失真和昂贵的计算成本。在这项工作中，我们提出了KDTalker，这是将无监督的隐式3D关键点与时空扩散模型相结合的第一个框架。 KDTALKER利用无监督的隐式3D关键点，适应面部信息密度，从而使扩散过程可以模拟各种头部姿势并灵活地捕获精细的面部细节。定制设计的时空注意机制可确保准确的嘴唇同步，从而在提高计算效率的同时，在时间上产生时间一致，高质量的动画。实验结果表明，KDTalker在唇部同步精度，头部姿势多样性和执行方面实现了最先进的性能，此HTTP URL代码可在此HTTPS URL上获得。

Title: Optimal Denoising in Score-Based Generative Models: The Role of Data Regularity

Authors: Eliot Beyler (SIERRA), Francis Bach (SIERRA)
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.12966
Pdf URL: https://arxiv.org/pdf/2503.12966
Copy Paste: [[2503.12966]] Optimal Denoising in Score-Based Generative Models: The Role of Data Regularity(https://arxiv.org/abs/2503.12966)
Keywords: generative
Abstract: Score-based generative models achieve state-of-the-art sampling performance by denoising a distribution perturbed by Gaussian noise. In this paper, we focus on a single deterministic denoising step, and compare the optimal denoiser for the quadratic loss, we name ''full-denoising'', to the alternative ''half-denoising'' introduced by Hyv{ä}rinen (2024). We show that looking at the performances in term of distance between distribution tells a more nuanced story, with different assumptions on the data leading to very different this http URL prove that half-denoising is better than full-denoising for regular enough densities, while full-denoising is better for singular densities such as mixtures of Dirac measures or densities supported on a low-dimensional subspace. In the latter case, we prove that full-denoising can alleviate the curse of dimensionality under a linear manifold hypothesis.
摘要：基于得分的生成模型通过将高斯噪声扰动的分布降低，实现了最新的采样性能。在本文中，我们专注于单个确定性的denoising步骤，并比较了二次损失的最佳DeNoiser，我们将'''''''''命名为“全态”，由hyv {ä} rinen（2024）引入的替代性''half-denoising'。我们表明，看一下分布之间的距离表现，讲述了一个更加细致的故事，对数据的不同假设导致截然不同的HTTP URL证明，半衰减胜于正常的足够密度，而全面降级则更好，而全面的贬低则更好，而对于单次型号的单个混合物（dirac compities comptimitions contectiations contectiations contectiations visementions contpace contpace contpace contpace或lowdimpace）均更好。在后一种情况下，我们证明，全面降级可以减轻线性歧管假设下的维度诅咒。

Title: Action tube generation by person query matching for spatio-temporal action detection

Authors: Kazuki Omi, Jion Oshima, Toru Tamaki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12969
Pdf URL: https://arxiv.org/pdf/2503.12969
Copy Paste: [[2503.12969]] Action tube generation by person query matching for spatio-temporal action detection(https://arxiv.org/abs/2503.12969)
Keywords: generation
Abstract: This paper proposes a method for spatio-temporal action detection (STAD) that directly generates action tubes from the original video without relying on post-processing steps such as IoU-based linking and clip splitting. Our approach applies query-based detection (DETR) to each frame and matches DETR queries to link the same person across frames. We introduce the Query Matching Module (QMM), which uses metric learning to bring queries for the same person closer together across frames compared to queries for different people. Action classes are predicted using the sequence of queries obtained from QMM matching, allowing for variable-length inputs from videos longer than a single clip. Experimental results on JHMDB, UCF101-24, and AVA datasets demonstrate that our method performs well for large position changes of people while offering superior computational efficiency and lower resource requirements.
摘要：本文提出了一种时空动作检测（Stad）的方法，该方法直接从原始视频中生成动作管而不依赖于后处理步骤，例如基于IOU的链接和夹子分开。我们的方法将基于查询的检测（DITR）应用于每个帧，并匹配DETR查询以跨帧链接同一个人。我们介绍了查询匹配模块（QMM），该模块使用公制学习将与不同人的查询相比，将同一个人的查询更加紧密。使用从QMM匹配获得的查询序列预测动作类，从而使视频的可变长度输入更长，而不是单个剪辑。 JHMDB，UCF101-24和AVA数据集的实验结果表明，我们的方法在人们的巨大位置变化时表现良好，同时提供出色的计算效率和较低的资源需求。

Title: Exploring 3D Activity Reasoning and Planning: From Implicit Human Intentions to Route-Aware Planning

Authors: Xueying Jiang, Wenhao Li, Xiaoqin Zhang, Ling Shao, Shijian Lu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.12974
Pdf URL: https://arxiv.org/pdf/2503.12974
Copy Paste: [[2503.12974]] Exploring 3D Activity Reasoning and Planning: From Implicit Human Intentions to Route-Aware Planning(https://arxiv.org/abs/2503.12974)
Keywords: generation
Abstract: 3D activity reasoning and planning has attracted increasing attention in human-robot interaction and embodied AI thanks to the recent advance in multimodal learning. However, most existing works share two constraints: 1) heavy reliance on explicit instructions with little reasoning on implicit user intention; 2) negligence of inter-step route planning on robot moves. To bridge the gaps, we propose 3D activity reasoning and planning, a novel 3D task that reasons the intended activities from implicit instructions and decomposes them into steps with inter-step routes and planning under the guidance of fine-grained 3D object shapes and locations from scene segmentation. We tackle the new 3D task from two perspectives. First, we construct ReasonPlan3D, a large-scale benchmark that covers diverse 3D scenes with rich implicit instructions and detailed annotations for multi-step task planning, inter-step route planning, and fine-grained segmentation. Second, we design a novel framework that introduces progressive plan generation with contextual consistency across multiple steps, as well as a scene graph that is updated dynamically for capturing critical objects and their spatial relations. Extensive experiments demonstrate the effectiveness of our benchmark and framework in reasoning activities from implicit human instructions, producing accurate stepwise task plans, and seamlessly integrating route planning for multi-step moves. The dataset and code will be released.
摘要：由于最近在多模式学习方面的进步，3D活动推理和计划引起了人类机器人互动的越来越多的关注，并体现了AI。但是，大多数现有作品都有两个限制：1）严重依赖明确说明，而对隐式用户意图几乎没有理由； 2）机器人移动上的步进路线计划的疏忽。为了弥合差距，我们提出了3D活动推理和计划，这是一项新颖的3D任务，该任务是从隐式指令中预定的活动，并将它们分解为步骤间路线的步骤，并在良好的3D对象形状和场景细分的位置的指导下进行了计划。我们从两个角度解决了新的3D任务。首先，我们构建了推理计划，这是一个大规模的基准，涵盖了具有丰富隐式指令的不同3D场景，以及用于多步任务策划，步骤路线计划和细分细分的详细注释。其次，我们设计了一个新颖的框架，该框架介绍了跨多个步骤的上下文一致性以及一个动态更新以捕获关键对象及其空间关系的场景图。广泛的实验证明了我们的基准和框架在隐含人类指令中的推理活动中的有效性，制定准确的逐步任务计划，并无缝整合多步骤的路线计划。数据集和代码将发布。

Title: Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization

Authors: Ruichuan An, Kai Zeng, Ming Lu, Sihan Yang, Renrui Zhang, Huitong Ji, Qizhe Zhang, Yulin Luo, Hao Liang, Wentao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12999
Pdf URL: https://arxiv.org/pdf/2503.12999
Copy Paste: [[2503.12999]] Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization(https://arxiv.org/abs/2503.12999)
Keywords: generation
Abstract: Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for fine-tuning. To reveal the relationship between sample and model performance, we systematically investigate the impact of positive and negative samples (easy and hard) and their diversity on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity for VLM personalization. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the personalization capabilities of VLMs across the MyVLM, Yo'LLaVA, and MC-LLaVA datasets. To our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code is released at \href{this https URL}{this https URL}.
摘要：视觉模型（VLM）在各种多模式任务中表现出了出色的性能。最近，人们对提高VLM的个性化功能的兴趣越来越多。为了更好地将用户提供的概念整合到VLM中，许多方法使用正面和负面样本来微调这些模型。但是，用户提供的阳性样本的稀缺性和检索到的负样品的低质量对微调构成了挑战。为了揭示样本与模型性能之间的关系，我们系统地研究了正面和负样本（简单又硬）的影响及其对VLM个性化任务的多样性。基于详细的分析，我们介绍了概念 - 树（CAT），该概念代表了树木结构，从而使数据生成的正面和负面样本具有不同的难度和多样性的VLM个性化。通过精心设计的数据过滤策略，我们的CAT框架可以确保生成的数据的质量，从而构成强大的管道。我们对各种VLM个性化基线进行了彻底的实验，以评估管道的有效性，从而减轻了缺乏正样本和低样本质量的缺乏。我们的结果表明，配备了提议的数据过滤器的CAT可显着增强MyVLM，Yo'llava和Mc-llava数据集的VLM的个性化功能。据我们所知，这项工作是VLM个性化的第一个可控制的合成数据管道。该代码以\ href {this HTTPS url} {此https url}发布。

Title: TFDM: Time-Variant Frequency-Based Point Cloud Diffusion with Mamba

Authors: Jiaxu Liu, Li Li, Hubert P. H. Shum, Toby P. Breckon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13004
Pdf URL: https://arxiv.org/pdf/2503.13004
Copy Paste: [[2503.13004]] TFDM: Time-Variant Frequency-Based Point Cloud Diffusion with Mamba(https://arxiv.org/abs/2503.13004)
Keywords: generation, generative
Abstract: Diffusion models currently demonstrate impressive performance over various generative tasks. Recent work on image diffusion highlights the strong capabilities of Mamba (state space models) due to its efficient handling of long-range dependencies and sequential data modeling. Unfortunately, joint consideration of state space models with 3D point cloud generation remains limited. To harness the powerful capabilities of the Mamba model for 3D point cloud generation, we propose a novel diffusion framework containing dual latent Mamba block (DM-Block) and a time-variant frequency encoder (TF-Encoder). The DM-Block apply a space-filling curve to reorder points into sequences suitable for Mamba state-space modeling, while operating in a latent space to mitigate the computational overhead that arises from direct 3D data processing. Meanwhile, the TF-Encoder takes advantage of the ability of the diffusion model to refine fine details in later recovery stages by prioritizing key points within the U-Net architecture. This frequency-based mechanism ensures enhanced detail quality in the final stages of generation. Experimental results on the ShapeNet-v2 dataset demonstrate that our method achieves state-of-the-art performance (ShapeNet-v2: 0.14\% on 1-NNA-Abs50 EMD and 57.90\% on COV EMD) on certain metrics for specific categories while reducing computational parameters and inference time by up to 10$\times$ and 9$\times$, respectively. Source code is available in Supplementary Materials and will be released upon accpetance.
摘要：当前，扩散模型在各种生成任务上都表现出令人印象深刻的性能。最近在图像扩散的工作突出了Mamba（状态空间模型）的强大功能，因为它有效地处理了长期依赖性和顺序数据建模。不幸的是，对具有3D点云生成的状态空间模型的联合考虑仍然有限。为了利用3D点云生成的MAMBA模型的强大功能，我们提出了一个新型扩散框架，其中包含双潜在Mamba块（DM-Block）（DM-Block）和一个时间变化的频率编码器（TF-编码器）。 DM块将空间填充曲线应用于适合Mamba状态空间建模的序列，同时在潜在空间进行操作以减轻直接3D数据处理产生的计算开销。同时，TF编码器利用扩散模型在以后的恢复阶段中优先级优先级在U-NET体系结构中的关键点来完善细节的能力。这种基于频率的机制可确保在发电的最后阶段增强细节质量。 Shapenet-V2数据集的实验结果表明，我们的方法在1-NNA-ABS50 EMD上达到了最先进的性能（Shapenet-V2：0.14 \％\％\％\％，COV EMD上的57.90 \％）在某些类别上的某些指标上，同时减少了计算参数和最高时间$ \ $ $ $ $ $ $ $ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \。源代码可在补充材料中获得，并将在Accpetance时发布。

Title: Do Vision Models Develop Human-Like Progressive Difficulty Understanding?

Authors: Zeyi Huang, Utkarsh Ojha, Yuyang Ji, Donghyun Lee, Yong Jae Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13058
Pdf URL: https://arxiv.org/pdf/2503.13058
Copy Paste: [[2503.13058]] Do Vision Models Develop Human-Like Progressive Difficulty Understanding?(https://arxiv.org/abs/2503.13058)
Keywords: generative
Abstract: When a human undertakes a test, their responses likely follow a pattern: if they answered an easy question $(2 \times 3)$ incorrectly, they would likely answer a more difficult one $(2 \times 3 \times 4)$ incorrectly; and if they answered a difficult question correctly, they would likely answer the easy one correctly. Anything else hints at memorization. Do current visual recognition models exhibit a similarly structured learning capacity? In this work, we consider the task of image classification and study if those models' responses follow that pattern. Since real images aren't labeled with difficulty, we first create a dataset of 100 categories, 10 attributes, and 3 difficulty levels using recent generative models: for each category (e.g., dog) and attribute (e.g., occlusion), we generate images of increasing difficulty (e.g., a dog without occlusion, a dog only partly visible). We find that most of the models do in fact behave similarly to the aforementioned pattern around 80-90% of the time. Using this property, we then explore a new way to evaluate those models. Instead of testing the model on every possible test image, we create an adaptive test akin to GRE, in which the model's performance on the current round of images determines the test images in the next round. This allows the model to skip over questions too easy/hard for itself, and helps us get its overall performance in fewer steps.
摘要：当人类进行测试时，他们的回答可能会遵循一种模式：如果他们不正确地回答一个简单的问题$（2 \ times 3）$，他们可能会错误地回答一个更困难的一个$（2 \ times 3 \ times 4）$错误地回答；而且，如果他们正确回答了一个困难的问题，他们可能会正确回答简单的问题。其他任何暗示的记忆。当前的视觉识别模型是否具有类似结构化的学习能力？在这项工作中，我们考虑图像分类的任务并研究这些模型的响应是否遵循该模式。由于实际图像没有难度标记，因此我们首先使用最近的生成模型创建了100个类别，10个属性和3个难度级别的数据集：对于每个类别（例如，狗）和属性（例如，遮挡），我们生成了增加难度的图像（例如，狗，狗，狗，狗，只有部分可见）。我们发现，大多数模型实际上的行为与上述模式的行为相似，大约在80-90％。然后，使用此属性，我们探索了一种评估这些模型的新方法。我们没有在每个可能的测试图像上测试模型，而是创建一个类似于GRE的自适应测试，其中模型在当前图像上的性能在下一轮中确定了测试图像。这使该模型可以对自己的问题过于容易/难以跳过问题，并帮助我们以更少的步骤来获得其整体性能。

Title: Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation

Authors: Yihong Luo, Tianyang Hu, Weijian Luo, Kenji Kawaguchi, Jing Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13070
Pdf URL: https://arxiv.org/pdf/2503.13070
Copy Paste: [[2503.13070]] Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation(https://arxiv.org/abs/2503.13070)
Keywords: generation, generative
Abstract: Aligning generated images to complicated text prompts and human preferences is a central challenge in Artificial Intelligence-Generated Content (AIGC). With reward-enhanced diffusion distillation emerging as a promising approach that boosts controllability and fidelity of text-to-image models, we identify a fundamental paradigm shift: as conditions become more specific and reward signals stronger, the rewards themselves become the dominant force in generation. In contrast, the diffusion losses serve as an overly expensive form of regularization. To thoroughly validate our hypothesis, we introduce R0, a novel conditional generation approach via regularized reward maximization. Instead of relying on tricky diffusion distillation losses, R0 proposes a new perspective that treats image generations as an optimization problem in data space which aims to search for valid images that have high compositional rewards. By innovative designs of the generator parameterization and proper regularization techniques, we train state-of-the-art few-step text-to-image generative models with R0 at scales. Our results challenge the conventional wisdom of diffusion post-training and conditional generation by demonstrating that rewards play a dominant role in scenarios with complex conditions. We hope our findings can contribute to further research into human-centric and reward-centric generation paradigms across the broader field of AIGC. Code is available at this https URL.
摘要：将生成的图像与复杂的文本提示和人类偏好保持一致是人工智能生成的内容（AIGC）的核心挑战。随着奖励增强的扩散蒸馏作为一种有前途的方法，可以提高文本对图像模型的可控性和保真度，我们确定了基本的范式转移：随着条件变得更加具体和奖励信号，奖励本身就成为了一代中的主要力量。相反，扩散损失是一种过于昂贵的正规化形式。为了彻底验证我们的假设，我们引入了R0，这是一种通过正则奖励最大化的新型条件生成方法。 R0不是依靠棘手的扩散蒸馏损失，而是提出了一种新的视角，将图像世代视为数据空间中的优化问题，旨在搜索具有较高组成奖励的有效图像。通过发电机参数化和适当正则化技术的创新设计，我们训练最先进的文本到图像生成模型，具有R0的尺度。我们的结果通过证明奖励在具有复杂条件的情况下起着主导作用来挑战传统扩散后训练和有条件产生的智慧。我们希望我们的发现能够有助于进一步研究以人为中心和以奖励为中心的以人为中心的范式，遍及整个AIGC领域。代码可在此HTTPS URL上找到。

Title: DehazeMamba: SAR-guided Optical Remote Sensing Image Dehazing with Adaptive State Space Model

Authors: Zhicheng Zhao, Jinquan Yan, Chenglong Li, Xiao Wang, Jin Tang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.13073
Pdf URL: https://arxiv.org/pdf/2503.13073
Copy Paste: [[2503.13073]] DehazeMamba: SAR-guided Optical Remote Sensing Image Dehazing with Adaptive State Space Model(https://arxiv.org/abs/2503.13073)
Keywords: quality assessment
Abstract: Optical remote sensing image dehazing presents significant challenges due to its extensive spatial scale and highly non-uniform haze distribution, which traditional single-image dehazing methods struggle to address effectively. While Synthetic Aperture Radar (SAR) imagery offers inherently haze-free reference information for large-scale scenes, existing SAR-guided dehazing approaches face two critical limitations: the integration of SAR information often diminishes the quality of haze-free regions, and the instability of feature quality further exacerbates cross-modal domain shift. To overcome these challenges, we introduce DehazeMamba, a novel SAR-guided dehazing network built on a progressive haze decoupling fusion strategy. Our approach incorporates two key innovations: a Haze Perception and Decoupling Module (HPDM) that dynamically identifies haze-affected regions through optical-SAR difference analysis, and a Progressive Fusion Module (PFM) that mitigates domain shift through a two-stage fusion process based on feature quality assessment. To facilitate research in this domain, we present MRSHaze, a large-scale benchmark dataset comprising 8,000 pairs of temporally synchronized, precisely geo-registered SAR-optical images with high resolution and diverse haze conditions. Extensive experiments demonstrate that DehazeMamba significantly outperforms state-of-the-art methods, achieving a 0.73 dB improvement in PSNR and substantial enhancements in downstream tasks such as semantic segmentation. The dataset is available at this https URL.
摘要：光学遥感图像除尘图像由于其广泛的空间尺度和高度不均匀的雾化分布提出了重大挑战，传统的单像脱掩的方法很难有效地解决。虽然合成的孔径雷达（SAR）图像为大型场景提供了固有的无雾参考信息，但现有的SAR引导的飞行方法面临两个关键局限性：SAR信息的整合通常会降低无雾性区域的质量，并且功能质量的不稳定性进一步加剧了交叉模块化结构域的变化。为了克服这些挑战，我们介绍了Dehazemamba，这是一个基于渐进的雾化融合融合策略的新型SAR引导的飞行网络。我们的方法结合了两个关键的创新：雾化感和去耦模块（HPDM），通过光学SAR差异分析动态识别受雾影响的区域，以及基于功能质量评估的两阶段融合过程来减轻域通过两阶段融合过程的域转移。为了促进该领域的研究，我们提出了MRShaze，这是一个大规模的基准数据集，其中包括8,000对具有较高分辨率和多种雾度条件的时间同步，精确地注册的Sar-Optical图像。广泛的实验表明，Dehazemamba显着胜过最先进的方法，在PSNR方面取得了0.73 dB的改善，并且在下游任务（例如语义分段）中的实质增强。该数据集可在此HTTPS URL上找到。

Title: Rethinking Image Evaluation in Super-Resolution

Authors: Shaolin Su, Josep M. Rocafort, Danna Xue, David Serrano-Lozano, Lei Sun, Javier Vazquez-Corral
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13074
Pdf URL: https://arxiv.org/pdf/2503.13074
Copy Paste: [[2503.13074]] Rethinking Image Evaluation in Super-Resolution(https://arxiv.org/abs/2503.13074)
Keywords: super-resolution
Abstract: While recent advancing image super-resolution (SR) techniques are continually improving the perceptual quality of their outputs, they can usually fail in quantitative evaluations. This inconsistency leads to a growing distrust in existing image metrics for SR evaluations. Though image evaluation depends on both the metric and the reference ground truth (GT), researchers typically do not inspect the role of GTs, as they are generally accepted as `perfect' references. However, due to the data being collected in the early years and the ignorance of controlling other types of distortions, we point out that GTs in existing SR datasets can exhibit relatively poor quality, which leads to biased evaluations. Following this observation, in this paper, we are interested in the following questions: Are GT images in existing SR datasets 100\% trustworthy for model evaluations? How does GT quality affect this evaluation? And how to make fair evaluations if there exist imperfect GTs? To answer these questions, this paper presents two main contributions. First, by systematically analyzing seven state-of-the-art SR models across three real-world SR datasets, we show that SR performances can be consistently affected across models by low-quality GTs, and models can perform quite differently when GT quality is controlled. Second, we propose a novel perceptual quality metric, Relative Quality Index (RQI), that measures the relative quality discrepancy of image pairs, thus issuing the biased evaluations caused by unreliable GTs. Our proposed model achieves significantly better consistency with human opinions. We expect our work to provide insights for the SR community on how future datasets, models, and metrics should be developed.
摘要：尽管最近进步的图像超分辨率（SR）技术正在不断提高其产出的感知质量，但它们通常会在定量评估中失败。这种不一致导致对SR评估的现有图像指标的不信任。尽管图像评估取决于指标和参考基础真理（GT），但研究人员通常不检查GT的作用，因为它们通常被视为“完美”参考。但是，由于早期收集的数据以及控制其他类型的扭曲的无知，我们指出现有SR数据集中的GTS的质量相对较差，从而导致评估有偏见。在此观察之后，在本文中，我们对以下问题感兴趣：现有SR数据集中的GT图像100 \％可信赖用于模型评估？ GT质量如何影响此评估？如果存在不完美的GTS，如何进行公平评估？为了回答这些问题，本文提出了两个主要贡献。首先，通过系统地分析三个现实世界中的SR数据集的七个最先进的SR模型，我们表明SR性能可以通过低质量的GT在模型中始终如一地影响，并且在控制GT质量时，模型可以非常不同。其次，我们提出了一种新颖的感知度量，相对质量指数（RQI），该指标（RQI）衡量了图像对的相对质量差异，从而发出了不可靠的GTS引起的偏见评估。我们提出的模型与人类意见的一致性明显更好。我们希望我们的工作能为SR社区提供有关如何开发未来数据集，模型和指标的见解。

Title: DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry

Authors: Jing Li, Yihang Fu, Falai Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13110
Pdf URL: https://arxiv.org/pdf/2503.13110
Copy Paste: [[2503.13110]] DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry(https://arxiv.org/abs/2503.13110)
Keywords: generation, generative
Abstract: Boundary representation (B-rep) of geometric models is a fundamental format in Computer-Aided Design (CAD). However, automatically generating valid and high-quality B-rep models remains challenging due to the complex interdependence between the topology and geometry of the models. Existing methods tend to prioritize geometric representation while giving insufficient attention to topological constraints, making it difficult to maintain structural validity and geometric accuracy. In this paper, we propose DTGBrepGen, a novel topology-geometry decoupled framework for B-rep generation that explicitly addresses both aspects. Our approach first generates valid topological structures through a two-stage process that independently models edge-face and edge-vertex adjacency relationships. Subsequently, we employ Transformer-based diffusion models for sequential geometry generation, progressively generating vertex coordinates, followed by edge geometries and face geometries which are represented as B-splines. Extensive experiments on diverse CAD datasets show that DTGBrepGen significantly outperforms existing methods in both topological validity and geometric accuracy, achieving higher validity rates and producing more diverse and realistic B-reps. Our code is publicly available at this https URL.
摘要：几何模型的边界表示（B-REP）是计算机辅助设计（CAD）的基本格式。但是，由于模型的拓扑结构和几何形状之间的复杂相互依赖性，自动生成有效和高质量的B-REP模型仍然具有挑战性。现有方法倾向于优先考虑几何表示，同时不足以关注拓扑约束，从而难以维持结构有效性和几何准确性。在本文中，我们提出了DTGBREPGEN，这是B-Rep生成的新型拓扑几何分解框架，可显式地解决这两个方面。我们的方法首先通过两个阶段的过程生成有效的拓扑结构，该过程独立地建模边缘和边缘维克斯邻接关系。随后，我们采用基于变压器的扩散模型来产生顺序几何产生，逐渐生成顶点坐标，然后是边缘几何形状和面部几何形状，这些几何形式表示为b-splines。对各种CAD数据集进行的广泛实验表明，DTGBREPGEN在拓扑有效性和几何准确性方面的现有方法显着超过了现有方法，从而达到了更高的有效性率，并产生了更加多样化和现实的B-REP。我们的代码在此HTTPS URL上公开可用。

Title: 3D Human Interaction Generation: A Survey

Authors: Siyuan Fan, Wenke Huang, Xiantao Cai, Bo Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13120
Pdf URL: https://arxiv.org/pdf/2503.13120
Copy Paste: [[2503.13120]] 3D Human Interaction Generation: A Survey(https://arxiv.org/abs/2503.13120)
Keywords: generation, generative
Abstract: 3D human interaction generation has emerged as a key research area, focusing on producing dynamic and contextually relevant interactions between humans and various interactive entities. Recent rapid advancements in 3D model representation methods, motion capture technologies, and generative models have laid a solid foundation for the growing interest in this domain. Existing research in this field can be broadly categorized into three areas: human-scene interaction, human-object interaction, and human-human interaction. Despite the rapid advancements in this area, challenges remain due to the need for naturalness in human motion generation and the accurate interaction between humans and interactive entities. In this survey, we present a comprehensive literature review of human interaction generation, which, to the best of our knowledge, is the first of its kind. We begin by introducing the foundational technologies, including model representations, motion capture methods, and generative models. Subsequently, we introduce the approaches proposed for the three sub-tasks, along with their corresponding datasets and evaluation metrics. Finally, we discuss potential future research directions in this area and conclude the survey. Through this survey, we aim to offer a comprehensive overview of the current advancements in the field, highlight key challenges, and inspire future research works.
摘要：3D人类相互作用的产生已成为关键研究领域，重点是在人类与各种交互式实体之间产生动态和上下文相关的相互作用。 3D模型表示方法，运动捕获技术和生成模型的最新快速进步为对该领域的兴趣日益增长奠定了坚实的基础。该领域的现有研究可以广泛地分为三个领域：人类场景相互作用，人类对象的相互作用和人类相互作用。尽管这一领域取得了迅速的进步，但由于人类运动产生的自然性以及人类与互动实体之间的准确相互作用，挑战仍然存在。在这项调查中，我们介绍了人类互动产生的全面文献综述，据我们所知，这是同类的第一本。我们首先引入基础技术，包括模型表示，运动捕获方法和生成模型。随后，我们介绍了针对这三个子任务提出的方法，以及它们相应的数据集和评估指标。最后，我们讨论了该领域的潜在研究指示，并总结了调查。通过这项调查，我们旨在全面概述该领域的当前进步，突出关键挑战，并激发未来的研究工作。

Title: ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation

Authors: Ling-An Zeng, Guohong Huang, Yi-Lin Wei, Shengbo Gu, Yu-Ming Tang, Jingke Meng, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13130
Pdf URL: https://arxiv.org/pdf/2503.13130
Copy Paste: [[2503.13130]] ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation(https://arxiv.org/abs/2503.13130)
Keywords: generation, generative
Abstract: We propose ChainHOI, a novel approach for text-driven human-object interaction (HOI) generation that explicitly models interactions at both the joint and kinetic chain levels. Unlike existing methods that implicitly model interactions using full-body poses as tokens, we argue that explicitly modeling joint-level interactions is more natural and effective for generating realistic HOIs, as it directly captures the geometric and semantic relationships between joints, rather than modeling interactions in the latent pose space. To this end, ChainHOI introduces a novel joint graph to capture potential interactions with objects, and a Generative Spatiotemporal Graph Convolution Network to explicitly model interactions at the joint level. Furthermore, we propose a Kinematics-based Interaction Module that explicitly models interactions at the kinetic chain level, ensuring more realistic and biomechanically coherent motions. Evaluations on two public datasets demonstrate that ChainHOI significantly outperforms previous methods, generating more realistic, and semantically consistent HOIs. Code is available \href{this https URL}{here}.
摘要：我们提出了Chainhoi，这是一种用于文本驱动的人类相互作用（HOI）生成的新方法，该方法明确模拟了关节和动力学链水平的相互作用。与现有的使用全身相互作用模型的方法与代币相互作用，我们认为明确建模联合层次相互作用对于产生现实的HOI是更自然和有效的，因为它直接捕获了关节之间的几何和语义关系，而不是对潜在姿势空间中的相互作用进行建模。为此，Chainhoi引入了一个新颖的关节图，以捕获与物体的潜在相互作用，以及一个生成的时空图卷积网络，以在关节水平上明确模拟相互作用。此外，我们提出了一个基于运动学的相互作用模块，该模块在动力学链水平上明确模拟相互作用，从而确保更现实和生物力学上的相干运动。对两个公共数据集的评估表明，Chainhoi明显优于先前的方法，从而产生更现实和语义一致的HOI。代码可用\ href {this HTTPS url} {there}。

Title: Patient-specific radiomic feature selection with reconstructed healthy persona of knee MR images

Authors: Yaxi Chen, Simin Ni, Aleksandra Ivanova, Shaheer U. Saeed, Rikin Hargunani, Jie Huang, Chaozong Liu, Yipeng Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13131
Pdf URL: https://arxiv.org/pdf/2503.13131
Copy Paste: [[2503.13131]] Patient-specific radiomic feature selection with reconstructed healthy persona of knee MR images(https://arxiv.org/abs/2503.13131)
Keywords: generative
Abstract: Classical radiomic features have been designed to describe image appearance and intensity patterns. These features are directly interpretable and readily understood by radiologists. Compared with end-to-end deep learning (DL) models, lower dimensional parametric models that use such radiomic features offer enhanced interpretability but lower comparative performance in clinical tasks. In this study, we propose an approach where a standard logistic regression model performance is substantially improved by learning to select radiomic features for individual patients, from a pool of candidate features. This approach has potentials to maintain the interpretability of such approaches while offering comparable performance to DL. We also propose to expand the feature pool by generating a patient-specific healthy persona via mask-inpainting using a denoising diffusion model trained on healthy subjects. Such a pathology-free baseline feature set allows further opportunity in novel feature discovery and improved condition classification. We demonstrate our method on multiple clinical tasks of classifying general abnormalities, anterior cruciate ligament tears, and meniscus tears. Experimental results demonstrate that our approach achieved comparable or even superior performance than state-of-the-art DL approaches while offering added interpretability by using radiomic features extracted from images and supplemented by generating healthy personas. Example clinical cases are discussed in-depth to demonstrate the intepretability-enabled utilities such as human-explainable feature discovery and patient-specific location/view selection. These findings highlight the potentials of the combination of subject-specific feature selection with generative models in augmenting radiomic analysis for more interpretable decision-making. The codes are available at: this https URL
摘要：经典的放射素特征已设计用于描述图像外观和强度模式。这些特征是可以直接解释的，并且很容易被放射科医生理解。与端到端深度学习（DL）模型相比，使用这种放射线特征的较低维参数模型可增强可解释性，但在临床任务中的比较性能较低。在这项研究中，我们提出了一种方法，即通过学习从一系列候选特征中为个别患者选择放射性特征，从而大大提高了标准的逻辑回归模型性能。这种方法具有维持这种方法的可解释性，同时提供与DL可比的性能。我们还建议通过使用针对健康受试者训练的DeNoising扩散模型来产生特定于患者的健康角色来扩展特征池。这样的无病理基线功能集可以在新颖的特征发现和改善状况分类中进一步的机会。我们展示了我们的方法，以分类一般异常，前交叉韧带撕裂和半月板撕裂的多种临床任务。实验结果表明，我们的方法比最先进的DL方法获得了可比性甚至更高的性能，同时通过使用从图像中提取并通过产生健康的角色补充的放射性特征，提供了增加的可解释性。对临床病例进行了深入讨论，以证明具有可无用的公用事业，例如可解释的特征发现和特定于患者的位置/视图选择。这些发现突出了特定于特定特征特征选择与生成模型相结合的潜力，以增强放射线分析以进行更可解释的决策。代码可用：此HTTPS URL

Title: From Zero to Detail: Deconstructing Ultra-High-Definition Image Restoration from Progressive Spectral Perspective

Authors: Chen Zhao, Zhizhou Chen, Yunzhe Xu, Enxuan Gu, Jian Li, Zili Yi, Qian Wang, Jian Yang, Ying Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13165
Pdf URL: https://arxiv.org/pdf/2503.13165
Copy Paste: [[2503.13165]] From Zero to Detail: Deconstructing Ultra-High-Definition Image Restoration from Progressive Spectral Perspective(https://arxiv.org/abs/2503.13165)
Keywords: restoration
Abstract: Ultra-high-definition (UHD) image restoration faces significant challenges due to its high resolution, complex content, and intricate details. To cope with these challenges, we analyze the restoration process in depth through a progressive spectral perspective, and deconstruct the complex UHD restoration problem into three progressive stages: zero-frequency enhancement, low-frequency restoration, and high-frequency refinement. Building on this insight, we propose a novel framework, ERR, which comprises three collaborative sub-networks: the zero-frequency enhancer (ZFE), the low-frequency restorer (LFR), and the high-frequency refiner (HFR). Specifically, the ZFE integrates global priors to learn global mapping, while the LFR restores low-frequency information, emphasizing reconstruction of coarse-grained content. Finally, the HFR employs our designed frequency-windowed kolmogorov-arnold networks (FW-KAN) to refine textures and details, producing high-quality image restoration. Our approach significantly outperforms previous UHD methods across various tasks, with extensive ablation studies validating the effectiveness of each component. The code is available at \href{this https URL}{here}.
摘要：超高定义（UHD）图像恢复由于其高分辨率，复杂的内容和复杂的细节而面临重大挑战。为了应对这些挑战，我们通过渐进的光谱观点深入分析恢复过程，并将复杂的UHD恢复问题分解为三个渐进阶段：零频率增强，低频恢复和高频改进。在此洞察力的基础上，我们提出了一个新颖的框架ERR，其中包括三个协作子网络：零频率增强器（ZFE），低频修复程序（LFR）和高频炼油厂（HFR）。具体而言，ZFE集成了全球先验以学习全球映射，而LFR恢复了低频信息，强调重建粗粒含量。最后，HFR采用了我们设计的频率窗口的Kolmogorov-Arnold网络（FW-KAN）来完善纹理和细节，从而产生高质量的图像修复。我们的方法在各种任务上都显着优于先前的UHD方法，并进行了广泛的消融研究来验证每个组件的有效性。该代码可在\ href {this https url} {there}上获得。

Title: A super-resolution reconstruction method for lightweight building images based on an expanding feature modulation network

Authors: Yi Zhang, Wenye Zhou, Ruonan Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13179
Pdf URL: https://arxiv.org/pdf/2503.13179
Copy Paste: [[2503.13179]] A super-resolution reconstruction method for lightweight building images based on an expanding feature modulation network(https://arxiv.org/abs/2503.13179)
Keywords: super-resolution
Abstract: This study proposes a lightweight method for building image super-resolution using a Dilated Contextual Feature Modulation Network (DCFMN). The process includes obtaining high-resolution images, down-sampling them to low-resolution, enhancing the low-resolution images, constructing and training a lightweight network model, and generating super-resolution outputs. To address challenges such as regular textures and long-range dependencies in building images, the DCFMN integrates an expansion separable modulation unit and a local feature enhancement module. The former employs multiple expansion convolutions equivalent to a large kernel to efficiently aggregate multi-scale features while leveraging a simple attention mechanism for adaptivity. The latter encodes local features, mixes channel information, and ensures no additional computational burden during inference through reparameterization. This approach effectively resolves the limitations of existing lightweight super-resolution networks in modeling long-range dependencies, achieving accurate and efficient global feature modeling without increasing computational costs, and significantly improving both reconstruction quality and lightweight efficiency for building image super-resolution models.
摘要：这项研究提出了一种使用扩张的上下文特征调制网络（DCFMN）来构建图像超分辨率的轻量级方法。该过程包括获取高分辨率图像，将其简化为低分辨率，增强低分辨率图像，构建和训练轻质网络模型以及生成超分辨率输出。为了应对构建图像中的常规纹理和远程依赖性等挑战，DCFMN集成了扩展可分离调制单元和局部功能增强模块。前者采用了相当于大内核的多个扩展卷积，以有效地汇总多尺度特征，同时利用一种简单的注意机制以进行适应性。后者编码本地功能，混合通道信息，并通过重新聚集化在推理过程中没有其他计算负担。这种方法有效地解决了现有轻质超分辨率网络在建模长期依赖性方面的局限性，实现了准确有效的全球功能建模而不增加计算成本，并显着提高了重建质量和轻巧效率，以构建图像超分辨率模型。

Title: Triad: Empowering LMM-based Anomaly Detection with Vision Expert-guided Visual Tokenizer and Manufacturing Process

Authors: Yuanze Li, Shihao Yuan, Haolin Wang, Qizhang Li, Ming Liu, Chen Xu, Guangming Shi, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13184
Pdf URL: https://arxiv.org/pdf/2503.13184
Copy Paste: [[2503.13184]] Triad: Empowering LMM-based Anomaly Detection with Vision Expert-guided Visual Tokenizer and Manufacturing Process(https://arxiv.org/abs/2503.13184)
Keywords: generation
Abstract: Although recent methods have tried to introduce large multimodal models (LMMs) into industrial anomaly detection (IAD), their generalization in the IAD field is far inferior to that for general purposes. We summarize the main reasons for this gap into two aspects. On one hand, general-purpose LMMs lack cognition of defects in the visual modality, thereby failing to sufficiently focus on defect areas. Therefore, we propose to modify the AnyRes structure of the LLaVA model, providing the potential anomalous areas identified by existing IAD models to the LMMs. On the other hand, existing methods mainly focus on identifying defects by learning defect patterns or comparing with normal samples, yet they fall short of understanding the causes of these defects. Considering that the generation of defects is closely related to the manufacturing process, we propose a manufacturing-driven IAD paradigm. An instruction-tuning dataset for IAD (InstructIAD) and a data organization approach for Chain-of-Thought with manufacturing (CoT-M) are designed to leverage the manufacturing process for IAD. Based on the above two modifications, we present Triad, a novel LMM-based method incorporating an expert-guided region-of-interest tokenizer and manufacturing process for industrial anomaly detection. Extensive experiments show that our Triad not only demonstrates competitive performance against current LMMs but also achieves further improved accuracy when equipped with manufacturing processes. Source code, training data, and pre-trained models will be publicly available at this https URL.
摘要：尽管最近的方法试图将大型多模型模型（LMM）引入工业异常检测（IAD），但出于一般目的，它们在IAD领域的概括远不及该领域。我们总结了这一差距分为两个方面的主要原因。一方面，通用LMM缺乏视觉方式中缺陷的认知，因此无法充分专注于缺陷区域。因此，我们建议修改LLAVA模型的Anyres结构，从而为LMMS现有的IAD模型确定的潜在异常区域。另一方面，现有方法主要集中于通过学习缺陷模式或与普通样本进行比较来识别缺陷，但它们不了解这些缺陷的原因。考虑到缺陷的产生与制造过程密切相关，我们提出了一个以制造为驱动的IAD范式。 IAD（指导性）的指导数据集和使用制造链（COT-M）的数据组织方法旨在利用IAD的制造过程。基于上述两种修改，我们提出了Triad，这是一种基于LMM的新型方法，它结合了专家引导的利率令牌和工业异常检测的制造过程。广泛的实验表明，我们的三合会不仅表现出针对当前LMM的竞争性能，而且还可以在配备制造过程时进一步提高准确性。源代码，培训数据和预培训模型将在此HTTPS URL上公开可用。

Title: MedLoRD: A Medical Low-Resource Diffusion Model for High-Resolution 3D CT Image Synthesis

Authors: Marvin Seyfarth, Salman Ul Hassan Dar, Isabelle Ayx, Matthias Alexander Fink, Stefan O. Schoenberg, Hans-Ulrich Kauczor, Sandy Engelhardt
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13211
Pdf URL: https://arxiv.org/pdf/2503.13211
Copy Paste: [[2503.13211]] MedLoRD: A Medical Low-Resource Diffusion Model for High-Resolution 3D CT Image Synthesis(https://arxiv.org/abs/2503.13211)
Keywords: generative
Abstract: Advancements in AI for medical imaging offer significant potential. However, their applications are constrained by the limited availability of data and the reluctance of medical centers to share it due to patient privacy concerns. Generative models present a promising solution by creating synthetic data as a substitute for real patient data. However, medical images are typically high-dimensional, and current state-of-the-art methods are often impractical for computational resource-constrained healthcare environments. These models rely on data sub-sampling, raising doubts about their feasibility and real-world applicability. Furthermore, many of these models are evaluated on quantitative metrics that alone can be misleading in assessing the image quality and clinical meaningfulness of the generated images. To address this, we introduce MedLoRD, a generative diffusion model designed for computational resource-constrained environments. MedLoRD is capable of generating high-dimensional medical volumes with resolutions up to 512$\times$512$\times$256, utilizing GPUs with only 24GB VRAM, which are commonly found in standard desktop workstations. MedLoRD is evaluated across multiple modalities, including Coronary Computed Tomography Angiography and Lung Computed Tomography datasets. Extensive evaluations through radiological evaluation, relative regional volume analysis, adherence to conditional masks, and downstream tasks show that MedLoRD generates high-fidelity images closely adhering to segmentation mask conditions, surpassing the capabilities of current state-of-the-art generative models for medical image synthesis in computational resource-constrained environments.
摘要：医学成像的AI进步具有巨大的潜力。但是，由于患者的隐私问题，数据的限制受到数据可用性有限，医疗中心不愿分享它。生成模型通过创建合成数据作为替代实际患者数据来提供有希望的解决方案。但是，医学图像通常是高维的，而当前的最新方法对于计算资源受限的医疗环境通常是不切实际的。这些模型依靠数据子采样，提高了对其可行性和现实世界中适用性的怀疑。此外，这些模型中的许多模型都对定量指标进行了评估，这些指标可能会误导评估所产生图像的图像质量和临床意义。为了解决这个问题，我们介绍了Medlord，这是一种用于计算资源约束环境的生成扩散模型。 Medlord能够生成高维医疗量，其分辨率高达512美元$ \ times $ 512 $ \ times $ 256，仅利用GPU，只有24GB VRAM，这通常在标准桌面工作站中找到。跨多种方式评估了Medlord，包括冠状动脉层析造影术和肺计算机断层扫描数据集。通过放射学评估，相对区域数量分析，遵守条件掩膜和下游任务进行广泛的评估表明，Medlord会产生高保真图像，这些图像与分割蒙版条件紧密粘附，超过了当前最新的用于计算资源限制环境中医学模型合成的现有最新生成模型的能力。

Title: MAME: Multidimensional Adaptive Metamer Exploration with Human Perceptual Feedback

Authors: Mina Kamao, Hayato Ono, Ayumu Yamashita, Kaoru Amano, Masataka Sawayama
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13212
Pdf URL: https://arxiv.org/pdf/2503.13212
Copy Paste: [[2503.13212]] MAME: Multidimensional Adaptive Metamer Exploration with Human Perceptual Feedback(https://arxiv.org/abs/2503.13212)
Keywords: generation
Abstract: Alignment between human brain networks and artificial models is actively studied in machine learning and neuroscience. A widely adopted approach to explore their functional alignment is to identify metamers for both humans and models. Metamers refer to input stimuli that are physically different but equivalent within a given system. If a model's metameric space completely matched the human metameric space, the model would achieve functional alignment with humans. However, conventional methods lack direct ways to search for human metamers. Instead, researchers first develop biologically inspired models and then infer about human metamers indirectly by testing whether model metamers also appear as metamers to humans. Here, we propose the Multidimensional Adaptive Metamer Exploration (MAME) framework, enabling direct high-dimensional exploration of human metameric space. MAME leverages online image generation guided by human perceptual feedback. Specifically, it modulates reference images across multiple dimensions by leveraging hierarchical responses from convolutional neural networks (CNNs). Generated images are presented to participants whose perceptual discriminability is assessed in a behavioral task. Based on participants' responses, subsequent image generation parameters are adaptively updated online. Using our MAME framework, we successfully measured a human metameric space of over fifty dimensions within a single experiment. Experimental results showed that human discrimination sensitivity was lower for metameric images based on low-level features compared to high-level features, which image contrast metrics could not explain. The finding suggests that the model computes low-level information not essential for human perception. Our framework has the potential to contribute to developing interpretable AI and understanding of brain function in neuroscience.
摘要：在机器学习和神经科学中，对人脑网络与人工模型之间的一致性进行了积极研究。一种广泛采用的探索其功能一致性的方法是识别人类和模型的元素。 METAMER是指物理上不同但在给定系统中等效的输入刺激。如果模型的元空间完全匹配人类的元素空间，则该模型将实现与人类的功能对齐。但是，传统方法缺乏寻找人类物质的直接方法。取而代之的是，研究人员首先开发了以生物学启发的模型开发，然后通过测试模型metamer是否也作为人类的元素来间接推断人类的元素。在这里，我们提出了多维自适应Metamer探索（MAME）框架，从而可以直接对人类元空间的高维探索。 MAME利用人类感知反馈指导的在线图像生成。具体而言，它通过利用卷积神经网络（CNN）的层次响应来调节多个维度的参考图像。生成的图像呈现给在行为任务中评估感知性可区分性的参与者。根据参与者的回答，随后的图像生成参数将在线自适应更新。使用我们的MAME框架，我们在一个实验中成功测量了超过五十多个维度的人类元素空间。实验结果表明，与高级特征相比，基于低水平特征的跨层次图像的人类歧视敏感性较低，而这些图像对比度指标无法解释。该发现表明，该模型计算低级信息对于人类感知而言并不是必不可少的。我们的框架有可能有助于开发可解释的AI和对神经科学中大脑功能的理解。

Title: HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures

Authors: Yongkang Cheng, Shaoli Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13229
Pdf URL: https://arxiv.org/pdf/2503.13229
Copy Paste: [[2503.13229]] HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures(https://arxiv.org/abs/2503.13229)
Keywords: generation
Abstract: Animating virtual characters with holistic co-speech gestures is a challenging but critical task. Previous systems have primarily focused on the weak correlation between audio and gestures, leading to physically unnatural outcomes that degrade the user experience. To address this problem, we introduce HoleGest, a novel neural network framework based on decoupled diffusion and motion priors for the automatic generation of high-quality, expressive co-speech gestures. Our system leverages large-scale human motion datasets to learn a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements. To improve the generation efficiency of diffusion-based models, we integrate implicit joint constraints with explicit geometric and conditional constraints, capturing complex motion distributions between large strides. This integration significantly enhances generation speed while maintaining high-quality motion. Furthermore, we design a shared embedding space for gesture-transcription text alignment, enabling the generation of semantically correct gesture actions. Extensive experiments and user feedback demonstrate the effectiveness and potential applications of our model, with our method achieving a level of realism close to the ground truth, providing an immersive user experience. Our code, model, and demo are are available at this https URL.
摘要：用整体共同语音手势为虚拟字符动画是一项具有挑战性但至关重要的任务。以前的系统主要集中在音频和手势之间的弱相关性上，从而导致身体不自然的结果降低了用户体验。为了解决这个问题，我们引入了基于脱钩的扩散和运动先验的新型神经网络框架，以自动生成高质量的高质量，表达的共同语音手势。我们的系统利用大规模的人类运动数据集以低音频依赖性和高运动依赖性来学习强大的先验，从而实现稳定的全球运动和详细的手指运动。为了提高基于扩散的模型的发电效率，我们将隐式关节约束与明确的几何和条件约束相结合，从而捕获大步之间的复杂运动分布。这种集成在保持高质量运动的同时显着提高了发电速度。此外，我们设计了一个共享的嵌入空间，以进行手势转录文本对齐，从而能够生成语义上正确的手势动作。广泛的实验和用户反馈证明了我们的模型的有效性和潜在应用，我们的方法实现了接近地面真理的一定程度的现实主义，提供了沉浸式的用户体验。我们的代码，模型和演示可在此HTTPS URL上找到。

Title: Don't Judge Before You CLIP: A Unified Approach for Perceptual Tasks

Authors: Amit Zalcher, Navve Wasserman, Roman Beliy, Oliver Heinimann, Michal Irani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13260
Pdf URL: https://arxiv.org/pdf/2503.13260
Copy Paste: [[2503.13260]] Don't Judge Before You CLIP: A Unified Approach for Perceptual Tasks(https://arxiv.org/abs/2503.13260)
Keywords: quality assessment
Abstract: Visual perceptual tasks aim to predict human judgment of images (e.g., emotions invoked by images, image quality assessment). Unlike objective tasks such as object/scene recognition, perceptual tasks rely on subjective human assessments, making its data-labeling difficult. The scarcity of such human-annotated data results in small datasets leading to poor generalization. Typically, specialized models were designed for each perceptual task, tailored to its unique characteristics and its own training dataset. We propose a unified architectural framework for solving multiple different perceptual tasks leveraging CLIP as a prior. Our approach is based on recent cognitive findings which indicate that CLIP correlates well with human judgment. While CLIP was explicitly trained to align images and text, it implicitly also learned human inclinations. We attribute this to the inclusion of human-written image captions in CLIP's training data, which contain not only factual image descriptions, but inevitably also human sentiments and emotions. This makes CLIP a particularly strong prior for perceptual tasks. Accordingly, we suggest that minimal adaptation of CLIP suffices for solving a variety of perceptual tasks. Our simple unified framework employs a lightweight adaptation to fine-tune CLIP to each task, without requiring any task-specific architectural changes. We evaluate our approach on three tasks: (i) Image Memorability Prediction, (ii) No-reference Image Quality Assessment, and (iii) Visual Emotion Analysis. Our model achieves state-of-the-art results on all three tasks, while demonstrating improved generalization across different datasets.
摘要：视觉感知任务旨在预测人类对图像的判断（例如，图像质量评估引用的情绪）。与客观任务（例如对象/场景识别）不同，感知任务依赖于主观的人类评估，从而使其数据标记困难。这种人类注销数据的稀缺性导致小数据集导致概括不佳。通常，专门模型是为每个感知任务设计的，该模型是根据其独特特征和自己的培训数据集量身定制的。我们提出了一个统一的架构框架，用于解决利用剪辑作为先验的多个不同的感知任务。我们的方法基于最近的认知发现，表明剪辑与人类判断良好相关。虽然夹子经过明确训练以使图像和文本对齐，但它也隐含地学习了人类的倾向。我们将其归因于剪辑的培训数据中包含人写的图像标题，其中不仅包含事实图像描述，而且不可避免地，人类的情感和情感。这使得剪辑成为感知任务的先验。因此，我们建议将剪辑的最小适应足以解决各种感知任务。我们简单的统一框架采用轻巧的适应来对每个任务进行微调剪辑，而无需任何特定于任务的架构更改。我们在三个任务上评估了我们的方法：（i）图像记忆性预测，（ii）无参考图像质量评估和（iii）视觉情感分析。我们的模型在所有三个任务上都取得了最新的结果，同时证明了不同数据集的概括。

Title: Graph Generative Models Evaluation with Masked Autoencoder

Authors: Chengen Wang, Murat Kantarcioglu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13271
Pdf URL: https://arxiv.org/pdf/2503.13271
Copy Paste: [[2503.13271]] Graph Generative Models Evaluation with Masked Autoencoder(https://arxiv.org/abs/2503.13271)
Keywords: generative
Abstract: In recent years, numerous graph generative models (GGMs) have been proposed. However, evaluating these models remains a considerable challenge, primarily due to the difficulty in extracting meaningful graph features that accurately represent real-world graphs. The traditional evaluation techniques, which rely on graph statistical properties like node degree distribution, clustering coefficients, or Laplacian spectrum, overlook node features and lack scalability. There are newly proposed deep learning-based methods employing graph random neural networks or contrastive learning to extract graph features, demonstrating superior performance compared to traditional statistical methods, but their experimental results also demonstrate that these methods do not always working well across different metrics. Although there are overlaps among these metrics, they are generally not interchangeable, each evaluating generative models from a different perspective. In this paper, we propose a novel method that leverages graph masked autoencoders to effectively extract graph features for GGM evaluations. We conduct extensive experiments on graphs and empirically demonstrate that our method can be more reliable and effective than previously proposed methods across a number of GGM evaluation metrics, such as "Fréchet Distance (FD)" and "MMD Linear". However, no single method stands out consistently across all metrics and datasets. Therefore, this study also aims to raise awareness of the significance and challenges associated with GGM evaluation techniques, especially in light of recent advances in generative models.
摘要：近年来，已经提出了许多图形生成模型（GGM）。但是，评估这些模型仍然是一个巨大的挑战，这主要是由于很难提取有意义的图形特征，这些图形特征准确地代表了现实世界图。传统的评估技术依赖于图形统计属性，例如节点度分布，聚类系数或laplacian频谱，忽略节点特征和缺乏可扩展性。采用图形随机神经网络或对比度学习来提取图形特征，有新提出的深度学习方法，与传统的统计方法相比，表现出了卓越的性能，但是他们的实验结果也表明，这些方法并不总是在不同的指标上运作良好。尽管这些指标之间存在重叠，但它们通常不可互换，每个指标都从不同的角度评估生成模型。在本文中，我们提出了一种新型方法，该方法利用图形掩盖的自动编码器有效提取图形特征以进行GGM评估。我们在图形上进行了广泛的实验，并从经验上表明，在许多GGM评估指标中，我们的方法比以前提出的方法更可靠，更有效，例如“Fréchet距离（FD）”和“ MMD线性”。但是，在所有指标和数据集中，没有任何一种方法始终如一。因此，这项研究还旨在提高人们对GGM评估技术相关的重要性和挑战的认识，尤其是鉴于生成模型的最新进展。

Title: Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

Authors: Katja Schwarz, Norman Mueller, Peter Kontschieder
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13272
Pdf URL: https://arxiv.org/pdf/2503.13272
Copy Paste: [[2503.13272]] Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors(https://arxiv.org/abs/2503.13272)
Keywords: generative
Abstract: Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e., lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) -- a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet+, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by ~20% on both RealEstate10K and ScanNet+. Project page: this https URL
摘要：综合一致和逼真的3D场景是计算机视觉中的一个开放问题。视频扩散模型会产生令人印象深刻的视频，但无法直接合成3D表示，即缺乏生成序列中的3D一致性。此外，由于缺乏大规模的3D培训数据，直接训练生成3D模型具有挑战性。在这项工作中，我们介绍了生成的高斯脱落（GGS） - 一种新型方法，将3D表示与预训练的潜在视频扩散模型集成在一起。具体而言，我们的模型合成了通过3D高斯基原始人参数化的特征字段。然后，将特征字段渲染以特征地图并将其解码为多视图图像，或直接将其置于3D辐射范围中。我们在两个通用基准数据集上评估了场景合成，RealEstate10K和Scannet+的方法，并发现我们提出的GGS模型显着提高了生成的多视图图像的3D一致性，以及生成的3D场景的质量在所有相关的基线中。与没有3D表示的类似模型相比，GGS在生成的3D场景上的FID在RealEstate10K和Scannet+上都提高了约20％。项目页面：此HTTPS URL

Title: $ϕ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation

Authors: Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Jun Liu, Qika Lin, Zhiyong Wu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.13288
Pdf URL: https://arxiv.org/pdf/2503.13288
Copy Paste: [[2503.13288]] $ϕ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation(https://arxiv.org/abs/2503.13288)
Keywords: generation
Abstract: Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named $\phi$-Decoding. To provide a precise and expressive estimation of step value, $\phi$-Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show $\phi$-Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets. The code will be released at this https URL, and the open-source PyPI package is coming soon.
摘要：推理时间优化量表计算以得出有效性能的故意推理步骤。尽管以前的基于搜索的策略解决了自动回归产生的短视性，但庞大的搜索空间导致过度探索和利用不足。为了实现有效的平衡以得出最佳步骤，我们将解码策略作为远见卓识而将模拟的未来步骤构架为获得全球最佳步骤估计。基于它，我们提出了一种新颖的解码策略，名为$ \ phi $ - 编码。为了提供对步骤值的精确和表达性估计，$ \ phi $通过前瞻性和聚类对两个发行版进行近似。从关节分布中采样，可以选择最佳步骤进行开发。为了支持自适应计算分配，我们提出了宽度和深度修剪策略，具有轻重量解决方案以实现推理效率。七个基准测试的广泛实验表明，$ \ phi $编码在性能和效率方面都优于强大的基准。其他分析表明，其在各种LLM上的概括以及在广泛的计算预算中的可扩展性。该代码将在此HTTPS URL上发布，开源PYPI软件包即将推出。

Title: Progressive Human Motion Generation Based on Text and Few Motion Frames

Authors: Ling-An Zeng, Gaojie Wu, Ancong Wu, Jian-Fang Hu, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13300
Pdf URL: https://arxiv.org/pdf/2503.13300
Copy Paste: [[2503.13300]] Progressive Human Motion Generation Based on Text and Few Motion Frames(https://arxiv.org/abs/2503.13300)
Keywords: generation
Abstract: Although existing text-to-motion (T2M) methods can produce realistic human motion from text description, it is still difficult to align the generated motion with the desired postures since using text alone is insufficient for precisely describing diverse postures. To achieve more controllable generation, an intuitive way is to allow the user to input a few motion frames describing precise desired postures. Thus, we explore a new Text-Frame-to-Motion (TF2M) generation task that aims to generate motions from text and very few given frames. Intuitively, the closer a frame is to a given frame, the lower the uncertainty of this frame is when conditioned on this given frame. Hence, we propose a novel Progressive Motion Generation (PMG) method to progressively generate a motion from the frames with low uncertainty to those with high uncertainty in multiple stages. During each stage, new frames are generated by a Text-Frame Guided Generator conditioned on frame-aware semantics of the text, given frames, and frames generated in previous stages. Additionally, to alleviate the train-test gap caused by multi-stage accumulation of incorrectly generated frames during testing, we propose a Pseudo-frame Replacement Strategy for training. Experimental results show that our PMG outperforms existing T2M generation methods by a large margin with even one given frame, validating the effectiveness of our PMG. Code will be released.
摘要：尽管现有的文本到动作（T2M）方法可以从文本描述中产生现实的人类运动，但仍然很难将生成的运动与所需的姿势保持一致，因为单独使用文本不足以精确描述各种姿势。为了获得更可控制的生成，一种直观的方法是允许用户输入一些动作框架，描述精确所需的姿势。因此，我们探索了一个新的文本框架 - 动作（TF2M）生成任务，旨在生成文本动作，并且很少有给定的帧。直观地，框架越接近给定的框架，在此给定框架上进行条件时，此框架的不确定性就越低。因此，我们提出了一种新型的渐进运动产生（PMG）方法，以逐步产生不确定性较低的帧的运动，到多个阶段不确定性高的帧。在每个阶段，新帧都是由在文本语义上的文本框架引导的生成器生成的，给定的帧和以前阶段生成的框架。此外，为了减轻测试过程中多个阶段产生的帧的多阶段积累引起的火车测试差距，我们提出了一种伪框架替代策略。实验结果表明，我们的PMG的表现超过了现有的T2M生成方法，甚至具有一个给定的框架，从而验证了PMG的有效性。代码将发布。

Title: Agents Play Thousands of 3D Video Games

Authors: Zhongwen Xu, Xianliang Wang, Siyi Li, Tao Yu, Liang Wang, Qiang Fu, Wei Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13356
Pdf URL: https://arxiv.org/pdf/2503.13356
Copy Paste: [[2503.13356]] Agents Play Thousands of 3D Video Games(https://arxiv.org/abs/2503.13356)
Keywords: generation
Abstract: We present PORTAL, a novel framework for developing artificial intelligence agents capable of playing thousands of 3D video games through language-guided policy generation. By transforming decision-making problems into language modeling tasks, our approach leverages large language models (LLMs) to generate behavior trees represented in domain-specific language (DSL). This method eliminates the computational burden associated with traditional reinforcement learning approaches while preserving strategic depth and rapid adaptability. Our framework introduces a hybrid policy structure that combines rule-based nodes with neural network components, enabling both high-level strategic reasoning and precise low-level control. A dual-feedback mechanism incorporating quantitative game metrics and vision-language model analysis facilitates iterative policy improvement at both tactical and strategic levels. The resulting policies are instantaneously deployable, human-interpretable, and capable of generalizing across diverse gaming environments. Experimental results demonstrate PORTAL's effectiveness across thousands of first-person shooter (FPS) games, showcasing significant improvements in development efficiency, policy generalization, and behavior diversity compared to traditional approaches. PORTAL represents a significant advancement in game AI development, offering a practical solution for creating sophisticated agents that can operate across thousands of commercial video games with minimal development overhead. Experiment results on the 3D video games are best viewed on this https URL .
摘要：我们提出了门户网站，这是一个新颖的框架，用于开发能够通过语言制定的政策生成玩数千个3D视频游戏的人工智能代理。通过将决策问题转换为语言建模任务，我们的方法利用大型语言模型（LLMS）生成以域特异性语言（DSL）表示的行为树。该方法消除了与传统强化学习方法相关的计算负担，同时保留了战略性深度和快速适应性。我们的框架引入了混合政策结构，该结构将基于规则的节点与神经网络组件相结合，从而实现了高级战略推理和精确的低级控制。包含定量游戏指标和视觉模型分析的双反馈机制促进了战术和战略水平的迭代政策改进。由此产生的政策是即时可部署的，人工解剖的，并且能够在各种游戏环境中概括。实验结果表明，与传统方法相比，在成千上万的第一人称射击游戏（FPS）游戏中，Portal在发展效率，政策概括和行为多样性方面的有效性。 Portal代表了游戏AI开发方面的重大进步，为创建精致的代理提供了一种实用的解决方案，该解决方案可以在数千个商业视频游戏中运行，并且开发开发最少。最好在此HTTPS URL上查看3D视频游戏的实验结果。

Title: One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

Authors: Daniil Selikhanovych, David Li, Aleksei Leonov, Nikita Gushchin, Sergei Kushneriuk, Alexander Filippov, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13358
Pdf URL: https://arxiv.org/pdf/2503.13358
Copy Paste: [[2503.13358]] One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation(https://arxiv.org/abs/2503.13358)
Keywords: restoration, super-resolution
Abstract: Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift, one of the top diffusion-based SR models. Our method is based on training the student network to produce such images that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a large margin. We show that our distillation method can surpass the other distillation-based method for ResShift - SinSR - making it on par with state-of-the-art diffusion-based SR distillation methods. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality, provides images with better alignment to degraded input images, and requires fewer parameters and GPU memory. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K.
摘要：超分辨率（SR）的扩散模型产生高质量的视觉结果，但需要昂贵的计算成本。尽管开发了加速基于扩散的SR模型的几种方法，但有些（例如，Sinsr）无法产生现实的感知细节，而另一些方法（例如Osediff）可能会幻觉不存在的结构。为了克服这些问题，我们提出了RSD，RSD是Resshift的新蒸馏方法，Resshift是最高的基于扩散的SR模型之一。我们的方法基于培训学生网络，以制作出这样的图像，以使新的假Resshift模型训练有素，这将与教师模型一致。 RSD实现了单步恢复，并以很大的优势优于老师。我们表明，我们的蒸馏方法可以超过其他基于蒸馏的Resshift-sinsr的方法，使其与最新的基于基于扩散的SR蒸馏方法相提并论。与基于预训练的文本对图像模型的SR方法相比，RSD会产生竞争性的感知质量，提供具有更好对齐的图像以降低输入图像，并且需要更少的参数和GPU存储器。我们为包括RealSR，RealSet65，DrealSR，Imagenet和Div2K在内的各种现实世界和合成数据集提供了实验结果。

Title: MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

Authors: James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G. Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, Sarina M. Hasan, Alexandra Johannesson, William D. Leineweber, Malvika G Nair, Ridhi Yarlagadda, Connor Zuraski, Wah Chiu, Sarah Cohen, Jan N. Hansen, Manuel D Leonetti, Chad Liu, Emma Lundberg, Serena Yeung-Levy
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, q-bio.CB
Abstract URL: https://arxiv.org/abs/2503.13399
Pdf URL: https://arxiv.org/pdf/2503.13399
Copy Paste: [[2503.13399]] MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research(https://arxiv.org/abs/2503.13399)
Keywords: generation
Abstract: Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at this https URL, and project page at this https URL.
摘要：科学研究要求对多模式数据进行复杂的推理，这在生物学中尤为普遍。尽管最近用于AI辅助研究的多模式大语言模型（MLLM）的进步，但现有的多模式推理基准仅针对大学级别的难度，而研究级别的基准测试则强调了低级感知，而不是缺乏科学发现所需的复杂的多模式推理。为了弥合这一差距，我们介绍了MicroVQA，这是一个视觉问题答案（VQA）基准测试，旨在评估研究工作流程中至关重要的三个推理能力：专家图像理解，假设产生和实验建议。 MicroVQA由1,042个多项选择问题（MCQ）组成，该问题由各种显微镜模式的生物学专家策划，确保VQA样品代表了真正的科学实践。在构建基准测试时，我们发现标准的MCQ生成方法诱导了语言快捷方式，激发了新的两阶段管道：优化的LLM提示结构质疑答案将MCQ逐渐成对；然后，基于代理的“精炼机器人”更新它们以删除快捷方式。在最新的MLLM上进行基准测试表明，峰值性能为53 \％； LLMS较小的模型仅表现出色的顶级模型，这表明基于语言的推理比多模式推理更具挑战性。并使用科学文章进行调整会增强性能。对链链反应的专家分析表明，感知错误是最常见的，其次是知识错误，然后是过度概括的错误。这些见解强调了多模式科学推理的挑战，表明MicroVQA是推进AI驱动的生物医学研究的宝贵资源。 MicroVQA可在此HTTPS URL上找到，此https URL的项目页面可用。

Title: Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation

Authors: Xinyu Lian, Zichao Yu, Ruiming Liang, Yitong Wang, Li Ray Luo, Kaixu Chen, Yuanzhen Zhou, Qihong Tang, Xudong Xu, Zhaoyang Lyu, Bo Dai, Jiangmiao Pang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13424
Pdf URL: https://arxiv.org/pdf/2503.13424
Copy Paste: [[2503.13424]] Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation(https://arxiv.org/abs/2503.13424)
Keywords: generation, generative
Abstract: Large-scale articulated objects with high quality are desperately needed for multiple tasks related to embodied AI. Most existing methods for creating articulated objects are either data-driven or simulation based, which are limited by the scale and quality of the training data or the fidelity and heavy labour of the simulation. In this paper, we propose Infinite Mobility, a novel method for synthesizing high-fidelity articulated objects through procedural generation. User study and quantitative evaluation demonstrate that our method can produce results that excel current state-of-the-art methods and are comparable to human-annotated datasets in both physics property and mesh quality. Furthermore, we show that our synthetic data can be used as training data for generative models, enabling next-step scaling up. Code is available at this https URL
摘要：与体现AI相关的多个任务迫切需要具有高质量的大规模铰接物体。创建铰接对象的大多数现有方法都是基于数据驱动的或基于模拟的方法，这些方法受训练数据的规模和质量的限制，或模拟的忠诚度和大量劳动。在本文中，我们提出了无限的迁移率，这是一种通过程序生成综合高保真表达物体的新方法。用户研究和定量评估表明，我们的方法可以产生可观的最新方法的结果，并且与物理属性和网格质量中的人类注销数据集相当。此外，我们表明我们的合成数据可以用作生成模型的训练数据，从而使下一步扩展。代码可在此HTTPS URL上找到

Title: BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

Authors: Yaowei Li, Lingen Li, Zhaoyang Zhang, Xiaoyu Li, Guangzhi Wang, Hongxiang Li, Xiaodong Cun, Ying Shan, Yuexian Zou
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2503.13434
Pdf URL: https://arxiv.org/pdf/2503.13434
Copy Paste: [[2503.13434]] BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing(https://arxiv.org/abs/2503.13434)
Keywords: generation
Abstract: Element-level visual manipulation is essential in digital content creation, but current diffusion-based methods lack the precision and flexibility of traditional tools. In this work, we introduce BlobCtrl, a framework that unifies element-level generation and editing using a probabilistic blob-based representation. By employing blobs as visual primitives, our approach effectively decouples and represents spatial location, semantic content, and identity information, enabling precise element-level manipulation. Our key contributions include: 1) a dual-branch diffusion architecture with hierarchical feature fusion for seamless foreground-background integration; 2) a self-supervised training paradigm with tailored data augmentation and score functions; and 3) controllable dropout strategies to balance fidelity and diversity. To support further research, we introduce BlobData for large-scale training and BlobBench for systematic evaluation. Experiments show that BlobCtrl excels in various element-level manipulation tasks while maintaining computational efficiency, offering a practical solution for precise and flexible visual content creation. Project page: this https URL
摘要：元素级的视觉操作在数字内容创建中至关重要，但是当前基于扩散的方法缺乏传统工具的精确性和灵活性。在这项工作中，我们介绍了BlobCtrl，该框架使用基于概率的BLOB表示，统一元素级生成和编辑。通过采用斑点作为视觉原始素，我们的方法有效地取代了空间位置，语义内容和身份信息，从而实现了精确的元素级操作。我们的主要贡献包括：1）双分支扩散体系结构，具有层次特征融合，可用于无缝前后背景集成； 2）具有量身定制的数据增强和得分功能的自学训练范式； 3）可控的辍学策略，以平衡忠诚度和多样性。为了支持进一步的研究，我们介绍了Blobdata进行大规模培训和Blobbench进行系统评估。实验表明，BlobCtrl在维持计算效率的同时，在各种元素级操作任务中都擅长，为精确且灵活的视觉内容创建提供了实用解决方案。项目页面：此HTTPS URL

Title: WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes

Authors: Ling Yang, Kaixin Zhu, Juanxi Tian, Bohan Zeng, Mingbao Lin, Hongjuan Pei, Wentao Zhang, Shuicheng Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13435
Pdf URL: https://arxiv.org/pdf/2503.13435
Copy Paste: [[2503.13435]] WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes(https://arxiv.org/abs/2503.13435)
Keywords: generation
Abstract: With the rapid development of 3D reconstruction technology, research in 4D reconstruction is also advancing, existing 4D reconstruction methods can generate high-quality 4D scenes. However, due to the challenges in acquiring multi-view video data, the current 4D reconstruction benchmarks mainly display actions performed in place, such as dancing, within limited scenarios. In practical scenarios, many scenes involve wide-range spatial movements, highlighting the limitations of existing 4D reconstruction datasets. Additionally, existing 4D reconstruction methods rely on deformation fields to estimate the dynamics of 3D objects, but deformation fields struggle with wide-range spatial movements, which limits the ability to achieve high-quality 4D scene reconstruction with wide-range spatial movements. In this paper, we focus on 4D scene reconstruction with significant object spatial movements and propose a novel 4D reconstruction benchmark, WideRange4D. This benchmark includes rich 4D scene data with large spatial variations, allowing for a more comprehensive evaluation of the generation capabilities of 4D generation methods. Furthermore, we introduce a new 4D reconstruction method, Progress4D, which generates stable and high-quality 4D results across various complex 4D scene reconstruction tasks. We conduct both quantitative and qualitative comparison experiments on WideRange4D, showing that our Progress4D outperforms existing state-of-the-art 4D reconstruction methods. Project: this https URL
摘要：随着3D重建技术的快速发展，4D重建的研究也正在发展，现有的4D重建方法可以产生高质量的4D场景。但是，由于获取多视频视频数据的挑战，当前的4D重建基准主要在有限的方案中显示出诸如舞蹈之类的位置。在实际场景中，许多场景都涉及广泛的空间运动，突出了现有的4D重建数据集的局限性。此外，现有的4D重建方法依靠变形字段来估计3D对象的动力学，但是变形场在大范围的空间运动方面遇到了困难，这限制了实现具有广泛空间运动的高质量4D场景重建的能力。在本文中，我们将重点放在具有重要对象空间运动的4D场景重建上，并提出了一种新颖的4D重建基准Widerange4D。该基准包括具有较大空间变化的丰富4D场景数据，从而更全面地评估了4D生成方法的生成能力。此外，我们引入了一种新的4D重建方法Progress4D，该方法在各种复杂的4D场景重建任务中生成稳定且高质量的4D结果。我们在widerange4d上进行了定量和定性比较实验，这表明我们的Progress4D优于现有的最新4D重建方法。项目：此HTTPS URL

Title: Unified Autoregressive Visual Generation and Understanding with Continuous Tokens

Authors: Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, Michael Rubinstein, Michalis Raptis, Deqing Sun, Radu Soricut
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13436
Pdf URL: https://arxiv.org/pdf/2503.13436
Copy Paste: [[2503.13436]] Unified Autoregressive Visual Generation and Understanding with Continuous Tokens(https://arxiv.org/abs/2503.13436)
Keywords: generation
Abstract: We present UniFluid, a unified autoregressive framework for joint visual generation and understanding leveraging continuous visual tokens. Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image. We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other. By selecting an appropriate loss balance weight, the unified model achieves results comparable to or exceeding those of single-task baselines on both tasks. Furthermore, we demonstrate that employing stronger pre-trained LLMs and random-order generation during training is important to achieve high-fidelity image generation within this unified framework. Built upon the Gemma model series, UniFluid exhibits competitive performance across both image generation and understanding, demonstrating strong transferability to various downstream tasks, including image editing for generation, as well as visual captioning and question answering for understanding.
摘要：我们提出了Unifluid，这是一个统一的自动回归框架，用于关节视觉生成并了解连续视觉令牌的利用。我们统一的自回归体系结构过程多模式图像和文本输入，生成用于文本的离散令牌和图像连续令牌。我们发现，尽管图像生成和理解任务之间存在固有的权衡，但经过精心调整的培训配方使他们能够相互改进。通过选择适当的损失平衡权重，统一模型可以在这两个任务上获得与单任务基准相当或超过单任务基准的结果。此外，我们证明，在训练过程中采用更强的预训练的LLM和随机生成对于在此统一框架内实现高保真图像生成很重要。 Unifluid建立在Gemma Model系列的基础上，在图像生成和理解中都表现出竞争性能，证明了对各种下游任务的强大可传递性，包括发电的图像编辑，以及视觉字幕和问题回答以了解理解。

Title: Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

Authors: Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, Tat-Jen Cham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13439
Pdf URL: https://arxiv.org/pdf/2503.13439
Copy Paste: [[2503.13439]] Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images(https://arxiv.org/abs/2503.13439)
Keywords: generative
Abstract: Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a "foundation" 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.
摘要：大多数基于图像的3D对象重建器都认为对象是完全可见的，忽略了在现实世界中通常发生的闭塞。在本文中，我们介绍了Amodal3r，这是一种条件3D生成模型，旨在从部分观察中重建3D对象。我们从“基础” 3D生成模型开始，并将其扩展到从封闭的对象中恢复出色的3D几何形状和外观。我们引入了一个面膜加权的多头跨注意机制，然后是一个咬合感知的注意力层，该层明确利用闭塞先验来指导重建过程。我们证明，通过仅对合成数据培训，Amodal3r学会了即使在实际场景中存在遮挡的情况下，也可以恢复完整的3D对象。它显着优于独立执行2D Amodal完成的现有方法，然后进行3D重建，从而为闭塞3D重建建立了新的基准。