2025-07-01

Title: Counting with Confidence: Accurate Pest Monitoring in Water Traps

Authors: Xumin Gao, Mark Stevens, Grzegorz Cielniak
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22438
Pdf URL: https://arxiv.org/pdf/2506.22438
Copy Paste: [[2506.22438]] Counting with Confidence: Accurate Pest Monitoring in Water Traps(https://arxiv.org/abs/2506.22438)
Keywords: quality assessment
Abstract: Accurate pest population monitoring and tracking their dynamic changes are crucial for precision agriculture decision-making. A common limitation in existing vision-based automatic pest counting research is that models are typically evaluated on datasets with ground truth but deployed in real-world scenarios without assessing the reliability of counting results due to the lack of ground truth. To this end, this paper proposed a method for comprehensively evaluating pest counting confidence in the image, based on information related to counting results and external environmental conditions. First, a pest detection network is used for pest detection and counting, extracting counting result-related information. Then, the pest images undergo image quality assessment, image complexity assessment, and pest distribution uniformity assessment. And the changes in image clarity caused by stirring during image acquisition are quantified by calculating the average gradient magnitude. Notably, we designed a hypothesis-driven multi-factor sensitivity analysis method to select the optimal image quality assessment and image complexity assessment methods. And we proposed an adaptive DBSCAN clustering algorithm for pest distribution uniformity assessment. Finally, the obtained information related to counting results and external environmental conditions is input into a regression model for prediction, resulting in the final pest counting confidence. To the best of our knowledge, this is the first study dedicated to comprehensively evaluating counting confidence in counting tasks, and quantifying the relationship between influencing factors and counting confidence through a model. Experimental results show our method reduces MSE by 31.7% and improves R2 by 15.2% on the pest counting confidence test set, compared to the baseline built primarily on information related to counting results.
摘要：准确的害虫种群监测和跟踪其动态变化对于精确农业决策至关重要。现有基于视觉的自动害虫计数研究的一个普遍局限性是，通常在具有基础真理的数据集上评估模型，但由于缺乏地面真理而在不评估计数结果的可靠性的情况下部署在现实世界中。为此，本文提出了一种基于与计数结果和外部环境条件相关的信息，以全面评估对图像的信心进行全面评估的方法。首先，使用有害生物检测网络进行害虫检测和计数，从而提取与结果相关的信息。然后，害虫图像经过图像质量评估，图像复杂性评估和害虫分布均匀性评估。通过计算平均梯度幅度，可以量化图像采集过程中搅拌引起的图像清晰度的变化。值得注意的是，我们设计了一种假设驱动的多因素灵敏度分析方法来选择最佳图像质量评估和图像复杂性评估方法。我们提出了一种自适应DBSCAN聚类算法，用于害虫分布均匀性评估。最后，获得的与计数结果和外部环境条件相关的信息输入了回归模型以进行预测，从而导致了最终的害虫计数置信度。据我们所知，这是第一个致力于全面评估计算任务的信心，并量化影响因素和通过模型计算信心之间的关系的研究。实验结果表明，与主要基于与计数结果相关的信息相比，有害生物计数置信度测试集的MSE将MSE降低31.7％，并将R2降低15.2％。

Title: Hierarchical Adversarially-Resilient Multi-Agent Reinforcement Learning for Cyber-Physical Systems Security

Authors: Saad Alqithami
Subjects: cs.LG, cs.AI, cs.CR, cs.MA
Abstract URL: https://arxiv.org/abs/2506.22445
Pdf URL: https://arxiv.org/pdf/2506.22445
Copy Paste: [[2506.22445]] Hierarchical Adversarially-Resilient Multi-Agent Reinforcement Learning for Cyber-Physical Systems Security(https://arxiv.org/abs/2506.22445)
Keywords: generation
Abstract: Cyber-Physical Systems play a critical role in the infrastructure of various sectors, including manufacturing, energy distribution, and autonomous transportation systems. However, their increasing connectivity renders them highly vulnerable to sophisticated cyber threats, such as adaptive and zero-day attacks, against which traditional security methods like rule-based intrusion detection and single-agent reinforcement learning prove insufficient. To overcome these challenges, this paper introduces a novel Hierarchical Adversarially-Resilient Multi-Agent Reinforcement Learning (HAMARL) framework. HAMARL employs a hierarchical structure consisting of local agents dedicated to subsystem security and a global coordinator that oversees and optimizes comprehensive, system-wide defense strategies. Furthermore, the framework incorporates an adversarial training loop designed to simulate and anticipate evolving cyber threats, enabling proactive defense adaptation. Extensive experimental evaluations conducted on a simulated industrial IoT testbed indicate that HAMARL substantially outperforms traditional multi-agent reinforcement learning approaches, significantly improving attack detection accuracy, reducing response times, and ensuring operational continuity. The results underscore the effectiveness of combining hierarchical multi-agent coordination with adversarially-aware training to enhance the resilience and security of next-generation CPS.
摘要：网络物理系统在各个部门的基础设施中起着至关重要的作用，包括制造，能源分布和自主运输系统。但是，它们不断提高的连接性使它们非常容易受到复杂的网络威胁的影响，例如自适应和零日攻击，诸如基于规则的入侵检测和单人强化学习之类的传统安全方法证明不足。为了克服这些挑战，本文介绍了一种新型的分层对抗性的多代理增强学习（HAMARL）框架。 Hamarl采用了一个分层结构，该结构由专门针对子系统安全的本地代理和一个全球协调员，负责监督和优化全面，全系统的防御策略。此外，该框架结合了一个旨在模拟和预测不断发展的网络威胁的对抗训练环，从而实现了积极的防御适应。在模拟的工业物联网测试中进行的广泛的实验评估表明，Hamarl的表现大大优于传统的多方强化学习方法，可显着提高攻击检测准确性，减少响应时间并确保操作连续性。结果强调了将层次多代理协调与对抗感知的训练相结合的有效性，以增强下一代CP的韧性和安全性。

Title: Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization

Authors: Weizhi Gao, Zhichao Hou, Junqi Yin, Feiyi Wang, Linyu Peng, Xiaorui Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22463
Pdf URL: https://arxiv.org/pdf/2506.22463
Copy Paste: [[2506.22463]] Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization(https://arxiv.org/abs/2506.22463)
Keywords: generation, generative
Abstract: Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available at this https URL.
摘要：扩散模型已成为强大的生成模型，但是它们在迭代采样中的高计算成本仍然是一个重要的瓶颈。在这项工作中，我们对扩散模型的最新加速技术进行了深入而有见地的研究，包括缓存和量化，揭示了它们在计算误差和发电质量中的局限性。为了打破这些限制，这项工作引入了调制扩散（Modiff），这是一种创新，严格且原则性的框架，可通过调制量化和误差补偿加速生成建模。 Modiff不仅固有现有的缓存和量化方法的优势，而且还可以作为加速所有扩散模型的一般框架。 Modiff的优势得到了坚实的理论洞察力和分析的支持。此外，在CIFAR-10和LSUN上进行的广泛实验表明，Modiff显着的激活量化从8位减少到3位，而不会在训练后量化（PTQ）中进行性能降低。我们的代码实现可在此HTTPS URL上获得。

Title: Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models

Authors: Weiyi Zhao, Xiaoyu Tan, Liang Liu, Sijia Li, Youwei Song, Xihe Qiu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22500
Pdf URL: https://arxiv.org/pdf/2506.22500
Copy Paste: [[2506.22500]] Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models(https://arxiv.org/abs/2506.22500)
Keywords: generation
Abstract: Surgical risk identification is critical for patient safety and reducing preventable medical errors. While multimodal large language models (MLLMs) show promise for automated operating room (OR) risk detection, they often exhibit visual-semantic knowledge conflicts (VS-KC), failing to identify visual safety violations despite understanding textual rules. To address this, we introduce a dataset comprising over 34,000 synthetic images generated by diffusion models, depicting operating room scenes containing entities that violate established safety rules. These images were created to alleviate data scarcity and examine MLLMs vulnerabilities. In addition, the dataset includes 214 human-annotated images that serve as a gold-standard reference for validation. This comprehensive dataset, spanning diverse perspectives, stages, and configurations, is designed to expose and study VS-KC. Fine-tuning on OR-VSKC significantly improves MLLMs' detection of trained conflict entities and generalizes well to new viewpoints for these entities, but performance on untrained entity types remains poor, highlighting learning specificity and the need for comprehensive training. The main contributions of this work include: (1) a data generation methodology tailored for rule-violation scenarios; (2) the release of the OR-VSKC dataset and its associated benchmark as open-source resources; and (3) an empirical analysis of violation-sensitive knowledge consistency in representative MLLMs. The dataset and appendix are available at this https URL.
摘要：手术风险识别对于患者安全和减少可预防的医疗错误至关重要。虽然多模式大型语言模型（MLLMS）显示出对自动手术室（OR）风险检测的希望，但他们经常表现出视觉语义知识冲突（VS-KC），尽管不了解文本规则，但仍未确定违反视觉安全性的行为。为了解决这个问题，我们介绍了一个数据集，该数据集由扩散模型生成的34,000多个合成图像，描绘了手术室场景，其中包含违反既定安全规则的实体。创建这些图像是为了减轻数据稀缺性并检查MLLMS漏洞。此外，数据集还包括214张人类注销的图像，这些图像是用于验证的金标准参考。这个跨越各种观点，阶段和配置的全面数据集旨在暴露和研究与KC。对OR-VSKC进行微调显着改善了MLLM对受过训练的冲突实体的检测，并将其推广到这些实体的新观点，但是对未经训练的实体类型的绩效仍然很差，突出了学习特殊性和对全面培训的需求。这项工作的主要贡献包括：（1）针对规则侵犯方案量身定制的数据生成方法；（2）OR-VSKC数据集的发布及其相关的基准作为开源资源；（3）代表性MLLM中对违规敏感知识一致性的经验分析。数据集和附录可在此HTTPS URL上找到。

Title: Weakly Supervised Object Segmentation by Background Conditional Divergence

Authors: Hassan Baker, Matthew S. Emigh, Austin J. Brockmeier
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22505
Pdf URL: https://arxiv.org/pdf/2506.22505
Copy Paste: [[2506.22505]] Weakly Supervised Object Segmentation by Background Conditional Divergence(https://arxiv.org/abs/2506.22505)
Keywords: generative
Abstract: As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic, images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images, and then during learning create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The basecode for this work can be found at \href{GitHub}{this https URL}.
摘要：作为一项计算机视觉任务，在没有大量标记数据的专用图像域中，自动对象细分仍然具有挑战性，例如合成孔径声纳图像，遥感，生物医学成像等，在任何域中，都可以获得昂贵的像素细分面具。在这项工作中，我们提出了一种训练屏蔽网络以使用弱监管或不存在感兴趣对象的对象进行薄弱对象进行分割的方法，该对象提供了更少的信息，但可以从手动或自动标记中更快地获得。我们方法中的一个关键步骤是，可以将分段对象放入仅背景图像中，以创建具有反事实背景的对象的逼真的图像。为了在原始背景图像和反事实背景图像之间创建对比度，我们建议首先聚集仅背景图像，然后在学习过程中创建反事实图像，从而将对象从原始源背景分割为从目标群集中选择的背景。训练损失中的一个术语是这些反事实图像与具有目标群集背景的真实对象图像之间的差异。另一个术语是仅背景图像的监督损失。尽管对抗性评论家可以提供分歧，但我们使用基于样本的差异。我们对侧扫和合成孔径声纳进行了实验，其中与以前仅在自然图像上测试过的无监督分割基准相比，我们的方法成功了。此外，为了显示一般性，我们将实验扩展到了自然图像，通过我们的方法获得合理的性能，以避免鉴定的网络，生成网络和对抗性评论家。可以在\ href {github} {此https url}上找到此工作的基本。

Title: Lightning the Night with Generative Artificial Intelligence

Authors: Tingting Zhou, Feng Zhang, Haoyang Fu, Baoxiang Pan, Renhe Zhang, Feng Lu, Zhixin Yang
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2506.22511
Pdf URL: https://arxiv.org/pdf/2506.22511
Copy Paste: [[2506.22511]] Lightning the Night with Generative Artificial Intelligence(https://arxiv.org/abs/2506.22511)
Keywords: generative
Abstract: The visible light reflectance data from geostationary satellites is crucial for meteorological observations and plays an important role in weather monitoring and forecasting. However, due to the lack of visible light at night, it is impossible to conduct continuous all-day weather observations using visible light reflectance data. This study pioneers the use of generative diffusion models to address this limitation. Based on the multi-band thermal infrared brightness temperature data from the Advanced Geostationary Radiation Imager (AGRI) onboard the Fengyun-4B (FY4B) geostationary satellite, we developed a high-precision visible light reflectance retrieval model, called Reflectance Diffusion (RefDiff), which enables 0.47~\mu\mathrm{m}, 0.65~\mu\mathrm{m}, and 0.825~\mu\mathrm{m} bands visible light reflectance retrieval at night. Compared to the classical models, RefDiff not only significantly improves accuracy through ensemble averaging but also provides uncertainty estimation. Specifically, the SSIM index of RefDiff can reach 0.90, with particularly significant improvements in areas with complex cloud structures and thick clouds. The model's nighttime retrieval capability was validated using VIIRS nighttime product, demonstrating comparable performance to its daytime counterpart. In summary, this research has made substantial progress in the ability to retrieve visible light reflectance at night, with the potential to expand the application of nighttime visible light data.
摘要：来自地静止卫星的可见光反射率数据对于气象观察至关重要，并且在天气监测和预测中起着重要作用。但是，由于夜间缺乏可见光，因此无法使用可见光反射数据进行全天全天的天气观测。这项研究开创了生成扩散模型来解决这一限制的使用。基于从高级地静止辐射成像器（AGRI）上的多波段热红外亮度温度数据，船上风口4B（FY4B）地球上卫星卫星，我们开发了一个高精度可见光的光反射检索模型，称为0.47〜 \ mu \ Mu \ Mu \ M.Mu \ M. M. 0.65〜 \ mu \ mathrm {m}和0.825〜 \ mu \ mathrm {m}频段在夜间可见光反射率可见。与经典模型相比，返回不仅可以通过合奏平均显着提高准确性，而且还提供了不确定性估计。具体而言，避难所的SSIM指数可以达到0.90，在具有复杂的云结构和厚云的区域中，其区域尤其显着改善。该模型的夜间检索功能通过Viirs Nighttime产品验证，表现出与白天对应物相当的性能。总而言之，这项研究在夜间检索可见光反射率的能力取得了长足的进步，并有可能扩大夜间可见光数据的应用。

Title: Preserve Anything: Controllable Image Synthesis with Object Preservation

Authors: Prasen Kumar Sharma, Neeraj Matiyali, Siddharth Srivastava, Gaurav Sharma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22531
Pdf URL: https://arxiv.org/pdf/2506.22531
Copy Paste: [[2506.22531]] Preserve Anything: Controllable Image Synthesis with Object Preservation(https://arxiv.org/abs/2506.22531)
Keywords: generation
Abstract: We introduce \textit{Preserve Anything}, a novel method for controlled image synthesis that addresses key limitations in object preservation and semantic consistency in text-to-image (T2I) generation. Existing approaches often fail (i) to preserve multiple objects with fidelity, (ii) maintain semantic alignment with prompts, or (iii) provide explicit control over scene composition. To overcome these challenges, the proposed method employs an N-channel ControlNet that integrates (i) object preservation with size and placement agnosticism, color and detail retention, and artifact elimination, (ii) high-resolution, semantically consistent backgrounds with accurate shadows, lighting, and prompt adherence, and (iii) explicit user control over background layouts and lighting conditions. Key components of our framework include object preservation and background guidance modules, enforcing lighting consistency and a high-frequency overlay module to retain fine details while mitigating unwanted artifacts. We introduce a benchmark dataset consisting of 240K natural images filtered for aesthetic quality and 18K 3D-rendered synthetic images with metadata such as lighting, camera angles, and object relationships. This dataset addresses the deficiencies of existing benchmarks and allows a complete evaluation. Empirical results demonstrate that our method achieves state-of-the-art performance, significantly improving feature-space fidelity (FID 15.26) and semantic alignment (CLIP-S 32.85) while maintaining competitive aesthetic quality. We also conducted a user study to demonstrate the efficacy of the proposed work on unseen benchmark and observed a remarkable improvement of $\sim25\%$, $\sim19\%$, $\sim13\%$, and $\sim14\%$ in terms of prompt alignment, photorealism, the presence of AI artifacts, and natural aesthetics over existing works.
摘要：我们介绍\ textit {preserve nothing}，这是一种用于控制图像合成的新方法，该方法解决了文本到图像（T2i）生成中对象保存和语义一致性的关键局限性。现有方法通常会失败（i）以保留多个对象的保真度，（ii）通过提示保持语义对齐，或（iii）提供对场景组成的明确控制。为了克服这些挑战，提出的方法采用了N通道控制网络，该网络将（i）与大小和放置的对象保存相结合，不可知，颜色和细节保留以及消除伪像消除，（ii）高分辨率，语义上一致的背景与准确的阴影，照明，照明以及及时的依从性以及（ii II）的用户控制和廉价的条件。我们框架的关键组成部分包括对象保存和背景指导模块，实施照明一致性和高频叠加模块，以保留细节，同时减轻不需要的文物。我们引入了一个基准数据集，该数据集由240k自然图像组成，该图像的审美质量和18K 3D渲染的合成图像具有元数据，例如照明，摄像机角度和对象关系。该数据集解决了现有基准的缺陷，并允许进行完整的评估。经验结果表明，我们的方法可实现最先进的性能，可显着提高功能空间保真度（FID 15.26）和语义一致性（剪辑-S 32.85），同时保持有竞争力的美学质量。 We also conducted a user study to demonstrate the efficacy of the proposed work on unseen benchmark and observed a remarkable improvement of $\sim25\%$, $\sim19\%$, $\sim13\%$, and $\sim14\%$ in terms of prompt alignment, photorealism, the presence of AI artifacts, and natural aesthetics over existing works.

Title: ReCo: Reminder Composition Mitigates Hallucinations in Vision-Language Models

Authors: Sotirios Panagiotis Chytas, Miso Choi, Hyunwoo J. Kim, Vikas Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22636
Pdf URL: https://arxiv.org/pdf/2506.22636
Copy Paste: [[2506.22636]] ReCo: Reminder Composition Mitigates Hallucinations in Vision-Language Models(https://arxiv.org/abs/2506.22636)
Keywords: generation
Abstract: Vision Language Models (VLMs) show impressive capabilities in integrating and reasoning with both visual and language data. But these models make mistakes. A common finding -- similar to LLMs -- is their tendency to hallucinate, i.e., generate plausible sounding text which is not grounded in the visual input, or at worst, is contradictory. A growing consensus attributes this behavior to an over-reliance on language -- especially as the generation progresses, the model suffers from a ``fading memory effect'' with respect to the provided visual input. We study mechanisms by which this behavior can be controlled. Specifically, using ideas from geometric algebra and relational compositions, we propose the addition of a small, trainable module (named ReCo) on top of any VLM -- no other modification is needed. We show that such a lightweight module is able to mitigate the fading memory effect on three of the most widely used VLMs (InstructBLIP, LlaVA, MiniGPT4), where we see performance improvements on multiple benchmarks. Additionally, we show that our module can be combined with many of the other approaches for reducing hallucination where we achieve improved results for each one.
摘要：视觉语言模型（VLM）在与视觉和语言数据集成和推理方面表现出令人印象深刻的功能。但是这些模型会犯错。一个普遍的发现 - 与LLM相似 - 它们倾向于幻觉，即产生合理的发声文本，该文本并未基于视觉输入或最坏的情况，这是矛盾的。越来越多的共识将这种行为归因于对语言的过度依赖，尤其是随着一代的发展，模型就提供的视觉输入而遭受了``褪色的记忆效应''。我们研究可以控制这种行为的机制。具体而言，使用几何代数和关系组成的想法，我们建议在任何VLM之上添加一个小型，可训练的模块（命名为RECO） - 不需要其他修改。我们表明，这样的轻量级模块能够减轻三个使用最广泛使用的VLM（ConsendBlip，Llava，Minigpt4）上的褪色内存效果，我们可以看到多个基准测试的性能提高。此外，我们表明我们的模块可以与许多其他方法相结合，用于减少幻觉，在其中我们获得了每个方法的改进结果。

Title: 3D Shape Generation: A Survey

Authors: Nicolas Caytuiro, Ivan Sipiran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22678
Pdf URL: https://arxiv.org/pdf/2506.22678
Copy Paste: [[2506.22678]] 3D Shape Generation: A Survey(https://arxiv.org/abs/2506.22678)
Keywords: generation, generative
Abstract: Recent advances in deep learning have significantly transformed the field of 3D shape generation, enabling the synthesis of complex, diverse, and semantically meaningful 3D objects. This survey provides a comprehensive overview of the current state of the art in 3D shape generation, organizing the discussion around three core components: shape representations, generative modeling approaches, and evaluation protocols. We begin by categorizing 3D representations into explicit, implicit, and hybrid setups, highlighting their structural properties, advantages, and limitations. Next, we review a wide range of generation methods, focusing on feedforward architectures. We further summarize commonly used datasets and evaluation metrics that assess fidelity, diversity, and realism of generated shapes. Finally, we identify open challenges and outline future research directions that could drive progress in controllable, efficient, and high-quality 3D shape generation. This survey aims to serve as a valuable reference for researchers and practitioners seeking a structured and in-depth understanding of this rapidly evolving field.
摘要：深度学习的最新进展显着改变了3D形状产生的领域，从而使复杂，多样化和语义意义的3D对象的综合。这项调查提供了3D形式生成最新技术的全面概述，组织了围绕三个核心组成部分的讨论：形状表示，生成建模方法和评估协议。我们首先将3D表示形式分类为明确，隐式和混合设置，突出其结构属性，优势和局限性。接下来，我们回顾了广泛的一代方法，重点是馈电体系结构。我们进一步总结了常用的数据集和评估指标，以评估产生形状的忠诚度，多样性和现实主义。最后，我们确定了开放的挑战，并概述了未来的研究方向，这些方向可以推动可控，高效和高质量的3D形状生成的进步。这项调查旨在为研究人员和从业人员寻求对这个快速发展的领域的结构化和深入了解的宝贵参考。

Title: Mitigating Semantic Collapse in Generative Personalization with a Surprisingly Simple Test-Time Embedding Adjustment

Authors: Anh Bui, Trang Vu, Trung Le, Junae Kim, Tamas Abraham, Rollin Omari, Amar Kaur, Dinh Phung
Subjects: cs.LG, cs.GR
Abstract URL: https://arxiv.org/abs/2506.22685
Pdf URL: https://arxiv.org/pdf/2506.22685
Copy Paste: [[2506.22685]] Mitigating Semantic Collapse in Generative Personalization with a Surprisingly Simple Test-Time Embedding Adjustment(https://arxiv.org/abs/2506.22685)
Keywords: generative
Abstract: In this paper, we investigate the semantic collapsing problem in generative personalization, an under-explored topic where the learned visual concept ($V^*$) gradually shifts from its original textual meaning and comes to dominate other concepts in multi-concept input prompts. This issue not only reduces the semantic richness of complex input prompts like "a photo of $V^*$ wearing glasses and playing guitar" into simpler, less contextually rich forms such as "a photo of $V^*$" but also leads to simplified output images that fail to capture the intended concept. We identify the root cause as unconstrained optimisation, which allows the learned embedding $V^*$ to drift arbitrarily in the embedding space, both in direction and magnitude. To address this, we propose a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem. Our method is broadly applicable across different personalization methods and demonstrates significant improvements in text-image alignment in diverse use cases. Our code is anonymously published at this https URL.
摘要：在本文中，我们研究了生成个性化的语义崩溃问题，这是一个不足的主题，其中学习的视觉概念（$ v^*$）逐渐从其原始文本含义转变，并在多概念输入提示中占据了其他概念。这个问题不仅降低了复杂输入提示的语义丰富性，例如“戴眼镜的$ v^*$照片”和弹吉他的吉他降低了更简单，不那么富有的形式，例如“ $ v^*$的照片”，还导致简化的输出图像，这些图像未能捕获预期的概念。我们将根本原因识别为无约束的优化，它允许学习的$ v^*$在方向和幅度上任意漂移在嵌入空间中。为了解决这个问题，我们提出了一种简单而有效的无训练方法，该方法在推理时调整了预训练的嵌入的大小和方向，从而有效地减轻了语义崩溃的问题。我们的方法广泛适用于不同的个性化方法，并证明了不同用例中文本图像对齐的显着改善。我们的代码在此HTTPS URL上匿名发布。

Title: LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning

Authors: Jiang Yuan, JI Ma, Bo Wang, Guanzhou Ke, Weiming Hu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.22710
Pdf URL: https://arxiv.org/pdf/2506.22710
Copy Paste: [[2506.22710]] LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning(https://arxiv.org/abs/2506.22710)
Keywords: restoration, super-resolution
Abstract: Implicit degradation estimation-based blind super-resolution (IDE-BSR) hinges on extracting the implicit degradation representation (IDR) of the LR image and adapting it to LR image features to guide HR detail restoration. Although IDE-BSR has shown potential in dealing with noise interference and complex degradations, existing methods ignore the importance of IDR discriminability for BSR and instead over-complicate the adaptation process to improve effect, resulting in a significant increase in the model's parameters and computations. In this paper, we focus on the discriminability optimization of IDR and propose a new powerful and lightweight BSR model termed LightBSR. Specifically, we employ a knowledge distillation-based learning framework. We first introduce a well-designed degradation-prior-constrained contrastive learning technique during teacher stage to make the model more focused on distinguishing different degradation types. Then we utilize a feature alignment technique to transfer the degradation-related knowledge acquired by the teacher to the student for practical inferencing. Extensive experiments demonstrate the effectiveness of IDR discriminability-driven BSR model design. The proposed LightBSR can achieve outstanding performance with minimal complexity across a range of blind SR tasks. Our code is accessible at: this https URL.
摘要：基于隐式降解估计的基于盲目的超级分辨率（IDE-BSR）取决于提取LR图像的隐式降解表示（IDR），并将其调整为LR图像特征以指导HR详细信息恢复。尽管IDE-BSR在处理噪声干扰和复杂的降解方面已经显示出潜力，但现有方法忽略了IDR可区分性对BSR的重要性，而过度整理了适应过程以改善效果，从而显着增加了模型的参数和计算。在本文中，我们专注于IDR的可区分性优化，并提出了一种称为LightBSR的新型强大且轻巧的BSR模型。具体来说，我们采用了基于知识蒸馏的学习框架。我们首先在教师阶段中引入了精心设计的降级前对比度学习技术，以使模型更加专注于区分不同的降解类型。然后，我们利用一种功能对齐技术来将老师获取的与降解相关的知识转移给学生进行实际推论。广泛的实验证明了IDR可区分性驱动的BSR模型设计的有效性。拟议的LightBSR可以在一系列盲目的SR任务中获得最小的复杂性，实现出色的性能。我们的代码可访问：此HTTPS URL。

Title: UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments

Authors: Dayong Su, Yafei Zhang, Huafeng Li, Jinxing Li, Yu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22736
Pdf URL: https://arxiv.org/pdf/2506.22736
Copy Paste: [[2506.22736]] UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments(https://arxiv.org/abs/2506.22736)
Keywords: restoration
Abstract: Current multimodal medical image fusion typically assumes that source images are of high quality and perfectly aligned at the pixel level. Its effectiveness heavily relies on these conditions and often deteriorates when handling misaligned or degraded medical images. To address this, we propose UniFuse, a general fusion framework. By embedding a degradation-aware prompt learning module, UniFuse seamlessly integrates multi-directional information from input images and correlates cross-modal alignment with restoration, enabling joint optimization of both tasks within a unified framework. Additionally, we design an Omni Unified Feature Representation scheme, which leverages Spatial Mamba to encode multi-directional features and mitigate modality differences in feature alignment. To enable simultaneous restoration and fusion within an All-in-One configuration, we propose a Universal Feature Restoration & Fusion module, incorporating the Adaptive LoRA Synergistic Network (ALSN) based on LoRA principles. By leveraging ALSN's adaptive feature representation along with degradation-type guidance, we enable joint restoration and fusion within a single-stage framework. Compared to staged approaches, UniFuse unifies alignment, restoration, and fusion within a single framework. Experimental results across multiple datasets demonstrate the method's effectiveness and significant advantages over existing approaches.
摘要：当前的多模式医学图像融合通常假设源图像具有高质量，并且在像素级别上完全排列。它的有效性在很大程度上依赖于这些条件，并且在处理未对准或退化的医学图像时通常会恶化。为了解决这个问题，我们建议Unifuse，一个通用的融合框架。通过嵌入降解感知的及时学习模块，Unifuse无缝地从输入图像中整合了多向信息，并将跨模式对齐与恢复相关联，从而在统一框架内实现了两个任务的关节优化。此外，我们设计了OMNI统一特征表示方案，该方案利用空间Mamba编码多向特征并减轻特征对齐方式的方式差异。为了在多合一配置中同时恢复和融合，我们提出了一个通用功能恢复和融合模块，并根据LORA原理结合了自适应Lora协同网络（ALSN）。通过利用ALSN的自适应特征表示以及降解类型的指导，我们可以在单级框架内实现关节恢复和融合。与分阶段的方法相比，统一在一个框架内统一的对齐，恢复和融合。多个数据集的实验结果证明了该方法在现有方法上的有效性和显着优势。

Title: Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography

Authors: Jianing Zhang, Jiayi Zhu, Feiyu Ji, Xiaokang Yang, Xiaoyun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22753
Pdf URL: https://arxiv.org/pdf/2506.22753
Copy Paste: [[2506.22753]] Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography(https://arxiv.org/abs/2506.22753)
Keywords: restoration, generation
Abstract: Metalenses offer significant potential for ultra-compact computational imaging but face challenges from complex optical degradation and computational restoration difficulties. Existing methods typically rely on precise optical calibration or massive paired datasets, which are non-trivial for real-world imaging systems. Furthermore, a lack of control over the inference process often results in undesirable hallucinated artifacts. We introduce Degradation-Modeled Multipath Diffusion for tunable metalens photography, leveraging powerful natural image priors from pretrained models instead of large datasets. Our framework uses positive, neutral, and negative-prompt paths to balance high-frequency detail generation, structural fidelity, and suppression of metalens-specific degradation, alongside \textit{pseudo} data augmentation. A tunable decoder enables controlled trade-offs between fidelity and perceptual quality. Additionally, a spatially varying degradation-aware attention (SVDA) module adaptively models complex optical and sensor-induced degradation. Finally, we design and build a millimeter-scale MetaCamera for real-world validation. Extensive results show that our approach outperforms state-of-the-art methods, achieving high-fidelity and sharp image reconstruction. More materials: this https URL.
摘要：金属镜为超紧凑型计算成像提供了巨大的潜力，但要面对复杂的光学降解和计算恢复困难而面临的挑战。现有方法通常依赖于精确的光学校准或大规模的配对数据集，这些数据集对现实世界成像系统而言并非广泛。此外，缺乏对推理过程的控制通常会导致不良的幻觉伪像。我们引入了可调金属摄影的降解模型的多径扩散，利用预读取模型而不是大型数据集的强大自然图像先验。我们的框架使用积极，中性和负面的路径来平衡高频细节生成，结构保真度和对金属特异性降解的抑制，以及\ textit {pseudo}数据增强。可调的解码器可以在忠诚度和感知质量之间进行控制权衡。此外，在空间上有变化的降解感知注意力（SVDA）模块自适应地模拟复杂的光学和传感器诱导的降解。最后，我们设计和建造一个用于现实世界验证的毫米级元素。广泛的结果表明，我们的方法表现优于最先进的方法，实现了高保真和敏锐的图像重建。更多材料：此HTTPS URL。

Title: VSRM: A Robust Mamba-Based Framework for Video Super-Resolution

Authors: Dinh Phu Tran, Dao Duy Hung, Daeyoung Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22762
Pdf URL: https://arxiv.org/pdf/2506.22762
Copy Paste: [[2506.22762]] VSRM: A Robust Mamba-Based Framework for Video Super-Resolution(https://arxiv.org/abs/2506.22762)
Keywords: super-resolution
Abstract: Video super-resolution remains a major challenge in low-level vision tasks. To date, CNN- and Transformer-based methods have delivered impressive results. However, CNNs are limited by local receptive fields, while Transformers struggle with quadratic complexity, posing challenges for processing long sequences in VSR. Recently, Mamba has drawn attention for its long-sequence modeling, linear complexity, and large receptive fields. In this work, we propose VSRM, a novel \textbf{V}ideo \textbf{S}uper-\textbf{R}esolution framework that leverages the power of \textbf{M}amba. VSRM introduces Spatial-to-Temporal Mamba and Temporal-to-Spatial Mamba blocks to extract long-range spatio-temporal features and enhance receptive fields efficiently. To better align adjacent frames, we propose Deformable Cross-Mamba Alignment module. This module utilizes a deformable cross-mamba mechanism to make the compensation stage more dynamic and flexible, preventing feature distortions. Finally, we minimize the frequency domain gaps between reconstructed and ground-truth frames by proposing a simple yet effective Frequency Charbonnier-like loss that better preserves high-frequency content and enhances visual quality. Through extensive experiments, VSRM achieves state-of-the-art results on diverse benchmarks, establishing itself as a solid foundation for future research.
摘要：视频超分辨率仍然是低级视觉任务的主要挑战。迄今为止，基于CNN的方法和基于变压器的方法取得了令人印象深刻的结果。但是，CNN受到局部接收场的限制，而变形金刚则在二次复杂性上挣扎，对处理VSR中的长序列提出了挑战。最近，曼巴（Mamba）引起了其长期序列建模，线性复杂性和较大的接受场的关注。在这项工作中，我们提出了VSRM，这是一个小说\ textbf {v} IDEO \ textbf {s} uper- \ textbf {r} Esloute Framework，它利用了\ textbf {m} amba的力量。 VSRM引入了时空的山姆巴和时间到空间的mamba块，以提取远程时空特征并有效地增强接受场。为了更好地对齐相邻的帧，我们提出了可变形的横amba对准模块。该模块利用可变形的跨曼巴机构使补偿阶段更具动态性和灵活性，从而防止特征扭曲。最后，我们通过提出一个简单而有效的类似charbonnier的损失，可以更好地保留高频含量并增强视觉质量，从而最大程度地减少了重建后和地面真实框架之间的频域差距。通过广泛的实验，VSRM实现了各种基准的最先进的结果，从而确立了未来研究的坚实基础。

Title: Multimodal Atmospheric Super-Resolution With Deep Generative Models

Authors: Dibyajyoti Chakraborty, Haiwen Guan, Jason Stock, Troy Arcomano, Guido Cervone, Romit Maulik
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2506.22780
Pdf URL: https://arxiv.org/pdf/2506.22780
Copy Paste: [[2506.22780]] Multimodal Atmospheric Super-Resolution With Deep Generative Models(https://arxiv.org/abs/2506.22780)
Keywords: super-resolution, generative
Abstract: Score-based diffusion modeling is a generative machine learning algorithm that can be used to sample from complex distributions. They achieve this by learning a score function, i.e., the gradient of the log-probability density of the data, and reversing a noising process using the same. Once trained, score-based diffusion models not only generate new samples but also enable zero-shot conditioning of the generated samples on observed data. This promises a novel paradigm for data and model fusion, wherein the implicitly learned distributions of pretrained score-based diffusion models can be updated given the availability of online data in a Bayesian formulation. In this article, we apply such a concept to the super-resolution of a high-dimensional dynamical system, given the real-time availability of low-resolution and experimentally observed sparse sensor measurements from multimodal data. Additional analysis on how score-based sampling can be used for uncertainty estimates is also provided. Our experiments are performed for a super-resolution task that generates the ERA5 atmospheric dataset given sparse observations from a coarse-grained representation of the same and/or from unstructured experimental observations of the IGRA radiosonde dataset. We demonstrate accurate recovery of the high dimensional state given multiple sources of low-fidelity measurements. We also discover that the generative model can balance the influence of multiple dataset modalities during spatiotemporal reconstructions.
摘要：基于得分的扩散建模是一种生成机器学习算法，可用于从复杂分布中采样。他们通过学习得分函数，即数据的对数概率密度的梯度，并使用相同的数据来逆转噪声过程，从而实现这一目标。一旦受过训练，基于得分的扩散模型不仅生成新样本，而且还可以在观察到的数据上对生成的样品进行零弹条件。这有望成为数据和模型融合的新型范式，其中，鉴于贝叶斯公式中的在线数据可用性，可以更新基于得分的扩散模型的隐式学分布。在本文中，鉴于低分辨率的实时可用性和实验观察到的稀疏传感器测量值，我们将这种概念应用于高维动力系统的超分辨率。还提供了如何将基于得分的抽样用于不确定性估计的其他分析。我们的实验是针对一个超分辨率任务进行的，该任务从相同和/或来自IGRA RadioSonde数据集的非结构化实验观察结果的粗粒表示，从而生成ERA5大气数据集。我们证明了在多个低保真测量来源的高维状态的准确恢复。我们还发现，生成模型可以在时空重建过程中平衡多个数据集模式的影响。

Title: PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection

Authors: Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep P. Chinchali
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22783
Pdf URL: https://arxiv.org/pdf/2506.22783
Copy Paste: [[2506.22783]] PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection(https://arxiv.org/abs/2506.22783)
Keywords: generative
Abstract: Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.
摘要：随着生成模型越来越高级，DeepFake（DF）攻击构成了日益增长的威胁。但是，我们的研究表明，与影响公共话语的真正DF攻击不同，现有的DF数据集并未欺骗人类的看法。它突出了对更现实的DF攻击向量的需求。我们介绍了DF攻击（PF），该攻击使用语言推理来操纵关键的语音段，将人类的感知大大降低了42％，基准精度最多减少了94％。我们在拥抱面和开源二线DF段检测模型上发布了易于使用的PF数据集，该模型可以适应对操纵区域的计算优先级。我们在三个已知DF数据集中进行的广泛实验表明，我们的检测模型可将EER降低91％，同时达到90％的速度，而最小的计算机开销和精确定位除了现有模型作为可扩展解决方案之外。

Title: RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors

Authors: Sicong Du, Jiarun Liu, Qifeng Chen, Hao-Xiang Chen, Tai-Jiang Mu, Sheng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22800
Pdf URL: https://arxiv.org/pdf/2506.22800
Copy Paste: [[2506.22800]] RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors(https://arxiv.org/abs/2506.22800)
Keywords: generation
Abstract: A single-pass driving clip frequently results in incomplete scanning of the road structure, making reconstructed scene expanding a critical requirement for sensor simulators to effectively regress driving actions. Although contemporary 3D Gaussian Splatting (3DGS) techniques achieve remarkable reconstruction quality, their direct extension through the integration of diffusion priors often introduces cumulative physical inconsistencies and compromises training efficiency. To address these limitations, we present RGE-GS, a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration. The RGE-GS framework incorporates two key innovations: First, we propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases, thereby enabling selective retention of diffusion outputs for spatial stability. Second, during the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics, which achieving better convergence than baseline methods. Extensive evaluations of publicly available datasets demonstrate that RGE-GS achieves state-of-the-art performance in reconstruction quality. Our source-code will be made publicly available at this https URL. (Camera-ready version incorporating reviewer suggestions will be updated soon.)
摘要：单通行驾驶夹通常会导致道路结构不完整的扫描，从而使经过重建的场景扩展了传感器模拟器有效回归驾驶动作的关键要求。尽管当代的3D高斯脱落（3DGS）技术达到了显着的重建质量，但通过整合扩散先验，它们的直接扩展通常会引入累积的物理不一致之处并损害培训效率。为了解决这些局限性，我们提出了RGE-GS，这是一种新型的扩展重建框架，通过奖励引导的高斯集成协同基于扩散的生成。 RGE-GS框架结合了两个关键的创新：首先，我们提出了一个奖励网络，该网络学会了在重建阶段之前识别和优先生成的模式，从而可以选择性地保留空间稳定性的扩散输出。其次，在重建过程中，我们制定了一种差异化的训练策略，该策略会根据场景收敛指标自动调整高斯优化进展，从而获得比基线方法更好的收敛性。对公开数据集的广泛评估表明，RGE-GS在重建质量方面取得了最新的性能。我们的源代码将在此HTTPS URL上公开提供。（摄像机就绪版本包含审阅者建议将很快更新。）

Title: Riemannian-Geometric Fingerprints of Generative Models

Authors: Hae Jin Song, Laurent Itti
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2506.22802
Pdf URL: https://arxiv.org/pdf/2506.22802
Copy Paste: [[2506.22802]] Riemannian-Geometric Fingerprints of Generative Models(https://arxiv.org/abs/2506.22802)
Keywords: generative
Abstract: Recent breakthroughs and rapid integration of generative models (GMs) have sparked interest in the problem of model attribution and their fingerprints. For instance, service providers need reliable methods of authenticating their models to protect their IP, while users and law enforcement seek to verify the source of generated content for accountability and trust. In addition, a growing threat of model collapse is arising, as more model-generated data are being fed back into sources (e.g., YouTube) that are often harvested for training ("regurgitative training"), heightening the need to differentiate synthetic from human data. Yet, a gap still exists in understanding generative models' fingerprints, we believe, stemming from the lack of a formal framework that can define, represent, and analyze the fingerprints in a principled way. To address this gap, we take a geometric approach and propose a new definition of artifact and fingerprint of GMs using Riemannian geometry, which allows us to leverage the rich theory of differential geometry. Our new definition generalizes previous work (Song et al., 2024) to non-Euclidean manifolds by learning Riemannian metrics from data and replacing the Euclidean distances and nearest-neighbor search with geodesic distances and kNN-based Riemannian center of mass. We apply our theory to a new gradient-based algorithm for computing the fingerprints in practice. Results show that it is more effective in distinguishing a large array of GMs, spanning across 4 different datasets in 2 different resolutions (64 by 64, 256 by 256), 27 model architectures, and 2 modalities (Vision, Vision-Language). Using our proposed definition significantly improves the performance on model attribution, as well as a generalization to unseen datasets, model types, and modalities, suggesting its practical efficacy.
摘要：最近的突破和生成模型（GM）的快速整合引起了人们对模型归因问题及其指纹问题的兴趣。例如，服务提供商需要可靠的方法来验证其模型以保护其IP，而用户和执法部门则寻求验证生成的内容的来源，以实现问责制和信任。此外，由于更多的模型生成的数据被回馈到通常用于培训的来源（例如YouTube）中（“反流训练”），因此增加了模型崩溃的威胁，这增加了与人类数据区分分化合成的需求。然而，我们认为，由于缺乏可以以原则性的方式定义，代表和分析指纹的正式框架，我们认为，仍然存在差距。为了解决这一差距，我们采用了几何方法，并使用riemannian几何形状提出了对GMS的人工制品和指纹的新定义，这使我们能够利用富裕几何学的丰富理论。我们的新定义通过从数据中学习Riemannian指标，以大地距离和基于KNN的Riemann Mass的Riemannian Mass中心来替代欧几里得距离，并替换了欧几里得距离和最近的neighbor搜索，从而将先前的工作（Song等，2024）概括为非欧盟歧管。我们将理论应用于一种基于梯度的新算法，用于计算实践中的指纹。结果表明，它在区分大量的GMS方面更有效，跨越了2种不同分辨率的4个不同数据集（64 x 64、256 x 256），27个模型体系结构和2种模型（视觉，视觉，语言）。使用我们提出的定义，可以显着提高模型归因的性能，以及对看不见的数据集，模型类型和方式的概括，提示其实际功效。

Title: Listener-Rewarded Thinking in VLMs for Image Preferences

Authors: Alexander Gambashidze, Li Pengyi, Matvey Skripkin, Andrey Galichin, Anton Gusarov, Konstantin Sobolev, Andrey Kuznetsov, Ivan Oseledets
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22832
Pdf URL: https://arxiv.org/pdf/2506.22832
Copy Paste: [[2506.22832]] Listener-Rewarded Thinking in VLMs for Image Preferences(https://arxiv.org/abs/2506.22832)
Keywords: generative
Abstract: Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: this https URL.
摘要：培训对人类视觉偏好的强大且可推广的奖励模型对于将文本对象和文本到视频生成模型与人类意图保持一致至关重要。但是，当前的奖励模型通常无法概括，监督的微调导致记忆，要求复杂的注释管道。强化学习（RL），特别是组相对策略优化（GRPO），改善了概括，但我们发现关键故障模式：当模型的推理痕迹与独立的，冷冻的视觉模型（“听众”）评估相同输出时，推理准确性的显着下降会发生。为了解决这个问题，我们介绍了一个听众的GRPO框架。在这里，听众重新评估了推理者的思想链，以提供密集，校准的置信度得分，从而塑造了RL奖励信号。这不仅鼓励推理者正确回答，还鼓励推理对独立模型有说服力的解释。我们的听众形状奖励方案在Imagerward基准测试（67.4％）上实现了最佳准确性，在大规模的人类偏好数据集上显着提高了分布（OOD）的性能（OOD）的表现（120万票，高达6％的天真理由），并且与强烈的Grpo和Sft Baselines相比，降低了推理的矛盾。这些结果表明，基于听众的奖励为将视觉模型与细微的人类偏好相结合提供了可扩展的，可扩展的数据效率路径。我们将在此处发布推理模型：此HTTPS URL。

Title: SemFaceEdit: Semantic Face Editing on Generative Radiance Manifolds

Authors: Shashikant Verma, Shanmuganathan Raman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22833
Pdf URL: https://arxiv.org/pdf/2506.22833
Copy Paste: [[2506.22833]] SemFaceEdit: Semantic Face Editing on Generative Radiance Manifolds(https://arxiv.org/abs/2506.22833)
Keywords: generative
Abstract: Despite multiple view consistency offered by 3D-aware GAN techniques, the resulting images often lack the capacity for localized editing. In response, generative radiance manifolds emerge as an efficient approach for constrained point sampling within volumes, effectively reducing computational demands and enabling the learning of fine details. This work introduces SemFaceEdit, a novel method that streamlines the appearance and geometric editing process by generating semantic fields on generative radiance manifolds. Utilizing latent codes, our method effectively disentangles the geometry and appearance associated with different facial semantics within the generated image. In contrast to existing methods that can change the appearance of the entire radiance field, our method enables the precise editing of particular facial semantics while preserving the integrity of other regions. Our network comprises two key modules: the Geometry module, which generates semantic radiance and occupancy fields, and the Appearance module, which is responsible for predicting RGB radiance. We jointly train both modules in adversarial settings to learn semantic-aware geometry and appearance descriptors. The appearance descriptors are then conditioned on their respective semantic latent codes by the Appearance Module, facilitating disentanglement and enhanced control. Our experiments highlight SemFaceEdit's superior performance in semantic field-based editing, particularly in achieving improved radiance field disentanglement.
摘要：尽管3D吸引的GAN技术提供了多种视图一致性，但最终的图像通常缺乏局部编辑的能力。作为响应，生成的辐射歧管作为体积内约束点采样的有效方法出现，有效地减少了计算需求并能够学习细节。这项工作介绍了Semfaceedit，这是一种新颖的方法，该方法通过生成生成辐射歧管上的语义字段来简化外观和几何编辑过程。利用潜在代码，我们的方法有效地解开了与生成图像中不同面部语义相关的几何形状和外观。与可以改变整个辐射场外观的现有方法相反，我们的方法可以在保留其他区域的完整性的同时精确地编辑特定的面部语义。我们的网络包括两个关键模块：生成语义辐射和占用场的几何模块以及负责预测RGB辐射的外观模块。我们共同在对抗设置中共同训练两个模块，以学习语义感知的几何形状和外观描述符。然后，通过外观模块将外观描述符在各自的语义潜在代码上进行调节，从而促进分解和增强的控制。我们的实验重点介绍了Semfaceedit在基于语义场的编辑中的出色性能，尤其是在改善了辐射场磁场上的分离方面。

Title: Neural Cellular Automata: From Cells to Pixels

Authors: Ehsan Pajouheshgar, Yitao Xu, Ali Abbasi, Alexander Mordvintsev, Wenzel Jakob, Sabine Süsstrunk
Subjects: cs.CV, cs.GR, cs.LG, cs.MA, eess.IV
Abstract URL: https://arxiv.org/abs/2506.22899
Pdf URL: https://arxiv.org/pdf/2506.22899
Copy Paste: [[2506.22899]] Neural Cellular Automata: From Cells to Pixels(https://arxiv.org/abs/2506.22899)
Keywords: generation
Abstract: Neural Cellular Automata (NCAs) are bio-inspired systems in which identical cells self-organize to form complex and coherent patterns by repeatedly applying simple local rules. NCAs display striking emergent behaviors including self-regeneration, generalization and robustness to unseen situations, and spontaneous motion. Despite their success in texture synthesis and morphogenesis, NCAs remain largely confined to low-resolution grids. This limitation stems from (1) training time and memory requirements that grow quadratically with grid size, (2) the strictly local propagation of information which impedes long-range cell communication, and (3) the heavy compute demands of real-time inference at high resolution. In this work, we overcome this limitation by pairing NCA with a tiny, shared implicit decoder, inspired by recent advances in implicit neural representations. Following NCA evolution on a coarse grid, a lightweight decoder renders output images at arbitrary resolution. We also propose novel loss functions for both morphogenesis and texture synthesis tasks, specifically tailored for high-resolution output with minimal memory and computation overhead. Combining our proposed architecture and loss functions brings substantial improvement in quality, efficiency, and performance. NCAs equipped with our implicit decoder can generate full-HD outputs in real time while preserving their self-organizing, emergent properties. Moreover, because each MLP processes cell states independently, inference remains highly parallelizable and efficient. We demonstrate the applicability of our approach across multiple NCA variants (on 2D, 3D grids, and 3D meshes) and multiple tasks, including texture generation and morphogenesis (growing patterns from a seed), showing that with our proposed framework, NCAs seamlessly scale to high-resolution outputs with minimal computational overhead.
摘要：神经细胞自动机（NCAS）是生物启发的系统，在该系统中，相同的细胞通过反复应用简单的局部规则而自组织形成复杂而相干的模式。 NCA表现出惊人的紧急行为，包括自我再生，对看不见的情况的概括和鲁棒性以及自发运动。尽管它们在质地合成和形态发生方面取得了成功，但NCA仍在很大程度上仅限于低分辨率网格。这种限制源于（1）训练时间和记忆要求，这些时间和记忆要求随着电网大小而四次增长，（2）严格局部的信息传播，阻碍了远距离通信，以及（3）高分辨率以实时推断的重量计算。在这项工作中，我们通过将NCA与一个微小的，共享的隐式解码器配对来克服了这一局限性，这是受隐性神经表示的最新进展的启发。在粗网格上进行NCA演变后，轻质解码器以任意分辨率呈现输出图像。我们还提出了用于形态发生和纹理合成任务的新型损失功能，该任务是针对高分辨率输出而定制的，并以最小的记忆和计算开销量身定制。结合我们提出的体系结构和损失功能，质量，效率和性能都有很大的提高。配备了我们隐式解码器的NCA可以实时生成全高清输出，同时保留其自组织的新兴属性。此外，由于每个MLP可以独立地处理细胞状态，因此推断仍然高度可行且有效。我们证明了方法在多个NCA变体中的适用性（在2D，3D网格和3D网格上）以及多个任务，包括纹理产生和形态发生（从种子中增长的模式），表明，通过我们提出的框架，NCAS无缝扩展到最小的计算机上的高分辨率输出。

Title: MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

Authors: Mai A. Shaaban, Tausifa Jan Saleem, Vijay Ram Papineni, Mohammad Yaqub
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22900
Pdf URL: https://arxiv.org/pdf/2506.22900
Copy Paste: [[2506.22900]] MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering(https://arxiv.org/abs/2506.22900)
Keywords: generation
Abstract: Medical visual question answering (MedVQA) plays a vital role in clinical decision-making by providing contextually rich answers to image-based queries. Although vision-language models (VLMs) are widely used for this task, they often generate factually incorrect answers. Retrieval-augmented generation addresses this challenge by providing information from external sources, but risks retrieving irrelevant context, which can degrade the reasoning capabilities of VLMs. Re-ranking retrievals, as introduced in existing approaches, enhances retrieval relevance by focusing on query-text alignment. However, these approaches neglect the visual or multimodal context, which is particularly crucial for medical diagnosis. We propose MOTOR, a novel multimodal retrieval and re-ranking approach that leverages grounded captions and optimal transport. It captures the underlying relationships between the query and the retrieved context based on textual and visual information. Consequently, our approach identifies more clinically relevant contexts to augment the VLM input. Empirical analysis and human expert evaluation demonstrate that MOTOR achieves higher accuracy on MedVQA datasets, outperforming state-of-the-art methods by an average of 6.45%. Code is available at this https URL.
摘要：医学视觉问题回答（MEDVQA）通过为基于图像的查询提供上下文丰富的答案，在临床决策中起着至关重要的作用。尽管视觉语言模型（VLM）被广泛用于此任务，但它们通常会产生事实错误的答案。检索授权的一代通过提供外部来源的信息来解决这一挑战，但风险会检索无关紧要的环境，从而降低VLM的推理能力。在现有方法中引入的重新排列检索通过关注查询文本对齐来增强检索相关性。但是，这些方法忽略了视觉或多模式环境，这对于医学诊断尤其重要。我们提出了Motor，这是一种新型的多模式检索和重新排列的方法，利用扎根的标题和最佳运输。它根据文本和视觉信息捕获了查询与检索到的上下文之间的基本关系。因此，我们的方法确定了更多在临床上相关的环境，以增加VLM输入。经验分析和人类专家评估表明，电动机在MEDVQA数据集上的准确性更高，表现平均比最先进的方法的准确性平均6.45％。代码可在此HTTPS URL上找到。

Title: Point Cloud Compression and Objective Quality Assessment: A Survey

Authors: Yiling Xu, Yujie Zhang, Shuting Xia, Kaifa Yang, He Huang, Ziyu Shan, Wenjie Huang, Qi Yang, Le Yang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.22902
Pdf URL: https://arxiv.org/pdf/2506.22902
Copy Paste: [[2506.22902]] Point Cloud Compression and Objective Quality Assessment: A Survey(https://arxiv.org/abs/2506.22902)
Keywords: quality assessment
Abstract: The rapid growth of 3D point cloud data, driven by applications in autonomous driving, robotics, and immersive environments, has led to criticals demand for efficient compression and quality assessment techniques. Unlike traditional 2D media, point clouds present unique challenges due to their irregular structure, high data volume, and complex attributes. This paper provides a comprehensive survey of recent advances in point cloud compression (PCC) and point cloud quality assessment (PCQA), emphasizing their significance for real-time and perceptually relevant applications. We analyze a wide range of handcrafted and learning-based PCC algorithms, along with objective PCQA metrics. By benchmarking representative methods on emerging datasets, we offer detailed comparisons and practical insights into their strengths and limitations. Despite notable progress, challenges such as enhancing visual fidelity, reducing latency, and supporting multimodal data remain. This survey outlines future directions, including hybrid compression frameworks and advanced feature extraction strategies, to enable more efficient, immersive, and intelligent 3D applications.
摘要：由自动驾驶，机器人技术和沉浸式环境的应用驱动的3D点云数据的快速增长导致对有效的压缩和质量评估技术的需求。与传统的2D媒体不同，Point Cloud由于其不规则结构，高数据量和复杂属性而带来了独特的挑战。本文对点云压缩（PCC）和点云质量评估（PCQA）的最新进展进行了全面的调查，强调了它们对实时和感知相关应用的重要性。我们分析了广泛的手工制作和基于学习的PCC算法，以及客观的PCQA指标。通过基准在新兴数据集上的代表性方法，我们对其优势和局限性提供了详细的比较和实际见解。尽管取得了显着的进展，但仍存在诸如增强视觉保真度，减少延迟和支持多模式数据之类的挑战。这项调查概述了未来的方向，包括混合压缩框架和高级功能提取策略，以实现更高效，沉浸式和智能的3D应用程序。

Title: Towards Time Series Generation Conditioned on Unstructured Natural Language

Authors: Jaeyun Woo, Jiseok Lee, Brian Kenji Iwana
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.22927
Pdf URL: https://arxiv.org/pdf/2506.22927
Copy Paste: [[2506.22927]] Towards Time Series Generation Conditioned on Unstructured Natural Language(https://arxiv.org/abs/2506.22927)
Keywords: generation, generative
Abstract: Generative Artificial Intelligence (AI) has rapidly become a powerful tool, capable of generating various types of data, such as images and text. However, despite the significant advancement of generative AI, time series generative AI remains underdeveloped, even though the application of time series is essential in finance, climate, and numerous fields. In this research, we propose a novel method of generating time series conditioned on unstructured natural language descriptions. We use a diffusion model combined with a language model to generate time series from the text. Through the proposed method, we demonstrate that time series generation based on natural language is possible. The proposed method can provide various applications such as custom forecasting, time series manipulation, data augmentation, and transfer learning. Furthermore, we construct and propose a new public dataset for time series generation, consisting of 63,010 time series-description pairs.
摘要：生成人工智能（AI）已迅速成为一种强大的工具，能够生成各种类型的数据，例如图像和文本。但是，尽管生成AI的显着进步，但时间序列生成的AI仍然不发达，即使时间序列的应用在金融，气候和众多领域至关重要。在这项研究中，我们提出了一种以非结构化自然语言描述为条件的新方法，该方法生成时间序列。我们使用与语言模型相结合的扩散模型来从文本中生成时间序列。通过提出的方法，我们证明了基于自然语言的时间序列是可能的。所提出的方法可以提供各种应用程序，例如自定义预测，时间序列操作，数据增强和转移学习。此外，我们为时间序列的生成构建并提出了一个新的公共数据集，由63,010个时间序列描述对组成。

Title: Infinite Sampling: Efficient and Stable Grouped RL Training for Large Language Models

Authors: Liangyu Wang, Huanyi Xie, Xinhai Wang, Tianjin Huang, Mengdi Li, Di Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.22950
Pdf URL: https://arxiv.org/pdf/2506.22950
Copy Paste: [[2506.22950]] Infinite Sampling: Efficient and Stable Grouped RL Training for Large Language Models(https://arxiv.org/abs/2506.22950)
Keywords: generation
Abstract: Group-based reinforcement learning algorithms such as Group Reward Policy Optimization (GRPO) have proven effective for fine-tuning large language models (LLMs) with human feedback. However, generating and storing multiple responses per prompt incurs substantial memory overhead, especially as the sample group size increases, limiting scalability under constrained hardware. We propose Infinite Sampling, a framework that enables efficient and stable GRPO training by decoupling group size from GPU memory usage. It consists of: (1) micro sampling groups that decompose large groups into memory-feasible rounds; (2) continuous sampling that interleaves generation across groups to improve utilization; and (3) a length-aware scheduler combining token-conditioned sequence length prediction with a two-stage plan: global grouping via FPTAS and runtime refill via SJF. Experiments show that our Micro Sampling Groups reduce peak memory usage by over 50% compared to full-group decoding (e.g., from 21.55 GB to 10.64 GB on Qwen3-1.7B). Building on this, Infinite Sampling improves throughput by over 25% compared to the naive micro sampling group method, reducing decoding steps while maintaining full-length completions and memory usage. Our hybrid scheduling ensures efficient and stable GRPO training with larger groups under realistic GPU memory constraints.
摘要：基于小组的增强学习算法（例如小组奖励政策优化（GRPO））已被证明对具有人为反馈的微调大语言模型（LLM）有效。但是，每个提示的生成和存储多个响应会产生大量内存开销，尤其是随着样本组大小的增加，限制了约束硬件下的可伸缩性。我们提出了无限抽样，该框架通过将组大小从GPU内存使用中解耦来实现高效且稳定的GRPO培训。它由：（1）将大型组分解为记忆可行的回合的微抽样组；（2）连续采样使跨群体产生以改善利用率；（3）一个长度感知的调度程序将令牌条件的序列长度与两阶段计划相结合：通过FPTAS进行全局分组和通过SJF进行运行时重新填充。实验表明，与全组解码相比，我们的微抽样组将峰值记忆使用量降低了50％以上（例如，QWEN3-1.7B上的21.55 GB至10.64 GB）。在此基础上，与幼稚的微抽样组方法相比，无限采样可提高吞吐量以上的吞吐量超过25％，从而减少了解码步骤，同时保持全长完成和记忆使用情况。我们的混合计划可确保在现实的GPU内存约束下与较大组的效率和稳定的GRPO培训。

Title: Peccavi: Visual Paraphrase Attack Safe and Distortion Free Image Watermarking Technique for AI-Generated Images

Authors: Shreyas Dixit, Ashhar Aziz, Shashwat Bajpai, Vasu Sharma, Aman Chadha, Vinija Jain, Amitava Das
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22960
Pdf URL: https://arxiv.org/pdf/2506.22960
Copy Paste: [[2506.22960]] Peccavi: Visual Paraphrase Attack Safe and Distortion Free Image Watermarking Technique for AI-Generated Images(https://arxiv.org/abs/2506.22960)
Keywords: generative
Abstract: A report by the European Union Law Enforcement Agency predicts that by 2026, up to 90 percent of online content could be synthetically generated, raising concerns among policymakers, who cautioned that "Generative AI could act as a force multiplier for political disinformation. The combined effect of generative text, images, videos, and audio may surpass the influence of any single modality." In response, California's Bill AB 3211 mandates the watermarking of AI-generated images, videos, and audio. However, concerns remain regarding the vulnerability of invisible watermarking techniques to tampering and the potential for malicious actors to bypass them entirely. Generative AI-powered de-watermarking attacks, especially the newly introduced visual paraphrase attack, have shown an ability to fully remove watermarks, resulting in a paraphrase of the original image. This paper introduces PECCAVI, the first visual paraphrase attack-safe and distortion-free image watermarking technique. In visual paraphrase attacks, an image is altered while preserving its core semantic regions, termed Non-Melting Points (NMPs). PECCAVI strategically embeds watermarks within these NMPs and employs multi-channel frequency domain watermarking. It also incorporates noisy burnishing to counter reverse-engineering efforts aimed at locating NMPs to disrupt the embedded watermark, thereby enhancing durability. PECCAVI is model-agnostic. All relevant resources and codes will be open-sourced.
摘要：欧盟执法机构的一份报告预测，到2026年，可以综合地产生多达90％的在线内容，这引起了决策者之间的关注，他们警告说：“生成的AI可以作为政治虚假的力量乘数。作为回应，加利福尼亚州的AB 3211要求AI生成的图像，视频和音频的水印。然而，人们仍然担心看不见的水印技术对篡改的脆弱性以及恶意演员完全绕过它们的潜力。生成的AI驱动的取水攻击，尤其是新引入的视觉释义攻击，显示出可以完全去除水印的能力，从而导致原始图像的释义。本文介绍了Peccavi，这是第一个视觉释义攻击安全和无失真图像水印技术。在视觉释义攻击中，在保留其核心语义区域（称为非熔点点（NMP））时会更改图像。 Peccavi战略性地将水印嵌入了这些NMP中，并采用了多通道频域水印。它还结合了嘈杂的抛光剂，以应对旨在定位NMP的反向工程工作，以破坏嵌入式水印，从而提高耐用性。 peccavi是模型的敏锐性。所有相关资源和代码将是开源的。

Title: A Reinforcement Learning Approach for Optimal Control in Microgrids

Authors: Davide Salaorni, Federico Bianchi, Francesco Trovò, Marcello Restelli
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2506.22995
Pdf URL: https://arxiv.org/pdf/2506.22995
Copy Paste: [[2506.22995]] A Reinforcement Learning Approach for Optimal Control in Microgrids(https://arxiv.org/abs/2506.22995)
Keywords: generation
Abstract: The increasing integration of renewable energy sources (RESs) is transforming traditional power grid networks, which require new approaches for managing decentralized energy production and consumption. Microgrids (MGs) provide a promising solution by enabling localized control over energy generation, storage, and distribution. This paper presents a novel reinforcement learning (RL)-based methodology for optimizing microgrid energy management. Specifically, we propose an RL agent that learns optimal energy trading and storage policies by leveraging historical data on energy production, consumption, and market prices. A digital twin (DT) is used to simulate the energy storage system dynamics, incorporating degradation factors to ensure a realistic emulation of the analysed setting. Our approach is validated through an experimental campaign using real-world data from a power grid located in the Italian territory. The results indicate that the proposed RL-based strategy outperforms rule-based methods and existing RL benchmarks, offering a robust solution for intelligent microgrid management.
摘要：可再生能源（RESS）的越来越多的整合正在改变传统的电网网络，这需要新的方法来管理分散的能源生产和消费。微电网（MGS）通过对能量产生，存储和分布进行局部控制提供了有希望的解决方案。本文提出了一种新型的增强学习（RL），用于优化微电网能量管理。具体而言，我们提出了一个RL代理，该代理通过利用有关能源生产，消费和市场价格的历史数据来学习最佳的能源交易和存储政策。数字双（DT）用于模拟储能系统动力学，并结合了降解因子，以确保对分析设置的现实仿真。我们的方法通过实验活动使用来自意大利领土上的电网的现实世界数据来验证。结果表明，所提出的基于RL的策略优于基于规则的方法和现有的RL基准，为智能微电网管理提供了强大的解决方案。

Title: Inpainting is All You Need: A Diffusion-based Augmentation Method for Semi-supervised Medical Image Segmentation

Authors: Xinrong Hu, Yiyu Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23038
Pdf URL: https://arxiv.org/pdf/2506.23038
Copy Paste: [[2506.23038]] Inpainting is All You Need: A Diffusion-based Augmentation Method for Semi-supervised Medical Image Segmentation(https://arxiv.org/abs/2506.23038)
Keywords: generation
Abstract: Collecting pixel-level labels for medical datasets can be a laborious and expensive process, and enhancing segmentation performance with a scarcity of labeled data is a crucial challenge. This work introduces AugPaint, a data augmentation framework that utilizes inpainting to generate image-label pairs from limited labeled data. AugPaint leverages latent diffusion models, known for their ability to generate high-quality in-domain images with low overhead, and adapts the sampling process for the inpainting task without need for retraining. Specifically, given a pair of image and label mask, we crop the area labeled with the foreground and condition on it during reversed denoising process for every noise level. Masked background area would gradually be filled in, and all generated images are paired with the label mask. This approach ensures the accuracy of match between synthetic images and label masks, setting it apart from existing dataset generation methods. The generated images serve as valuable supervision for training downstream segmentation models, effectively addressing the challenge of limited annotations. We conducted extensive evaluations of our data augmentation method on four public medical image segmentation datasets, including CT, MRI, and skin imaging. Results across all datasets demonstrate that AugPaint outperforms state-of-the-art label-efficient methodologies, significantly improving segmentation performance.
摘要：为医疗数据集收集像素级标签可能是一个艰辛且昂贵的过程，并且以稀缺的标记数据来提高细分性能是一个至关重要的挑战。这项工作介绍了AugPaint，这是一个数据增强框架，该框架利用介入来从有限标记的数据中生成图像标签对。 AugPaint利用潜在的扩散模型，该模型以其生成低开销的高质量内域图像的能力而闻名，并适应了无需重新训练的介入任务的采样过程。具体而言，鉴于一对图像和标签面膜，我们在每个噪声水平的反向变形过程中裁剪了标有前景的区域，并在其上裁剪了它。蒙版背景区域将逐渐填充，所有生成的图像均与标签蒙版配对。这种方法可确保合成图像和标签掩模之间匹配的准确性，从而将其与现有数据集生成方法区分开来。生成的图像是培训下游细分模型的宝贵监督，有效地应对有限注释的挑战。我们对四个公共医学图像分割数据集（包括CT，MRI和皮肤成像）进行了广泛的数据增强方法的评估。所有数据集中的结果表明，AugPaint的表现优于最先进的标签效率方法，从而显着提高了细分性能。

Title: Ovis-U1 Technical Report

Authors: Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23044
Pdf URL: https://arxiv.org/pdf/2506.23044
Copy Paste: [[2506.23044]] Ovis-U1 Technical Report(https://arxiv.org/abs/2506.23044)
Keywords: generation
Abstract: In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.
摘要：在本报告中，我们介绍了OVIS-U1，这是一个30亿参数统一模型，该模型集成了多模式理解，文本到图像生成和图像编辑功能。 OVIS-U1以OVIS系列的基础为基础，结合了一个基于扩散的视觉解码器，与双向令牌炼油厂配对，从而使图像生成任务与GPT-4O等领先模型相当。与某些使用冷冻MLLM进行生成任务的模型不同，OVIS-U1采用了一种从语言模型开始的新统一训练方法。与仅在理解或发电任务上进行培训相比，统一培训可以提高表现，从而证明了通过整合这两个任务实现的增强。 OVIS-U1在Opencompass多模式学术基准上取得了69.6的成绩，超过了最近最新的模型，例如Ristretto-3B和Sail-VL-1.5-2B。在文本到图像的生成中，它在DPG基础和Geneval基准分别以83.72和0.89的得分出色。对于图像编辑，它分别在IMGEDIT基础和Gedit-Bench-en上达到4.00和6.42。作为OVIS统一模型系列的初始版本，OVIS-U1推动了多模式理解，生成和编辑的边界。

Title: Double-Diffusion: Diffusion Conditioned Diffusion Probabilistic Model For Air Quality Prediction

Authors: Hanlin Dong, Arian Prabowo, Hao Xue, Flora D. Salim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23053
Pdf URL: https://arxiv.org/pdf/2506.23053
Copy Paste: [[2506.23053]] Double-Diffusion: Diffusion Conditioned Diffusion Probabilistic Model For Air Quality Prediction(https://arxiv.org/abs/2506.23053)
Keywords: restoration, generative
Abstract: Air quality prediction is a challenging forecasting task due to its spatio-temporal complexity and the inherent dynamics as well as uncertainty. Most of the current models handle these two challenges by applying Graph Neural Networks or known physics principles, and quantifying stochasticity through probabilistic networks like Diffusion models. Nevertheless, finding the right balancing point between the certainties and uncertainties remains an open question. Therefore, we propose Double-Diffusion, a novel diffusion probabilistic model that harnesses the power of known physics to guide air quality forecasting with stochasticity. To the best of our knowledge, while precedents have been made of using conditional diffusion models to predict air pollution, this is the first attempt to use physics as a conditional generative approach for air quality prediction. Along with a sampling strategy adopted from image restoration and a new denoiser architecture, Double-Diffusion ranks first in most evaluation scenarios across two real-life datasets compared with other probabilistic models, it also cuts inference time by 50% to 30% while enjoying an increase between 3-12% in Continuous Ranked Probabilistic Score (CRPS).
摘要：空气质量预测是一项具有挑战性的预测任务，因为它的时空复杂性以及固有的动态和不确定性。当前大多数模型通过应用图形神经网络或已知的物理原理以及通过概率网络（如扩散模型）来量化随机性来应对这两个挑战。然而，在确定性和不确定性之间找到正确的平衡点仍然是一个悬而未决的问题。因此，我们提出了双扩散，这是一种新型的扩散概率模型，可利用已知物理学的力量以随机性指导空气质量预测。据我们所知，尽管使用条件扩散模型来预测空气污染的先例是最初是尝试将物理作为空气质量预测的有条件生成方法的尝试。与其他概率模型相比，在大多数评估方案中，双扩散策略在大多数评估方案中排名第一，与其他概率模型相比，双扩散排名第一，它也将推理时间降低了50％至30％，而在连续的概率分数（CRP）中享受3-12％的增长（CRP）。

Title: Learning Counterfactually Decoupled Attention for Open-World Model Attribution

Authors: Yu Zheng, Boyang Gong, Fanye Kong, Yueqi Duan, Bingyao Yu, Wenzhao Zheng, Lei Chen, Jiwen Lu, Jie Zhou
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23074
Pdf URL: https://arxiv.org/pdf/2506.23074
Copy Paste: [[2506.23074]] Learning Counterfactually Decoupled Attention for Open-World Model Attribution(https://arxiv.org/abs/2506.23074)
Keywords: generation
Abstract: In this paper, we propose a Counterfactually Decoupled Attention Learning (CDAL) method for open-world model attribution. Existing methods rely on handcrafted design of region partitioning or feature space, which could be confounded by the spurious statistical correlations and struggle with novel attacks in open-world scenarios. To address this, CDAL explicitly models the causal relationships between the attentional visual traces and source model attribution, and counterfactually decouples the discriminative model-specific artifacts from confounding source biases for comparison. In this way, the resulting causal effect provides a quantification on the quality of learned attention maps, thus encouraging the network to capture essential generation patterns that generalize to unseen source models by maximizing the effect. Extensive experiments on existing open-world model attribution benchmarks show that with minimal computational overhead, our method consistently improves state-of-the-art models by large margins, particularly for unseen novel attacks. Source code: this https URL.
摘要：在本文中，我们提出了一种针对开放世界模型归因的反合脱钩的注意力学习（CDAL）方法。现有的方法依赖于区域分配或特征空间的手工设计，这可能会被虚假的统计相关性和在开放世界场景中的新攻击而奋斗。为了解决这个问题，CDAL明确地模拟了注意视觉迹线和源模型归因之间的因果关系，并反合将歧视性模型特异性伪像与混杂的源偏见解除以进行比较。通过这种方式，由此产生的因果效应可以量化学到的注意力图的质量，从而鼓励网络捕获基本的生成模式，从而通过最大化效果来概括地概括了看不见的源模型。对现有的开放世界模型归因基准的广泛实验表明，随着计算开销的最低，我们的方法始终通过大幅度提高最先进的模型，尤其是对于看不见的新型攻击。源代码：此HTTPS URL。

Title: Dare to Plagiarize? Plagiarized Painting Recognition and Retrieval

Authors: Sophie Zhou, Shu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23132
Pdf URL: https://arxiv.org/pdf/2506.23132
Copy Paste: [[2506.23132]] Dare to Plagiarize? Plagiarized Painting Recognition and Retrieval(https://arxiv.org/abs/2506.23132)
Keywords: generative
Abstract: Art plagiarism detection plays a crucial role in protecting artists' copyrights and intellectual property, yet it remains a challenging problem in forensic analysis. In this paper, we address the task of recognizing plagiarized paintings and explaining the detected plagarisms by retrieving visually similar authentic artworks. To support this study, we construct a dataset by collecting painting photos and synthesizing plagiarized versions using generative AI, tailored to specific artists' styles. We first establish a baseline approach using off-the-shelf features from the visual foundation model DINOv2 to retrieve the most similar images in the database and classify plagiarism based on a similarity threshold. Surprisingly, this non-learned method achieves a high recognition accuracy of 97.2\% but suffers from low retrieval precision 29.0\% average precision (AP). To improve retrieval quality, we finetune DINOv2 with a metric learning loss using positive and negative sample pairs sampled in the database. The finetuned model greatly improves retrieval performance by 12\% AP over the baseline, though it unexpectedly results in a lower recognition accuracy (92.7\%). We conclude with insightful discussions and outline directions for future research.
摘要：艺术pla窃检测在保护艺术家的版权和知识产权方面起着至关重要的作用，但在法医分析中仍然是一个具有挑战性的问题。在本文中，我们解决了识别窃画的任务，并通过检索视觉上类似的真实艺术品来解释被检测到的搜索。为了支持这项研究，我们通过使用针对特定艺术家风格量身定制的生成AI来收集绘画照片并合成窃版本来构建数据集。我们首先使用Visual Foundation模型Dinov2的现成功能建立基线方法，以检索数据库中最相似的图像，并基于相似性阈值对pla窃进行分类。令人惊讶的是，这种非学习方法的高识别精度为97.2 \％，但遭受低检索精度29.0 \％平均精度（AP）。为了提高检索质量，我们使用在数据库中采样的正面和负样本对的度量学习损失来列出Dinov2。填充模型在基线上大大提高了12 \％AP的检索性能，尽管它出乎意料地导致识别精度较低（92.7 \％）。我们以有见地的讨论和大纲方向进行结论，以供将来的研究。

Title: RoboScape: Physics-informed Embodied World Model

Authors: Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, Yong Li
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2506.23135
Pdf URL: https://arxiv.org/pdf/2506.23135
Copy Paste: [[2506.23135]] RoboScape: Physics-informed Embodied World Model(https://arxiv.org/abs/2506.23135)
Keywords: generation
Abstract: World models have become indispensable tools for embodied intelligence, serving as powerful simulators capable of generating realistic robotic videos while addressing critical data scarcity challenges. However, current embodied world models exhibit limited physical awareness, particularly in modeling 3D geometry and motion dynamics, resulting in unrealistic video generation for contact-rich robotic scenarios. In this paper, we present RoboScape, a unified physics-informed world model that jointly learns RGB video generation and physics knowledge within an integrated framework. We introduce two key physics-informed joint training tasks: temporal depth prediction that enhances 3D geometric consistency in video rendering, and keypoint dynamics learning that implicitly encodes physical properties (e.g., object shape and material characteristics) while improving complex motion modeling. Extensive experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. Our work provides new insights for building efficient physics-informed world models to advance embodied intelligence research. The code is available at: this https URL.
摘要：世界模型已成为具有体现智能的必不可少的工具，它是能够生成逼真的机器人视频的强大模拟器，同时解决关键数据稀缺挑战。但是，当前的体现世界模型表现出有限的身体意识，尤其是在建模3D几何和运动动力学中，从而导致了不切实际的视频生成，用于富裕的机器人场景。在本文中，我们提出了Roboscape，这是一种统一的物理学世界模型，该模型共同学习集成框架内的RGB视频生成和物理知识。我们介绍了两个关键的物理信息联合培训任务：时间深度预测，以增强视频渲染中的3D几何一致性，以及密钥点动力学学习，这些动态学习隐式编码物理属性（例如对象形状和材料特征），同时改善复杂运动模型。广泛的实验表明，Roboscape在各种机器人场景中生成具有出色的视觉保真度和身体合理性的视频。我们通过下游应用程序进一步验证其实际实用程序，包括具有生成数据和政策评估的机器人政策培训。我们的工作为建立有效的物理知识的世界模型提供了新的见解，以推进具体的情报研究。该代码可用：此HTTPS URL。

Title: VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis

Authors: Shiyu Wu, Mingzhen Sun, Weining Wang, Yequan Wang, Jing Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23138
Pdf URL: https://arxiv.org/pdf/2506.23138
Copy Paste: [[2506.23138]] VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis(https://arxiv.org/abs/2506.23138)
Keywords: generative
Abstract: Since there exists a notable gap between user-provided and model-preferred prompts, generating high-quality and satisfactory images using diffusion models often requires prompt engineering to optimize user inputs. Current studies on text-to-image prompt engineering can effectively enhance the style and aesthetics of generated images. However, they often neglect the semantic alignment between generated images and user descriptions, resulting in visually appealing but content-wise unsatisfying outputs. In this work, we propose VisualPrompter, a novel training-free prompt engineering framework that refines user inputs to model-preferred sentences. In particular, VisualPrompter utilizes an automatic self-reflection module to identify the missing concepts in generated images and a target-specific prompt optimization mechanism to revise the prompts in a fine-grained manner. Extensive experiments demonstrate the effectiveness of our VisualPrompter, which achieves new state-of-the-art performance on multiple benchmarks for text-image alignment evaluation. Additionally, our framework features a plug-and-play design, making it highly adaptable to various generative models.
摘要：由于在用户提供的和偏爱的提示之间存在显着的差距，因此使用扩散模型生成高质量和令人满意的图像通常需要及时的工程来优化用户输入。当前对文本图像及时工程的研究可以有效地增强生成图像的样式和美学。但是，他们经常忽略生成的图像和用户描述之间的语义对齐，从而导致视觉上吸引但内容不满意的输出。在这项工作中，我们提出了VisualProppter，这是一个新颖的无培训及时工程框架，可完善用户输入到模型偏爱的句子。特别是，VisualProppter使用自动自我反射模块来识别生成的图像中缺少的概念和目标特定的提示优化机制，以细粒度的方式修改提示。广泛的实验证明了我们的VisualProppter的有效性，该实验在多个基准测试中实现了新的最新性能，以进行文本图像对齐评估。此外，我们的框架具有插件设计，使其高度适应各种生成型号。

Title: AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation

Authors: Xinyue Liang, Zhiyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23150
Pdf URL: https://arxiv.org/pdf/2506.23150
Copy Paste: [[2506.23150]] AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation(https://arxiv.org/abs/2506.23150)
Keywords: generation
Abstract: Single-image-to-3D models typically follow a sequential generation and reconstruction workflow. However, intermediate multi-view images synthesized by pre-trained generation models often lack cross-view consistency (CVC), significantly degrading 3D reconstruction performance. While recent methods attempt to refine CVC by feeding reconstruction results back into the multi-view generator, these approaches struggle with noisy and unstable reconstruction outputs that limit effective CVC improvement. We introduce AlignCVC, a novel framework that fundamentally re-frames single-image-to-3D generation through distribution alignment rather than relying on strict regression losses. Our key insight is to align both generated and reconstructed multi-view distributions toward the ground-truth multi-view distribution, establishing a principled foundation for improved CVC. Observing that generated images exhibit weak CVC while reconstructed images display strong CVC due to explicit rendering, we propose a soft-hard alignment strategy with distinct objectives for generation and reconstruction models. This approach not only enhances generation quality but also dramatically accelerates inference to as few as 4 steps. As a plug-and-play paradigm, our method, namely AlignCVC, seamlessly integrates various multi-view generation models with 3D reconstruction models. Extensive experiments demonstrate the effectiveness and efficiency of AlignCVC for single-image-to-3D generation.
摘要：单图像到3D模型通常遵循顺序的生成和重建工作流。但是，预训练的生成模型合成的中间多视图图像通常缺乏跨视图一致性（CVC），从而显着降低了3D重建性能。尽管最近的方法试图通过将重建结果馈入多视图生成器来完善CVC，但这些方法在嘈杂且不稳定的重建输出方面遇到了困难，从而限制了有效的CVC改进。我们介绍了AlignCVC，这是一个新颖的框架，它从根本上通过分配对准来重新构架单图像到3D，而不是依靠严格的回归损失。我们的主要见解是将生成和重建的多视图分布与地面多视图分布保持一致，为改进的CVC建立了原则上的基础。观察到生成的图像表现出较弱的CVC，而重建图像由于显式渲染而显示出强大的CVC，因此我们提出了一种柔软的比对策略，具有针对生成和重建模型的不同目标。这种方法不仅提高了发电质量，而且还可以显着加速推断，以至4步。作为插件范式，我们的方法（即对齐）无缝地将各种多视图生成模型与3D重建模型集成在一起。广泛的实验证明了AlignCVC对单图像到3D的有效性和效率。

Title: Data Can Speak for Itself: Quality-guided Utilization of Wireless Synthetic Data

Authors: Chen Gong, Bo Liang, Wei Gao, Chenren Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23174
Pdf URL: https://arxiv.org/pdf/2506.23174
Copy Paste: [[2506.23174]] Data Can Speak for Itself: Quality-guided Utilization of Wireless Synthetic Data(https://arxiv.org/abs/2506.23174)
Keywords: generative
Abstract: Generative models have gained significant attention for their ability to produce realistic synthetic data that supplements the quantity of real-world datasets. While recent studies show performance improvements in wireless sensing tasks by incorporating all synthetic data into training sets, the quality of synthetic data remains unpredictable and the resulting performance gains are not guaranteed. To address this gap, we propose tractable and generalizable metrics to quantify quality attributes of synthetic data - affinity and diversity. Our assessment reveals prevalent affinity limitation in current wireless synthetic data, leading to mislabeled data and degraded task performance. We attribute the quality limitation to generative models' lack of awareness of untrained conditions and domain-specific processing. To mitigate these issues, we introduce SynCheck, a quality-guided synthetic data utilization scheme that refines synthetic data quality during task model training. Our evaluation demonstrates that SynCheck consistently outperforms quality-oblivious utilization of synthetic data, and achieves 4.3% performance improvement even when the previous utilization degrades performance by 13.4%.
摘要：生成模型已引起了他们产生现实的合成数据的能力，以补充现实世界数据集的数量。虽然最近的研究表明，通过将所有合成数据纳入训练集中，无线传感任务的性能提高，但合成数据的质量仍然无法预测，并且无法保证其性能提高。为了解决这一差距，我们提出了可易和推广的指标，以量化合成数据的质量属性 - 亲和力和多样性。我们的评估揭示了当前无线合成数据中普遍的亲和力限制，从而导致标签错误并降低任务性能。我们将质量限制归因于生成模型对未经训练条件和特定领域的处理的认识。为了减轻这些问题，我们引入了Syncheck，这是一种质量引导的合成数据利用方案，可在任务模型培训期间优化合成数据质量。我们的评估表明，同步始终优于合成数据质量的质量利用，即使以前的利用率降低了13.4％的绩效，也可以提高4.3％的性能。

Title: Attribution assignment for deep-generative sequence models enables interpretability analysis using positive-only data

Authors: Robert Frank, Michael Widrich, Rahmad Akbar, Günter Klambauer, Geir Kjetil Sandve, Philippe A. Robert, Victor Greiff
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2506.23182
Pdf URL: https://arxiv.org/pdf/2506.23182
Copy Paste: [[2506.23182]] Attribution assignment for deep-generative sequence models enables interpretability analysis using positive-only data(https://arxiv.org/abs/2506.23182)
Keywords: generative
Abstract: Generative machine learning models offer a powerful framework for therapeutic design by efficiently exploring large spaces of biological sequences enriched for desirable properties. Unlike supervised learning methods, which require both positive and negative labeled data, generative models such as LSTMs can be trained solely on positively labeled sequences, for example, high-affinity antibodies. This is particularly advantageous in biological settings where negative data are scarce, unreliable, or biologically ill-defined. However, the lack of attribution methods for generative models has hindered the ability to extract interpretable biological insights from such models. To address this gap, we developed Generative Attribution Metric Analysis (GAMA), an attribution method for autoregressive generative models based on Integrated Gradients. We assessed GAMA using synthetic datasets with known ground truths to characterize its statistical behavior and validate its ability to recover biologically relevant features. We further demonstrated the utility of GAMA by applying it to experimental antibody-antigen binding data. GAMA enables model interpretability and the validation of generative sequence design strategies without the need for negative training data.
摘要：生成机器学习模型通过有效探索丰富所需特性的大量生物序列，为治疗设计提供了强大的框架。与需要正面标记数据的监督学习方法不同，可以仅在标记序列上进行培训，例如高亲和力抗体。这在阴性数据稀缺，不可靠或生物学上不确定的生物环境中尤其有利。但是，生成模型缺乏归因方法阻碍了从此类模型中提取可解释的生物学见解的能力。为了解决这一差距，我们开发了生成归因度量分析（GAMA），这是一种基于集成梯度的自回归生成模型的归因方法。我们使用具有已知基础真理的合成数据集评估了GAMA，以表征其统计行为并验证其恢复与生物学相关特征的能力。我们通过将GAMA应用于实验性抗体抗原结合数据来进一步证明了GAMA的实用性。 GAMA可以使模型解释性和生成序列设计策略的验证，而无需负面培训数据。

Title: BridgeShape: Latent Diffusion Schrödinger Bridge for 3D Shape Completion

Authors: Dequan Kong, Zhe Zhu, Honghua Chen, Mingqiang Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23205
Pdf URL: https://arxiv.org/pdf/2506.23205
Copy Paste: [[2506.23205]] BridgeShape: Latent Diffusion Schrödinger Bridge for 3D Shape Completion(https://arxiv.org/abs/2506.23205)
Keywords: generation
Abstract: Existing diffusion-based 3D shape completion methods typically use a conditional paradigm, injecting incomplete shape information into the denoising network via deep feature interactions (e.g., concatenation, cross-attention) to guide sampling toward complete shapes, often represented by voxel-based distance functions. However, these approaches fail to explicitly model the optimal global transport path, leading to suboptimal completions. Moreover, performing diffusion directly in voxel space imposes resolution constraints, limiting the generation of fine-grained geometric details. To address these challenges, we propose BridgeShape, a novel framework for 3D shape completion via latent diffusion Schrödinger bridge. The key innovations lie in two aspects: (i) BridgeShape formulates shape completion as an optimal transport problem, explicitly modeling the transition between incomplete and complete shapes to ensure a globally coherent transformation. (ii) We introduce a Depth-Enhanced Vector Quantized Variational Autoencoder (VQ-VAE) to encode 3D shapes into a compact latent space, leveraging self-projected multi-view depth information enriched with strong DINOv2 features to enhance geometric structural perception. By operating in a compact yet structurally informative latent space, BridgeShape effectively mitigates resolution constraints and enables more efficient and high-fidelity 3D shape completion. BridgeShape achieves state-of-the-art performance on large-scale 3D shape completion benchmarks, demonstrating superior fidelity at higher resolutions and for unseen object classes.
摘要：现有的基于扩散的3D形状完成方法通常使用条件范式，通过深度特征交互（例如，串联，交叉注意）将不完整的形状信息注入DeNoing网络，以指导采样朝着完整的形状，通常由基于体量的距离函数表示。但是，这些方法无法明确对最佳的全球运输路径进行建模，从而导致次优的完成。此外，直接在体素空间中进行扩散会施加分辨率约束，从而限制了细粒几何细节的产生。为了应对这些挑战，我们提出了Bridgeshape，这是通过潜在扩散Schrödinger桥完成3D形状的新框架。关键创新在两个方面：（i）Bridgeshape将形状的完成作为最佳运输问题，明确对不完整和完整形状之间的过渡进行建模，以确保全球连贯的转换。（ii）我们引入了一个深度增强的矢量量化变异自动编码器（VQ-VAE），将3D形状编码为紧凑的潜在空间，利用自我投影的多视图深度信息丰富具有强dinov2特征，以增强几何结构感知。通过在紧凑而结构上有用的潜在空间中运行，Bridgeshape有效地减轻了分辨率的约束，并实现了更有效和高保真3D形状的完成。 Bridgeshape在大规模3D形状完成基准上实现最先进的性能，表明在更高的分辨率和看不见的对象类中表现出卓越的忠诚度。

Title: Single Image Inpainting and Super-Resolution with Simultaneous Uncertainty Guarantees by Universal Reproducing Kernels

Authors: Bálint Horváth, Balázs Csanád Csáji
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.23221
Pdf URL: https://arxiv.org/pdf/2506.23221
Copy Paste: [[2506.23221]] Single Image Inpainting and Super-Resolution with Simultaneous Uncertainty Guarantees by Universal Reproducing Kernels(https://arxiv.org/abs/2506.23221)
Keywords: super-resolution
Abstract: The paper proposes a statistical learning approach to the problem of estimating missing pixels of images, crucial for image inpainting and super-resolution problems. One of the main novelties of the method is that it also provides uncertainty quantifications together with the estimated values. Our core assumption is that the underlying data-generating function comes from a Reproducing Kernel Hilbert Space (RKHS). A special emphasis is put on band-limited functions, central to signal processing, which form Paley-Wiener type RKHSs. The proposed method, which we call Simultaneously Guaranteed Kernel Interpolation (SGKI), is an extension and refinement of a recently developed kernel method. An advantage of SGKI is that it not only estimates the missing pixels, but also builds non-asymptotic confidence bands for the unobserved values, which are simultaneously guaranteed for all missing pixels. We also show how to compute these bands efficiently using Schur complements, we discuss a generalization to vector-valued functions, and we present a series of numerical experiments on various datasets containing synthetically generated and benchmark images, as well.
摘要：本文提出了一种统计学习方法，以估计图像缺失像素的问题，这对于图像介入和超分辨率问题至关重要。该方法的主要新颖性之一是它还提供了不确定性量化以及估计值。我们的核心假设是，潜在的数据生成函数来自复制的内核希尔伯特空间（RKHS）。特别重点是带有限制的功能，这是信号处理的中心，形成了paley-fiener型RKHSS。我们称为同时保证内核插值（SGKI）的提出方法是最近开发的内核方法的扩展和改进。 SGKI的一个优点是，它不仅估计缺失的像素，而且还为未观察到的值构建了非反应置信带，这些值可以同时保证所有缺失的像素。我们还展示了如何使用Schur补充有效地计算这些频段，我们讨论了对矢量值函数的概括，并在包含合成生成和基准图像的各种数据集上介绍了一系列数值实验。

Title: High-quality Pseudo-labeling for Point Cloud Segmentation with Scene-level Annotation

Authors: Lunhao Duan, Shanshan Zhao, Xingxing Weng, Jing Zhang, Gui-Song Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23227
Pdf URL: https://arxiv.org/pdf/2506.23227
Copy Paste: [[2506.23227]] High-quality Pseudo-labeling for Point Cloud Segmentation with Scene-level Annotation(https://arxiv.org/abs/2506.23227)
Keywords: generation
Abstract: This paper investigates indoor point cloud semantic segmentation under scene-level annotation, which is less explored compared to methods relying on sparse point-level labels. In the absence of precise point-level labels, current methods first generate point-level pseudo-labels, which are then used to train segmentation models. However, generating accurate pseudo-labels for each point solely based on scene-level annotations poses a considerable challenge, substantially affecting segmentation performance. Consequently, to enhance accuracy, this paper proposes a high-quality pseudo-label generation framework by exploring contemporary multi-modal information and region-point semantic consistency. Specifically, with a cross-modal feature guidance module, our method utilizes 2D-3D correspondences to align point cloud features with corresponding 2D image pixels, thereby assisting point cloud feature learning. To further alleviate the challenge presented by the scene-level annotation, we introduce a region-point semantic consistency module. It produces regional semantics through a region-voting strategy derived from point-level semantics, which are subsequently employed to guide the point-level semantic predictions. Leveraging the aforementioned modules, our method can rectify inaccurate point-level semantic predictions during training and obtain high-quality pseudo-labels. Significant improvements over previous works on ScanNet v2 and S3DIS datasets under scene-level annotation can demonstrate the effectiveness. Additionally, comprehensive ablation studies validate the contributions of our approach's individual components. The code is available at this https URL .
摘要：本文研究了场景级注释下的室内点云语义分割，与依靠稀疏点级标签的方法相比，室内云语义分割较少。在没有精确的点级标签的情况下，当前方法首先生成点级伪标记，然后将其用于训练分割模型。但是，仅基于场景级注释为每个点生成准确的伪标记带来了一个巨大的挑战，从而实质上影响了细分性能。因此，为了提高准确性，本文提出了通过探索当代多模式信息和区域点语义一致性，提出了高质量的伪标签生成框架。具体而言，使用跨模式特征指南模块，我们的方法利用了2D-3D对应于带有相应2D图像像素的对齐点云特征，从而帮助点云特征学习。为了进一步缓解场景级注释所带来的挑战，我们引入了区域点语义一致性模块。它通过源自点级语义的区域投票策略来产生区域语义，随后被用来指导点级语义预测。利用上述模块，我们的方法可以在训练过程中纠正训练过程中的不准确的语义预测并获得高质量的伪标签。在场景级注释下，对扫描仪V2和S3DIS数据集的先前工作的重大改进可以证明有效性。此外，全面的消融研究验证了我们方法的各个组成部分的贡献。该代码可在此HTTPS URL上找到。

Title: PixelBoost: Leveraging Brownian Motion for Realistic-Image Super-Resolution

Authors: Aradhana Mishra, Bumshik Lee
Subjects: cs.CV, cs.AI, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2506.23254
Pdf URL: https://arxiv.org/pdf/2506.23254
Copy Paste: [[2506.23254]] PixelBoost: Leveraging Brownian Motion for Realistic-Image Super-Resolution(https://arxiv.org/abs/2506.23254)
Keywords: super-resolution, generation
Abstract: Diffusion-model-based image super-resolution techniques often face a trade-off between realistic image generation and computational efficiency. This issue is exacerbated when inference times by decreasing sampling steps, resulting in less realistic and hazy images. To overcome this challenge, we introduce a novel diffusion model named PixelBoost that underscores the significance of embracing the stochastic nature of Brownian motion in advancing image super-resolution, resulting in a high degree of realism, particularly focusing on texture and edge definitions. By integrating controlled stochasticity into the training regimen, our proposed model avoids convergence to local optima, effectively capturing and reproducing the inherent uncertainty of image textures and patterns. Our proposed model demonstrates superior objective results in terms of learned perceptual image patch similarity (LPIPS), lightness order error (LOE), peak signal-to-noise ratio(PSNR), structural similarity index measure (SSIM), as well as visual quality. To determine the edge enhancement, we evaluated the gradient magnitude and pixel value, and our proposed model exhibited a better edge reconstruction capability. Additionally, our model demonstrates adaptive learning capabilities by effectively adjusting to Brownian noise patterns and introduces a sigmoidal noise sequencing method that simplifies training, resulting in faster inference speeds.
摘要：基于扩散模型的图像超分辨率技术通常面临现实的图像产生与计算效率之间的权衡。当推断时间减少采样步骤时，此问题会加剧，从而导致较不现实和朦胧的图像。为了克服这一挑战，我们介绍了一个名为Pixelboost的新型扩散模型，该模型强调了布朗运动在推进图像超级分辨率方面的随机性质的重要性，导致了高度的现实主义，尤其是专注于质地和边缘定义。通过将受控的随机性整合到训练方案中，我们提出的模型避免了融合到本地Optima，有效地捕获和再现了图像纹理和模式的固有不确定性。我们提出的模型在学习的感知图像贴片相似性（LPIPS），轻度顺序误差（LOE），峰值信噪比（PSNR），结构相似性指数量度（SSIM）以及视觉质量方面展示了优越的目标结果。为了确定边缘增强，我们评估了梯度幅度和像素值，我们提出的模型表现出更好的边缘重建能力。此外，我们的模型通过有效调整布朗噪声模式并引入了一种sigmoidal噪声测序方法来证明自适应学习能力，从而简化了训练，从而导致推理速度更快。

Title: Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis

Authors: Lei-lei Li, Jianwu Fang, Junbin Xiao, Shanmin Pang, Hongkai Yu, Chen Lv, Jianru Xue, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23263
Pdf URL: https://arxiv.org/pdf/2506.23263
Copy Paste: [[2506.23263]] Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis(https://arxiv.org/abs/2506.23263)
Keywords: generation
Abstract: Egocentricly comprehending the causes and effects of car accidents is crucial for the safety of self-driving cars, and synthesizing causal-entity reflected accident videos can facilitate the capability test to respond to unaffordable accidents in reality. However, incorporating causal relations as seen in real-world videos into synthetic videos remains challenging. This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model, Causal-VidSyn, for synthesizing egocentric traffic accident videos. To enable causal entity grounding in video diffusion, Causal-VidSyn leverages the cause descriptions and driver fixations to identify the accident participants and behaviors, facilitated by accident reason answering and gaze-conditioned selection modules. To support Causal-VidSyn, we further construct Drive-Gaze, the largest driver gaze dataset (with 1.54M frames of fixations) in driving accident scenarios. Extensive experiments show that Causal-VidSyn surpasses state-of-the-art video diffusion models in terms of frame quality and causal sensitivity in various tasks, including accident video editing, normal-to-accident video diffusion, and text-to-video generation.
摘要：以自我为中心理解汽车事故的原因和影响对于自动驾驶汽车的安全至关重要，而合成的因果关系反映事故视频可以促进能力测试，以应对现实中无法实现的事故。但是，将现实世界视频中的因果关系融合到合成视频中仍然具有挑战性。这项工作认为，精确地确定事故参与者并捕获其相关行为至关重要。在这方面，我们提出了一种新型扩散模型，即因果关系，以综合以自我为中心的交通事故视频。为了使因果实体在视频扩散中扎根，因果关系利用原因描述和驾驶员固定来识别事故参与者和行为，这是由于事故原因回答和注视条件条件的选择模块的促进。为了支持Causal-Vidsyn，我们进一步构建了最大的驾驶员凝视数据集（具有154万帧固定框架）的驱动器凝视。广泛的实验表明，因果关系在各种任务中，因果关系质量和因果敏感性而超过最先进的视频扩散模型，包括事故视频编辑，正常视频视频扩散以及文本到视频的生成。

Title: Why Settle for One? Text-to-ImageSet Generation and Evaluation

Authors: Chengyou Jia, Xin Shen, Zhuohang Dang, Zhuohang Dang, Changliang Xia, Weijia Wu, Xinyu Zhang, Hangwei Qian, Ivor W.Tsang, Minnan Luo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23275
Pdf URL: https://arxiv.org/pdf/2506.23275
Copy Paste: [[2506.23275]] Why Settle for One? Text-to-ImageSet Generation and Evaluation(https://arxiv.org/abs/2506.23275)
Keywords: generation
Abstract: Despite remarkable progress in Text-to-Image models, many real-world applications require generating coherent image sets with diverse consistency requirements. Existing consistent methods often focus on a specific domain with specific aspects of consistency, which significantly constrains their generalizability to broader applications. In this paper, we propose a more challenging problem, Text-to-ImageSet (T2IS) generation, which aims to generate sets of images that meet various consistency requirements based on user instructions. To systematically study this problem, we first introduce $\textbf{T2IS-Bench}$ with 596 diverse instructions across 26 subcategories, providing comprehensive coverage for T2IS generation. Building on this, we propose $\textbf{T2IS-Eval}$, an evaluation framework that transforms user instructions into multifaceted assessment criteria and employs effective evaluators to adaptively assess consistency fulfillment between criteria and generated sets. Subsequently, we propose $\textbf{AutoT2IS}$, a training-free framework that maximally leverages pretrained Diffusion Transformers' in-context capabilities to harmonize visual elements to satisfy both image-level prompt alignment and set-level visual consistency. Extensive experiments on T2IS-Bench reveal that diverse consistency challenges all existing methods, while our AutoT2IS significantly outperforms current generalized and even specialized approaches. Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value. Visit our project in this https URL.
摘要：尽管文本到图像模型取得了显着进展，但许多现实世界应用都需要生成具有不同一致性要求的连贯的图像集。现有的一致方法通常集中于具有一致性特定方面的特定领域，这显着限制了它们对更广泛应用的概括。在本文中，我们提出了一个更具挑战性的问题，即文本对象集（T2IS）生成，该生成旨在生成基于用户指令满足各种一致性要求的图像集。为了系统地研究此问题，我们首先介绍$ \ textbf {t2is-bench} $，在26个子类别中提供了596个不同的说明，为T2is Generation提供了全面的覆盖范围。在此基础上，我们提出了$ \ textbf {t2is-eval} $，这是一个评估框架，将用户指令转换为多方面的评估标准，并采用有效的评估者来适应标准和生成集之间的一致性实现。随后，我们提出了$ \ textbf {autot2is} $，这是一个无训练的框架，最大程度地利用了预估计的扩散变压器的内部文本功能来协调视觉元素，以满足图像级的提示提示对齐和设置级别的视觉一致性。在T2IS基础台上进行的广泛实验表明，各种一致性挑战了所有现有方法，而我们的AUTOT2显着胜过当前的广义甚至专业方法。我们的方法还证明了能够实现众多未充满现实世界的应用程序的能力，从而证实了其实质性的实践价值。在此HTTPS URL中访问我们的项目。

Title: Autoregressive Denoising Score Matching is a Good Video Anomaly Detector

Authors: Hanwen Zhang, Congqi Cao, Qinyi Lv, Lingtong Min, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23282
Pdf URL: https://arxiv.org/pdf/2506.23282
Copy Paste: [[2506.23282]] Autoregressive Denoising Score Matching is a Good Video Anomaly Detector(https://arxiv.org/abs/2506.23282)
Keywords: generative
Abstract: Video anomaly detection (VAD) is an important computer vision problem. Thanks to the mode coverage capabilities of generative models, the likelihood-based paradigm is catching growing interest, as it can model normal distribution and detect out-of-distribution anomalies. However, these likelihood-based methods are blind to the anomalies located in local modes near the learned distribution. To handle these ``unseen" anomalies, we dive into three gaps uniquely existing in VAD regarding scene, motion and appearance. Specifically, we first build a noise-conditioned score transformer for denoising score matching. Then, we introduce a scene-dependent and motion-aware score function by embedding the scene condition of input sequences into our model and assigning motion weights based on the difference between key frames of input sequences. Next, to solve the problem of blindness in principle, we integrate unaffected visual information via a novel autoregressive denoising score matching mechanism for inference. Through autoregressively injecting intensifying Gaussian noise into the denoised data and estimating the corresponding score function, we compare the denoised data with the original data to get a difference and aggregate it with the score function for an enhanced appearance perception and accumulate the abnormal context. With all three gaps considered, we can compute a more comprehensive anomaly indicator. Experiments on three popular VAD benchmarks demonstrate the state-of-the-art performance of our method.
摘要：视频异常检测（VAD）是重要的计算机视觉问题。由于生成模型的模式覆盖能力，基于可能性的范式正在引起人们的兴趣，因为它可以对正态分布进行建模并检测到分布异常。但是，这些基于可能性的方法对所学分布附近局部模式的异常视而不见。 To handle these ``unseen" anomalies, we dive into three gaps uniquely existing in VAD regarding scene, motion and appearance. Specifically, we first build a noise-conditioned score transformer for denoising score matching. Then, we introduce a scene-dependent and motion-aware score function by embedding the scene condition of input sequences into our model and assigning motion weights based on the difference between key frames of input sequences. Next, to solve the在原理上，我们通过新的自动性deno deNo匹配机制来集成了不受影响的视觉信息，以自动化的方式将高斯噪声注入deno的数据，并估算相应的分数功能，我们将差异的数据与原始数据进行比较，以获得差异和分数的效果。可以在三个流行的VAD基准上计算更全面的异常指标。

Title: Hierarchical Quantized Diffusion Based Tree Generation Method for Hierarchical Representation and Lineage Analysis

Authors: Zelin Zang, WenZhe Li, Fei Chen, Yongjie Xu, Chang Yu, Zhen Lei, Stan Z. Li
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2506.23287
Pdf URL: https://arxiv.org/pdf/2506.23287
Copy Paste: [[2506.23287]] Hierarchical Quantized Diffusion Based Tree Generation Method for Hierarchical Representation and Lineage Analysis(https://arxiv.org/abs/2506.23287)
Keywords: generation, generative
Abstract: In single-cell research, tracing and analyzing high-throughput single-cell differentiation trajectories is crucial for understanding complex biological processes. Key to this is the modeling and generation of hierarchical data that represents the intrinsic structure within datasets. Traditional methods face limitations in terms of computational cost, performance, generative capacity, and stability. Recent VAEs based approaches have made strides in addressing these challenges but still require specialized network modules for each tree branch, limiting their stability and ability to capture deep hierarchical relationships. To overcome these challenges, we introduce diffusion-based approach called HDTree. HDTree captures tree relationships within a hierarchical latent space using a unified hierarchical codebook and quantized diffusion processes to model tree node transitions. This method improves stability by eliminating branch-specific modules and enhancing generative capacity through gradual hierarchical changes simulated by the diffusion process. HDTree's effectiveness is demonstrated through comparisons on both general-purpose and single-cell datasets, where it outperforms existing methods in terms of accuracy and performance. These contributions provide a new tool for hierarchical lineage analysis, enabling more accurate and efficient modeling of cellular differentiation paths and offering insights for downstream biological tasks. The code of HDTree is available at anonymous link this https URL.
摘要：在单细胞研究中，追踪和分析高通量单细胞分化轨迹对于理解复杂的生物学过程至关重要。关键是代表数据集中内在结构的层次数据的建模和生成。传统方法在计算成本，性能，生成能力和稳定性方面面临限制。最近基于VAE的方法在应对这些挑战方面取得了长足的进步，但仍需要每个树枝的专门网络模块，从而限制了它们捕获深层分层关系的能力。为了克服这些挑战，我们引入了称为HDTREE的基于扩散的方法。 HDTREE使用统一的层次代码簿在层次潜在空间内捕获树的关系，并量化扩散过程，以建模树节点过渡。通过消除分支特定的模块并通过扩散过程模拟的逐渐层次变化来提高生成能力，从而提高了稳定性。通过对通用和单细胞数据集进行比较，可以证明HDTREE的有效性，在该数据集上，它在准确性和性能方面都优于现有方法。这些贡献为分层谱系分析提供了一种新工具，从而使细胞分化路径的更准确，有效地建模，并为下游生物学任务提供见解。 HDTREE的代码可在此HTTPS URL的匿名链接上获得。

Title: DDL: A Dataset for Interpretable Deepfake Detection and Localization in Real-World Scenarios

Authors: Changtao Miao, Yi Zhang, Weize Gao, Man Luo, Weiwei Feng, Zhiya Tan, Jianshu Li, Ajian Liu, Yunfeng Diao, Qi Chu, Tao Gong, Zhe Li, Weibin Yao, Joey Tianyi Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23292
Pdf URL: https://arxiv.org/pdf/2506.23292
Copy Paste: [[2506.23292]] DDL: A Dataset for Interpretable Deepfake Detection and Localization in Real-World Scenarios(https://arxiv.org/abs/2506.23292)
Keywords: generation
Abstract: Recent advances in AIGC have exacerbated the misuse of malicious deepfake content, making the development of reliable deepfake detection methods an essential means to address this challenge. Although existing deepfake detection models demonstrate outstanding performance in detection metrics, most methods only provide simple binary classification results, lacking interpretability. In critical domains such as law, interpretability is crucial for enhancing the credibility and authority of decisions. Recent studies attempt to improve the interpretability of classification results by providing spatial manipulation masks or temporal forgery segments. However, the practical effectiveness of these methods remains suboptimal due to limitations of the forgery data. Most current deepfake datasets predominantly offer binary labels, only a few datasets with localization annotations. However, they suffer from restricted forgery scenarios, limited diversity in deepfake types, and insufficient data scale, making them inadequate for complex real-world scenarios. To address this predicament, we construct a novel large-scale deepfake detection and localization ($\textbf{DDL}$) dataset containing over $\textbf{1.8M}$ forged samples and encompassing up to $\textbf{75}$ distinct deepfake methods. The DDL design incorporates four key innovations: (1) $\textbf{Diverse Forgery Scenarios}$, (2) $\textbf{Comprehensive Deepfake Methods}$, (3) $\textbf{Varied Manipulation Modes}$, and (4) $\textbf{Fine-grained Forgery Annotations}$. Through these improvements, our DDL not only provides a more challenging benchmark for complex real-world forgeries, but also offers crucial support for building next-generation deepfake detection, localization, and interpretability methods. The DDL dataset project page is on this https URL.
摘要：AIGC的最新进展加剧了滥用恶意的深层含量，使可靠的深击检测方法的发展成为应对这一挑战的重要手段。尽管现有的DeepFake检测模型在检测指标中表现出出色的性能，但大多数方法仅提供简单的二元分类结果，缺乏可解释性。在诸如法律之类的关键领域中，可解释性对于增强决策的信誉和权威至关重要。最近的研究试图通过提供空间操纵面具或时间伪造段来提高分类结果的解释性。然而，由于伪造数据的局限性，这些方法的实际有效性仍然不错。大多数当前的DeepFake数据集主要提供二进制标签，只有几个带有本地化注释的数据集。但是，他们遭受了限制的伪造情景，深泡类型的多样性有限以及数据量表不足，因此无法实现复杂的现实世界情景。为了解决这种困境，我们构建了一个新颖的大规模深击检测和本地化（$ \ textbf {ddl} $）数据集，该数据集包含$ \ textbf {180万} $ forged样品，并包含高达$ \ textbf {75} $不同的deepfake方法。 The DDL design incorporates four key innovations: (1) $\textbf{Diverse Forgery Scenarios}$, (2) $\textbf{Comprehensive Deepfake Methods}$, (3) $\textbf{Varied Manipulation Modes}$, and (4) $\textbf{Fine-grained Forgery Annotations}$.通过这些改进，我们的DDL不仅为复杂的现实世界伪造提供了更具挑战性的基准，而且还为建立下一代深层检测，本地化和可解释性方法提供了重要的支持。 DDL数据集项目页面在此HTTPS URL上。

Title: DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On

Authors: Xiang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23295
Pdf URL: https://arxiv.org/pdf/2506.23295
Copy Paste: [[2506.23295]] DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On(https://arxiv.org/abs/2506.23295)
Keywords: generation
Abstract: Virtual try-on (VTON) aims to synthesize realistic images of a person wearing a target garment, with broad applications in e-commerce and digital fashion. While recent advances in latent diffusion models have substantially improved visual quality, existing approaches still struggle with preserving fine-grained garment details, achieving precise garment-body alignment, maintaining inference efficiency, and generalizing to diverse poses and clothing styles. To address these challenges, we propose DiffFit, a novel two-stage latent diffusion framework for high-fidelity virtual try-on. DiffFit adopts a progressive generation strategy: the first stage performs geometry-aware garment warping, aligning the garment with the target body through fine-grained deformation and pose adaptation. The second stage refines texture fidelity via a cross-modal conditional diffusion model that integrates the warped garment, the original garment appearance, and the target person image for high-quality rendering. By decoupling geometric alignment and appearance refinement, DiffFit effectively reduces task complexity and enhances both generation stability and visual realism. It excels in preserving garment-specific attributes such as textures, wrinkles, and lighting, while ensuring accurate alignment with the human body. Extensive experiments on large-scale VTON benchmarks demonstrate that DiffFit achieves superior performance over existing state-of-the-art methods in both quantitative metrics and perceptual evaluations.
摘要：虚拟试验（VTON）旨在综合穿着目标服装的人的现实图像，并在电子商务和数字时尚中进行广泛应用。尽管潜在扩散模型的最新进展已大大提高了视觉质量，但现有的方法仍然在保留细粒度的细节，实现精确的服装体位对齐，保持推理效率以及对各种姿势和服装风格的推广方面仍然困难。为了应对这些挑战，我们提出了Difffit，这是一种新型的两阶段潜在扩散框架，用于高保真虚拟尝试。 Difffit采用了一种进步的生成策略：第一阶段执行几何形状吸引的衣服翘曲，通过细粒度的变形和姿势适应使衣服与目标体一致。第二阶段通过跨模式的条件扩散模型来完善纹理保真度，该模型集成了扭曲的服装，原始的服装外观和目标人形象，以进行高质量渲染。通过解开几何对齐和外观精致，Difffit有效地降低了任务的复杂性并增强了发电稳定性和视觉现实主义。它在保留特定于服装的特定属性（例如纹理，皱纹和照明）方面表现出色，同时确保与人体的准确对齐。大规模VTON基准测试的广泛实验表明，在定量指标和知觉评估中，Difffit在现有的最新方法中都能达到卓越的性能。

Title: Endo-4DGX: Robust Endoscopic Scene Reconstruction and Illumination Correction with Gaussian Splatting

Authors: Yiming Huang, Long Bai, Beilei Cui, Yanheng Li, Tong Chen, Jie Wang, Jinlin Wu, Zhen Lei, Hongbin Liu, Hongliang Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23308
Pdf URL: https://arxiv.org/pdf/2506.23308
Copy Paste: [[2506.23308]] Endo-4DGX: Robust Endoscopic Scene Reconstruction and Illumination Correction with Gaussian Splatting(https://arxiv.org/abs/2506.23308)
Keywords: restoration
Abstract: Accurate reconstruction of soft tissue is crucial for advancing automation in image-guided robotic surgery. The recent 3D Gaussian Splatting (3DGS) techniques and their variants, 4DGS, achieve high-quality renderings of dynamic surgical scenes in real-time. However, 3D-GS-based methods still struggle in scenarios with varying illumination, such as low light and over-exposure. Training 3D-GS in such extreme light conditions leads to severe optimization problems and devastating rendering quality. To address these challenges, we present Endo-4DGX, a novel reconstruction method with illumination-adaptive Gaussian Splatting designed specifically for endoscopic scenes with uneven lighting. By incorporating illumination embeddings, our method effectively models view-dependent brightness variations. We introduce a region-aware enhancement module to model the sub-area lightness at the Gaussian level and a spatial-aware adjustment module to learn the view-consistent brightness adjustment. With the illumination adaptive design, Endo-4DGX achieves superior rendering performance under both low-light and over-exposure conditions while maintaining geometric accuracy. Additionally, we employ an exposure control loss to restore the appearance from adverse exposure to the normal level for illumination-adaptive optimization. Experimental results demonstrate that Endo-4DGX significantly outperforms combinations of state-of-the-art reconstruction and restoration methods in challenging lighting environments, underscoring its potential to advance robot-assisted surgical applications. Our code is available at this https URL.
摘要：软组织的准确重建对于在图像引导的机器人手术中自动化至关重要。最近的3D高斯分裂（3DGS）技术及其变体4DGS实时实现了动态外科手术场景的高质量渲染。但是，基于3D-GS的方法仍然在不同的照明的情况下（例如弱光和过度曝光）挣扎。在这种极端光条件下训练3D-GS会导致严重的优化问题和毁灭性的质量。为了应对这些挑战，我们提出了Endo-4DGX，这是一种新型的重建方法，采用照明自适应的高斯裂纹专门为内镜场景设计，并具有不均匀的照明。通过合并照明嵌入，我们的方法有效地对视图依赖性亮度变化进行了建模。我们引入了一个具有区域感知的增强模块，以模拟高斯级别的次面积轻度和空间感知的调整模块，以了解视图一致的亮度调整。借助照明自适应设计，Endo-4DGX在弱光和过度曝光条件下达到了出色的渲染性能，同时保持了几何精度。此外，我们采用暴露控制损失来恢复不良暴露到正常水平的外观以进行照明自适应优化。实验结果表明，在挑战性的照明环境中，Endo-4DGX显着优于最先进的重建和恢复方法的组合，强调了其推进机器人辅助手术应用的潜力。我们的代码可在此HTTPS URL上找到。

Title: IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

Authors: Parker Liu, Chenxin Li, Zhengxin Li, Yipeng Wu, Wuyang Li, Zhiqin Yang, Zhenyuan Zhang, Yunlong Lin, Sirui Han, Brandon Y. Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23329
Pdf URL: https://arxiv.org/pdf/2506.23329
Copy Paste: [[2506.23329]] IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering(https://arxiv.org/abs/2506.23329)
Keywords: generative
Abstract: Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This "understanding-by-creating" approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.
摘要：视觉语言模型（VLMS）在描述性任务上表现出色，但是他们是否真正了解视觉观察的场景仍然不确定。我们介绍了IR3D Bench，这是一种挑战VLM的基准测试，以通过积极创造而不是被动认可来证明理解。 IR3D基础任务视觉语言代理（VLAS）以分析为基础，积极使用编程和渲染工具来重新创建输入图像的基础3D结构，从而实现通过工具使用来实现代理逆渲染。这种“通过创建”方法探究了VLA的工具生成能力，超越了传统场景理解基准测试的描述性或对话能力。我们提供了一套全面的指标套件，以评估几何准确性，空间关系，外观属性和整体合理性。由各种最先进的VLMS提供动力的代理逆渲染的初步实验突出了当前的限制，尤其是在视觉精确而不是基本工具使用情况下。 IR3D基准台（包括数据和评估协议）释放出来，以促进系统的研究和开发使用工具的VLA，通过创建来实现真正的场景理解。

Title: VALID-Mol: a Systematic Framework for Validated LLM-Assisted Molecular Design

Authors: Malikussaid, Hilal Hudan Nuha
Subjects: cs.LG, cs.AI, physics.chem-ph, q-bio.QM
Abstract URL: https://arxiv.org/abs/2506.23339
Pdf URL: https://arxiv.org/pdf/2506.23339
Copy Paste: [[2506.23339]] VALID-Mol: a Systematic Framework for Validated LLM-Assisted Molecular Design(https://arxiv.org/abs/2506.23339)
Keywords: generation
Abstract: Large Language Models (LLMs) demonstrate remarkable potential for scientific discovery, but their application in domains requiring factual accuracy and domain-specific constraints remains challenging. In molecular design for drug discovery, LLMs can suggest creative molecular modifications but often produce chemically invalid or impractical structures. We present VALID-Mol, a systematic framework for integrating chemical validation with LLM-driven molecular design that increases the rate of generating valid chemical structures from 3% to 83%. Our approach combines methodical prompt engineering, automated chemical validation, and a fine-tuned domain-adapted LLM to ensure reliable generation of synthesizable molecules with improved properties. Beyond the specific implementation, we contribute a generalizable methodology for scientifically-constrained LLM applications, with quantifiable reliability improvements. Computational predictions suggest our framework can generate promising candidates for synthesis with up to 17-fold computationally predicted improvements in target affinity while maintaining synthetic accessibility. We provide a detailed analysis of our prompt engineering process, validation architecture, and fine-tuning approach, offering a reproducible blueprint for applying LLMs to other scientific domains where domain-specific validation is essential.
摘要：大型语言模型（LLMS）具有科学发现的巨大潜力，但是它们在需要事实准确性和特定领域的约束的领域中的应用仍然具有挑战性。在用于药物发现的分子设计中，LLM可以提出创造性的分子修饰，但通常会产生化学无效或不切实际的结构。我们提出了有效的摩尔，这是一个系统的框架，用于将化学验证与LLM驱动的分子设计集成，从而将有效化学结构的产生速率从3％提高到83％。我们的方法结合了有条不紊的及时工程，自动化化学验证和微调域适应的LLM，以确保具有改进特性的可靠生成可靠的合成分子。除了具体的实施外，我们还为科学约束的LLM应用程序提供了可概括的可靠性可靠性的可靠性方法。计算预测表明，我们的框架可以生成有希望的合成候选者，并在维持合成可访问性的同时，最多17倍计算预测的目标亲和力改善。我们对迅速的工程过程，验证体系结构和微调方法提供了详细的分析，为将LLMS应用于其他特定领域验证至关重要的其他科学领域提供了可重复的蓝图。

Title: CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation

Authors: Yi Liu, Shengqian Li, Zuzeng Lin, Feng Wang, Si Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23347
Pdf URL: https://arxiv.org/pdf/2506.23347
Copy Paste: [[2506.23347]] CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation(https://arxiv.org/abs/2506.23347)
Keywords: generation
Abstract: The current conditional autoregressive image generation methods have shown promising results, yet their potential remains largely unexplored in the practical unsupervised image translation domain, which operates without explicit cross-domain correspondences. A critical limitation stems from the discrete quantization inherent in traditional Vector Quantization-based frameworks, which disrupts gradient flow between the Variational Autoencoder decoder and causal Transformer, impeding end-to-end optimization during adversarial training in image space. To tackle this issue, we propose using Softmax Relaxed Quantization, a novel approach that reformulates codebook selection as a continuous probability mixing process via Softmax, thereby preserving gradient propagation. Building upon this differentiable foundation, we introduce CycleVAR, which reformulates image-to-image translation as image-conditional visual autoregressive generation by injecting multi-scale source image tokens as contextual prompts, analogous to prefix-based conditioning in language models. CycleVAR exploits two modes to generate the target image tokens, including (1) serial multi-step generation, enabling iterative refinement across scales, and (2) parallel one-step generation synthesizing all resolution outputs in a single forward pass. Experimental findings indicate that the parallel one-step generation mode attains superior translation quality with quicker inference speed than the serial multi-step mode in unsupervised scenarios. Furthermore, both quantitative and qualitative results indicate that CycleVAR surpasses previous state-of-the-art unsupervised image translation models, \textit{e}.\textit{g}., CycleGAN-Turbo.
摘要：当前的条件自回旋图像生成方法已显示出令人鼓舞的结果，但它们的潜力在实际的无监督图像翻译域中仍未探索，该域在没有明确的跨域对应关系的情况下运行。临界限制源于基于传统的基于矢量量化的框架中固有的离散量化，该框架破坏了变异自动编码器解码器和因果变压器之间的梯度流，从而阻碍了图像空间对抗性训练期间的端到端优化。为了解决这个问题，我们建议使用SoftMax Raseled量化，这是一种新型方法，将代码手册的选择重新定量通过SoftMax进行了连续的概率混合过程，从而保留了梯度传播。在这个可区分的基础的基础上，我们介绍了Cyclevar，该Var介绍了图像到图像翻译作为图像条件视觉自回归产生，通过将多尺度源图像令牌注入上下文提示，类似于语言模型中的前缀条件。 Cyclevar利用了两种模式来生成目标图像令牌，包括（1）串行多步生成，可以跨量表进行迭代改进，以及（2）平行的一步生成合成单个前向通过中所有分辨率输出的综合。实验发现表明，平行的一步生成模式比在无监督的场景中的串行多步模式具有更快的推理速度。此外，定量和定性结果都表明，自行车体超过了先前的最新无监督图像翻译模型，\ textit {e}。\ textit {g}。

Title: Federated Timeline Synthesis: Scalable and Private Methodology For Model Training and Deployment

Authors: Pawel Renc, Michal K. Grzeszczyk, Linglong Qian, Nassim Oufattole, Jeff Rasley, Arkadiusz Sitek
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23358
Pdf URL: https://arxiv.org/pdf/2506.23358
Copy Paste: [[2506.23358]] Federated Timeline Synthesis: Scalable and Private Methodology For Model Training and Deployment(https://arxiv.org/abs/2506.23358)
Keywords: generative
Abstract: We present Federated Timeline Synthesis (FTS), a novel framework for training generative foundation models across distributed timeseries data applied to electronic health records (EHR). At its core, FTS represents patient history as tokenized Patient Health Timelines (PHTs), language-agnostic sequences encoding temporal, categorical, and continuous clinical information. Each institution trains an autoregressive transformer on its local PHTs and transmits only model weights to a central server. The server uses the generators to synthesize a large corpus of trajectories and train a Global Generator (GG), enabling zero-shot inference via Monte Carlo simulation of future PHTs. We evaluate FTS on five clinically meaningful prediction tasks using MIMIC-IV data, showing that models trained on synthetic data generated by GG perform comparably to those trained on real data. FTS offers strong privacy guarantees, scalability across institutions, and extensibility to diverse prediction and simulation tasks especially in healthcare, including counterfactual inference, early warning detection, and synthetic trial design.
摘要：我们提出了联合时间轴合成（FTS），这是一种跨电子健康记录（EHR）的分布式时间表数据培训生成基础模型的新型框架。 FTS的核心代表了患者历史记录，作为标记的患者健康时间表（PHTS），编码时间，分类和连续临床信息的语言敏捷序列。每个机构在其本地PHT上训练自动回应变压器，并且仅将型号的权重传输到中央服务器。该服务器使用发电机来合成大型轨迹语料库并训练全局发电机（GG），从而通过Monte Carlo模拟未来的PHT进行零射击推理。我们使用MIMIC-IV数据评估了五个临床上有意义的预测任务的FT，这表明对GG生成的合成数据训练的模型与对实际数据训练的模型相当。 FTS提供了强大的隐私保证，跨机构的可扩展性，以及对各种预测和仿真任务的可扩展性，尤其是在医疗保健中，包括反事实推断，预警检测和合成试验设计。

Title: OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Authors: Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23361
Pdf URL: https://arxiv.org/pdf/2506.23361
Copy Paste: [[2506.23361]] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions(https://arxiv.org/abs/2506.23361)
Keywords: generation
Abstract: Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: this https URL. Our code will be released at this https URL
摘要：现有的前馈主题驱动的视频自定义方法主要研究单个受试者方案，因为难以构建多主体训练数据对。另一个具有挑战性的问题是，如何使用深度，掩码，相机和文本等信号提示在自定义视频中控制和编辑主题，但仍不太浏览。在本文中，我们首先提出了一条数据构建管道，即Videocus-Factory，以从没有标签和控制信号的原始视频和控制信号（例如深度到视频和蒙版至视频对）中生成培训数据对。基于我们的构造数据，我们开发了一个图像视频传输混合（IVTM）培训，并使用图像编辑数据来启用自定义视频中对象的启发性编辑。然后，我们提出了一个扩散的变压器框架Omnivcus，具有两个嵌入机制，彩票嵌入（LE）和时间对齐的嵌入（TAE）。 LE通过使用训练对象激活更多框架嵌入来对更多主题进行推理。 TAE鼓励生成过程通过将相同的框架嵌入到控制和噪声令牌中，从时间对齐的控制信号中提取指导。实验表明，我们的方法在定量和定性评估中都显着超过了最新方法。视频演示在我们的项目页面上：此HTTPS URL。我们的代码将在此HTTPS URL上发布

Title: Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models

Authors: Parham Rezaei, Arash Marioriyad, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23418
Pdf URL: https://arxiv.org/pdf/2506.23418
Copy Paste: [[2506.23418]] Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models(https://arxiv.org/abs/2506.23418)
Keywords: generation
Abstract: Despite the ability of text-to-image models to generate high-quality, realistic, and diverse images, they face challenges in compositional generation, often struggling to accurately represent details specified in the input prompt. A prevalent issue in compositional generation is the misalignment of spatial relationships, as models often fail to faithfully generate images that reflect the spatial configurations specified between objects in the input prompts. To address this challenge, we propose a novel probabilistic framework for modeling the relative spatial positioning of objects in a scene, leveraging the concept of Probability of Superiority (PoS). Building on this insight, we make two key contributions. First, we introduce a novel evaluation metric, PoS-based Evaluation (PSE), designed to assess the alignment of 2D and 3D spatial relationships between text and image, with improved adherence to human judgment. Second, we propose PoS-based Generation (PSG), an inference-time method that improves the alignment of 2D and 3D spatial relationships in T2I models without requiring fine-tuning. PSG employs a Part-of-Speech PoS-based reward function that can be utilized in two distinct ways: (1) as a gradient-based guidance mechanism applied to the cross-attention maps during the denoising steps, or (2) as a search-based strategy that evaluates a set of initial noise vectors to select the best one. Extensive experiments demonstrate that the PSE metric exhibits stronger alignment with human judgment compared to traditional center-based metrics, providing a more nuanced and reliable measure of complex spatial relationship accuracy in text-image alignment. Furthermore, PSG significantly enhances the ability of text-to-image models to generate images with specified spatial configurations, outperforming state-of-the-art methods across multiple evaluation metrics and benchmarks.
摘要：尽管文本对图像模型能够产生高质量，现实和多样化的图像，但它们在构图生成方面面临挑战，通常会努力准确地表示输入提示中指定的细节。组成生成中的一个普遍问题是空间关系的未对准，因为模型通常无法忠实地生成反映输入提示中对象之间指定的空间配置的图像。为了应对这一挑战，我们提出了一个新颖的概率框架，用于建模场景中对象的相对空间定位，利用优越性的概念（POS）。在这种见解的基础上，我们做出了两个关键的贡献。首先，我们介绍了一种新颖的评估度量，基于POS的评估（PSE），旨在评估文本和图像之间2D和3D空间关系的一致性，并提高了对人类判断力的遵守。其次，我们提出了基于POS的生成（PSG），这是一种推理时间方法，可改善T2I模型中2D和3D空间关系的对齐，而无需进行微调。 PSG采用基于语音POS的部分奖励函数，可以通过两种不同的方式使用：（1）作为基于梯度的指导机制，用于在DeNoising步骤期间应用于跨意义地图，或（2）作为基于搜索的策略，评估一组初始噪声矢量以选择最佳一种。广泛的实验表明，与传统的基于中心的指标相比，PSE指标表现出更强的与人类判断的一致性，从而提供了更细微和可靠的文本图像对齐中复杂空间关系准确性的量度。此外，PSG显着增强了文本对图像模型的能力，生成具有指定空间配置的图像，在多个评估指标和基准测试中表现优于最先进的方法。

Title: PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions

Authors: Mahesh Bhosale, Abdul Wasi, Yuanhao Zhai, Yunjie Tian, Samuel Border, Nan Xi, Pinaki Sarder, Junsong Yuan, David Doermann, Xuan Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23440
Pdf URL: https://arxiv.org/pdf/2506.23440
Copy Paste: [[2506.23440]] PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions(https://arxiv.org/abs/2506.23440)
Keywords: generation, generative
Abstract: Diffusion-based generative models have shown promise in synthesizing histopathology images to address data scarcity caused by privacy constraints. Diagnostic text reports provide high-level semantic descriptions, and masks offer fine-grained spatial structures essential for representing distinct morphological regions. However, public datasets lack paired text and mask data for the same histopathological images, limiting their joint use in image generation. This constraint restricts the ability to fully exploit the benefits of combining both modalities for enhanced control over semantics and spatial details. To overcome this, we propose PathDiff, a diffusion framework that effectively learns from unpaired mask-text data by integrating both modalities into a unified conditioning space. PathDiff allows precise control over structural and contextual features, generating high-quality, semantically accurate images. PathDiff also improves image fidelity, text-image alignment, and faithfulness, enhancing data augmentation for downstream tasks like nuclei segmentation and classification. Extensive experiments demonstrate its superiority over existing methods.
摘要：基于扩散的生成模型已经在合成组织病理学图像中显示了有望解决由隐私限制引起的数据稀缺性。诊断文本报告提供了高级语义描述，口罩提供了代表不同形态区域必不可少的细粒空间结构。但是，公共数据集缺乏相同组织病理学图像的配对文本和掩盖数据，从而限制了它们在图像生成中的共同使用。这种限制限制了完全利用这两种方式相结合的好处，以增强对语义和空间细节的控制的能力。为了克服这一点，我们提出了Pathdiff，这是一个扩散框架，通过将这两种模式集成到统一的调节空间中，可以从未配对的掩码文本数据中有效地学习。 Pathdiff可以精确控制结构和上下文特征，从而产生高质量的语义精确图像。 Pathdiff还改善了图像保真度，文本图像对准和忠诚，增强了核分割和分类等下游任务的数据增强。广泛的实验证明了其优于现有方法。

Title: Contrastive Learning with Diffusion Features for Weakly Supervised Medical Image Segmentation

Authors: Dewen Zeng, Xinrong Hu, Yu-Jen Chen, Yawen Wu, Xiaowei Xu, Yiyu Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23460
Pdf URL: https://arxiv.org/pdf/2506.23460
Copy Paste: [[2506.23460]] Contrastive Learning with Diffusion Features for Weakly Supervised Medical Image Segmentation(https://arxiv.org/abs/2506.23460)
Keywords: generation
Abstract: Weakly supervised semantic segmentation (WSSS) methods using class labels often rely on class activation maps (CAMs) to localize objects. However, traditional CAM-based methods struggle with partial activations and imprecise object boundaries due to optimization discrepancies between classification and segmentation. Recently, the conditional diffusion model (CDM) has been used as an alternative for generating segmentation masks in WSSS, leveraging its strong image generation capabilities tailored to specific class distributions. By modifying or perturbing the condition during diffusion sampling, the related objects can be highlighted in the generated images. Yet, the saliency maps generated by CDMs are prone to noise from background alterations during reverse diffusion. To alleviate the problem, we introduce Contrastive Learning with Diffusion Features (CLDF), a novel method that uses contrastive learning to train a pixel decoder to map the diffusion features from a frozen CDM to a low-dimensional embedding space for segmentation. Specifically, we integrate gradient maps generated from CDM external classifier with CAMs to identify foreground and background pixels with fewer false positives/negatives for contrastive learning, enabling robust pixel embedding learning. Experimental results on four segmentation tasks from two public medical datasets demonstrate that our method significantly outperforms existing baselines.
摘要：使用类标签的弱监督语义分割（WSSS）方法通常依靠类激活图（CAM）来定位对象。但是，由于分类和细分之间的优化差异，传统的基于CAM的方法在部分激活和不精确的对象边界方面遇到了困难。最近，条件扩散模型（CDM）已被用作生成WSSS中分割面罩的替代方法，利用其针对特定类别分布量身定制的强图像生成能力。通过在扩散采样过程中修改或扰动条件，可以在生成的图像中突出显示相关对象。然而，CDM产生的显着图很容易在反向扩散期间从背景变化中噪声。为了减轻问题，我们引入了具有扩散特征（CLDF）的对比度学习，这是一种使用对比度学习来训练像素解码器来映射从冷冻CDM到低维嵌入空间进行分割的扩散特征。具体而言，我们将CDM外部分类器生成的梯度图与CAM集成在一起，以识别前景和背景像素，具有较少的假阳性/负面学习，以进行对比学习，从而实现了可靠的像素嵌入学习。来自两个公共医疗数据集的四个细分任务的实验结果表明，我们的方法显着优于现有基准。

Title: Time-variant Image Inpainting via Interactive Distribution Transition Estimation

Authors: Yun Xing, Qing Guo, Xiaoguang Li, Yihao Huang, Xiaofeng Cao, Di Lin, Ivor Tsang, Lei Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23461
Pdf URL: https://arxiv.org/pdf/2506.23461
Copy Paste: [[2506.23461]] Time-variant Image Inpainting via Interactive Distribution Transition Estimation(https://arxiv.org/abs/2506.23461)
Keywords: restoration
Abstract: In this work, we focus on a novel and practical task, i.e., Time-vAriant iMage inPainting (TAMP). The aim of TAMP is to restore a damaged target image by leveraging the complementary information from a reference image, where both images captured the same scene but with a significant time gap in between, i.e., time-variant images. Different from conventional reference-guided image inpainting, the reference image under TAMP setup presents significant content distinction to the target image and potentially also suffers from damages. Such an application frequently happens in our daily lives to restore a damaged image by referring to another reference image, where there is no guarantee of the reference image's source and quality. In particular, our study finds that even state-of-the-art (SOTA) reference-guided image inpainting methods fail to achieve plausible results due to the chaotic image complementation. To address such an ill-posed problem, we propose a novel Interactive Distribution Transition Estimation (InDiTE) module which interactively complements the time-variant images with adaptive semantics thus facilitate the restoration of damaged regions. To further boost the performance, we propose our TAMP solution, namely Interactive Distribution Transition Estimation-driven Diffusion (InDiTE-Diff), which integrates InDiTE with SOTA diffusion model and conducts latent cross-reference during sampling. Moreover, considering the lack of benchmarks for TAMP task, we newly assembled a dataset, i.e., TAMP-Street, based on existing image and mask datasets. We conduct experiments on the TAMP-Street datasets under two different time-variant image inpainting settings, which show our method consistently outperform SOTA reference-guided image inpainting methods for solving TAMP.
摘要：在这项工作中，我们专注于一项新颖而实用的任务，即时间变化的图像介入（tamp）。 tamp的目的是通过利用参考图像的互补信息来恢复损坏的目标图像，在该图像中，这两个图像都捕获了相同的场景，但在之间存在很大的时间差距，即时间变化图像。与常规的参考引导图像介绍不同，TAMP设置下的参考图像对目标图像的重要性区别显着，并且可能遭受损害。这样的应用在我们的日常生活中经常发生，以通过参考另一个参考图像来恢复受损的图像，那里无法保证参考图像的来源和质量。特别是，我们的研究发现，由于混乱的图像互补，即使是最先进的（SOTA）参考引导的图像介绍方法也无法实现合理的结果。为了解决这样的问题，我们提出了一种新型的交互式分布过渡估计（INDITE）模块，该模块可以交互作用地补充具有自适应语义的时间变化的图像，从而有助于恢复受损区域的恢复。为了进一步提高性能，我们提出了我们的TAMP解决方案，即交互式分布过渡估计驱动的扩散（Indite-Diff），该型号将INDITE与SOTA扩散模型集成并在抽样过程中进行潜在的交叉引用。此外，考虑到缺乏用于TAMP任务的基准测试，我们基于现有图像和掩码数据集，新组装了一个数据集，即tamp-street。我们在两个不同的时间变化图像介入设置下在坦克街数据集上进行实验，这表明我们的方法始终超过了SOTA参考引导的图像介绍方法，用于求解TAMP。

Title: MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting

Authors: Jun Huang, Ting Liu, Yihang Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23482
Pdf URL: https://arxiv.org/pdf/2506.23482
Copy Paste: [[2506.23482]] MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting(https://arxiv.org/abs/2506.23482)
Keywords: generative
Abstract: Advancements in generative models have enabled image inpainting models to generate content within specific regions of an image based on provided prompts and masks. However, existing inpainting methods often suffer from problems such as semantic misalignment, structural distortion, and style inconsistency. In this work, we present MTADiffusion, a Mask-Text Alignment diffusion model designed for object inpainting. To enhance the semantic capabilities of the inpainting model, we introduce MTAPipeline, an automatic solution for annotating masks with detailed descriptions. Based on the MTAPipeline, we construct a new MTADataset comprising 5 million images and 25 million mask-text pairs. Furthermore, we propose a multi-task training strategy that integrates both inpainting and edge prediction tasks to improve structural stability. To promote style consistency, we present a novel inpainting style-consistency loss using a pre-trained VGG network and the Gram matrix. Comprehensive evaluations on BrushBench and EditBench demonstrate that MTADiffusion achieves state-of-the-art performance compared to other methods.
摘要：生成模型的进步已启用图像介入模型，可以根据提供的提示和掩码在图像的特定区域内生成内容。但是，现有的含介质方法通常会遇到语义错位，结构失真和风格不一致之类的问题。在这项工作中，我们提出了Mtadiffusion，这是一种针对对象插入的掩模对齐扩散模型。为了增强介入模型的语义功能，我们介绍了mtapipeline，这是一种自动解决方案，用于注释带有详细说明的掩模。基于mtapipeline，我们构建了一个新的mtadataset，其中包括500万张图像和2500万个掩模文本对。此外，我们提出了一项多任务培训策略，该策略既整合了介入和边缘预测任务，以提高结构稳定性。为了促进风格一致性，我们使用预先训练的VGG网络和革兰氏矩阵提出了一种新颖的介绍风格一致性损失。对Brushbench和EditBench的全面评估表明，与其他方法相比，Mtadiffusion可以实现最先进的性能。

Title: ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models

Authors: Zixun Fang, Kai Zhu, Zhiheng Liu, Yu Liu, Wei Zhai, Yang Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23513
Pdf URL: https://arxiv.org/pdf/2506.23513
Copy Paste: [[2506.23513]] ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models(https://arxiv.org/abs/2506.23513)
Keywords: generation
Abstract: Panoramic video generation aims to synthesize 360-degree immersive videos, holding significant importance in the fields of VR, world models, and spatial intelligence. Existing works fail to synthesize high-quality panoramic videos due to the inherent modality gap between panoramic data and perspective data, which constitutes the majority of the training data for modern diffusion models. In this paper, we propose a novel framework utilizing pretrained perspective video models for generating panoramic videos. Specifically, we design a novel panorama representation named ViewPoint map, which possesses global spatial continuity and fine-grained visual details simultaneously. With our proposed Pano-Perspective attention mechanism, the model benefits from pretrained perspective priors and captures the panoramic spatial correlations of the ViewPoint map effectively. Extensive experiments demonstrate that our method can synthesize highly dynamic and spatially consistent panoramic videos, achieving state-of-the-art performance and surpassing previous methods.
摘要：全景视频生成旨在综合360度沉浸式视频，在VR，世界模型和空间智能领域中具有重要意义。由于全景数据和透视数据之间的固有方式差距，现有作品无法综合高质量的全景视频，这构成了现代扩散模型的大部分培训数据。在本文中，我们提出了一个新颖的框架，利用审核的透视视频模型来生成全景视频。具体来说，我们设计了一个名为Viewpoint图的新颖全景图，该图表具有同时具有全球空间连续性和细粒度的视觉细节。借助我们提出的Pano-Perspective注意机制，该模型受到鉴定的透视先验，并有效地捕获了观点图的全景空间相关性。广泛的实验表明，我们的方法可以合成高度动态和空间一致的全景视频，实现最新性能并超过以前的方法。

Title: WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

Authors: Jiwoo Park, Tae Eun Choi, Youngjun Jun, Seong Jae Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23518
Pdf URL: https://arxiv.org/pdf/2506.23518
Copy Paste: [[2506.23518]] WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image(https://arxiv.org/abs/2506.23518)
Keywords: generation
Abstract: Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency. While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lack efficiency due to their complex multi-step pipelines. This paper proposes a novel view-consistent image generation method which utilizes diffusion models without additional modules. Our key idea is to enhance diffusion models with a training-free method that enables adaptive attention manipulation and noise reinitialization by leveraging view-guided warping to ensure view consistency. Through our comprehensive metric framework suitable for novel-view datasets, we show that our method improves view consistency across various diffusion models, demonstrating its broader applicability.
摘要：从单个图像中产生场景的高质量新颖观点需要在不同视图上保持结构连贯性，称为视图一致性。尽管扩散模型促进了新观点综合的进步，但它们仍然难以在视图跨视图中保持空间连续性。扩散模型已与3D模型相结合以解决该问题，但是由于其复杂的多步管道，这种方法缺乏效率。本文提出了一种新颖的视图一致图像生成方法，该方法利用没有其他模块的扩散模型。我们的关键思想是通过一种无训练的方法来增强扩散模型，该方法通过利用视图引导的翘曲来确保视图一致性，从而实现自适应的注意操纵和噪音重新定性。通过我们适用于新型视图数据集的综合度量框架，我们表明我们的方法提高了各种扩散模型的视图一致性，证明了其更广泛的适用性。

Title: Pyramidal Patchification Flow for Visual Generation

Authors: Hui Li, Baoyou Chen, Liwei Zhang, Jiaye Li, Jingdong Wang, Siyu Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23543
Pdf URL: https://arxiv.org/pdf/2506.23543
Copy Paste: [[2506.23543]] Pyramidal Patchification Flow for Visual Generation(https://arxiv.org/abs/2506.23543)
Keywords: generation
Abstract: Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a $1.6\times$ ($2.0\times$) inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at this https URL.
摘要：扩散变压器（DITS）采用补丁程序，通过线性投影将补丁表示表示形式映射到令牌表示形式，以调整令牌输入的数量到DIT块，从而调整了计算成本。我们引入了锥体贴片流（PPFLOF）方法：大噪声尺寸用于高噪声时间段和小斑块尺寸，用于低噪声时间段；每个贴片大小都学会了线性预测；因此，请进行uncthatify的修改。与金字塔流动不同，我们的方法在金字塔表示以外的全部潜在表示上运行，并采用了正常的降解过程，而无需进行重新化的技巧。我们通过两种训练方式证明了我们的方法的有效性。从头开始培训$ 1.6 \ times $（$ 2.0 \ times $）的推理速度超过了SIT-B/2的2级（3级）金字塔补丁，并具有较低的训练拖鞋和类似的图像生成性能。通过较小的训练时间，经过预读的普通滴的训练可以取得更好的表现。代码和检查点在此HTTPS URL上。

Title: JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Authors: Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.23552
Pdf URL: https://arxiv.org/pdf/2506.23552
Copy Paste: [[2506.23552]] JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching(https://arxiv.org/abs/2506.23552)
Keywords: generation, generative
Abstract: The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: this https URL
摘要：在生成建模中通常会忽略面部运动和语音之间的固有联系，在生成建模中，会说话的头综合和文本到语音（TTS）通常被称为单独的任务。本文介绍了Jam-Flow，这是一个统一的框架，可以同时综合面部运动和语音条件。我们的方法利用流量匹配和新型的多模式扩散变压器（MM-DIT）结构，集成了专门的运动台和音频启动模块。这些是通过选择性的关节注意层耦合的，并结合了关键的架构选择，例如时间对齐的位置嵌入和局部关节注意力掩盖，以实现有效的跨模式相互作用，同时保留特定于模态的优势。 JAM-FLOW经过培训，经过培训的目标，支持各种条件输入，包括文本，参考音频和参考运动实现任务，例如从文本，音频驱动的动画以及单个相干模型中从文本，音频驱动的动画等中同步说话的头部生成。 JAM-FLOW通过为整体视听合成提供实用解决方案来显着提高多模式生成建模。项目页面：此HTTPS URL

Title: Metadata, Wavelet, and Time Aware Diffusion Models for Satellite Image Super Resolution

Authors: Luigi Sigillo, Renato Giamba, Danilo Comminiello
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23566
Pdf URL: https://arxiv.org/pdf/2506.23566
Copy Paste: [[2506.23566]] Metadata, Wavelet, and Time Aware Diffusion Models for Satellite Image Super Resolution(https://arxiv.org/abs/2506.23566)
Keywords: super-resolution
Abstract: The acquisition of high-resolution satellite imagery is often constrained by the spatial and temporal limitations of satellite sensors, as well as the high costs associated with frequent observations. These challenges hinder applications such as environmental monitoring, disaster response, and agricultural management, which require fine-grained and high-resolution data. In this paper, we propose MWT-Diff, an innovative framework for satellite image super-resolution (SR) that combines latent diffusion models with wavelet transforms to address these challenges. At the core of the framework is a novel metadata-, wavelet-, and time-aware encoder (MWT-Encoder), which generates embeddings that capture metadata attributes, multi-scale frequency information, and temporal relationships. The embedded feature representations steer the hierarchical diffusion dynamics, through which the model progressively reconstructs high-resolution satellite imagery from low-resolution inputs. This process preserves critical spatial characteristics including textural patterns, boundary discontinuities, and high-frequency spectral components essential for detailed remote sensing analysis. The comparative analysis of MWT-Diff across multiple datasets demonstrated favorable performance compared to recent approaches, as measured by standard perceptual quality metrics including FID and LPIPS.
摘要：高分辨率卫星图像的采集通常受到卫星传感器的空间和时间限制的限制，以及与频繁观察的高成本。这些挑战阻碍了需要细粒度和高分辨率数据的环境监测，灾难响应和农业管理。在本文中，我们提出了MWT-DIFF，这是卫星图像超分辨率（SR）的创新框架，将潜在扩散模型与小波变换相结合以应对这些挑战。该框架的核心是一种新颖的元数据，小波和时间感知的编码器（MWT-编码器），该编码器生成嵌入，可捕获元数据属性，多尺度频率信息和时间关系。嵌入式特征表示引导了分层扩散动力学，该模型通过该模型从低分辨率输入中逐步重建高分辨率卫星图像。该过程保留了关键的空间特征，包括纹理模式，边界不连续性以及对于详细的遥感分析必不可少的高频光谱成分。与最近的方法相比，多个数据集对MWT-DIFF的比较分析表现出了有利的性能，这是通过包括FID和LPIP在内的标准感知质量指标来衡量的。

Title: Transition Matching: Scalable and Flexible Generative Modeling

Authors: Neta Shaul, Uriel Singer, Itai Gat, Yaron Lipman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23589
Pdf URL: https://arxiv.org/pdf/2506.23589
Copy Paste: [[2506.23589]] Transition Matching: Scalable and Flexible Generative Modeling(https://arxiv.org/abs/2506.23589)
Keywords: generation, generative
Abstract: Diffusion and flow matching models have significantly advanced media generation, yet their design space is well-explored, somewhat limiting further improvements. Concurrently, autoregressive (AR) models, particularly those generating continuous tokens, have emerged as a promising direction for unifying text and media generation. This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes, thereby unlocking new flexible design avenues. We explore these choices through three TM variants: (i) Difference Transition Matching (DTM), which generalizes flow matching to discrete-time by directly learning transition probabilities, yielding state-of-the-art image quality and text adherence as well as improved sampling efficiency. (ii) Autoregressive Transition Matching (ARTM) and (iii) Full History Transition Matching (FHTM) are partially and fully causal models, respectively, that generalize continuous AR methods. They achieve continuous causal AR generation quality comparable to non-causal approaches and potentially enable seamless integration with existing AR text generation techniques. Notably, FHTM is the first fully causal model to match or surpass the performance of flow-based methods on text-to-image task in continuous domains. We demonstrate these contributions through a rigorous large-scale comparison of TM variants and relevant baselines, maintaining a fixed architecture, training data, and hyperparameters.
摘要：扩散和流匹配模型具有明显的高级媒体生成，但是它们的设计空间经过了充分的探索，有些限制了进一步的改进。同时，自回归（AR）模型，尤其是那些生成连续令牌的模型，已成为统一文本和媒体生成的有希望的方向。本文介绍了过渡匹配（TM），这是一种新型离散时间，连续状态的生成范式，它统一和进步既可以扩散/流量模型又是连续的AR产生。 TM将复杂的生成任务分解为更简单的马尔可夫过渡，从而允许表达性的非确定性概率过渡内核和任意的非连续监督过程，从而解开了新的灵活设计途径。我们通过三个TM变体探索这些选择：（i）差异过渡匹配（DTM），该匹配通过直接学习过渡概率将流量匹配到离散时间，从而产生最先进的图像质量和文本粘附以及提高的采样效率。（ii）自回旋过渡匹配（ARTM）和（iii）完整的历史过渡匹配（FHTM）分别是部分和完全因果模型，它概括了连续的AR方法。它们达到了与非因果方法相媲美的连续因果生成质量，并有可能与现有的AR文本生成技术无缝集成。值得注意的是，FHTM是第一个匹配或超过连续域中文本到图像任务的基于流程的方法的完全因果模型。我们通过对TM变体和相关基线的严格大规模比较来证明这些贡献，从而维护固定的体系结构，训练数据和超参数。

Title: CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

Authors: Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23590
Pdf URL: https://arxiv.org/pdf/2506.23590
Copy Paste: [[2506.23590]] CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models(https://arxiv.org/abs/2506.23590)
Keywords: generative
Abstract: Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly stronger when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern in response to caption queries to enhance LVLMs' visual perception capability. Extensive experimental results across four benchmarks covering both discriminative and generative tasks, demonstrate that CAI achieves state-of-the-art (SOTA) hallucination mitigating performance only with minimal additional inference cost.
摘要：尽管大型视觉模型（LVLM）在解释视觉信息方面表现出了强大的功能，但它们经常产生与视觉信息偏离的内容，从而导致对象幻觉。为了解决这个问题，最近的作品主要取决于昂贵的手动注释和培训成本，或者大大增加推理时间。在这项工作中，我们观察到，与非计费查询相比，在回答字幕查询时，LVLMS对视觉信息的关注要更强。受这种现象的启发，我们提出了标题敏感的注意力干预（CAI），这是一种无训练的，插件的幻觉缓解方法，该方法利用了注意力激活模式来响应标题查询以增强LVLMS的视觉感知能力。涵盖鉴别和生成任务的四个基准的广泛实验结果表明，CAI仅具有最小的额外推断成本，可以减轻幻觉的最新（SOTA）幻觉。

Title: AI-Generated Lecture Slides for Improving Slide Element Detection and Retrieval

Authors: Suyash Maniyar, Vishvesh Trivedi, Ajoy Mondal, Anand Mishra, C.V. Jawahar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23605
Pdf URL: https://arxiv.org/pdf/2506.23605
Copy Paste: [[2506.23605]] AI-Generated Lecture Slides for Improving Slide Element Detection and Retrieval(https://arxiv.org/abs/2506.23605)
Keywords: generation
Abstract: Lecture slide element detection and retrieval are key problems in slide understanding. Training effective models for these tasks often depends on extensive manual annotation. However, annotating large volumes of lecture slides for supervised training is labor intensive and requires domain expertise. To address this, we propose a large language model (LLM)-guided synthetic lecture slide generation pipeline, SynLecSlideGen, which produces high-quality, coherent and realistic slides. We also create an evaluation benchmark, namely RealSlide by manually annotating 1,050 real lecture slides. To assess the utility of our synthetic slides, we perform few-shot transfer learning on real data using models pre-trained on them. Experimental results show that few-shot transfer learning with pretraining on synthetic slides significantly improves performance compared to training only on real data. This demonstrates that synthetic data can effectively compensate for limited labeled lecture slides. The code and resources of our work are publicly available on our project website: this https URL.
摘要：讲座幻灯片元素检测和检索是幻灯片理解中的关键问题。这些任务的培训有效模型通常取决于大量的手动注释。但是，注释大量的教授幻灯片以进行监督培训是劳动密集型的，需要领域的专业知识。为了解决这个问题，我们提出了一个大型语言模型（LLM）指导的合成讲座幻灯片生成管道，Synlecslidegen，它会产生高质量，相干和逼真的幻灯片。我们还通过手动注释1,050台真实的演讲幻灯片来创建评估基准，即RealSlide。为了评估合成幻灯片的效用，我们使用预先训练的模型在实际数据上进行了很少的射击传输学习。实验结果表明，与仅在实际数据上进行培训相比，对合成幻灯片进行预处理的射击传递学习可显着提高性能。这表明合成数据可以有效地补偿有限的标记载玻片。我们工作的代码和资源在我们的项目网站上公开可用：此HTTPS URL。

Title: SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion

Authors: Zhengkang Xiang, Zizhao Li, Amir Khodabandeh, Kourosh Khoshelham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23606
Pdf URL: https://arxiv.org/pdf/2506.23606
Copy Paste: [[2506.23606]] SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion(https://arxiv.org/abs/2506.23606)
Keywords: generation, generative
Abstract: Lidar point cloud synthesis based on generative models offers a promising solution to augment deep learning pipelines, particularly when real-world data is scarce or lacks diversity. By enabling flexible object manipulation, this synthesis approach can significantly enrich training datasets and enhance discriminative models. However, existing methods focus on unconditional lidar point cloud generation, overlooking their potential for real-world applications. In this paper, we propose SG-LDM, a Semantic-Guided Lidar Diffusion Model that employs latent alignment to enable robust semantic-to-lidar synthesis. By directly operating in the native lidar space and leveraging explicit semantic conditioning, SG-LDM achieves state-of-the-art performance in generating high-fidelity lidar point clouds guided by semantic labels. Moreover, we propose the first diffusion-based lidar translation framework based on SG-LDM, which enables cross-domain translation as a domain adaptation strategy to enhance downstream perception performance. Systematic experiments demonstrate that SG-LDM significantly outperforms existing lidar diffusion models and the proposed lidar translation framework further improves data augmentation performance in the downstream lidar segmentation task.
摘要：基于生成模型的激光点云合成为增强深度学习管道提供了有希望的解决方案，尤其是当现实世界数据稀缺或缺乏多样性时。通过启用灵活的对象操纵，这种合成方法可以显着丰富训练数据集并增强歧视模型。但是，现有的方法着重于无条件的激光雷德点云的产生，忽略了它们对现实世界应用的潜力。在本文中，我们提出了SG-LDM，这是一种语义引导的激光雷达扩散模型，该模型采用潜在对准来实现稳健的语义至lidar合成。通过直接在本地发光雷达空间中运行并利用明确的语义调节，SG-LDM在生成以语义标签引导的高保真激光圈点云中实现了最先进的性能。此外，我们提出了基于SG-LDM的第一个基于扩散的LIDAR翻译框架，该框架使跨域翻译作为域适应策略，以增强下游感知性能。系统的实验表明，SG-LDM显着胜过现有的激光雷达扩散模型，而建议的LiDAR翻译框架进一步改善了下游LIDAR分割任务中的数据增强性能。

Title: TurboVSR: Fantastic Video Upscalers and Where to Find Them

Authors: Zhongdao Wang, Guodongfang Zhao, Jingjing Ren, Bailan Feng, Shifeng Zhang, Wenbo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23618
Pdf URL: https://arxiv.org/pdf/2506.23618
Copy Paste: [[2506.23618]] TurboVSR: Fantastic Video Upscalers and Where to Find Them(https://arxiv.org/abs/2506.23618)
Keywords: super-resolution, generation, generative
Abstract: Diffusion-based generative models have demonstrated exceptional promise in the video super-resolution (VSR) task, achieving a substantial advancement in detail generation relative to prior methods. However, these approaches face significant computational efficiency challenges. For instance, current techniques may require tens of minutes to super-resolve a mere 2-second, 1080p video. In this paper, we present TurboVSR, an ultra-efficient diffusion-based video super-resolution model. Our core design comprises three key aspects: (1) We employ an autoencoder with a high compression ratio of 32$\times$32$\times$8 to reduce the number of tokens. (2) Highly compressed latents pose substantial challenges for training. We introduce factorized conditioning to mitigate the learning complexity: we first learn to super-resolve the initial frame; subsequently, we condition the super-resolution of the remaining frames on the high-resolution initial frame and the low-resolution subsequent frames. (3) We convert the pre-trained diffusion model to a shortcut model to enable fewer sampling steps, further accelerating inference. As a result, TurboVSR performs on par with state-of-the-art VSR methods, while being 100+ times faster, taking only 7 seconds to process a 2-second long 1080p video. TurboVSR also supports image resolution by considering image as a one-frame video. Our efficient design makes SR beyond 1080p possible, results on 4K (3648$\times$2048) image SR show surprising fine details.
摘要：基于扩散的生成模型在视频超分辨率（VSR）任务中表现出了非凡的希望，相对于先前的方法实现了详细生成的显着进步。但是，这些方法面临着重大的计算效率挑战。例如，当前的技术可能需要数十分钟才能超级溶解仅2秒，1080p视频。在本文中，我们提出了Turbovsr，这是一种基于超有效扩散的视频超分辨率模型。我们的核心设计包括三个关键方面：（1）我们采用高压比为32 $ \ times $ 32 $ \ tims $ 8的自动编码器来减少令牌数量。（2）高度压缩的潜伏期在训练方面构成了重大挑战。我们引入分解条件以减轻学习复杂性：我们首先学会超级溶解初始框架；随后，我们在高分辨率初始帧和低分辨率随后的框架上调节其余帧的超分辨率。（3）我们将预训练的扩散模型转换为快捷模型，以更少的采样步骤，进一步加速推断。结果，TurboVSR与最先进的VSR方法相当，而快速100倍以上，仅需7秒即可处理2秒长的1080p视频。 Turbovsr还通过将图像视为单帧视频来支持图像分辨率。我们的有效设计使SR超过1080p，结果是4K（3648 $ \ times $ 2048）图像SR显示出令人惊讶的细节。

Title: Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

Authors: Shaofei Huang, Rui Ling, Tianrui Hui, Hongyu Li, Xu Zhou, Shifeng Zhang, Si Liu, Richang Hong, Meng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23623
Pdf URL: https://arxiv.org/pdf/2506.23623
Copy Paste: [[2506.23623]] Revisiting Audio-Visual Segmentation with Vision-Centric Transformer(https://arxiv.org/abs/2506.23623)
Keywords: generation
Abstract: Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset. The code is available at this https URL.
摘要：视听细分（AVS）旨在根据相关音频信号在视频帧中分割声音产生对象。盛行的AVS方法通常采用以音频为中心的变压器体系结构，其中对象查询来自音频功能。但是，以音频为中心的变压器遭受了两个局限性：由音频的混合性质引起的感知歧义，并且由于视觉细节丢失而导致了密集的预测能力。为了解决这些限制，我们提出了一个以新的为中心的变压器（VCT）框架，该框架利用视觉衍生的查询来迭代获取相应的音频和视觉信息，使查询能够更好地区分混合音频的不同声音对象，并准确地划定其轮廓。此外，我们还引入了一种原型，该原型在我们的VCT框架内引起了查询产生（PPQG）模块，以生成视觉衍生的查询，这些查询既可以通过音频原型和像素上下文进行分组，既具有语义上意识到且视觉上富含的查询，又促进了音频信息信息的聚合。广泛的实验表明，我们的VCT框架在AVSBench数据集的三个子集上实现了新的最新性能。该代码可在此HTTPS URL上找到。

Title: Blending Concepts with Text-to-Image Diffusion Models

Authors: Lorenzo Olearo, Giorgio Longari, Alessandro Raganato, Rafael Peñaloza, Simone Melzi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23630
Pdf URL: https://arxiv.org/pdf/2506.23630
Copy Paste: [[2506.23630]] Blending Concepts with Text-to-Image Diffusion Models(https://arxiv.org/abs/2506.23630)
Keywords: generation
Abstract: Diffusion models have dramatically advanced text-to-image generation in recent years, translating abstract concepts into high-fidelity images with remarkable ease. In this work, we examine whether they can also blend distinct concepts, ranging from concrete objects to intangible ideas, into coherent new visual entities under a zero-shot framework. Specifically, concept blending merges the key attributes of multiple concepts (expressed as textual prompts) into a single, novel image that captures the essence of each concept. We investigate four blending methods, each exploiting different aspects of the diffusion pipeline (e.g., prompt scheduling, embedding interpolation, or layer-wise conditioning). Through systematic experimentation across diverse concept categories, such as merging concrete concepts, synthesizing compound words, transferring artistic styles, and blending architectural landmarks, we show that modern diffusion models indeed exhibit creative blending capabilities without further training or fine-tuning. Our extensive user study, involving 100 participants, reveals that no single approach dominates in all scenarios: each blending technique excels under certain conditions, with factors like prompt ordering, conceptual distance, and random seed affecting the outcome. These findings highlight the remarkable compositional potential of diffusion models while exposing their sensitivity to seemingly minor input variations.
摘要：近年来，扩散模型具有巨大的先进文本形象生成，将抽象概念转化为高保真图像，并具有显着的轻松性。在这项工作中，我们检查了它们是否还可以将不同的概念融合在一起，从具体对象到无形的思想，再到零照片框架下连贯的新视觉实体。具体而言，概念融合将多个概念的关键属性（以文本提示表示）合并为一个单一的新图像，该图像捕获了每个概念的本质。我们研究了四种混合方法，每种混合方法都利用扩散管道的不同方面（例如，提示调度，嵌入插值或层调节）。通过跨不同概念类别的系统实验，例如合并具体概念，合成复合词，转移艺术风格以及混合体系结构地标，我们表明现代的扩散模型确实在没有进一步的培训或微调的情况下表现出创造性的混合功能。我们广泛的用户研究（涉及100名参与者）表明，在所有情况下，没有任何单一的方法在某些条件下都符合所有融合技术，而迅速的订购，概念距离和随机种子等因素都会影响结果。这些发现突出了扩散模型的显着组成潜力，同时暴露了它们对看似较小的输入变化的敏感性。

Title: VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation

Authors: Peng Huang, Junhu Fu, Bowen Guo, Zeju Li, Yuanyuan Wang, Yi Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23641
Pdf URL: https://arxiv.org/pdf/2506.23641
Copy Paste: [[2506.23641]] VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation(https://arxiv.org/abs/2506.23641)
Keywords: generation, generative
Abstract: As the appearance of medical images is influenced by multiple underlying factors, generative models require rich attribute information beyond labels to produce realistic and diverse images. For instance, generating an image of skin lesion with specific patterns demands descriptions that go beyond diagnosis, such as shape, size, texture, and color. However, such detailed descriptions are not always accessible. To address this, we explore a framework, termed Visual Attribute Prompts (VAP)-Diffusion, to leverage external knowledge from pre-trained Multi-modal Large Language Models (MLLMs) to improve the quality and diversity of medical image generation. First, to derive descriptions from MLLMs without hallucination, we design a series of prompts following Chain-of-Thoughts for common medical imaging tasks, including dermatologic, colorectal, and chest X-ray images. Generated descriptions are utilized during training and stored across different categories. During testing, descriptions are randomly retrieved from the corresponding category for inference. Moreover, to make the generator robust to unseen combination of descriptions at the test time, we propose a Prototype Condition Mechanism that restricts test embeddings to be similar to those from training. Experiments on three common types of medical imaging across four datasets verify the effectiveness of VAP-Diffusion.
摘要：由于医学图像的出现受多种潜在因素的影响，因此生成模型需要丰富的属性信息以外的标签以产生现实和多样化的图像。例如，生成具有特定模式的皮肤病变的图像需要的描述超出了诊断，例如形状，大小，纹理和颜色。但是，这种详细的描述并不总是可以访问。为了解决这个问题，我们探索一个框架，称为视觉属性提示（VAP） - 扩散，以利用预训练的多模式大型语言模型（MLLM）的外部知识来提高医学图像生成的质量和多样性。首先，为了从不幻觉的MLLM中得出描述，我们设计了一系列提示，这些提示是针对常见的医学成像任务的一系列提示，包括皮肤病学，结直肠和胸部X射线图像。在培训期间使用生成的描述，并跨不同类别存储。在测试过程中，从相应类别中随机检索描述以进行推理。此外，为了使发电机在测试时间内稳健以看不见描述的组合，我们提出了一种原型条件机制，该原型机制限制了测试嵌入与训练中的嵌入相似。在四个数据集中进行三种常见的医学成像类型的实验验证了VAP扩散的有效性。

Title: A Unified Framework for Stealthy Adversarial Generation via Latent Optimization and Transferability Enhancement

Authors: Gaozheng Pei, Ke Ma, Dongpeng Zhang, Chengzhi Sun, Qianqian Xu, Qingming Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23676
Pdf URL: https://arxiv.org/pdf/2506.23676
Copy Paste: [[2506.23676]] A Unified Framework for Stealthy Adversarial Generation via Latent Optimization and Transferability Enhancement(https://arxiv.org/abs/2506.23676)
Keywords: generation
Abstract: Due to their powerful image generation capabilities, diffusion-based adversarial example generation methods through image editing are rapidly gaining popularity. However, due to reliance on the discriminative capability of the diffusion model, these diffusion-based methods often struggle to generalize beyond conventional image classification tasks, such as in Deepfake detection. Moreover, traditional strategies for enhancing adversarial example transferability are challenging to adapt to these methods. To address these challenges, we propose a unified framework that seamlessly incorporates traditional transferability enhancement strategies into diffusion model-based adversarial example generation via image editing, enabling their application across a wider range of downstream tasks. Our method won first place in the "1st Adversarial Attacks on Deepfake Detectors: A Challenge in the Era of AI-Generated Media" competition at ACM MM25, which validates the effectiveness of our approach.
摘要：由于其强大的图像生成功能，通过图像编辑的基于扩散的对抗示例生成方法迅速越来越流行。但是，由于依赖扩散模型的歧视能力，这些基于扩散的方法通常很难概括传统的图像分类任务，例如在DeepFake检测中。此外，增强对抗性示例可转移性的传统策略在适应这些方法方面具有挑战性。为了应对这些挑战，我们提出了一个统一的框架，该框架将传统的可传递性增强策略无缝融合到基于扩散模型的对抗性示例中，通过图像编辑，使其在更广泛的下游任务中实现其应用。我们的方法在ACM MM25上的AI生成媒体竞争时期赢得了“对Deepfake探测器的第一次对抗攻击：在AI生成的媒体竞争时期的挑战，该挑战验证了我们方法的有效性。

Title: SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation

Authors: Shuai Tan, Biao Gong, Yujie Wei, Shiwei Zhang, Zhuoxin Liu, Dandan Zheng, Jingdong Chen, Yan Wang, Hao Ouyang, Kecheng Zheng, Yujun Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23690
Pdf URL: https://arxiv.org/pdf/2506.23690
Copy Paste: [[2506.23690]] SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation(https://arxiv.org/abs/2506.23690)
Keywords: generation, generative
Abstract: Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., ''cats'' or ''dogs'') to produce visually appealing results. However, video data involve complex spatio-temporal patterns, and focusing solely on semantics cause the model to overlook the visual complexity of motion. Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce the dual-embedding semantic comprehension mechanism which disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we introduce a new embedding-specific training strategy which \textbf{alternately optimizes} subject and motion embeddings, supported by the manually constructed Subject Prior Video (SPV) training dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that \method outperforms existing baselines. Project page: this https URL
摘要：基于扩散的视频运动自定义有助于从一些视频样本中获取人类运动表示，同时通过精确的文本条件实现任意主题转移。现有的方法通常依赖于语义级别的对齐方式，期望该模型学习新的运动概念并将其与其他实体（例如'猫'或“'狗”）相结合，以产生视觉上吸引人的结果。但是，视频数据涉及复杂的时空模式，仅关注语义上的语义使该模型忽略了运动的视觉复杂性。相反，仅调整视觉表示会导致语义混乱在表示预期的动作时。为了解决这些局限性，我们提出了Synmotion，这是一种新的运动视频生成模型，共同利用语义指导和视觉适应。在语义层面上，我们介绍了双重插入语义理解机制，该机制删除了主体和运动表示形式，从而使模型可以学习自定义运动功能，同时保留其生成能力用于不同主体。在视觉级别，我们将参数有效的运动适配器集成到预训练的视频生成模型中，以增强运动保真度和时间连贯性。此外，我们引入了一种新的嵌入式训练策略，该培训\ textbf {替代优化}主题和运动嵌入，并由手动构造的主题先验视频（SPV）培训数据集支持。该策略促进了运动特异性，同时保留了各种受试者的概括。最后，我们介绍了Motionbench，这是一种具有不同运动模式的新策划的基准测试。 T2V和I2V设置的实验结果表明，\方法的表现优于现有基准。项目页面：此HTTPS URL

Title: Subjective Camera: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion

Authors: Haoyang Chen, Dongfang Sun, Caoyuan Ma, Shiqin Wang, Kewei Zhang, Zheng Wang, Zhixiang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23711
Pdf URL: https://arxiv.org/pdf/2506.23711
Copy Paste: [[2506.23711]] Subjective Camera: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion(https://arxiv.org/abs/2506.23711)
Keywords: generation
Abstract: We propose Subjective Camera, a human-as-imaging-device paradigm that reconstructs real-world scenes from mental impressions through synergistic use of verbal descriptions and progressive rough sketches. This approach overcomes dual limitations of language ambiguity and sketch abstraction by treating the user's drawing sequence as priors, effectively translating subjective perceptual expectations into photorealistic images. Existing approaches face three fundamental barriers: (1) user-specific subjective input biases, (2) huge modality gap between planar sketch and 3D priors in diffusion, and (3) sketch quality-sensitive performance degradation. Current solutions either demand resource-intensive model adaptation or impose impractical requirements on sketch precision. Our framework addresses these challenges through concept-sequential generation. (1) We establish robust appearance priors through text-reward optimization, and then implement sequence-aware disentangled generation that processes concepts in sketching order; these steps accommodate user-specific subjective expectation in a train-free way. (2) We employ latent optimization that effectively bridges the modality gap between planar sketches and 3D priors in diffusion. (3) Our hierarchical reward-guided framework enables the use of rough sketches without demanding artistic expertise. Comprehensive evaluation across diverse datasets demonstrates that our approach achieves state-of-the-art performance in maintaining both semantic and spatial coherence.
摘要：我们提出了主观相机，这是一种人类成像设备范式，它通过协同使用口头描述和渐进的粗略草图来重建现实世界的场景。这种方法通过将用户的绘图序列视为先验，从而克服了语言歧义的双重限制和素描抽象，从而有效地将主观的感知期望转化为逼真的图像。现有方法面临三个基本障碍：（1）用户特定的主观输入偏见，（2）平面草图和3D先验之间的巨大模态差距，以及（3）草图质量敏感的性能降级。当前的解决方案要求资源密集型模型适应或对草图精确施加不切实际的要求。我们的框架通过概念序列的一代解决了这些挑战。（1）我们通过文本回报优化建立了强大的外观先验，然后实现序列意识的发电生成，以素描顺序处理概念；这些步骤以无火车的方式适应特定于用户的主观期望。（2）我们采用潜在优化，有效地弥合了在扩散中平面草图和3D先验之间的模态差距。（3）我们的分层奖励指导框架可以使用粗略的草图而无需进行艺术专业知识。跨不同数据集的全面评估表明，我们的方法在保持语义和空间连贯性方面取得了最新的性能。

Title: System-Embedded Diffusion Bridge Models

Authors: Bartlomiej Sobieski, Matthew Tivnan, Yuang Wang, Siyeop Yoon, Pengfei Jin, Dufan Wu, Quanzheng Li, Przemyslaw Biecek
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23726
Pdf URL: https://arxiv.org/pdf/2506.23726
Copy Paste: [[2506.23726]] System-Embedded Diffusion Bridge Models(https://arxiv.org/abs/2506.23726)
Keywords: generative
Abstract: Solving inverse problems -- recovering signals from incomplete or noisy measurements -- is fundamental in science and engineering. Score-based generative models (SGMs) have recently emerged as a powerful framework for this task. Two main paradigms have formed: unsupervised approaches that adapt pretrained generative models to inverse problems, and supervised bridge methods that train stochastic processes conditioned on paired clean and corrupted data. While the former typically assume knowledge of the measurement model, the latter have largely overlooked this structural information. We introduce System embedded Diffusion Bridge Models (SDBs), a new class of supervised bridge methods that explicitly embed the known linear measurement system into the coefficients of a matrix-valued SDE. This principled integration yields consistent improvements across diverse linear inverse problems and demonstrates robust generalization under system misspecification between training and deployment, offering a promising solution to real-world applications.
摘要：解决反问题 - 从不完整或嘈杂的测量中恢复信号 - 在科学和工程中是基础。基于得分的生成模型（SGM）最近成为了此任务的强大框架。已经形成了两个主要的范式：无监督的方法，这些方法适应了验证的生成模型，以及监督的桥梁方法，这些方法训练以配对的清洁和损坏的数据为条件。尽管前者通常假设对测量模型的了解，但后者在很大程度上忽略了这些结构信息。我们介绍了嵌入式扩散桥模型（SDB），这是一种新的监督桥方法，将已知的线性测量系统明确嵌入到矩阵值值的SDE的系数中。这种原则的集成在培训和部署之间的系统错误指定下证明了各种线性逆问题的一致改进，并为现实世界应用提供了有希望的解决方案。

Title: Radioactive Watermarks in Diffusion and Autoregressive Image Generative Models

Authors: Michel Meintz, Jan Dubiński, Franziska Boenisch, Adam Dziedzic
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.23731
Pdf URL: https://arxiv.org/pdf/2506.23731
Copy Paste: [[2506.23731]] Radioactive Watermarks in Diffusion and Autoregressive Image Generative Models(https://arxiv.org/abs/2506.23731)
Keywords: generation, generative
Abstract: Image generative models have become increasingly popular, but training them requires large datasets that are costly to collect and curate. To circumvent these costs, some parties may exploit existing models by using the generated images as training data for their own models. In general, watermarking is a valuable tool for detecting unauthorized use of generated images. However, when these images are used to train a new model, watermarking can only enable detection if the watermark persists through training and remains identifiable in the outputs of the newly trained model - a property known as radioactivity. We analyze the radioactivity of watermarks in images generated by diffusion models (DMs) and image autoregressive models (IARs). We find that existing watermarking methods for DMs fail to retain radioactivity, as watermarks are either erased during encoding into the latent space or lost in the noising-denoising process (during the training in the latent space). Meanwhile, despite IARs having recently surpassed DMs in image generation quality and efficiency, no radioactive watermarking methods have been proposed for them. To overcome this limitation, we propose the first watermarking method tailored for IARs and with radioactivity in mind - drawing inspiration from techniques in large language models (LLMs), which share IARs' autoregressive paradigm. Our extensive experimental evaluation highlights our method's effectiveness in preserving radioactivity within IARs, enabling robust provenance tracking, and preventing unauthorized use of their generated images.
摘要：图像生成模型已经变得越来越受欢迎，但是训练它们需要大量的数据集，这些数据集的收集和策划费用很高。为了规避这些成本，一些当事方可以通过使用生成的图像作为自己模型的培训数据来利用现有模型。通常，水印是用于检测未经授权使用生成图像的宝贵工具。但是，当这些图像用于训练新模型时，只有在训练中保持水印并在新训练的模型的输出中仍然可以识别时，水印才能检测到检测 - 一种称为放射性的属性。我们分析了通过扩散模型（DMS）和图像自回归模型（IARS）产生的图像中水印的放射性。我们发现，现有的DMS水印方法无法保留放射性，因为在编码潜在空间时要么删除水印，要么在Noisising-Noising-denoising过程中丢失（在潜在空间进行训练期间）。同时，尽管IAR最近在图像产生质量和效率方面超过了DMS，但未提出放射性水印方法。为了克服这一限制，我们提出了针对IARS量身定制的第一种水印方法，并考虑到了放射性 - 从大语言模型（LLMS）中汲取灵感，这些技术（LLMS）共享IARS的自回旋范式。我们广泛的实验评估突出了我们方法在保存IAR内放射性方面的有效性，实现了强大的出处跟踪，并防止未经授权使用其生成的图像。

Title: Controllable Reference-Based Real-World Remote Sensing Image Super-Resolution with Generative Diffusion Priors

Authors: Ce Wang, Wanjie Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23801
Pdf URL: https://arxiv.org/pdf/2506.23801
Copy Paste: [[2506.23801]] Controllable Reference-Based Real-World Remote Sensing Image Super-Resolution with Generative Diffusion Priors(https://arxiv.org/abs/2506.23801)
Keywords: super-resolution, generation, generative
Abstract: Super-resolution (SR) techniques can enhance the spatial resolution of remote sensing images by utilizing low-resolution (LR) images to reconstruct high-resolution (HR) images, enabling more efficient large-scale earth observation applications. While single-image super-resolution (SISR) methods have shown progress, reference-based super-resolution (RefSR) offers superior performance by incorporating historical HR images alongside current LR observations. However, existing RefSR methods struggle with real-world complexities, such as cross-sensor resolution gap and significant land cover changes, often leading to under-generation or over-reliance on reference image. To address these challenges, we propose CRefDiff, a novel controllable reference-based diffusion model for real-world remote sensing image SR. To address the under-generation problem, CRefDiff is built upon the pretrained Stable Diffusion model, leveraging its powerful generative prior to produce accurate structures and textures. To mitigate over-reliance on the reference, we introduce a dual-branch fusion mechanism that adaptively integrates both local and global information from the reference image. Moreover, this novel dual-branch design enables reference strength control during inference, enhancing interactivity and flexibility of the model. Finally, a strategy named Better Start is proposed to significantly reduce the number of denoising steps, thereby accelerating the inference process. To support further research, we introduce Real-RefRSSRD, a new real-world RefSR dataset for remote sensing images, consisting of HR NAIP and LR Sentinel-2 image pairs with diverse land cover changes and significant temporal gaps. Extensive experiments on Real-RefRSSRD show that CRefDiff achieves state-of-the-art performance across various metrics and improves downstream tasks such as scene classification and semantic segmentation.
摘要：超分辨率（SR）技术可以通过利用低分辨率（LR）图像来重建高分辨率（HR）图像，从而增强遥感图像的空间分辨率，从而实现更有效的大型地球观测应用。虽然单片图像超分辨率（SISR）方法已经显示出进度，但基于参考的超分辨率（REFSR）通过将历史HR图像与当前LR观测纳入当前的LR观测值相结合，从而提供了出色的性能。但是，现有的REFSR方法在现实世界中的复杂性（例如跨传感器的分辨率差距和巨大的土地覆盖变化）而遇到的困难，通常会导致对参考图像的发电不足或过度依赖。为了应对这些挑战，我们提出了Crefdiff，这是一种可用于现实世界遥感图像SR的新型基于可控的参考扩散模型。为了解决不足的问题，CREFDIFF建立在验证的稳定扩散模型的基础上，在产生准确的结构和纹理之前利用其强大的生成性。为了减轻对参考的过度依赖，我们引入了一种双分支融合机制，该机制可以适应从参考图像中整合本地和全局信息。此外，这种新颖的双分支设计可以在推理过程中实现参考强度控制，从而增强模型的交互性和灵活性。最后，提出了一种名为“更好的开始”的策略，以大大减少脱索步骤的数量，从而加速推理过程。为了支持进一步的研究，我们介绍了Real-Refrssrd，这是一种新的现实世界REFSR数据集，用于遥感图像，由HR NAIP和LR Sentinel-2图像对组成，具有不同的土地覆盖率变化和明显的时间间隙。对Real-Refrssrd的广泛实验表明，Crefdiff在各种指标上实现了最先进的性能，并改善了下游任务，例如场景分类和语义分段。

Title: Refine Any Object in Any Scene

Authors: Ziwei Chen, Ziling Liu, Zitong Huang, Mingqi Gao, Feng Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23835
Pdf URL: https://arxiv.org/pdf/2506.23835
Copy Paste: [[2506.23835]] Refine Any Object in Any Scene(https://arxiv.org/abs/2506.23835)
Keywords: generative
Abstract: Viewpoint missing of objects is common in scene reconstruction, as camera paths typically prioritize capturing the overall scene structure rather than individual objects. This makes it highly challenging to achieve high-fidelity object-level modeling while maintaining accurate scene-level representation. Addressing this issue is critical for advancing downstream tasks requiring detailed object understanding and appearance modeling. In this paper, we introduce Refine Any object In any ScenE (RAISE), a novel 3D enhancement framework that leverages 3D generative priors to recover fine-grained object geometry and appearance under missing views. Starting from substituting degraded objects with proxies, via a 3D generative model with strong 3D understanding, RAISE progressively refines geometry and texture by aligning each proxy to its degraded counterpart in 7-DOF pose, followed by correcting spatial and appearance inconsistencies via registration-constrained enhancement. This two-stage refinement ensures the high-fidelity geometry and appearance of the original object in unseen views while maintaining consistency in spatial positioning, observed geometry, and appearance. Extensive experiments on challenging benchmarks show that RAISE significantly outperforms state-of-the-art methods in both novel view synthesis and geometry completion tasks. RAISE is made publicly available at this https URL.
摘要：在场景重建中，对象缺少对象很常见，因为相机路径通常优先考虑捕获整体场景结构而不是单个对象。这使得在保持准确的场景级表示的同时，实现高保真对象级建模是高度挑战。解决此问题对于推进需要详细的对象理解和外观建模的下游任务至关重要。在本文中，我们在任何场景中介绍了完善的任何对象（Rise），这是一个新颖的3D增强框架，该框架利用3D生成先验来恢复细颗粒的对象几何形状和缺少视图下的外观。从通过具有强烈3D理解的3D生成模型替换降解的对象开始，通过将每个代理与其以7-DOF姿势保持降解的对应物来逐步提高几何形状和纹理，然后通过校正空间和外观，并通过注册结构增强来纠正空间和外观。这种两阶段的改进可确保在看不见的视图中，在空间定位，观察到的几何形状和外观方面保持一致性，可确保原始对象的高保真几何形状和外观。关于具有挑战性的基准测试的广泛实验表明，在新型视图合成和几何完成任务中，提高了最先进的方法。在此HTTPS URL上公开提供加薪。

Title: RGC-VQA: An Exploration Database for Robotic-Generated Video Quality Assessment

Authors: Jianing Jin, Jiangyong Ying, Huiyu Duan, Liu Yang, Sijing Wu, Yunhao Li, Yushuo Zheng, Xiongkuo Min, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23852
Pdf URL: https://arxiv.org/pdf/2506.23852
Copy Paste: [[2506.23852]] RGC-VQA: An Exploration Database for Robotic-Generated Video Quality Assessment(https://arxiv.org/abs/2506.23852)
Keywords: quality assessment
Abstract: As camera-equipped robotic platforms become increasingly integrated into daily life, robotic-generated videos have begun to appear on streaming media platforms, enabling us to envision a future where humans and robots coexist. We innovatively propose the concept of Robotic-Generated Content (RGC) to term these videos generated from egocentric perspective of robots. The perceptual quality of RGC videos is critical in human-robot interaction scenarios, and RGC videos exhibit unique distortions and visual requirements that differ markedly from those of professionally-generated content (PGC) videos and user-generated content (UGC) videos. However, dedicated research on quality assessment of RGC videos is still lacking. To address this gap and to support broader robotic applications, we establish the first Robotic-Generated Content Database (RGCD), which contains a total of 2,100 videos drawn from three robot categories and sourced from diverse platforms. A subjective VQA experiment is conducted subsequently to assess human visual perception of robotic-generated videos. Finally, we conduct a benchmark experiment to evaluate the performance of 11 state-of-the-art VQA models on our database. Experimental results reveal significant limitations in existing VQA models when applied to complex, robotic-generated content, highlighting a critical need for RGC-specific VQA models. Our RGCD is publicly available at: this https URL.
摘要：随着配备相机的机器人平台越来越多地整合到日常生活中，机器人生成的视频已经开始出现在流媒体平台上，使我们能够设想一个人类和机器人共存的未来。我们对机器人生成的内容（RGC）的概念进行了创新，以将这些视频从机器人的以自我为中心的角度产生。 RGC视频的感知质量在人类机器人互动场景中至关重要，RGC视频表现出独特的扭曲和视觉要求，与专业生成内容（PGC）视频和用户生成的内容（UGC）视频明显不同。但是，关于RGC视频质量评估的专门研究仍然缺乏。为了解决这一差距并支持更广泛的机器人应用程序，我们建立了第一个机器人生成的内容数据库（RGCD），该数据库包含来自三个机器人类别的2100个视频，并来自不同的平台。随后进行了主观VQA实验，以评估人类对机器人生成视频的视觉感知。最后，我们进行了基准实验，以评估数据库中11个最先进的VQA模型的性能。实验结果显示，当应用于复杂的，机器人生成的内容时，现有VQA模型的显着局限性，突出了对RGC特异性VQA模型的关键需求。我们的RGCD可公开，网址为：此HTTPS URL。

Title: VMoBA: Mixture-of-Block Attention for Video Diffusion Models

Authors: Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, Yunhai Tong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23858
Pdf URL: https://arxiv.org/pdf/2506.23858
Copy Paste: [[2506.23858]] VMoBA: Mixture-of-Block Attention for Video Diffusion Models(https://arxiv.org/abs/2506.23858)
Keywords: generation
Abstract: The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.
摘要：全部注意机制的二次复杂性为视频扩散模型（VDM）带来了重要的瓶颈，旨在生成长期的高分辨率视频。尽管已经提出了各种稀疏注意方法，但许多方法被设计为无训练的推理加速器，或者在本地训练时，视频数据固有的唯一时空特征固有的唯一时空特征。本文介绍了阻止注意力（VMOBA）的视频混合物，这是一种专门针对VDM的新型稀疏注意机制。通过对预训练的视频变压器内的注意力模式进行深入分析的激励，该视频变压器揭示了强大的时空位置，各种查询的重要性和特定于头部特定的浓度水平，VMOBA通过三个关键修改增强了原始MOBA框架：（1）层上的层次范围分配方案（1D-2D-2D-3D-3D）的动态性改善，并改善了多样性的调整，以改进多样化的改善效果。（2）全局块选择优先考虑整个注意力头的最显着查询块相互作用；（3）基于阈值的块选择，以基于其累积相似性动态确定所在块的数量。广泛的实验表明，VMOBA显着加速了VDM在更长的序列上的训练，达到2.92倍失败和1.48倍的延迟速度，同时获得了可比甚至较高的发电质量，以达到全部关注。此外，VMOBA在无训练推理方面表现出竞争性能，可为高分辨率视频发行提供2.40倍的失败和1.35倍的延迟速度。

Title: Bridging the Gap with Retrieval-Augmented Generation: Making Prosthetic Device User Manuals Available in Marginalised Languages

Authors: Ikechukwu Ogbonna, Lesley Davidson, Soumya Banerjee, Abhishek Dasgupta, Laurence Kenney, Vikranth Harthikote Nagaraja
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23958
Pdf URL: https://arxiv.org/pdf/2506.23958
Copy Paste: [[2506.23958]] Bridging the Gap with Retrieval-Augmented Generation: Making Prosthetic Device User Manuals Available in Marginalised Languages(https://arxiv.org/abs/2506.23958)
Keywords: generation, generative
Abstract: Millions of people in African countries face barriers to accessing healthcare due to language and literacy gaps. This research tackles this challenge by transforming complex medical documents -- in this case, prosthetic device user manuals -- into accessible formats for underserved populations. This case study in cross-cultural translation is particularly pertinent/relevant for communities that receive donated prosthetic devices but may not receive the accompanying user documentation. Or, if available online, may only be available in formats (e.g., language and readability) that are inaccessible to local populations (e.g., English-language, high resource settings/cultural context). The approach is demonstrated using the widely spoken Pidgin dialect, but our open-source framework has been designed to enable rapid and easy extension to other languages/dialects. This work presents an AI-powered framework designed to process and translate complex medical documents, e.g., user manuals for prosthetic devices, into marginalised languages. The system enables users -- such as healthcare workers or patients -- to upload English-language medical equipment manuals, pose questions in their native language, and receive accurate, localised answers in real time. Technically, the system integrates a Retrieval-Augmented Generation (RAG) pipeline for processing and semantic understanding of the uploaded manuals. It then employs advanced Natural Language Processing (NLP) models for generative question-answering and multilingual translation. Beyond simple translation, it ensures accessibility to device instructions, treatment protocols, and safety information, empowering patients and clinicians to make informed healthcare decisions.
摘要：由于语言和识字差距，非洲国家数百万的人面临获得医疗保健的障碍。这项研究通过将复杂的医疗文件（在这种情况下，假肢用户手册）转换为服务不足人群的可访问格式来应对这一挑战。跨文化翻译中的案例研究与接受捐赠的假肢设备但可能无法收到随附的用户文档的社区特别相关/相关。或者，如果在线使用，则只能以当地人群无法访问的格式（例如语言和可读性）（例如，英语，高资源设置/文化环境）无法获得。使用广泛口语的Pidgin方言来证明该方法，但是我们的开源框架旨在使快速易于扩展到其他语言/方言。这项工作为AI驱动的框架提供了旨在处理和翻译复杂的医疗文档（例如，假体设备的用户手册）为边缘化语言。该系统使用户（例如医疗保健工人或患者）可以上传英语医疗设备手册，用母语提出问题，并实时获得准确的本地化答案。从技术上讲，系统集成了一个检索功能的生成（RAG）管道，以处理和语义对上载手册的理解。然后，它采用先进的自然语言处理（NLP）模型来进行发电和多语言翻译。除了简单的翻译之外，它还可以确保对设备说明，治疗方案和安全信息的可访问性，使患者和临床医生能够做出明智的医疗保健决定。

Title: Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios

Authors: Deng Li, Aming Wu, Yang Li, Yaowei Wang, Yahong Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24063
Pdf URL: https://arxiv.org/pdf/2506.24063
Copy Paste: [[2506.24063]] Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios(https://arxiv.org/abs/2506.24063)
Keywords: generation
Abstract: In practice, environments constantly change over time and space, posing significant challenges for object detectors trained based on a closed-set assumption, i.e., training and test data share the same distribution. To this end, continual test-time adaptation has attracted much attention, aiming to improve detectors' generalization by fine-tuning a few specific parameters, e.g., BatchNorm layers. However, based on a small number of test images, fine-tuning certain parameters may affect the representation ability of other fixed parameters, leading to performance degradation. Instead, we explore a new mechanism, i.e., converting the fine-tuning process to a specific-parameter generation. Particularly, we first design a dual-path LoRA-based domain-aware adapter that disentangles features into domain-invariant and domain-specific components, enabling efficient adaptation. Additionally, a conditional diffusion-based parameter generation mechanism is presented to synthesize the adapter's parameters based on the current environment, preventing the optimization from getting stuck in local optima. Finally, we propose a class-centered optimal transport alignment method to mitigate catastrophic forgetting. Extensive experiments conducted on various continuous domain adaptive object detection tasks demonstrate the effectiveness. Meanwhile, visualization results show that the representation extracted by the generated parameters can capture more object-related information and strengthen the generalization ability.
摘要：在实践中，环境不断随着时间和空间的形式变化，对基于封闭设置的假设训练的对象探测器提出了重大挑战，即培训和测试数据共享相同的分布。为此，持续的测试时间适应引起了很多关注，旨在通过微调一些特定参数，例如批处理层来改善检测器的概括。但是，基于少量测试图像，对某些参数进行微调可能会影响其他固定参数的表示能力，从而导致性能降解。取而代之的是，我们探索了一种新机制，即将微调过程转换为特定参数的生成。特别是，我们首先设计了一个基于双路径的域域感知适配器，该适配器将特征分解为域中不变和特定于域的组件，从而有效适应。此外，提出了基于条件扩散的参数生成机制，以基于当前环境综合适配器的参数，以防止优化陷入本地Optima。最后，我们提出了一种以班级为中心的最佳运输对准方法来减轻灾难性遗忘。对各种连续域自适应对象检测任务进行的广泛实验证明了有效性。同时，可视化结果表明，生成参数提取的表示形式可以捕获更多与对象相关的信息并增强概括能力。

Title: Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention

Authors: Wonwoong Cho, Yanxia Zhang, Yan-Ying Chen, David I. Inouye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.24085
Pdf URL: https://arxiv.org/pdf/2506.24085
Copy Paste: [[2506.24085]] Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention(https://arxiv.org/abs/2506.24085)
Keywords: generative
Abstract: Blending visual and textual concepts into a new visual concept is a unique and powerful trait of human beings that can fuel creativity. However, in practice, cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation, which leads to local minima in the design space. In this paper, we propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity. Prior works related to cross-modal conceptual blending are limited in encoding a real image without loss of details or in disentangling the image and text inputs. To address these gaps, IT-Blender leverages pretrained diffusion models (SD and FLUX) to blend the latent representations of a clean reference image with those of the noisy generated image. Combined with our novel blended attention, IT-Blender encodes the real reference image without loss of details and blends the visual concept with the object specified by the text in a disentangled way. Our experiment results show that IT-Blender outperforms the baselines by a large margin in blending visual and textual concepts, shedding light on the new application of image generative models to augment human creativity.
摘要：将视觉和文本概念融合到一个新的视觉概念中是人类的独特而有力的特征，可以推动创造力。但是，实际上，人类的跨模式概念融合很容易出现认知偏见，例如设计固定，这会导致设计空间中的本地最小值。在本文中，我们提出了一个T2I扩散适配器“ IT-Blender”，该适配器可以自动化混合过程以增强人类创造力。与跨模式概念混合有关的先前作品在编码真实图像的情况下受到限制，而不会丢失细节或解散图像和文本输入。为了解决这些差距，IT-Blender利用了预处理的扩散模型（SD和Flux）将干净的参考图像的潜在表示与嘈杂生成的图像的潜在表示。结合我们的小说混合注意力，IT-Blender编码了真实的参考图像，而不会丢失细节，并将视觉概念与文本指定的对象融合在一起。我们的实验结果表明，IT-Blender在混合视觉和文本概念中的大幅度优于基准，从而阐明了图像生成模型在增强人类创造力方面的新应用。

Title: MotionGPT3: Human Motion as a Second Modality

Authors: Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, Xin Chen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.24086
Pdf URL: https://arxiv.org/pdf/2506.24086
Copy Paste: [[2506.24086]] MotionGPT3: Human Motion as a Second Modality(https://arxiv.org/abs/2506.24086)
Keywords: generation
Abstract: Though recent advances in multimodal models have demonstrated strong capabilities and opportunities in unified understanding and generation, the development of unified motion-language models remains underexplored. To enable such models with high-fidelity human motion, two core challenges must be addressed. The first is the reconstruction gap between the continuous motion modality and discrete representation in an autoregressive manner, and the second is the degradation of language intelligence during unified training. Inspired by the mixture of experts, we propose MotionGPT3, a bimodal motion-language model that treats human motion as a second modality, decoupling motion modeling via separate model parameters and enabling both effective cross-modal interaction and efficient multimodal scaling training. To preserve language intelligence, the text branch retains the original structure and parameters of the pretrained language model, while a new motion branch is integrated via a shared attention mechanism, enabling bidirectional information flow between two modalities. We first employ a motion Variational Autoencoder (VAE) to encode raw human motion into latent representations. Based on this continuous latent space, the motion branch predicts motion latents directly from intermediate hidden states using a diffusion head, bypassing discrete tokenization. Extensive experiments show that our approach achieves competitive performance on both motion understanding and generation tasks while preserving strong language capabilities, establishing a unified bimodal motion diffusion framework within an autoregressive manner.
摘要：尽管多模型模型的最新进展表现出了统一的理解和产生的强大能力和机会，但统一运动语言模型的发展仍然没有得到充实。为了使这种具有高保真性人类运动的模型，必须解决两个核心挑战。第一个是以自回归方式的连续运动方式和离散表示之间的重建差距，第二个是在统一培训期间语言智能的退化。受专家的混合，我们提出了MotionGpt3，这是一种双峰运动语言模型，将人类运动视为第二种方式，通过单独的模型参数解耦运动模型，并既可以启用有效的跨模式相互作用和有效的多模态缩放训练。为了保留语言智能，文本分支保留了预审前的语言模型的原始结构和参数，而新的运动分支是通过共享注意机制集成的，从而使双向信息流在两个模态之间。我们首先采用运动变异自动编码器（VAE）将原始的人类运动编码为潜在表示。基于这个连续的潜在空间，运动分支使用扩散头直接从中间隐藏状态预测运动潜在，从而绕过离散的令牌化。广泛的实验表明，我们的方法在保持强大的语言能力的同时，可以在运动理解和发电任务上实现竞争性能，并以自动回归方式建立统一的双峰运动扩散框架。

Title: WaRA: Wavelet Low Rank Adaptation

Authors: Moein Heidari, Yasamin Medghalchi, Mahdi Khoursha, Reza Rezaeian, Ilker Hacihaliloglu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.24092
Pdf URL: https://arxiv.org/pdf/2506.24092
Copy Paste: [[2506.24092]] WaRA: Wavelet Low Rank Adaptation(https://arxiv.org/abs/2506.24092)
Keywords: generation
Abstract: Parameter-efficient fine-tuning (PEFT) has gained widespread adoption across various applications. Among PEFT techniques, Low-Rank Adaptation (LoRA) and its extensions have emerged as particularly effective, allowing efficient model adaptation while significantly reducing computational overhead. However, existing approaches typically rely on global low-rank factorizations, which overlook local or multi-scale structure, failing to capture complex patterns in the weight updates. To address this, we propose WaRA, a novel PEFT method that leverages wavelet transforms to decompose the weight update matrix into a multi-resolution representation. By performing low-rank factorization in the wavelet domain and reconstructing updates through an inverse transform, WaRA obtains compressed adaptation parameters that harness multi-resolution analysis, enabling it to capture both coarse and fine-grained features while providing greater flexibility and sparser representations than standard LoRA. Through comprehensive experiments and analysis, we demonstrate that WaRA performs superior on diverse vision tasks, including image generation, classification, and semantic segmentation, significantly enhancing generated image quality while reducing computational complexity. Although WaRA was primarily designed for vision tasks, we further showcase its effectiveness in language tasks, highlighting its broader applicability and generalizability. The code is publicly available at \href{GitHub}{this https URL}.
摘要：参数有效的微调（PEFT）已在各种应用程序中广泛采用。在PEFT技术中，低级适应性（LORA）及其扩展已成为特别有效的，从而可以有效地适应，同时显着降低了计算开销。但是，现有方法通常依赖于全局低级别因素，这些因素忽略了本地或多尺度结构，因此无法在重量更新中捕获复杂的模式。为了解决这个问题，我们提出了沃拉（Wara），这是一种新型的PEFT方法，它利用小波转换为将重量更新矩阵分解为多分辨率表示。通过在小波域中执行低级别的分解并通过反变形重建更新，Wara获得了压缩的适应参数，可利用多分辨率分析，使其能够捕获比标准LORA更大的柔韧性和更稀疏的表述。通过全面的实验和分析，我们证明Wara在各种视觉任务上表现出色，包括图像产生，分类和语义分割，可显着提高生成的图像质量，同时降低计算复杂性。尽管Wara主要是为视觉任务设计的，但我们进一步展示了其在语言任务中的有效性，突出了其更广泛的适用性和可推广性。该代码在\ href {github} {this https url}上公开可用。

Title: DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

Authors: Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, Tianheng Cheng, Yi Lin, Zilong Huang, Wenhao Huang, Jiashi Feng, Guang Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24102
Pdf URL: https://arxiv.org/pdf/2506.24102
Copy Paste: [[2506.24102]] DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World(https://arxiv.org/abs/2506.24102)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities. Several grounded caption datasets face the problems of missing detailed descriptions, relations, and massive object descriptions on high-resolution images. To fill this gap for the community, we present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world. We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging. The first stage obtains entity-level masks and labels. The second stage generates the object-level, detailed captions with the guidance of masks and labels from the first stage. The final stage merges object captions and masks into spatial and relational dense captions. To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model. Extensive experiments on various settings, including vision-language understanding, visual grounding, and region caption generation, demonstrate the effectiveness of our DenseWorld-1M dataset and labeling models.
摘要：多模式的大语言模型（MLLM）表现出对场景的复杂理解，从大规模和高质量的数据集中受益。大多数现有的字幕数据集都缺乏视觉实体的地面位置和关系。几个接地的标题数据集面临着缺少详细描述，关系和大量对象描述的问题。为了填补社区的这一空白，我们介绍了密集的World-1m，这是现实世界中第一个庞大，详细，密集的扎根标题数据集。我们设计了一个三阶段的标签管道，其中包含开放世界的感知，详细的对象字幕产生和密集的字幕合并。第一阶段获得实体级的面具和标签。第二阶段生成了对象级别的详细标题，并从第一阶段的面具和标签的指导产生了标题。最后阶段将对象字幕和掩盖掩盖为空间和关系密集的字幕。为了加速标签过程并提高字幕质量，我们提出了两个VLM模型：详细的区域标题模型和空间标题合并模型。在各种环境中进行了广泛的实验，包括视觉理解，视觉接地和区域字幕的生成，证明了我们密集的World-1M数据集和标记模型的有效性。

Title: Epona: Autoregressive Diffusion World Model for Autonomous Driving

Authors: Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, Xun Cao, Wei Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24113
Pdf URL: https://arxiv.org/pdf/2506.24113
Copy Paste: [[2506.24113]] Epona: Autoregressive Diffusion World Model for Autonomous Driving(https://arxiv.org/abs/2506.24113)
Keywords: generation
Abstract: Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-to-end framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chain-of-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4\% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks. Code will be publicly available at \href{this https URL}{this https URL}.
摘要：扩散模型在视频生成中表现出了出色的视觉质量，使它们有望在自动驾驶世界建模中。但是，现有的基于视频扩散的世界模型在灵活的长度，长马预测和整合轨迹计划方面进行了努力。这是因为传统的视频扩散模型依赖于固定长度序列序列的全局关节分布建模，而不是在每个时间步中顺序构建局部分布。在这项工作中，我们提出了Epona，这是一种自动回归扩散世界模型，通过两个关键创新来实现局部时空分配建模：1）将时间动态模型分离出来，将时间动态模型与良好的未来世界一代和2）模块化轨迹和视频预测分开，该模块化轨迹和视频预测无效地集成了一个无效的运动模型。我们的体系结构可实现高分辨率，长期的生成，同时引入了一种新型的前卫训练策略，以解决自回旋循环中的错误积累。实验结果表明，与先前的作品相比，预测持续时间为7.4 \％\％FVD的最新性能。博学的世界模型进一步充当了实时运动计划者，在NAVSIM基准测试中表现优于强大的端到端计划者。代码将在\ href {this HTTPS url} {this HTTPS url}上公开可用。

Title: TextMesh4D: High-Quality Text-to-4D Mesh Generation

Authors: Sisi Dai, Xinxin Su, Boyan Wan, Ruizhen Hu, Kai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24121
Pdf URL: https://arxiv.org/pdf/2506.24121
Copy Paste: [[2506.24121]] TextMesh4D: High-Quality Text-to-4D Mesh Generation(https://arxiv.org/abs/2506.24121)
Keywords: generation, generative
Abstract: Recent advancements in diffusion generative models significantly advanced image, video, and 3D content creation from user-provided text prompts. However, the challenging problem of dynamic 3D content generation (text-to-4D) with diffusion guidance remains largely unexplored. In this paper, we introduce TextMesh4D, a novel framework for high-quality text-to-4D generation. Our approach leverages per-face Jacobians as a differentiable mesh representation and decomposes 4D generation into two stages: static object creation and dynamic motion synthesis. We further propose a flexibility-rigidity regularization term to stabilize Jacobian optimization under video diffusion priors, ensuring robust geometric performance. Experiments demonstrate that TextMesh4D achieves state-of-the-art results in terms of temporal consistency, structural fidelity, and visual realism. Moreover, TextMesh4D operates with a low GPU memory overhead-requiring only a single 24GB GPU-offering a cost-effective yet high-quality solution for text-driven 4D mesh generation. The code will be released to facilitate future research in text-to-4D generation.
摘要：扩散生成模型的最新进展可从用户提供的文本提示中显着高级图像，视频和3D内容创建。但是，具有扩散指南的动态3D内容生成（文本到4D）的挑战性问题仍然在很大程度上没有探索。在本文中，我们介绍了TextMesh4D，这是一种用于高质量文本到4D代的新型框架。我们的方法将雅各布人作为可区分的网格表示，并将4D代表分为两个阶段：静态对象创建和动态运动综合。我们进一步提出了一个柔韧性rigitigity rogiation justripation术语，以稳定视频扩散先验下的雅各布优化，以确保稳健的几何性能。实验表明，TextMesh4D以时间的一致性，结构忠诚和视觉现实主义的形式实现了最先进的结果。此外，TextMesh4D的运行方式低于GPU内存开销，仅需一个24GB GPU，可为文本驱动的4D网格生成一种具有成本效益但高质量的解决方案。该代码将被发布，以促进文本到4D代的未来研究。

Title: Calligrapher: Freestyle Text Image Customization

Authors: Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qiuyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24123
Pdf URL: https://arxiv.org/pdf/2506.24123
Copy Paste: [[2506.24123]] Calligrapher: Freestyle Text Image Customization(https://arxiv.org/abs/2506.24123)
Keywords: generation, generative
Abstract: We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.
摘要：我们介绍Cergigrapher，这是一个基于扩散的新型框架，将高级文本定制与数字书法和设计应用程序的艺术版式整合在一起。我们的框架解决了精确样式控制和数据依赖性的挑战，我们的框架包含了三个关键的技术贡献。首先，我们开发了一种自我验证机制，该机制利用预先训练的文本对象生成模型本身与大型语言模型一起自动构建以样式为中心的排版基准。其次，我们通过可训练的样式编码器介绍了局部样式注入框架，该框架包括QFormer和Linear层，以从参考图像中提取强大的样式功能。还采用了一种内在的生成机制将参考图像直接嵌入到剥离过程中，从而进一步增强了目标样式的精制对齐。跨不同字体和设计环境之间的广泛定量和定性评估证实了书法家对复杂的风格细节和精确的字形定位的准确再现。通过使高质量，视觉上一致的排版自动化，书法家超越了传统模型，增强了创意从业人员的数字艺术，品牌和上下文印刷设计。

Title: FADRM: Fast and Accurate Data Residual Matching for Dataset Distillation

Authors: Jiacheng Cui, Xinyue Bi, Yaxin Luo, Xiaohan Zhao, Jiacheng Liu, Zhiqiang Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.24125
Pdf URL: https://arxiv.org/pdf/2506.24125
Copy Paste: [[2506.24125]] FADRM: Fast and Accurate Data Residual Matching for Dataset Distillation(https://arxiv.org/abs/2506.24125)
Keywords: generation
Abstract: Residual connection has been extensively studied and widely applied at the model architecture level. However, its potential in the more challenging data-centric approaches remains unexplored. In this work, we introduce the concept of Data Residual Matching for the first time, leveraging data-level skip connections to facilitate data generation and mitigate data information vanishing. This approach maintains a balance between newly acquired knowledge through pixel space optimization and existing core local information identification within raw data modalities, specifically for the dataset distillation task. Furthermore, by incorporating optimization-level refinements, our method significantly improves computational efficiency, achieving superior performance while reducing training time and peak GPU memory usage by 50%. Consequently, the proposed method Fast and Accurate Data Residual Matching for Dataset Distillation (FADRM) establishes a new state-of-the-art, demonstrating substantial improvements over existing methods across multiple dataset benchmarks in both efficiency and effectiveness. For instance, with ResNet-18 as the student model and a 0.8% compression ratio on ImageNet-1K, the method achieves 47.7% test accuracy in single-model dataset distillation and 50.0% in multi-model dataset distillation, surpassing RDED by +5.7% and outperforming state-of-the-art multi-model approaches, EDC and CV-DD, by +1.4% and +4.0%. Code is available at: this https URL.
摘要：剩余连接已在模型架构级别进行了广泛的研究和广泛应用。但是，它在更具挑战性的以数据为中心的方法中的潜力尚未得到探索。在这项工作中，我们首次介绍了数据残差匹配的概念，利用数据级跳过连接来促进数据生成并减轻数据信息消失。这种方法通过像素空间优化新获取的知识与原始数据模式中的现有核心本地信息标识之间保持平衡，特别是针对数据集蒸馏任务。此外，通过纳入优化级别的改进，我们的方法显着提高了计算效率，在训练时间和峰值GPU内存使用量下降50％的同时，实现了卓越的性能。因此，提出的方法快速，准确的数据残差匹配数据集蒸馏（FADRM）建立了一种新的最新技术，证明了在效率和有效性方面对多个数据集基准的现有方法的实质性改进。 For instance, with ResNet-18 as the student model and a 0.8% compression ratio on ImageNet-1K, the method achieves 47.7% test accuracy in single-model dataset distillation and 50.0% in multi-model dataset distillation, surpassing RDED by +5.7% and outperforming state-of-the-art multi-model approaches, EDC and CV-DD, by +1.4% and +4.0％。代码可用：此HTTPS URL。