2025-07-09

Title: Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)

Authors: Nicholas Merchant, Haitz Sáez de Ocáriz Borde, Andrei Cristian Popescu, Carlos Garcia Jurado Suarez
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05300
Pdf URL: https://arxiv.org/pdf/2507.05300
Copy Paste: [[2507.05300]] Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)(https://arxiv.org/abs/2507.05300)
Keywords: generative
Abstract: We argue that generative text-to-image models often struggle with prompt adherence due to the noisy and unstructured nature of large-scale datasets like LAION-5B. This forces users to rely heavily on prompt engineering to elicit desirable outputs. In this work, we propose that enforcing a consistent caption structure during training can significantly improve model controllability and alignment. We introduce Re-LAION-Caption 19M, a high-quality subset of Re-LAION-5B, comprising 19 million 1024x1024 images with captions generated by a Mistral 7B Instruct-based LLaVA-Next model. Each caption follows a four-part template: subject, setting, aesthetics, and camera details. We fine-tune PixArt-$\Sigma$ and Stable Diffusion 2 using both structured and randomly shuffled captions, and show that structured versions consistently yield higher text-image alignment scores using visual question answering (VQA) models. The dataset is publicly available at this https URL.
摘要：我们认为，由于大规模数据集（如Laion-5B）的嘈杂和非结构化的性质，生成的文本对图像模型通常会迅速依从性。这迫使用户在很大程度上依靠迅速的工程来引起理想的输出。在这项工作中，我们建议在训练过程中执行一致的标题结构可以显着提高模型的可控性和一致性。我们引入了Re-Laion-Caption 19m，这是Re-Laion-5B的高质量子集，其中包括1900万个1024x1024图像，并带有由Mistral 7B基于Llava-Next模型产生的字幕。每个字幕均遵循一个四部分的模板：主题，设置，美学和相机详细信息。我们使用结构化和随机洗牌的字幕微调Pixart-$ \ sigma $和稳定的扩散2，并使用视觉值答案（VQA）模型表明结构化版本始终产生更高的文本图像对齐分数。该数据集可在此HTTPS URL上公开可用。

Title: CorrDetail: Visual Detail Enhanced Self-Correction for Face Forgery Detection

Authors: Binjia Zhou, Hengrui Lou, Lizhe Chen, Haoyuan Li, Dawei Luo, Shuai Chen, Jie Lei, Zunlei Feng, Yijun Bei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05302
Pdf URL: https://arxiv.org/pdf/2507.05302
Copy Paste: [[2507.05302]] CorrDetail: Visual Detail Enhanced Self-Correction for Face Forgery Detection(https://arxiv.org/abs/2507.05302)
Keywords: generation
Abstract: With the swift progression of image generation technology, the widespread emergence of facial deepfakes poses significant challenges to the field of security, thus amplifying the urgent need for effective deepfake this http URL techniques for face forgery detection can broadly be categorized into two primary groups: visual-based methods and multimodal approaches. The former often lacks clear explanations for forgery details, while the latter, which merges visual and linguistic modalities, is more prone to the issue of this http URL address these shortcomings, we introduce a visual detail enhanced self-correction framework, designated CorrDetail, for interpretable face forgery detection. CorrDetail is meticulously designed to rectify authentic forgery details when provided with error-guided questioning, with the aim of fostering the ability to uncover forgery details rather than yielding hallucinated responses. Additionally, to bolster the reliability of its findings, a visual fine-grained detail enhancement module is incorporated, supplying CorrDetail with more precise visual forgery details. Ultimately, a fusion decision strategy is devised to further augment the model's discriminative capacity in handling extreme samples, through the integration of visual information compensation and model bias this http URL results demonstrate that CorrDetail not only achieves state-of-the-art performance compared to the latest methodologies but also excels in accurately identifying forged details, all while exhibiting robust generalization capabilities.
摘要：随着图像产生技术的迅速发展，面部深烟的广泛出现对安全领域构成了重大挑战，从而扩大了有效的深层侵害的迫切需求，这种HTTP URL技术用于面部伪造检测，可以将其广泛地分为两种主要组：基于视觉的方法和多态方法。前者通常缺乏对伪造细节的明确解释，而将视觉和语言方式融合的后者更容易解决此HTTP URL问题的问题，我们引入了视觉细节增强的自我校正框架，指定的Corrdetail，指定的Corrdail，以供可解释的面孔检测。当提供错误引导的询问时，Corrdetail经过精心设计，以纠正真实的伪造细节，目的是促进揭示伪造细节而不是产生幻觉的回答的能力。此外，为了增强其发现的可靠性，还合并了视觉细粒细节增强模块，为CorrdeTail提供了更精确的视觉伪造细节。最终，设计了一种融合决策策略，以进一步增强该模型在处理极端样本方面的歧视能力，通过整合视觉信息补偿和模型偏见，此HTTP URL结果表明，Corrdetail不仅可以实现与最新方法相比，还可以准确地识别出伪造的详细信息，同时确定良好的通用性cablistials Caberization Cablistial Cablistial Cablistial Compabab，还可以实现最新的方法。

Title: Enhancing Underwater Images Using Deep Learning with Subjective Image Quality Integration

Authors: Jose M. Montero, Jose-Luis Lisani
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2507.05393
Pdf URL: https://arxiv.org/pdf/2507.05393
Copy Paste: [[2507.05393]] Enhancing Underwater Images Using Deep Learning with Subjective Image Quality Integration(https://arxiv.org/abs/2507.05393)
Keywords: generative
Abstract: Recent advances in deep learning, particularly neural networks, have significantly impacted a wide range of fields, including the automatic enhancement of underwater images. This paper presents a deep learning-based approach to improving underwater image quality by integrating human subjective assessments into the training process. To this end, we utilize publicly available datasets containing underwater images labeled by experts as either high or low quality. Our method involves first training a classifier network to distinguish between high- and low-quality images. Subsequently, generative adversarial networks (GANs) are trained using various enhancement criteria to refine the low-quality images. The performance of the GAN models is evaluated using quantitative metrics such as PSNR, SSIM, and UIQM, as well as through qualitative analysis. Results demonstrate that the proposed model -- particularly when incorporating criteria such as color fidelity and image sharpness -- achieves substantial improvements in both perceived and measured image quality.
摘要：深度学习的最新进展，尤其是神经网络，严重影响了广泛的领域，包括自动增强水下图像。本文提出了一种基于深度学习的方法，可以通过将人类主观评估整合到培训过程中来改善水下图像质量。为此，我们利用包含由专家标记为高质量或低质量的水下图像的公开数据集。我们的方法涉及首先培训分类器网络，以区分高质量图像和低质量图像。随后，使用各种增强标准来完善低质量图像的生成对抗网络（GAN）进行训练。使用定量指标（例如PSNR，SSIM和UIQM）以及通过定性分析来评估GAN模型的性能。结果表明，提出的模型 - 特别是在纳入诸如颜色保真度和图像清晰度之类的标准时，可以在感知和测量的图像质量方面取得了重大改进。

Title: Neural-Driven Image Editing

Authors: Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05397
Pdf URL: https://arxiv.org/pdf/2507.05397
Copy Paste: [[2507.05397]] Neural-Driven Image Editing(https://arxiv.org/abs/2507.05397)
Keywords: generative
Abstract: Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.
摘要：传统的图像编辑通常依赖于手动提示，使其具有劳动力密集型，并且对于有限的运动控制或语言能力的人来说是无法访问的。利用脑部计算机界面（BCI）和生成模型的最新进展，我们提出了Loongx，这是一种由多模式神经生理学信号驱动的无提图像编辑方法。 Loongx利用了在23,928个图像编辑对的综合数据集上训练的最先进的扩散模型，每个模型都与同步脑电图（EEG）配对，功能性近红外光谱（FNIRS），PhotoPoplethysmmography（PPG）（PPG）以及捕获用户的头部运动信号。为了有效解决这些信号的异质性，loongx集成了两个关键模块。跨尺度状态空间（CS3）模块编码信息方式特定的特征。动态门控融合（DGF）模块将这些特征进一步汇总到统一的潜在空间中，然后通过扩散变压器（DIT）上的微调与编辑语义对齐。此外，我们还使用对比度学习对编码者进行了培训，以使认知状态与嵌入式自然语言的语义意图相结合。广泛的实验表明，loongx的性能与文本驱动的方法相当（clip-i：0.6605 vs. 0.6558; dino：0.4812 vs. 0.4636），当神经信号与语音结合时（clip-t：0.2588 vs.0.2549）。这些结果突出了神经驱动的生成模型的希望，可以为认知驱动的创意技术提供可访问，直观的图像编辑和开放新方向。数据集和代码将被发布，以支持未来的工作并促进该新兴领域的进步。

Title: Motion Generation: A Survey of Generative Approaches and Benchmarks

Authors: Aliasghar Khani, Arianna Rampini, Bruno Roy, Larasika Nadela, Noa Kaplan, Evan Atherton, Derek Cheung, Jacky Bibliowicz
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05419
Pdf URL: https://arxiv.org/pdf/2507.05419
Copy Paste: [[2507.05419]] Motion Generation: A Survey of Generative Approaches and Benchmarks(https://arxiv.org/abs/2507.05419)
Keywords: generation, generative
Abstract: Motion generation, the task of synthesizing realistic motion sequences from various conditioning inputs, has become a central problem in computer vision, computer graphics, and robotics, with applications ranging from animation and virtual agents to human-robot interaction. As the field has rapidly progressed with the introduction of diverse modeling paradigms including GANs, autoencoders, autoregressive models, and diffusion-based techniques, each approach brings its own advantages and limitations. This growing diversity has created a need for a comprehensive and structured review that specifically examines recent developments from the perspective of the generative approach employed. In this survey, we provide an in-depth categorization of motion generation methods based on their underlying generative strategies. Our main focus is on papers published in top-tier venues since 2023, reflecting the most recent advancements in the field. In addition, we analyze architectural principles, conditioning mechanisms, and generation settings, and compile a detailed overview of the evaluation metrics and datasets used across the literature. Our objective is to enable clearer comparisons and identify open challenges, thereby offering a timely and foundational reference for researchers and practitioners navigating the rapidly evolving landscape of motion generation.
摘要：运动生成是从各种条件输入中综合现实运动序列的任务，已成为计算机视觉，计算机图形和机器人技术的核心问题，其应用程序从动画和虚拟代理到人类机器人交互的应用程序不等。随着该领域的迅速发展，随着引入各种建模范式，包括gan，自动编码器，自动回归模型和基于扩散的技术，每种方法都带来了自己的优势和局限性。这种日益增长的多样性引起了对全面和结构化的审查的需求，该审查从采用的生成方法的角度专门研究了最新的发展。在这项调查中，我们根据其潜在的生成策略提供了对运动生成方法的深入分类。我们的主要重点是自2023年以来在顶级场所发表的论文，反映了该领域的最新进步。此外，我们分析了体系结构原理，调理机制和生成设置，并汇总了文献中使用的评估指标和数据集的详细概述。我们的目标是实现更清晰的比较并确定开放的挑战，从而为研究人员和从业人员提供及时，基础参考，从而导致运动产生快速发展的景观。

Title: Navigating Sparse Molecular Data with Stein Diffusion Guidance

Authors: Van Khoa Nguyen, Lionel Blondé, Alexandros Kalousis
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.05482
Pdf URL: https://arxiv.org/pdf/2507.05482
Copy Paste: [[2507.05482]] Navigating Sparse Molecular Data with Stein Diffusion Guidance(https://arxiv.org/abs/2507.05482)
Keywords: generation
Abstract: Stochastic optimal control (SOC) has recently emerged as a principled framework for fine-tuning diffusion models. However, its dependence on computationally intensive simulations makes it impractical for fast sampling. In parallel, a class of training-free approaches has been developed that guides diffusion models using off-the-shelf classifiers on predicted clean samples, bypassing the need to train classifiers on noisy data. These methods can be interpreted as approximate SOC schemes, using Tweedie's formula to estimate diffusion posteriors. In practice, however, such direct approximations can introduce significant errors, leading to unreliable guidance. In this work, we unify the strengths of both paradigms by proposing a novel training-free diffusion guidance framework based on a surrogate stochastic optimal control objective. We derive a new theoretical bound on the value function that reveals the necessity of correcting the approximate posteriors to remain faithful to the true diffusion posterior. To this end, we connect the problem with Stein variational inference, which seeks the steepest descent direction that minimizes the Kullback-Leibler discrepancy between the two posteriors. Our method, which we refer to as Stein Diffusion Guidance (SDG), introduces a principled correction mechanism and incorporates a novel running cost functional to enable effective guidance in low-density regions. Experiments on challenging molecular generation tasks demonstrate that SDG significantly outperforms standard training-free guidance methods, highlighting its potential for broader applications.
摘要：随机最佳控制（SOC）最近已成为微调扩散模型的原则框架。但是，它对计算密集型模拟的依赖性使其对于快速采样不切实际。同时，已经开发了一类无训练方法，可以在预测的干净样品上使用现成的分类器指导扩散模型，从而绕开了对嘈杂数据进行培训分类器的需求。这些方法可以使用Tweedie的公式来估算扩散后代，可以将这些方法解释为近似SOC方案。但是，实际上，这种直接近似可能会引入重大错误，从而导致不可靠的指导。在这项工作中，我们通过提出基于替代随机最佳控制目标的新型无训练扩散引导框架来统一两种范式的优势。我们在价值函数上得出了一种新的理论结合，该函数揭示了校正近似后代的必要性，以保持忠实于真正的扩散后部。为此，我们将问题与Stein变异推理联系起来，该推断寻求最陡峭的下降方向，从而最大程度地减少了两个后者之间的Kullback-Leibler差异。我们称为Stein扩散指南（SDG）的方法，引入了原则性的校正机制，并结合了一种新型的运行成本功能，以在低密度区域中实现有效的指导。有关挑战分子生成任务的实验表明，SDG显着优于标准的无训练指导方法，强调了其对更广泛应用的潜力。

Title: Cloud Diffusion Part 1: Theory and Motivation

Authors: Andrew Randono
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05496
Pdf URL: https://arxiv.org/pdf/2507.05496
Copy Paste: [[2507.05496]] Cloud Diffusion Part 1: Theory and Motivation(https://arxiv.org/abs/2507.05496)
Keywords: generation
Abstract: Diffusion models for image generation function by progressively adding noise to an image set and training a model to separate out the signal from the noise. The noise profile used by these models is white noise -- that is, noise based on independent normal distributions at each point whose mean and variance is independent of the scale. By contrast, most natural image sets exhibit a type of scale invariance in their low-order statistical properties characterized by a power-law scaling. Consequently, natural images are closer (in a quantifiable sense) to a different probability distribution that emphasizes large scale correlations and de-emphasizes small scale correlations. These scale invariant noise profiles can be incorporated into diffusion models in place of white noise to form what we will call a ``Cloud Diffusion Model". We argue that these models can lead to faster inference, improved high-frequency details, and greater controllability. In a follow-up paper, we will build and train a Cloud Diffusion Model that uses scale invariance at a fundamental level and compare it to classic, white noise diffusion models.
摘要：图像生成函数的扩散模型通过逐步将噪声添加到图像集中并训练模型以将信号与噪声分开。这些模型使用的噪声曲线是白噪声 - 也就是说，基于每个点的独立正常分布的噪声，其平均值和方差独立于尺度。相比之下，大多数自然图像集在其低阶统计特性中表现出一种比例不变的类型，其特征是幂律缩放。因此，自然图像更接近（从可量化的意义上），它与强调大规模相关性并取消强调小规模相关性的不同概率分布。这些规模不变的噪声配置文件可以纳入扩散模型，以代替白噪声，以形成我们所谓的``云扩散模型''。我们认为，这些模型可以导致更快的推理，改进的高频细节和更大的可控性。在后续文件中，我们将使用云扩散模型来构建和训练云扩散模型，以降低量级别的差异级别的质量差异，并与经典级别进行了比较，并将经典级别用于经典级别。

Title: LoomNet: Enhancing Multi-View Image Generation via Latent Space Weaving

Authors: Giulio Federico, Fabio Carrara, Claudio Gennaro, Giuseppe Amato, Marco Di Benedetto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05499
Pdf URL: https://arxiv.org/pdf/2507.05499
Copy Paste: [[2507.05499]] LoomNet: Enhancing Multi-View Image Generation via Latent Space Weaving(https://arxiv.org/abs/2507.05499)
Keywords: generation
Abstract: Generating consistent multi-view images from a single image remains challenging. Lack of spatial consistency often degrades 3D mesh quality in surface reconstruction. To address this, we propose LoomNet, a novel multi-view diffusion architecture that produces coherent images by applying the same diffusion model multiple times in parallel to collaboratively build and leverage a shared latent space for view consistency. Each viewpoint-specific inference generates an encoding representing its own hypothesis of the novel view from a given camera pose, which is projected onto three orthogonal planes. For each plane, encodings from all views are fused into a single aggregated plane. These aggregated planes are then processed to propagate information and interpolate missing regions, combining the hypotheses into a unified, coherent interpretation. The final latent space is then used to render consistent multi-view images. LoomNet generates 16 high-quality and coherent views in just 15 seconds. In our experiments, LoomNet outperforms state-of-the-art methods on both image quality and reconstruction metrics, also showing creativity by producing diverse, plausible novel views from the same input.
摘要：从单个图像中生成一致的多视图图像仍然具有挑战性。缺乏空间一致性通常会在表面重建中降低3D网格质量。为了解决这个问题，我们提出了Loomnet，这是一种新颖的多视图扩散体系结构，通过与协作构建和利用共享的潜在空间以相同的构建和利用相同的扩散模型来产生连贯的图像。每个观点特定的推断都会产生一个代表其自身对新视图的假设，从给定的相机姿势投射到三个正交平面上。对于每个平面，将所有视图中的编码融合到一个聚合平面中。然后对这些聚合的平面进行处理以传播信息并插入缺失区域，将假设结合到统一的连贯的解释中。然后，最终的潜在空间用于呈现一致的多视图图像。 Loomnet在短短15秒内就产生了16个高质量和连贯的景色。在我们的实验中，Loomnet在图像质量和重建指标上都优于最先进的方法，这也通过从同一输入中产生多样化的，合理的新颖观点来显示创造力。

Title: Simulating Refractive Distortions and Weather-Induced Artifacts for Resource-Constrained Autonomous Perception

Authors: Moseli Mots'oehli, Feimei Chen, Hok Wai Chan, Itumeleng Tlali, Thulani Babeli, Kyungim Baek, Huaijin Chen
Subjects: cs.CV, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05536
Pdf URL: https://arxiv.org/pdf/2507.05536
Copy Paste: [[2507.05536]] Simulating Refractive Distortions and Weather-Induced Artifacts for Resource-Constrained Autonomous Perception(https://arxiv.org/abs/2507.05536)
Keywords: restoration
Abstract: The scarcity of autonomous vehicle datasets from developing regions, particularly across Africa's diverse urban, rural, and unpaved roads, remains a key obstacle to robust perception in low-resource settings. We present a procedural augmentation pipeline that enhances low-cost monocular dashcam footage with realistic refractive distortions and weather-induced artifacts tailored to challenging African driving scenarios. Our refractive module simulates optical effects from low-quality lenses and air turbulence, including lens distortion, Perlin noise, Thin-Plate Spline (TPS), and divergence-free (incompressible) warps. The weather module adds homogeneous fog, heterogeneous fog, and lens flare. To establish a benchmark, we provide baseline performance using three image restoration models. To support perception research in underrepresented African contexts, without costly data collection, labeling, or simulation, we release our distortion toolkit, augmented dataset splits, and benchmark results.
摘要：从发展中国家，尤其是在非洲多元化的城市，农村和未铺砌的道路上，自动驾驶汽车数据集的稀缺仍然是在低资产阶级环境中强烈看法的关键障碍。我们提出了一条程序性增强管道，该管道可增强低成本的单眼仪表板镜头，并具有逼真的屈光失真和天气引起的人工制品，该镜头是针对挑战非洲驾驶场景而定制的。我们的折射模块模拟了低质量透镜和空气湍流的光学效应，包括镜片扭曲，珀林噪声，薄板样条（TPS）和无差异（不可压缩）扭曲。天气模块增加了均匀的雾，异质雾和镜头耀斑。为了建立基准测试，我们使用三个图像恢复模型提供基线性能。为了支持在代表性不足的非洲环境中的感知研究，而没有昂贵的数据收集，标签或模拟，我们发布了失真工具包，增强数据集拆分和基准结果。

Title: ReLayout: Integrating Relation Reasoning for Content-aware Layout Generation with Multi-modal Large Language Models

Authors: Jiaxu Tian, Xuehui Yu, Yaoxing Wang, Pan Wang, Guangqian Guo, Shan Gao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05568
Pdf URL: https://arxiv.org/pdf/2507.05568
Copy Paste: [[2507.05568]] ReLayout: Integrating Relation Reasoning for Content-aware Layout Generation with Multi-modal Large Language Models(https://arxiv.org/abs/2507.05568)
Keywords: generation
Abstract: Content-aware layout aims to arrange design elements appropriately on a given canvas to convey information effectively. Recently, the trend for this task has been to leverage large language models (LLMs) to generate layouts automatically, achieving remarkable performance. However, existing LLM-based methods fail to adequately interpret spatial relationships among visual themes and design elements, leading to structural and diverse problems in layout generation. To address this issue, we introduce ReLayout, a novel method that leverages relation-CoT to generate more reasonable and aesthetically coherent layouts by fundamentally originating from design concepts. Specifically, we enhance layout annotations by introducing explicit relation definitions, such as region, salient, and margin between elements, with the goal of decomposing the layout into smaller, structured, and recursive layouts, thereby enabling the generation of more structured layouts. Furthermore, based on these defined relationships, we introduce a layout prototype rebalance sampler, which defines layout prototype features across three dimensions and quantifies distinct layout styles. This sampler addresses uniformity issues in generation that arise from data bias in the prototype distribution balance process. Extensive experimental results verify that ReLayout outperforms baselines and can generate structural and diverse layouts that are more aligned with human aesthetics and more explainable.
摘要：内容感知的布局旨在在给定画布上适当地安排设计元素，以有效地传达信息。最近，此任务的趋势是利用大型语言模型（LLM）自动生成布局，从而实现出色的性能。但是，现有的基于LLM的方法无法充分解释视觉主题和设计元素之间的空间关系，从而导致布局生成的结构和多样化问题。为了解决这个问题，我们介绍了一种新颖的方法，该方法利用关系cot来通过从根本上源自设计概念来生成更合理和美观的连贯布局。具体而言，我们通过引入明确的关系定义（例如区域，显着性和元素之间的边缘）来增强布局注释，以将布局分解为较小，结构化和递归的布局，从而实现更结构化的布局的产生。此外，基于这些定义的关系，我们引入了一个布局原型重新平衡采样器，该采样器定义了三个维度上的布局原型特征，并量化了不同的布局样式。该采样器解决了由原型分布平衡过程中的数据偏差引起的一代均匀性问题。广泛的实验结果验证了相关的表现优于基准，并且可以产生结构性和多样的布局，这些布局与人类美学更加一致，并且可以解释。

Title: Model-free Optical Processors using In Situ Reinforcement Learning with Proximal Policy Optimization

Authors: Yuhang Li, Shiqi Chen, Tingyu Gong, Aydogan Ozcan
Subjects: cs.LG, cs.NE, physics.app-ph, physics.optics
Abstract URL: https://arxiv.org/abs/2507.05583
Pdf URL: https://arxiv.org/pdf/2507.05583
Copy Paste: [[2507.05583]] Model-free Optical Processors using In Situ Reinforcement Learning with Proximal Policy Optimization(https://arxiv.org/abs/2507.05583)
Keywords: generation
Abstract: Optical computing holds promise for high-speed, energy-efficient information processing, with diffractive optical networks emerging as a flexible platform for implementing task-specific transformations. A challenge, however, is the effective optimization and alignment of the diffractive layers, which is hindered by the difficulty of accurately modeling physical systems with their inherent hardware imperfections, noise, and misalignments. While existing in situ optimization methods offer the advantage of direct training on the physical system without explicit system modeling, they are often limited by slow convergence and unstable performance due to inefficient use of limited measurement data. Here, we introduce a model-free reinforcement learning approach utilizing Proximal Policy Optimization (PPO) for the in situ training of diffractive optical processors. PPO efficiently reuses in situ measurement data and constrains policy updates to ensure more stable and faster convergence. We experimentally validated our method across a range of in situ learning tasks, including targeted energy focusing through a random diffuser, holographic image generation, aberration correction, and optical image classification, demonstrating in each task better convergence and performance. Our strategy operates directly on the physical system and naturally accounts for unknown real-world imperfections, eliminating the need for prior system knowledge or modeling. By enabling faster and more accurate training under realistic experimental constraints, this in situ reinforcement learning approach could offer a scalable framework for various optical and physical systems governed by complex, feedback-driven dynamics.
摘要：光学计算有望对高速，节能信息处理，衍射光学网络作为实现特定于任务转换的灵活平台。但是，一个挑战是衍射层的有效优化和对齐方式，这受到难以通过其固有的硬件缺陷，噪声和未对准的物理系统对物理系统进行建模的困难。尽管现有的原位优化方法在没有明确的系统建模的情况下提供了直接训练的优势，但由于使用有限的测量数据，它们通常受到缓慢收敛性和不稳定性能的限制。在这里，我们介绍了一种利用近端策略优化（PPO）的无模型增强学习方法，用于对衍射光学处理器的原位训练。 PPO有效地重用原位测量数据并约束策略更新，以确保更稳定和更快的收敛性。我们在一系列原位学习任务中验证了我们的方法，包括通过随机扩散器，全息图像产生，像差校正和光学图像分类的目标能量，在每个任务中都更好地收敛和性能。我们的策略直接在物理系统上运行，并且自然地说明了未知的现实世界缺陷，从而消除了对先前系统知识或建模的需求。通过在现实的实验约束下实现更快，更准确的培训，这种原位增强学习方法可以为由复杂的，反馈驱动的动态控制的各种光学和物理系统提供可扩展的框架。

Title: Semi-Supervised Defect Detection via Conditional Diffusion and CLIP-Guided Noise Filtering

Authors: Shuai Li, Shihan Chen, Wanru Geng, Zhaohua Xu, Xiaolu Liu, Can Dong, Zhen Tian, Changlin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05588
Pdf URL: https://arxiv.org/pdf/2507.05588
Copy Paste: [[2507.05588]] Semi-Supervised Defect Detection via Conditional Diffusion and CLIP-Guided Noise Filtering(https://arxiv.org/abs/2507.05588)
Keywords: generation
Abstract: In the realm of industrial quality inspection, defect detection stands as a critical component, particularly in high-precision, safety-critical sectors such as automotive components aerospace, and medical devices. Traditional methods, reliant on manual inspection or early image processing algorithms, suffer from inefficiencies, high costs, and limited robustness. This paper introduces a semi-supervised defect detection framework based on conditional diffusion (DSYM), leveraging a two-stage collaborative training mechanism and a staged joint optimization strategy. The framework utilizes labeled data for initial training and subsequently incorporates unlabeled data through the generation of pseudo-labels. A conditional diffusion model synthesizes multi-scale pseudo-defect samples, while a CLIP cross-modal feature-based noise filtering mechanism mitigates label contamination. Experimental results on the NEU-DET dataset demonstrate a 78.4% mAP@0.5 with the same amount of labeled data as traditional supervised methods, and 75.1% mAP@0.5 with only 40% of the labeled data required by the original supervised model, showcasing significant advantages in data efficiency. This research provides a high-precision, low-labeling-dependent solution for defect detection in industrial quality inspection scenarios. The work of this article has been open-sourced at this https URL.
摘要：在工业质量检查领域，缺陷检测是关键组成部分，尤其是在高精度，安全关键部门（例如汽车组件航空航天和医疗设备）中。依赖手动检查或早期图像处理算法的传统方法患有低效率，高成本和有限的鲁棒性。本文介绍了基于条件扩散（DSYM）的半监督缺陷检测框架，利用了两阶段的协作训练机制和分阶段的关节优化策略。该框架利用标记的数据进行初始培训，随后通过生成伪标签来合并未标记的数据。条件扩散模型合成了多尺度伪缺陷样品，而基于夹子的基于跨模式的噪声滤波机制可减轻标签污染。 NEU-DET数据集的实验结果证明了78.4％map@0.5，标记的数据与传统监督方法相同，而75.1％map@0.5，只有40％的原始监督模型所需的标记数据，显示了数据效率的显着优势。这项研究为工业质量检查方案中的缺陷检测提供了高精度，低标记的依赖性解决方案。本文的工作已在此HTTPS URL上开源。

Title: Rethinking Layered Graphic Design Generation with a Top-Down Approach

Authors: Jingye Chen, Zhaowen Wang, Nanxuan Zhao, Li Zhang, Difan Liu, Jimei Yang, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05601
Pdf URL: https://arxiv.org/pdf/2507.05601
Copy Paste: [[2507.05601]] Rethinking Layered Graphic Design Generation with a Top-Down Approach(https://arxiv.org/abs/2507.05601)
Keywords: generation
Abstract: Graphic design is crucial for conveying ideas and messages. Designers usually organize their work into objects, backgrounds, and vectorized text layers to simplify editing. However, this workflow demands considerable expertise. With the rise of GenAI methods, an endless supply of high-quality graphic designs in pixel format has become more accessible, though these designs often lack editability. Despite this, non-layered designs still inspire human designers, influencing their choices in layouts and text styles, ultimately guiding the creation of layered designs. Motivated by this observation, we propose Accordion, a graphic design generation framework taking the first attempt to convert AI-generated designs into editable layered designs, meanwhile refining nonsensical AI-generated text with meaningful alternatives guided by user prompts. It is built around a vision language model (VLM) playing distinct roles in three curated stages. For each stage, we design prompts to guide the VLM in executing different tasks. Distinct from existing bottom-up methods (e.g., COLE and Open-COLE) that gradually generate elements to create layered designs, our approach works in a top-down manner by using the visually harmonious reference image as global guidance to decompose each layer. Additionally, it leverages multiple vision experts such as SAM and element removal models to facilitate the creation of graphic layers. We train our method using the in-house graphic design dataset Design39K, augmented with AI-generated design images coupled with refined ground truth created by a customized inpainting model. Experimental results and user studies by designers show that Accordion generates favorable results on the DesignIntention benchmark, including tasks such as text-to-template, adding text to background, and text de-rendering, and also excels in creating design variations.
摘要：图形设计对于传达思想和信息至关重要。设计人员通常将其工作组织到对象，背景和矢量化文本层中，以简化编辑。但是，此工作流程需要大量的专业知识。随着Genai方法的兴起，尽管这些设计通常缺乏编辑性，但以像素格式的高质量图形设计供应变得更加易于使用。尽管如此，非层设计仍然激发了人类设计师的灵感，在布局和文本样式中影响了他们的选择，最终指导了分层设计的创建。在这一观察方面的动机中，我们提出了手风琴，这是一个图形设计生成框架，将第一次尝试将AI生成的设计转换为可编辑的分层设计，同时精炼了荒谬的AI生成的文本，并以用户提示为指导的有意义的替代方案。它是围绕视觉语言模型（VLM）在三个策划阶段中扮演不同角色的。对于每个阶段，我们设计提示指导VLM执行不同的任务。与现有的自下而上的方法（例如，Cole和开孔）逐渐生成元素以创建分层设计，我们的方法通过使用视觉上和谐的参考图像作为分解每一层的全局指导，以自上而下的方式工作。此外，它利用多个视觉专家（例如SAM和元素去除模型）来促进图形层的创建。我们使用内部图形设计数据集Design39K训练我们的方法，并使用AI生成的设计图像以及由自定义的涂层模型创建的精制地面真相增强。设计师的实验结果和用户研究表明，手风琴在DesignIntention基准上产生了有利的结果，包括诸如文本之间的任务，将文本添加到背景和文本范围内，并且在创建设计变化方面也很擅长。

Title: Kernel Density Steering: Inference-Time Scaling via Mode Seeking for Image Restoration

Authors: Yuyang Hu, Kangfu Mei, Mojtaba Sahraee-Ardakan, Ulugbek S. Kamilov, Peyman Milanfar, Mauricio Delbracio
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2507.05604
Pdf URL: https://arxiv.org/pdf/2507.05604
Copy Paste: [[2507.05604]] Kernel Density Steering: Inference-Time Scaling via Mode Seeking for Image Restoration(https://arxiv.org/abs/2507.05604)
Keywords: restoration, super-resolution
Abstract: Diffusion models show promise for image restoration, but existing methods often struggle with inconsistent fidelity and undesirable artifacts. To address this, we introduce Kernel Density Steering (KDS), a novel inference-time framework promoting robust, high-fidelity outputs through explicit local mode-seeking. KDS employs an $N$-particle ensemble of diffusion samples, computing patch-wise kernel density estimation gradients from their collective outputs. These gradients steer patches in each particle towards shared, higher-density regions identified within the ensemble. This collective local mode-seeking mechanism, acting as "collective wisdom", steers samples away from spurious modes prone to artifacts, arising from independent sampling or model imperfections, and towards more robust, high-fidelity structures. This allows us to obtain better quality samples at the expense of higher compute by simultaneously sampling multiple particles. As a plug-and-play framework, KDS requires no retraining or external verifiers, seamlessly integrating with various diffusion samplers. Extensive numerical validations demonstrate KDS substantially improves both quantitative and qualitative performance on challenging real-world super-resolution and image inpainting tasks.
摘要：扩散模型显示出图像恢复的希望，但是现有的方法通常会以不一致的忠诚度和不良的伪像。为了解决这个问题，我们介绍了内核密度转向（KDS），这是一种新颖的推理时间框架，通过明确的本地模式寻求促进稳健，高保真的输出。 KDS采用扩散样品的$ N $粒子集合，从其集体输出中计算贴片的内核密度估计梯度。这些梯度将每个粒子的斑块转向集合中鉴定的共享高密度区域。这种集体的局部寻求机制，充当“集体智慧”，将样本从易发的模式转移到容易产生的人工制品，是由独立的采样或模型瑕疵引起的，并且朝着更健壮的高效率结构迈进。这使我们能够通过同时采样多个粒子来获得更好的质量样品，而牺牲了更高的计算。作为插件框架，KDS不需要再培训或外部验证器，而是与各种扩散采样器无缝集成。广泛的数值验证表明，KD可在挑战现实世界的超级分辨率和图像介入任务上显着提高定量和定性性能。

Title: Generative Head-Mounted Camera Captures for Photorealistic Avatars

Authors: Shaojie Bai, Seunghyeon Seo, Yida Wang, Chenghui Li, Owen Wang, Te-Li Wang, Tianyang Ma, Jason Saragih, Shih-En Wei, Nojun Kwak, Hyung Jun Kim
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05620
Pdf URL: https://arxiv.org/pdf/2507.05620
Copy Paste: [[2507.05620]] Generative Head-Mounted Camera Captures for Photorealistic Avatars(https://arxiv.org/abs/2507.05620)
Keywords: generative
Abstract: Enabling photorealistic avatar animations in virtual and augmented reality (VR/AR) has been challenging because of the difficulty of obtaining ground truth state of faces. It is physically impossible to obtain synchronized images from head-mounted cameras (HMC) sensing input, which has partial observations in infrared (IR), and an array of outside-in dome cameras, which have full observations that match avatars' appearance. Prior works relying on analysis-by-synthesis methods could generate accurate ground truth, but suffer from imperfect disentanglement between expression and style in their personalized training. The reliance of extensive paired captures (HMC and dome) for the same subject makes it operationally expensive to collect large-scale datasets, which cannot be reused for different HMC viewpoints and lighting. In this work, we propose a novel generative approach, Generative HMC (GenHMC), that leverages large unpaired HMC captures, which are much easier to collect, to directly generate high-quality synthetic HMC images given any conditioning avatar state from dome captures. We show that our method is able to properly disentangle the input conditioning signal that specifies facial expression and viewpoint, from facial appearance, leading to more accurate ground truth. Furthermore, our method can generalize to unseen identities, removing the reliance on the paired captures. We demonstrate these breakthroughs by both evaluating synthetic HMC images and universal face encoders trained from these new HMC-avatar correspondences, which achieve better data efficiency and state-of-the-art accuracy.
摘要：在虚拟现实（VR/AR）中启用光真逼真的头像动画，由于难以获得面部的地面真实状态，因此具有挑战性。从物理上不可能从头部安装的摄像头（HMC）传感输入中获得同步图像，后者在红外（IR）中具有部分观察结果，以及一系列外部圆顶摄像头，具有与Avatars外观相匹配的完整观察结果。依靠分析方法的先前工作可能会产生准确的地面真理，但在其个性化培训中表达和风格之间的不完善。对同一主题的广泛配对捕获（HMC和DOME）的依赖使收集大规模数据集的操作昂贵，这在不同的HMC观点和照明方面无法重复使用。在这项工作中，我们提出了一种新颖的生成方法，即生成HMC（GENHMC），该方法利用了大型未配对的HMC捕获，这些捕获更容易收集，以直接从圆顶捕获的任何条件化的头像状态下直接生成高质量的合成HMC图像。我们表明，我们的方法能够正确解开指定面部表达和观点的输入条件信号，从面部外观，从而导致更准确的地面真相。此外，我们的方法可以推广到看不见的身份，消除对配对捕获的依赖。我们通过评估合成HMC图像和通过这些新的HMC-Avatar对应训练的通用面部编码来证明这些突破，从而实现了更好的数据效率和最先进的准确性。

Title: AdaptaGen: Domain-Specific Image Generation through Hierarchical Semantic Optimization Framework

Authors: Suoxiang Zhang, Xiaxi Li, Hongrui Chang, Zhuoyan Hou, Guoxin Wu, Ronghua Ji
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2507.05621
Pdf URL: https://arxiv.org/pdf/2507.05621
Copy Paste: [[2507.05621]] AdaptaGen: Domain-Specific Image Generation through Hierarchical Semantic Optimization Framework(https://arxiv.org/abs/2507.05621)
Keywords: generation
Abstract: Domain-specific image generation aims to produce high-quality visual content for specialized fields while ensuring semantic accuracy and detail fidelity. However, existing methods exhibit two critical limitations: First, current approaches address prompt engineering and model adaptation separately, overlooking the inherent dependence between semantic understanding and visual representation in specialized domains. Second, these techniques inadequately incorporate domain-specific semantic constraints during content synthesis, resulting in generation outcomes that exhibit hallucinations and semantic deviations. To tackle these issues, we propose AdaptaGen, a hierarchical semantic optimization framework that integrates matrix-based prompt optimization with multi-perspective understanding, capturing comprehensive semantic relationships from both global and local perspectives. To mitigate hallucinations in specialized domains, we design a cross-modal adaptation mechanism, which, when combined with intelligent content synthesis, enables preserving core thematic elements while incorporating diverse details across images. Additionally, we introduce a two-phase caption semantic transformation during the generation phase. This approach maintains semantic coherence while enhancing visual diversity, ensuring the generated images adhere to domain-specific constraints. Experimental results confirm our approach's effectiveness, with our framework achieving superior performance across 40 categories from diverse datasets using only 16 images per category, demonstrating significant improvements in image quality, diversity, and semantic consistency.
摘要：特定于域的图像生成旨在为专业领域生产高质量的视觉内容，同时确保语义准确性和细节保真度。但是，现有方法表现出两个关键局限性：首先，当前方法分别解决了提示工程和模型适应，从而忽略了专用域中语义理解和视觉表示之间的固有依赖性。其次，这些技术在内容合成过程中不足地纳入了特定领域的语义约束，从而产生了表现出幻觉和语义偏差的产生结果。为了解决这些问题，我们提出了Adaptagen，这是一个层次的语义优化框架，将基于矩阵的及时优化与多观点的理解集成在一起，从全球和本地观点捕获全面的语义关系。为了减轻专用域中的幻觉，我们设计了一种跨模式适应机制，当与智能内容合成结合使用时，可以保留核心主题元素，同时在图像跨图像中结合各种细节。此外，我们在生成阶段引入了两相字幕语义转换。这种方法在增强视觉多样性的同时保持语义连贯性，以确保生成的图像遵守特定于领域的约束。实验结果证实了我们的方法的有效性，我们的框架可以使用每类仅16个图像来实现40个类别的卓越性能，从而证明了图像质量，多样性和语义一致性的显着改善。

Title: Graph Learning

Authors: Feng Xia, Ciyuan Peng, Jing Ren, Falih Gozi Febrinanto, Renqiang Luo, Vidya Saikrishna, Shuo Yu, Xiangjie Kong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05636
Pdf URL: https://arxiv.org/pdf/2507.05636
Copy Paste: [[2507.05636]] Graph Learning(https://arxiv.org/abs/2507.05636)
Keywords: generative
Abstract: Graph learning has rapidly evolved into a critical subfield of machine learning and artificial intelligence (AI). Its development began with early graph-theoretic methods, gaining significant momentum with the advent of graph neural networks (GNNs). Over the past decade, progress in scalable architectures, dynamic graph modeling, multimodal learning, generative AI, explainable AI (XAI), and responsible AI has broadened the applicability of graph learning to various challenging environments. Graph learning is significant due to its ability to model complex, non-Euclidean relationships that traditional machine learning struggles to capture, thus better supporting real-world applications ranging from drug discovery and fraud detection to recommender systems and scientific reasoning. However, challenges like scalability, generalization, heterogeneity, interpretability, and trustworthiness must be addressed to unlock its full potential. This survey provides a comprehensive introduction to graph learning, focusing on key dimensions including scalable, temporal, multimodal, generative, explainable, and responsible graph learning. We review state-of-the-art techniques for efficiently handling large-scale graphs, capturing dynamic temporal dependencies, integrating heterogeneous data modalities, generating novel graph samples, and enhancing interpretability to foster trust and transparency. We also explore ethical considerations, such as privacy and fairness, to ensure responsible deployment of graph learning models. Additionally, we identify and discuss emerging topics, highlighting recent integration of graph learning and other AI paradigms and offering insights into future directions. This survey serves as a valuable resource for researchers and practitioners seeking to navigate the rapidly evolving landscape of graph learning.
摘要：图形学习迅速发展成为机器学习和人工智能（AI）的关键子领域。它的发展始于早期的图理论方法，随着图形神经网络（GNN）的出现而获得了显着的动力。在过去的十年中，可扩展体系结构，动态图形建模，多模式学习，生成AI，可解释的AI（XAI）和负责人AI的进展扩大了图形学习对各种具有挑战性的环境的适用性。图形学习非常重要，因为它可以对传统机器学习努力捕获的复杂，非欧国人的关系进行建模，从而更好地支持从药物发现和欺诈检测到建议系统和科学推理等现实世界的应用。但是，必须解决诸如可伸缩，概括，异质性，可解释性和可信度之类的挑战，以解锁其全部潜力。这项调查提供了对图形学习的全面介绍，重点是关键维度，包括可扩展，时间，多模式，生成，可解释和负责任的图形学习。我们回顾了有效处理大型图形，捕获动态的时间依赖性，集成异质数据模式，生成新的图形样本以及增强解释性以增强信任和透明度的可解释性的最新技术。我们还探讨了诸如隐私和公平之类的道德考虑因素，以确保负责任地部署图形学习模型。此外，我们识别并讨论新兴的主题，重点介绍了图形学习和其他AI范式的最新集成，并为未来的方向提供了见解。这项调查是寻求迅速发展的图形学习景观的研究人员和从业人员的宝贵资源。

Title: MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos

Authors: Rongsheng Wang, Junying Chen, Ke Ji, Zhenyang Cai, Shunian Chen, Yunjin Yang, Benyou Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05675
Pdf URL: https://arxiv.org/pdf/2507.05675
Copy Paste: [[2507.05675]] MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos(https://arxiv.org/abs/2507.05675)
Keywords: generation
Abstract: Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at this https URL
摘要：视频生成的最新进展显示出开放域设置的取得了显着进展，但是医疗视频生成仍然很大程度上没有被散布。医疗视频对于诸如临床培训，教育和模拟等应用至关重要，不仅需要高视觉保真度，而且需要严格的医疗准确性。但是，当前模型在应用于医疗提示时通常会产生不现实或错误的内容，这主要是由于缺乏针对医疗领域量身定制的大型高质量数据集。为了解决这一差距，我们介绍了MedVideOcap-55K，这是第一个大型，多样和字幕的数据集，用于医疗视频生成。它包括超过55,000个跨越现实医疗场景的策划剪辑，为培训通才医学视频生成模型提供了坚实的基础。我们开发了MEDGEN，基于此数据集，该数据集在视觉质量和医疗准确性的多个基准测试中实现了开源模型和与商业系统相媲美的领先性能。我们希望我们的数据集和模型可以作为宝贵的资源，并有助于促进医学视频生成的进一步研究。我们的代码和数据可在此HTTPS URL上找到

Title: LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion

Authors: Yisu Zhang, Chenjie Cao, Chaohui Yu, Jianke Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05678
Pdf URL: https://arxiv.org/pdf/2507.05678
Copy Paste: [[2507.05678]] LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion(https://arxiv.org/abs/2507.05678)
Keywords: generation
Abstract: Video Diffusion Models (VDMs) have demonstrated remarkable capabilities in synthesizing realistic videos by learning from large-scale data. Although vanilla Low-Rank Adaptation (LoRA) can learn specific spatial or temporal movement to driven VDMs with constrained data, achieving precise control over both camera trajectories and object motion remains challenging due to the unstable fusion and non-linear scalability. To address these issues, we propose LiON-LoRA, a novel framework that rethinks LoRA fusion through three core principles: Linear scalability, Orthogonality, and Norm consistency. First, we analyze the orthogonality of LoRA features in shallow VDM layers, enabling decoupled low-level controllability. Second, norm consistency is enforced across layers to stabilize fusion during complex camera motion combinations. Third, a controllable token is integrated into the diffusion transformer (DiT) to linearly adjust motion amplitudes for both cameras and objects with a modified self-attention mechanism to ensure decoupled control. Additionally, we extend LiON-LoRA to temporal generation by leveraging static-camera videos, unifying spatial and temporal controllability. Experiments demonstrate that LiON-LoRA outperforms state-of-the-art methods in trajectory control accuracy and motion strength adjustment, achieving superior generalization with minimal training data. Project Page: this https URL
摘要：视频扩散模型（VDM）通过从大规模数据中学习来综合现实视频，证明了出色的功能。尽管Vanilla低级适应（LORA）可以学习特定的空间或时间运动，从而通过受限的数据来驱动VDM，但由于不稳定的融合和非线性可伸缩性，对摄像机轨迹和物体运动的精确控制仍然具有挑战性。为了解决这些问题，我们提出了Lion-Lora，这是一个新颖的框架，通过三个核心原则重新考虑Lora融合：线性可扩展性，正交性和规范一致性。首先，我们分析了浅VDM层中LORA特征的正交性，从而实现了低级可控性。其次，在复杂的摄像机运动组合过程中，跨层上实现了规范一致性，以稳定融合。第三，将可控的令牌集成到扩散变压器（DIT）中，以线性调整具有修改的自发机制的相机和对象的运动振幅，以确保对照的脱钩。此外，我们通过利用静态相机视频，统一空间和时间可控性来扩展狮子 - 洛拉至时间生成。实验表明，Lion-Lora在轨迹控制精度和运动强度调节方面的表现优于最先进的方法，从而通过最小的训练数据实现了出色的概括。项目页面：此HTTPS URL

Title: DreamArt: Generating Interactable Articulated Objects from a Single Image

Authors: Ruijie Lu, Yu Liu, Jiaxiang Tang, Junfeng Ni, Yuxiang Wang, Diwen Wan, Gang Zeng, Yixin Chen, Siyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05763
Pdf URL: https://arxiv.org/pdf/2507.05763
Copy Paste: [[2507.05763]] DreamArt: Generating Interactable Articulated Objects from a Single Image(https://arxiv.org/abs/2507.05763)
Keywords: generation
Abstract: Generating articulated objects, such as laptops and microwaves, is a crucial yet challenging task with extensive applications in Embodied AI and AR/VR. Current image-to-3D methods primarily focus on surface geometry and texture, neglecting part decomposition and articulation modeling. Meanwhile, neural reconstruction approaches (e.g., NeRF or Gaussian Splatting) rely on dense multi-view or interaction data, limiting their scalability. In this paper, we introduce DreamArt, a novel framework for generating high-fidelity, interactable articulated assets from single-view images. DreamArt employs a three-stage pipeline: firstly, it reconstructs part-segmented and complete 3D object meshes through a combination of image-to-3D generation, mask-prompted 3D segmentation, and part amodal completion. Second, we fine-tune a video diffusion model to capture part-level articulation priors, leveraging movable part masks as prompt and amodal images to mitigate ambiguities caused by occlusion. Finally, DreamArt optimizes the articulation motion, represented by a dual quaternion, and conducts global texture refinement and repainting to ensure coherent, high-quality textures across all parts. Experimental results demonstrate that DreamArt effectively generates high-quality articulated objects, possessing accurate part shape, high appearance fidelity, and plausible articulation, thereby providing a scalable solution for articulated asset generation. Our project page is available at this https URL.
摘要：在体现的AI和AR/VR中广泛应用，生成铰接式物体（例如笔记本电脑和微波炉）是一项至关重要但又具有挑战性的任务。当前的图像到3D方法主要集中于表面几何和纹理，忽略了零件分解和发音建模。同时，神经重建方法（例如NERF或Gaussian脱落）依赖于密集的多视图或交互数据，从而限制了它们的可扩展性。在本文中，我们介绍了DreamArt，这是一个新颖的框架，用于从单视图中产生高保真，可相互作用的表达资产。 DreamArt采用了三阶段的管道：首先，它通过图像到3D生成，蒙版填充的3D细分和零件Amodal完成的组合来重建部分细分和完整的3D对象。其次，我们微调了一个视频扩散模型，以捕获零件级的发音先验，利用可移动的零件掩模作为及时的膜掩码，而氨隔图像减轻了由遮挡引起的歧义。最后，DreamArt优化了以双重四个四面体为代表的发音运动，并进行全球质地的改进和重新粉刷，以确保各个部分的连贯，高质量的质地。实验结果表明，DreamArt有效地产生了高质量的铰接物体，具有准确的部分形状，高外观保真度和合理的发音，从而为发明资产的产生提供了可扩展的解决方案。我们的项目页面可在此HTTPS URL上找到。

Title: SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning

Authors: Xin Hu, Ke Qin, Guiduo Duan, Ming Li, Yuan-Fang Li, Tao He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05798
Pdf URL: https://arxiv.org/pdf/2507.05798
Copy Paste: [[2507.05798]] SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning(https://arxiv.org/abs/2507.05798)
Keywords: generation
Abstract: Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes. Although recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasoning, such as difficulty in distinguishing object relative positions, which results in suboptimal relation prediction. Motivated by the denoising diffusion model's inversion process in preserving the spatial structure of input images, we propose SPADE (SPatial-Aware Denoising-nEtwork) framework -- a novel approach for open-vocabulary PSG. SPADE consists of two key steps: (1) inversion-guided calibration for the UNet adaptation, and (2) spatial-aware context reasoning. In the first step, we calibrate a general pre-trained teacher diffusion model into a PSG-specific denoising network with cross-attention maps derived during inversion through a lightweight LoRA-based fine-tuning strategy. In the second step, we develop a spatial-aware relation graph transformer that captures both local and long-range contextual information, facilitating the generation of high-quality relation queries. Extensive experiments on benchmark PSG and Visual Genome datasets demonstrate that SPADE outperforms state-of-the-art methods in both closed- and open-set scenarios, particularly for spatial relationship prediction.
摘要：Panoptic场景图生成（PSG）将实例分割与关系理解相结合，以捕获复杂场景中的像素级结构关系。尽管最新利用预训练的视力语言模型（VLM）的方法在开放式摄影环境中显着提高了性能，但它们通常会忽略VLM在空间关系推理中的固有局限性，例如在区分对象相对位置方面的难度，这在次级关系预测中产生了结果。由deno的扩散模型的反演过程在保留输入图像的空间结构中的动机，我们提出了Spade（空间感知的DeNoising-Network）框架 - 一种新型的开放式播放式PSG的方法。 Spade由两个关键步骤组成：（1）倒置针对UNET适应的校准，以及（2）空间感知的上下文推理。在第一步中，我们将一般的预训练的教师扩散模型校准为PSG特异性的DeNoising网络，该网络具有通过基于LORA的轻质劳拉（Lora）基于LORA的微调策略而在反转过程中得出的跨注意地图。在第二步中，我们开发了一个空间感知的关系图变压器，该图形变压器同时捕获本地和远程上下文信息，从而促进了高质量关系查询的生成。在基准PSG和视觉基因组数据集上进行的广泛实验表明，在闭合场景和开放式方案中，Spade优于最先进的方法，尤其是用于空间关系预测。

Title: DREAM: Document Reconstruction via End-to-end Autoregressive Model

Authors: Xin Li, Mingming Gong, Yunfei Wu, Jianxin Dai, Antai Guo, Xinghua Jiang, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05805
Pdf URL: https://arxiv.org/pdf/2507.05805
Copy Paste: [[2507.05805]] DREAM: Document Reconstruction via End-to-end Autoregressive Model(https://arxiv.org/abs/2507.05805)
Keywords: generative
Abstract: Document reconstruction constitutes a significant facet of document analysis and recognition, a field that has been progressively accruing interest within the scholarly community. A multitude of these researchers employ an array of document understanding models to generate predictions on distinct subtasks, subsequently integrating their results into a holistic document reconstruction format via heuristic principles. Nevertheless, these multi-stage methodologies are hindered by the phenomenon of error propagation, resulting in suboptimal performance. Furthermore, contemporary studies utilize generative models to extract the logical sequence of plain text, tables and mathematical expressions in an end-to-end process. However, this approach is deficient in preserving the information related to element layouts, which are vital for document reconstruction. To surmount these aforementioned limitations, we in this paper present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM). DREAM transmutes the text image into a sequence of document reconstruction in a comprehensive, end-to-end process, encapsulating a broader spectrum of document element information. In addition, we establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task. Empirical results substantiate that our methodology attains unparalleled performance in the realm of document reconstruction. Furthermore, the results on a variety of subtasks, encompassing document layout analysis, text recognition, table structure recognition, formula recognition and reading order detection, indicate that our model is competitive and compatible with various tasks.
摘要：文档重建构成了文档分析和认可的重要方面，该领域一直在学术界逐渐产生兴趣。这些研究人员中的许多人采用了一系列文档理解模型来对不同的子任务产生预测，然后通过启发式原则将其结果整合到整体文档重建格式中。然而，这些多阶段方法论被误差传播现象阻碍，导致了次优性能。此外，当代研究利用生成模型在端到端过程中提取纯文本，表和数学表达式的逻辑序列。但是，这种方法在保留与元素布局相关的信息方面不足，这对于文档重建至关重要。为了克服这些上述限制，我们在本文中提出了一种专门为文档重建设计的创新自动回归模型，该模型通过端到端自动回归模型（Dream）称为文档重建。 Dream将文本图像转换为一系列文档重建，以全面的端到端过程，封装了更广泛的文档元素信息。此外，我们还建立了文档重建任务的标准化定义，并引入了新颖的文档相似度度量（DSM）和DOCREC1K数据集，以评估任务的性能。经验结果证明了我们的方法在文档重建领域取得了无与伦比的绩效。此外，在各种子任务上的结果，包括文档布局分析，文本识别，表结构识别，公式识别和阅读顺序检测，表明我们的模型具有竞争力并且与各种任务兼容。

Title: Towards Solar Altitude Guided Scene Illumination

Authors: Samed Doğan, Maximilian Hoh, Nico Leuze, Nicolas R.-Peña, Alfred Schöttl
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05812
Pdf URL: https://arxiv.org/pdf/2507.05812
Copy Paste: [[2507.05812]] Towards Solar Altitude Guided Scene Illumination(https://arxiv.org/abs/2507.05812)
Keywords: generation
Abstract: The development of safe and robust autonomous driving functions is heavily dependent on large-scale, high-quality sensor data. However, real-word data acquisition demands intensive human labor and is strongly limited by factors such as labeling cost, driver safety protocols and diverse scenario coverage. Thus, multiple lines of work focus on the conditional generation of synthetic camera sensor data. We identify a significant gap in research regarding daytime variation, presumably caused by the scarcity of available labels. Consequently, we present the solar altitude as global conditioning variable. It is readily computable from latitude-longitude coordinates and local time, eliminating the need for extensive manual labeling. Our work is complemented by a tailored normalization approach, targeting the sensitivity of daylight towards small numeric changes in altitude. We demonstrate its ability to accurately capture lighting characteristics and illumination-dependent image noise in the context of diffusion models.
摘要：安全和强大的自主驾驶功能的发展在很大程度上取决于大规模的高质量传感器数据。但是，现实词的数据获取需要大量的人工劳动力，并且受到标签成本，驾驶员安全协议和各种情况覆盖的因素的强烈限制。因此，多条工作重点是合成相机传感器数据的条件生成。我们确定了有关白天变化的显着差距，这可能是由于可用标签的稀缺性引起的。因此，我们将太阳高度作为全球条件变量呈现。可以从纬度长度坐标和当地时间进行计算，从而消除了对大量手动标记的需求。我们的工作是通过量身定制的归一化方法来补充的，它针对日光对高度数字变化的敏感性。我们证明了它在扩散模型的背景下准确捕获照明特性和照明依赖性图像噪声的能力。

Title: 2D Instance Editing in 3D Space

Authors: Yuhuan Xie, Aoxuan Pan, Ming-Xian Lin, Wei Huang, Yi-Hua Huang, Xiaojuan Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05819
Pdf URL: https://arxiv.org/pdf/2507.05819
Copy Paste: [[2507.05819]] 2D Instance Editing in 3D Space(https://arxiv.org/abs/2507.05819)
Keywords: generative
Abstract: Generative models have achieved significant progress in advancing 2D image editing, demonstrating exceptional precision and realism. However, they often struggle with consistency and object identity preservation due to their inherent pixel-manipulation nature. To address this limitation, we introduce a novel "2D-3D-2D" framework. Our approach begins by lifting 2D objects into 3D representation, enabling edits within a physically plausible, rigidity-constrained 3D environment. The edited 3D objects are then reprojected and seamlessly inpainted back into the original 2D image. In contrast to existing 2D editing methods, such as DragGAN and DragDiffusion, our method directly manipulates objects in a 3D environment. Extensive experiments highlight that our framework surpasses previous methods in general performance, delivering highly consistent edits while robustly preserving object identity.
摘要：生成模型在推进2D图像编辑方面取得了重大进展，证明了出色的精度和现实主义。但是，由于其固有的像素操纵性质，他们经常在一致性和对象身份保存中挣扎。为了解决这一限制，我们介绍了一个小说的“ 2d-3d-2d”框架。我们的方法首先将2D对象提升为3D表示，从而在物理上合理的，刚性约束的3D环境中实现了编辑。然后，对编辑的3D对象进行重新投影并无缝地将其分配回原始的2D图像。与现有的2D编辑方法（例如Draggan和DragDiffusion）相反，我们的方法直接在3D环境中操纵对象。广泛的实验强调，我们的框架超过了一般性能的先前方法，提供了高度一致的编辑，同时可以坚固地保留对象身份。

Title: USIGAN: Unbalanced Self-Information Feature Transport for Weakly Paired Image IHC Virtual Staining

Authors: Yue Peng, Bing Xiong, Fuqiang Chen, De Eybo, RanRan Zhang, Wanming Hu, Jing Cai, Wenjian Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05843
Pdf URL: https://arxiv.org/pdf/2507.05843
Copy Paste: [[2507.05843]] USIGAN: Unbalanced Self-Information Feature Transport for Weakly Paired Image IHC Virtual Staining(https://arxiv.org/abs/2507.05843)
Keywords: generative
Abstract: Immunohistochemical (IHC) virtual staining is a task that generates virtual IHC images from H\&E images while maintaining pathological semantic consistency with adjacent slices. This task aims to achieve cross-domain mapping between morphological structures and staining patterns through generative models, providing an efficient and cost-effective solution for pathological analysis. However, under weakly paired conditions, spatial heterogeneity between adjacent slices presents significant challenges. This can lead to inaccurate one-to-many mappings and generate results that are inconsistent with the pathological semantics of adjacent slices. To address this issue, we propose a novel unbalanced self-information feature transport for IHC virtual staining, named USIGAN, which extracts global morphological semantics without relying on positional this http URL removing weakly paired terms in the joint marginal distribution, we effectively mitigate the impact of weak pairing on joint distributions, thereby significantly improving the content consistency and pathological semantic consistency of the generated results. Moreover, we design the Unbalanced Optimal Transport Consistency (UOT-CTM) mechanism and the Pathology Self-Correspondence (PC-SCM) mechanism to construct correlation matrices between H\&E and generated IHC in image-level and real IHC and generated IHC image sets in intra-group level.. Experiments conducted on two publicly available datasets demonstrate that our method achieves superior performance across multiple clinically significant metrics, such as IoD and Pearson-R correlation, demonstrating better clinical relevance.
摘要：免疫组织化学（IHC）虚拟染色是一项任务，可以从H \＆E图像中生成虚拟IHC图像，同时保持与相邻切片的病理语义一致性。该任务旨在通过生成模型实现形态结构和染色模式之间的跨域映射，从而为病理分析提供了有效且具有成本效益的解决方案。但是，在弱配对条件下，相邻切片之间的空间异质性提出了重大挑战。这可能导致一到一对映射不准确，并产生与相邻切片的病理语义不一致的结果。为了解决这个问题，我们为IHC虚拟染色提出了一种新型的不平衡自我信息传输，名为Usigan，它提取了全球形态学的语义，而不依赖于位置，该HTTP URL在关节边际分布中删除了弱配对的弱术语，我们有效地降低了分布的影响，从而使分布对差异的影响有效地一致性，从而使综合性的一致性有效地一致性，并具有综合性的一致性，并有效地提高了对综合性的一致性。此外，我们设计了不平衡的最佳运输一致性（UOT-CTM）机制和病理学自我对应（PC-SCM）机制，以构建H \＆e之间的相关矩阵并在图像级和真实IHC中生成的IHC之间的相关矩阵，并生成了IHC的IHC图像集跨越临床的范围。实验范围跨越了两个临床范围。例如IOD和Pearson-R相关性，表现出更好的临床相关性。

Title: Diffusion Dataset Condensation: Training Your Diffusion Model Faster with Less Data

Authors: Rui Huang, Shitong Shao, Zikai Zhou, Pukun Zhao, Hangyu Guo, Tian Ye, Lichen Bai, Shuo Yang, Zeke Xie
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.05914
Pdf URL: https://arxiv.org/pdf/2507.05914
Copy Paste: [[2507.05914]] Diffusion Dataset Condensation: Training Your Diffusion Model Faster with Less Data(https://arxiv.org/abs/2507.05914)
Keywords: generative
Abstract: Diffusion models have achieved remarkable success in various generative tasks, but training them remains highly resource-intensive, often requiring millions of images and many days of GPU computation. From a data-centric perspective addressing this limitation, we study diffusion dataset condensation as a new and challenging problem setting. The goal is to construct a "synthetic" sub-dataset with significantly fewer samples than the original dataset, enabling high-quality diffusion model training with greatly reduced cost. To the best of our knowledge, we are the first to formally investigate dataset condensation for diffusion models, whereas prior work focused on training discriminative models. To tackle this new challenge, we propose a novel Diffusion Dataset Condensation (D2C) framework, which consists of two phases: Select and Attach. The Select phase identifies a compact and diverse subset using a diffusion difficulty score and interval sampling. The Attach phase enhances the selected subset by attaching rich semantic and visual representations to strengthen the conditional signals. Extensive experiments across various dataset sizes, model architectures, and resolutions show that our D2C framework enables significantly faster diffusion model training with dramatically fewer data, while preserving high visual quality. Notably, for the SiT-XL/2 architecture, D2C achieves a 100x training speed-up, reaching a FID score of 4.3 in just 40k steps using only 0.8% of the training data.
摘要：扩散模型在各种生成任务中取得了巨大的成功，但是培训它们仍然是高度资源密集型的，通常需要数百万的图像和许多天数的GPU计算。从以数据为中心的角度来解决这一限制的角度，我们将扩散数据集凝结作为一种新的挑战性问题设定。目的是构建一个比原始数据集少得多的“合成”子数据集，从而实现高质量的扩散模型训练，其成本大大降低。据我们所知，我们是第一个正式研究扩散模型数据集凝结的人，而先前的工作着重于训练判别模型。为了应对这一新挑战，我们提出了一种新颖的扩散数据集冷凝（D2C）框架，该框架由两个阶段组成：选择和附加。选择阶段使用扩散难度分数和间隔采样来识别紧凑而多样的子集。附加相通过连接丰富的语义和视觉表示以增强条件信号来增强所选子集。各种数据集大小，模型体系结构和决议的广泛实验表明，我们的D2C框架可以更快地使用数据较少的数据进行扩散模型训练，同时保持高视觉质量。值得注意的是，对于SIT-XL/2体系结构，D2C实现了100倍的训练速度，仅使用训练数据的0.8％，在仅40k步骤中达到4.3的FID得分。

Title: Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation

Authors: Zhenghao Zhang, Junchao Liao, Xiangyu Meng, Long Qin, Weizhi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05963
Pdf URL: https://arxiv.org/pdf/2507.05963
Copy Paste: [[2507.05963]] Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation(https://arxiv.org/abs/2507.05963)
Keywords: generation
Abstract: Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation. Project page: this https URL .
摘要：诸如Tora之类的运动引导视频产生的扩散变压器模型的最新进展已显示出很大的进步。在本文中，我们介绍了Tora2，这是Tora的增强版本，该版本引入了一些设计改进，以扩大其外观和运动定制功能。具体来说，我们引入了一个解耦的个性化提取器，该提取器为多个开放式实体生成全面的个性化嵌入，与以前的方法相比，更好地保留细粒度的视觉细节。在此基础上，我们设计了一个封闭式的自我发挥机制，以整合每个实体的轨迹，文本描述和视觉信息。这项创新大大减少了训练期间多模式调节的错位。此外，我们引入了对比损失，通过在运动和个性化嵌入之间的明确映射来共同优化轨迹动力学和实体一致性。据我们所知，Tora2是实现视频生成的外观和运动的同时多实体定制的第一种方法。实验结果表明，Tora2通过最先进的自定义方法在提供高级运动控制功能的同时，实现了竞争性能，这标志着多条件视频的生成中的重要进步。项目页面：此HTTPS URL。

Title: Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval

Authors: Haiwen Li, Delong Liu, Zhaohui Hou, Zhicheng Zhao, Fei Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05970
Pdf URL: https://arxiv.org/pdf/2507.05970
Copy Paste: [[2507.05970]] Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval(https://arxiv.org/abs/2507.05970)
Keywords: generation, generative
Abstract: As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.
摘要：作为一个具有挑战性的视觉语言（VL）任务，组成的图像检索（CIR）旨在使用多模式（图像+文本）查询检索目标图像。尽管许多现有的CIR方法已经达到了有希望的性能，但它们对昂贵，手动标记的三重态的依赖阻碍了可伸缩性和零发功能。为了解决这个问题，我们为自动三重态生成的可扩展管道以及一个在高质量合成三胞胎（CIRHS）上的完全合成数据集（称为组成的图像检索）。我们的管道利用大型语言模型（LLM）生成各种提示，控制文本对图像生成模型，以产生每对中具有相同元素的图像对，然后对其进行过滤并重新组织以形成CIRHS数据集。此外，我们引入了混合上下文对齐（Coalign），这是一个新颖的CIR框架，可以在更广泛的环境中完成全球一致性和本地推理，从而使该模型能够学习更强大和信息丰富的表示。通过利用合成CIRHS数据集，Coalign在三个常用的基准测试中实现了出色的零拍性能，这是第一次在完全合成数据集中训练CIR模型的可行性。此外，在监督培训下，我们的方法的表现优于所有最新监督的CIR方法，从而验证了我们提议的检索框架的有效性。代码和CIRHS数据集将很快发布。

Title: MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding

Authors: Chang Liu, Ye Pan, Chenyang Ding, Susanto Rahardja, Xiaokang Yang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2507.06071
Pdf URL: https://arxiv.org/pdf/2507.06071
Copy Paste: [[2507.06071]] MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding(https://arxiv.org/abs/2507.06071)
Keywords: generation
Abstract: Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal inputs-including text descriptions and reference expression images-to guide the generation of user-specified facial expressions. With MetaHuman as the priority, our generated results can be conveniently integrated into the industrial production pipeline.
摘要：音频驱动的情感3D面部动画旨在产生同步的唇部动作和生动的面部表情。但是，大多数现有的方法都集中在静态和预定义的情绪标签上，从而限制了它们的多样性和自然性。为了应对这些挑战，我们提出了Medtalk，这是一种新颖的框架，用于精细元素和动态的情感交谈。我们的方法首先使用经过精心设计的跨重建过程将内容和情感嵌入运动序列嵌入空间，从而可以独立控制唇部运动和面部表情。除了传统的音频驱动的唇彩同步之外，我们还整合了音频和语音文本，预测帧强度的变化并动态调整静态情感特征以产生现实的情感表达。此外，为了增强控制和个性化，我们结合了多模式输入，包括文本说明和参考表达图像，以指导用户指定的面部表情的产生。以Metahuman作为优先级，我们的生成结果可以方便地集成到工业生产管道中。

Title: ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models

Authors: Chihan Huang, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.06078
Pdf URL: https://arxiv.org/pdf/2507.06078
Copy Paste: [[2507.06078]] ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models(https://arxiv.org/abs/2507.06078)
Keywords: generation
Abstract: Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on $\ell_{p}$-norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.
摘要：尽管在各个领域进行了深入学习的成功，但它仍然容易受到对抗性攻击的影响。尽管许多现有的对抗攻击方法达到了很高的成功率，但它们通常依赖于$ \ ell_ {p} $ - 标准扰动约束，这与人类感知能力不符。因此，研究人员将注意力转向产生自然，不受限制的对抗例子（UAE）。基于GAN的方法受到固有的局限性，例如由于不稳定性和模式崩溃而导致的图像质量差。同时，已经采用了扩散模型来生成阿联酋，但它们仍然依靠迭代PGD摄动注射，而无需完全利用其中心脱氧能力。在本文中，我们介绍了一种基于扩散模型（名为Scoreadv）生成UAE的新方法。该方法结合了一种可解释的对抗指导机制，以逐渐将采样分布转移到对抗分布的同时，同时使用可解释的显着性图将参考图像的视觉信息注入生成的样品中。值得注意的是，我们的方法能够生成无限数量的天然对抗示例，并且不仅可以攻击分类模型，还可以攻击检索模型。我们在ImageNet和Celeba数据集上进行了广泛的实验，从而验证了黑盒和白色盒子设置中十个目标模型的ScoreAdv的性能。我们的结果表明，Scoreadv实现了最先进的攻击成功率和图像质量。此外，在防御措施下即使在防御措施下，ScoreAdv也能够保持强劲的态度。

Title: Omni-Video: Democratizing Unified Video Understanding and Generation

Authors: Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.06119
Pdf URL: https://arxiv.org/pdf/2507.06119
Copy Paste: [[2507.06119]] Omni-Video: Democratizing Unified Video Understanding and Generation(https://arxiv.org/abs/2507.06119)
Keywords: generation
Abstract: Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.
摘要：统一的理解和生成建模方面的显着突破导致了图像理解，推理，生产和编辑的显着进步，但目前的基础模型主要集中在处理图像上，从而在开发统一模型的视频理解和生成方面造成了差距。该报告介绍了Omni-Video，这是一个有效有效的统一框架，用于视频理解，生成以及基于教学的编辑。我们的主要见解是教授现有的多模式大语模型（MLLM），以产生连续的视觉线索，这些线索用作扩散解码器的输入，这些线索会产生以这些视觉线索为条件的高质量视频。为了充分解锁我们系统在统一视频建模的潜力，我们整合了几种技术改进：1）一种轻巧的体系结构设计，分别将视觉头固定在MLLM的顶部和一个适配器，然后在扩散解码器输入之前，前者为后者产生视觉图表，以使这些可视化的图表适应散射剂的散热器空间，使其适应这些视觉量。 2）一种有效的多阶段训练方案，可促进MLLM和扩散解码器之间具有有限数据和计算资源的快速连接。我们从经验上证明，我们的模型在视频生成，编辑和理解任务之间表现出令人满意的概括能力。

Title: OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion

Authors: Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.06165
Pdf URL: https://arxiv.org/pdf/2507.06165
Copy Paste: [[2507.06165]] OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion(https://arxiv.org/abs/2507.06165)
Keywords: generation, generative
Abstract: The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among components while maintaining robust structural cohesion. OmniPart uniquely decouples this complex task into two synergistic stages: (1) an autoregressive structure planning module generates a controllable, variable-length sequence of 3D part bounding boxes, critically guided by flexible 2D part masks that allow for intuitive control over part decomposition without requiring direct correspondences or semantic labels; and (2) a spatially-conditioned rectified flow model, efficiently adapted from a pre-trained holistic 3D generator, synthesizes all 3D parts simultaneously and consistently within the planned layout. Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications. Extensive experiments demonstrate that OmniPart achieves state-of-the-art performance, paving the way for more interpretable, editable, and versatile 3D content.
摘要：具有明确，可编辑的零件结构的3D资产对于推进交互式应用至关重要，但是大多数生成方法仅产生单片形状，从而限制了它们的效用。我们介绍了Omnipart，这是一个新型的零件感知3D对象生成的框架，旨在实现组件之间的高语义脱钩，同时保持健壮的结构内聚力。 Omnipart独特地将这项复杂的任务解散为两个协同阶段：（1）自回归结构计划模块生成一个可控的，可变的长度序列3D零件边界盒的序列，由柔性2D零件蒙版在不需要直接控制的情况下进行柔性2D零件掩模，而无需直接的直接或偏见相应或半个或半个月的标签；（2）有效地从预先训练的整体3D发电机中进行了空间条件的整流流模型，同时且始终如一地在计划的布局内合成所有3D部分。我们的方法支持用户定义的零件粒度，精确的本地化，并启用各种下游应用程序。广泛的实验表明，Omnipart实现了最先进的性能，为更加可解释，可编辑和多功能的3D内容铺平了道路。