2025-05-28

Title: Joint-stochastic-approximation Random Fields with Application to Semi-supervised Learning

Authors: Yunfu Song, Zhijian Ou
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.20330
Pdf URL: https://arxiv.org/pdf/2505.20330
Copy Paste: [[2505.20330]] Joint-stochastic-approximation Random Fields with Application to Semi-supervised Learning(https://arxiv.org/abs/2505.20330)
Keywords: generation, generative
Abstract: Our examination of deep generative models (DGMs) developed for semi-supervised learning (SSL), mainly GANs and VAEs, reveals two problems. First, mode missing and mode covering phenomenons are observed in genertion with GANs and VAEs. Second, there exists an awkward conflict between good classification and good generation in SSL by employing directed generative models. To address these problems, we formally present joint-stochastic-approximation random fields (JRFs) -- a new family of algorithms for building deep undirected generative models, with application to SSL. It is found through synthetic experiments that JRFs work well in balancing mode covering and mode missing, and match the empirical data distribution well. Empirically, JRFs achieve good classification results comparable to the state-of-art methods on widely adopted datasets -- MNIST, SVHN, and CIFAR-10 in SSL, and simultaneously perform good generation.
摘要：我们对针对半监督学习（SSL）（主要是gan和vaes）开发的深层生成模型（DGM）的检查揭示了两个问题。首先，在用gan和vaes的基因中观察到模式缺失和模式覆盖现象。其次，通过采用定向的生成模型，良好的分类与良好产生之间存在尴尬的冲突。为了解决这些问题，我们正式呈现联合传播的随机字段（JRFS） - 一个新的算法系列，用于构建深层无向生成模型，并应用于SSL。通过合成实验发现，JRFS在平衡模式覆盖和模式缺失方面效果很好，并匹配经验数据分布。从经验上讲，JRF达到了良好的分类结果，可与广泛采用的数据集的最新方法相媲美-MNIST，SVHN和CIFAR-10在SSL中，并同时执行良好的生成。

Title: Decision Flow Policy Optimization

Authors: Jifeng Hu, Sili Huang, Siyuan Guo, Zhaogeng Liu, Li Shen, Lichao Sun, Hechang Chen, Yi Chang, Dacheng Tao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20350
Pdf URL: https://arxiv.org/pdf/2505.20350
Copy Paste: [[2505.20350]] Decision Flow Policy Optimization(https://arxiv.org/abs/2505.20350)
Keywords: generation, generative
Abstract: In recent years, generative models have shown remarkable capabilities across diverse fields, including images, videos, language, and decision-making. By applying powerful generative models such as flow-based models to reinforcement learning, we can effectively model complex multi-modal action distributions and achieve superior robotic control in continuous action spaces, surpassing the limitations of single-modal action distributions with traditional Gaussian-based policies. Previous methods usually adopt the generative models as behavior models to fit state-conditioned action distributions from datasets, with policy optimization conducted separately through additional policies using value-based sample weighting or gradient-based updates. However, this separation prevents the simultaneous optimization of multi-modal distribution fitting and policy improvement, ultimately hindering the training of models and degrading the performance. To address this issue, we propose Decision Flow, a unified framework that integrates multi-modal action distribution modeling and policy optimization. Specifically, our method formulates the action generation procedure of flow-based models as a flow decision-making process, where each action generation step corresponds to one flow decision. Consequently, our method seamlessly optimizes the flow policy while capturing multi-modal action distributions. We provide rigorous proofs of Decision Flow and validate the effectiveness through extensive experiments across dozens of offline RL environments. Compared with established offline RL baselines, the results demonstrate that our method achieves or matches the SOTA performance.
摘要：近年来，生成模型已经在各种领域（包括图像，视频，语言和决策）展示了出色的功能。通过将强大的生成模型（例如基于流的模型）应用于增强学习，我们可以有效地对复杂的多模式作用分布进行建模，并在连续的动作空间中实现出色的机器人控制，从而超过了具有传统高斯政策的单模式动作分布的局限性。以前的方法通常采用生成模型作为行为模型，以适合数据集中的状态条件的动作分布，策略优化通过使用基于价值的样本加权或基于梯度的更新来通过其他策略进行了分别进行。但是，这种分离阻止了多模式分配拟合和政策改进的同时优化，最终阻碍了模型的训练并降低了性能。为了解决这个问题，我们提出了决策流，这是一个整合多模式动作分布建模和策略优化的统一框架。具体而言，我们的方法将基于流的模型的动作生成过程作为流动决策过程，在该过程中，每个动作生成步骤都对应于一个流决策。因此，我们的方法在捕获多模式的动作分布时无缝优化流策略。我们通过在数十个离线RL环境中进行广泛的实验来提供严格的决策流量证明，并验证有效性。与已建立的离线RL基线相比，结果表明我们的方法可以达到或匹配SOTA性能。

Title: FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation

Authors: Dong Liu, Jiayi Zhang, Yifan Li, Yanxuan Yu, Ben Lengerich, Ying Nian Wu
Subjects: cs.LG, cs.AI, cs.CV, cs.MM, cs.PF
Abstract URL: https://arxiv.org/abs/2505.20353
Pdf URL: https://arxiv.org/pdf/2505.20353
Copy Paste: [[2505.20353]] FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation(https://arxiv.org/abs/2505.20353)
Keywords: generation, generative
Abstract: Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose FastCache, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model's internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes are statistically insignificant. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a hypothesis-testing-based decision rule. Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage, with best generation output quality compared to other cache methods, as measured by FID and t-FID. Code implementation of FastCache is available on GitHub at this https URL.
摘要：扩散变压器（DIT）是强大的生成模型，但由于其迭代结构和深层变压器堆栈，因此在计算密集型模型中保持了密集。为了减轻这种低效率，我们提出了FastCache，这是一种隐藏状态级的缓存和压缩框架，通过利用模型内部表示中的冗余来加速DIT推断。 FastCache引入了双重策略：（1）一种空间感知的令牌选择机制，可根据隐藏的状态显着性适应过滤冗余令牌，以及（2）变压器级的缓存，当变化更改是统计学上无关紧要的时，可以在跨时间浏览中重新进行潜在的激活。这些模块共同起作用，以减少不必要的计算，同时通过可学习的线性近似来保留产生的保真度。理论分析表明，FastCache在基于假设测试的决策规则下保持有限的近似误差。通过FID和T-FID衡量的其他高速缓存方法相比，多个DIT变体之间的经验评估表明，与其他缓存方法相比，最佳生成质量的延迟和记忆使用质量大大减少。 FastCache的代码实现可在此HTTPS URL的GitHub上获得。

Title: GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

Authors: Yeonjoon Jung, Daehyun Ahn, Hyungjun Kim, Taesu Kim, Eunhyeok Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20355
Pdf URL: https://arxiv.org/pdf/2505.20355
Copy Paste: [[2505.20355]] GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2505.20355)
Keywords: generation, generative
Abstract: Low-Rank Adaptation (LoRA) is a popular method for parameter-efficient fine-tuning (PEFT) of generative models, valued for its simplicity and effectiveness. Despite recent enhancements, LoRA still suffers from a fundamental limitation: overfitting when the bottleneck is widened. It performs best at ranks 32-64, yet its accuracy stagnates or declines at higher ranks, still falling short of full fine-tuning (FFT) performance. We identify the root cause as LoRA's structural bottleneck, which introduces gradient entanglement to the unrelated input channels and distorts gradient propagation. To address this, we introduce a novel structure, Granular Low-Rank Adaptation (GraLoRA) that partitions weight matrices into sub-blocks, each with its own low-rank adapter. With negligible computational or storage cost, GraLoRA overcomes LoRA's limitations, effectively increases the representational capacity, and more closely approximates FFT behavior. Experiments on code generation and commonsense reasoning benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank settings, making GraLoRA a scalable and robust solution for PEFT. Code, data, and scripts are available at this https URL
摘要：低级适应性（LORA）是一种流行的生成模型参数效率微调（PEFT）的方法，以其简单性和有效性而重视。尽管最近有所提高，但洛拉仍然受到基本限制：扩大瓶颈时过度适应。它的表现最好以32-64的排名，但其准确性停滞不前或下降较高，但仍未达到完整的微调（FFT）表现。我们将根本原因确定为洛拉的结构瓶颈，它将梯度纠缠到无关的输入通道并扭曲梯度传播。为了解决这个问题，我们介绍了一种新颖的结构，即颗粒状的低级适应（Gralora），该结构将重量矩阵分为亚块，每个矩阵都有其自身的低级适配器。格拉洛拉（Gralora）克服了洛拉（Lora）的局限性，有效地增加了代表能力，并更紧密地近似FFT行为。关于代码生成和常识性推理基准测试的实验表明，Gralora始终优于Lora和其他基线，在Humaneval +上的Pass@1中获得了 +8.5％的绝对增益。这些改进范围跨模型尺寸和排名设置，使Gralora成为PEFT的可扩展且可靠的解决方案。代码，数据和脚本可在此HTTPS URL上找到

Title: What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

Authors: Lorenzo Baraldi, Davide Bucciarelli, Federico Betti, Marcella Cornia, Lorenzo Baraldi, Nicu Sebe, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2505.20405
Pdf URL: https://arxiv.org/pdf/2505.20405
Copy Paste: [[2505.20405]] What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models(https://arxiv.org/abs/2505.20405)
Keywords: generative
Abstract: Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.
摘要：基于教学的图像编辑模型在生成任务中提供了更多的个性化机会。但是，正确评估他们的结果是具有挑战性的，并且大多数现有指标落后于人类的判断力和解释性。为了解决这些问题，我们引入了骰子（差异连贯估计器），该模型旨在检测原始图像和编辑图像之间的局部差异，并评估其与给定修改请求的相关性。骰子由两个关键组成部分组成：差异检测器和一个连贯的估计器，均建立在自回归的多模式大语言模型（MLLM）上，并使用利用自我设计的策略进行了培训，从事自学，从内置网络中蒸馏和全面监督。通过广泛的实验，我们评估了管道的每个阶段，比较了所提出的框架内的不同MLLM。我们证明，骰子有效地识别了连贯的编辑，从而有效地评估了与人类判断力有着密切相关的不同编辑模型产生的图像。我们公开发布我们的源代码，模型和数据。

Title: Time Series Generation Under Data Scarcity: A Unified Generative Modeling Approach

Authors: Tal Gonen, Itai Pemper, Ilan Naiman, Nimrod Berman, Omri Azencot
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.20446
Pdf URL: https://arxiv.org/pdf/2505.20446
Copy Paste: [[2505.20446]] Time Series Generation Under Data Scarcity: A Unified Generative Modeling Approach(https://arxiv.org/abs/2505.20446)
Keywords: generation, generative
Abstract: Generative modeling of time series is a central challenge in time series analysis, particularly under data-scarce conditions. Despite recent advances in generative modeling, a comprehensive understanding of how state-of-the-art generative models perform under limited supervision remains lacking. In this work, we conduct the first large-scale study evaluating leading generative models in data-scarce settings, revealing a substantial performance gap between full-data and data-scarce regimes. To close this gap, we propose a unified diffusion-based generative framework that can synthesize high-fidelity time series across diverse domains using just a few examples. Our model is pre-trained on a large, heterogeneous collection of time series datasets, enabling it to learn generalizable temporal representations. It further incorporates architectural innovations such as dynamic convolutional layers for flexible channel adaptation and dataset token conditioning for domain-aware generation. Without requiring abundant supervision, our unified model achieves state-of-the-art performance in few-shot settings-outperforming domain-specific baselines across a wide range of subset sizes. Remarkably, it also surpasses all baselines even when tested on full datasets benchmarks, highlighting the strength of pre-training and cross-domain generalization. We hope this work encourages the community to revisit few-shot generative modeling as a key problem in time series research and pursue unified solutions that scale efficiently across domains. Code is available at this https URL.
摘要：时间序列的生成建模是时间序列分析的核心挑战，尤其是在数据筛选条件下。尽管在生成建模方面取得了最新的进步，但仍缺乏对最先进的生成模型在有限监督下进行的全面了解。在这项工作中，我们进行了第一项大规模研究，评估了数据筛分设置中的主要生成模型，从而揭示了全数据和数据筛选制度之间的巨大性能差距。为了缩小这一差距，我们提出了一个基于统一的基于扩散的生成框架，该框架可以仅使用几个示例综合跨不同域的高保真时间序列。我们的模型已在大型，异质的时间序列数据集中进行了预训练，从而使其能够学习可通用的时间表示。它进一步结合了建筑创新，例如用于柔性通道适应的动态卷积层和域吸引生成的数据集代币条件。在不需要大量的监督的情况下，我们的统一模型在几个范围内的域中，在几个范围内特有的域基线都可以实现最先进的性能。值得注意的是，即使在完整的数据集基准上进行了测试，它也超越了所有基准，突出了预训练和跨域概括的强度。我们希望这项工作鼓励社区重新审视几乎没有生成的建模，这是时间序列研究中的关键问题，并追求统一的解决方案，以在跨领域有效地扩展统一的解决方案。代码可在此HTTPS URL上找到。

Title: DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data

Authors: Ruqi Wu, Xinjie Wang, Liu Liu, Chunle Guo, Jiaxiong Qiu, Chongyi Li, Lichao Huang, Zhizhong Su, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20460
Pdf URL: https://arxiv.org/pdf/2505.20460
Copy Paste: [[2505.20460]] DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data(https://arxiv.org/abs/2505.20460)
Keywords: generation
Abstract: We present DIPO, a novel framework for the controllable generation of articulated 3D objects from a pair of images: one depicting the object in a resting state and the other in an articulated state. Compared to the single-image approach, our dual-image input imposes only a modest overhead for data collection, but at the same time provides important motion information, which is a reliable guide for predicting kinematic relationships between parts. Specifically, we propose a dual-image diffusion model that captures relationships between the image pair to generate part layouts and joint parameters. In addition, we introduce a Chain-of-Thought (CoT) based graph reasoner that explicitly infers part connectivity relationships. To further improve robustness and generalization on complex articulated objects, we develop a fully automated dataset expansion pipeline, name LEGO-Art, that enriches the diversity and complexity of PartNet-Mobility dataset. We propose PM-X, a large-scale dataset of complex articulated 3D objects, accompanied by rendered images, URDF annotations, and textual descriptions. Extensive experiments demonstrate that DIPO significantly outperforms existing baselines in both the resting state and the articulated state, while the proposed PM-X dataset further enhances generalization to diverse and structurally complex articulated objects. Our code and dataset will be released to the community upon publication.
摘要：我们提出了Dipo，这是一个新颖的框架，用于从一对图像中可控产生铰接的3D对象：一个描绘物体处于静止状态，另一个以铰接状态描绘。与单图像方法相比，我们的双图像输入仅对数据收集施加了适度的开销，但同时提供了重要的运动信息，这是预测零件之间运动学关系的可靠指南。具体而言，我们提出了一个双图像扩散模型，该模型捕获图像对之间的关系以生成零件布局和关节参数。此外，我们引入了基于经过思考的（COT）的图形推理器，该图明确地渗透了零件连接关系。为了进一步提高对复杂铰接式对象的鲁棒性和概括，我们开发了一个完全自动化的数据集扩展管道，名称为Lego-Art，以丰富Partnet-Mobility数据集的多样性和复杂性。我们提出了PM-X，这是一个大规模的复杂铰接3D对象的数据集，并附有渲染的图像，乌尔多夫注释和文本描述。广泛的实验表明，DIPO在静止状态和铰接状态下都显着优于现有基线，而拟议的PM-X数据集则进一步增强了对多样化和结构复杂的铰接式物体的概括。我们的代码和数据集将在出版后将其发布给社区。

Title: WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

Authors: Chenghao Qian, Wenjing Li, Yuhu Guo, Gustav Markkula
Subjects: cs.CV, cs.AI, cs.ET, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2505.20471
Pdf URL: https://arxiv.org/pdf/2505.20471
Copy Paste: [[2505.20471]] WeatherEdit: Controllable Weather Editing with 4D Gaussian Field(https://arxiv.org/abs/2505.20471)
Keywords: generation
Abstract: In this work, we present WeatherEdit, a novel weather editing pipeline for generating realistic weather effects with controllable types and severity in 3D scenes. Our approach is structured into two key components: weather background editing and weather particle construction. For weather background editing, we introduce an all-in-one adapter that integrates multiple weather styles into a single pretrained diffusion model, enabling the generation of diverse weather effects in 2D image backgrounds. During inference, we design a Temporal-View (TV-) attention mechanism that follows a specific order to aggregate temporal and spatial information, ensuring consistent editing across multi-frame and multi-view images. To construct the weather particles, we first reconstruct a 3D scene using the edited images and then introduce a dynamic 4D Gaussian field to generate snowflakes, raindrops and fog in the scene. The attributes and dynamics of these particles are precisely controlled through physical-based modelling and simulation, ensuring realistic weather representation and flexible severity adjustments. Finally, we integrate the 4D Gaussian field with the 3D scene to render consistent and highly realistic weather effects. Experiments on multiple driving datasets demonstrate that WeatherEdit can generate diverse weather effects with controllable condition severity, highlighting its potential for autonomous driving simulation in adverse weather. See project page: this https URL
摘要：在这项工作中，我们介绍了Whateedit，这是一种新型的天气编辑管道，用于在3D场景中产生具有可控类型和严重性的逼真的天气影响。我们的方法构成了两个关键组成部分：天气背景编辑和天气颗粒结构。对于天气背景编辑，我们引入了一个多合一的适配器，该适配器将多种天气样式集成到单个预处理的扩散模型中，从而在2D图像背景下能够产生不同的天气效果。在推断期间，我们设计了一个时间视图（TV-）注意机制，该机制遵循特定顺序，以汇总时间和空间信息，以确保在多框架和多视图图像上进行一致的编辑。为了构建天气颗粒，我们首先使用编辑的图像重建一个3D场景，然后引入动态的4D高斯田地，以在场景中产生雪花，雨滴和雾。这些粒子的属性和动力学通过基于物理的建模和仿真来精确控制，从而确保天气表示和灵活的严重性调整。最后，我们将4D高斯领域与3D场景集成在一起，从而产生一致且高度现实的天气影响。多个驾驶数据集的实验表明，卫星可以产生不同的天气影响，并具有可控的条件严重程度，从而强调了其在不利天气中自动驾驶模拟的潜力。请参阅项目页面：此HTTPS URL

Title: ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image

Authors: Dongyu Luo, Kelin Yu, Amir-Hossein Shahidzadeh, Cornelia Fermüller, Yiannis Aloimonos
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2505.20498
Pdf URL: https://arxiv.org/pdf/2505.20498
Copy Paste: [[2505.20498]] ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image(https://arxiv.org/abs/2505.20498)
Keywords: generation
Abstract: Vision-based tactile sensing has been widely used in perception, reconstruction, and robotic manipulation. However, collecting large-scale tactile data remains costly due to the localized nature of sensor-object interactions and inconsistencies across sensor instances. Existing approaches to scaling tactile data, such as simulation and free-form tactile generation, often suffer from unrealistic output and poor transferability to downstream this http URL address this, we propose ControlTac, a two-stage controllable framework that generates realistic tactile images conditioned on a single reference tactile image, contact force, and contact position. With those physical priors as control input, ControlTac generates physically plausible and varied tactile images that can be used for effective data augmentation. Through experiments on three downstream tasks, we demonstrate that ControlTac can effectively augment tactile datasets and lead to consistent gains. Our three real-world experiments further validate the practical utility of our approach. Project page: this https URL.
摘要：基于视觉的触觉传感已被广泛用于感知，重建和机器人操作。但是，由于传感器相互作用的局部性质和传感器实例的不一致性，收集大规模触觉数据的昂贵。现有的缩放触觉数据的方法，例如模拟和自由形式的触觉生成，通常会遭受不切实际的输出和向下游的可传递性，该HTTP URL解决了这一点，我们提出了ControlTAC，ControlTac是一个可产生单个参考触觉图像，接触力量，接触力和接触位置的实现逼真的触觉图像，这是一个两阶段的可控框架。使用这些物理先验作为控制输入，ControlTAC会生成物理上合理且多样化的触觉图像，可用于有效的数据增强。通过对三个下游任务的实验，我们证明ControlTAC可以有效地增强触觉数据集并导致一致的收益。我们的三个现实实验进一步验证了我们方法的实际实用性。项目页面：此HTTPS URL。

Title: MultLFG: Training-free Multi-LoRA composition using Frequency-domain Guidance

Authors: Aniket Roy, Maitreya Suin, Ketul Shah, Rama Chellappa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20525
Pdf URL: https://arxiv.org/pdf/2505.20525
Copy Paste: [[2505.20525]] MultLFG: Training-free Multi-LoRA composition using Frequency-domain Guidance(https://arxiv.org/abs/2505.20525)
Keywords: generation, generative
Abstract: Low-Rank Adaptation (LoRA) has gained prominence as a computationally efficient method for fine-tuning generative models, enabling distinct visual concept synthesis with minimal overhead. However, current methods struggle to effectively merge multiple LoRA adapters without training, particularly in complex compositions involving diverse visual elements. We introduce MultLFG, a novel framework for training-free multi-LoRA composition that utilizes frequency-domain guidance to achieve adaptive fusion of multiple LoRAs. Unlike existing methods that uniformly aggregate concept-specific LoRAs, MultLFG employs a timestep and frequency subband adaptive fusion strategy, selectively activating relevant LoRAs based on content relevance at specific timesteps and frequency bands. This frequency-sensitive guidance not only improves spatial coherence but also provides finer control over multi-LoRA composition, leading to more accurate and consistent results. Experimental evaluations on the ComposLoRA benchmark reveal that MultLFG substantially enhances compositional fidelity and image quality across various styles and concept sets, outperforming state-of-the-art baselines in multi-concept generation tasks. Code will be released.
摘要：低级适应性（LORA）已成为微调生成模型的一种计算有效方法，从而实现了具有最小开销的不同视觉概念合成。但是，当前的方法难以在没有训练的情况下有效合并多个洛拉适配器，尤其是在涉及各种视觉元素的复杂组成中。我们介绍了Multlfg，这是一种用于训练无多洛拉组合物的新型框架，该框架利用频域的指导来实现多个洛拉斯的自适应融合。与现有统一聚合概念特异性洛拉斯的方法不同，Multlfg采用了时间步和频率子带自适应融合策略，在特定的时间段和频段上选择性地基于内容相关性，选择性地激活相关的Loras。这种对频率敏感的指导不仅提高了空间连贯性，而且还提供了对多路拉组合物的更好控制，从而导致更准确和一致的结果。对Composlora基准测试的实验评估表明，MULTLFG显着提高了各种样式和概念集的组成保真度和图像质量，在多概念生成任务中表现优于最先进的基线。代码将发布。

Title: Causality and "In-the-Wild" Video-Based Person Re-ID: A Survey

Authors: Md Rashidunnabi, Kailash Hambarde, Hugo Proença
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20540
Pdf URL: https://arxiv.org/pdf/2505.20540
Copy Paste: [[2505.20540]] Causality and "In-the-Wild" Video-Based Person Re-ID: A Survey(https://arxiv.org/abs/2505.20540)
Keywords: generative
Abstract: Video-based person re-identification (Re-ID) remains brittle in real-world deployments despite impressive benchmark performance. Most existing models rely on superficial correlations such as clothing, background, or lighting that fail to generalize across domains, viewpoints, and temporal variations. This survey examines the emerging role of causal reasoning as a principled alternative to traditional correlation-based approaches in video-based Re-ID. We provide a structured and critical analysis of methods that leverage structural causal models, interventions, and counterfactual reasoning to isolate identity-specific features from confounding factors. The survey is organized around a novel taxonomy of causal Re-ID methods that spans generative disentanglement, domain-invariant modeling, and causal transformers. We review current evaluation metrics and introduce causal-specific robustness measures. In addition, we assess practical challenges of scalability, fairness, interpretability, and privacy that must be addressed for real-world adoption. Finally, we identify open problems and outline future research directions that integrate causal modeling with efficient architectures and self-supervised learning. This survey aims to establish a coherent foundation for causal video-based person Re-ID and to catalyze the next phase of research in this rapidly evolving domain.
摘要：基于视频的人重新识别（RE-ID）尽管令人印象深刻的基准性能，但在现实世界中仍然很脆弱。大多数现有的模型都依赖于表面上的相关性，例如服装，背景或照明，这些相关性无法跨越域，观点和时间变化。这项调查研究了因果推理是基于视频的RE-ID中基于传统相关方法的原则替代方案的新兴作用。我们对利用结构性因果模型，干预措施和反事实推理的方法进行结构化和批判性分析，以将特定于身份的特征与混杂因素隔离开来。该调查是围绕因果重新ID方法的新分类法进行了组织，该分类法涵盖了生成分解，域名变异的建模和因果变压器。我们回顾当前的评估指标，并引入特异性鲁棒性措施。此外，我们评估必须解决现实世界中必须解决的可扩展性，公平性，可解释性和隐私性的实际挑战。最后，我们确定了开放的问题，并概述了未来的研究方向，这些方向将因果建模与有效的体系结构和自我监管的学习融为一体。这项调查旨在为基于因果视频的人重新ID建立连贯的基础，并催化这个快速发展的领域的下一阶段研究。

Title: Ctrl-DNA: Controllable Cell-Type-Specific Regulatory DNA Design via Constrained RL

Authors: Xingyu Chen, Shihao Ma, Runsheng Lin, Jiecong Lin, Bo Wang
Subjects: cs.LG, cs.AI, q-bio.GN
Abstract URL: https://arxiv.org/abs/2505.20578
Pdf URL: https://arxiv.org/pdf/2505.20578
Copy Paste: [[2505.20578]] Ctrl-DNA: Controllable Cell-Type-Specific Regulatory DNA Design via Constrained RL(https://arxiv.org/abs/2505.20578)
Keywords: generative
Abstract: Designing regulatory DNA sequences that achieve precise cell-type-specific gene expression is crucial for advancements in synthetic biology, gene therapy and precision medicine. Although transformer-based language models (LMs) can effectively capture patterns in regulatory DNA, their generative approaches often struggle to produce novel sequences with reliable cell-specific activity. Here, we introduce Ctrl-DNA, a novel constrained reinforcement learning (RL) framework tailored for designing regulatory DNA sequences with controllable cell-type specificity. By formulating regulatory sequence design as a biologically informed constrained optimization problem, we apply RL to autoregressive genomic LMs, enabling the models to iteratively refine sequences that maximize regulatory activity in targeted cell types while constraining off-target effects. Our evaluation on human promoters and enhancers demonstrates that Ctrl-DNA consistently outperforms existing generative and RL-based approaches, generating high-fitness regulatory sequences and achieving state-of-the-art cell-type specificity. Moreover, Ctrl-DNA-generated sequences capture key cell-type-specific transcription factor binding sites (TFBS), short DNA motifs recognized by regulatory proteins that control gene expression, demonstrating the biological plausibility of the generated sequences.
摘要：设计获得精确细胞类型基因表达的调节性DNA序列对于合成生物学，基因治疗和精度医学的进步至关重要。尽管基于变压器的语言模型（LMS）可以有效地捕获调节性DNA中的模式，但它们的生成方法通常很难产生具有可靠细胞特异性活性的新型序列。在这里，我们介绍了CTRL-DNA，这是一种针对设计具有可控细胞类型特异性的调节性DNA序列的新型约束增强学习（RL）框架。通过将调节序列设计作为一个生物学知情的约束优化问题，我们将RL应用于自回归的基因组LMS，使模型能够迭代地完善序列，从而在靶向细胞类型中最大程度地发挥调节活性，同时限制非目标效应。我们对人类启动子和增强剂的评估表明，CTRL-DNA始终优于现有生成和RL的方法，生成高效果调节序列并实现最先进的细胞类型特异性。此外，CTRL-DNA生成的序列捕获了关键的细胞类型转录因子结合位点（TFB），通过控制基因表达的调节蛋白识别的短DNA基序，证明了生成序列的生物学上性。

Title: ConsiStyle: Style Diversity in Training-Free Consistent T2I Generation

Authors: Yohai Mazuz, Janna Bruner, Lior Wolf
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20626
Pdf URL: https://arxiv.org/pdf/2505.20626
Copy Paste: [[2505.20626]] ConsiStyle: Style Diversity in Training-Free Consistent T2I Generation(https://arxiv.org/abs/2505.20626)
Keywords: generation
Abstract: In text-to-image models, consistent character generation is the task of achieving text alignment while maintaining the subject's appearance across different prompts. However, since style and appearance are often entangled, the existing methods struggle to preserve consistent subject characteristics while adhering to varying style prompts. Current approaches for consistent text-to-image generation typically rely on large-scale fine-tuning on curated image sets or per-subject optimization, which either fail to generalize across prompts or do not align well with textual descriptions. Meanwhile, training-free methods often fail to maintain subject consistency across different styles. In this work, we introduce a training-free method that achieves both style alignment and subject consistency. The attention matrices are manipulated such that Queries and Keys are obtained from the anchor image(s) that are used to define the subject, while the Values are imported from a parallel copy that is not subject-anchored. Additionally, cross-image components are added to the self-attention mechanism by expanding the Key and Value matrices. To do without shifting from the target style, we align the statistics of the Value matrices. As is demonstrated in a comprehensive battery of qualitative and quantitative experiments, our method effectively decouples style from subject appearance and enables faithful generation of text-aligned images with consistent characters across diverse styles.
摘要：在文本到图像模型中，一致的角色产生是在维护不同提示的同时保持对象的外观的任务。但是，由于风格和外观经常被纠缠，因此现有的方法难以保留一致的主题特征，同时遵守各种样式提示。一致的文本到图像生成的当前方法通常依赖于策划的图像集或每个受试者优化的大规模微调，该方法无法跨提示概括，要么与文本描述不符。同时，无培训方法通常无法保持不同样式的主题一致性。在这项工作中，我们介绍了一种无训练的方法，可以实现样式一致性和主题一致性。操纵注意力矩阵，以使查询和键是从用于定义主题的锚图像中获得的，而值是从未经主题的并行副本中导入的值。此外，通过扩展键和价值矩阵，将跨图像组件添加到自我发项机制中。在不转移目标样式的情况下，我们将值矩阵的统计数据对齐。正如全面的定性和定量实验中所证明的那样，我们的方法有效地将样式从主题外观中解脱出来，并使忠实地生成了文本一致的图像，并具有一致的角色在各种样式上。

Title: Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training

Authors: Bolin Lai, Sangmin Lee, Xu Cao, Xiang Li, James M. Rehg
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20629
Pdf URL: https://arxiv.org/pdf/2505.20629
Copy Paste: [[2505.20629]] Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training(https://arxiv.org/abs/2505.20629)
Keywords: generation
Abstract: Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few predefined conditioning settings. To tackle this issue, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. We also show more insights of our method by detailed ablation study and analysis.
摘要：文本图像到视频（TI2V）生成是使用语义和视觉条件的可控制视频生成的关键问题。大多数现有的方法通常通过填充来为文本对视频（T2V）基础模型添加视觉条件，该模型的资源成本高昂，仅限于一些预定义的调节设置。为了解决此问题，我们引入了具有灵活的视觉调节的TI2V生成的统一配方。此外，我们提出了一种称为Flexti2v的无创新培训方法，该方法可以根据任意位置的任意图像来调节T2V基础模型。具体而言，我们首先将条件图像转换为潜在空间中的嘈杂表示。然后，在T2V模型的转换过程中，我们的方法使用一种新颖的随机补丁交换策略将视觉特征通过本地图像补丁纳入视频表示中。为了平衡创造力和忠诚，我们使用动态控制机制将视觉调理的强度调整为每个视频框架。广泛的实验验证我们的方法是否超过了以前的无训练图像调节方法。我们还通过详细的消融研究和分析来展示我们方法的更多见解。

Title: Open-Det: An Efficient Learning Framework for Open-Ended Detection

Authors: Guiping Cao, Tao Wang, Wenjian Huang, Xiangyuan Lan, Jianguo Zhang, Dongmei Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20639
Pdf URL: https://arxiv.org/pdf/2505.20639
Copy Paste: [[2505.20639]] Open-Det: An Efficient Learning Framework for Open-Ended Detection(https://arxiv.org/abs/2505.20639)
Keywords: generation
Abstract: Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at: this https URL.
摘要：开放式对象检测（OED）是一项新颖而充满挑战的任务，它以自由形式的方式检测对象并生成其类别名称，而无需推断期间需要额外的词汇。但是，现有的OED模型（例如GenerateU）需要大规模的数据集进行训练，趋于缓慢的收敛性和表现有限的性能。为了解决这些问题，我们提出了一个新颖而高效的开放式框架，由四个协作部分组成。具体而言，通过重建对象检测器和对象名称生成器，开放det在边界框和对象名称生成过程中加速了模型训练。为了弥合视觉和语言方式之间的语义差距，我们提出了一个具有V-TO-L和L-TO-V对准机制的视觉对准器，并与提示蒸馏器合并，将知识从VLM转移到VL-Prompts中，从而为LLM提供了准确的对象名称。此外，我们设计了一个掩盖的对准损失，以消除矛盾的监督，并引入联合损失以增强分类，从而进行更有效的培训。与Generateu相比，Open-Det仅使用1.5％的训练数据（0.077m对5.077m），训练时期的20.8％（31 vs. 149），而GPU资源较少（4 V100 vs. 16 A100）可实现更高的性能（APR中+1.0％）。源代码可用：此HTTPS URL。

Title: Scan-and-Print: Patch-level Data Summarization and Augmentation for Content-aware Layout Generation in Poster Design

Authors: HsiaoYuan Hsu, Yuxin Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20649
Pdf URL: https://arxiv.org/pdf/2505.20649
Copy Paste: [[2505.20649]] Scan-and-Print: Patch-level Data Summarization and Augmentation for Content-aware Layout Generation in Poster Design(https://arxiv.org/abs/2505.20649)
Keywords: generation
Abstract: In AI-empowered poster design, content-aware layout generation is crucial for the on-image arrangement of visual-textual elements, e.g., logo, text, and underlay. To perceive the background images, existing work demanded a high parameter count that far exceeds the size of available training data, which has impeded the model's real-time performance and generalization ability. To address these challenges, we proposed a patch-level data summarization and augmentation approach, vividly named Scan-and-Print. Specifically, the scan procedure selects only the patches suitable for placing element vertices to perform fine-grained perception efficiently. Then, the print procedure mixes up the patches and vertices across two image-layout pairs to synthesize over 100% new samples in each epoch while preserving their plausibility. Besides, to facilitate the vertex-level operations, a vertex-based layout representation is introduced. Extensive experimental results on widely used benchmarks demonstrated that Scan-and-Print can generate visually appealing layouts with state-of-the-art quality while dramatically reducing computational bottleneck by 95.2%.
摘要：在AI授权的海报设计中，内容感知的布局生成对于视觉文本元素（例如徽标，文本和底层）的内图像排列至关重要。为了感知背景图像，现有工作要求高参数计数远远超过可用培训数据的大小，这阻碍了该模型的实时性能和概括能力。为了应对这些挑战，我们提出了一种补丁级数据摘要和增强方法，生动地命名为扫描和印刷。具体而言，扫描过程仅选择适合放置元素顶点的贴片以有效地执行细粒度的感知。然后，打印过程将斑块和顶点混合在两个图像层配对上，以在每个时期内合成100％的新样本，同时保持其合理性。此外，为了促进顶点级操作，还引入了基于顶点的布局表示。对广泛使用的基准测试的广泛实验结果表明，扫描和印刷可以产生具有最新质量的视觉上吸引人的布局，同时将计算瓶颈大幅降低了95.2％。

Title: Photography Perspective Composition: Towards Aesthetic Perspective Recommendation

Authors: Lujian Yao, Siming Zheng, Xinbin Yuan, Zhuoxuan Cai, Pu Wu, Jinwei Chen, Bo Li, Peng-Tao Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20655
Pdf URL: https://arxiv.org/pdf/2505.20655
Copy Paste: [[2505.20655]] Photography Perspective Composition: Towards Aesthetic Perspective Recommendation(https://arxiv.org/abs/2505.20655)
Keywords: generation, quality assessment
Abstract: Traditional photography composition approaches are dominated by 2D cropping-based methods. However, these methods fall short when scenes contain poorly arranged subjects. Professional photographers often employ perspective adjustment as a form of 3D recomposition, modifying the projected 2D relationships between subjects while maintaining their actual spatial positions to achieve better compositional balance. Inspired by this artistic practice, we propose photography perspective composition (PPC), extending beyond traditional cropping-based methods. However, implementing the PPC faces significant challenges: the scarcity of perspective transformation datasets and undefined assessment criteria for perspective quality. To address these challenges, we present three key contributions: (1) An automated framework for building PPC datasets through expert photographs. (2) A video generation approach that demonstrates the transformation process from suboptimal to optimal perspectives. (3) A perspective quality assessment (PQA) model constructed based on human performance. Our approach is concise and requires no additional prompt instructions or camera trajectories, helping and guiding ordinary users to enhance their composition skills.
摘要：传统的摄影作品方法以2D基于2D裁剪的方法为主。但是，当场景包含安排不佳的主题时，这些方法缺乏。专业摄影师通常会采用透视调整为3D重新组件的一种形式，修改了受试者之间预计的2D关系，同时保持其实际的空间位置以实现更好的组成平衡。受这种艺术实践的启发，我们提出了摄影透视图（PPC），超越了基于传统种植的方法。但是，实施PPC面临重大挑战：视角转换数据集的稀缺性和视角质量的不确定评估标准。为了应对这些挑战，我们提出了三个关键贡献：（1）通过专家照片构建PPC数据集的自动框架。（2）一种视频生成方法，展示了从次优到最佳观点的转换过程。（3）基于人类绩效构建的透视质量评估（PQA）模型。我们的方法是简洁的，不需要其他及时的说明或相机轨迹，可以帮助和指导普通用户增强其组成技能。

Title: Accelerating RL for LLM Reasoning with Optimal Advantage Regression

Authors: Kianté Brantley, Mingyu Chen, Zhaolin Gao, Jason D. Lee, Wen Sun, Wenhao Zhan, Xuezhou Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20686
Pdf URL: https://arxiv.org/pdf/2505.20686
Copy Paste: [[2505.20686]] Accelerating RL for LLM Reasoning with Optimal Advantage Regression(https://arxiv.org/abs/2505.20686)
Keywords: generation
Abstract: Reinforcement learning (RL) has emerged as a powerful tool for fine-tuning large language models (LLMs) to improve complex reasoning abilities. However, state-of-the-art policy optimization methods often suffer from high computational overhead and memory consumption, primarily due to the need for multiple generations per prompt and the reliance on critic networks or advantage estimates of the current policy. In this paper, we propose $A$*-PO, a novel two-stage policy optimization framework that directly approximates the optimal advantage function and enables efficient training of LLMs for reasoning tasks. In the first stage, we leverage offline sampling from a reference policy to estimate the optimal value function $V$*, eliminating the need for costly online value estimation. In the second stage, we perform on-policy updates using a simple least-squares regression loss with only a single generation per prompt. Theoretically, we establish performance guarantees and prove that the KL-regularized RL objective can be optimized without requiring complex exploration strategies. Empirically, $A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks, while reducing training time by up to 2$\times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL. Implementation of $A$*-PO can be found at this https URL.
摘要：增强学习（RL）已成为微调大型语言模型（LLMS）以提高复杂推理能力的强大工具。但是，最先进的政策优化方法通常遭受高计算开销和内存消耗的困扰，这主要是由于每个提示需要多代人以及对评论家网络的依赖或当前策略的优势估计。在本文中，我们提出了$ A $* - PO，这是一种新颖的两阶段策略优化框架，直接近似于最佳优势功能，并为推理任务提供有效的LLMS培训。在第一阶段，我们利用从参考策略的离线采样来估计最佳值函数$ v $*，从而消除了对昂贵的在线价值估计的需求。在第二阶段，我们使用简单的最小二乘回归损失执行了policy更新，每个提示只有一代人。从理论上讲，我们建立绩效保证，并证明可以在不需要复杂的勘探策略的情况下优化KL批准的RL目标。从经验上讲，与PPO，GRPO和Rebel相比，$* - PO在广泛的数学推理基准中实现了竞争性能，同时将训练时间最多减少2 $ \ times $，而峰值内存使用率则降低了30％以上。可以在此HTTPS URL上找到$ A $* - PO的实现。

Title: Generating Hypotheses of Dynamic Causal Graphs in Neuroscience: Leveraging Generative Factor Models of Observed Time Series

Authors: Zachary C. Brown, David Carlson
Subjects: cs.LG, cs.AI, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2505.20697
Pdf URL: https://arxiv.org/pdf/2505.20697
Copy Paste: [[2505.20697]] Generating Hypotheses of Dynamic Causal Graphs in Neuroscience: Leveraging Generative Factor Models of Observed Time Series(https://arxiv.org/abs/2505.20697)
Keywords: generation, generative
Abstract: The field of hypothesis generation promises to reduce costs in neuroscience by narrowing the range of interventional studies needed to study various phenomena. Existing machine learning methods can generate scientific hypotheses from complex datasets, but many approaches assume causal relationships are static over time, limiting their applicability to systems with dynamic, state-dependent behavior, such as the brain. While some techniques attempt dynamic causal discovery through factor models, they often restrict relationships to linear patterns or impose other simplifying assumptions. We propose a novel method that models dynamic graphs as a conditionally weighted superposition of static graphs, where each static graph can capture nonlinear relationships. This approach enables the detection of complex, time-varying interactions between variables beyond linear limitations. Our method improves f1-scores of predicted dynamic causal patterns by roughly 22-28% on average over baselines in some of our experiments, with some improvements reaching well over 60%. A case study on real brain data demonstrates our method's ability to uncover relationships linked to specific behavioral states, offering valuable insights into neural dynamics.
摘要：假设产生的领域有望通过缩小研究各种现象所需的介入研究范围来降低神经科学的成本。现有的机器学习方法可以从复杂数据集中产生科学假设，但是许多方法都假定因果关系随着时间的流逝而静态，从而将其适用性限制在具有动态，状态依赖性行为（例如大脑）的系统中。尽管某些技术通过因子模型尝试动态因果发现，但它们通常将关系限制为线性模式或强加其他简化的假设。我们提出了一种新颖的方法，该方法将动态图建模为静态图的有条件加权的叠加，每个静态图都可以捕获非线性关系。这种方法可以检测超出线性限制的变量之间的复杂，时变的相互作用。在我们的某些实验中，我们的方法将预测动态因果模式的F1得分提高了大约22-28％，而基线的F1得分平均提高了，其中一些改进的速度超过60％。对真实大脑数据的案例研究表明，我们方法发现与特定行为状态相关的关系的能力，为神经动态提供了宝贵的见解。

Title: Hierarchical Instruction-aware Embodied Visual Tracking

Authors: Kui Wu, Hao Chen, Churan Wang, Fakhri Karray, Zhoujun Li, Yizhou Wang, Fangwei Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20710
Pdf URL: https://arxiv.org/pdf/2505.20710
Copy Paste: [[2505.20710]] Hierarchical Instruction-aware Embodied Visual Tracking(https://arxiv.org/abs/2505.20710)
Keywords: generation
Abstract: User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models due to the substantial gap between high-level user instructions and low-level agent actions. While recent advancements in language models (e.g., LLMs, VLMs, VLAs) have improved instruction comprehension, these models face critical limitations in either inference speed (LLMs, VLMs) or generalizability (VLAs) for UC-EVT tasks. To address these challenges, we propose \textbf{Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT)} agent, which bridges instruction comprehension and action generation using \textit{spatial goals} as intermediaries. HIEVT first introduces \textit{LLM-based Semantic-Spatial Goal Aligner} to translate diverse human instructions into spatial goals that directly annotate the desired spatial position. Then the \textit{RL-based Adaptive Goal-Aligned Policy}, a general offline policy, enables the tracker to position the target as specified by the spatial goal. To benchmark UC-EVT tasks, we collect over ten million trajectories for training and evaluate across one seen environment and nine unseen challenging environments. Extensive experiments and real-world deployments demonstrate the robustness and generalizability of HIEVT across diverse environments, varying target dynamics, and complex instruction combinations. The complete project is available at this https URL.
摘要：以用户为中心的视觉跟踪（UC-EVT）为基于增强学习的模型提出了一个新的挑战，因为高级用户说明与低级代理操作之间存在很大的差距。尽管语言模型（例如LLMS，VLM，VLAS）的最新进步已经提高了指导理解，但这些模型在推理速度（LLMS，VLMS）或UC-EVT任务的推理速度（LLMS，VLMS）或概括性（VLAS）中面临关键限制。为了应对这些挑战，我们建议\ textbf {层次指令 - 意识到体现的视觉跟踪（hievt）}代理，该代理使用\ textit {空间目标}作为Intermediaries桥接指令理解和动作生成。 Hievt首先引入\ textIt {基于LLM的语义空间目标对准器}，以将各种人类的指示转化为直接注释所需空间位置的空间目标。然后，\ textIt {基于基于自适应目标对准策略}是一般的离线策略，使跟踪器能够将目标定位为空间目标指定的。为了基准基准UC-EVT任务，我们收集了超过一千万个轨迹进行培训，并在一个可见的环境和九个看不见的挑战环境中进行评估。广泛的实验和现实世界的部署表明了Hievt在各种环境中的鲁棒性和概括性，不同的目标动态和复杂的指导组合。完整的项目可在此HTTPS URL上找到。

Title: LeDiFlow: Learned Distribution-guided Flow Matching to Accelerate Image Generation

Authors: Pascal Zwick, Nils Friederich, Maximilian Beichter, Lennart Hilbert, Ralf Mikut, Oliver Bringmann
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20723
Pdf URL: https://arxiv.org/pdf/2505.20723
Copy Paste: [[2505.20723]] LeDiFlow: Learned Distribution-guided Flow Matching to Accelerate Image Generation(https://arxiv.org/abs/2505.20723)
Keywords: generation, generative
Abstract: Enhancing the efficiency of high-quality image generation using Diffusion Models (DMs) is a significant challenge due to the iterative nature of the process. Flow Matching (FM) is emerging as a powerful generative modeling paradigm based on a simulation-free training objective instead of a score-based one used in DMs. Typical FM approaches rely on a Gaussian distribution prior, which induces curved, conditional probability paths between the prior and target data distribution. These curved paths pose a challenge for the Ordinary Differential Equation (ODE) solver, requiring a large number of inference calls to the flow prediction network. To address this issue, we present Learned Distribution-guided Flow Matching (LeDiFlow), a novel scalable method for training FM-based image generation models using a better-suited prior distribution learned via a regression-based auxiliary model. By initializing the ODE solver with a prior closer to the target data distribution, LeDiFlow enables the learning of more computationally tractable probability paths. These paths directly translate to fewer solver steps needed for high-quality image generation at inference time. Our method utilizes a State-Of-The-Art (SOTA) transformer architecture combined with latent space sampling and can be trained on a consumer workstation. We empirically demonstrate that LeDiFlow remarkably outperforms the respective FM baselines. For instance, when operating directly on pixels, our model accelerates inference by up to 3.75x compared to the corresponding pixel-space baseline. Simultaneously, our latent FM model enhances image quality on average by 1.32x in CLIP Maximum Mean Discrepancy (CMMD) metric against its respective baseline.
摘要：由于该过程的迭代性质，使用扩散模型（DMS）提高高质量图像产生的效率是一个重大挑战。基于无模拟训练目标，而不是DMS中使用的基于得分的训练目标，流动匹配（FM）是作为强大的生成建模范式而出现的。典型的FM方法依赖于先验的高斯分布，该分布在先验数据和目标数据分布之间引起弯曲的条件概率路径。这些弯曲的路径对普通微分方程（ODE）求解器构成了挑战，需要大量推理对流预测网络。为了解决这个问题，我们介绍了学习分布引导的流量匹配（LEDIFLOW），这是一种使用基于回归的辅助模型学到的更好的先验分布来训练基于FM的图像生成模型的新型可扩展方法。通过先前接近目标数据分布的ode求解器初始化ode求解器，Lediflow可以学习更多计算障碍的概率路径。这些路径直接转化为推理时高质量图像生成所需的较少的求解器步骤。我们的方法利用了最先进的（SOTA）变压器架构与潜在空间采样相结合，可以在消费者工作站上进行培训。我们从经验上证明，Lediflow的表现非常优于相应的FM基准。例如，与相应的像素空间基线相比，当直接在像素上操作时，我们的模型将最高3.75倍加速。同时，我们的潜在FM模型平均在夹子最大值平均差异（CMMD）指标中平均提高了1.32倍的图像质量。

Title: Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction

Authors: Yifei Wang, Weimin Bai, Colin Zhang, Debing Zhang, Weijian Luo, He Sun
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.20755
Pdf URL: https://arxiv.org/pdf/2505.20755
Copy Paste: [[2505.20755]] Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction(https://arxiv.org/abs/2505.20755)
Keywords: generation
Abstract: In this paper, we unify more than 10 existing one-step diffusion distillation approaches, such as Diff-Instruct, DMD, SIM, SiD, $f$-distill, etc, inside a theory-driven framework which we name the \textbf{\emph{Uni-Instruct}}. Uni-Instruct is motivated by our proposed diffusion expansion theory of the $f$-divergence family. Then we introduce key theories that overcome the intractability issue of the original expanded $f$-divergence, resulting in an equivalent yet tractable loss that effectively trains one-step diffusion models by minimizing the expanded $f$-divergence family. The novel unification introduced by Uni-Instruct not only offers new theoretical contributions that help understand existing approaches from a high-level perspective but also leads to state-of-the-art one-step diffusion generation performances. On the CIFAR10 generation benchmark, Uni-Instruct achieves record-breaking Frechet Inception Distance (FID) values of \textbf{\emph{1.46}} for unconditional generation and \textbf{\emph{1.38}} for conditional generation. On the ImageNet-$64\times 64$ generation benchmark, Uni-Instruct achieves a new SoTA one-step generation FID of \textbf{\emph{1.02}}, which outperforms its 79-step teacher diffusion with a significant improvement margin of 1.33 (1.02 vs 2.35). We also apply Uni-Instruct on broader tasks like text-to-3D generation. For text-to-3D generation, Uni-Instruct gives decent results, which slightly outperforms previous methods, such as SDS and VSD, in terms of both generation quality and diversity. Both the solid theoretical and empirical contributions of Uni-Instruct will potentially help future studies on one-step diffusion distillation and knowledge transferring of diffusion models.
摘要：在本文中，我们在理论驱动的框架内统一了10多种现有的一步扩散蒸馏方法，例如Diff-Instruct，DMD，SIM，SID，SID，$ F $ -DISTILL等，我们将其命名为\ textbf {\ emph {uni-Instruct}}}。 Uni-Instruct是由我们提出的$ f $ divivergence家族扩散扩展理论的动机。然后，我们介绍关键理论，以克服原始扩展的$ f $ divergence的棘手性问题，从而导致同等而可拖延的损失，从而通过最大程度地减少扩展的$ f $ divergence家族来有效地训练一步扩散模型。 Uni-Instruction引入的小说统一不仅提供了新的理论贡献，可以从高级的角度帮助理解现有方法，而且还导致了最新的一步一步扩散生成。在CIFAR10生成基准上，Uni-Instruct实现了无条件生成的\ textbf {\ emph {1.46}}的纪录的构成距离（fid）值，\ textbf {\ textbf {\ emph {1.38}}}有条件生成。在Imagenet- $ 64 \ times 64 $一代基准中，Uni-Instruct实现了新的SOTA单步生成\ TextBf {\ Emph {1.02}}的FID，超过了其79步的教师扩散，并以1.33（1.33 vs 2.35）的显着改善率进行了优于其79步的教师扩散。我们还将Uni-Insruct应用于更广泛的任务，例如文本到3D代。对于文本到3D的生成，就产生质量和多样性而言，Uni-Instruct提供了体面的结果，这些结果略优于先前的SD和VSD，例如SDS和VSD。 Uni-Instruct的固体理论和经验贡献都可能有助于对一步扩散蒸馏的未来研究以及扩散模型的知识转移。

Title: ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval

Authors: Eric Xing, Pranavi Kolouju, Robert Pless, Abby Stylianou, Nathan Jacobs
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20764
Pdf URL: https://arxiv.org/pdf/2505.20764
Copy Paste: [[2505.20764]] ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval(https://arxiv.org/abs/2505.20764)
Keywords: generation
Abstract: Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on multiple benchmark datasets, including CIRR and CIRCO. Source code, model checkpoints, and our new datasets are available at this https URL.
摘要：组成的图像检索（CIR）是检索由查询图像指定的目标图像和描述对查询图像的语义修改的相对文本的任务。 CIR中的现有方法努力准确地表示图像和文本修改，从而导致表现不佳。为了解决这一限制，我们引入了一个CIR框架，即上下文CIR，并接受了文本概念一致性损失训练，该概念损失鼓励文本修改中名词短语的表示，以更好地参与查询图像的相关部分。为了支持此损失功能的培训，我们还提出了一个合成数据生成管道，该管道从现有的CIR数据集或未标记的图像中创建培训数据。我们表明，这些组件共同使CIR任务具有更强的性能，并在包括CIRR和Circo在内的多个基准数据集中的监督和零弹射设置中为组成的图像检索设置了新的最先进的图像检索。源代码，模型检查点和我们的新数据集可在此HTTPS URL上找到。

Title: MetaSlot: Break Through the Fixed Number of Slots in Object-Centric Learning

Authors: Hongjia Liu, Rongzhen Zhao, Haohan Chen, Joni Pajarinen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20772
Pdf URL: https://arxiv.org/pdf/2505.20772
Copy Paste: [[2505.20772]] MetaSlot: Break Through the Fixed Number of Slots in Object-Centric Learning(https://arxiv.org/abs/2505.20772)
Keywords: generation
Abstract: Learning object-level, structured representations is widely regarded as a key to better generalization in vision and underpins the design of next-generation Pre-trained Vision Models (PVMs). Mainstream Object-Centric Learning (OCL) methods adopt Slot Attention or its variants to iteratively aggregate objects' super-pixels into a fixed set of query feature vectors, termed slots. However, their reliance on a static slot count leads to an object being represented as multiple parts when the number of objects varies. We introduce MetaSlot, a plug-and-play Slot Attention variant that adapts to variable object counts. MetaSlot (i) maintains a codebook that holds prototypes of objects in a dataset by vector-quantizing the resulting slot representations; (ii) removes duplicate slots from the traditionally aggregated slots by quantizing them with the codebook; and (iii) injects progressively weaker noise into the Slot Attention iterations to accelerate and stabilize the aggregation. MetaSlot is a general Slot Attention variant that can be seamlessly integrated into existing OCL architectures. Across multiple public datasets and tasks--including object discovery and recognition--models equipped with MetaSlot achieve significant performance gains and markedly interpretable slot representations, compared with existing Slot Attention variants.
摘要：学习对象级的结构化表示被广泛认为是在视觉上更好地概括的关键，并为下一代预训练的视觉模型（PVM）的设计支撑。主流对象以对象为中心的学习（OCL）方法采用插槽注意力或其变体对迭代汇总对象的超像素的变体中的固定查询特征向量，称为插槽。但是，当对象数量变化时，它们对静态插槽计数的依赖会导致对象表示为多个部分。我们介绍了MetAslot，这是一种适应可变对象计数的即插即用的插槽注意变体。 MetAslot（i）维护了一个代码簿，该代码簿通过对向量定量产生的插槽表示来保存数据集中的对象原型；（ii）通过使用代码书对传统汇总的插槽中的重复插槽删除；（iii）将噪声逐渐弱化为插槽的注意力迭代，以加速和稳定聚集。 MetAslot是一种一般的插槽注意变体，可以无缝集成到现有的OCL体系结构中。与现有的插槽注意变体相比，在多个公共数据集和任务（包括对象发现和识别）中，具有显着的性能增长并明显可解释的插槽表示。

Title: Integrating Intermediate Layer Optimization and Projected Gradient Descent for Solving Inverse Problems with Diffusion Models

Authors: Yang Zheng, Wen Li, Zhaoqiang Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20789
Pdf URL: https://arxiv.org/pdf/2505.20789
Copy Paste: [[2505.20789]] Integrating Intermediate Layer Optimization and Projected Gradient Descent for Solving Inverse Problems with Diffusion Models(https://arxiv.org/abs/2505.20789)
Keywords: generative
Abstract: Inverse problems (IPs) involve reconstructing signals from noisy observations. Traditional approaches often rely on handcrafted priors, which can fail to capture the complexity of real-world data. The advent of pre-trained generative models has introduced new paradigms, offering improved reconstructions by learning rich priors from data. Among these, diffusion models (DMs) have emerged as a powerful framework, achieving remarkable reconstruction performance across numerous IPs. However, existing DM-based methods frequently encounter issues such as heavy computational demands and suboptimal convergence. In this work, building upon the idea of the recent work DMPlug~\cite{wang2024dmplug}, we propose two novel methods, DMILO and DMILO-PGD, to address these challenges. Our first method, DMILO, employs intermediate layer optimization (ILO) to alleviate the memory burden inherent in DMPlug. Additionally, by introducing sparse deviations, we expand the range of DMs, enabling the exploration of underlying signals that may lie outside the range of the diffusion model. We further propose DMILO-PGD, which integrates ILO with projected gradient descent (PGD), thereby reducing the risk of suboptimal convergence. We provide an intuitive theoretical analysis of our approach under appropriate conditions and validate its superiority through extensive experiments on diverse image datasets, encompassing both linear and nonlinear IPs. Our results demonstrate significant performance gains over state-of-the-art methods, highlighting the effectiveness of DMILO and DMILO-PGD in addressing common challenges in DM-based IP solvers.
摘要：反问题（IP）涉及重建来自嘈杂观察的信号。传统方法通常依赖于手工先验，这可能无法捕获现实世界数据的复杂性。预培训的生成模型的出现引入了新的范式，通过从数据中学习丰富的先验，从而改善了重建。其中，扩散模型（DMS）已成为一个强大的框架，在众多IP中实现了出色的重建性能。但是，现有的基于DM的方法经常遇到诸如大量计算需求和次优融合等问题。在这项工作中，我们基于最近的工作dmplug〜 \ cite {wang2024dmplug}的想法，我们提出了两种新颖的方法Dmilo和dmilo-pgd，以应对这些挑战。我们的第一种方法DMILO采用了中间层优化（ILO）来减轻DMPLUG固有的内存负担。此外，通过引入稀疏偏差，我们扩大了DM的范围，从而使可能位于扩散模型范围之外的基础信号进行探索。我们进一步提出了DMILO-PGD，该DMILO-PGD将ILO与投影梯度下降（PGD）集成在一起，从而降低了次优化的风险。我们在适当条件下对我们的方法提供了直观的理论分析，并通过对各种图像数据集进行了广泛的实验来验证其优越性，并涵盖了线性和非线性IPS。我们的结果表明，对最先进的方法的性能取得了显着增长，突显了DMILO和DMILO-PGD在解决基于DM的IP求解器中的共同挑战方面的有效性。

Title: Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Authors: Juan A. Rodriguez, Haotian Zhang, Abhay Puri, Aarash Feizi, Rishav Pramanik, Pascal Wichmann, Arnab Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, Spandana Gella, Sai Rajeswar, David Vazquez, Christopher Pal, Marco Pedersoli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20793
Pdf URL: https://arxiv.org/pdf/2505.20793
Copy Paste: [[2505.20793]] Rendering-Aware Reinforcement Learning for Vector Graphics Generation(https://arxiv.org/abs/2505.20793)
Keywords: generation
Abstract: Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-language models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF(Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.
摘要：可扩展的向量图形（SVG）提供了一种强大的格式，用于将视觉设计表示为可解释的代码。视觉模型（VLM）的最新进展已通过将问题作为代码生成任务并利用大规模预处理的问题来实现高质量的SVG生成。 VLM特别适合此任务，因为它们同时捕获了全球语义和细粒度的视觉模式，同时跨视觉，自然语言和代码域转移知识。但是，现有的VLM方法通常很难产生忠实有效的SVG，因为它们从未在训练过程中观察到渲染图像。尽管自回归SVG代码生成的可区分渲染仍然不可用，但仍可以将渲染的输出与原始输入进行比较，从而使评估反馈适合加固学习（RL）。我们介绍了RLRF（从渲染反馈中学习），这是一种RL方法，它通过利用渲染的SVG输出的反馈来增强自回旋VLMS中的SVG生成。给定输入图像，该模型生成了呈现的SVG推出，并将其与原始图像进行比较以计算奖励。这种视觉保真反馈指导该模型产生更准确，有效和语义相干的SVG。 RLRF显着优于监督微调，解决常见的故障模式，并具有较强的结构理解和概括，实现了精确的高质量SVG生成。

Title: Not All Thats Rare Is Lost: Causal Paths to Rare Concept Synthesis

Authors: Bo-Kai Ruan, Zi-Xiang Ni, Bo-Lun Huang, Teng-Fang Hsiao, Hong-Han Shuai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20808
Pdf URL: https://arxiv.org/pdf/2505.20808
Copy Paste: [[2505.20808]] Not All Thats Rare Is Lost: Causal Paths to Rare Concept Synthesis(https://arxiv.org/abs/2505.20808)
Keywords: generation, generative
Abstract: Diffusion models have shown strong capabilities in high-fidelity image generation but often falter when synthesizing rare concepts, i.e., prompts that are infrequently observed in the training distribution. In this paper, we introduce RAP, a principled framework that treats rare concept generation as navigating a latent causal path: a progressive, model-aligned trajectory through the generative space from frequent concepts to rare targets. Rather than relying on heuristic prompt alternation, we theoretically justify that rare prompt guidance can be approximated by semantically related frequent prompts. We then formulate prompt switching as a dynamic process based on score similarity, enabling adaptive stage transitions. Furthermore, we reinterpret prompt alternation as a second-order denoising mechanism, promoting smooth semantic progression and coherent visual synthesis. Through this causal lens, we align input scheduling with the model's internal generative dynamics. Experiments across diverse diffusion backbones demonstrate that RAP consistently enhances rare concept generation, outperforming strong baselines in both automated evaluations and human studies.
摘要：扩散模型在高保真图像的产生中表现出很强的功能，但在综合稀有概念（即在训练分布中都不经常观察到的提示时，通常都会步履蹒跚。在本文中，我们介绍了RAP，这是一个原则性的框架，将稀有概念生成视为导航潜在因果路径：一种渐进的，模型的轨迹，通过从频繁的概念到稀有目标的生成空间。我们从理论上证明可以通过语义相关的频繁提示来近似罕见的及时指导，而不是依靠启发式及时的交替。然后，我们根据分数相似性制定提示切换为动态过程，从而实现自适应阶段过渡。此外，我们重新解释了迅速交替作为二阶授予机制，从而促进了光滑的语义进展和相干的视觉合成。通过该因果镜头，我们将输入调度与模型的内部生成动力学相结合。跨不同扩散骨架的实验表明，RAP始终增强稀有概念的产生，在自动化评估和人类研究中都优于强质基础。

Title: Frame-Level Captions for Long Video Generation with Complex Multi Scenes

Authors: Guangcong Zheng, Jianlong Yuan, Bo Wang, Haoyang Huang, Guoqing Ma, Nan Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20827
Pdf URL: https://arxiv.org/pdf/2505.20827
Copy Paste: [[2505.20827]] Frame-Level Captions for Long Video Generation with Complex Multi Scenes(https://arxiv.org/abs/2505.20827)
Keywords: generation
Abstract: Generating long videos that can show complex stories, like movie scenes from scripts, has great promise and offers much more than short clips. However, current methods that use autoregression with diffusion models often struggle because their step-by-step process naturally leads to a serious error accumulation (drift). Also, many existing ways to make long videos focus on single, continuous scenes, making them less useful for stories with many events and changes. This paper introduces a new approach to solve these problems. First, we propose a novel way to annotate datasets at the frame-level, providing detailed text guidance needed for making complex, multi-scene long videos. This detailed guidance works with a Frame-Level Attention Mechanism to make sure text and video match precisely. A key feature is that each part (frame) within these windows can be guided by its own distinct text prompt. Our training uses Diffusion Forcing to provide the model with the ability to handle time flexibly. We tested our approach on difficult VBench 2.0 benchmarks ("Complex Plots" and "Complex Landscapes") based on the WanX2.1-T2V-1.3B model. The results show our method is better at following instructions in complex, changing scenes and creates high-quality long videos. We plan to share our dataset annotation methods and trained models with the research community. Project page: this https URL .
摘要：产生长的视频，这些视频可以显示复杂的故事，例如剧本的电影场景，具有巨大的希望，并且提供了不仅仅是短剪辑。但是，当前使用自动重新收入与扩散模型的方法通常会挣扎，因为它们的分步过程自然会导致严重的错误积累（漂移）。此外，许多制作长视频的现有方法集中在单个连续的场景上，从而使它们对许多事件和变化的故事有用。本文介绍了一种解决这些问题的新方法。首先，我们提出了一种在框架级别注释数据集的新颖方法，提供了制作复杂的多场景长视频所需的详细文本指南。该详细的指南可用于帧级的注意机制，以确保文本和视频精确匹配。一个关键功能是，这些窗口中的每个部分（帧）可以通过其自己独特的文本提示来指导。我们的训练使用扩散强迫，为模型提供灵活处理时间的能力。我们基于WANX2.1-T2V-1.3B模型测试了难以vbench 2.0基准（“复杂图”和“复杂景观”）的方法。结果表明，我们的方法更好地遵循复杂，改变场景的说明，并创建高质量的长视频。我们计划与研究社区共享数据集注释方法和训练有素的模型。项目页面：此HTTPS URL。

Title: Exploring Timeline Control for Facial Motion Generation

Authors: Yifeng Ma, Jinwei Qi, Chaonan Ji, Peng Zhang, Bang Zhang, Zhidong Deng, Liefeng Bo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20861
Pdf URL: https://arxiv.org/pdf/2505.20861
Copy Paste: [[2505.20861]] Exploring Timeline Control for Facial Motion Generation(https://arxiv.org/abs/2505.20861)
Keywords: generation
Abstract: This paper introduces a new control signal for facial motion generation: timeline control. Compared to audio and text signals, timelines provide more fine-grained control, such as generating specific facial motions with precise timing. Users can specify a multi-track timeline of facial actions arranged in temporal intervals, allowing precise control over the timing of each action. To model the timeline control capability, We first annotate the time intervals of facial actions in natural facial motion sequences at a frame-level granularity. This process is facilitated by Toeplitz Inverse Covariance-based Clustering to minimize human labor. Based on the annotations, we propose a diffusion-based generation model capable of generating facial motions that are natural and accurately aligned with input timelines. Our method supports text-guided motion generation by using ChatGPT to convert text into timelines. Experimental results show that our method can annotate facial action intervals with satisfactory accuracy, and produces natural facial motions accurately aligned with timelines.
摘要：本文引入了面部运动生成的新控制信号：时间轴控制。与音频和文本信号相比，时间表提供了更细粒度的控制，例如以精确的计时生成特定的面部运动。用户可以指定以时间间隔安排的面部动作的多轨时间表，从而可以精确控制每个动作的时间。为了建模时间线控制能力，我们首先注释在框架级粒度下自然面部运动序列中面部动作的时间间隔。 Toeplitz基于协方差的聚类促进了这一过程，以最大程度地减少人工劳动力。根据注释，我们提出了一个基于扩散的生成模型，能够生成自然且与输入时间表准确保持一致的面部运动。我们的方法通过使用chatgpt将文本转换为时间表来支持文本引导的运动生成。实验结果表明，我们的方法可以以令人满意的精度注释面部动作间隔，并产生与时间表准确排列的自然面部运动。

Title: Generalizable Heuristic Generation Through Large Language Models with Meta-Optimization

Authors: Yiding Shi, Jianan Zhou, Wen Song, Jieyi Bi, Yaoxin Wu, Jie Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20881
Pdf URL: https://arxiv.org/pdf/2505.20881
Copy Paste: [[2505.20881]] Generalizable Heuristic Generation Through Large Language Models with Meta-Optimization(https://arxiv.org/abs/2505.20881)
Keywords: generation
Abstract: Heuristic design with large language models (LLMs) has emerged as a promising approach for tackling combinatorial optimization problems (COPs). However, existing approaches often rely on manually predefined evolutionary computation (EC) optimizers and single-task training schemes, which may constrain the exploration of diverse heuristic algorithms and hinder the generalization of the resulting heuristics. To address these issues, we propose Meta-Optimization of Heuristics (MoH), a novel framework that operates at the optimizer level, discovering effective optimizers through the principle of meta-learning. Specifically, MoH leverages LLMs to iteratively refine a meta-optimizer that autonomously constructs diverse optimizers through (self-)invocation, thereby eliminating the reliance on a predefined EC optimizer. These constructed optimizers subsequently evolve heuristics for downstream tasks, enabling broader heuristic exploration. Moreover, MoH employs a multi-task training scheme to promote its generalization capability. Experiments on classic COPs demonstrate that MoH constructs an effective and interpretable meta-optimizer, achieving state-of-the-art performance across various downstream tasks, particularly in cross-size settings.
摘要：具有大型语言模型（LLM）的启发式设计已成为解决组合优化问题（COPS）的有前途的方法。但是，现有的方法通常依赖于手动预定义的进化计算（EC）优化器和单任务训练方案，这可能会限制对不同启发式算法的探索，并阻碍了由此产生的启发式方法的概括。为了解决这些问题，我们提出了对启发式方法的元优化（MOH），这是一个在优化器级别运行的新型框架，通过元学习原理发现有效的优化器。具体而言，MOH利用LLMS迭代地完善了一种自主通过（自我）调用来构建多种优化器的元淘汰仪，从而消除了对预定义的EC优化器的依赖。这些构建的优化器随后进化了下游任务的启发式方法，从而实现了更广泛的启发式探索。此外，MOH采用多任务培训计划来促进其概括能力。经典警察的实验表明，MOH构建了一种有效且可解释的元观察器，在各种下游任务，尤其是在跨尺寸设置中实现了最先进的性能。

Title: Create Anything Anywhere: Layout-Controllable Personalized Diffusion Model for Multiple Subjects

Authors: Wei Li, Hebei Li, Yansong Peng, Siying Wu, Yueyi Zhang, Xiaoyan Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20909
Pdf URL: https://arxiv.org/pdf/2505.20909
Copy Paste: [[2505.20909]] Create Anything Anywhere: Layout-Controllable Personalized Diffusion Model for Multiple Subjects(https://arxiv.org/abs/2505.20909)
Keywords: generation, generative
Abstract: Diffusion models have significantly advanced text-to-image generation, laying the foundation for the development of personalized generative frameworks. However, existing methods lack precise layout controllability and overlook the potential of dynamic features of reference subjects in improving fidelity. In this work, we propose Layout-Controllable Personalized Diffusion (LCP-Diffusion) model, a novel framework that integrates subject identity preservation with flexible layout guidance in a tuning-free approach. Our model employs a Dynamic-Static Complementary Visual Refining module to comprehensively capture the intricate details of reference subjects, and introduces a Dual Layout Control mechanism to enforce robust spatial control across both training and inference stages. Extensive experiments validate that LCP-Diffusion excels in both identity preservation and layout controllability. To the best of our knowledge, this is a pioneering work enabling users to "create anything anywhere".
摘要：扩散模型具有显着高级的文本到图像生成，为开发个性化生成框架奠定了基础。但是，现有方法缺乏精确的布局可控性，并且忽略了参考主体动态特征在改善保真度方面的潜力。在这项工作中，我们提出了可控制的个性化的个性化扩散（LCP扩散）模型，该模型是一种新颖的框架，将主题身份保存与灵活的布局指导集成在无调的方法中。我们的模型采用动态静态互补的视觉精炼模块来全面捕获参考主体的复杂细节，并引入了双重布局控制机制，以在训练和推理阶段强制实施强大的空间控制。广泛的实验验证了LCP扩散在身份保存和布局可控性方面均出色。据我们所知，这是一项开创性的工作，使用户可以“创建任何地方”。

Title: Geometry-Editable and Appearance-Preserving Object Compositon

Authors: Jianman Lin, Haojie Li, Chunmei Qing, Zhijing Yang, Liang Lin, Tianshui Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20914
Pdf URL: https://arxiv.org/pdf/2505.20914
Copy Paste: [[2505.20914]] Geometry-Editable and Appearance-Preserving Object Compositon(https://arxiv.org/abs/2505.20914)
Keywords: generation
Abstract: General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties, while simultaneously preserving its fine-grained appearance details. Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation. However, these highly compact embeddings encode only high-level semantic cues and inevitably discard fine-grained appearance details. We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion (DGAD) model that first leverages semantic embeddings to implicitly capture the desired geometric transformations and then employs a cross-attention retrieval mechanism to align fine-grained appearance features with the geometry-edited representation, facilitating both precise geometry editing and faithful appearance preservation in object composition. Specifically, DGAD builds on CLIP/DINO-derived and reference networks to extract semantic embeddings and appearance-preserving representations, which are then seamlessly integrated into the encoding and decoding pipelines in a disentangled manner. We first integrate the semantic embeddings into pre-trained diffusion models that exhibit strong spatial reasoning capabilities to implicitly capture object geometry, thereby facilitating flexible object manipulation and ensuring effective editability. Then, we design a dense cross-attention mechanism that leverages the implicitly learned object geometry to retrieve and spatially align appearance features with their corresponding regions, ensuring faithful appearance consistency. Extensive experiments on public benchmarks demonstrate the effectiveness of the proposed DGAD framework.
摘要：一般对象组成（GOC）的目的是将目标对象无缝整合到具有所需几何特性的背景场景中，同时保留其细粒度的外观细节。最近的方法推导了语义嵌入，并将它们集成到高级扩散模型中，以实现几何形状的生成。但是，这些高度紧凑的嵌入仅编码高级语义提示，不可避免地丢弃细粒度的外观细节。我们引入了一个不列出的几何形状和外观保护扩散（DGAD）模型，该模型首先利用语义嵌入以隐式捕获所需的几何变换，然后采用交叉注意的检索机制来使精细的外观与几何形状编辑的表述保持一致，并保留两种精确的构图，并保持精确的构图。具体而言，DGAD建立在剪辑/恐龙衍生的和参考网络上，以提取语义嵌入和外观保留表示表示，然后将其无缝地集成到编码和解码管道中，以分散的方式。我们首先将语义嵌入到预训练的扩散模型中，这些扩散模型具有强大的空间推理能力，以隐式捕获对象几何形状，从而促进灵活的对象操纵并确保有效的编辑性。然后，我们设计了一种密集的跨注意机制，该机制利用隐式学习的对象几何形状及其相应的区域来检索和空间使外观特征取回，从而确保了忠实的外观一致性。公共基准的广泛实验证明了拟议的DGAD框架的有效性。

Title: ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

Authors: Sanghyun Jo, Wooyeol Lee, Ziseok Lee, Kyungsu Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20935
Pdf URL: https://arxiv.org/pdf/2505.20935
Copy Paste: [[2505.20935]] ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation(https://arxiv.org/abs/2505.20935)
Keywords: generation
Abstract: Text-to-image diffusion models excel at generating single-instance scenes but struggle with multi-instance scenarios, often merging or omitting objects. Unlike previous training-free approaches that rely solely on semantic-level guidance without addressing instance individuation, our training-free method, Instance-to-Semantic Attention Control (ISAC), explicitly resolves incomplete instance formation and semantic entanglement through an instance-first modeling approach. This enables ISAC to effectively leverage a hierarchical, tree-structured prompt mechanism, disentangling multiple object instances and individually aligning them with their corresponding semantic labels. Without employing any external models, ISAC achieves up to 52% average multi-class accuracy and 83% average multi-instance accuracy by effectively forming disentangled instances. The code will be made available upon publication.
摘要：文本到图像扩散模型在生成单个现实场景方面表现出色，但要与多稳定场景挣扎，经常合并或省略对象。与以前的无培训方法不同，这些方法仅依赖于语义级别的指导而没有解决实例个性化，我们的无训练方法，语义注意力控制（ISAC），明确解决了通过实例第一建模方法来解决不完整的实例形成和语义纠缠。这使ISAC能够有效利用层次结构化的及时机制，解散多个对象实例，并单独将它们与相应的语义标签对齐。 ISAC在不采用任何外部模型的情况下实现了高达52％的平均多级准确性和83％的平均多效精度，可以有效地形成分离的实例。该代码将在出版时提供。

Title: OrienText: Surface Oriented Textual Image Generation

Authors: Shubham Singh Paliwal, Arushi Jain, Monika Sharma, Vikram Jamwal, Lovekesh Vig
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20958
Pdf URL: https://arxiv.org/pdf/2505.20958
Copy Paste: [[2505.20958]] OrienText: Surface Oriented Textual Image Generation(https://arxiv.org/abs/2505.20958)
Keywords: generation
Abstract: Textual content in images is crucial in e-commerce sectors, particularly in marketing campaigns, product imaging, advertising, and the entertainment industry. Current text-to-image (T2I) generation diffusion models, though proficient at producing high-quality images, often struggle to incorporate text accurately onto complex surfaces with varied perspectives, such as angled views of architectural elements like buildings, banners, or walls. In this paper, we introduce the Surface Oriented Textual Image Generation (OrienText) method, which leverages region-specific surface normals as conditional input to T2I generation diffusion model. Our approach ensures accurate rendering and correct orientation of the text within the image context. We demonstrate the effectiveness of the OrienText method on a self-curated dataset of images and compare it against the existing textual image generation methods.
摘要：图像中的文本内容对于电子商务领域至关重要，尤其是在营销活动，产品成像，广告和娱乐行业中。当前的文本对图像（T2I）的生成扩散模型虽然精通生产高质量的图像，但通常很难将文本准确地纳入具有多种视角的复杂表面，例如建筑，横幅或墙壁等建筑元素的角度观点。在本文中，我们介绍了面向表面的文本图像生成（Orientext）方法，该方法将其作为T2I生成扩散模型的条件输入来利用区域特异性的表面正态。我们的方法可确保在图像上下文中准确渲染和正确定向文本。我们在图像的自我策划数据集中演示了东方方法的有效性，并将其与现有的文本图像生成方法进行比较。

Title: DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization

Authors: Shamil Ayupov, Maksim Nakhodnov, Anastasia Yaschenko, Andrey Kuznetsov, Aibek Alanov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20975
Pdf URL: https://arxiv.org/pdf/2505.20975
Copy Paste: [[2505.20975]] DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization(https://arxiv.org/abs/2505.20975)
Keywords: generation
Abstract: Personalized diffusion models have shown remarkable success in Text-to-Image (T2I) generation by enabling the injection of user-defined concepts into diverse contexts. However, balancing concept fidelity with contextual alignment remains a challenging open problem. In this work, we propose an RL-based approach that leverages the diverse outputs of T2I models to address this issue. Our method eliminates the need for human-annotated scores by generating a synthetic paired dataset for DPO-like training using external quality metrics. These better-worse pairs are specifically constructed to improve both concept fidelity and prompt adherence. Moreover, our approach supports flexible adjustment of the trade-off between image fidelity and textual alignment. Through multi-step training, our approach outperforms a naive baseline in convergence speed and output quality. We conduct extensive qualitative and quantitative analysis, demonstrating the effectiveness of our method across various architectures and fine-tuning techniques. The source code can be found at this https URL.
摘要：个性化的扩散模型通过将用户定义的概念注入到各种环境中，在文本对图像（T2I）生成方面取得了显着成功。但是，平衡概念忠诚与上下文一致性仍然是一个具有挑战性的开放问题。在这项工作中，我们提出了一种基于RL的方法，该方法利用T2I模型的各种输出来解决此问题。我们的方法通过使用外部质量指标生成合成的配对数据集来消除对人类通知得分的需求。这些更好的疑问对是专门构建的，以提高概念保真度和及时的遵守。此外，我们的方法支持图像保真度和文本对齐之间的权衡调整。通过多步训练，我们的方法在收敛速度和产出质量方面的表现优于天真的基线。我们进行了广泛的定性和定量分析，证明了我们方法在各种体系结构和微调技术中的有效性。可以在此HTTPS URL上找到源代码。

Title: BIPNN: Learning to Solve Binary Integer Programming via Hypergraph Neural Networks

Authors: Sen Bai, Chunqi Yang, Xin Bai, Xin Zhang, Zhengang Jiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20997
Pdf URL: https://arxiv.org/pdf/2505.20997
Copy Paste: [[2505.20997]] BIPNN: Learning to Solve Binary Integer Programming via Hypergraph Neural Networks(https://arxiv.org/abs/2505.20997)
Keywords: generation
Abstract: Binary (0-1) integer programming (BIP) is pivotal in scientific domains requiring discrete decision-making. As the advance of AI computing, recent works explore neural network-based solvers for integer linear programming (ILP) problems. Yet, they lack scalability for tackling nonlinear challenges. To handle nonlinearities, state-of-the-art Branch-and-Cut solvers employ linear relaxations, leading to exponential growth in auxiliary variables and severe computation limitations. To overcome these limitations, we propose BIPNN (Binary Integer Programming Neural Network), an unsupervised learning framework to solve nonlinear BIP problems via hypergraph neural networks (HyperGNN). Specifically, BIPNN reformulates BIPs-constrained, discrete, and nonlinear (sin, log, exp) optimization problems-into unconstrained, differentiable, and polynomial loss functions. The reformulation stems from the observation of a precise one-to-one mapping between polynomial BIP objectives and hypergraph structures, enabling the unsupervised training of HyperGNN to optimize BIP problems in an end-to-end manner. On this basis, we propose a GPU-accelerated and continuous-annealing-enhanced training pipeline for BIPNN. The pipeline enables BIPNN to optimize large-scale nonlinear terms in BIPs fully in parallel via straightforward gradient descent, thus significantly reducing the training cost while ensuring the generation of discrete, high-quality solutions. Extensive experiments on synthetic and real-world datasets highlight the superiority of our approach.
摘要：二进制（0-1）整数编程（BIP）在需要离散决策的科学领域中是关键的。随着AI计算的进步，最近的工作探索了基于神经网络的求解器，用于整数线性编程（ILP）问题。但是，他们缺乏应对非线性挑战的可扩展性。为了处理非线性，最先进的分支和切割的求解器采用线性松弛，从而导致辅助变量的指数增长和严重的计算限制。为了克服这些局限性，我们提出了BIPNN（二进制整数编程神经网络），这是一个无监督的学习框架，可通过HyperGraph Neural Networks（HyperGNN）解决非线性BIP问题。具体而言，BIPNN重新定义了BIPS约束，离散和非线性（sin，log，exp）优化问题 - 无约束，可区分和多项式损失函数。重新制作源于观察多项式BIP目标与超图结构之间的精确一对一映射，从而使无监督的HyperGNN训练以端到端的方式优化BIP问题。在此基础上，我们建议使用BIPNN的GPU加速且连续解放增强的培训管道。该管道使BIPNN能够通过直接梯度下降完全并联BIP中的大规模非线性项，从而大大降低了训练成本，同时确保生成离散的高质量解决方案。关于合成和现实世界数据集的广泛实验突出了我们方法的优越性。

Title: Facial Attribute Based Text Guided Face Anonymization

Authors: Mustafa İzzet Muştu, Hazım Kemal Ekenel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21002
Pdf URL: https://arxiv.org/pdf/2505.21002
Copy Paste: [[2505.21002]] Facial Attribute Based Text Guided Face Anonymization(https://arxiv.org/abs/2505.21002)
Keywords: generation, generative
Abstract: The increasing prevalence of computer vision applications necessitates handling vast amounts of visual data, often containing personal information. While this technology offers significant benefits, it should not compromise privacy. Data privacy regulations emphasize the need for individual consent for processing personal data, hindering researchers' ability to collect high-quality datasets containing the faces of the individuals. This paper presents a deep learning-based face anonymization pipeline to overcome this challenge. Unlike most of the existing methods, our method leverages recent advancements in diffusion-based inpainting models, eliminating the need for training Generative Adversarial Networks. The pipeline employs a three-stage approach: face detection with RetinaNet, feature extraction with VGG-Face, and realistic face generation using the state-of-the-art BrushNet diffusion model. BrushNet utilizes the entire image, face masks, and text prompts specifying desired facial attributes like age, ethnicity, gender, and expression. This enables the generation of natural-looking images with unrecognizable individuals, facilitating the creation of privacy-compliant datasets for computer vision research.
摘要：计算机视觉应用程序的越来越多的流行率需要处理大量的视觉数据，通常包含个人信息。尽管该技术提供了重大好处，但不应损害隐私。数据隐私法规强调需要个人同意处理个人数据，从而阻碍了研究人员收集包含个人面孔的高质量数据集的能力。本文提出了一个基于深度学习的面对匿名管道，以克服这一挑战。与大多数现有方法不同，我们的方法利用了基于扩散的镶嵌模型的最新进展，从而消除了训练生成对抗网络的需求。该管道采用了三阶段的方法：带视视网膜的面部检测，使用最新的Brushnet扩散模型提取vgg-Face的特征提取，以及逼真的面部生成。 Brushnet利用整个图像，面罩和文本提示，指定所需的面部属性，例如年龄，种族，性别和表达。这使其能够与无法识别的个体产生自然的图像，从而促进了符合隐私的数据集的创建计算机视觉研究。

Title: RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy

Authors: Aiyue Chen, Bin Dong, Jingru Li, Jing Lin, Yiwu Yao, Gongyi Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21036
Pdf URL: https://arxiv.org/pdf/2505.21036
Copy Paste: [[2505.21036]] RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy(https://arxiv.org/abs/2505.21036)
Keywords: generation
Abstract: Video generation using diffusion models is highly computationally intensive, with 3D attention in Diffusion Transformer (DiT) models accounting for over 80\% of the total computational resources. In this work, we introduce {\bf RainFusion}, a novel training-free sparse attention method that exploits inherent sparsity nature in visual data to accelerate attention computation while preserving video quality. Specifically, we identify three unique sparse patterns in video generation attention calculations--Spatial Pattern, Temporal Pattern and Textural Pattern. The sparse pattern for each attention head is determined online with negligible overhead (\textasciitilde\,0.2\%) with our proposed {\bf ARM} (Adaptive Recognition Module) during inference. Our proposed {\bf RainFusion} is a plug-and-play method, that can be seamlessly integrated into state-of-the-art 3D-attention video generation models without additional training or calibration. We evaluate our method on leading open-sourced models including HunyuanVideo, OpenSoraPlan-1.2 and CogVideoX-5B, demonstrating its broad applicability and effectiveness. Experimental results show that RainFusion achieves over {\bf 2$\times$} speedup in attention computation while maintaining video quality, with only a minimal impact on VBench scores (-0.2\%).
摘要：使用扩散模型的视频生成是高度计算密集的，在扩散变压器（DIT）模型中，占总计算资源的80 \％以上的3D注意力。在这项工作中，我们引入了{\ bf RainFusion}，这是一种新型的无训练稀疏注意方法，利用视觉数据中固有的稀疏性质以加速注意力计算，同时保持视频质量。具体而言，我们在视频生成注意计算中确定了三个独特的稀疏模式 - 空间模式，时间模式和纹理模式。每个注意力头的稀疏模式都可以在网上用可忽略不计的开销（\ textasciitilde \，0.2 \％）在线确定，并在推理过程中使用我们提出的{\ bf arm}（自适应识别模块）在线确定。我们提出的{\ bf RainFusion}是一种插件方法，可以无缝地集成到最新的3D开发视频生成模型中，而无需其他培训或校准。我们评估了我们的方法对领先的开源模型，包括Hunyuanvideo，OpenSoraplan-1.2和Cogvideox-5B，证明了其广泛的适用性和有效性。实验结果表明，在保持视频质量的同时，雨林促进{\ bf 2 \（\ times \）}加速速度，对VBENCH评分（-0.2 \％）的影响只有最小的影响。

Title: Advancing high-fidelity 3D and Texture Generation with 2.5D latents

Authors: Xin Yang, Jiantao Lin, Yingjie Xu, Haodong Li, Yingcong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21050
Pdf URL: https://arxiv.org/pdf/2505.21050
Copy Paste: [[2505.21050]] Advancing high-fidelity 3D and Texture Generation with 2.5D latents(https://arxiv.org/abs/2505.21050)
Keywords: generation, generative
Abstract: Despite the availability of large-scale 3D datasets and advancements in 3D generative models, the complexity and uneven quality of 3D geometry and texture data continue to hinder the performance of 3D generation techniques. In most existing approaches, 3D geometry and texture are generated in separate stages using different models and non-unified representations, frequently leading to unsatisfactory coherence between geometry and texture. To address these challenges, we propose a novel framework for joint generation of 3D geometry and texture. Specifically, we focus in generate a versatile 2.5D representations that can be seamlessly transformed between 2D and 3D. Our approach begins by integrating multiview RGB, normal, and coordinate images into a unified representation, termed as 2.5D latents. Next, we adapt pre-trained 2D foundation models for high-fidelity 2.5D generation, utilizing both text and image conditions. Finally, we introduce a lightweight 2.5D-to-3D refiner-decoder framework that efficiently generates detailed 3D representations from 2.5D images. Extensive experiments demonstrate that our model not only excels in generating high-quality 3D objects with coherent structure and color from text and image inputs but also significantly outperforms existing methods in geometry-conditioned texture generation.
摘要：尽管3D生成模型中有大规模的3D数据集以及进步，但3D几何和纹理数据的复杂性和不均匀质量仍然阻碍了3D生成技术的性能。在大多数现有方法中，使用不同的模型和非统一表示形式在单独的阶段生成3D几何和纹理，经常导致几何和纹理之间的连贯性不令人满意。为了应对这些挑战，我们为联合生成3D几何和质地提出了一个新颖的框架。具体而言，我们集中于生成一种多功能2.5D表示，可以在2D和3D之间无缝转换。我们的方法首先将多视RGB，正常和坐标图像集成到统一表示形式中，称为2.5D潜在。接下来，我们使用文本和图像条件调整了高保真2.5D生成的预先训练的2D基础模型。最后，我们引入了一个轻巧的2.5D到3D炼油厂框架框架，该框架有效地从2.5D图像中生成了详细的3D表示。广泛的实验表明，我们的模型不仅在产生具有连贯的结构和颜色的高质量的3D对象方面表现出色，而且从文本和图像输入中产生了颜色，而且在几何条件条件生成中的现有方法显着优于现有方法。

Title: Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Authors: Davide Lobba, Fulvio Sanguigni, Bin Ren, Marcella Cornia, Rita Cucchiara, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21062
Pdf URL: https://arxiv.org/pdf/2505.21062
Copy Paste: [[2505.21062]] Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals(https://arxiv.org/abs/2505.21062)
Keywords: generation
Abstract: While virtual try-on (VTON) systems aim to render a garment onto a target person image, this paper tackles the novel task of virtual try-off (VTOFF), which addresses the inverse problem: generating standardized product images of garments from real-world photos of clothed individuals. Unlike VTON, which must resolve diverse pose and style variations, VTOFF benefits from a consistent and well-defined output format -- typically a flat, lay-down-style representation of the garment -- making it a promising tool for data generation and dataset enhancement. However, existing VTOFF approaches face two major limitations: (i) difficulty in disentangling garment features from occlusions and complex poses, often leading to visual artifacts, and (ii) restricted applicability to single-category garments (e.g., upper-body clothes only), limiting generalization. To address these challenges, we present Text-Enhanced MUlti-category Virtual Try-Off (TEMU-VTOFF), a novel architecture featuring a dual DiT-based backbone with a modified multimodal attention mechanism for robust garment feature extraction. Our architecture is designed to receive garment information from multiple modalities like images, text, and masks to work in a multi-category setting. Finally, we propose an additional alignment module to further refine the generated visual details. Experiments on VITON-HD and Dress Code datasets show that TEMU-VTOFF sets a new state-of-the-art on the VTOFF task, significantly improving both visual quality and fidelity to the target garments.
摘要：虽然虚拟试验（VTON）系统旨在将衣服呈现到目标人形象上，但本文解决了虚拟试验的新任务（VTOFF），该任务解决了逆问题：从衣服的个人照片中生成了服装的标准化产品图像。与必须解决多样化的姿势和样式变化的Vton不同，VTOFF受益于一致且定义明确的输出格式（通常是服装的平坦，外向型式代表），这使其成为数据生成和数据集增强的有前途的工具。但是，现有的VTOFF方法面临两个主要局限性：（i）将服装特征从遮挡和复杂的姿势中解脱出来的困难，通常会导致视觉伪像，以及（ii）限制对单类服装的适用性（例如，仅限上面的衣服），从而限制了普遍性。为了应对这些挑战，我们提出了文本增强的多类虚拟试验（Temu-vtoff），这是一种新颖的体系结构，具有基于双DIT的主链，具有改进的多模式注意机制，可用于健壮的服装特征提取。我们的架构旨在从图像，文本和口罩等多种模式中接收服装信息，以在多类别设置中工作。最后，我们提出了一个附加的对齐模块，以进一步完善生成的视觉细节。 Viton-HD和着装码数据集的实验表明，Temu-Vtoff在VTOFF任务上设定了新的最新技术，从而大大提高了目标服装的视觉质量和忠诚度。

Title: Minute-Long Videos with Dual Parallelisms

Authors: Zeqing Wang, Bowen Zheng, Xingyi Yang, Yuecong Xu, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21070
Pdf URL: https://arxiv.org/pdf/2505.21070
Copy Paste: [[2505.21070]] Minute-Long Videos with Dual Parallelisms(https://arxiv.org/abs/2505.21070)
Keywords: generation
Abstract: Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54$\times$ lower latency and 1.48$\times$ lower memory cost on 8$\times$RTX 4090 GPUs.
摘要：基于扩散的变压器（DIT）的视频扩散模型会大规模生成高质量的视频，但会产生长时间视频的延迟和记忆成本。为了解决这个问题，我们提出了一种新颖的分布推理策略，称为双向。核心想法是，我们没有在单个GPU上生成整个视频，而是在跨GPU的时间框架和模型层并行。但是，该部门的天真实现面临关键限制：由于扩散模型需要跨帧的同步噪声水平，因此该实现会导致原始平行性的序列化。我们利用一个范围的denoising计划来处理这一点。也就是说，我们通过管道处理一系列框架块，并逐渐降低噪声水平。每个GPU在将先前的结果传递给下一个GPU的同时处理特定的块和层子集，从而实现异步计算和通信。为了进一步优化性能，我们结合了两个关键增强功能。首先，在每个GPU上实现了一个功能缓存，以存储和重复使用以上块作为上下文，从而最大程度地减少GPU通信和冗余计算。其次，我们采用协调的噪声初始化策略，通过在没有额外资源成本的情况下共享GPU的初始噪声模式，从而确保全球一致的时间动态。这些共同使快速，无伪影和无限长的视频产生。应用于最新的扩散变压器视频生成器，我们的方法有效地生产了1,025帧的视频，最高6.54美元$ \ times $降低延迟和1.48 $ \ times $降低记忆成本，8 $ \ times $ rtx 4090 gpus。

Title: DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response

Authors: Junjue Wang, Weihao Xuan, Heli Qi, Zhihao Liu, Kunyi Liu, Yuhan Wu, Hongruixuan Chen, Jian Song, Junshi Xia, Zhuo Zheng, Naoto Yokoya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21089
Pdf URL: https://arxiv.org/pdf/2505.21089
Copy Paste: [[2505.21089]] DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response(https://arxiv.org/abs/2505.21089)
Keywords: generation
Abstract: Large vision-language models (VLMs) have made great achievements in Earth vision. However, complex disaster scenes with diverse disaster types, geographic regions, and satellite sensors have posed new challenges for VLM applications. To fill this gap, we curate a remote sensing vision-language dataset (DisasterM3) for global-scale disaster assessment and response. DisasterM3 includes 26,988 bi-temporal satellite images and 123k instruction pairs across 5 continents, with three characteristics: 1) Multi-hazard: DisasterM3 involves 36 historical disaster events with significant impacts, which are categorized into 10 common natural and man-made disasters. 2)Multi-sensor: Extreme weather during disasters often hinders optical sensor imaging, making it necessary to combine Synthetic Aperture Radar (SAR) imagery for post-disaster scenes. 3) Multi-task: Based on real-world scenarios, DisasterM3 includes 9 disaster-related visual perception and reasoning tasks, harnessing the full potential of VLM's reasoning ability with progressing from disaster-bearing body recognition to structural damage assessment and object relational reasoning, culminating in the generation of long-form disaster reports. We extensively evaluated 14 generic and remote sensing VLMs on our benchmark, revealing that state-of-the-art models struggle with the disaster tasks, largely due to the lack of a disaster-specific corpus, cross-sensor gap, and damage object counting insensitivity. Focusing on these issues, we fine-tune four VLMs using our dataset and achieve stable improvements across all tasks, with robust cross-sensor and cross-disaster generalization capabilities.
摘要：大型视觉模型（VLM）在地球视觉中取得了巨大的成就。但是，具有不同灾难类型，地理区域和卫星传感器的复杂灾难场景对VLM应用提出了新的挑战。为了填补这一空白，我们为全球规模的灾难评估和响应策划了一个遥感视觉语言数据集（Disasterm3）。 DileStM3包括26,988个双阶卫星图像和123K的指令对，具有三个特征：1）多危险：DivestM3涉及36个具有重大影响的历史灾难事件，这些事件被分类为10种常见的自然和人为灾难。 2）多传感器：灾难中的极端天气通常会阻碍光学传感器成像，因此有必要将合成孔径雷达（SAR）成像结合起来，以使其成为污水后场景。 3）多任务：基于现实世界的情况，Dielderm3包括9个与灾难有关的视觉感知和推理任务，从而利用VLM的推理能力的全部潜力，从灾难性的身体识别到结构性损害评估和对象关系推理的发展，在远面灾难的一代中造成了限制。我们在基准上广泛评估了14个通用和遥感VLM，揭示了最先进的模型与灾难任务斗争，这在很大程度上是由于缺乏灾难特异性的语料库，交叉传感器间隙和损害对象计数不敏感性。专注于这些问题，我们使用数据集对四个VLM进行了微调，并具有强大的跨传感器和交叉式概括能力，并在所有任务中实现稳定的改进。

Title: Instance Data Condensation for Image Super-Resolution

Authors: Tianhao Peng, Ho Man Kwan, Yuxuan Jiang, Ge Gao, Fan Zhang, Xiaozhong Xu, Shan Liu, David Bull
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21099
Pdf URL: https://arxiv.org/pdf/2505.21099
Copy Paste: [[2505.21099]] Instance Data Condensation for Image Super-Resolution(https://arxiv.org/abs/2505.21099)
Keywords: super-resolution
Abstract: Deep learning based image Super-Resolution (ISR) relies on large training datasets to optimize model generalization; this requires substantial computational and storage resources during training. While dataset condensation has shown potential in improving data efficiency and privacy for high-level computer vision tasks, it has not yet been fully exploited for ISR. In this paper, we propose a novel Instance Data Condensation (IDC) framework specifically for ISR, which achieves instance-level data condensation through Random Local Fourier Feature Extraction and Multi-level Feature Distribution Matching. This aims to optimize feature distributions at both global and local levels and obtain high-quality synthesized training content with fine detail. This framework has been utilized to condense the most commonly used training dataset for ISR, DIV2K, with a 10% condensation rate. The resulting synthetic dataset offers comparable or (in certain cases) even better performance compared to the original full dataset and excellent training stability when used to train various popular ISR models. To the best of our knowledge, this is the first time that a condensed/synthetic dataset (with a 10% data volume) has demonstrated such performance. The source code and the synthetic dataset have been made available at this https URL.
摘要：基于深度学习的图像超分辨率（ISR）依靠大型培训数据集来优化模型概括。这需要培训期间大量的计算和存储资源。尽管数据集凝结显示出在提高数据效率和高级计算机视觉任务的隐私方面的潜力，但尚未完全利用ISR。在本文中，我们提出了专门针对ISR的新型实例数据凝结（IDC）框架，该框架通过随机的局部傅立叶特征提取和多级特征分布匹配来实现实例级数据冷凝。这旨在优化全球和本地级别的特征分布，并以细节的细节获得高质量的合成培训内容。该框架已被用来凝结为ISR，DIV2K的最常用的训练数据集，其冷凝率为10％。与原始的完整数据集相比，所得的合成数据集提供了可比性的或（在某些情况下）更好的性能，并且在用于培训各种流行的ISR模型时具有出色的训练稳定性。据我们所知，这是第一次凝结/合成数据集（数据量为10％）证明了这种性能。源代码和合成数据集已在此HTTPS URL上提供。

Title: Conditional Diffusion Models with Classifier-Free Gibbs-like Guidance

Authors: Badr Moufad, Yazid Janati, Alain Durmus, Ahmed Ghorbel, Eric Moulines, Jimmy Olsson
Subjects: cs.LG, stat.ME
Abstract URL: https://arxiv.org/abs/2505.21101
Pdf URL: https://arxiv.org/pdf/2505.21101
Copy Paste: [[2505.21101]] Conditional Diffusion Models with Classifier-Free Gibbs-like Guidance(https://arxiv.org/abs/2505.21101)
Keywords: generation
Abstract: Classifier-Free Guidance (CFG) is a widely used technique for improving conditional diffusion models by linearly combining the outputs of conditional and unconditional denoisers. While CFG enhances visual quality and improves alignment with prompts, it often reduces sample diversity, leading to a challenging trade-off between quality and diversity. To address this issue, we make two key contributions. First, CFG generally does not correspond to a well-defined denoising diffusion model (DDM). In particular, contrary to common intuition, CFG does not yield samples from the target distribution associated with the limiting CFG score as the noise level approaches zero -- where the data distribution is tilted by a power $w \gt 1$ of the conditional distribution. We identify the missing component: a Rényi divergence term that acts as a repulsive force and is required to correct CFG and render it consistent with a proper DDM. Our analysis shows that this correction term vanishes in the low-noise limit. Second, motivated by this insight, we propose a Gibbs-like sampling procedure to draw samples from the desired tilted distribution. This method starts with an initial sample from the conditional diffusion model without CFG and iteratively refines it, preserving diversity while progressively enhancing sample quality. We evaluate our approach on both image and text-to-audio generation tasks, demonstrating substantial improvements over CFG across all considered metrics. The code is available at this https URL
摘要：无分类器引导（CFG）是一种通过线性结合条件和无条件deoisiser的输出来改善条件扩散模型的广泛使用技术。尽管CFG提高了视觉质量并提高了提示的一致性，但它通常会降低样本多样性，从而导致质量和多样性之间的挑战。为了解决这个问题，我们做出了两个关键贡献。首先，CFG通常不对应于定义明确的DeNoising扩散模型（DDM）。特别是，与共同的直觉相反，CFG不会从与限制CFG分数相关的目标分布中产生样品，因为噪声水平接近零，其中数据分布被条件分布的功率$ w \ gt 1 $倾斜。我们确定缺失的组件：rényi的分歧术语，该术语充当排斥力，需要纠正CFG并与适当的DDM保持一致。我们的分析表明，此校正项在低噪声限制中消失。其次，在这种见解的动机上，我们提出了一种类似吉布斯的抽样程序，以从所需的倾斜分布中绘制样品。该方法从没有CFG的条件扩散模型的初始样本开始，并迭代地完善了它，可以保留多样性，同时逐步提高样本质量。我们在图像和文本生成任务上评估了我们的方法，证明了所有被考虑的指标对CFG的实质性改进。该代码可在此HTTPS URL上找到

Title: Differentiable Solver Search for Fast Diffusion Sampling

Authors: Shuai Wang, Zexian Li, Qipeng zhang, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21114
Pdf URL: https://arxiv.org/pdf/2505.21114
Copy Paste: [[2505.21114]] Differentiable Solver Search for Fast Diffusion Sampling(https://arxiv.org/abs/2505.21114)
Keywords: generation
Abstract: Diffusion models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal for diffusion model and reveal a compact search space comprised of time steps and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify more optimal solver. Equipped with the searched solver, rectified-flow models, e.g., SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2.40 and 2.35, respectively, on ImageNet256 with only 10 steps. Meanwhile, DDPM model, DiT-XL/2, reaches a FID score of 2.33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates generality across various model architectures, resolutions, and model sizes.
摘要：扩散模型表现出了显着的产生质量，但以众多功能评估为代价。最近，已经开发了基于高级ODE的求解器，以减轻有限的采样步骤下反向扩散解决的大量计算需求。但是，这些求解器受亚当斯式多步方法的启发，仅依赖于与T相关的Lagrange插值。我们表明，与T相关的Lagrange插值对于扩散模型是次优的，并且揭示了由时间步骤和求解器系数组成的紧凑搜索空间。在我们的分析的基础上，我们提出了一种新型的可区分求解器搜索算法，以识别更最佳的求解器。配备了搜索的求解器，例如SIT-XL/2和FlowDCN-XL/2，在Imagenet256上分别达到2.40和2.35的FID得分，只有10个步骤。同时，DDPM型号DIT-XL/2仅以10个步骤达到2.33的FID分数。值得注意的是，我们搜索的求解器的表现要优于传统求解器。此外，我们的搜索求解器展示了各种模型体系结构，分辨率和模型尺寸的通用性。

Title: Learning Single Index Models with Diffusion Priors

Authors: Anqi Tang, Youming Chen, Shuchen Xue, Zhaoqiang Liu
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2505.21135
Pdf URL: https://arxiv.org/pdf/2505.21135
Copy Paste: [[2505.21135]] Learning Single Index Models with Diffusion Priors(https://arxiv.org/abs/2505.21135)
Keywords: generative
Abstract: Diffusion models (DMs) have demonstrated remarkable ability to generate diverse and high-quality images by efficiently modeling complex data distributions. They have also been explored as powerful generative priors for signal recovery, resulting in a substantial improvement in the quality of reconstructed signals. However, existing research on signal recovery with diffusion models either focuses on specific reconstruction problems or is unable to handle nonlinear measurement models with discontinuous or unknown link functions. In this work, we focus on using DMs to achieve accurate recovery from semi-parametric single index models, which encompass a variety of popular nonlinear models that may have {\em discontinuous} and {\em unknown} link functions. We propose an efficient reconstruction method that only requires one round of unconditional sampling and (partial) inversion of DMs. Theoretical analysis on the effectiveness of the proposed methods has been established under appropriate conditions. We perform numerical experiments on image datasets for different nonlinear measurement models. We observe that compared to competing methods, our approach can yield more accurate reconstructions while utilizing significantly fewer neural function evaluations.
摘要：扩散模型（DMS）通过有效地对复杂的数据分布进行建模，表现出非常出色的能力，可以产生多样化和高质量的图像。它们还被探讨为信号恢复的强大生成先验，从而大大提高了重建信号的质量。但是，对信号回收模型的现有研究要么着重于特定的重建问题，要么无法处理具有不连续或未知链路功能的非线性测量模型。在这项工作中，我们专注于使用DMS来从半参数单索引模型中获得准确的恢复，该模型涵盖了可能具有{\ em em doctionuul}和{\ em Unknown}链接函数的各种流行的非线性模型。我们提出了一种有效的重建方法，该方法仅需要一轮无条件采样和（部分）DMS反转。在适当的条件下已经建立了有关拟议方法有效性的理论分析。我们在图像数据集上对不同的非线性测量模型进行数值实验。我们观察到，与竞争方法相比，我们的方法可以产生更准确的重建，同时使用明显更少的神经功能评估。

Title: SageAttention2++: A More Efficient Implementation of SageAttention2

Authors: Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, Jianfei Chen
Subjects: cs.LG, cs.AI, cs.AR, cs.CV
Abstract URL: https://arxiv.org/abs/2505.21136
Pdf URL: https://arxiv.org/pdf/2505.21136
Copy Paste: [[2505.21136]] SageAttention2++: A More Efficient Implementation of SageAttention2(https://arxiv.org/abs/2505.21136)
Keywords: generation
Abstract: The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at this https URL.
摘要：注意效率至关重要，因为它的时间复杂性随序列长度四倍地增长。 SageAttention2通过利用量化来加速矩阵乘法（MATMUL）来解决此问题。为了进一步加速sageattention2，我们建议利用FP16中积累的FP8矩阵的更快指导。该指令比SageAttention 2中使用的FP8矩阵快2倍。我们的实验表明，SageAttention2 ++在闪光灯方面达到了3.9倍的速度，同时保持与SageAttention的注意力准确性2。这意味着SageAttention2 ++有效地加速了各种模型，包括语言，图像和视频生成的模型，并具有可忽略的端到端指标损失。该代码将在此HTTPS URL上可用。

Title: A Predicting Phishing Websites Using Support Vector Machine and MultiClass Classification Based on Association Rule Techniques

Authors: Nancy C. Woods, Virtue Ene Agada, Adebola K. Ojo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.21141
Pdf URL: https://arxiv.org/pdf/2505.21141
Copy Paste: [[2505.21141]] A Predicting Phishing Websites Using Support Vector Machine and MultiClass Classification Based on Association Rule Techniques(https://arxiv.org/abs/2505.21141)
Keywords: generation
Abstract: Phishing is a semantic attack which targets the user rather than the computer. It is a new Internet crime in comparison with other forms such as virus and hacking. Considering the damage phishing websites has caused to various economies by collapsing organizations, stealing information and financial diversion, various researchers have embarked on different ways of detecting phishing websites but there has been no agreement about the best algorithm to be used for prediction. This study is interested in integrating the strengths of two algorithms, Support Vector Machines (SVM) and Multi-Class Classification Rules based on Association Rules (MCAR) to establish a strong and better means of predicting phishing websites. A total of 11,056 websites were used from both PhishTank and yahoo directory to verify the effectiveness of this approach. Feature extraction and rules generation were done by the MCAR technique; classification and prediction were done by SVM technique. The result showed that the technique achieved 98.30% classification accuracy with a computation time of 2205.33s with minimum error rate. It showed a total of 98% Area under the Curve (AUC) which showed the proportion of accuracy in classifying phishing websites. The model showed 82.84% variance in the prediction of phishing websites based on the coefficient of determination. The use of two techniques together in detecting phishing websites produced a more accurate result as it combined the strength of both techniques respectively. This research work centralized on this advantage by building a hybrid of two techniques to help produce a more accurate result.
摘要：网络钓鱼是针对用户而不是计算机的语义攻击。与其他形式（例如病毒和黑客）相比，这是一种新的互联网犯罪。考虑到网络网站通过崩溃，窃取信息和金融转移而造成的损害网站对各种经济体造成了，各种研究人员启动了检测网络钓鱼网站的不同方式，但对预测的最佳算法没有达成共识。这项研究有兴趣基于协会规则（MCAR）的两种算法，支持向量机（SVM）和多类分类规则的优势，以建立一种预测网络钓鱼网站的强大，更好的方法。来自Phishtank和Yahoo目录总共使用了11,056个网站来验证这种方法的有效性。特征提取和规则生成是由MCAR技术完成的；分类和预测是通过SVM技术完成的。结果表明，该技术以2205.33的计算时间（最低误差率）达到了98.30％的分类精度。它显示了曲线（AUC）下98％的面积，该面积显示了对网络钓鱼网站进行分类的准确性比例。该模型根据确定系数显示了网站网站预测的82.84％差异。在检测网络钓鱼网站中使用两种技术会产生更准确的结果，因为它分别结合了两种技术的强度。这项研究通过建立两种技术的混合体来帮助产生更准确的结果，从而集中了这一优势。

Title: FastFace: Tuning Identity Preservation in Distilled Diffusion via Guidance and Attention

Authors: Sergey Karpukhin, Vadim Titov, Andrey Kuznetsov, Aibek Alanov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21144
Pdf URL: https://arxiv.org/pdf/2505.21144
Copy Paste: [[2505.21144]] FastFace: Tuning Identity Preservation in Distilled Diffusion via Guidance and Attention(https://arxiv.org/abs/2505.21144)
Keywords: generation
Abstract: In latest years plethora of identity-preserving adapters for a personalized generation with diffusion models have been released. Their main disadvantage is that they are dominantly trained jointly with base diffusion models, which suffer from slow multi-step inference. This work aims to tackle the challenge of training-free adaptation of pretrained ID-adapters to diffusion models accelerated via distillation - through careful re-design of classifier-free guidance for few-step stylistic generation and attention manipulation mechanisms in decoupled blocks to improve identity similarity and fidelity, we propose universal FastFace framework. Additionally, we develop a disentangled public evaluation protocol for id-preserving adapters.
摘要：在最新几年中，已经发布了具有扩散模型的个性化一代的多种身份保护适配器。他们的主要缺点是，它们是与基础扩散模型共同训练的，这些模型遭受了缓慢的多步推断。这项工作旨在应对通过蒸馏通过蒸馏加速的扩散模型的无训练适应的挑战 - 通过仔细重新设计无分类器的指导，以在几步之中的样式产生和注意力操纵机制中，以脱钩的块，以提高身份相似性，我们提高了普遍的快速范围。此外，我们为ID保护适配器开发了一个分解的公共评估协议。

Title: STEB: In Search of the Best Evaluation Approach for Synthetic Time Series

Authors: Michael Stenger, Robert Leppich, André Bauer, Samuel Kounev
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21160
Pdf URL: https://arxiv.org/pdf/2505.21160
Copy Paste: [[2505.21160]] STEB: In Search of the Best Evaluation Approach for Synthetic Time Series(https://arxiv.org/abs/2505.21160)
Keywords: generative
Abstract: The growing need for synthetic time series, due to data augmentation or privacy regulations, has led to numerous generative models, frameworks, and evaluation measures alike. Objectively comparing these measures on a large scale remains an open challenge. We propose the Synthetic Time series Evaluation Benchmark (STEB) -- the first benchmark framework that enables comprehensive and interpretable automated comparisons of synthetic time series evaluation measures. Using 10 diverse datasets, randomness injection, and 13 configurable data transformations, STEB computes indicators for measure reliability and score consistency. It tracks running time, test errors, and features sequential and parallel modes of operation. In our experiments, we determine a ranking of 41 measures from literature and confirm that the choice of upstream time series embedding heavily impacts the final score.
摘要：由于数据扩大或隐私法规，对合成时间序列的需求日益增长，导致了许多生成模型，框架和评估措施。客观地比较大规模的这些措施仍然是一个开放的挑战。我们提出了合成时间序列评估基准（Steb） - 第一个基准框架，可以实现合成时间序列评估指标的全面且可解释的自动比较。使用10个不同的数据集，随机注入和13个可配置的数据转换，Steb计算指标，以衡量可靠性和得分一致性。它跟踪运行时间，测试错误以及具有顺序和并行操作模式。在我们的实验中，我们确定了文献中41种措施的排名，并确认上游时间序列的选择嵌入了很大的最终分数。

Title: Topological Deep Learning for Speech Data

Authors: Zhiwang Yu
Subjects: cs.LG, cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.21173
Pdf URL: https://arxiv.org/pdf/2505.21173
Copy Paste: [[2505.21173]] Topological Deep Learning for Speech Data(https://arxiv.org/abs/2505.21173)
Keywords: generation
Abstract: Topological data analysis (TDA) offers novel mathematical tools for deep learning. Inspired by Carlsson et al., this study designs topology-aware convolutional kernels that significantly improve speech recognition networks. Theoretically, by investigating orthogonal group actions on kernels, we establish a fiber-bundle decomposition of matrix spaces, enabling new filter generation methods. Practically, our proposed Orthogonal Feature (OF) layer achieves superior performance in phoneme recognition, particularly in low-noise scenarios, while demonstrating cross-domain adaptability. This work reveals TDA's potential in neural network optimization, opening new avenues for mathematics-deep learning interdisciplinary studies.
摘要：拓扑数据分析（TDA）提供了用于深度学习的新型数学工具。受Carlsson等人的启发，这项研究设计了拓扑感知的卷积内核，可显着改善语音识别网络。从理论上讲，通过研究内核上的正交组作用，我们建立了矩阵空间的纤维束分解，从而实现了新的滤波器生成方法。实际上，我们提出的正交特征（OF）层在音素识别中取得了卓越的性能，尤其是在低噪声场景中，同时展示了跨域的适应性。这项工作揭示了TDA在神经网络优化方面的潜力，为数学学习跨学科研究开辟了新的途径。

Title: Latent label distribution grid representation for modeling uncertainty

Authors: ShuNing Sun, YinSong Xiong, Yu Zhang, Zhuoran Zheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21180
Pdf URL: https://arxiv.org/pdf/2505.21180
Copy Paste: [[2505.21180]] Latent label distribution grid representation for modeling uncertainty(https://arxiv.org/abs/2505.21180)
Keywords: generation
Abstract: Although \textbf{L}abel \textbf{D}istribution \textbf{L}earning (LDL) has promising representation capabilities for characterizing the polysemy of an instance, the complexity and high cost of the label distribution annotation lead to inexact in the construction of the label space. The existence of a large number of inexact labels generates a label space with uncertainty, which misleads the LDL algorithm to yield incorrect decisions. To alleviate this problem, we model the uncertainty of label distributions by constructing a \textbf{L}atent \textbf{L}abel \textbf{D}istribution \textbf{G}rid (LLDG) to form a low-noise representation space. Specifically, we first construct a label correlation matrix based on the differences between labels, and then expand each value of the matrix into a vector that obeys a Gaussian distribution, thus building a LLDG to model the uncertainty of the label space. Finally, the LLDG is reconstructed by the LLDG-Mixer to generate an accurate label distribution. Note that we enforce a customized low-rank scheme on this grid, which assumes that the label relations may be noisy and it needs to perform noise-reduction with the help of a Tucker reconstruction technique. Furthermore, we attempt to evaluate the effectiveness of the LLDG by considering its generation as an upstream task to achieve the classification of the objects. Extensive experimental results show that our approach performs competitively on several benchmarks.
摘要：尽管\ textbf {l} abel \ textbf {d} iStribution \ textbf {l} restning（ldl）具有有希望的表示能力，可以表征实例的多义，复杂性和高成本的标签分发注释导致标记空间的构造中不成熟的。大量不精确标签的存在会产生一个具有不确定性的标签空间，这误导了LDL算法以产生错误的决策。为了减轻这个问题，我们通过构造\ textbf {l} atent \ textbf {l} abel \ textbf {d} istribution \ textbf {g} rid（lldg）来形成一个低含量的表示空间，来模拟标签分布的不确定性。具体而言，我们首先根据标签之间的差异构建标签相关矩阵，然后将矩阵的每个值扩展到遵守高斯分布的向量中，从而构建LLDG以建模标签空间的不确定性。最后，LLDG由LLDG-Mixer重建以生成准确的标签分布。请注意，我们在此电网上强制执行定制的低级方案，该方案假设标签关系可能很嘈杂，并且需要借助塔克重建技术执行降噪。此外，我们试图通过将其一代作为实现对象分类的上游任务来评估LLDG的有效性。广泛的实验结果表明，我们的方法在几个基准测试中竞争性能。

Title: PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing

Authors: Yu Yan, Sheng Sun, Zhifei Zheng, Ziji Hao, Teli Liu, Min Liu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.21184
Pdf URL: https://arxiv.org/pdf/2505.21184
Copy Paste: [[2505.21184]] PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing(https://arxiv.org/abs/2505.21184)
Keywords: generation
Abstract: To construct responsible and secure AI applications, harmful information data is widely utilized for adversarial testing and the development of safeguards. Existing studies mainly leverage Large Language Models (LLMs) to synthesize data to obtain high-quality task datasets at scale, thereby avoiding costly human annotation. However, limited by the safety alignment mechanisms of LLMs, the synthesis of harmful data still faces challenges in generation reliability and content diversity. In this study, we propose a novel harmful information synthesis framework, PoisonSwarm, which applies the model crowdsourcing strategy to generate diverse harmful data while maintaining a high success rate. Specifically, we generate abundant benign data as the based templates in a counterfactual manner. Subsequently, we decompose each based template into multiple semantic units and perform unit-by-unit toxification and final refinement through dynamic model switching, thus ensuring the success of synthesis. Experimental results demonstrate that PoisonSwarm achieves state-of-the-art performance in synthesizing different categories of harmful data with high scalability and diversity.
摘要：为了构建负责任的AI应用程序，有害信息数据被广泛用于对抗测试和保障措施的开发。现有研究主要利用大型语言模型（LLM）综合数据以大规模获取高质量的任务数据集，从而避免了昂贵的人类注释。但是，受LLMS的安全一致性机制的限制，有害数据的综合仍然面临着发电性可靠性和内容多样性的挑战。在这项研究中，我们提出了一个新型有害信息综合框架PoisonSwarm，该框架采用模型众包策略来生成各种有害数据，同时保持高成功率。具体而言，我们以反事实方式生成大量的良性数据作为基于的模板。随后，我们将每个基于的模板分解为多个语义单元，并通过动态模型切换执行单位造血和最终改进，从而确保合成的成功。实验结果表明，PoisonSwarm在综合具有高可扩展性和多样性的不同类别的有害数据中实现最先进的性能。

Title: 3D-UIR: 3D Gaussian for Underwater 3D Scene Reconstruction via Physics-Based Appearance-Medium Decouplin

Authors: Jieyu Yuan, Yujun Li, Yuanlin Zhang, Chunle Guo, Xiongxin Tang, Ruixing Wang, Chongyi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21238
Pdf URL: https://arxiv.org/pdf/2505.21238
Copy Paste: [[2505.21238]] 3D-UIR: 3D Gaussian for Underwater 3D Scene Reconstruction via Physics-Based Appearance-Medium Decouplin(https://arxiv.org/abs/2505.21238)
Keywords: restoration
Abstract: Novel view synthesis for underwater scene reconstruction presents unique challenges due to complex light-media interactions. Optical scattering and absorption in water body bring inhomogeneous medium attenuation interference that disrupts conventional volume rendering assumptions of uniform propagation medium. While 3D Gaussian Splatting (3DGS) offers real-time rendering capabilities, it struggles with underwater inhomogeneous environments where scattering media introduce artifacts and inconsistent appearance. In this study, we propose a physics-based framework that disentangles object appearance from water medium effects through tailored Gaussian modeling. Our approach introduces appearance embeddings, which are explicit medium representations for backscatter and attenuation, enhancing scene consistency. In addition, we propose a distance-guided optimization strategy that leverages pseudo-depth maps as supervision with depth regularization and scale penalty terms to improve geometric fidelity. By integrating the proposed appearance and medium modeling components via an underwater imaging model, our approach achieves both high-quality novel view synthesis and physically accurate scene restoration. Experiments demonstrate our significant improvements in rendering quality and restoration accuracy over existing methods. The project page is available at \href{this https URL}{this https URL
摘要：水下现场重建的新型视图综合提出了由于复杂的光线作用相互作用而引起的独特挑战。水体中的光学散射和吸收带来了不均匀的培养基衰减干扰，从而破坏了均匀传播培养基的常规体积渲染假设。尽管3D高斯碎片（3DGS）具有实时的渲染功能，但它与水下不均匀环境中挣扎，在这些环境中，散射媒体会引入人工制品和不一致的外观。在这项研究中，我们提出了一个基于物理的框架，该框架通过量身定制的高斯建模将物体外观从水中效应中脱离。我们的方法引入了外观嵌入，这些嵌入方式是反向散射和衰减的明确中等表示，增强了场景的一致性。此外，我们提出了一种距离引导的优化策略，该策略利用伪深度地图作为监督，并具有深度正则化和规模罚款条款，以提高几何忠诚度。通过通过水下成像模型整合提出的外观和中型建模组件，我们的方法既可以实现高质量的新型视图合成又可以实现物理准确的场景恢复。实验证明了我们对现有方法的质量和恢复精度的显着改善。该项目页面可在\ href {this https url} {此https url

Title: Copresheaf Topological Neural Networks: A Generalized Deep Learning Framework

Authors: Mustafa Hajij, Lennart Bastian, Sarah Osentoski, Hardik Kabaria, John L. Davenport, Sheik Dawood, Balaji Cherukuri, Joseph G. Kocheemoolayil, Nastaran Shahmansouri, Adrian Lew, Theodore Papamarkou, Tolga Birdal
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.21251
Pdf URL: https://arxiv.org/pdf/2505.21251
Copy Paste: [[2505.21251]] Copresheaf Topological Neural Networks: A Generalized Deep Learning Framework(https://arxiv.org/abs/2505.21251)
Keywords: generation
Abstract: We introduce copresheaf topological neural networks (CTNNs), a powerful and unifying framework that encapsulates a wide spectrum of deep learning architectures, designed to operate on structured data: including images, point clouds, graphs, meshes, and topological manifolds. While deep learning has profoundly impacted domains ranging from digital assistants to autonomous systems, the principled design of neural architectures tailored to specific tasks and data types remains one of the field's most persistent open challenges. CTNNs address this gap by grounding model design in the language of copresheaves, a concept from algebraic topology that generalizes and subsumes most practical deep learning models in use today. This abstract yet constructive formulation yields a rich design space from which theoretically sound and practically effective solutions can be derived to tackle core challenges in representation learning: long-range dependencies, oversmoothing, heterophily, and non-Euclidean domains. Our empirical results on structured data benchmarks demonstrate that CTNNs consistently outperform conventional baselines, particularly in tasks requiring hierarchical or localized sensitivity. These results underscore CTNNs as a principled, multi-scale foundation for the next generation of deep learning architectures.
摘要：我们介绍了一个强大而统一的框架，它介绍了Copresheaf拓扑神经网络（CTNNS），它封装了广泛的深度学习体系结构，旨在在结构化数据上运行：包括图像，点云，图形，网格，网格和拓扑歧管。尽管深度学习对从数字助理到自主系统的深刻影响，但针对特定任务和数据类型量身定制的神经体系结构的原则设计仍然是该领域最持久的开放挑战之一。 CTNNS通过以Copresheves语言为基础的模型设计来解决这一差距，这一概念是代数拓扑的概念，它概括了当今使用的最实用的深度学习模型。这种抽象而建设性的配方产生了一个丰富的设计空间，从理论上讲是合理的和实际上有效的解决方案，以应对代表性学习中的核心挑战：远程依赖性，超平面，异性恋和非欧巴文领域。我们对结构化数据基准的经验结果表明，CTNN始终超过常规基准，尤其是在需要层次或局部灵敏度的任务中。这些结果强调了CTNNS作为下一代深度学习体系结构的原则性多尺度基础。

Title: Plenodium: UnderWater 3D Scene Reconstruction with Plenoptic Medium Representation

Authors: Changguanng Wu, Jiangxin Dong, Chengjian Li, Jinhui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21258
Pdf URL: https://arxiv.org/pdf/2505.21258
Copy Paste: [[2505.21258]] Plenodium: UnderWater 3D Scene Reconstruction with Plenoptic Medium Representation(https://arxiv.org/abs/2505.21258)
Keywords: restoration
Abstract: We present Plenodium (plenoptic medium), an effective and efficient 3D representation framework capable of jointly modeling both objects and participating media. In contrast to existing medium representations that rely solely on view-dependent modeling, our novel plenoptic medium representation incorporates both directional and positional information through spherical harmonics encoding, enabling highly accurate underwater scene reconstruction. To address the initialization challenge in degraded underwater environments, we propose the pseudo-depth Gaussian complementation to augment COLMAP-derived point clouds with robust depth priors. In addition, a depth ranking regularized loss is developed to optimize the geometry of the scene and improve the ordinal consistency of the depth maps. Extensive experiments on real-world underwater datasets demonstrate that our method achieves significant improvements in 3D reconstruction. Furthermore, we conduct a simulated dataset with ground truth and the controllable scattering medium to demonstrate the restoration capability of our method in underwater scenarios. Our code and dataset are available at this https URL.
摘要：我们提出了全球（全体介质），这是一个有效而有效的3D表示框架，能够共同对物体和参与培养基进行共同建模。与仅依赖于视图依赖性建模的现有介质表示相反，我们的新颖载体介质表示通过球形谐波编码结合了方向和位置信息，从而实现了高度准确的水下场景重建。为了应对退水的水下环境中的初始化挑战，我们提出了伪深度的高斯互补，以增强Colmap衍生的点云具有强大的深度先验。此外，开发了一个深度排名的正规损失，以优化场景的几何形状并提高深度图的顺序一致性。对实际水下数据集的广泛实验表明，我们的方法在3D重建方面取得了重大改进。此外，我们进行了一个带有地面真理和可控散射介质的模拟数据集，以证明我们在水下场景中方法的恢复能力。我们的代码和数据集可在此HTTPS URL上找到。

Title: DiMoSR: Feature Modulation via Multi-Branch Dilated Convolutions for Efficient Image Super-Resolution

Authors: M. Akin Yilmaz, Ahmet Bilican, A. Murat Tekalp
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.21262
Pdf URL: https://arxiv.org/pdf/2505.21262
Copy Paste: [[2505.21262]] DiMoSR: Feature Modulation via Multi-Branch Dilated Convolutions for Efficient Image Super-Resolution(https://arxiv.org/abs/2505.21262)
Keywords: super-resolution
Abstract: Balancing reconstruction quality versus model efficiency remains a critical challenge in lightweight single image super-resolution (SISR). Despite the prevalence of attention mechanisms in recent state-of-the-art SISR approaches that primarily emphasize or suppress feature maps, alternative architectural paradigms warrant further exploration. This paper introduces DiMoSR (Dilated Modulation Super-Resolution), a novel architecture that enhances feature representation through modulation to complement attention in lightweight SISR networks. The proposed approach leverages multi-branch dilated convolutions to capture rich contextual information over a wider receptive field while maintaining computational efficiency. Experimental results demonstrate that DiMoSR outperforms state-of-the-art lightweight methods across diverse benchmark datasets, achieving superior PSNR and SSIM metrics with comparable or reduced computational complexity. Through comprehensive ablation studies, this work not only validates the effectiveness of DiMoSR but also provides critical insights into the interplay between attention mechanisms and feature modulation to guide future research in efficient network design. The code and model weights to reproduce our results are available at: this https URL
摘要：平衡重建质量与模型效率仍然是轻质单图像超分辨率（SISR）的关键挑战。尽管最近最先进的SISR方法的注意机制流行率主要强调或抑制特征地图，但替代建筑范式仍需要进一步探索。本文介绍了DiMOSR（扩张的调制超分辨率），这是一种新型架构，通过调制来增强特征表示，以补充轻量级SISR网络中的注意力。所提出的方法利用多分支扩张的卷积，在保持计算效率的同时，在更广泛的接受场上捕获丰富的上下文信息。实验结果表明，DIMOSR的表现优于各种基准数据集的最先进的轻量级方法，从而实现了具有可比或降低计算复杂性的优质PSNR和SSIM指标。通过全面的消融研究，这项工作不仅验证了DIMOSR的有效性，而且还为注意力机制和功能调制之间的相互作用提供了重要的见解，以指导有效的网络设计中的未来研究。复制我们的结果的代码和模型权重可用：此HTTPS URL

Title: UGCE: User-Guided Incremental Counterfactual Exploration

Authors: Christos Fragkathoulas, Evaggelia Pitoura
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.21330
Pdf URL: https://arxiv.org/pdf/2505.21330
Copy Paste: [[2505.21330]] UGCE: User-Guided Incremental Counterfactual Exploration(https://arxiv.org/abs/2505.21330)
Keywords: generation
Abstract: Counterfactual explanations (CFEs) are a popular approach for interpreting machine learning predictions by identifying minimal feature changes that alter model outputs. However, in real-world settings, users often refine feasibility constraints over time, requiring counterfactual generation to adapt dynamically. Existing methods fail to support such iterative updates, instead recomputing explanations from scratch with each change, an inefficient and rigid approach. We propose User-Guided Incremental Counterfactual Exploration (UGCE), a genetic algorithm-based framework that incrementally updates counterfactuals in response to evolving user constraints. Experimental results across five benchmark datasets demonstrate that UGCE significantly improves computational efficiency while maintaining high-quality solutions compared to a static, non-incremental approach. Our evaluation further shows that UGCE supports stable performance under varying constraint sequences, benefits from an efficient warm-start strategy, and reveals how different constraint types may affect search behavior.
摘要：反事实解释（CFE）是通过识别改变模型输出的最小特征变化来解释机器学习预测的流行方法。但是，在实际设置中，用户通常会随着时间的推移而提高可行性约束，这需要反事实生成动态适应。现有方法无法支持此类迭代更新，而是通过每次更改（一种效率低下且僵化的方法）从头开始重新计算说明。我们提出了用户引导的增量反事实探索（UGCE），这是一种基于遗传算法的框架，该框架会逐步更新反事实，以响应不断发展的用户约束。五个基准数据集的实验结果表明，与静态的，非额外的方法相比，UGCE在维持高质量的解决方案的同时显着提高了计算效率。我们的评估进一步表明，UGCE在不同的约束序列下支持稳定的性能，从有效的温暖启动策略中受益，并揭示了不同的约束类型如何影响搜索行为。

Title: OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

Authors: Ziheng Cheng, Yixiao Huang, Hui Xu, Somayeh Sojoudi, Xuandong Zhao, Dawn Song, Song Mei
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.21347
Pdf URL: https://arxiv.org/pdf/2505.21347
Copy Paste: [[2505.21347]] OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models(https://arxiv.org/abs/2505.21347)
Keywords: generation
Abstract: Text-to-Image (T2I) models have achieved remarkable success in generating visual content from text inputs. Although multiple safety alignment strategies have been proposed to prevent harmful outputs, they often lead to overly cautious behavior -- rejecting even benign prompts -- a phenomenon known as $\textit{over-refusal}$ that reduces the practical utility of T2I models. Despite over-refusal having been observed in practice, there is no large-scale benchmark that systematically evaluates this phenomenon for T2I models. In this paper, we present an automatic workflow to construct synthetic evaluation data, resulting in OVERT ($\textbf{OVE}$r-$\textbf{R}$efusal evaluation on $\textbf{T}$ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors in T2I models. OVERT includes 4,600 seemingly harmful but benign prompts across nine safety-related categories, along with 1,785 genuinely harmful prompts (OVERT-unsafe) to evaluate the safety-utility trade-off. Using OVERT, we evaluate several leading T2I models and find that over-refusal is a widespread issue across various categories (Figure 1), underscoring the need for further research to enhance the safety alignment of T2I models without compromising their this http URL a preliminary attempt to reduce over-refusal, we explore prompt rewriting; however, we find it often compromises faithfulness to the meaning of the original prompts. Finally, we demonstrate the flexibility of our generation framework in accommodating diverse safety requirements by generating customized evaluation data adapting to user-defined policies.
摘要：文本对图像（T2I）模型在从文本输入中生成视觉内容方面取得了显着的成功。尽管已经提出了多种安全一致性策略来防止有害产出，但它们通常会导致过度谨慎的行为 - 甚至拒绝良性提示 - 一种被称为$ \ textit {过度补充} $的现象，可降低T2i模型的实际实用性。尽管在实践中观察到了过度的，但没有大规模的基准系统地评估T2I模型的这种现象。在本文中，我们提出一个自动工作流程来构建合成评估数据，从而导致公开（$ \ textbf {ove} $ r- $ \ $ \ textbf {r} $ efusal评估$ \ textbf {t} $ ext toemigage模型），是第一个大型cale benchmmark，用于评估超级模型的大型模型。公开包括4,600个看似有害的，但在9个与安全有关的类别中的提示以及1,785个真正有害的提示（明显的无安全），以评估安全性权衡权衡。使用公开，我们评估了几种领先的T2I模型，并发现过度充实是各个类别的普遍问题（图1），强调需要进一步研究以增强T2i模型的安全对准，而不会损害他们的这一HTTP URL，而不是预先尝试减少过度反复的尝试，我们探索了及时探索迅速重新处理；但是，我们发现这通常会损害原始提示的含义的忠诚。最后，我们通过生成自定义的评估数据适应用户定义的策略来证明我们一代框架在适应各种安全要求方面的灵活性。

Title: Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility

Authors: Yidi Li, Jun Xiao, Zhengda Lu, Yiqun Wang, Haiyong Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21377
Pdf URL: https://arxiv.org/pdf/2505.21377
Copy Paste: [[2505.21377]] Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility(https://arxiv.org/abs/2505.21377)
Keywords: generation
Abstract: This work presents a novel text-to-vector graphics generation approach, Dream3DVG, allowing for arbitrary viewpoint viewing, progressive detail optimization, and view-dependent occlusion awareness. Our approach is a dual-branch optimization framework, consisting of an auxiliary 3D Gaussian Splatting optimization branch and a 3D vector graphics optimization branch. The introduced 3DGS branch can bridge the domain gaps between text prompts and vector graphics with more consistent guidance. Moreover, 3DGS allows for progressive detail control by scheduling classifier-free guidance, facilitating guiding vector graphics with coarse shapes at the initial stages and finer details at later stages. We also improve the view-dependent occlusions by devising a visibility-awareness rendering module. Extensive results on 3D sketches and 3D iconographies, demonstrate the superiority of the method on different abstraction levels of details, cross-view consistency, and occlusion-aware stroke culling.
摘要：这项工作提出了一种新颖的文本对矢量图形生成方法，Dream3DVG，可以进行任意观点，进行渐进的细节优化和依赖视图的闭塞意识。我们的方法是一个双分支优化框架，由辅助3D高斯脱落优化分支和3D矢量图形优化分支组成。引入的3DGS分支可以弥合文本提示和向量图形之间的域间隙，并具有更一致的指导。此外，3DGS可以通过安排无分类器指导，在初始阶段促进具有粗大形状的指导矢量图形，并在后期阶段进行更细节。我们还通过设计一个可见性意识渲染模块来改善依赖视图的闭塞。 3D草图和3D肖像画的广泛结果证明了该方法在不同的抽象详细信息，跨视图一致性和遮挡感知中的中风的优越性。

Title: A Convergence Theory for Diffusion Language Models: An Information-Theoretic Perspective

Authors: Gen Li, Changxiao Cai
Subjects: cs.LG, cs.IT, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2505.21400
Pdf URL: https://arxiv.org/pdf/2505.21400
Copy Paste: [[2505.21400]] A Convergence Theory for Diffusion Language Models: An Information-Theoretic Perspective(https://arxiv.org/abs/2505.21400)
Keywords: generation, generative
Abstract: Diffusion models have emerged as a powerful paradigm for modern generative modeling, demonstrating strong potential for large language models (LLMs). Unlike conventional autoregressive (AR) models that generate tokens sequentially, diffusion models enable parallel token sampling, leading to faster generation and eliminating left-to-right generation constraints. Despite their empirical success, the theoretical understanding of diffusion model approaches remains underdeveloped. In this work, we develop convergence guarantees for diffusion language models from an information-theoretic perspective. Our analysis demonstrates that the sampling error, measured by the Kullback-Leibler (KL) divergence, decays inversely with the number of iterations $T$ and scales linearly with the mutual information between tokens in the target text sequence. In particular, we establish matching upper and lower bounds, up to some constant factor, to demonstrate the tightness of our convergence analysis. These results offer novel theoretical insights into the practical effectiveness of diffusion language models.
摘要：扩散模型已成为现代生成建模的强大范式，证明了大型语言模型（LLMS）的强大潜力。与依次生成代币的常规自动性（AR）模型不同，扩散模型可实现并行令牌采样，从而导致更快的生成并消除了从左到右的生成约束。尽管他们的经验成功，但对扩散模型方法的理论理解仍然不发达。在这项工作中，我们从信息理论的角度开发了扩散语言模型的融合保证。我们的分析表明，通过Kullback-Leibler（KL）差异测量的采样误差与迭代$ t $相反，并与目标文本序列中的令牌之间的互信息线性缩放。特别是，我们建立了匹配的上限和下限，最多达到一定的恒定因素，以证明我们的收敛分析的紧密度。这些结果为扩散语言模型的实际有效性提供了新颖的理论见解。

Title: Designing Cyclic Peptides via Harmonic SDE with Atom-Bond Modeling

Authors: Xiangxin Zhou, Mingyu Li, Yi Xiao, Jiahan Li, Dongyu Xue, Zaixiang Zheng, Jianzhu Ma, Quanquan Gu
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2505.21452
Pdf URL: https://arxiv.org/pdf/2505.21452
Copy Paste: [[2505.21452]] Designing Cyclic Peptides via Harmonic SDE with Atom-Bond Modeling(https://arxiv.org/abs/2505.21452)
Keywords: generation, generative
Abstract: Cyclic peptides offer inherent advantages in pharmaceuticals. For example, cyclic peptides are more resistant to enzymatic hydrolysis compared to linear peptides and usually exhibit excellent stability and affinity. Although deep generative models have achieved great success in linear peptide design, several challenges prevent the development of computational methods for designing diverse types of cyclic peptides. These challenges include the scarcity of 3D structural data on target proteins and associated cyclic peptide ligands, the geometric constraints that cyclization imposes, and the involvement of non-canonical amino acids in cyclization. To address the above challenges, we introduce CpSDE, which consists of two key components: AtomSDE, a generative structure prediction model based on harmonic SDE, and ResRouter, a residue type predictor. Utilizing a routed sampling algorithm that alternates between these two models to iteratively update sequences and structures, CpSDE facilitates the generation of cyclic peptides. By employing explicit all-atom and bond modeling, CpSDE overcomes existing data limitations and is proficient in designing a wide variety of cyclic peptides. Our experimental results demonstrate that the cyclic peptides designed by our method exhibit reliable stability and affinity.
摘要：环状肽在药品中具有固有的优势。例如，与线性肽相比，环状肽对酶水解具有更大的抵抗力，通常表现出极好的稳定性和亲和力。尽管深层生成模型在线性肽设计方面取得了巨大的成功，但一些挑战阻止了设计各种类型的环状肽的计算方法。这些挑战包括在靶蛋白和相关的环状肽配体上缺乏3D结构数据，环化施加的几何约束以及非经典氨基酸在环化中的参与。为了应对上述挑战，我们介绍了CPSDE，该CPSDE由两个关键组成部分组成：Atomsde，Atomsde，一个基于Harmonic SDE的生成结构预测模型，以及Resrouter，Resrouter是残基类型预测指标。 CPSDE利用将这两个模型在迭代更新序列和结构之间交替的路由采样算法，CPSDE促进了环状肽的产生。通过采用明确的全原子和键建模，CPSDE克服了现有的数据限制，并精通设计多种环状肽。我们的实验结果表明，我们方法设计的环状肽表现出可靠的稳定性和亲和力。

Title: Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

Authors: Mehrdad Fazli, Bowen Wei, Ziwei Zhu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.21472
Pdf URL: https://arxiv.org/pdf/2505.21472
Copy Paste: [[2505.21472]] Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration(https://arxiv.org/abs/2505.21472)
Keywords: generation
Abstract: Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current inference-time interventions, while training-free, struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding based on the model's confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.
摘要：大型视觉模型（LVLM）在多模式任务上实现了令人印象深刻的性能，但经常患有幻觉，并自信地描述图像中不存在的对象或属性。当前的推理时间干预措施在无训练的同时，努力保持开放式和长期生成场景的准确性。我们介绍了置信度引人注目的注意校准（CAAC）框架，以针对两个关键偏见来应对这一挑战：空间感知偏见，在图像令牌之间分布了注意力不成比例，而模态偏见则从视觉输入转变为随着时间的推移而转变为文本输入。 CAAC采用了两步方法：视觉校准（VTC），以平衡视觉令牌的注意力以及自适应注意力重新缩放（AAR），以根据模型的信心来增强视觉接地。这种信心驱动的调整可确保在发电期间保持一致的视觉对齐。椅子，琥珀色和教皇基准上的实验表明，CAAC的表现优于基线，尤其是在长期的世代中，有效地减少了幻觉。

Title: DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction

Authors: Yiheng Liu, Liao Qu, Huichao Zhang, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Xian Li, Shuai Wang, Daniel K. Du, Shu Cheng, Zehuan Yuan, Xinglong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21473
Pdf URL: https://arxiv.org/pdf/2505.21473
Copy Paste: [[2505.21473]] DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction(https://arxiv.org/abs/2505.21473)
Keywords: generation
Abstract: This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method that models images through a novel next-detail prediction strategy. By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure and incrementally refine details. This coarse-to-fine 1D token sequence aligns well with the autoregressive inference mechanism, providing a more natural and efficient way for the AR model to generate complex visual content. Our compact 1D AR model achieves high-quality image synthesis with significantly fewer tokens than previous approaches, i.e. VAR/VQGAN. We further propose a parallel inference mechanism with self-correction that accelerates generation speed by approximately 8x while reducing accumulation sampling error inherent in teacher-forcing supervision. On the ImageNet 256x256 benchmark, our method achieves 2.96 gFID with 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID), which both require 680 tokens in their AR models. Moreover, due to the significantly reduced token count and parallel inference mechanism, our method runs nearly 2x faster inference speed compared to VAR and FlexVAR. Extensive experimental results demonstrate DetailFlow's superior generation quality and efficiency compared to existing state-of-the-art methods.
摘要：本文介绍了细节流，这是一种粗到1D自回归（AR）图像生成方法，该方法通过新颖的隔壁预测策略来对图像进行建模。通过学习通过逐渐退化的图像监督的分辨率感知令牌序列，细节流使生成过程从全局结构开始并逐步完善细节。这种粗到1D令牌的序列与自回旋推理机制很好地对齐，为AR模型生成复杂的视觉内容提供了一种更自然和有效的方式。我们的紧凑型1D AR模型以比以前的方法（即var/vqgan）实现高质量的图像合成。我们进一步提出了一种具有自校正的平行推理机制，该机制将生成速度加速约8倍，同时减少教师控制监督中固有的积累抽样误差。在ImageNet 256x256基准上，我们的方法具有128个令牌，超过var（3.3 FID）和Flexvar（3.05 FID）的2.96 GFID，它们在其AR模型中都需要680个令牌。此外，由于令牌计数和平行推理机制的显着降低，与VAR和FlexVAR相比，我们的方法的推理速度近2倍。与现有的最新方法相比，广泛的实验结果表明了细节流的出色发电质量和效率。

Title: Policy Optimized Text-to-Image Pipeline Design

Authors: Uri Gadot, Rinon Gal, Yftah Ziser, Gal Chechik, Shie Mannor
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21478
Pdf URL: https://arxiv.org/pdf/2505.21478
Copy Paste: [[2505.21478]] Policy Optimized Text-to-Image Pipeline Design(https://arxiv.org/abs/2505.21478)
Keywords: generation
Abstract: Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines. These combine fine-tuned generators, adapters, upscaling blocks and even editing steps, leading to significant improvements in image quality. However, their effective design requires substantial expertise. Recent approaches have shown promise in automating this process through large language models (LLMs), but they suffer from two critical limitations: extensive computational requirements from generating images with hundreds of predefined pipelines, and poor generalization beyond memorized training examples. We introduce a novel reinforcement learning-based framework that addresses these inefficiencies. Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations, eliminating the need for costly image generation during training. We then implement a two-phase training strategy: initial workflow vocabulary training followed by GRPO-based optimization that guides the model toward higher-performing regions of the workflow space. Additionally, we incorporate a classifier-free guidance based enhancement technique that extrapolates along the path between the initial and GRPO-tuned models, further improving output quality. We validate our approach through a set of comparisons, showing that it can successfully create new flows with greater diversity and lead to superior image quality compared to existing baselines.
摘要：文本对图像的生成已经发展到单个单片模型之外，到复杂的多组分管道。这些结合了微调的发电机，适配器，放大块甚至编辑步骤，从而导致图像质量的显着改善。但是，它们的有效设计需要大量的专业知识。最近的方法表明，通过大型语言模型（LLM）自动化此过程的希望，但它们具有两个关键局限性：从生成具有数百个预定义管道的图像的广泛计算需求，超出记忆训练示例的概括不良。我们介绍了一个基于强化学习的新型框架，以解决这些效率低下。我们的方法首先训练一组奖励模型，能够直接从及时工作流组合中预测图像质量得分，从而消除了在训练过程中产生昂贵图像的需求。然后，我们实施了两阶段的训练策略：初始工作流词汇培训，然后是基于GRPO的优化，该培训将模型指向工作流程空间较高的区域。此外，我们结合了一种基于无分类器的增强技术，该技术沿着初始和GRPO调整模型之间的路径推断，从而进一步提高了输出质量。我们通过一组比较来验证我们的方法，表明它可以成功创造出具有更大多样性的新流量，并且与现有基线相比，它可以取得更大的图像质量。

Title: MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation

Authors: Kerui Ren, Jiayang Bai, Linning Xu, Lihan Jiang, Jiangmiao Pang, Mulin Yu, Bo Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21483
Pdf URL: https://arxiv.org/pdf/2505.21483
Copy Paste: [[2505.21483]] MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation(https://arxiv.org/abs/2505.21483)
Keywords: generation
Abstract: Object compositing offers significant promise for augmented reality (AR) and embodied intelligence applications. Existing approaches predominantly focus on single-image scenarios or intrinsic decomposition techniques, facing challenges with multi-view consistency, complex scenes, and diverse lighting conditions. Recent inverse rendering advancements, such as 3D Gaussian and diffusion-based methods, have enhanced consistency but are limited by scalability, heavy data requirements, or prolonged reconstruction time per scene. To broaden its applicability, we introduce MV-CoLight, a two-stage framework for illumination-consistent object compositing in both 2D images and 3D scenes. Our novel feed-forward architecture models lighting and shadows directly, avoiding the iterative biases of diffusion-based methods. We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework's robustness and wide generalization.
摘要：对象合成为增强现实（AR）和具体的情报应用提供了巨大的希望。现有方法主要集中在单片图案或内在分解技术上，面临多视图一致性，复杂场景和各种照明条件的挑战。最近的反向渲染进步，例如3D高斯和基于扩散的方法，具有增强的一致性，但受到每个场景的可扩展性，繁重的数据要求或延长的重建时间的限制。为了扩大其适用性，我们引入了MV-Colight，这是一个两阶段的框架，用于在2D图像和3D场景中均匀的对象组合。我们的新型进料构建结构直接模拟照明和阴影，避免了基于扩散的方法的迭代偏见。我们采用基于希尔伯特曲线的映射来使2D图像输入与3D高斯场景表示无缝。为了促进培训和评估，我们进一步介绍了一个大规模的3D合成数据集。实验证明了跨标准基准和我们的数据集的最新统一结果，以及随便捕获的现实场景展示了该框架的鲁棒性和广泛的概括。

Title: Be Decisive: Noise-Induced Layouts for Multi-Subject Generation

Authors: Omer Dahary, Yehonathan Cohen, Or Patashnik, Kfir Aberman, Daniel Cohen-Or
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21488
Pdf URL: https://arxiv.org/pdf/2505.21488
Copy Paste: [[2505.21488]] Be Decisive: Noise-Induced Layouts for Multi-Subject Generation(https://arxiv.org/abs/2505.21488)
Keywords: generation
Abstract: Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject's spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model's prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model's original distribution.
摘要：生成多个不同的主题仍然是现有文本对图像扩散模型的挑战。复杂提示通常会导致受试者泄漏，从而导致数量，属性和视觉特征的准确性。防止受试者之间的泄漏需要了解每个受试者的空间位置。最近的方法通过外部布局控制提供了这些空间位置。但是，强制执行这种规定的布局通常与采样初始噪声所指示的先天布局发生冲突，从而导致与模型的先验失误。在这项工作中，我们介绍了一种新方法，该方法可以预测与提示的空间布局，并从初始噪声中得出，并在整个DeNoising过程中进行完善。通过依靠这种噪声引起的布局，我们避免了与外部强加的布局发生冲突，并更好地保留模型的先验。我们的方法采用一个小的神经网络来预测和完善每个降级步骤中不断发展的噪声引起的布局，从而确保受试者之间的明确边界，同时保持一致性。实验结果表明，与现有的布局引导技术相比，这种与噪声平衡的策略相比，具有改进的文本图像对齐方式和更稳定的多主体生成，同时保留了模型原始分布的丰富多样性。

Title: Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

Authors: Boyang Wang, Xuweiyi Chen, Matheus Gadelha, Zezhou Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21491
Pdf URL: https://arxiv.org/pdf/2505.21491
Copy Paste: [[2505.21491]] Frame In-N-Out: Unbounded Controllable Image-to-Video Generation(https://arxiv.org/abs/2505.21491)
Keywords: generation
Abstract: Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by user-specified motion trajectory. To support this task, we introduce a new dataset curated semi-automatically, a comprehensive evaluation protocol targeting this setting, and an efficient identity-preserving motion-controllable video Diffusion Transformer architecture. Our evaluation shows that our proposed approach significantly outperforms existing baselines.
摘要：可控性，时间连贯性和细节合成仍然是视频生成中最关键的挑战。在本文中，我们专注于一种常用但未充分散布的电影技术，称为框架和框架。具体而言，从图像到视频生成开始，用户可以控制图像中的对象，以自然离开场景或提供新的身份参考以进入场景，并在用户指定的运动轨迹的指导下。为了支持这项任务，我们以半自动策划的新数据集，针对此设置的全面评估协议以及有效的身份可控制的动作控制视频扩散变压器体系结构。我们的评估表明，我们提出的方法显着胜过现有基准。

Title: Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment

Authors: Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, Yang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21494
Pdf URL: https://arxiv.org/pdf/2505.21494
Copy Paste: [[2505.21494]] Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment(https://arxiv.org/abs/2505.21494)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) remain vulnerable to transferable adversarial examples. While existing methods typically achieve targeted attacks by aligning global features-such as CLIP's [CLS] token-between adversarial and target samples, they often overlook the rich local information encoded in patch tokens. This leads to suboptimal alignment and limited transferability, particularly for closed-source models. To address this limitation, we propose a targeted transferable adversarial attack method based on feature optimal alignment, called FOA-Attack, to improve adversarial transfer capability. Specifically, at the global level, we introduce a global feature loss based on cosine similarity to align the coarse-grained features of adversarial samples with those of target samples. At the local level, given the rich local representations within Transformers, we leverage clustering techniques to extract compact local patterns to alleviate redundant local features. We then formulate local feature alignment between adversarial and target samples as an optimal transport (OT) problem and propose a local clustering optimal transport loss to refine fine-grained feature alignment. Additionally, we propose a dynamic ensemble model weighting strategy to adaptively balance the influence of multiple models during adversarial example generation, thereby further improving transferability. Extensive experiments across various models demonstrate the superiority of the proposed method, outperforming state-of-the-art methods, especially in transferring to closed-source MLLMs. The code is released at this https URL.
摘要：多模式大语言模型（MLLM）仍然容易受到可转移的对抗示例的影响。尽管现有方法通常通过对齐全局特征，例如剪贴画的[Cls]令牌和目标样本之间的攻击，但它们经常忽略贴片令牌中编码的丰富本地信息。这导致了次优的对准和有限的可传递性，尤其是对于封闭式模型。为了解决这一限制，我们提出了一种基于特征最佳对齐方式（称为FOA-攻击）的目标转移对抗攻击方法，以提高对抗传递能力。具体而言，在全球层面上，我们基于余弦相似性引入了全局特征损失，以使对抗样本的粗粒特征与目标样本的特征相结合。在本地层面，鉴于变形金刚内的丰富本地表示，我们利用聚类技术来提取紧凑的本地模式来减轻冗余的本地特征。然后，我们将对抗样本和目标样本之间的局部特征对齐作为最佳传输（OT）问题，并提出局部聚类的最佳运输损失，以完善细粒的特征对齐。此外，我们提出了一种动态的集合模型加权策略，以适应在对抗示例生成过程中多个模型的影响，从而进一步提高可传递性。各种模型的广泛实验证明了所提出的方法的优越性，表现优于最先进的方法，尤其是转移到封闭源MLLM时。该代码在此HTTPS URL上发布。

Title: Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Authors: Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr
Subjects: cs.CV, cs.AI, cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2505.21497
Pdf URL: https://arxiv.org/pdf/2505.21497
Copy Paste: [[2505.21497]] Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers(https://arxiv.org/abs/2505.21497)
Keywords: generation
Abstract: Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at this https URL.
摘要：学术海报的生成是科学交流中的一项至关重要但又具有挑战性的任务，需要将长期文化的交织文档压缩为一个视觉连贯的页面。 To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably （iv）PaperQuiz - 海报传达核心纸质内容的能力，通过VLMS回答生成的测验。在此基准测试的基础上，我们提出了后向上的，视觉上的多机构管道：（a）解析器将纸提炼成一个结构化的资产库；（b）规划师将文本 - 视觉对将二进制的布局保持一致，以保留阅读顺序和空间平衡；（c）画家 - 征服者循环通过执行渲染代码并使用VLM反馈来消除溢出并确保对齐，从而完善了每个面板。在我们的全面评估中，我们发现GPT-4O输出 - 尽管在乍一看，但在视觉上以嘈杂的文本和较差的纸质Quiz分数具有吸引力，并且我们发现读者参与度是主要的美学瓶颈，因为人类设计的海报在很大程度上依赖于视觉语义来传达意义。我们的完全开源变体（例如，基于QWEN-2.5系列）在几乎所有指标上都优于现有的4O驱动多代理系统，同时使用的令牌少了87％。它将22页的纸转换为最终确定但可编辑的.pptx海报 - 全部仅$ 0.005。这些发现图表了下一代全自动海报生成模型的明确方向。代码和数据集可在此HTTPS URL上找到。