2025-03-11

Title: What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces

Authors: Jordi Armengol-Estapé, Quentin Carbonneaux, Tianjun Zhang, Aram H. Markosyan, Volker Seeker, Chris Cummins, Melanie Kambadur, Michael F.P. O'Boyle, Sida Wang, Gabriel Synnaeve, Hugh James Leather
Subjects: cs.LG, cs.AI, cs.PL
Abstract URL: https://arxiv.org/abs/2503.05703
Pdf URL: https://arxiv.org/pdf/2503.05703
Copy Paste: [[2503.05703]] What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces(https://arxiv.org/abs/2503.05703)
Keywords: generation
Abstract: Code generation and understanding are critical capabilities for large language models (LLMs). Thus, most LLMs are pretrained and fine-tuned on code data. However, these datasets typically treat code as static strings and rarely exploit the dynamic information about their execution. Building upon previous work on trace modeling, we study Execution Tuning (E.T.), a training procedure in which we explicitly model real-world program execution traces without requiring manual test annotations. We train and evaluate models on different execution trace granularities (line and instruction-level) and strategies on the task of output prediction, obtaining around 80% accuracy on CruxEval and MBPP, and showing the advantages of dynamic scratchpads (i.e., self-contained intermediate computations updated by the model rather than accumulated as a history of past computations) on long executions (up to 14k steps). Finally, we discuss E.T.'s practical applications.
摘要：代码生成和理解是大语言模型（LLM）的关键功能。因此，大多数LLM都仔细研究并在代码数据上进行了微调。但是，这些数据集通常将代码视为静态字符串，并且很少利用有关其执行的动态信息。在先前的痕量建模工作的基础上，我们研究了执行调整（E.T.），这是一个培训程序，在该程序中，我们在不需要手动测试注释的情况下明确对现实世界中的程序执行痕迹进行建模。我们训练和评估模型在不同的执行痕量粒度（线路和指令级别）以及有关输出预测任务的策略，获得了在Cruxeval和MBPP上的80％精度，并显示了动态刮擦板的优势（即，自我包含的中间设备（即，由模型）更新，而不是按照近期计算的历史（而不是按照模型来累积），以实现近期计算的步骤（以上计算）（以上计算）（以上是一个近期计算）（以上计算）。最后，我们讨论了E.T.的实际应用。

Title: Evaluation of Missing Data Imputation for Time Series Without Ground Truth

Authors: Rania Farjallah, Bassant Selim, Brigitte Jaumard, Samr Ali, Georges Kaddoum
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.05775
Pdf URL: https://arxiv.org/pdf/2503.05775
Copy Paste: [[2503.05775]] Evaluation of Missing Data Imputation for Time Series Without Ground Truth(https://arxiv.org/abs/2503.05775)
Keywords: generation
Abstract: The challenge of handling missing data in time series is critical for maintaining the accuracy and reliability of machine learning (ML) models in applications like fifth generation mobile communication (5G) network management. Traditional methods for validating imputation rely on ground truth data, which is inherently unavailable. This paper addresses this limitation by introducing two statistical metrics, the wasserstein distance (WD) and jensen-shannon divergence (JSD), to evaluate imputation quality without requiring ground truth. These metrics assess the alignment between the distributions of imputed and original data, providing a robust method for evaluating imputation performance based on internal structure and data consistency. We apply and test these metrics across several imputation techniques. Results demonstrate that WD and JSD are effective metrics for assessing the quality of missing data imputation, particularly in scenarios where ground truth data is unavailable.
摘要：处理时间序列中缺少数据的挑战对于在第五代移动通信（5G）网络管理等应用程序中保持机器学习（ML）模型的准确性和可靠性至关重要。验证插补的传统方法取决于地面真实数据，这本质上是无法使用的。本文通过引入两个统计指标，即Wasserstein距离（WD）和Jensen-Shannon Divergence（JSD）来解决这一限制，以评估插补质量而无需地面真相。这些指标评估了估算和原始数据的分布之间的一致性，从而提供了一种可靠的方法来根据内部结构和数据一致性评估插补性能。我们将这些指标应用于几种插补技术。结果表明，WD和JSD是评估丢失数据插补质量的有效指标，尤其是在无法使用地面真相数据的情况下。

Title: Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA

Authors: Nils Graef, Andrew Wasielewski
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.05840
Pdf URL: https://arxiv.org/pdf/2503.05840
Copy Paste: [[2503.05840]] Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA(https://arxiv.org/abs/2503.05840)
Keywords: generation
Abstract: Slim attention shrinks the context memory size by 2x for transformer models with MHA (multi-head attention), which can speed up inference by up to 2x for large context windows. Slim attention is an exact, mathematically identical implementation of the standard attention mechanism and therefore does not compromise model accuracy. In other words, slim attention losslessly compresses the context memory by a factor of 2. For encoder-decoder transformers, the context memory size can be reduced even further: For the Whisper models for example, slim attention reduces the context memory by 8x, which can speed up token generation by 5x for batch size 64 for example. And for rare cases where the MHA projection dimension is larger than the embedding dimension, the memory can be reduced by a factor of 32 for the T5-11B model for example. See this https URL for code and more transformer tricks, and this https URL for a video about this paper.
摘要：SLIM注意力将上下文存储器大小缩小了2倍，对于具有MHA（多头注意力）的变压器模型，该模型可以加快对大上下文窗口的推理高达2倍。纤细的注意力是标准注意机制的确切，数学相同的实现，因此不会损害模型的准确性。换句话说，苗条的注意力无误地将上下文记忆压缩了2倍。对于编码器换句器，可以进一步降低上下文记忆的大小：例如，对于耳语模型，Slim注意将上下文记忆减少了8X，例如，对于批量尺寸64，这可以加快速度为5X的代币产生。对于极少数MHA投影维度大于嵌入维度的情况，例如，对于T5-11B模型，内存可以减少32倍。有关代码和更多变压器技巧，请参见此HTTPS URL，以及有关本文的视频的HTTPS URL。

Title: Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records

Authors: Ekaterina Redekop, Zichen Wang, Rushikesh Kulkarni, Mara Pleasure, Aaron Chin, Hamid Reza Hassanzadeh, Brian L. Hill, Melika Emami, William Speier, Corey W. Arnold
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05893
Pdf URL: https://arxiv.org/pdf/2503.05893
Copy Paste: [[2503.05893]] Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records(https://arxiv.org/abs/2503.05893)
Keywords: generative
Abstract: Longitudinal data in electronic health records (EHRs) represent an individual`s clinical history through a sequence of codified concepts, including diagnoses, procedures, medications, and laboratory tests. Foundational models, such as generative pre-trained transformers (GPT), can leverage this data to predict future events. While fine-tuning of these models enhances task-specific performance, it is costly, complex, and unsustainable for every target. We show that a foundation model trained on EHRs can perform predictive tasks in a zero-shot manner, eliminating the need for fine-tuning. This study presents the first comprehensive analysis of zero-shot forecasting with GPT-based foundational models in EHRs, introducing a novel pipeline that formulates medical concept prediction as a generative modeling task. Unlike supervised approaches requiring extensive labeled data, our method enables the model to forecast a next medical event purely from a pretraining knowledge. We evaluate performance across multiple time horizons and clinical categories, demonstrating model`s ability to capture latent temporal dependencies and complex patient trajectories without task supervision. Model performance for predicting the next medical concept was evaluated using precision and recall metrics, achieving an average top1 precision of 0.614 and recall of 0.524. For 12 major diagnostic conditions, the model demonstrated strong zero-shot performance, achieving high true positive rates while maintaining low false positives. We demonstrate the power of a foundational EHR GPT model in capturing diverse phenotypes and enabling robust, zero-shot forecasting of clinical outcomes. This capability enhances the versatility of predictive healthcare models and reduces the need for task-specific training, enabling more scalable applications in clinical settings.
摘要：电子健康记录（EHR）中的纵向数据通过一系列编纂的概念，包括诊断，程序，药物和实验室测试，代表了个人的临床病史。基础模型，例如生成预训练的变压器（GPT），可以利用这些数据来预测未来的事件。尽管对这些模型进行微调增强了特定于任务的性能，但对于每个目标而言，它都是昂贵，复杂且不可持续的。我们表明，在EHRS上训练的基础模型可以以零拍的方式执行预测任务，从而消除了进行微调的需求。这项研究介绍了EHR中基于GPT的基础模型对零摄像预测进行的首次全面分析，并引入了一种新型的管道，该管道将医学概念预测作为生成建模任务进行了制定。与需要广泛标记数据的有监督方法不同，我们的方法使该模型纯粹是从预处理知识中预测下一次医疗事件的。我们评估了多个时间范围和临床类别的性能，展示了模型的能力，可以捕获潜在的时间依赖性和复杂的患者轨迹而没有任务监督。使用精度和召回指标评估了用于预测下一个医学概念的模型性能，达到平均TOP1精度为0.614，回忆为0.524。对于12个主要的诊断条件，该模型表现出强烈的零拍摄性能，在保持较低的假阳性的同时，达到了很高的真实率。我们证明了基础EHR GPT模型在捕获多种表型并实现临床结果的零射预测中的力量。这种能力增强了预测性医疗保健模型的多功能性，并减少了对特定于任务的培训的需求，从而在临床环境中实现了更可扩展的应用。

Title: A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond

Authors: Mihaela Cătălina Stoian, Eleonora Giunchiglia, Thomas Lukasiewicz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.05954
Pdf URL: https://arxiv.org/pdf/2503.05954
Copy Paste: [[2503.05954]] A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond(https://arxiv.org/abs/2503.05954)
Keywords: generation, generative
Abstract: Generative modelling has become the standard approach for synthesising tabular data. However, different use cases demand synthetic data to comply with different requirements to be useful in practice. In this survey, we review deep generative modelling approaches for tabular data from the perspective of four types of requirements: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities. We group the approaches along two levels of granularity: (i) based on the primary type of requirements they address and (ii) according to the underlying model they utilise. Additionally, we summarise the appropriate evaluation methods for each requirement and the specific characteristics of each model type. Finally, we discuss future directions for the field, along with opportunities to improve the current evaluation methods. Overall, this survey can be seen as a user guide to tabular data generation: helping readers navigate available models and evaluation methods to find those best suited to their needs.
摘要：生成建模已成为合成表格数据的标准方法。但是，不同的用例要求合成数据符合不同的要求在实践中有用。在本调查中，我们从四种要求的角度回顾了对表格数据的深层生成建模方法：合成数据的实用性，合成数据与域特异性知识的一致性，合成数据分布的统计保真度与现实数据分布相比的统计保真度，以及隐私保护功能。我们将方法沿两个层次的粒度分组：（i）基于它们解决的主要要求类型，以及（ii）根据它们使用的基础模型。此外，我们总结了每种要求的适当评估方法以及每种模型类型的特定特征。最后，我们讨论了该领域的未来方向，以及改善当前评估方法的机会。总体而言，这项调查可以看作是表格数据生成的用户指南：帮助读者导航可用的模型和评估方法，以找到最适合其需求的模型。

Title: Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

Authors: Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, Alexandra Chouldechova
Subjects: cs.LG, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2503.05965
Pdf URL: https://arxiv.org/pdf/2503.05965
Copy Paste: [[2503.05965]] Validating LLM-as-a-Judge Systems in the Absence of Gold Labels(https://arxiv.org/abs/2503.05965)
Keywords: generative
Abstract: The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, has come to play a critical role in scaling and standardizing GenAI evaluations. To validate judge systems, evaluators collect multiple human ratings for each item in a validation corpus, and then aggregate the ratings into a single, per-item gold label rating. High agreement rates between these gold labels and judge system ratings are then taken as a sign of good judge system performance. In many cases, however, items or rating criteria may be ambiguous, or there may be principled disagreement among human raters. In such settings, gold labels may not exist for many of the items. In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We present a theoretical analysis drawing connections between different measures of judge system performance under different rating elicitation and aggregation schemes. We also demonstrate empirically that existing validation approaches can select judge systems that are highly suboptimal, performing as much as 34% worse than the systems selected by alternative approaches that we describe. Based on our findings, we provide concrete recommendations for developing more reliable approaches to LLM-as-a-judge validation.
摘要：LLM-AS-A-A-Gudge范式，其中法官LLM系统取代了人类评估者评估其他生成AI（Genai）系统的产出，它在扩展和标准化Genai评估中起着至关重要的作用。为了验证判断系统，评估人员在验证语料库中为每个项目收集多个人类评级，然后将评级汇总为单个，每个项目的金标签等级。然后，这些黄金标签和法官系统评级之间的高协议率被视为良好的法官系统绩效的标志。但是，在许多情况下，项目或评级标准可能是模棱两可的，或者人类评估者之间可能存在原则分歧。在这种情况下，许多项目可能不存在金标签。在本文中，我们在没有金标签的情况下介绍了LLM-AS-A-A-A-Gudge验证的框架。我们提出了理论分析在不同评级启发和聚合方案下的不同度量裁判系统绩效指标之间的绘制连接。我们还从经验上证明，现有的验证方法可以选择高度次优的法官系统，比我们描述的替代方法选择的系统差34％。根据我们的发现，我们提供了具体的建议，以开发更可靠的方法来为LLM-AS-A-A-Gudge验证。

Title: Generative Multi-Agent Q-Learning for Policy Optimization: Decentralized Wireless Networks

Authors: Talha Bozkus, Urbashi Mitra
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.05970
Pdf URL: https://arxiv.org/pdf/2503.05970
Copy Paste: [[2503.05970]] Generative Multi-Agent Q-Learning for Policy Optimization: Decentralized Wireless Networks(https://arxiv.org/abs/2503.05970)
Keywords: generative
Abstract: Q-learning is a widely used reinforcement learning (RL) algorithm for optimizing wireless networks, but faces challenges with large state-spaces. Recently proposed multi-environment mixed Q-learning (MEMQ) algorithm addresses these challenges by employing multiple Q-learning algorithms across multiple synthetically generated, distinct but structurally related environments, so-called digital cousins. In this paper, we propose a novel multi-agent MEMQ (M-MEMQ) for cooperative decentralized wireless networks with multiple networked transmitters (TXs) and base stations (BSs). TXs do not have access to global information (joint state and actions). The new concept of coordinated and uncoordinated states is introduced. In uncoordinated states, TXs act independently to minimize their individual costs and update local Q-functions. In coordinated states, TXs use a Bayesian approach to estimate the joint state and update the joint Q-functions. The cost of information-sharing scales linearly with the number of TXs and is independent of the joint state-action space size. Several theoretical guarantees, including deterministic and probabilistic convergence, bounds on estimation error variance, and the probability of misdetecting the joint states, are given. Numerical simulations show that M-MEMQ outperforms several decentralized and centralized training with decentralized execution (CTDE) multi-agent RL algorithms by achieving 55% lower average policy error (APE), 35% faster convergence, 50% reduced runtime complexity, and 45% less sample complexity. Furthermore, M-MEMQ achieves comparable APE with significantly lower complexity than centralized methods. Simulations validate the theoretical analyses.
摘要：Q学习是一种广泛使用的增强学习（RL）算法，用于优化无线网络，但面临较大状态空间的挑战。最近提出的多种环境混合Q学习（MEMQ）算法通过在多个合成生成的，独特但与结构相关的环境（所谓的数字表亲）中采用多种Q学习算法来解决这些挑战。在本文中，我们为具有多个网络发射机（TXS）和基站（BSS）（BSS）的合作分散无线网络提出了一种新型的多代理MEMQ（M-MEMQ）。 TXS无法访问全球信息（联合国家和行动）。引入了协调和不协调状态的新概念。在不协调的国家中，TXS独立起作用，以最大程度地减少其个人成本并更新本地Q功能。在协调的州，TXS使用贝叶斯的方法来估计联合状态并更新联合Q-功能。信息共享的成本与TXS的数量线性缩放，并且与联合国家行动空间大小无关。给出了几种理论保证，包括确定性和概率收敛，限制估计误差方差以及误导关节状态的概率。数值模拟表明，M-MEMQ的表现优于几个分散和集中式培训，分散执行（CTDE）多代理RL算法通过达到55％的降低平均策略误差（APE），更快的收敛速度35％，降低了35％，降低了运行时间的复杂性，降低了45％的样品复杂度。此外，M-MEMQ的复杂性明显低于集中方法，其复杂性的可比猿可比较。模拟验证理论分析。

Title: MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice

Authors: Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, Wei Li, Michael Lingelbach, Daquan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05978
Pdf URL: https://arxiv.org/pdf/2503.05978
Copy Paste: [[2503.05978]] MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice(https://arxiv.org/abs/2503.05978)
Keywords: generation
Abstract: We present MagicInfinite, a novel diffusion Transformer (DiT) framework that overcomes traditional portrait animation limitations, delivering high-fidelity results across diverse character types-realistic humans, full-body figures, and stylized anime characters. It supports varied facial poses, including back-facing views, and animates single or multiple characters with input masks for precise speaker designation in multi-character scenes. Our approach tackles key challenges with three innovations: (1) 3D full-attention mechanisms with a sliding window denoising strategy, enabling infinite video generation with temporal coherence and visual quality across diverse character styles; (2) a two-stage curriculum learning scheme, integrating audio for lip sync, text for expressive dynamics, and reference images for identity preservation, enabling flexible multi-modal control over long sequences; and (3) region-specific masks with adaptive loss functions to balance global textual control and local audio guidance, supporting speaker-specific animations. Efficiency is enhanced via our innovative unified step and cfg distillation techniques, achieving a 20x inference speed boost over the basemodel: generating a 10 second 540x540p video in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs, without quality loss. Evaluations on our new benchmark demonstrate MagicInfinite's superiority in audio-lip synchronization, identity preservation, and motion naturalness across diverse scenarios. It is publicly available at this https URL, with examples at this https URL.
摘要：我们提出了MagicInfinite，这是一种新颖的扩散变压器（DIT）框架，它克服了传统的肖像画限制，在各种角色类型 - 现实的人，全身人物和风格的动漫角色中提供了高保真的结果。它支持各种面部姿势，包括背面视图，并用输入掩码为单个或多个字符动画，以在多字符场景中使用精确的扬声器名称。我们的方法通过三个创新解决了关键挑战：（1）带有滑动窗口降级策略的3D全注意机制，从而使无限的视频产生具有各种角色样式的时间连贯性和视觉质量；（2）一种两阶段的课程学习方案，集成了唇同同步的音频，表达动力学的文本以及用于身份保存的参考图像，从而可以对长序列进行灵活的多模式控制；（3）具有自适应损失功能的特定区域面具，以平衡全球文本控制和本地音频指导，并支持特定于扬声器的动画。通过我们的创新统一步骤和CFG蒸馏技术提高了效率，实现了20倍推理速度的提高：在10秒内生成10秒的540x540p视频，或在8 h100 gpus的30秒内在30秒内生成720x720p，而无需质量损失。对我们的新基准的评估表明，魔术师在不同情况下的音频同步，身份保存和运动自然性方面具有优势。它在此HTTPS URL上公开可用，并在此HTTPS URL上提供示例。

Title: Learning-Order Autoregressive Models with Application to Molecular Graph Generation

Authors: Zhe Wang, Jiaxin Shi, Nicolas Heess, Arthur Gretton, Michalis K. Titsias
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.05979
Pdf URL: https://arxiv.org/pdf/2503.05979
Copy Paste: [[2503.05979]] Learning-Order Autoregressive Models with Application to Molecular Graph Generation(https://arxiv.org/abs/2503.05979)
Keywords: generation
Abstract: Autoregressive models (ARMs) have become the workhorse for sequence generation tasks, since many problems can be modeled as next-token prediction. While there appears to be a natural ordering for text (i.e., left-to-right), for many data types, such as graphs, the canonical ordering is less obvious. To address this problem, we introduce a variant of ARM that generates high-dimensional data using a probabilistic ordering that is sequentially inferred from data. This model incorporates a trainable probability distribution, referred to as an \emph{order-policy}, that dynamically decides the autoregressive order in a state-dependent manner. To train the model, we introduce a variational lower bound on the exact log-likelihood, which we optimize with stochastic gradient estimation. We demonstrate experimentally that our method can learn meaningful autoregressive orderings in image and graph generation. On the challenging domain of molecular graph generation, we achieve state-of-the-art results on the QM9 and ZINC250k benchmarks, evaluated using the Fréchet ChemNet Distance (FCD).
摘要：自回归模型（ARM）已成为序列生成任务的主力，因为许多问题可以建模为下一步的预测。尽管对于许多数据类型（例如图形）似乎有天然的文本（即，从左到右）的自然排序，但规范排序的效果不太明显。为了解决此问题，我们引入了一种ARM的变体，该变体使用从数据中依次推断出的概率订购来生成高维数据。该模型结合了一个可训练的概率分布，称为\ emph {order-policy}，该分布以状态依赖性方式动态决定自回旋顺序。为了训练该模型，我们引入了针对确切的对数似然性的变异下限，我们通过随机梯度估计进行了优化。我们通过实验证明，我们的方法可以在图像和图形生成中学习有意义的自回旋顺序。关于分子图生成的挑战性领域，我们使用FréchetChemnet距离（FCD）评估了QM9和ZINC250K基准的最先进结果。

Title: Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models

Authors: Md Azim Khan, Aryya Gangopadhyay, Jianwu Wang, Robert F. Erbacher
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06003
Pdf URL: https://arxiv.org/pdf/2503.06003
Copy Paste: [[2503.06003]] Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models(https://arxiv.org/abs/2503.06003)
Keywords: generation
Abstract: Situational awareness applications rely heavily on real-time processing of visual and textual data to provide actionable insights. Vision language models (VLMs) have become essential tools for interpreting complex environments by connecting visual inputs with natural language descriptions. However, these models often face computational challenges, especially when required to perform efficiently in real environments. This research presents a novel vision language model (VLM) framework that leverages frequency domain transformations and low-rank adaptation (LoRA) to enhance feature extraction, scalability, and efficiency. Unlike traditional VLMs, which rely solely on spatial-domain representations, our approach incorporates Discrete Fourier Transform (DFT) based low-rank features while retaining pretrained spatial weights, enabling robust performance in noisy or low visibility scenarios. We evaluated the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise. Quantitative results demonstrate that our model achieves evaluation metrics comparable to state-of-the-art VLMs, such as CLIP ViT-L/14 and SigLIP. Qualitative analysis further reveals that our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV).
摘要：情境意识应用程序很大程度上依赖于视觉和文本数据的实时处理以提供可行的见解。视觉语言模型（VLM）已成为通过将视觉输入与自然语言描述联系起来来解释复杂环境的重要工具。但是，这些模型通常会面临计算挑战，尤其是在需要在实际环境中有效执行的情况下。这项研究提出了一种新颖的视觉语言模型（VLM）框架，该框架利用频域转换和低级别适应性（LORA）来增强特征可提取，可伸缩性和效率。与仅依靠空间域表示的传统VLM不同，我们的方法包含了离散的傅立叶变换（DFT）低级别功能，同时保留了预审预测的空间重量，从而在噪音或低的可见性场景中实现了稳健的性能。我们使用具有不同级别的高斯噪声的基准数据集评估了有关字幕生成和视觉问题回答（VQA）任务的建议模型。定量结果表明，我们的模型可以实现与最新VLM相当的评估指标，例如剪辑VIT-L/14和SIGLIP。定性分析进一步表明，我们的模型提供了更详细且相关的响应，特别是对于由安装在无人接地车上（UGV）上的真实摄像机捕获的现实世界图像。

Title: Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

Authors: Xiaohao Xu, Feng Xue, Xiang Li, Haowei Li, Shusheng Yang, Tianyi Zhang, Matthew Johnson-Roberson, Xiaonan Huang
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.06014
Pdf URL: https://arxiv.org/pdf/2503.06014
Copy Paste: [[2503.06014]] Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity(https://arxiv.org/abs/2503.06014)
Keywords: generation
Abstract: Depth ambiguity is a fundamental challenge in spatial scene understanding, especially in transparent scenes where single-depth estimates fail to capture full 3D structure. Existing models, limited to deterministic predictions, overlook real-world multi-layer depth. To address this, we introduce a paradigm shift from single-prediction to multi-hypothesis spatial foundation models. We first present \texttt{MD-3k}, a benchmark exposing depth biases in expert and foundational models through multi-layer spatial relationship labels and new metrics. To resolve depth ambiguity, we propose Laplacian Visual Prompting (LVP), a training-free spectral prompting technique that extracts hidden depth from pre-trained models via Laplacian-transformed RGB inputs. By integrating LVP-inferred depth with standard RGB-based estimates, our approach elicits multi-layer depth without model retraining. Extensive experiments validate the effectiveness of LVP in zero-shot multi-layer depth estimation, unlocking more robust and comprehensive geometry-conditioned visual generation, 3D-grounded spatial reasoning, and temporally consistent video-level depth inference. Our benchmark and code will be available at this https URL.
摘要：深度歧义是空间场景理解中的一个基本挑战，尤其是在单深度估计无法捕获完整3D结构的透明场景中。现有模型，仅限于确定性预测，忽略了现实世界中的多层深度。为了解决这个问题，我们介绍了从单个预测到多种假设的空间基础模型的范式转变。我们首先提出\ texttt {MD-3K}，这是一种基准测试，通过多层空间关系标签和新指标，通过多层空间关系标签和基础模型中的深度偏见。为了解决深度歧义，我们提出了Laplacian Visual Pressing（LVP），这是一种无训练的光谱提示技术，通过Laplacian转换的RGB输入从预训练的模型中提取隐藏的深度。通过将LVP的深度与基于标准的RGB估计值集成，我们的方法引发了多层深度，而无需模型重新培训。广泛的实验验证了LVP在零拍的多层深度估计中的有效性，从而解开了更健壮和全面的几何形成视觉发电，3D接地的空间推理以及时间上一致的视频级别深度推断。我们的基准和代码将在此HTTPS URL上找到。

Title: DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

Authors: Runze Zhang, Guoguang Du, Xiaochuan Li, Qi Jia, Liang Jin, Lu Liu, Jingjing Wang, Cong Xu, Zhenhua Guo, Yaqian Zhao, Xiaoli Gong, Rengang Li, Baoyu Fan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06053
Pdf URL: https://arxiv.org/pdf/2503.06053
Copy Paste: [[2503.06053]] DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation(https://arxiv.org/abs/2503.06053)
Keywords: generation
Abstract: Spatio-temporal consistency is a critical research topic in video generation. A qualified generated video segment must ensure plot plausibility and coherence while maintaining visual consistency of objects and scenes across varying viewpoints. Prior research, especially in open-source projects, primarily focuses on either temporal or spatial consistency, or their basic combination, such as appending a description of a camera movement after a prompt without constraining the outcomes of this movement. However, camera movement may introduce new objects to the scene or eliminate existing ones, thereby overlaying and affecting the preceding narrative. Especially in videos with numerous camera movements, the interplay between multiple plots becomes increasingly complex. This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. Our research encompasses dataset construction through to the development of the model. Initially, we constructed a DropletVideo-10M dataset, which comprises 10 million videos featuring dynamic camera motion and object actions. Each video is annotated with an average caption of 206 words, detailing various camera movements and plot developments. Following this, we developed and trained the DropletVideo model, which excels in preserving spatio-temporal coherence during video generation. The DropletVideo dataset and model are accessible at this https URL.
摘要：时空的一致性是视频生成中的关键研究主题。合格的生成的视频段必须确保情节的合理性和连贯性，同时在各种观点跨保持对象和场景的视觉一致性。先前的研究，尤其是在开源项目中，主要集中于时间或空间一致性或其基本组合，例如在提示后对摄像机移动的描述附加而不限制该运动的结果。但是，相机运动可能会将新对象引入场景或消除现有物体，从而覆盖并影响前面的叙述。尤其是在具有众多相机运动的视频中，多个图之间的相互作用变得越来越复杂。本文介绍并研究了积分时空的一致性，考虑到情节进步和摄像机技术之间的协同作用以及先前内容对后续生成的长期影响。我们的研究涵盖了数据集构建到模型的开发。最初，我们构建了一个DropletVideo-10m数据集，该数据集由动态摄像机运动和对象动作组成1000万个视频。每个视频的平均标题为206个单词，详细介绍了各种相机运动和情节发展。此后，我们开发并训练了滴液列车模型，该模型在视频生成过程中保持时空连贯性方面表现出色。在此HTTPS URL上可以访问滴度数据集和模型。

Title: Attention-Based Synthetic Data Generation for Calibration-Enhanced Survival Analysis: A Case Study for Chronic Kidney Disease Using Electronic Health Records

Authors: Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06096
Pdf URL: https://arxiv.org/pdf/2503.06096
Copy Paste: [[2503.06096]] Attention-Based Synthetic Data Generation for Calibration-Enhanced Survival Analysis: A Case Study for Chronic Kidney Disease Using Electronic Health Records(https://arxiv.org/abs/2503.06096)
Keywords: generation
Abstract: Access to real-world healthcare data is limited by stringent privacy regulations and data imbalances, hindering advancements in research and clinical applications. Synthetic data presents a promising solution, yet existing methods often fail to ensure the realism, utility, and calibration essential for robust survival analysis. Here, we introduce Masked Clinical Modelling (MCM), an attention-based framework capable of generating high-fidelity synthetic datasets that preserve critical clinical insights, such as hazard ratios, while enhancing survival model calibration. Unlike traditional statistical methods like SMOTE and machine learning models such as VAEs, MCM supports both standalone dataset synthesis for reproducibility and conditional simulation for targeted augmentation, addressing diverse research needs. Validated on a chronic kidney disease electronic health records dataset, MCM reduced the general calibration loss over the entire dataset by 15%; and MCM reduced a mean calibration loss by 9% across 10 clinically stratified subgroups, outperforming 15 alternative methods. By bridging data accessibility with translational utility, MCM advances the precision of healthcare models, promoting more efficient use of scarce healthcare resources.
摘要：访问现实世界的医疗保健数据受到严格的隐私法规和数据失衡的限制，阻碍了研究和临床应用方面的进步。合成数据提出了一个有希望的解决方案，但是现有的方法通常无法确保现实主义，实用性和校准对于健壮的生存分析必不可少。在这里，我们引入了蒙版临床建模（MCM），这是一个基于注意力的框架，能够生成高保真合成数据集，以保留关键的临床见解，例如危害比率，同时增强生存模型校准。与Smote和机器学习模型（例如VAE）等传统统计方法不同，MCM支持独立的数据集综合，用于可重复性和有条件的仿真，以满足各种研究需求。在慢性肾脏疾病电子健康记录数据集中验证，MCM将整个数据集的一般校准损失降低了15％。 MCM在10个临床分层的亚组中将平均校准损失减少了9％，表现优于15种替代方法。通过将数据可访问性与转化实用程序桥接，MCM提高了医疗保健模型的精度，从而促进了更有效地利用稀缺医疗保健资源。

Title: Unlocking Pretrained LLMs for Motion-Related Multimodal Generation: A Fine-Tuning Approach to Unify Diffusion and Next-Token Prediction

Authors: Shinichi Tanaka, Zhao Wang, Yoichi Kato, Jun Ohya
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06119
Pdf URL: https://arxiv.org/pdf/2503.06119
Copy Paste: [[2503.06119]] Unlocking Pretrained LLMs for Motion-Related Multimodal Generation: A Fine-Tuning Approach to Unify Diffusion and Next-Token Prediction(https://arxiv.org/abs/2503.06119)
Keywords: generation
Abstract: In this paper, we propose a unified framework that leverages a single pretrained LLM for Motion-related Multimodal Generation, referred to as MoMug. MoMug integrates diffusion-based continuous motion generation with the model's inherent autoregressive discrete text prediction capabilities by fine-tuning a pretrained LLM. This enables seamless switching between continuous motion output and discrete text token prediction within a single model architecture, effectively combining the strengths of both diffusion- and LLM-based approaches. Experimental results show that, compared to the most recent LLM-based baseline, MoMug improves FID by 38% and mean accuracy across seven metrics by 16.61% on the text-to-motion task. Additionally, it improves mean accuracy across eight metrics by 8.44% on the text-to-motion task. To the best of our knowledge, this is the first approach to integrate diffusion- and LLM-based generation within a single model for motion-related multimodal tasks while maintaining low training costs. This establishes a foundation for future advancements in motion-related generation, paving the way for high-quality yet cost-efficient motion synthesis.
摘要：在本文中，我们提出了一个统一的框架，该框架利用了一个预估计的LLM进行与运动相关的多模式生成，称为MOMUG。 MOMUG通过微调验证的LLM来将基于扩散的连续运动产生与模型固有的自回归离散文本预测能力相结合。这使得可以在单个模型体系结构中的连续运动输出和离散文本令牌预测之间进行无缝切换，从而有效地结合了基于扩散和LLM的方法的优势。实验结果表明，与最新的基于LLM的基线相比，MOMUG在文本到动作任务上的FID提高了38％，七个指标的平均准确性提高了16.61％。此外，在文本到动作任务上，它提高了八个指标的平均准确性8.44％。据我们所知，这是将基于扩散和LLM的生成整合到单个模型中，以进行运动相关的多模式任务的第一种方法，同时保持低训练成本。这为与运动相关的发电中的未来进步奠定了基础，为高质量但成本效益的运动综合铺平了道路。

Title: Viewport-Unaware Blind Omnidirectional Image Quality Assessment: A Flexible and Effective Paradigm

Authors: Jiebin Yan, Kangcheng Wu, Junjie Chen, Ziwen Tan, Yuming Fang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06129
Pdf URL: https://arxiv.org/pdf/2503.06129
Copy Paste: [[2503.06129]] Viewport-Unaware Blind Omnidirectional Image Quality Assessment: A Flexible and Effective Paradigm(https://arxiv.org/abs/2503.06129)
Keywords: generation, quality assessment
Abstract: Most of existing blind omnidirectional image quality assessment (BOIQA) models rely on viewport generation by modeling user viewing behavior or transforming omnidirectional images (OIs) into varying formats; however, these methods are either computationally expensive or less scalable. To solve these issues, in this paper, we present a flexible and effective paradigm, which is viewport-unaware and can be easily adapted to 2D plane image quality assessment (2D-IQA). Specifically, the proposed BOIQA model includes an adaptive prior-equator sampling module for extracting a patch sequence from the equirectangular projection (ERP) image in a resolution-agnostic manner, a progressive deformation-unaware feature fusion module which is able to capture patch-wise quality degradation in a deformation-immune way, and a local-to-global quality aggregation module to adaptively map local perception to global quality. Extensive experiments across four OIQA databases (including uniformly distorted OIs and non-uniformly distorted OIs) demonstrate that the proposed model achieves competitive performance with low complexity against other state-of-the-art models, and we also verify its adaptive capacity to 2D-IQA.
摘要：大多数现有的盲目全向图像质量评估（BOIQA）模型都通过对用户观看行为进行建模或将全向图像（OIS）转换为不同格式来依赖于视口的生成；但是，这些方法在计算上是昂贵的，要么是可扩展的。为了解决这些问题，在本文中，我们提出了一种灵活而有效的范式，该范式是视口 - 纳维尔，并且很容易适应2D平面图像质量评估（2D-IQA）。具体而言，提出的BOIQA模型包括一个自适应的先验取样模块，用于以分辨率 - 静态的方式从等应角预测（ERP）图像中提取斑块序列，这是一种逐步的变形 - 额定特征融合模块，能够捕获适合质量的质量质量，并在局部质量上降级，并在质量上降级质量，并在质量上降级质量，并具有质量的质量 - 质量 - 质量 - 质量融合了质量，并捕获了质量的质量 - 且质量良好的粘液型胶合量，并捕获了质量的质量融合。对全球质量的看法。在四个OIQA数据库（包括均匀扭曲的OI和非均匀扭曲的OIS）之间进行的广泛实验表明，所提出的模型可以针对其他最先进的模型实现低复杂性的竞争性能，我们还验证了其适应性能力至2D-IQA。

Title: USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

Authors: Xiangxiang Chu, Renda Li, Yong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06132
Pdf URL: https://arxiv.org/pdf/2503.06132
Copy Paste: [[2503.06132]] USP: Unified Self-Supervised Pretraining for Image Generation and Understanding(https://arxiv.org/abs/2503.06132)
Keywords: generation
Abstract: Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models. Our code will be publicly available at this https URL.
摘要：最近的研究强调了扩散模型与表示学习之间的相互作用。可以利用来自扩散模型的中间表示，用于下游视觉任务，而自我监视的视觉模型可以增强扩散模型的收敛性和生成质量。但是，由于输入不匹配和使用潜在空间，从视觉模型转移到视觉模型到扩散模型的转移重点是具有挑战性的。为了应对这些挑战，我们提出了统一的自我监督预处理（USP），该框架是通过在差异自动编码器（VAE）潜在空间中通过掩盖潜在建模初始化扩散模型的框架。 USP在理解任务时取得了可比的性能，同时显着提高了扩散模型的收敛速度和发电质量。我们的代码将在此HTTPS URL上公开可用。

Title: X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

Authors: Jian Ma, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu, Zhenyu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06134
Pdf URL: https://arxiv.org/pdf/2503.06134
Copy Paste: [[2503.06134]] X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation(https://arxiv.org/abs/2503.06134)
Keywords: generation
Abstract: Text-to-image (T2I) models are well known for their ability to produce highly realistic images, while multimodal large language models (MLLMs) are renowned for their proficiency in understanding and integrating multiple modalities. However, currently there is no straightforward and efficient framework to transfer the multimodal comprehension abilities of MLLMs to T2I models to enable them to understand multimodal inputs. In this paper, we propose the X2I framework, which endows Diffusion Transformer (DiT) models with the capability to comprehend various modalities, including multilingual text, screenshot documents, images, videos, and audio. X2I is trained using merely 100K English corpus with 160 GPU hours. Building on the DiT teacher model, we adopt an innovative distillation method to extract the inference capabilities of the teacher model and design a lightweight AlignNet structure to serve as an intermediate bridge. Compared to the teacher model, X2I shows a decrease in performance degradation of less than 1\% while gaining various multimodal understanding abilities, including multilingual to image, image to image, image-text to image, video to image, audio to image, and utilizing creative fusion to enhance imagery. Furthermore, it is applicable for LoRA training in the context of image-text to image generation, filling a void in the industry in this area. We further design a simple LightControl to enhance the fidelity of instructional image editing. Finally, extensive experiments demonstrate the effectiveness, efficiency, multifunctionality, and transferability of our X2I. The open-source code and checkpoints for X2I can be found at the following link: this https URL.
摘要：文本对图像（T2I）模型以其产生高度逼真的图像的能力而闻名，而多模式大语言模型（MLLM）以其在理解和整合多种方式方面的熟练程度而闻名。但是，当前没有直接有效的框架可以将MLLM的多模式理解能力传输到T2I模型，以使它们能够理解多模式输入。在本文中，我们提出了X2I框架，该框架赋予了扩散变压器（DIT）模型，具有理解各种方式的能力，包括多语言文本，屏幕截图文档，图像，视频和音频。 X2i仅使用160 GPU小时的100K英语语料库进行培训。在DIT教师模型的基础上，我们采用了一种创新的蒸馏方法来提取教师模型的推理能力，并设计一种轻量级的对齐结构来充当中间桥。与教师模型相比，X2i在获得各种多模式理解能力的同时，表现出少于1 \％的性能降解，包括多语言到图像，图像到图像，图像到图像，图像到图像，视频到图像，音频到图像，以及利用创造性的融合来增强成像。此外，它适用于在图像文本的背景下用于图像生成的洛拉培训，从而填补了该领域的行业的空白。我们进一步设计了一个简单的灯塔，以增强教学图像编辑的保真度。最后，广泛的实验证明了我们X2i的有效性，效率，多功能性和可传递性。可以在以下链接上找到X2I的开源代码和检查点：此HTTPS URL。

Title: GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

Authors: Ye Tao, Jiawei Zhang, Yahao Shi, Dongqing Zou, Bin Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06136
Pdf URL: https://arxiv.org/pdf/2503.06136
Copy Paste: [[2503.06136]] GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation(https://arxiv.org/abs/2503.06136)
Keywords: generation
Abstract: Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.
摘要：基于图像的3D生成在机器人技术和游戏中具有广泛的应用，其中高质量，不同的产出和一致的3D表示至关重要。但是，现有方法有局限性：3D扩散模型受数据集稀缺性和缺乏强大的预先训练的先验限制，而基于2D扩散的方法则与几何一致性抗争。我们提出了一种利用2D扩散模型的隐式3D推理能力的方法，同时通过基于高斯分类的几何蒸馏确保3D一致性。具体而言，提出的高斯剥离解码器通过将SV3D潜在输出转换为显式3D表示，从而实现了3D一致性。与仅依赖于视频生成的隐式2D表示的SV3D不同，高斯脱落明确编码空间和外观属性，从而通过几何约束来实现多视图一致性。这些约束正确的视图不一致，确保了强大的几何一致性。结果，我们的方法同时生成了高质量的多视图一致图像和准确的3D模型，为基于单像的3D生成提供了可扩展的解决方案，并弥合了2D扩散多样性和3D结构相干性之间的差距。实验结果证明了最先进的多视图一致性和跨不同数据集的强烈概括。该代码将在接受后公开提供。

Title: Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model

Authors: Mingxing Li, Rui Wang, Lei Sun, Yancheng Bai, Xiangxiang Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06141
Pdf URL: https://arxiv.org/pdf/2503.06141
Copy Paste: [[2503.06141]] Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model(https://arxiv.org/abs/2503.06141)
Keywords: quality assessment
Abstract: The rapid expansion of mobile internet has resulted in a substantial increase in user-generated content (UGC) images, thereby making the thorough assessment of UGC images both urgent and essential. Recently, multimodal large language models (MLLMs) have shown great potential in image quality assessment (IQA) and image aesthetic assessment (IAA). Despite this progress, effectively scoring the quality and aesthetics of UGC images still faces two main challenges: 1) A single score is inadequate to capture the hierarchical human perception. 2) How to use MLLMs to output numerical scores, such as mean opinion scores (MOS), remains an open question. To address these challenges, we introduce a novel dataset, named Realistic image Quality and Aesthetic (RealQA), including 14,715 UGC images, each of which is annoted with 10 fine-grained attributes. These attributes span three levels: low level (e.g., image clarity), middle level (e.g., subject integrity) and high level (e.g., composition). Besides, we conduct a series of in-depth and comprehensive investigations into how to effectively predict numerical scores using MLLMs. Surprisingly, by predicting just two extra significant digits, the next token paradigm can achieve SOTA performance. Furthermore, with the help of chain of thought (CoT) combined with the learnt fine-grained attributes, the proposed method can outperform SOTA methods on five public datasets for IQA and IAA with superior interpretability and show strong zero-shot generalization for video quality assessment (VQA). The code and dataset will be released.
摘要：移动互联网的快速扩展导致用户生成的内容（UGC）图像大幅增加，从而对UGC图像进行了彻底的评估，既紧急又不必不可少。最近，多模式的大语言模型（MLLM）在图像质量评估（IQA）和图像美学评估（IAA）方面表现出巨大的潜力。尽管取得了这种进步，但有效地评分了UGC图像的质量和美学仍然面临两个主要挑战：1）单个分数不足以捕获分层的人类感知。 2）如何使用MLLM输出数值得分（例如平均意见分数（MOS））仍然是一个悬而未决的问题。为了应对这些挑战，我们介绍了一个新颖的数据集，称为现实的图像质量和美学（REALQA），其中包括14,715个UGC图像，每个图像都带有10个细粒度的属性。这些属性跨越了三个级别：低级别（例如，图像清晰度），中层（例如主题完整性）和高级别（例如组成）。此外，我们对如何使用MLLM有效预测数值得分进行了一系列深入而全面的研究。令人惊讶的是，通过仅预测两个额外的重要数字，接下来的令牌范式可以实现SOTA性能。此外，借助思想链（COT）与学习的细粒属性相结合，该建议的方法可以在IQA和IAA的五个公共数据集上胜过SOTA方法，具有出色的解释性，并显示出强烈的零光概括用于视频质量评估（VQA）。代码和数据集将发布。

Title: VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models

Authors: Xinan He, Yue Zhou, Bing Fan, Bin Li, Guopu Zhu, Feng Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06142
Pdf URL: https://arxiv.org/pdf/2503.06142
Copy Paste: [[2503.06142]] VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models(https://arxiv.org/abs/2503.06142)
Keywords: generation
Abstract: Faces synthesized by diffusion models (DMs) with high-quality and controllable attributes pose a significant challenge for Deepfake detection. Most state-of-the-art detectors only yield a binary decision, incapable of forgery localization, attribution of forgery methods, and providing analysis on the cause of forgeries. In this work, we integrate Multimodal Large Language Models (MLLMs) within DM-based face forensics, and propose a fine-grained analysis triad framework called VLForgery, that can 1) predict falsified facial images; 2) locate the falsified face regions subjected to partial synthesis; and 3) attribute the synthesis with specific generators. To achieve the above goals, we introduce VLF (Visual Language Forensics), a novel and diverse synthesis face dataset designed to facilitate rich interactions between Visual and Language modalities in MLLMs. Additionally, we propose an extrinsic knowledge-guided description method, termed EkCot, which leverages knowledge from the image generation pipeline to enable MLLMs to quickly capture image content. Furthermore, we introduce a low-level vision comparison pipeline designed to identify differential features between real and fake that MLLMs can inherently understand. These features are then incorporated into EkCot, enhancing its ability to analyze forgeries in a structured manner, following the sequence of detection, localization, and attribution. Extensive experiments demonstrate that VLForgery outperforms other state-of-the-art forensic approaches in detection accuracy, with additional potential for falsified region localization and attribution analysis.
摘要：通过具有高质量和可控属性的扩散模型（DMS）合成的面孔对DeepFake检测构成了重大挑战。大多数最先进的检测器仅产生二元决定，无法进行伪造的本地化，伪造方法的归因以及提供有关伪造原因的分析。在这项工作中，我们将多模式大语模型（MLLM）整合在基于DM的面部取证中，并提出了一个名为VLForgery的细粒分析三合会框架，可以预测伪造的面部图像； 2）找到经过部分合成的伪造的面部区域； 3）将合成用特定的发生器归因。为了实现上述目标，我们引入了VLF（视觉语言取证），这是一种新颖而多样的合成面部数据集，旨在促进MLLM中视觉和语言方式之间的丰富相互作用。此外，我们提出了一种外部知识引导的描述方法，称为EKCOT，该方法利用图像生成管道的知识来使MLLM快速捕获图像内容。此外，我们引入了一个低级视觉比较管道，旨在识别MLLM可以固有理解的真实和假货之间的差异特征。然后将这些特征纳入EKCOT中，从而增强其以结构化方式分析伪造的能力，按照检测，定位和归因的顺序。广泛的实验表明，VLForgery在检测准确性方面的表现优于其他最先进的法医方法，具有伪造的区域定位和归因分析的其他潜力。

Title: BioMoDiffuse: Physics-Guided Biomechanical Diffusion for Controllable and Authentic Human Motion Synthesis

Authors: Zixi Kang, Xinghan Wang, Yadong Mu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06151
Pdf URL: https://arxiv.org/pdf/2503.06151
Copy Paste: [[2503.06151]] BioMoDiffuse: Physics-Guided Biomechanical Diffusion for Controllable and Authentic Human Motion Synthesis(https://arxiv.org/abs/2503.06151)
Keywords: generation
Abstract: Human motion generation holds significant promise in fields such as animation, film production, and robotics. However, existing methods often fail to produce physically plausible movements that adhere to biomechanical principles. While recent autoregressive and diffusion models have improved visual quality, they frequently overlook essential biodynamic features, such as muscle activation patterns and joint coordination, leading to motions that either violate physical laws or lack controllability. This paper introduces BioMoDiffuse, a novel biomechanics-aware diffusion framework that addresses these limitations. It features three key innovations: (1) A lightweight biodynamic network that integrates muscle electromyography (EMG) signals and kinematic features with acceleration constraints, (2) A physics-guided diffusion process that incorporates real-time biomechanical verification via modified Euler-Lagrange equations, and (3) A decoupled control mechanism that allows independent regulation of motion speed and semantic context. We also propose a set of comprehensive evaluation protocols that combines traditional metrics (FID, R-precision, etc.) with new biomechanical criteria (smoothness, foot sliding, floating, etc.). Our approach bridges the gap between data-driven motion synthesis and biomechanical authenticity, establishing new benchmarks for physically accurate motion generation.
摘要：人类运动产生在动画，电影制作和机器人技术等领域中拥有巨大的希望。但是，现有方法通常无法产生遵循生物力学原理的物理上合理的运动。尽管最近的自回归和扩散模型提高了视觉质量，但它们经常忽略基本的生物动力特征，例如肌肉激活模式和关节协调，导致动议违反了身体定律或缺乏可控性。本文介绍了生物植物，这是一种新型的生物力学传播框架，以解决这些局限性。它具有三个关键创新：（1）轻巧的生物动力网络，该网络将肌肉肌电图（EMG）信号和动力学特征与加速约束相结合，（2）物理指导的扩散过程，该过程结合了实时生物力学验证，可通过修改的Euler-Lagrange方程进行实时的生物力学验证，并允许途中和（3）途中的途径和（3）途中的途径。我们还提出了一组全面的评估协议，将传统指标（FID，R-Precision等）与新的生物力学标准（平滑度，脚滑，浮动等）相结合。我们的方法弥合了数据驱动的运动合成与生物力学真实性之间的差距，从而建立了新的基准，以实现物理准确的运动产生。

Title: ROCM: RLHF on consistency models

Authors: Shivanshu Shekhar, Tong Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06171
Pdf URL: https://arxiv.org/pdf/2503.06171
Copy Paste: [[2503.06171]] ROCM: RLHF on consistency models(https://arxiv.org/abs/2503.06171)
Keywords: generation, generative
Abstract: Diffusion models have revolutionized generative modeling in continuous domains like image, audio, and video synthesis. However, their iterative sampling process leads to slow generation and inefficient training, challenges that are further exacerbated when incorporating Reinforcement Learning from Human Feedback (RLHF) due to sparse rewards and long time horizons. Consistency models address these issues by enabling single-step or efficient multi-step generation, significantly reducing computational costs. In this work, we propose a direct reward optimization framework for applying RLHF to consistency models, incorporating distributional regularization to enhance training stability and prevent reward hacking. We investigate various $f$-divergences as regularization strategies, striking a balance between reward maximization and model consistency. Unlike policy gradient methods, our approach leverages first-order gradients, making it more efficient and less sensitive to hyperparameter tuning. Empirical results show that our method achieves competitive or superior performance compared to policy gradient based RLHF methods, across various automatic metrics and human evaluation. Additionally, our analysis demonstrates the impact of different regularization techniques in improving model generalization and preventing overfitting.
摘要：扩散模型已彻底改变了在图像，音频和视频综合等连续域中的生成建模。但是，它们的迭代抽样过程会导致缓慢的产生和效率低下的训练，由于稀疏的奖励和长时间的视野，从人类反馈（RLHF）中纳入强化学习时会进一步加剧挑战。一致性模型通过实现单步或有效的多步生成来解决这些问题，从而大大降低计算成本。在这项工作中，我们提出了一个直接的奖励优化框架，以将RLHF应用于一致性模型，并结合分配正则化以增强训练稳定性并防止奖励黑客入侵。我们将各种$ f $ divergences作为正规化策略，在奖励最大化和模型一致性之间取得平衡。与策略梯度方法不同，我们的方法利用一阶梯度，使其对高参数调谐更有效且不太敏感。经验结果表明，与基于策略梯度的RLHF方法相比，我们的方法在各种自动指标和人类评估中实现了竞争性或卓越的性能。此外，我们的分析证明了不同正则化技术对改善模型概括和防止过度拟合的影响。

Title: Removing Multiple Hybrid Adverse Weather in Video via a Unified Model

Authors: Yecong Wan, Mingwen Shao, Yuanshuo Cheng, Jun Shu, Shuigen Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06200
Pdf URL: https://arxiv.org/pdf/2503.06200
Copy Paste: [[2503.06200]] Removing Multiple Hybrid Adverse Weather in Video via a Unified Model(https://arxiv.org/abs/2503.06200)
Keywords: restoration
Abstract: Videos captured under real-world adverse weather conditions typically suffer from uncertain hybrid weather artifacts with heterogeneous degradation distributions. However, existing algorithms only excel at specific single degradation distributions due to limited adaption capacity and have to deal with different weather degradations with separately trained models, thus may fail to handle real-world stochastic weather scenarios. Besides, the model training is also infeasible due to the lack of paired video data to characterize the coexistence of multiple weather. To ameliorate the aforementioned issue, we propose a novel unified model, dubbed UniWRV, to remove multiple heterogeneous video weather degradations in an all-in-one fashion. Specifically, to tackle degenerate spatial feature heterogeneity, we propose a tailored weather prior guided module that queries exclusive priors for different instances as prompts to steer spatial feature characterization. To tackle degenerate temporal feature heterogeneity, we propose a dynamic routing aggregation module that can automatically select optimal fusion paths for different instances to dynamically integrate temporal features. Additionally, we managed to construct a new synthetic video dataset, termed HWVideo, for learning and benchmarking multiple hybrid adverse weather removal, which contains 15 hybrid weather conditions with a total of 1500 adverse-weather/clean paired video clips. Real-world hybrid weather videos are also collected for evaluating model generalizability. Comprehensive experiments demonstrate that our UniWRV exhibits robust and superior adaptation capability in multiple heterogeneous degradations learning scenarios, including various generic video restoration tasks beyond weather removal.
摘要：在现实世界中捕获的不利天气条件下捕获的视频通常会遭受不确定的混合天气伪像，并具有异质性降解分布。但是，由于适应能力有限，现有算法仅在特定的单个降解分布方面表现出色，并且必须通过单独训练的模型处理不同的天气降解，因此可能无法处理现实世界中随机天气的情况。此外，由于缺乏配对的视频数据来表征多种天气的共存，因此模型培训也是不可行的。为了改善上述问题，我们提出了一种新颖的统一模型，称为UniWrv，以多合一的方式删除多个异构视频天气退化。具体而言，为了解决退化的空间特征异质性，我们提出了一个量身定制的天气先前的指导模块，该模块对不同实例的独家先验查询，以提示引导空间特征表征。为了解决退化的时间特征异质性，我们提出了一个动态路由聚合模块，该模块可以自动选择不同实例的最佳融合路径，以动态整合时间特征。此外，我们设法构建了一个称为HWVIDEO的新的合成视频数据集，用于学习和基准测试多个混合不利的天气去除，其中包含15个混合天气条件，共有1500个不良天气/干净的配对视频剪辑。还收集了现实世界中的混合天气视频，以评估模型的推广性。全面的实验表明，我们的UNIWRV在多种异构降解学习场景中表现出强大而出色的适应能力，包括除了天气降低的各种通用视频恢复任务。

Title: Explainable Synthetic Image Detection through Diffusion Timestep Ensembling

Authors: Yixin Wu, Feiran Zhang, Tianyuan Shi, Ruicheng Yin, Zhenghua Wang, Zhenliang Gan, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.06201
Pdf URL: https://arxiv.org/pdf/2503.06201
Copy Paste: [[2503.06201]] Explainable Synthetic Image Detection through Diffusion Timestep Ensembling(https://arxiv.org/abs/2503.06201)
Keywords: generation
Abstract: Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we reveal that natural and synthetic images exhibit distinct differences in the high-frequency domains of their Fourier power spectra after undergoing iterative noise perturbations through an inverse multi-step denoising process, suggesting that such noise can provide additional discriminative information for identifying synthetic images. Based on this observation, we propose a novel detection method that amplifies these differences by progressively adding noise to the original images across multiple timesteps, and train an ensemble of classifiers on these noised images. To enhance human comprehension, we introduce an explanation generation and refinement module to identify flaws located in AI-generated images. Additionally, we construct two new datasets, GenHard and GenExplain, derived from the GenImage benchmark, providing detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that our method achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and harder samples, increasing a minimal of 2.51% and 3.46% compared to baselines. Furthermore, our method also generalizes effectively to images generated by other diffusion models. Our code and datasets will be made publicly available.
摘要：扩散模型的最新进展使创建了欺骗性的真实图像，并在滥用时构成了重大的安全风险。在这项研究中，我们揭示了自然和合成图像在通过反复的噪声扰动通过反向的多步降解过程后，在其傅立叶功率谱的高频域上表现出明显的差异，这表明这种噪声可以为识别合成图像提供其他歧视性信息。基于此观察结果，我们提出了一种新颖的检测方法，该方法通过在多个时间段中逐渐向原始图像添加噪声来扩大这些差异，并在这些噪声图像上训练分类器的集合。为了增强人类的理解，我们介绍了一个解释产生和改进模块，以识别AI生成图像中的缺陷。此外，我们构建了两个新数据集，即Genhard和Genexplain，这些数据集源自Genimage Benchmark，提供了更大的难度和高质量理由的检测样本。广泛的实验表明，我们的方法在常规和较硬的样本上以98.91％和95.89％的检测准确性达到了最先进的性能，与基准相比，最低率为2.51％和3.46％。此外，我们的方法还将有效地推广到其他扩散模型产生的图像。我们的代码和数据集将公开可用。

Title: GraphGen+: Advancing Distributed Subgraph Generation and Graph Learning On Industrial Graphs

Authors: Yue Jin, Yongchao Liu, Chuntao Hong
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2503.06212
Pdf URL: https://arxiv.org/pdf/2503.06212
Copy Paste: [[2503.06212]] GraphGen+: Advancing Distributed Subgraph Generation and Graph Learning On Industrial Graphs(https://arxiv.org/abs/2503.06212)
Keywords: generation
Abstract: Graph-based computations are crucial in a wide range of applications, where graphs can scale to trillions of edges. To enable efficient training on such large graphs, mini-batch subgraph sampling is commonly used, which allows training without loading the entire graph into memory. However, existing solutions face significant trade-offs: online subgraph generation, as seen in frameworks like DGL and PyG, is limited to a single machine, resulting in severe performance bottlenecks, while offline precomputed subgraphs, as in GraphGen, improve sampling efficiency but introduce large storage overhead and high I/O costs during training. To address these challenges, we propose \textbf{GraphGen+}, an integrated framework that synchronizes distributed subgraph generation with in-memory graph learning, eliminating the need for external storage while significantly improving efficiency. GraphGen+ achieves a \textbf{27$\times$} speedup in subgraph generation compared to conventional SQL-like methods and a \textbf{1.3$\times$} speedup over GraphGen, supporting training on 1 million nodes per iteration and removing the overhead associated with precomputed subgraphs, making it a scalable and practical solution for industry-scale graph learning.
摘要：基于图形的计算在广泛的应用中至关重要，图形可以扩展到数万亿个边缘。为了在如此大的图表上进行有效的训练，通常使用Mini Batch子图抽样，该采样允许训练而无需将整个图形加载到内存中。但是，现有的解决方案面临着重大的权衡：在线子图生成（如DGL和PYG）中所示，仅限于一台机器，从而导致了严重的性能瓶颈，而离线预先计算的次数为Graplgen，如Graplgen中，提高采样效率，但在培训期间引入了大型存储和高I/O成本。为了应对这些挑战，我们建议\ textbf {GraphGen+}，这是一个集成的框架，将分布式子图生成与内存图中的学习同步，消除了对外部存储的需求，同时显着提高了效率。 GraphGen+ achieves a \textbf{27$\times$} speedup in subgraph generation compared to conventional SQL-like methods and a \textbf{1.3$\times$} speedup over GraphGen, supporting training on 1 million nodes per iteration and removing the overhead associated with precomputed subgraphs, making it a scalable and practical solution for industry-scale graph learning.

Title: WaveStitch: Flexible and Fast Conditional Time Series Generation with Diffusion Models

Authors: Aditya Shankar, Lydia Y. Chen, Arie van Deursen, Rihan Hai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06231
Pdf URL: https://arxiv.org/pdf/2503.06231
Copy Paste: [[2503.06231]] WaveStitch: Flexible and Fast Conditional Time Series Generation with Diffusion Models(https://arxiv.org/abs/2503.06231)
Keywords: generation
Abstract: Generating temporal data under constraints is critical for forecasting, imputation, and synthesis. These datasets often include auxiliary conditions that influence the values within the time series signal. Existing methods face three key challenges: (1) they fail to adapt to conditions at inference time; (2) they rely on sequential generation, which slows the generation speed; and (3) they inefficiently encode categorical features, leading to increased sparsity and input sizes. We propose WaveStitch, a novel method that addresses these challenges by leveraging denoising diffusion probabilistic models to efficiently generate accurate temporal data under given auxiliary constraints. WaveStitch overcomes these limitations by: (1) modeling interactions between constraints and signals to generalize to new, unseen conditions; (2) enabling the parallel synthesis of sequential segments with a novel "stitching" mechanism to enforce coherence across segments; and (3) encoding categorical features as compact periodic signals while preserving temporal patterns. Extensive evaluations across diverse datasets highlight WaveStitch's ability to generalize to unseen conditions during inference, achieving up to a 10x lower mean-squared-error compared to the state-of-the-art methods. Moreover, WaveStitch generates data up to 460x faster than autoregressive methods while maintaining comparable accuracy. By efficiently encoding categorical features, WaveStitch provides a robust and efficient solution for temporal data generation. Our code is open-sourced: this https URL
摘要：在约束下生成时间数据对于预测，归因和合成至关重要。这些数据集通常包括影响时间序列信号中值的辅助条件。现有方法面临三个关键挑战：（1）它们在推理时无法适应条件；（2）他们依靠顺序产生，这会减慢生成速度；（3）它们效率低下的分类特征，导致稀疏性和输入尺寸增加。我们提出了Wavestitch，这是一种新颖的方法，它通过利用扩散概率模型来解决这些挑战，以在给定的辅助约束下有效地生成准确的时间数据。 Wavestitch通过以下方式克服了这些局限性：（1）建模约束和信号之间的相互作用，以推广到新的，看不见的条件；（2）实现具有新型“缝合”机制的顺序片段的平行合成，以在各个片段之间实现连贯性；（3）将分类特征编码为紧凑的周期性信号，同时保留时间模式。跨不同数据集的广泛评估突出了Wavestitch在推理过程中概括到看不见的条件的能力，与最先进的方法相比，均值低下的均值越高率高达10倍。此外，Wavestitch生成的数据比自回归方法快460倍，同时保持可比较的精度。通过有效地编码分类特征，Wavestitch为时间数据生成提供了强大而有效的解决方案。我们的代码是开源的：此HTTPS URL

Title: Single Domain Generalization with Adversarial Memory

Authors: Hao Yan, Marzi Heidari, Yuhong Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06288
Pdf URL: https://arxiv.org/pdf/2503.06288
Copy Paste: [[2503.06288]] Single Domain Generalization with Adversarial Memory(https://arxiv.org/abs/2503.06288)
Keywords: generation
Abstract: Domain Generalization (DG) aims to train models that can generalize to unseen testing domains by leveraging data from multiple training domains. However, traditional DG methods rely on the availability of multiple diverse training domains, limiting their applicability in data-constrained scenarios. Single Domain Generalization (SDG) addresses the more realistic and challenging setting by restricting the training data to a single domain distribution. The main challenges in SDG stem from the limited diversity of training data and the inaccessibility of unseen testing data distributions. To tackle these challenges, we propose a single domain generalization method that leverages an adversarial memory bank to augment training features. Our memory-based feature augmentation network maps both training and testing features into an invariant subspace spanned by diverse memory features, implicitly aligning the training and testing domains in the projected space. To maintain a diverse and representative feature memory bank, we introduce an adversarial feature generation method that creates features extending beyond the training domain distribution. Experimental results demonstrate that our approach achieves state-of-the-art performance on standard single domain generalization benchmarks.
摘要：域的概括（DG）旨在通过利用来自多个训练域的数据来训练可以推广到看不见的测试域的模型。但是，传统的DG方法依赖于多种不同培训领域的可用性，从而限制了它们在数据约束的情况下的适用性。单个域概括（SDG）通过将培训数据限制为单个域分布来解决更现实和具有挑战性的设置。可持续发展目标的主要挑战源于培训数据的多样性和看不见的测试数据分布的无法访问。为了应对这些挑战，我们提出了一种单个领域的概括方法，该方法利用对抗性记忆库增强培训功能。我们基于内存的功能增强网络将训练和测试功能映射到一个不变的子空间中，该子空间由多种内存功能跨越，隐含地对准了预计空间中的训练和测试域。为了维持多样化和代表性的功能记忆库，我们引入了一种对抗性功能生成方法，该方法创建了训练域分布以外的功能。实验结果表明，我们的方法在标准的单域概括基准上实现了最先进的性能。

Title: Text2Story: Advancing Video Storytelling with Text Guidance

Authors: Taewon Kang, Divya Kothandaraman, Ming C. Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06310
Pdf URL: https://arxiv.org/pdf/2503.06310
Copy Paste: [[2503.06310]] Text2Story: Advancing Video Storytelling with Text Guidance(https://arxiv.org/abs/2503.06310)
Keywords: generation
Abstract: Generating coherent long-form video sequences from discrete input using only text prompts is a critical task in content creation. While diffusion-based models excel at short video synthesis, long-form storytelling from text remains largely unexplored and a challenge due to challenges pertaining to temporal coherency, preserving semantic meaning and action continuity across the video. We introduce a novel storytelling approach to enable seamless video generation with natural action transitions and structured narratives. We present a bidirectional time-weighted latent blending strategy to ensure temporal consistency between segments of the long-form video being generated. Further, our method extends the Black-Scholes algorithm from prompt mixing for image generation to video generation, enabling controlled motion evolution through structured text conditioning. To further enhance motion continuity, we propose a semantic action representation framework to encode high-level action semantics into the blending process, dynamically adjusting transitions based on action similarity, ensuring smooth yet adaptable motion changes. Latent space blending maintains spatial coherence between objects in a scene, while time-weighted blending enforces bidirectional constraints for temporal consistency. This integrative approach prevents abrupt transitions while ensuring fluid storytelling. Extensive experiments demonstrate significant improvements over baselines, achieving temporally consistent and visually compelling video narratives without any additional training. Our approach bridges the gap between short clips and extended video to establish a new paradigm in GenAI-driven video synthesis from text.
摘要：使用仅使用文本提示的离散输入生成连贯的长格式视频序列是内容创建的关键任务。尽管基于扩散的模型在简短的视频综合中都表现出色，但文本的长期故事讲述仍然在很大程度上尚未开发，并且由于与时间相干性有关的挑战而引起的挑战，可以保留整个视频中的语义含义和动作连续性。我们介绍了一种新颖的讲故事方法，以通过自然动作过渡和结构化叙述来实现无缝的视频生成。我们提出了双向时间加权的潜在混合策略，以确保生成的长期视频的段之间的时间一致性。此外，我们的方法将黑色 - chcholes算法从图像生成的提示混合到视频生成，从而通过结构化的文本调节实现了控制的运动演变。为了进一步增强运动连续性，我们提出了一个语义动作表示框架，以将高级动作语义编码到混合过程中，并根据动作相似性动态调整过渡，以确保平滑而适应性的运动变化。潜在的空间混合在场景中保持对象之间的空间连贯性，而随着时间加权的混合，对时间一致性的双向约束强制执行双向约束。这种综合方法阻止了突然的过渡，同时确保流畅的讲故事。广泛的实验表明，对基准的改善有了显着改善，在没有任何额外培训的情况下实现了时间一致和视觉上令人信服的视频叙述。我们的方法弥合了短剪辑和扩展视频之间的差距，以从文本中建立一个新的genai驱动视频综合范式。

Title: Pretraining Generative Flow Networks with Inexpensive Rewards for Molecular Graph Generation

Authors: Mohit Pandey, Gopeshh Subbaraj, Artem Cherkasov, Emmanuel Bengio
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06337
Pdf URL: https://arxiv.org/pdf/2503.06337
Copy Paste: [[2503.06337]] Pretraining Generative Flow Networks with Inexpensive Rewards for Molecular Graph Generation(https://arxiv.org/abs/2503.06337)
Keywords: generation, generative
Abstract: Generative Flow Networks (GFlowNets) have recently emerged as a suitable framework for generating diverse and high-quality molecular structures by learning from rewards treated as unnormalized distributions. Previous works in this framework often restrict exploration by using predefined molecular fragments as building blocks, limiting the chemical space that can be accessed. In this work, we introduce Atomic GFlowNets (A-GFNs), a foundational generative model leveraging individual atoms as building blocks to explore drug-like chemical space more comprehensively. We propose an unsupervised pre-training approach using drug-like molecule datasets, which teaches A-GFNs about inexpensive yet informative molecular descriptors such as drug-likeliness, topological polar surface area, and synthetic accessibility scores. These properties serve as proxy rewards, guiding A-GFNs towards regions of chemical space that exhibit desirable pharmacological properties. We further implement a goal-conditioned finetuning process, which adapts A-GFNs to optimize for specific target properties. In this work, we pretrain A-GFN on a subset of ZINC dataset, and by employing robust evaluation metrics we show the effectiveness of our approach when compared to other relevant baseline methods for a wide range of drug design tasks.
摘要：生成流动网络（GFLOWNETS）最近已成为通过从被视为非差异分布的奖励中学习的合适框架，用于生成多样化和高质量的分子结构。此框架中的先前作品通常通过使用预定义的分子碎片作为构件来限制探索，从而限制了可以访问的化学空间。在这项工作中，我们引入了原子Gflownets（A-GFNS），这是一种利用单个原子作为基础的基础生成模型，以更全面地探索类似药物的化学空间。我们建议使用类似药物的分子数据集提出一种无监督的预训练方法，该方法教授A-GFN关于廉价但内容丰富的分子描述符，例如药物类似，拓扑极性表面积和合成可及性得分。这些特性是代理奖励，将A-GFN引导到具有理想的药理特性的化学空间区域。我们进一步实施了一个目标条件的登录过程，该过程适应A-GFN以优化特定目标属性。在这项工作中，我们在锌数据集的一个子集上为A-GFN预算了，并且通过使用可靠的评估指标，我们显示了与其他相关基线方法相比，我们显示了方法的有效性。

Title: Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning

Authors: Gaurav Patel, Qiang Qiu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06339
Pdf URL: https://arxiv.org/pdf/2503.06339
Copy Paste: [[2503.06339]] Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning(https://arxiv.org/abs/2503.06339)
Keywords: generative
Abstract: Machine Unlearning has recently garnered significant attention, aiming to selectively remove knowledge associated with specific data while preserving the model's performance on the remaining data. A fundamental challenge in this process is balancing effective unlearning with knowledge retention, as naive optimization of these competing objectives can lead to conflicting gradients, hindering convergence and degrading overall performance. To address this issue, we propose Learning to Unlearn while Retaining, aimed to mitigate gradient conflicts between unlearning and retention objectives. Our approach strategically avoids conflicts through an implicit gradient regularization mechanism that emerges naturally within the proposed framework. This prevents conflicting gradients between unlearning and retention, leading to effective unlearning while preserving the model's utility. We validate our approach across both discriminative and generative tasks, demonstrating its effectiveness in achieving unlearning without compromising performance on remaining data. Our results highlight the advantages of avoiding such gradient conflicts, outperforming existing methods that fail to account for these interactions.
摘要：Machine Unerning最近引起了极大的关注，旨在选择性地删除与特定数据相关的知识，同时保留模型在其余数据上的性能。在此过程中，一个根本的挑战是平衡有效的学习与知识的保留平衡，因为这些竞争目标的天真优化可能导致梯度冲突，阻碍融合并降低整体绩效。为了解决这个问题，我们建议在保留时学习学习，旨在减轻学习目标和保留目标之间的梯度冲突。我们的方法从策略性地避免了通过隐性梯度正规化机制自然出现在拟议框架内的隐性梯度正则化机制。这样可以防止在学习和保留率之间存在冲突的梯度，从而在保留模型的效用的同时导致有效的学习。我们在歧视性和生成任务中验证了我们的方法，证明了其在实现学习的有效性，而不会损害剩余数据的绩效。我们的结果突出了避免这种梯度冲突的优势，优于无法解释这些交互的现有方法。

Title: Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning

Authors: Samuel Garcin, Trevor McInroe, Pablo Samuel Castro, Prakash Panangaden, Christopher G. Lucas, David Abel, Stefano V. Albrecht
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06343
Pdf URL: https://arxiv.org/pdf/2503.06343
Copy Paste: [[2503.06343]] Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning(https://arxiv.org/abs/2503.06343)
Keywords: generation
Abstract: Extracting relevant information from a stream of high-dimensional observations is a central challenge for deep reinforcement learning agents. Actor-critic algorithms add further complexity to this challenge, as it is often unclear whether the same information will be relevant to both the actor and the critic. To this end, we here explore the principles that underlie effective representations for the actor and for the critic in on-policy algorithms. We focus our study on understanding whether the actor and critic will benefit from separate, rather than shared, representations. Our primary finding is that when separated, the representations for the actor and critic systematically specialise in extracting different types of information from the environment -- the actor's representation tends to focus on action-relevant information, while the critic's representation specialises in encoding value and dynamics information. We conduct a rigourous empirical study to understand how different representation learning approaches affect the actor and critic's specialisations and their downstream performance, in terms of sample efficiency and generation capabilities. Finally, we discover that a separated critic plays an important role in exploration and data collection during training. Our code, trained models and data are accessible at this https URL.
摘要：从一系列高维观测中提取相关信息是深入强化学习者的核心挑战。演员批评算法增加了这一挑战的进一步复杂性，因为通常不清楚相同的信息是否与演员和评论家相关。为此，我们在这里探讨了对演员和批评家算法的有效代表的原则。我们将研究重点放在理解演员和评论家是否会从单独而不是共享的表示形式中受益。我们的主要发现是，当分开时，演员和评论家的表示形式系统地专门研究从环境中提取不同类型的信息 - 演员的代表倾向于集中于与行动相关的信息，而评论家的代表则专门编码价值和动态信息。我们进行了一项严格的经验研究，以了解不同的表示方法如何影响演员的专业及其下游表现，从而在样本效率和发电能力方面。最后，我们发现一个分开的评论家在培训期间在探索和数据收集中起着重要作用。我们的代码，训练有素的模型和数据可在此HTTPS URL上访问。

Title: GIN-Graph: A Generative Interpretation Network for Model-Level Explanation of Graph Neural Networks

Authors: Xiao Yue, Guangzhi Qu, Lige Gan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06352
Pdf URL: https://arxiv.org/pdf/2503.06352
Copy Paste: [[2503.06352]] GIN-Graph: A Generative Interpretation Network for Model-Level Explanation of Graph Neural Networks(https://arxiv.org/abs/2503.06352)
Keywords: generative
Abstract: One significant challenge of exploiting Graph neural networks (GNNs) in real-life scenarios is that they are always treated as black boxes, therefore leading to the requirement of interpretability. Model-level interpretations explain what patterns maximize probability of predicting to a certain class. However, existing model-level interpretation methods pose several limitations such as generating invalid explanation graphs and requiring extreme fine-tuning on hyperparameters manually. In this paper, we propose a new Generative Interpretation Network for Model-Level Explanation of Graph Neural Networks (GIN-Graph), to generate reliable model-level explanation graphs. The implicit and likelihood-free generative adversarial networks are exploited to construct explanation graphs similar to original graphs, meanwhile maximizing the prediction probability for a certain class by adopting a novel objective function. Experimental results indicate that GIN-Graph can be easily applied to GNN models trained on a variety of graph datasets to create meaningful explanation graphs without requiring extensive fine-tuning on hyperparameters.
摘要：在现实生活中利用图形神经网络（GNN）的一个重大挑战是，它们始终被视为黑匣子，因此导致了可解释性的要求。模型级解释解释了哪些模式最大化预测某个类别的概率。但是，现有的模型级解释方法构成了几个局限性，例如生成无效的解释图，并需要对超参数进行极端的微调。在本文中，我们提出了一个新的生成解释网络，用于模型级别的图形神经网络（GIN-GRAPH），以生成可靠的模型级解释图。利用隐式和似然的生成对抗网络来构建与原始图相似的解释图，同时通过采用新的目标函数来最大程度地提高某个类别的预测概率。实验结果表明，杜松子图可以轻松地应用于在各种图形数据集上训练的GNN模型，以创建有意义的解释图，而无需在超参数上进行大量微调。

Title: Generative Video Bi-flow

Authors: Chen Liu, Tobias Ritschel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06364
Pdf URL: https://arxiv.org/pdf/2503.06364
Copy Paste: [[2503.06364]] Generative Video Bi-flow(https://arxiv.org/abs/2503.06364)
Keywords: generation, generative
Abstract: We propose a novel generative video model by robustly learning temporal change as a neural Ordinary Differential Equation (ODE) flow with a bilinear objective of combining two aspects: The first is to map from the past into future video frames directly. Previous work has mapped the noise to new frames, a more computationally expensive process. Unfortunately, starting from the previous frame, instead of noise, is more prone to drifting errors. Hence, second, we additionally learn how to remove the accumulated errors as the joint objective by adding noise during training. We demonstrate unconditional video generation in a streaming manner for various video datasets, all at competitive quality compared to a baseline conditional diffusion but with higher speed, i.e., fewer ODE solver steps.
摘要：我们提出了一种新颖的生成视频模型，通过鲁棒地学习时间变化作为神经普通微分方程（ODE）流，其双线性目标是结合两个方面：第一个是将过去从过去直接映射到未来的视频帧中。以前的工作将噪声映射到了新框架，这是一个更昂贵的过程。不幸的是，从上一个帧开始，而不是噪声，更容易出现错误。因此，其次，我们还学习如何通过在训练过程中添加噪声来消除累积错误作为关节目标。我们以各种视频数据集的方式以流方式展示了无条件的视频生成，与基线条件扩散相比，所有这些都具有竞争力，但速度更高，即更少的ode求解器步骤。

Title: EPR-GAIL: An EPR-Enhanced Hierarchical Imitation Learning Framework to Simulate Complex User Consumption Behaviors

Authors: Tao Feng, Yunke Zhang, Huandong Wang, Yong Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06392
Pdf URL: https://arxiv.org/pdf/2503.06392
Copy Paste: [[2503.06392]] EPR-GAIL: An EPR-Enhanced Hierarchical Imitation Learning Framework to Simulate Complex User Consumption Behaviors(https://arxiv.org/abs/2503.06392)
Keywords: generation, generative
Abstract: User consumption behavior data, which records individuals' online spending history at various types of stores, has been widely used in various applications, such as store recommendation, site selection, and sale forecasting. However, its high worth is limited due to deficiencies in data comprehensiveness and changes of application scenarios. Thus, generating high-quality sequential consumption data by simulating complex user consumption behaviors is of great importance to real-world applications. Two branches of existing sequence generation methods are both limited in quality. Model-based methods with simplified assumptions fail to model the complex decision process of user consumption, while data-driven methods that emulate real-world data are prone to noises, unobserved behaviors, and dynamic decision space. In this work, we propose to enhance the fidelity and trustworthiness of the data-driven Generative Adversarial Imitation Learning (GAIL) method by blending it with the Exploration and Preferential Return EPR model . The core idea of our EPR-GAIL framework is to model user consumption behaviors as a complex EPR decision process, which consists of purchase, exploration, and preference decisions. Specifically, we design the hierarchical policy function in the generator as a realization of the EPR decision process and employ the probability distributions of the EPR model to guide the reward function in the discriminator. Extensive experiments on two real-world datasets of user consumption behaviors on an online platform demonstrate that the EPR-GAIL framework outperforms the best state-of-the-art baseline by over 19\% in terms of data fidelity. Furthermore, the generated consumption behavior data can improve the performance of sale prediction and location recommendation by up to 35.29% and 11.19%, respectively, validating its advantage for practical applications.
摘要：用户消费行为数据记录了个人在各种商店的在线支出历史的记录，已广泛用于各种应用程序，例如商店推荐，网站选择和销售预测。但是，由于数据全面性和应用程序方案的变化，其高价值受到限制。因此，通过模拟复杂的用户消耗行为来生成高质量的顺序消耗数据对于实际应用程序至关重要。现有序列生成方法的两个分支都受到质量限制。具有简化假设的基于模型的方法无法对用户消耗的复杂决策过程进行建模，而模拟现实世界数据的数据驱动方法则容易出现噪音，未观察到的行为和动态决策空间。在这项工作中，我们建议通过将其与探索和优先返回EPR模型融合来增强数据驱动的生成对抗性模仿学习（GAIL）方法的忠诚度和可信度。我们EPR-GAIL框架的核心思想是将用户消费行为建模为复杂的EPR决策过程，该过程包括购买，探索和偏好决策。具体来说，我们将生成器中的层次策略函数设计为实现EPR决策过程的实现，并采用EPR模型的概率分布来指导歧视器中的奖励函数。在在线平台上的两个现实世界数据集上的大量实验表明，就数据保真度而言，EPR-GAIL框架的表现优于最佳最新基线超过19％。此外，生成的消费行为数据可以将销售预测和位置建议的性能提高35.29％和11.19％，从而验证了其对实际应用的优势。

Title: Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter

Authors: Yanyu Zhu, Licheng Bai, Jintao Xu, Jiwei Tang, Hai-tao Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06397
Pdf URL: https://arxiv.org/pdf/2503.06397
Copy Paste: [[2503.06397]] Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter(https://arxiv.org/abs/2503.06397)
Keywords: generation, generative
Abstract: Recent advances in diffusion-based lip-syncing generative models have demonstrated their ability to produce highly synchronized talking face videos for visual dubbing. Although these models excel at lip synchronization, they often struggle to maintain fine-grained control over facial details in generated images. In this work, we identify "lip averaging" phenomenon where the model fails to preserve subtle facial details when dubbing unseen in-the-wild videos. This issue arises because the commonly used UNet backbone primarily integrates audio features into visual representations in the latent space via cross-attention mechanisms and multi-scale fusion, but it struggles to retain fine-grained lip details in the generated faces. To address this issue, we propose UnAvgLip, which extracts identity embeddings from reference videos to generate highly faithful facial sequences while maintaining accurate lip synchronization. Specifically, our method comprises two primary components: (1) an Identity Perceiver module that encodes facial embeddings to align with conditioned audio features; and (2) an ID-CrossAttn module that injects facial embeddings into the generation process, enhancing model's capability of identity retention. Extensive experiments demonstrate that, at a modest training and inference cost, UnAvgLip effectively mitigates the "averaging" phenomenon in lip inpainting, significantly preserving unique facial characteristics while maintaining precise lip synchronization. Compared with the original approach, our method demonstrates significant improvements of 5% on the identity consistency metric and 2% on the SSIM metric across two benchmark datasets (HDTF and LRW).
摘要：基于扩散的唇部同步生成模型的最新进展证明了它们产生高度同步的会说话视频以进行视觉配音的能力。尽管这些模型在唇部同步时表现出色，但它们通常很难在生成的图像中保持对面部细节的细粒度控制。在这项工作中，我们确定了“唇部平均”现象，在该现象中，模型在看不见的野外视频时无法保留微妙的面部细节。之所以出现此问题，是因为常用的UNET主链主要将音频特征通过跨注意机制和多尺度融合整合到潜在空间中的视觉表示形式，但它努力在生成的面孔中保留细粒的唇部细节。为了解决这个问题，我们提出了UNAVGGLIP，该问题从参考视频中提取身份嵌入，以生成高度忠实的面部序列，同时保持准确的唇部同步。具体而言，我们的方法包括两个主要组件：（1）一个身份感知器模块，该模块编码面部嵌入以与条件音频特征对齐；（2）将面部嵌入到生成过程中的ID-Crossattn模块，增强了模型保留的能力。广泛的实验表明，以适度的训练和推理成本，取消避开唇膏的“平均”现象有效地降低了唇膏中的“平均”现象，在维持精确的唇部同步的同时，可以显着保留独特的面部特征。与原始方法相比，我们的方法在两个基准数据集（HDTF和LRW）上显示出身份一致性指标的5％的显着改善，而SSIM度量标准的2％得到了显着改善。

Title: Consistent Image Layout Editing with Diffusion Models

Authors: Tao Xia, Yudi Zhang, Ting Liu Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06419
Pdf URL: https://arxiv.org/pdf/2503.06419
Copy Paste: [[2503.06419]] Consistent Image Layout Editing with Diffusion Models(https://arxiv.org/abs/2503.06419)
Keywords: generation
Abstract: Despite the great success of large-scale text-to-image diffusion models in image generation and image editing, existing methods still struggle to edit the layout of real images. Although a few works have been proposed to tackle this problem, they either fail to adjust the layout of images, or have difficulty in preserving visual appearance of objects after the layout adjustment. To bridge this gap, this paper proposes a novel image layout editing method that can not only re-arrange a real image to a specified layout, but also can ensure the visual appearance of the objects consistent with their appearance before editing. Concretely, the proposed method consists of two key components. Firstly, a multi-concept learning scheme is used to learn the concepts of different objects from a single image, which is crucial for keeping visual consistency in the layout editing. Secondly, it leverages the semantic consistency within intermediate features of diffusion models to project the appearance information of objects to the desired regions directly. Besides, a novel initialization noise design is adopted to facilitate the process of re-arranging the layout. Extensive experiments demonstrate that the proposed method outperforms previous works in both layout alignment and visual consistency for the task of image layout editing
摘要：尽管大规模的文本到图像扩散模型在图像生成和图像编辑中取得了巨大的成功，但现有的方法仍然很难编辑真实图像的布局。尽管已经提出了一些工作来解决此问题，但它们要么无法调整图像的布局，要么在布局调整后很难保留对象的视觉外观。为了弥合这一差距，本文提出了一种新型的图像布局编辑方法，该方法不仅可以将真实图像重新安排到指定的布局，还可以确保对象的视觉外观与它们在编辑之前的外观一致。 Concretely, the proposed method consists of two key components.首先，多概念学习方案用于从单个图像中学习不同对象的概念，这对于在布局编辑中保持视觉一致性至关重要。其次，它利用扩散模型的中间特征中的语义一致性直接将对象的外观信息投影到所需区域。 Besides, a novel initialization noise design is adopted to facilitate the process of re-arranging the layout.广泛的实验表明，在图像布局编辑任务的布局对齐和视觉一致性方面，所提出的方法的表现优于先前的作品

Title: Federated Learning for Diffusion Models

Authors: Zihao Peng, Xijun Wang, Shengbo Chen, Hong Rao, Cong Shen
Subjects: cs.LG, cs.CV, cs.DC
Abstract URL: https://arxiv.org/abs/2503.06426
Pdf URL: https://arxiv.org/pdf/2503.06426
Copy Paste: [[2503.06426]] Federated Learning for Diffusion Models(https://arxiv.org/abs/2503.06426)
Keywords: generative
Abstract: Diffusion models are powerful generative models that can produce highly realistic samples for various tasks. Typically, these models are constructed using centralized, independently and identically distributed (IID) training data. However, in practical scenarios, data is often distributed across multiple clients and frequently manifests non-IID characteristics. Federated Learning (FL) can leverage this distributed data to train diffusion models, but the performance of existing FL methods is unsatisfactory in non-IID scenarios. To address this, we propose FedDDPM-Federated Learning with Denoising Diffusion Probabilistic Models, which leverages the data generative capability of diffusion models to facilitate model training. In particular, the server uses well-trained local diffusion models uploaded by each client before FL training to generate auxiliary data that can approximately represent the global data distribution. Following each round of model aggregation, the server further optimizes the global model using the auxiliary dataset to alleviate the impact of heterogeneous data on model performance. We provide a rigorous convergence analysis of FedDDPM and propose an enhanced algorithm, FedDDPM+, to reduce training overheads. FedDDPM+ detects instances of slow model learning and performs a one-shot correction using the auxiliary dataset. Experimental results validate that our proposed algorithms outperform the state-of-the-art FL algorithms on the MNIST, CIFAR10 and CIFAR100 datasets.
摘要：扩散模型是强大的生成模型，可以为各种任务生成高度逼真的样本。通常，这些模型是使用集中式，独立和相同分布（IID）培训数据构建的。但是，在实际情况下，数据通常分布在多个客户端，并且经常表现出非IID特征。联合学习（FL）可以利用这些分布式数据来训练扩散模型，但是在非IID方案中，现有FL方法的性能并不令人满意。为了解决这个问题，我们提出了通过deno的扩散概率模型提出的FedDDPM填充学习，该模型利用扩散模型的数据生成能力来促进模型训练。特别是，该服务器使用训练有素的局部扩散模型在FL培训之前上载了每个客户端上载，以生成辅助数据，该数据可以大致表示全局数据分布。按照每轮模型聚合，服务器使用辅助数据集进一步优化了全局模型，以减轻异质数据对模型性能的影响。我们提供了对FedDDPM的严格合并分析，并提出了增强的算法FedDDPM+，以减少训练开销。 FedDDPM+检测慢模型学习的实例，并使用辅助数据集执行单次校正。实验结果验证了我们提出的算法的表现优于MNIST，CIFAR10和CIFAR100数据集上的最先进的FL算法。

Title: Pre-Training Meta-Rule Selection Policy for Visual Generative Abductive Learning

Authors: Yu Jin, Jingming Liu, Zhexu Luo, Yifei Peng, Ziang Qin, Wang-Zhou Dai, Yao-Xiang Ding, Kun Zhou
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.06427
Pdf URL: https://arxiv.org/pdf/2503.06427
Copy Paste: [[2503.06427]] Pre-Training Meta-Rule Selection Policy for Visual Generative Abductive Learning(https://arxiv.org/abs/2503.06427)
Keywords: generation, generative
Abstract: Visual generative abductive learning studies jointly training symbol-grounded neural visual generator and inducing logic rules from data, such that after learning, the visual generation process is guided by the induced logic rules. A major challenge for this task is to reduce the time cost of logic abduction during learning, an essential step when the logic symbol set is large and the logic rule to induce is complicated. To address this challenge, we propose a pre-training method for obtaining meta-rule selection policy for the recently proposed visual generative learning approach AbdGen [Peng et al., 2023], aiming at significantly reducing the candidate meta-rule set and pruning the search space. The selection model is built based on the embedding representation of both symbol grounding of cases and meta-rules, which can be effectively integrated with both neural model and logic reasoning system. The pre-training process is done on pure symbol data, not involving symbol grounding learning of raw visual inputs, making the entire learning process low-cost. An additional interesting observation is that the selection policy can rectify symbol grounding errors unseen during pre-training, which is resulted from the memorization ability of attention mechanism and the relative stability of symbolic patterns. Experimental results show that our method is able to effectively address the meta-rule selection problem for visual abduction, boosting the efficiency of visual generative abductive learning. Code is available at this https URL.
摘要：视觉生成的绑架学习研究共同训练符号接地的神经视觉发生器，并从数据中诱导逻辑规则，以便在学习之后，视觉生成过程由诱导的逻辑规则指导。这项任务的一个主要挑战是减少学习过程中逻辑绑架的时间成本，这是逻辑符号集很大并且诱导逻辑规则的重要步骤。为了应对这一挑战，我们提出了一种预训练方法，用于为最近提出的视觉生成学习方法ABDGEN获得元规则选择策略[Peng等，2023]，旨在显着减少候选元规则集并修剪搜索空间。选择模型是基于壳体和元符号的符号接地的嵌入表示，它们可以与神经模型和逻辑推理系统有效整合。预训练过程是在纯符号数据上完成的，不涉及对原始视觉输入学习的符号接地，从而使整个学习过程低成本。另一个有趣的观察结果是，选择策略可以纠正在预训练期间看不见的符号接地误差，这是由于注意机制的记忆能力和符号模式的相对稳定性所致。实验结果表明，我们的方法能够有效解决视觉绑架的元规则选择问题，从而提高了视觉生成绑架学习的效率。代码可在此HTTPS URL上找到。

Title: CtrTab: Tabular Data Synthesis with High-Dimensional and Limited Data

Authors: Zuqing Li, Jianzhong Qi, Junhao Gan
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2503.06444
Pdf URL: https://arxiv.org/pdf/2503.06444
Copy Paste: [[2503.06444]] CtrTab: Tabular Data Synthesis with High-Dimensional and Limited Data(https://arxiv.org/abs/2503.06444)
Keywords: generative
Abstract: Diffusion-based tabular data synthesis models have yielded promising results. However, we observe that when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To address this issue, we propose CtrTab-a condition controlled diffusion model for tabular data synthesis-to improve the performance of diffusion-based generative models in high-dimensional, low-data scenarios. Through CtrTab, we inject samples with added Laplace noise as control signals to improve data diversity and show its resemblance to L2 regularization, which enhances model robustness. Experimental results across multiple datasets show that CtrTab outperforms state-of-the-art models, with performance gap in accuracy over 80% on average. Our source code will be released upon paper publication.
摘要：基于扩散的表格数据合成模型产生了有希望的结果。但是，我们观察到，当数据维度增加时，现有模型倾向于退化，并且可能比基于非扩散的模型更糟。这是因为在高维空间中的有限训练样品通常会阻碍生成模型准确捕获分布。为了解决此问题，我们建议使用表格数据综合的CTRTAB-A条件控制的扩散模型，以改善高维，低数据的情况下基于扩散的生成模型的性能。通过CTRTAB，我们将带有拉普拉斯噪声的样品注入控制信号，以改善数据多样性并显示其与L2正则化相似之处，从而增强了模型的鲁棒性。多个数据集的实验结果表明，CTRTAB的表现要优于最先进的模型，其精度平均超过80％。我们的源代码将在纸张出版物上发布。

Title: Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning

Authors: Yanbiao Ma, Wei Dai, Wenke Huang, Jiayi Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06457
Pdf URL: https://arxiv.org/pdf/2503.06457
Copy Paste: [[2503.06457]] Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning(https://arxiv.org/abs/2503.06457)
Keywords: generation
Abstract: Data heterogeneity in federated learning, characterized by a significant misalignment between local and global distributions, leads to divergent local optimization directions and hinders global model training. Existing studies mainly focus on optimizing local updates or global aggregation, but these indirect approaches demonstrate instability when handling highly heterogeneous data distributions, especially in scenarios where label skew and domain skew coexist. To address this, we propose a geometry-guided data generation method that centers on simulating the global embedding distribution locally. We first introduce the concept of the geometric shape of an embedding distribution and then address the challenge of obtaining global geometric shapes under privacy constraints. Subsequently, we propose GGEUR, which leverages global geometric shapes to guide the generation of new samples, enabling a closer approximation to the ideal global distribution. In single-domain scenarios, we augment samples based on global geometric shapes to enhance model generalization; in multi-domain scenarios, we further employ class prototypes to simulate the global distribution across domains. Extensive experimental results demonstrate that our method significantly enhances the performance of existing approaches in handling highly heterogeneous data, including scenarios with label skew, domain skew, and their coexistence. Code published at: this https URL
摘要：联邦学习中的数据异质性，其特征是本地和全球分布之间的严重未对准，导致局部优化方向不同，并阻碍了全球模型培训。现有研究主要集中于优化本地更新或全球聚合，但是这些间接方法在处理高度异构的数据分布时表明了不稳定，尤其是在标签偏斜和域偏斜并存的情况下。为了解决这个问题，我们提出了一种几何学引导的数据生成方法，该方法集中于在本地模拟全局嵌入分布。我们首先介绍了嵌入分布的几何形状的概念，然后解决在隐私约束下获得全局几何形状的挑战。随后，我们提出了GGEUR，该GGEUR利用全球几何形状来指导新样本的产生，从而更接近理想的全球分布。在单域情景中，我们根据全局几何形状增强样品，以增强模型泛化。在多域情景中，我们进一步采用类原型来模拟跨域的全局分布。广泛的实验结果表明，我们的方法显着提高了处理高度异构数据的现有方法的性能，包括标签偏斜，域偏斜的方案及其共存。代码发布于：此HTTPS URL

Title: SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts

Authors: Shijia Zhao, Qiming Xia, Xusheng Guo, Pufan Zou, Maoji Zheng, Hai Wu, Chenglu Wen, Cheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06467
Pdf URL: https://arxiv.org/pdf/2503.06467
Copy Paste: [[2503.06467]] SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts(https://arxiv.org/abs/2503.06467)
Keywords: generation
Abstract: Recently, sparsely-supervised 3D object detection has gained great attention, achieving performance close to fully-supervised 3D objectors while requiring only a few annotated instances. Nevertheless, these methods suffer challenges when accurate labels are extremely absent. In this paper, we propose a boosting strategy, termed SP3D, explicitly utilizing the cross-modal semantic prompts generated from Large Multimodal Models (LMMs) to boost the 3D detector with robust feature discrimination capability under sparse annotation settings. Specifically, we first develop a Confident Points Semantic Transfer (CPST) module that generates accurate cross-modal semantic prompts through boundary-constrained center cluster selection. Based on these accurate semantic prompts, which we treat as seed points, we introduce a Dynamic Cluster Pseudo-label Generation (DCPG) module to yield pseudo-supervision signals from the geometry shape of multi-scale neighbor points. Additionally, we design a Distribution Shape score (DS score) that chooses high-quality supervision signals for the initial training of the 3D detector. Experiments on the KITTI dataset and Waymo Open Dataset (WOD) have validated that SP3D can enhance the performance of sparsely supervised detectors by a large margin under meager labeling conditions. Moreover, we verified SP3D in the zero-shot setting, where its performance exceeded that of the state-of-the-art methods. The code is available at this https URL.
摘要：最近，稀疏监督的3D对象检测引起了极大的关注，在只需要几个带注释的实例的情况下，获得了几乎完全监督的3D反对者的性能。然而，当准确的标签极为缺乏时，这些方法会遇到挑战。在本文中，我们提出了一种称为SP3D的增强策略，该策略明确利用了从大型多模型模型（LMM）生成的跨模式语义提示，以在稀疏注释设置下具有可靠的特征歧视能力来增强3D检测器。具体而言，我们首先开发一个自信的点语义传输（CPST）模块，该模块通过边界约束中心群集选择生成准确的跨模式语义提示。基于这些准确的语义提示，我们将其视为种子点，我们引入了动态群集伪标签生成（DCPG）模块，以从多尺度邻居点的几何形状形状产生伪避风点信号。此外，我们设计了一个分配形状分数（DS分数），该分数评分选择了3D检测器初始训练的高质量监督信号。 KITTI数据集和Waymo Open DataSet（WOD）上的实验验证了SP3D可以在微薄的标签条件下通过很大的边距来增强稀疏监督检测器的性能。此外，我们在零拍设置中验证了SP3D，其性能超过了最先进的方法。该代码可在此HTTPS URL上找到。

Title: A Mesh Is Worth 512 Numbers: Spectral-domain Diffusion Modeling for High-dimension Shape Generation

Authors: Jiajie Fan, Amal Trigui, Andrea Bonfanti, Felix Dietrich, Thomas Bäck, Hao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06485
Pdf URL: https://arxiv.org/pdf/2503.06485
Copy Paste: [[2503.06485]] A Mesh Is Worth 512 Numbers: Spectral-domain Diffusion Modeling for High-dimension Shape Generation(https://arxiv.org/abs/2503.06485)
Keywords: generation, generative
Abstract: Recent advancements in learning latent codes derived from high-dimensional shapes have demonstrated impressive outcomes in 3D generative modeling. Traditionally, these approaches employ a trained autoencoder to acquire a continuous implicit representation of source shapes, which can be computationally expensive. This paper introduces a novel framework, spectral-domain diffusion for high-quality shape generation SpoDify, that utilizes singular value decomposition (SVD) for shape encoding. The resulting eigenvectors can be stored for subsequent decoding, while generative modeling is performed on the eigenfeatures. This approach efficiently encodes complex meshes into continuous implicit representations, such as encoding a 15k-vertex mesh to a 512-dimensional latent code without learning. Our method exhibits significant advantages in scenarios with limited samples or GPU resources. In mesh generation tasks, our approach produces high-quality shapes that are comparable to state-of-the-art methods.
摘要：从高维形状衍生出的潜在代码方面的最新进展显示出3D生成建模的令人印象深刻的结果。传统上，这些方法采用训练有素的自动编码器来获得源形状的连续隐式表示，这在计算上可能很昂贵。本文介绍了一个新颖的框架，即用于高质量形状生成的光谱域扩散，该框架利用了形状编码的单数值分解（SVD）。可以将所得的特征向量存储以进行后续解码，同时在特征表上执行生成建模。这种方法有效地将复杂的网格编码到连续的隐式表示中，例如在没有学习的情况下将15k-vertex网格编码为512维的潜在代码。我们的方法在样品有限或GPU资源的情况下具有很大的优势。在网格生成任务中，我们的方法产生的高质量形状可与最先进的方法相媲美。

Title: ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

Authors: Xukun Zhou, Fengxin Li, Ming Chen, Yan Zhou, Pengfei Wan, Di Zhang, Hongyan Liu, Jun He, Zhaoxin Fan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06499
Pdf URL: https://arxiv.org/pdf/2503.06499
Copy Paste: [[2503.06499]] ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis(https://arxiv.org/abs/2503.06499)
Keywords: generation
Abstract: Audio-driven human gesture synthesis is a crucial task with broad applications in virtual avatars, human-computer interaction, and creative content generation. Despite notable progress, existing methods often produce gestures that are coarse, lack expressiveness, and fail to fully align with audio semantics. To address these challenges, we propose ExGes, a novel retrieval-enhanced diffusion framework with three key designs: (1) a Motion Base Construction, which builds a gesture library using training dataset; (2) a Motion Retrieval Module, employing constrative learning and momentum distillation for fine-grained reference poses retreiving; and (3) a Precision Control Module, integrating partial masking and stochastic masking to enable flexible and fine-grained control. Experimental evaluations on BEAT2 demonstrate that ExGes reduces Fréchet Gesture Distance by 6.2\% and improves motion diversity by 5.3\% over EMAGE, with user studies revealing a 71.3\% preference for its naturalness and semantic relevance. Code will be released upon acceptance.
摘要：音频驱动的人类手势合成是一项至关重要的任务，在虚拟化身，人类计算机的互动和创造性的内容产生中具有广泛的应用。尽管取得了显着的进展，但现有方法通常会产生粗糙，缺乏表现力并且无法完全与音频语义一致的手势。为了应对这些挑战，我们提出了Exges，这是一个新颖的检索增强扩散框架，采用三个关键设计：（1）运动基础构造，该运动基础构建使用培训数据集构建一个手势库；（2）运动检索模块，采用约束学习和动量蒸馏进行细粒度参考姿势重述；（3）精确控制模块，集成部分掩盖和随机掩蔽，以实现柔性和细粒的控制。 BEAT2上的实验评估表明，EXER可将Fréchet的手势距离降低6.2 \％，并在Emage中提高了5.3 \％的运动多样性，用户研究表明，对其自然性和语义相关性的偏爱71.3 \％。代码将在接受后发布。

Title: DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability

Authors: Xirui Hu, Jiahao Wang, Hao Chen, Weizhan Zhang, Benqi Wang, Yikun Li, Haishun Nan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06505
Pdf URL: https://arxiv.org/pdf/2503.06505
Copy Paste: [[2503.06505]] DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability(https://arxiv.org/abs/2503.06505)
Keywords: generation
Abstract: Recent advancements in text-to-image generation have spurred interest in personalized human image generation, which aims to create novel images featuring specific human identities as reference images indicate. Although existing methods achieve high-fidelity identity preservation, they often struggle with limited multi-ID usability and inadequate facial editability. We present DynamicID, a tuning-free framework supported by a dual-stage training paradigm that inherently facilitates both single-ID and multi-ID personalized generation with high fidelity and flexible facial editability. Our key innovations include: 1) Semantic-Activated Attention (SAA), which employs query-level activation gating to minimize disruption to the original model when injecting ID features and achieve multi-ID personalization without requiring multi-ID samples during training. 2) Identity-Motion Reconfigurator (IMR), which leverages contrastive learning to effectively disentangle and re-entangle facial motion and identity features, thereby enabling flexible facial editing. Additionally, we have developed a curated VariFace-10k facial dataset, comprising 10k unique individuals, each represented by 35 distinct facial images. Experimental results demonstrate that DynamicID outperforms state-of-the-art methods in identity fidelity, facial editability, and multi-ID personalization capability.
摘要：文本到图像生成的最新进展激发了人们对个性化人类形象生成的兴趣，该形象旨在创建以特定人类身份作为参考图像的新颖图像。尽管现有方法获得了高保真身份，但它们通常在有限的多ID可用性和不足的面部编辑性方面挣扎。我们提出了DynamicId，这是一个由双阶段训练范式支持的无调框架，该范围内在固有地促进了具有高保真性和灵活的面部编辑性的单ID和多ID个性化生成。我们的关键创新包括：1）语义激活的注意力（SAA）（SAA），它采用查询级别的激活门，以最大程度地减少对原始模型的破坏，并在注射ID功能并实现多ID个性化的情况下，而无需在训练过程中进行多ID样品。 2）身份运动重构者（IMR），利用对比度学习有效地解开和重新输入面部运动和身份特征，从而实现了灵活的面部编辑。此外，我们开发了一个策划的Variface-10k面部数据集，包括10K独特的个体，每个个体都由35个不同的面部图像表示。实验结果表明，DynamiCID在身份保真度，面部编辑性和多ID个性化能力方面的最先进方法。

Title: Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation

Authors: Amir Mohammad Izadi, Seyed Mohammad Hadi Hosseini, Soroush Vafaie Tabar, Ali Abdollahi, Armin Saghafian, Mahdieh Soleymani Baghshah
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06506
Pdf URL: https://arxiv.org/pdf/2503.06506
Copy Paste: [[2503.06506]] Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation(https://arxiv.org/abs/2503.06506)
Keywords: generation, generative
Abstract: Text-to-image generative models have made significant advancements in recent years; however, accurately capturing intricate details in textual prompts, such as entity missing, attribute binding errors, and incorrect relationships remains a formidable challenge. In response, we present an innovative, training-free method that directly addresses these challenges by incorporating tailored objectives to account for textual constraints. Unlike layout-based approaches that enforce rigid structures and limit diversity, our proposed approach offers a more flexible arrangement of the scene by imposing just the extracted constraints from the text, without any unnecessary additions. These constraints are formulated as losses-entity missing, entity mixing, attribute binding, and spatial relationships, integrated into a unified loss that is applied in the first generation stage. Furthermore, we introduce a feedback-driven system for fine-grained initial noise refinement. This system integrates a verifier that evaluates the generated image, identifies inconsistencies, and provides corrective feedback. Leveraging this feedback, our refinement method first targets the unmet constraints by refining the faulty attention maps caused by initial noise, through the optimization of selective losses associated with these constraints. Subsequently, our unified loss function is reapplied to proceed the second generation phase. Experimental results demonstrate that our method, relying solely on our proposed objective functions, significantly enhances compositionality, achieving a 24% improvement in human evaluation and a 25% gain in spatial relationships. Furthermore, our fine-grained noise refinement proves effective, boosting performance by up to 5%. Code is available at this https URL.
摘要：近年来，文本到图像生成模型已取得了重大进步。但是，准确地捕获文本提示中的复杂细节，例如丢失实体，属性绑定错误和不正确的关系仍然是一个巨大的挑战。作为回应，我们提出了一种创新的，无培训的方法，该方法通过纳入量身定制的目标来说明文本约束，直接解决这些挑战。与实施刚性结构并限制多样性的基于布局的方法不同，我们提出的方法通过仅对文本提取的约束而没有任何不必要的添加来提供更灵活的场景安排。这些约束被表述为损失实体缺失，实体混合，属性结合和空间关系，并集成到第一代阶段应用的统一损失中。此外，我们引入了一个以反馈驱动的系统，以进行细粒度的初始噪声细化。该系统集成了一个验证者，该验证者评估生成的图像，标识不一致并提供纠正反馈。利用此反馈，我们的改进方法首先通过优化与这些约束相关的选择性损失所引起的故障注意图来针对未完成的约束。随后，我们重新申请了我们的统一损失函数以进行第二代阶段。实验结果表明，我们的方法仅依赖我们提出的目标功能，显着提高了组成性，在人类评估方面提高了24％，空间关系增长了25％。此外，我们的细粒噪声精炼被证明有效，可提高性能高达5％。代码可在此HTTPS URL上找到。

Title: A Light and Tuning-free Method for Simulating Camera Motion in Video Generation

Authors: Quanjian Song, Zhihang Lin, Zhanpeng Zeng, Ziyue Zhang, Liujuan Cao, Rongrong Ji
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06508
Pdf URL: https://arxiv.org/pdf/2503.06508
Copy Paste: [[2503.06508]] A Light and Tuning-free Method for Simulating Camera Motion in Video Generation(https://arxiv.org/abs/2503.06508)
Keywords: generation
Abstract: Existing camera motion-controlled video generation methods face computational bottlenecks in fine-tuning and inference. This paper proposes LightMotion, a light and tuning-free method for simulating camera motion in video generation. Operating in the latent space, it eliminates additional fine-tuning, inpainting, and depth estimation, making it more streamlined than existing methods. The endeavors of this paper comprise: (i) The latent space permutation operation effectively simulates various camera motions like panning, zooming, and rotation. (ii) The latent space resampling strategy combines background-aware sampling and cross-frame alignment to accurately fill new perspectives while maintaining coherence across frames. (iii) Our in-depth analysis shows that the permutation and resampling cause an SNR shift in latent space, leading to poor-quality generation. To address this, we propose latent space correction, which reintroduces noise during denoising to mitigate SNR shift and enhance video generation quality. Exhaustive experiments show that our LightMotion outperforms existing methods, both quantitatively and qualitatively.
摘要：现有的摄像机运动控制的视频生成方法面临微调和推理的计算瓶颈。本文提出了LightMotion，这是一种无轻巧的调谐方法，用于在视频生成中模拟相机运动。它在潜在空间中运行，消除了其他微调，内部介绍和深度估计，使其比现有方法更简化。本文的努力包括：（i）潜在空间置换操作有效地模拟了各种摄像机运动，例如平移，缩放和旋转。（ii）潜在空间重新采样策略结合了背景感知的采样和跨框架对准，以准确地填充新的视角，同时保持跨帧的连贯性。（iii）我们的深入分析表明，置换和重采样会导致潜在空间的SNR转移，从而导致质量较差。为了解决这个问题，我们提出了潜在的空间校正，该校正会重新引入DeNo的噪声以减轻SNR转移并增强视频生成质量。详尽的实验表明，我们的LightMotion在定量和定性上都优于现有方法。

Title: One-Step Diffusion Model for Image Motion-Deblurring

Authors: Xiaoyang Liu, Yuquan Wang, Zheng Chen, Jiezhang Cao, He Zhang, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06537
Pdf URL: https://arxiv.org/pdf/2503.06537
Copy Paste: [[2503.06537]] One-Step Diffusion Model for Image Motion-Deblurring(https://arxiv.org/abs/2503.06537)
Keywords: restoration
Abstract: Currently, methods for single-image deblurring based on CNNs and transformers have demonstrated promising performance. However, these methods often suffer from perceptual limitations, poor generalization ability, and struggle with heavy or complex blur. While diffusion-based methods can partially address these shortcomings, their multi-step denoising process limits their practical usage. In this paper, we conduct an in-depth exploration of diffusion models in deblurring and propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step, significantly improving inference efficiency while maintaining high fidelity. To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration. Additionally, we construct a high-quality synthetic deblurring dataset to mitigate perceptual collapse and design a dynamic dual-adapter (DDA) to enhance perceptual quality while preserving fidelity. Extensive experiments demonstrate that our method achieves strong performance on both full and no-reference metrics. Our code and pre-trained model will be publicly available at this https URL.
摘要：当前，基于CNN和变形金刚的单像脱毛的方法表现出了有希望的性能。但是，这些方法通常会遭受感知局限性，概括能力差以及与沉重或复杂的模糊斗争。虽然基于扩散的方法可以部分解决这些缺点，但它们的多个步骤降解过程限制了它们的实际用法。在本文中，我们对DeBlurring中的扩散模型进行了深入的探索，并提出了DeBlurring（OSDD）的一步扩散模型，该模型是一个新型框架，将denoising过程降低到单个步骤，从而显着提高了推理效率，同时保持了高忠诚。为了解决扩散模型中的保真度损失，我们引入了增强的变分自动编码器（EVAE），从而改善了结构恢复。此外，我们构建了一个高质量的合成去皮数据集，以减轻感知崩溃并设计动态的双重适配器（DDA），以增强感知质量，同时保持忠诚度。广泛的实验表明，我们的方法在完整和无参考指标上都达到了强劲的性能。我们的代码和预培训模型将在此HTTPS URL上公开可用。

Title: ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy

Authors: Jianwen Sun, Yukang Feng, Chuanhao Li, Fanrui Zhang, Zizhen Li, Jiaxin Ai, Sizhuo Zhou, Yu Dai, Shenglin Zhang, Kaipeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06542
Pdf URL: https://arxiv.org/pdf/2503.06542
Copy Paste: [[2503.06542]] ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy(https://arxiv.org/abs/2503.06542)
Keywords: generation
Abstract: Unified models (UniMs) for multimodal understanding and generation have recently received much attention in the area of vision and language. Existing UniMs are designed to simultaneously learn both multimodal understanding and generation capabilities, demanding substantial computational resources, and often struggle to generate interleaved text-image. We present ARMOR, a resource-efficient and pure autoregressive framework that achieves both understanding and generation by fine-tuning existing multimodal large language models (MLLMs). Specifically, ARMOR extends existing MLLMs from three perspectives: (1) For model architecture, an asymmetric encoder-decoder architecture with a forward-switching mechanism is introduced to unify embedding space integrating textual and visual modalities for enabling natural text-image interleaved generation with minimal computational overhead. (2) For training data, a meticulously curated, high-quality interleaved dataset is collected for fine-tuning MLLMs. (3) For the training algorithm, we propose a ``what or how to generate" algorithm to empower existing MLLMs with multimodal generation capabilities while preserving their multimodal understanding capabilities, through three progressive training stages based on the collected dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources. Our code will be released soon at this https URL.
摘要：统一的模型（UNIM）用于多模式理解和发电，最近在视觉和语言领域受到了很多关注。现有的UNIM旨在同时学习多模式的理解和发电能力，要求大量的计算资源，并且经常难以产生交织的文本图像。我们提出了Armor，这是一种资源效率和纯净的自动回归框架，通过微调现有的多模式大型语言模型（MLLM）来实现理解和产生。具体而言，盔甲从三个角度扩展了现有的MLLM：（1）用于模型架构，引入了具有前向切换机制的不对称编码器架构体系结构，以统一将自然文本图像图形图像与最小的计算机产生的自然文本图像形成式嵌入文本和视觉方式统一的空间。（2）对于训练数据，为微调MLLM收集了精心策划的高质量交织数据集。 (3) For the training algorithm, we propose a ``what or how to generate" algorithm to empower existing MLLMs with multimodal generation capabilities while preserving their multimodal understanding capabilities, through three progressive training stages based on the collected dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources. Our code will be released soon at此HTTPS URL。

Title: QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation

Authors: Junyi Wu, Zhiteng Li, Zheng Hui, Yulun Zhang, Linghe Kong, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06545
Pdf URL: https://arxiv.org/pdf/2503.06545
Copy Paste: [[2503.06545]] QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation(https://arxiv.org/abs/2503.06545)
Keywords: generation
Abstract: Recently, Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation, surpassing U-Net-based models in terms of performance. However, the enhanced capabilities of DiTs come with significant drawbacks, including increased computational and memory costs, which hinder their deployment on resource-constrained devices. Current acceleration techniques, such as quantization and cache mechanism, offer limited speedup and are often applied in isolation, failing to fully address the complexities of DiT architectures. In this paper, we propose QuantCache, a novel training-free inference acceleration framework that jointly optimizes hierarchical latent caching, adaptive importance-guided quantization, and structural redundancy-aware pruning. QuantCache achieves an end-to-end latency speedup of 6.72$\times$ on Open-Sora with minimal loss in generation quality. Extensive experiments across multiple video generation benchmarks demonstrate the effectiveness of our method, setting a new standard for efficient DiT inference. The code and models will be available at this https URL.
摘要：最近，扩散变压器（DIT）已成为视频生成中的主要体系结构，在性能方面超过了基于U-NET的模型。但是，DIT的增强功能具有重要的缺点，包括增加计算和内存成本，这阻碍了它们在资源受限设备上的部署。当前的加速技术（例如量化和缓存机制）提供了有限的加速，并且通常是孤立地应用的，无法完全解决DIT体系结构的复杂性。在本文中，我们提出了QuantCache，这是一种新型的无培训推理加速框架，共同优化了层次的潜在缓存，适应性重要性引导的量化和结构性冗余感知的修剪。 QuantCache的端到端潜伏期速度为6.72 $ \ times $ $ \ times $，而发电质量的损失最小。跨多个视频生成基准测试的广泛实验证明了我们方法的有效性，为有效的DIT推断设定了新的标准。代码和模型将在此HTTPS URL上可用。

Title: Generative modelling with jump-diffusions

Authors: Adrian Baule
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.06558
Pdf URL: https://arxiv.org/pdf/2503.06558
Copy Paste: [[2503.06558]] Generative modelling with jump-diffusions(https://arxiv.org/abs/2503.06558)
Keywords: generation, generative
Abstract: Score-based diffusion models generate samples from an unknown target distribution using a time-reversed diffusion process. While such models represent state-of-the-art approaches in industrial applications such as artificial image generation, it has recently been noted that their performance can be further improved by considering injection noise with heavy tailed characteristics. Here, I present a generalization of generative diffusion processes to a wide class of non-Gaussian noise processes. I consider forward processes driven by standard Gaussian noise with super-imposed Poisson jumps representing a finite activity Levy process. The generative process is shown to be governed by a generalized score function that depends on the jump amplitude distribution. Both probability flow ODE and SDE formulations are derived using basic technical effort, and are implemented for jump amplitudes drawn from a multivariate Laplace distribution. Remarkably, for the problem of capturing a heavy-tailed target distribution, the jump-diffusion Laplace model outperforms models driven by alpha-stable noise despite not containing any heavy-tailed characteristics. The framework can be readily applied to other jump statistics that could further improve on the performance of standard diffusion models.
摘要：基于得分的扩散模型使用时间转换的扩散过程从未知目标分布中生成样品。尽管这种模型代表了工业应用（例如人造图像产生）中最新的方法，但最近已经注意到，通过考虑具有重尾部特征的注入噪声，可以进一步提高其性能。在这里，我将生成扩散过程的概括概括为一系列非高斯噪声过程。我认为由标准高斯噪声驱动的前进过程，超出的泊松跳跃代表有限的活动征费过程。生成过程被证明由取决于跳跃幅度分布的广义分数函数控制。概率流ODE和SDE公式都是使用基本技术工作得出的，并用于从多元拉普拉斯分布中得出的跳跃振幅实现。值得注意的是，对于捕获重尾目标分布的问题，尽管不包含任何重尾特性，但跳闸拉普拉斯模型的表现就超过了由α稳定噪声驱动的模型。该框架可以很容易地应用于其他跳跃统计数据，这些统计数据可以进一步改善标准扩散模型的性能。

Title: TR-DQ: Time-Rotation Diffusion Quantization

Authors: Yihua Shao, Deyang Lin, Fanhu Zeng, Minxi Yan, Muyang Zhang, Siyu Chen, Yuxuan Fan, Ziyang Yan, Haozhe Wang, Jingcai Guo, Yan Wang, Haotong Qin, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06564
Pdf URL: https://arxiv.org/pdf/2503.06564
Copy Paste: [[2503.06564]] TR-DQ: Time-Rotation Diffusion Quantization(https://arxiv.org/abs/2503.06564)
Keywords: generation
Abstract: Diffusion models have been widely adopted in image and video generation. However, their complex network architecture leads to high inference overhead for its generation process. Existing diffusion quantization methods primarily focus on the quantization of the model structure while ignoring the impact of time-steps variation during sampling. At the same time, most current approaches fail to account for significant activations that cannot be eliminated, resulting in substantial performance degradation after quantization. To address these issues, we propose Time-Rotation Diffusion Quantization (TR-DQ), a novel quantization method incorporating time-step and rotation-based optimization. TR-DQ first divides the sampling process based on time-steps and applies a rotation matrix to smooth activations and weights dynamically. For different time-steps, a dedicated hyperparameter is introduced for adaptive timing modeling, which enables dynamic quantization across different time steps. Additionally, we also explore the compression potential of Classifier-Free Guidance (CFG-wise) to establish a foundation for subsequent work. TR-DQ achieves state-of-the-art (SOTA) performance on image generation and video generation tasks and a 1.38-1.89x speedup and 1.97-2.58x memory reduction in inference compared to existing quantization methods.
摘要：扩散模型已在图像和视频生成中广泛采用。但是，它们复杂的网络体系结构为其生成过程带来了高推理开销。现有的扩散量化方法主要集中于模型结构的量化，同时忽略了抽样过程中时间步长变化的影响。同时，大多数当前方法无法解释无法消除的重大激活，从而导致量化后的大量性能降解。为了解决这些问题，我们提出了时间旋转扩散量化（TR-DQ），这是一种结合了基于时步和基于旋转的优化的新型量化方法。 TR-DQ首先根据时间步长将采样过程划分，并应用旋转矩阵以动态激活和权重。对于不同的时间步长，引入了专用的超参数用于自适应时序建模，该模型可以在不同的时间步骤上进行动态量化。此外，我们还探索了无分类器指导（CFG）的压缩潜力，以建立后续工作的基础。与现有的量化方法相比，TR-DQ在图像生成和视频生成任务以及1.38-1.89倍的速度和1.97-2.58x内存降低推理上实现了最新的（SOTA）性能。

Title: Future-Aware Interaction Network For Motion Forecasting

Authors: Shijie Li, Xun Xu, Si Yong Yeo, Xulei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06565
Pdf URL: https://arxiv.org/pdf/2503.06565
Copy Paste: [[2503.06565]] Future-Aware Interaction Network For Motion Forecasting(https://arxiv.org/abs/2503.06565)
Keywords: generation
Abstract: Motion forecasting is a crucial component of autonomous driving systems, enabling the generation of accurate and smooth future trajectories to ensure safe navigation to the destination. In previous methods, potential future trajectories are often absent in the scene encoding stage, which may lead to suboptimal outcomes. Additionally, prior approaches typically employ transformer architectures for spatiotemporal modeling of trajectories and map information, which suffer from the quadratic scaling complexity of the transformer architecture. In this work, we propose an interaction-based method, named Future-Aware Interaction Network, that introduces potential future trajectories into scene encoding for a comprehensive traffic representation. Furthermore, a State Space Model (SSM), specifically Mamba, is introduced for both spatial and temporal modeling. To adapt Mamba for spatial interaction modeling, we propose an adaptive reordering strategy that transforms unordered data into a structured sequence. Additionally, Mamba is employed to refine generated future trajectories temporally, ensuring more consistent predictions. These enhancements not only improve model efficiency but also enhance the accuracy and diversity of predictions. We conduct comprehensive experiments on the widely used Argoverse 1 and Argoverse 2 datasets, demonstrating that the proposed method achieves superior performance compared to previous approaches in a more efficient way. The code will be released according to the acceptance.
摘要：运动预测是自动驾驶系统的关键组成部分，可以生成准确，平稳的未来轨迹，以确保安全到达目的地。在以前的方法中，在编码阶段通常没有潜在的未来轨迹，这可能会导致次优结果。此外，先前的方法通常采用变压器体系结构进行轨迹和地图信息的时空建模，这些建模遭受了变压器体系结构的二次缩放复杂性。在这项工作中，我们提出了一种基于互动的方法，称为“未来感知的交互网络”，该方法将潜在的未来轨迹引入场景中，以编码全面的流量表示。此外，为空间建模和时间建模引入了状态空间模型（SSM），特别是MAMBA。为了适应Mamba进行空间相互作用建模，我们提出了一种自适应重新排序策略，将无序的数据转换为结构化序列。此外，Mamba被用来在时间上完善生成的未来轨迹，从而确保更一致的预测。这些增强功能不仅提高了模型效率，还提高了预测的准确性和多样性。我们对广泛使用的Argoverse 1和2数据集进行了全面的实验，表明所提出的方法与以前的方法相比以更有效的方式实现了优越的性能。该代码将根据认可发布。

Title: Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving

Authors: Yao Cheng, Yibo Zhao, Jiapeng Zhu, Yao Liu, Xing Sun, Xiang Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06567
Pdf URL: https://arxiv.org/pdf/2503.06567
Copy Paste: [[2503.06567]] Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving(https://arxiv.org/abs/2503.06567)
Keywords: generation
Abstract: Large language models (LLMs) have demonstrated transformative potential across various domains, yet they face significant challenges in knowledge integration and complex problem reasoning, often leading to hallucinations and unreliable outputs. Retrieval-Augmented Generation (RAG) has emerged as a promising solution to enhance LLMs accuracy by incorporating external knowledge. However, traditional RAG systems struggle with processing complex relational information and multi-step reasoning, limiting their effectiveness in advanced problem-solving tasks. To address these limitations, we propose CogGRAG, a cognition inspired graph-based RAG framework, designed to improve LLMs performance in Knowledge Graph Question Answering (KGQA). Inspired by the human cognitive process of decomposing complex problems and performing self-verification, our framework introduces a three-stage methodology: decomposition, retrieval, and reasoning with self-verification. By integrating these components, CogGRAG enhances the accuracy of LLMs in complex problem solving. We conduct systematic experiments with three LLM backbones on four benchmark datasets, where CogGRAG outperforms the baselines.
摘要：大型语言模型（LLM）表现出各个领域的变革潜力，但它们在知识整合和复杂的问题推理方面面临着重大挑战，通常会导致幻觉和不可靠的产出。通过合并外部知识来提高LLMS精度的有前途的解决方案已成为一种有希望的解决方案。但是，传统的抹布系统努力处理复杂的关系信息和多步推理，从而限制了它们在高级解决问题的任务中的有效性。为了解决这些局限性，我们提出了Coggrag，这是一种基于图形的抹布框架，旨在提高知识图质量答案（KGQA）中的LLMS性能。受到分解复杂问题和进行自我验证的人类认知过程的启发，我们的框架引入了三阶段的方法：分解，检索和推理和自我验证。通过整合这些组件，CogGrag可以提高LLM在复杂的问题解决中的准确性。我们在四个基准数据集上使用三个LLM骨架进行系统实验，在该数据集中，Coggrag优于基准。

Title: Conceptrol: Concept Control of Zero-shot Personalized Image Generation

Authors: Qiyuan He, Angela Yao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06568
Pdf URL: https://arxiv.org/pdf/2503.06568
Copy Paste: [[2503.06568]] Conceptrol: Concept Control of Zero-shot Personalized Image Generation(https://arxiv.org/abs/2503.06568)
Keywords: generation
Abstract: Personalized image generation with text-to-image diffusion models generates unseen images based on reference image content. Zero-shot adapter methods such as IP-Adapter and OminiControl are especially interesting because they do not require test-time fine-tuning. However, they struggle to balance preserving personalized content and adherence to the text prompt. We identify a critical design flaw resulting in this performance gap: current adapters inadequately integrate personalization images with the textual descriptions. The generated images, therefore, replicate the personalized content rather than adhere to the text prompt instructions. Yet the base text-to-image has strong conceptual understanding capabilities that can be leveraged. We propose Conceptrol, a simple yet effective framework that enhances zero-shot adapters without adding computational overhead. Conceptrol constrains the attention of visual specification with a textual concept mask that improves subject-driven generation capabilities. It achieves as much as 89% improvement on personalization benchmarks over the vanilla IP-Adapter and can even outperform fine-tuning approaches such as Dreambooth LoRA. The source code is available at this https URL.
摘要：具有文本对图像扩散模型的个性化图像生成基于参考图像内容生成了看不见的图像。 Zero-shot adapter methods such as IP-Adapter and OminiControl are especially interesting because they do not require test-time fine-tuning.但是，他们努力平衡保存个性化的内容和遵守文本提示。我们确定了一个关键的设计缺陷，导致这种性能差距：当前的适配器将个性化图像与文本描述不足地整合在一起。因此，生成的图像复制个性化内容，而不是遵守文本提示说明。然而，基本的文本对图像具有强大的概念理解能力，可以利用。我们提出了Conceptrol，这是一个简单而有效的框架，可增强零击适配器而无需添加计算开销。 Conceptrol使用文本概念掩码来限制视觉规范的注意，该概念面具可提高主题驱动的生成能力。它的个性化基准在Vanilla IP-Adapter上取得了89％的提高，甚至可以超过Dreambooth Lora等微调方法。源代码可在此HTTPS URL上找到。

Title: Pixel to Gaussian: Ultra-Fast Continuous Super-Resolution with 2D Gaussian Modeling

Authors: Long Peng, Anran Wu, Wenbo Li, Peizhe Xia, Xueyuan Dai, Xinjie Zhang, Xin Di, Haoze Sun, Renjing Pei, Yang Wang, Yang Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06617
Pdf URL: https://arxiv.org/pdf/2503.06617
Copy Paste: [[2503.06617]] Pixel to Gaussian: Ultra-Fast Continuous Super-Resolution with 2D Gaussian Modeling(https://arxiv.org/abs/2503.06617)
Keywords: super-resolution
Abstract: Arbitrary-scale super-resolution (ASSR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs with arbitrary upsampling factors using a single model, addressing the limitations of traditional SR methods constrained to fixed-scale factors (\textit{e.g.}, $\times$ 2). Recent advances leveraging implicit neural representation (INR) have achieved great progress by modeling coordinate-to-pixel mappings. However, the efficiency of these methods may suffer from repeated upsampling and decoding, while their reconstruction fidelity and quality are constrained by the intrinsic representational limitations of coordinate-based functions. To address these challenges, we propose a novel ContinuousSR framework with a Pixel-to-Gaussian paradigm, which explicitly reconstructs 2D continuous HR signals from LR images using Gaussian Splatting. This approach eliminates the need for time-consuming upsampling and decoding, enabling extremely fast arbitrary-scale super-resolution. Once the Gaussian field is built in a single pass, ContinuousSR can perform arbitrary-scale rendering in just 1ms per scale. Our method introduces several key innovations. Through statistical ana
摘要：任意规模的超分辨率（ASSR）的目的是使用单个模型从低分辨率（LR）输入中重建具有任意UPPEMPLING因子的高分辨率（LR）图像，以解决限制在固定尺度因素上的传统SR方法的限制（\ textit {extitit {ef。利用隐式神经表示（INR）的最新进展通过对坐标到像素映射进行建模，取得了巨大的进步。但是，这些方法的效率可能会反复进行上采样和解码，而它们的重建保真度和质量受到基于坐标函数的内在表示局限性的限制。为了应对这些挑战，我们提出了一个具有像素到高斯范式的新型连续框架，它使用高斯分裂明确地从LR图像中明确重建了2D连续的HR信号。这种方法消除了时间耗时的上采样和解码的需求，从而实现了非常快速的任意规模的超级分辨率。一旦高斯字段构建在单个通行证中，连续SR可以以每尺度的1毫秒为单位执行任意规模的渲染。我们的方法引入了几项关键创新。通过统计ANA

Title: Synthetic Data Generation for Minimum-Exposure Navigation in a Time-Varying Environment using Generative AI Models

Authors: Nachiket U. Bapat, Randy C. Paffenroth, Raghvendra V. Cowlagi
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2503.06619
Pdf URL: https://arxiv.org/pdf/2503.06619
Copy Paste: [[2503.06619]] Synthetic Data Generation for Minimum-Exposure Navigation in a Time-Varying Environment using Generative AI Models(https://arxiv.org/abs/2503.06619)
Keywords: generation, generative
Abstract: We study the problem of synthetic generation of samples of environmental features for autonomous vehicle navigation. These features are described by a spatiotemporally varying scalar field that we refer to as a threat field. The threat field is known to have some underlying dynamics subject to process noise. Some "real-world" data of observations of various threat fields are also available. The assumption is that the volume of ``real-world'' data is relatively small. The objective is to synthesize samples that are statistically similar to the data. The proposed solution is a generative artificial intelligence model that we refer to as a split variational recurrent neural network (S-VRNN). The S-VRNN merges the capabilities of a variational autoencoder, which is a widely used generative model, and a recurrent neural network, which is used to learn temporal dependencies in data. The main innovation in this work is that we split the latent space of the S-VRNN into two subspaces. The latent variables in one subspace are learned using the ``real-world'' data, whereas those in the other subspace are learned using the data as well as the known underlying system dynamics. Through numerical experiments we demonstrate that the proposed S-VRNN can synthesize data that are statistically similar to the training data even in the case of very small volume of ``real-world'' training data.
摘要：我们研究了自动驾驶汽车导航环境特征样本的合成生成问题。这些特征由我们称为威胁字段的时空变化标量字段来描述。众所周知，威胁字段具有某些基本动力，可能会导致过程噪声。还提供了一些关于各种威胁领域观察结果的“现实世界”数据。假设``现实世界''数据的体积相对较小。目的是合成与数据统计相似的样本。提出的解决方案是一种生成人工智能模型，我们称之为分裂变分的复发神经网络（S-VRNN）。 S-vrnn合并了一个多种自动编码器的功能，该变量自动编码器是一种广泛使用的生成模型和一个经常性的神经网络，该网络用于学习数据中的时间依赖性。这项工作的主要创新是我们将S-VRNN的潜在空间分为两个子空间。一个子空间中的潜在变量是使用``现实世界''数据来学习的，而其他子空间中的潜在变量则使用数据以及已知的基础系统动力学学习。通过数值实验，我们证明了所提出的S-VRNN可以合成与训练数据相似的数据，即使在较小的``现实世界''培训数据中。

Title: Dynamic Updates for Language Adaptation in Visual-Language Tracking

Authors: Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, Shuxiang Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06621
Pdf URL: https://arxiv.org/pdf/2503.06621
Copy Paste: [[2503.06621]] Dynamic Updates for Language Adaptation in Visual-Language Tracking(https://arxiv.org/abs/2503.06621)
Keywords: generation
Abstract: The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency. Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker. DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_{\rm{ext}}$, TNL2K, OTB99-Lang, GOT-10K, and UAV123. Code and models are available at this https URL.
摘要：多模式参考提供的语义信息与跟踪对象提供的语义信息之间的一致性对于视觉语言（VL）跟踪至关重要。但是，现有的VL跟踪框架依靠静态多模式引用来定位动态对象，这可以导致语义差异并降低跟踪器的鲁棒性。为了解决这个问题，我们提出了一个新颖的视觉语言跟踪框架，名为Dutrack，该框架通过动态更新多模式引用以保持一致性来捕获最新的目标状态。具体来说，我们引入了动态语言更新模块，该模块利用大型语言模型根据视觉特征和对象类别信息为对象生成动态语言描述。然后，我们设计一个动态模板捕获模块，该模块捕获图像中高度与动态语言描述相匹配的区域。此外，为了确保描述生成的效率，我们设计了一个更新策略，该策略评估目标位移，规模和其他因素的变化以决定更新。最后，记录目标最新状态的动态模板和语言描述用于更新多模式引用，为后续推断和增强跟踪器的鲁棒性提供了更准确的参考信息。杜特拉克（Dutrack）在四个主流视觉语言和两个仅视觉跟踪基准（包括Lasot，lasot $ _ {\ rm {ext}} $，TNL2K，tnl2k，otb9999-lang，got-10k和无用的1123中，都可以实现新的最先进的性能。代码和型号可在此HTTPS URL上找到。

Title: Chameleon: On the Scene Diversity and Domain Variety of AI-Generated Videos Detection

Authors: Meiyu Zeng, Xingming Liao, Canyu Chen, Nankai Lin, Zhuowei Wang, Chong Chen, Aimin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06624
Pdf URL: https://arxiv.org/pdf/2503.06624
Copy Paste: [[2503.06624]] Chameleon: On the Scene Diversity and Domain Variety of AI-Generated Videos Detection(https://arxiv.org/abs/2503.06624)
Keywords: generation
Abstract: Artificial intelligence generated content (AIGC), known as DeepFakes, has emerged as a growing concern because it is being utilized as a tool for spreading disinformation. While much research exists on identifying AI-generated text and images, research on detecting AI-generated videos is limited. Existing datasets for AI-generated videos detection exhibit limitations in terms of diversity, complexity, and realism. To address these issues, this paper focuses on AI-generated videos detection and constructs a diverse dataset named Chameleon. We generate videos through multiple generation tools and various real video sources. At the same time, we preserve the videos' real-world complexity, including scene switches and dynamic perspective changes, and expand beyond face-centered detection to include human actions and environment generation. Our work bridges the gap between AI-generated dataset construction and real-world forensic needs, offering a valuable benchmark to counteract the evolving threats of AI-generated content.
摘要：人工智能产生的内容（AIGC）被称为深击，已经成为人们越来越关注的问题，因为它被用作传播虚假信息的工具。尽管在识别AI生成的文本和图像方面存在很多研究，但检测AI生成的视频的研究仍有限制。现有的AI生成视频检测数据集在多样性，复杂性和现实主义方面表现出局限性。为了解决这些问题，本文重点介绍了AI生成的视频检测，并构建了一个名为Chameleon的不同数据集。我们通过多个生成工具和各种真实的视频来源生成视频。同时，我们保留了视频的现实世界复杂性，包括场景开关和动态视角变化，并扩展了以面部为中心的检测，以包括人类的行动和环境生成。我们的工作弥合了AI生成的数据集构建与现实法律需求之间的差距，并提供了有价值的基准，以抵消AI生成的内容不断发展的威胁。

Title: Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias

Authors: Mingxiao Li, Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06632
Pdf URL: https://arxiv.org/pdf/2503.06632
Copy Paste: [[2503.06632]] Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias(https://arxiv.org/abs/2503.06632)
Keywords: generation
Abstract: Personalized image generation via text prompts has great potential to improve daily life and professional work by facilitating the creation of customized visual content. The aim of image personalization is to create images based on a user-provided subject while maintaining both consistency of the subject and flexibility to accommodate various textual descriptions of that subject. However, current methods face challenges in ensuring fidelity to the text prompt while not overfitting to the training data. In this work, we introduce a novel training pipeline that incorporates an attractor to filter out distractions in training images, allowing the model to focus on learning an effective representation of the personalized subject. Moreover, current evaluation methods struggle due to the lack of a dedicated test set. The evaluation set-up typically relies on the training data of the personalization task to compute text-image and image-image similarity scores, which, while useful, tend to overestimate performance. Although human evaluations are commonly used as an alternative, they often suffer from bias and inconsistency. To address these issues, we curate a diverse and high-quality test set with well-designed prompts. With this new benchmark, automatic evaluation metrics can reliably assess model performance
摘要：通过文本提示，个性化的图像生成具有巨大的潜力，可以通过促进定制的视觉内容来改善日常生活和专业工作。图像个性化的目的是基于用户提供的主题创建图像，同时保持主题的一致性和灵活性，以适应该主题的各种文本描述。但是，当前的方法在确保忠于文本提示的同时不适合培训数据方面面临挑战。在这项工作中，我们介绍了一条新颖的培训管道，该培训管道结合了吸引子，以过滤训练图像中的分心，从而使模型可以专注于学习个性化主题的有效表示。此外，由于缺乏专用的测试集，目前的评估方法挣扎。评估设置通常依赖于个性化任务的培训数据来计算文本图像和图像图像相似性分数，尽管有用，但往往高估了性能。尽管人类评估通常被用作替代方案，但它们通常会遭受偏见和不一致的困扰。为了解决这些问题，我们策划了具有精心设计的提示的多样化和高质量的测试。有了这个新的基准，自动评估指标可以可靠地评估模型性能

Title: Adding Additional Control to One-Step Diffusion with Joint Distribution Matching

Authors: Yihong Luo, Tianyang Hu, Yifan Song, Jiacheng Sun, Zhenguo Li, Jing Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06652
Pdf URL: https://arxiv.org/pdf/2503.06652
Copy Paste: [[2503.06652]] Adding Additional Control to One-Step Diffusion with Joint Distribution Matching(https://arxiv.org/abs/2503.06652)
Keywords: generation
Abstract: While diffusion distillation has enabled one-step generation through methods like Variational Score Distillation, adapting distilled models to emerging new controls -- such as novel structural constraints or latest user preferences -- remains challenging. Conventional approaches typically requires modifying the base diffusion model and redistilling it -- a process that is both computationally intensive and time-consuming. To address these challenges, we introduce Joint Distribution Matching (JDM), a novel approach that minimizes the reverse KL divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning. This asymmetric distillation scheme enables our one-step student to handle controls unknown to the teacher model and facilitates improved classifier-free guidance (CFG) usage and seamless integration of human feedback learning (HFL). Experimental results demonstrate that JDM surpasses baseline methods such as multi-step ControlNet by mere one-step in most cases, while achieving state-of-the-art performance in one-step text-to-image synthesis through improved usage of CFG or HFL integration.
摘要：尽管扩散蒸馏已通过诸如变异得分蒸馏之类的方法使一步生成，但将蒸馏模型调整为新兴的新控件（例如新型结构约束或最新的用户偏好）仍然具有挑战性。常规方法通常需要修改基础扩散模型并重新缩减它 - 这是计算密集型且耗时的过程。为了应对这些挑战，我们引入了联合分布匹配（JDM），这是一种新颖的方法，可最大程度地减少图像条件结合分布之间的反向KL差异。通过得出可拖动的上限，JDM从条件学习中解除了忠诚度学习。这种不对称的蒸馏计划使我们的一步学生能够处理教师模型未知的控件，并促进了改进的无分类器指导（CFG）的用法和人类反馈学习（HFL）的无缝集成。实验结果表明，JDM超过基线方法，例如在大多数情况下仅一步一步，而通过改善CFG或HFL集成使用的使用来实现一步文本对图像合成的最新性能。

Title: AxisPose: Model-Free Matching-Free Single-Shot 6D Object Pose Estimation via Axis Generation

Authors: Yang Zou, Zhaoshuai Qi, Yating Liu, Zihao Xu, Weipeng Sun, Weiyi Liu, Xingyuan Li, Jiaqi Yang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06660
Pdf URL: https://arxiv.org/pdf/2503.06660
Copy Paste: [[2503.06660]] AxisPose: Model-Free Matching-Free Single-Shot 6D Object Pose Estimation via Axis Generation(https://arxiv.org/abs/2503.06660)
Keywords: generation
Abstract: Object pose estimation, which plays a vital role in robotics, augmented reality, and autonomous driving, has been of great interest in computer vision. Existing studies either require multi-stage pose regression or rely on 2D-3D feature matching. Though these approaches have shown promising results, they rely heavily on appearance information, requiring complex input (i.e., multi-view reference input, depth, or CAD models) and intricate pipeline (i.e., feature extraction-SfM-2D to 3D matching-PnP). We propose AxisPose, a model-free, matching-free, single-shot solution for robust 6D pose estimation, which fundamentally diverges from the existing paradigm. Unlike existing methods that rely on 2D-3D or 2D-2D matching using 3D techniques, such as SfM and PnP, AxisPose directly infers a robust 6D pose from a single view by leveraging a diffusion model to learn the latent axis distribution of objects without reference views. Specifically, AxisPose constructs an Axis Generation Module (AGM) to capture the latent geometric distribution of object axes through a diffusion model. The diffusion process is guided by injecting the gradient of geometric consistency loss into the noise estimation to maintain the geometric consistency of the generated tri-axis. With the generated tri-axis projection, AxisPose further adopts a Triaxial Back-projection Module (TBM) to recover the 6D pose from the object tri-axis. The proposed AxisPose achieves robust performance at the cross-instance level (i.e., one model for N instances) using only a single view as input without reference images, with great potential for generalization to unseen-object level.
摘要：物体姿势估计在机器人技术，增强现实和自动驾驶中起着至关重要的作用，对计算机视觉引起了极大的兴趣。现有研究要么需要多阶段的姿势回归，要么依赖2D-3D功能匹配。尽管这些方法显示出令人鼓舞的结果，但它们在很大程度上依赖外观信息，需要复杂的输入（即多视图参考输入，深度或CAD模型）和复杂的管道（即，特征提取SFM-2D到3D匹配匹配PNP）。我们提出了一种无模型的，无匹配的单发解决方案，用于稳健的6D姿势估计，从根本上讲，它与现有范式有所不同。与使用3D技术（例如SFM和PNP）依赖2D-3D或2D-2D匹配的现有方法不同，Axispose通过利用扩散模型来学习对象的潜在轴分布而无需参考视图而直接从单个视图中直接侵蚀了稳健的6D姿势。具体而言，轴置构建轴生成模块（AGM），以通过扩散模型捕获对象轴的潜在几何分布。通过将几何一致性损失的梯度注入噪声估计，以维持生成的三轴的几何一致性来指导扩散过程。通过生成的三轴投影，Axispose进一步采用了三轴反射模块（TBM），以从对象三轴中恢复6D姿势。所提出的轴轴在跨固定级别（即，n个实例的一个模型）实现了强大的性能，仅使用单个视图作为没有参考图像的输入，具有概括性的潜力，可以从不看到观察的情况下进行概括。

Title: Emulating Self-attention with Convolution for Efficient Image Super-Resolution

Authors: Dongheon Lee, Seokju Yun, Youngmin Ro
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06671
Pdf URL: https://arxiv.org/pdf/2503.06671
Copy Paste: [[2503.06671]] Emulating Self-attention with Convolution for Efficient Image Super-Resolution(https://arxiv.org/abs/2503.06671)
Keywords: super-resolution
Abstract: In this paper, we tackle the high computational overhead of transformers for lightweight image super-resolution. (SR). Motivated by the observations of self-attention's inter-layer repetition, we introduce a convolutionized self-attention module named Convolutional Attention (ConvAttn) that emulates self-attention's long-range modeling capability and instance-dependent weighting with a single shared large kernel and dynamic kernels. By utilizing the ConvAttn module, we significantly reduce the reliance on self-attention and its involved memory-bound operations while maintaining the representational capability of transformers. Furthermore, we overcome the challenge of integrating flash attention into the lightweight SR regime, effectively mitigating self-attention's inherent memory bottleneck. We scale up window size to 32$\times$32 with flash attention rather than proposing an intricated self-attention module, significantly improving PSNR by 0.31dB on Urban100$\times$2 while reducing latency and memory usage by 16$\times$ and 12.2$\times$. Building on these approaches, our proposed network, termed Emulating Self-attention with Convolution (ESC), notably improves PSNR by 0.27 dB on Urban100$\times$4 compared to HiT-SRF, reducing the latency and memory usage by 3.7$\times$ and 6.2$\times$, respectively. Extensive experiments demonstrate that our ESC maintains the ability for long-range modeling, data scalability, and the representational power of transformers despite most self-attentions being replaced by the ConvAttn module.
摘要：在本文中，我们解决了变压器的高计算开销，以实现轻质图像超分辨率。（SR）。受到自我注意力层间重复的观察，我们介绍了一个卷积的自我注意力集模块，称为卷积注意力（Convattn），该模块模仿了自我注意力的远程建模能力和实例依赖性加权，并具有单个共享的大型内核和动态核。通过利用Convattn模块，我们可以显着减少对自我注意的依赖及其所涉及的记忆作业，同时保持变压器的代表性能力。此外，我们克服了将闪光注意力集成到轻质SR制度中的挑战，从而有效地减轻了自我注意力的内在记忆瓶颈。我们将窗口尺寸扩大到32 $ \ times $ 32，而不是提出复杂的自我发项模块，在Urban100 $ \ times $ 2上大大提高了PSNR的0.31db，而将延迟和内存使用率降低了16美元$ \ times $ \ times $ $ \ times $和12.2 $ \ times $ \ times $。在这些方法的基础上，我们提出的网络被称为卷积（ESC）模仿自我注意力，尤其是在Urban100 $ \ tims $ \ times $ 4 $ 4的psnr相比，PSNR分别将延迟和内存使用量减少了3.7 $ \ timple $ \ times $ $ \ times $和6.2 $ \ times $。广泛的实验表明，尽管大多数自我展示被Convattn模块取代，但我们的ESC仍具有远程建模，数据可伸缩性以及变压器的代表力的能力。

Title: Learning Few-Step Diffusion Models by Trajectory Distribution Matching

Authors: Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, Jing Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06674
Pdf URL: https://arxiv.org/pdf/2503.06674
Copy Paste: [[2503.06674]] Learning Few-Step Diffusion Models by Trajectory Distribution Matching(https://arxiv.org/abs/2503.06674)
Keywords: generation
Abstract: Accelerating diffusion model sampling is crucial for efficient AIGC deployment. While diffusion distillation methods -- based on distribution matching and trajectory matching -- reduce sampling to as few as one step, they fall short on complex tasks like text-to-image generation. Few-step generation offers a better balance between speed and quality, but existing approaches face a persistent trade-off: distribution matching lacks flexibility for multi-step sampling, while trajectory matching often yields suboptimal image quality. To bridge this gap, we propose learning few-step diffusion models by Trajectory Distribution Matching (TDM), a unified distillation paradigm that combines the strengths of distribution and trajectory matching. Our method introduces a data-free score distillation objective, aligning the student's trajectory with the teacher's at the distribution level. Further, we develop a sampling-steps-aware objective that decouples learning targets across different steps, enabling more adjustable sampling. This approach supports both deterministic sampling for superior image quality and flexible multi-step adaptation, achieving state-of-the-art performance with remarkable efficiency. Our model, TDM, outperforms existing methods on various backbones, such as SDXL and PixArt-$\alpha$, delivering superior quality and significantly reduced training costs. In particular, our method distills PixArt-$\alpha$ into a 4-step generator that outperforms its teacher on real user preference at 1024 resolution. This is accomplished with 500 iterations and 2 A800 hours -- a mere 0.01% of the teacher's training cost. In addition, our proposed TDM can be extended to accelerate text-to-video diffusion. Notably, TDM can outperform its teacher model (CogVideoX-2B) by using only 4 NFE on VBench, improving the total score from 80.91 to 81.65. Project page: this https URL
摘要：加速扩散模型采样对于有效的AIGC部署至关重要。尽管基于分布匹配和轨迹匹配的扩散蒸馏方法将采样降低到一步很少，但它们在复杂的任务上却缺乏文本到图像生成等复杂任务。几步生成在速度和质量之间提供了更好的平衡，但是现有的方法面临持续的权衡：分配匹配缺乏灵活性多步抽样，而轨迹匹配通常会产生次优的图像质量。为了弥合这一差距，我们建议通过轨迹分布匹配（TDM）学习很少的步骤扩散模型，这是一种结合了分布和轨迹匹配的优势的统一蒸馏范式。我们的方法引入了一个无数据的分数蒸馏目标，将学生的轨迹与老师的轨迹保持一致。此外，我们开发了一个抽样步骤感知的目标，该目标将跨不同步骤的学习目标解开，从而实现了更可调的采样。这种方法既支持卓越图像质量的确定性抽样，又支持灵活的多步骤适应，并以出色的效率来实现最先进的性能。我们的模型TDM优于各种主机上的现有方法，例如SDXL和PIXART-$ \ alpha $，可提供较高的质量并大大降低培训成本。特别是，我们的方法将Pixart-$ \ alpha $提炼到一个4步生成器中，该生成器以1024分辨率以实际用户偏好优于其老师。这是通过500次迭代和2个A800小时完成的，仅占教师培训成本的0.01％。另外，我们提出的TDM可以扩展以加速文本到视频扩散。值得注意的是，TDM可以通过仅在VBench上使用4 nFE，优胜其教师模型（Cogvideox-2b），从而将总分从80.91提高到81.65。项目页面：此HTTPS URL

Title: REArtGS: Reconstructing and Generating Articulated Objects via 3D Gaussian Splatting with Geometric and Motion Constraints

Authors: Di Wu, Liu Liu, Zhou Linli, Anran Huang, Liangtu Song, Qiaojun Yu, Qi Wu, Cewu Lu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.06677
Pdf URL: https://arxiv.org/pdf/2503.06677
Copy Paste: [[2503.06677]] REArtGS: Reconstructing and Generating Articulated Objects via 3D Gaussian Splatting with Geometric and Motion Constraints(https://arxiv.org/abs/2503.06677)
Keywords: generation
Abstract: Articulated objects, as prevalent entities in human life, their 3D representations play crucial roles across various applications. However, achieving both high-fidelity textured surface reconstruction and dynamic generation for articulated objects remains challenging for existing methods. In this paper, we present REArtGS, a novel framework that introduces additional geometric and motion constraints to 3D Gaussian primitives, enabling high-quality textured surface reconstruction and generation for articulated objects. Specifically, given multi-view RGB images of arbitrary two states of articulated objects, we first introduce an unbiased Signed Distance Field (SDF) guidance to regularize Gaussian opacity fields, enhancing geometry constraints and improving surface reconstruction quality. Then we establish deformable fields for 3D Gaussians constrained by the kinematic structures of articulated objects, achieving unsupervised generation of surface meshes in unseen states. Extensive experiments on both synthetic and real datasets demonstrate our approach achieves high-quality textured surface reconstruction for given states, and enables high-fidelity surface generation for unseen states. Codes will be released within the next four months.
摘要：作为人类生活中普遍存在的实体，表达的物体在各种应用中扮演着至关重要的角色。但是，对于现有方法，实现高保真纹理的表面重建和动态生成仍然具有挑战性。在本文中，我们提出了后期TGS，这是一个新颖的框架，将其他几何和运动约束引入3D高斯原语，从而实现高质量的纹理表面重建和发电对象的生成。具体而言，给定两种铰接式对象的任意状态的多视图RGB图像，我们首先引入一个无偏的签名距离场（SDF）指南，以使高斯不透明性字段正常，增强几何形状约束并提高表面重建质量。然后，我们为受铰接物体的运动学结构约束的3D高斯人建立可变形的场，从而在看不见的状态下实现了无监督的表面网格的产生。对合成数据集和真实数据集进行的广泛实验证明了我们的方法为给定状态实现了高质量的纹理表面重建，并为看不见的状态提供了高保真的表面产生。代码将在未来四个月内发布。

Title: PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation

Authors: Yanjie Pan, Qingdong He, Zhengkai Jiang, Pengcheng Xu, Chaoyi Wang, Jinlong Peng, Haoxuan Wang, Yun Cao, Zhenye Gan, Mingmin Chi, Bo Peng, Yabiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06684
Pdf URL: https://arxiv.org/pdf/2503.06684
Copy Paste: [[2503.06684]] PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation(https://arxiv.org/abs/2503.06684)
Keywords: generation
Abstract: Recent advances in diffusion-based text-to-image generation have demonstrated promising results through visual condition control. However, existing ControlNet-like methods struggle with compositional visual conditioning - simultaneously preserving semantic fidelity across multiple heterogeneous control signals while maintaining high visual quality, where they employ separate control branches that often introduce conflicting guidance during the denoising process, leading to structural distortions and artifacts in generated images. To address this issue, we present PixelPonder, a novel unified control framework, which allows for effective control of multiple visual conditions under a single control structure. Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level, enabling precise local guidance without global interference. Additionally, a time-aware control injection scheme is deployed to modulate condition influence according to denoising timesteps, progressively transitioning from structural preservation to texture refinement and fully utilizing the control information from different categories to promote more harmonious image generation. Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets, showing superior improvement in spatial alignment accuracy while maintaining high textual semantic consistency.
摘要：基于扩散的文本对图像生成的最新进展已通过视觉状况控制证明了有希望的结果。然而，现有的类似控制网的方法与组成视觉调节作用 - 同时保留了多个异质控制信号的语义保真度，同时保持高视觉质量，在那里它们采用了单独的控制分支，这些分支通常会在分离过程中引入相互矛盾的指导，从而导致结构性扭曲和产生的结构扭曲和工会。为了解决这个问题，我们提出了一个新型的统一控制框架PixelPonder，可以有效控制单个控制结构下的多个视觉条件。具体而言，我们设计了一个补丁级的自适应条件选择机制，该机制在子区域级别动态优先确定空间相关的控制信号，从而在没有全局干扰的情况下精确地局部指导。此外，还部署了时间感知的控制注射方案，以根据DENO的时间段来调节条件影响，从而逐渐从结构保存到纹理细化，并充分利用来自不同类别的控制信息，以促进更和谐的图像产生。广泛的实验表明，PixelPonder超过了不同基准数据集的先前方法，在保持高文本语义一致性的同时，显示了空间比对精度的较高改善。

Title: UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion

Authors: Gongbo Zhang, Yanting Li, Renqian Luo, Pipi Hu, Zeru Zhao, Lingbo Li, Guoqing Liu, Zun Wang, Ran Bi, Kaiyuan Gao, Liya Guo, Yu Xie, Chang Liu, Jia Zhang, Tian Xie, Robert Pinsler, Claudio Zeni, Ziheng Lu, Yingce Xia, Marwin Segler, Maik Riechert, Li Yuan, Lei Chen, Haiguang Liu, Tao Qin
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI, physics.bio-ph, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2503.06687
Pdf URL: https://arxiv.org/pdf/2503.06687
Copy Paste: [[2503.06687]] UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion(https://arxiv.org/abs/2503.06687)
Keywords: generation, generative
Abstract: Unified generation of sequence and structure for scientific data (e.g., materials, molecules, proteins) is a critical task. Existing approaches primarily rely on either autoregressive sequence models or diffusion models, each offering distinct advantages and facing notable limitations. Autoregressive models, such as GPT, Llama, and Phi-4, have demonstrated remarkable success in natural language generation and have been extended to multimodal tasks (e.g., image, video, and audio) using advanced encoders like VQ-VAE to represent complex modalities as discrete sequences. However, their direct application to scientific domains is challenging due to the high precision requirements and the diverse nature of scientific data. On the other hand, diffusion models excel at generating high-dimensional scientific data, such as protein, molecule, and material structures, with remarkable accuracy. Yet, their inability to effectively model sequences limits their potential as general-purpose multimodal foundation models. To address these challenges, we propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models. This integration leverages the strengths of autoregressive models to ease the training of conditional diffusion models, while diffusion-based generative heads enhance the precision of autoregressive predictions. We validate the effectiveness of UniGenX on material and small molecule generation tasks, achieving a significant leap in state-of-the-art performance for material crystal structure prediction and establishing new state-of-the-art results for small molecule structure prediction, de novo design, and conditional generation. Notably, UniGenX demonstrates significant improvements, especially in handling long sequences for complex structures, showcasing its efficacy as a versatile tool for scientific data generation.
摘要：统一的科学数据序列和结构（例如材料，分子，蛋白质）是一项关键任务。现有方法主要依赖于自回归序列模型或扩散模型，每种模型都具有明显的优势并面临着明显的局限性。 GPT，Llama和Phi-4等自回旋模型在自然语言生成方面取得了显着成功，并使用VQ-VAE等高级编码器将其扩展到多模式任务（例如，图像，视频和音频），以表示复杂的模态作为离散序列。但是，由于高精度要求和科学数据的多样性，它们直接应用于科学领域是具有挑战性的。另一方面，扩散模型在生成高维科学数据（例如蛋白质，分子和材料结构）方面具有出色的精度。然而，他们无法有效建模序列会限制其作为通用多模式模型的潜力。为了应对这些挑战，我们提出了Unigenx，这是一个统一的框架，结合了自回归的下一步预测与条件扩散模型。这种集成利用自回旋模型的优势来简化条件扩散模型的训练，而基于扩散的生成头则提高了自回旋预测的精度。我们验证了Unigenx对材料和小分子生成任务的有效性，从而实现了材料晶体结构预测的最新性能，并为小分子结构预测，从头设计和有条件生成建立了新的最新结果。值得注意的是，Unigenx显示出重大改进，尤其是在处理复杂结构的长序列时，展示了其作为科学数据生成的多功能工具的功效。

Title: Unsupervised Multi-Clustering and Decision-Making Strategies for 4D-STEM Orientation Mapping

Authors: Junhao Cao, Nicolas Folastre, Gozde Oney, Edgar Rauch, Stavros Nicolopoulos, Partha Pratim Das, Arnaud Demortière
Subjects: cs.LG, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.06699
Pdf URL: https://arxiv.org/pdf/2503.06699
Copy Paste: [[2503.06699]] Unsupervised Multi-Clustering and Decision-Making Strategies for 4D-STEM Orientation Mapping(https://arxiv.org/abs/2503.06699)
Keywords: quality assessment
Abstract: This study presents a novel integration of unsupervised learning and decision-making strategies for the advanced analysis of 4D-STEM datasets, with a focus on non-negative matrix factorization (NMF) as the primary clustering method. Our approach introduces a systematic framework to determine the optimal number of components (k) required for robust and interpretable orientation mapping. By leveraging the K-Component Loss method and Image Quality Assessment (IQA) metrics, we effectively balance reconstruction fidelity and model complexity. Additionally, we highlight the critical role of dataset preprocessing in improving clustering stability and accuracy. Furthermore, our spatial weight matrix analysis provides insights into overlapping regions within the dataset by employing threshold-based visualization, facilitating a detailed understanding of cluster interactions. The results demonstrate the potential of combining NMF with advanced IQA metrics and preprocessing techniques for reliable orientation mapping and structural analysis in 4D-STEM datasets, paving the way for future applications in multi-dimensional material characterization.
摘要：这项研究提出了对4D-STEM数据集进行高级分析的无监督学习和决策策略的新颖集成，重点是非负矩阵分解（NMF）作为主要聚类方法。我们的方法引入了一个系统的框架，以确定可靠和可解释的方向映射所需的最佳组件数（k）。通过利用K组件损耗方法和图像质量评估（IQA）指标，我们有效地平衡了重建保真度和模型复杂性。此外，我们强调了数据集预处理在提高聚类稳定性和准确性方面的关键作用。此外，我们的空间重量矩阵分析通过采用基于阈值的可视化来提供对数据集中重叠区域的见解，从而促进了对群集相互作用的详细理解。结果证明了将NMF与4D-STEM数据集中可靠的定向映射和结构分析相结合的潜力和预处理技术，为在多维材料表征中的未来应用铺平了道路。

Title: Color Alignment in Diffusion

Authors: Ka Chun Shum, Binh-Son Hua, Duc Thanh Nguyen, Sai-Kit Yeung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06746
Pdf URL: https://arxiv.org/pdf/2503.06746
Copy Paste: [[2503.06746]] Color Alignment in Diffusion(https://arxiv.org/abs/2503.06746)
Keywords: generation, generative
Abstract: Diffusion models have shown great promise in synthesizing visually appealing images. However, it remains challenging to condition the synthesis at a fine-grained level, for instance, synthesizing image pixels following some generic color pattern. Existing image synthesis methods often produce contents that fall outside the desired pixel conditions. To address this, we introduce a novel color alignment algorithm that confines the generative process in diffusion models within a given color pattern. Specifically, we project diffusion terms, either imagery samples or latent representations, into a conditional color space to align with the input color distribution. This strategy simplifies the prediction in diffusion models within a color manifold while still allowing plausible structures in generated contents, thus enabling the generation of diverse contents that comply with the target color pattern. Experimental results demonstrate our state-of-the-art performance in conditioning and controlling of color pixels, while maintaining on-par generation quality and diversity in comparison with regular diffusion models.
摘要：扩散模型在综合视觉上吸引人的图像方面表现出了巨大的希望。但是，在细粒度的层次上调节合成，例如，在某些通用颜色模式下综合图像像素仍然具有挑战性。现有的图像合成方法通常会产生含量不超出所需像素条件的内容。为了解决这个问题，我们介绍了一种新颖的颜色比对算法，该算法将传播模型中的生成过程限制在给定的颜色模式中。具体而言，我们将扩散项（图像样本或潜在表示）投射到有条件的色彩空间中，以与输入颜色分布保持一致。该策略简化了颜色歧管中扩散模型中的预测，同时仍允许生成内容中的合理结构，从而可以生成符合目标颜色模式的各种内容。实验结果证明了我们在颜色像素的调理和控制方面的最先进表现，同时与常规扩散模型相比，保持了PAR生成质量和多样性。

Title: DiffAtlas: GenAI-fying Atlas Segmentation via Image-Mask Diffusion

Authors: Hantao Zhang, Yuhe Liu, Jiancheng Yang, Weidong Guo, Xinyuan Wang, Pascal Fua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06748
Pdf URL: https://arxiv.org/pdf/2503.06748
Copy Paste: [[2503.06748]] DiffAtlas: GenAI-fying Atlas Segmentation via Image-Mask Diffusion(https://arxiv.org/abs/2503.06748)
Keywords: generative
Abstract: Accurate medical image segmentation is crucial for precise anatomical delineation. Deep learning models like U-Net have shown great success but depend heavily on large datasets and struggle with domain shifts, complex structures, and limited training samples. Recent studies have explored diffusion models for segmentation by iteratively refining masks. However, these methods still retain the conventional image-to-mask mapping, making them highly sensitive to input data, which hampers stability and generalization. In contrast, we introduce DiffAtlas, a novel generative framework that models both images and masks through diffusion during training, effectively ``GenAI-fying'' atlas-based segmentation. During testing, the model is guided to generate a specific target image-mask pair, from which the corresponding mask is obtained. DiffAtlas retains the robustness of the atlas paradigm while overcoming its scalability and domain-specific limitations. Extensive experiments on CT and MRI across same-domain, cross-modality, varying-domain, and different data-scale settings using the MMWHS and TotalSegmentator datasets demonstrate that our approach outperforms existing methods, particularly in limited-data and zero-shot modality segmentation. Code is available at this https URL.
摘要：准确的医学图像分割对于精确的解剖学描述至关重要。像U-NET这样的深度学习模型取得了巨大的成功，但在很大程度上取决于大型数据集，并与域的变化，复杂的结构和有限的培训样本斗争。最近的研究探索了通过迭代精炼面罩进行分割的扩散模型。但是，这些方法仍然保留了常规的图像对面罩映射，使其对输入数据高度敏感，从而阻碍了稳定性和概括。相比之下，我们引入了Diffatlas，这是一种新型的生成框架，可以通过训练期间的扩散来对图像和掩盖进行建模，从而有效地“基于Genai-Fying”地图集。在测试过程中，该模型被指导以生成特定的目标图像面罩对，从中获得相应的掩码。 Diffatlas保留了地图集范式的鲁棒性，同时克服了其可扩展性和特定于域的限制。使用MMWH和Total-Sementementator数据集进行了对同域，跨模式，变化域以及不同数据尺度设置的CT和MRI的广泛实验，这表明我们的方法优于现有方法，尤其是在有限data和零折射模态分段中。代码可在此HTTPS URL上找到。

Title: Primal-Dual Sample Complexity Bounds for Constrained Markov Decision Processes with Multiple Constraints

Authors: Max Buckley, Konstantinos Papathanasiou, Andreas Spanopoulos
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06751
Pdf URL: https://arxiv.org/pdf/2503.06751
Copy Paste: [[2503.06751]] Primal-Dual Sample Complexity Bounds for Constrained Markov Decision Processes with Multiple Constraints(https://arxiv.org/abs/2503.06751)
Keywords: generative
Abstract: This paper addresses the challenge of solving Constrained Markov Decision Processes (CMDPs) with $d > 1$ constraints when the transition dynamics are unknown, but samples can be drawn from a generative model. We propose a model-based algorithm for infinite horizon CMDPs with multiple constraints in the tabular setting, aiming to derive and prove sample complexity bounds for learning near-optimal policies. Our approach tackles both the relaxed and strict feasibility settings, where relaxed feasibility allows some constraint violations, and strict feasibility requires adherence to all constraints. The main contributions include the development of the algorithm and the derivation of sample complexity bounds for both settings. For the relaxed feasibility setting we show that our algorithm requires $\tilde{\mathcal{O}} \left( \frac{d |\mathcal{S}| |\mathcal{A}| \log(1/\delta)}{(1-\gamma)^3\epsilon^2} \right)$ samples to return $\epsilon$-optimal policy, while in the strict feasibility setting it requires $\tilde{\mathcal{O}} \left( \frac{d^3 |\mathcal{S}| |\mathcal{A}| \log(1/\delta)}{(1-\gamma)^5\epsilon^2{\zeta_{\mathbf{c}}^*}^2} \right)$ samples.
摘要：本文解决了在未知的过渡动力学时解决限制的马尔可夫决策过程（CMDP）的挑战，但可以从生成模型中获取样本。我们提出了一种基于模型的算法，用于在表格设置中具有多个约束的无限地平线CMDP，旨在得出和证明用于学习近乎最佳策略的样本复杂性界限。我们的方法既应对放松和严格的可行性设置，而放松的可行性允许违反一些约束，严格的可行性需要遵守所有约束。主要贡献包括算法的开发以及两种设置的样品复杂性界限的推导。对于轻松的可行性设置，我们表明我们的算法需要$ \ tilde {\ Mathcal {o}} \ left（\ frac {d | \ Mathcal {s} | | | | \ Mathcal {a} a} | \ log log（1/\ delta）}} {1- \ eps^3 \ gamma）要返回$ \ epsilon $ -optimal策略，而在严格的可行性设置中，需要$ \ tilde {\ Mathcal {o}}} \ left（\ frac {d^3 | \ Mathcal {s} \ log（1/\ delta）} {（1- \ gamma）^5 \ epsilon^2 {\ Zeta _ {\ MathBf {c}}}}}^*}^*}^*}^2}^2} \ right）$ samples。

Title: SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

Authors: Zisheng Chen, Chunwei Wang, Xiuwei Chen, Hang Xu, Jianhua Han, Xiandan Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06764
Pdf URL: https://arxiv.org/pdf/2503.06764
Copy Paste: [[2503.06764]] SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation(https://arxiv.org/abs/2503.06764)
Keywords: generation
Abstract: We present SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook that provides consistent discrete feature representations for multimodal understanding and generation tasks. Recently, unified multimodal large models (MLLMs) for understanding and generation have sparked exploration within research community. Previous works attempt to train a unified image tokenizer by combining loss functions for semantic feature reconstruction and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation tasks, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through Semantic-Guided Hierarchical codebook which builds texture sub-codebooks on pre-trained semantic codebook. This design decouples the training of semantic reconstruction and pixel reconstruction and equips the tokenizer with low-level texture feature extraction capability without degradation of high-level semantic feature extraction ability. Our experiments demonstrate that SemHiTok achieves state-of-the-art rFID score at 256X256resolution compared to other unified tokenizers, and exhibits competitive performance on multimodal understanding and generation tasks.
摘要：我们提出Semhitok，这是一种通过语义引导的层次结构代码簿统一的图像令牌，该代码本为多模式理解和生成任务提供一致的离散功能表示。最近，统一的多模式大型模型（MLLM）用于理解和产生，引发了研究社区的探索。以前的工作试图通过结合语义特征重建和像素重建的损失函数来训练统一的图像令牌。但是，由于多模式理解和发电任务优先考虑的特征水平不同，联合培训方法在实现良好的权衡方面面临着巨大的挑战。 Semhitok通过语义引导的层次结构代码簿解决了这一挑战，该代码本可以在预训练的语义代码簿上构建纹理子编码本。该设计将语义重建和像素重建的训练解散，并使令牌剂具有低级纹理特征提取能力，而不会降低高级语义特征提取能力。我们的实验表明，与其他统一的引物相比，Semhitok以256x256分辨率达到最新的RFID得分，并且在多模式理解和发电任务方面表现出竞争性能。

Title: GenDR: Lightning Generative Detail Restorator

Authors: Yan Wang, Shijie Zhao, Kai Chen, Kexin Zhang, Junlin Li, Li Zhang
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2503.06790
Pdf URL: https://arxiv.org/pdf/2503.06790
Copy Paste: [[2503.06790]] GenDR: Lightning Generative Detail Restorator(https://arxiv.org/abs/2503.06790)
Keywords: restoration, super-resolution, generative
Abstract: Recent research applying text-to-image (T2I) diffusion models to real-world super-resolution (SR) has achieved remarkable success. However, fundamental misalignments between T2I and SR targets result in a dilemma between inference speed and detail fidelity. Specifically, T2I tasks prioritize multi-step inversion to synthesize coherent outputs aligned with textual prompts and shrink the latent space to reduce generating complexity. Contrariwise, SR tasks preserve most information from low-resolution input while solely restoring high-frequency details, thus necessitating sufficient latent space and fewer inference steps. To bridge the gap, we present a one-step diffusion model for generative detail restoration, GenDR, distilled from a tailored diffusion model with larger latent space. In detail, we train a new SD2.1-VAE16 (0.9B) via representation alignment to expand latent space without enlarging the model size. Regarding step-distillation, we propose consistent score identity distillation (CiD) that incorporates SR task-specific loss into score distillation to leverage more SR priors and align the training target. Furthermore, we extend CiD with adversarial learning and representation alignment (CiDA) to enhance perceptual quality and accelerate training. We also polish the pipeline to achieve a more efficient inference. Experimental results demonstrate that GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity.
摘要：最新的研究将文本对图像（T2I）扩散模型用于现实世界超级分辨率（SR）取得了巨大的成功。但是，T2i和SR目标之间的基本未对准导致推理速度和细节保真度之间存在困境。具体而言，T2I任务优先考虑多步反转，以合成与文本提示对齐的相干输出并收缩潜在空间以降低生成复杂性。相反，SR任务将大多数信息从低分辨率输入中保留，同时仅恢复高频细节，因此需要足够的潜在空间和更少的推理步骤。为了弥合差距，我们提出了一个一步扩散模型，用于生成细节修复，Gendr，从具有较大潜在空间的定制扩散模型中提取。详细介绍，我们通过表示对齐方式训练新的SD2.1-VAE16（0.9B），以扩大潜在空间而不扩大模型大小。关于阶梯缩减，我们提出了一致的分数身份蒸馏（CID），将特定于SR的任务损失纳入得分蒸馏中，以利用更多的SR先验并将训练目标对准。此外，我们通过对抗性学习和表示对准（CIDA）扩展CID，以提高知觉质量和加速培训。我们还抛光管道以实现更有效的推断。实验结果表明，Gendr在定量指标和视觉保真度中都达到了最新的性能。

Title: VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation

Authors: Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, Kai-Wei Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06800
Pdf URL: https://arxiv.org/pdf/2503.06800
Copy Paste: [[2503.06800]] VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation(https://arxiv.org/abs/2503.06800)
Keywords: generation, generative
Abstract: Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy-2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy-2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy-AutoEval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy-2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation. The data and code is available at this https URL.
摘要：能够创建各种视觉概念的现实视频的大型视频生成模型是通用物理世界模拟器的强大候选人。但是，它们在现实世界中遵守物理常识尚不清楚（例如，打网球，反弹）。现有基准受到限制的限制，例如人体评估，SIM到真实的差距缺乏以及缺乏细粒度的物理规则分析。为了解决这个问题，我们介绍了videophy-2，这是一个以动作为中心的数据集，用于评估生成的视频中的物理常识。我们策划了200种不同的动作，并详细提示了现代生成模型的视频综合。我们执行人类评估，以评估生成视频中物理规则的语义依从性，物理常识和基础。我们的发现揭示了主要的缺点，即使是最佳模型也仅实现了22％的联合性能（即高语义和物理常识依从性）在录像带-2的硬子集上。我们发现，这些模型特别在诸如质量和势头之类的保护法律上挣扎。最后，我们还培训了录像带，自动评估师是我们数据集上快速，可靠评估的自动评估器。总体而言，Videophy-2是一种严格的基准测试，揭示了视频生成模型中的关键差距，并指导了物理地面视频生成的未来研究。数据和代码可在此HTTPS URL上找到。

Title: GUIDE-CoT: Goal-driven and User-Informed Dynamic Estimation for Pedestrian Trajectory using Chain-of-Thought

Authors: Sungsik Kim, Janghyun Baek, Jinkyu Kim, Jaekoo Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06832
Pdf URL: https://arxiv.org/pdf/2503.06832
Copy Paste: [[2503.06832]] GUIDE-CoT: Goal-driven and User-Informed Dynamic Estimation for Pedestrian Trajectory using Chain-of-Thought(https://arxiv.org/abs/2503.06832)
Keywords: generation
Abstract: While Large Language Models (LLMs) have recently shown impressive results in reasoning tasks, their application to pedestrian trajectory prediction remains challenging due to two key limitations: insufficient use of visual information and the difficulty of predicting entire trajectories. To address these challenges, we propose Goal-driven and User-Informed Dynamic Estimation for pedestrian trajectory using Chain-of-Thought (GUIDE-CoT). Our approach integrates two innovative modules: (1) a goal-oriented visual prompt, which enhances goal prediction accuracy combining visual prompts with a pretrained visual encoder, and (2) a chain-of-thought (CoT) LLM for trajectory generation, which generates realistic trajectories toward the predicted goal. Moreover, our method introduces controllable trajectory generation, allowing for flexible and user-guided modifications to the predicted paths. Through extensive experiments on the ETH/UCY benchmark datasets, our method achieves state-of-the-art performance, delivering both high accuracy and greater adaptability in pedestrian trajectory prediction. Our code is publicly available at this https URL.
摘要：尽管大型语言模型（LLMS）最近在推理任务中显示出令人印象深刻的结果，但由于两个关键局限性，它们在行人轨迹预测中的应用仍然具有挑战性：视觉信息的使用不足以及预测整个轨迹的困难。为了应对这些挑战，我们建议使用思想链（Guider-Cot）对行人轨迹进行目标驱动和用户信息的动态估计。我们的方法集成了两个创新的模块：（1）以目标为导向的视觉提示，可以增强目标预测的准确性，将视觉提示与预拟思的视觉编码器结合在一起，以及（2）轨迹产生的经过思考链（COT）LLM，从而将逼真的轨迹带入预测的目标。此外，我们的方法引入了可控的轨迹生成，可以为预测的路径进行灵活和用户引导的修改。通过对ETH/UCY基准数据集进行的广泛实验，我们的方法实现了最先进的性能，在行人轨迹预测中提供了很高的精度和更大的适应性。我们的代码在此HTTPS URL上公开可用。

Title: AttFC: Attention Fully-Connected Layer for Large-Scale Face Recognition with One GPU

Authors: Zhuowen Zheng, Yain-Whar Si, Xiaochen Yuan, Junwei Duan, Ke Wang, Xiaofan Li, Xinyuan Zhang, Xueyuan Gong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06839
Pdf URL: https://arxiv.org/pdf/2503.06839
Copy Paste: [[2503.06839]] AttFC: Attention Fully-Connected Layer for Large-Scale Face Recognition with One GPU(https://arxiv.org/abs/2503.06839)
Keywords: generative
Abstract: Nowadays, with the advancement of deep neural networks (DNNs) and the availability of large-scale datasets, the face recognition (FR) model has achieved exceptional performance. However, since the parameter magnitude of the fully connected (FC) layer directly depends on the number of identities in the dataset. If training the FR model on large-scale datasets, the size of the model parameter will be excessively huge, leading to substantial demand for computational resources, such as time and memory. This paper proposes the attention fully connected (AttFC) layer, which could significantly reduce computational resources. AttFC employs an attention loader to generate the generative class center (GCC), and dynamically store the class center with Dynamic Class Container (DCC). DCC only stores a small subset of all class centers in FC, thus its parameter count is substantially less than the FC layer. Also, training face recognition models on large-scale datasets with one GPU often encounter out-of-memory (OOM) issues. AttFC overcomes this and achieves comparable performance to state-of-the-art methods.
摘要：如今，随着深度神经网络（DNN）的发展和大规模数据集的可用性，面部识别（FR）模型已取得了出色的性能。但是，由于完全连接（FC）层的参数幅度直接取决于数据集中的身份数量。如果在大规模数据集中训练FR模型，则模型参数的大小将过于巨大，从而导致对计算资源（例如时间和内存）的大量需求。本文提出了完全连接的注意力（ATTFC）层，这可以大大减少计算资源。 ATTFC使用注意装载程序生成生成类中心（GCC），并用动态类容器（DCC）动态存储类中心。 DCC仅在FC中存储所有类中心的一小部分，因此其参数计数大大低于FC层。此外，在一个大规模数据集中训练面部识别模型，其中一个GPU经常遇到遗传外（OOM）问题。 ATTFC克服了这一点，并实现了与最先进方法的可比性。

Title: From Image- to Pixel-level: Label-efficient Hyperspectral Image Reconstruction

Authors: Yihong Leng, Jiaojiao Li, Haitao Xu, Rui Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06852
Pdf URL: https://arxiv.org/pdf/2503.06852
Copy Paste: [[2503.06852]] From Image- to Pixel-level: Label-efficient Hyperspectral Image Reconstruction(https://arxiv.org/abs/2503.06852)
Keywords: super-resolution
Abstract: Current hyperspectral image (HSI) reconstruction methods primarily rely on image-level approaches, which are time-consuming to form abundant high-quality HSIs through imagers. In contrast, spectrometers offer a more efficient alternative by capturing high-fidelity point spectra, enabling pixel-level HSI reconstruction that balances accuracy and label efficiency. To this end, we introduce a pixel-level spectral super-resolution (Pixel-SSR) paradigm that reconstructs HSI from RGB and point spectra. Despite its advantages, Pixel-SSR presents two key challenges: 1) generalizability to novel scenes lacking point spectra, and 2) effective information extraction to promote reconstruction accuracy. To address the first challenge, a Gamma-modeled strategy is investigated to synthesize point spectra based on their intrinsic properties, including nonnegativity, a skewed distribution, and a positive correlation. Furthermore, complementary three-branch prompts from RGB and point spectra are extracted with a Dynamic Prompt Mamba (DyPro-Mamba), which progressively directs the reconstruction with global spatial distributions, edge details, and spectral dependency. Comprehensive evaluations, including horizontal comparisons with leading methods and vertical assessments across unsupervised and image-level supervised paradigms, demonstrate that ours achieves competitive reconstruction accuracy with efficient label consumption.
摘要：当前的高光谱图像（HSI）重建方法主要依赖于图像级方法，这些方法耗时来通过成像器形成丰富的高质量HSIS。相比之下，光谱仪通过捕获高保真点光谱提供了更有效的替代方案，从而实现了像素级的HSI重建，从而平衡了准确性和标签效率。为此，我们引入了像素级光谱超分辨率（Pixel-SSR）范式，该范式从RGB和Point Spectra重建了HSI。尽管具有优势，但Pixel-SSR提出了两个主要挑战：1）对缺乏点光谱的新型场景的普遍性，以及2）有效的信息提取以促进重建精度。为了应对第一个挑战，研究了基于其内在特性（包括非负性，偏斜分布和正相关）的伽马模型策略来合成点光谱。此外，从RGB和点光谱中提取了互补的三分支提示，并使用动态提示Mamba（Dypro-Mamba）提取，该提示逐渐通过全局空间分布，边缘细节和光谱依赖性指导重建。全面的评估，包括与领先方法的水平比较和无监督和图像级监督的范式进行的垂直评估，表明我们的人可以通过有效的标签消耗来实现竞争性重建精度。

Title: Towards Generalization of Tactile Image Generation: Reference-Free Evaluation in a Leakage-Free Setting

Authors: Cagri Gungor, Derek Eppinger, Adriana Kovashka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06860
Pdf URL: https://arxiv.org/pdf/2503.06860
Copy Paste: [[2503.06860]] Towards Generalization of Tactile Image Generation: Reference-Free Evaluation in a Leakage-Free Setting(https://arxiv.org/abs/2503.06860)
Keywords: generation
Abstract: Tactile sensing, which relies on direct physical contact, is critical for human perception and underpins applications in computer vision, robotics, and multimodal learning. Because tactile data is often scarce and costly to acquire, generating synthetic tactile images provides a scalable solution to augment real-world measurements. However, ensuring robust generalization in synthesizing tactile images-capturing subtle, material-specific contact features-remains challenging. We demonstrate that overlapping training and test samples in commonly used datasets inflate performance metrics, obscuring the true generalizability of tactile models. To address this, we propose a leakage-free evaluation protocol coupled with novel, reference-free metrics-TMMD, I-TMMD, CI-TMMD, and D-TMMD-tailored for tactile generation. Moreover, we propose a vision-to-touch generation method that leverages text as an intermediate modality by incorporating concise, material-specific descriptions during training to better capture essential tactile features. Experiments on two popular visuo-tactile datasets, Touch and Go and HCT, show that our approach achieves superior performance and enhanced generalization in a leakage-free setting.
摘要：依赖于直接身体接触的触觉传感对于人类的感知至关重要，并且在计算机视觉，机器人技术和多模式学习中的应用至关重要。由于触觉数据通常稀缺且昂贵，因此生成合成触觉图像为增强现实世界测量的可扩展解决方案提供了可扩展的解决方案。但是，确保在合成触觉图像捕捉微妙的，特定于材料的接触特征 - 挑战性的触觉图像中确保有力概括。我们证明，在常用数据集中的重叠训练和测试样本会膨胀性能指标，从而掩盖了触觉模型的真正普遍性。为了解决这个问题，我们提出了一项无泄漏评估协议，再加上新颖的，无参考的指标-TMMD，I-TMMD，CI-TMMD和D-TMMD，用于触觉生成。此外，我们提出了一种与触觉的生成方法，该方法通过在训练过程中纳入简洁的，特定于物质的描述来更好地捕获基本触觉特征，从而将文本作为中间形式。在两个流行的Visuo-Tactile数据集上进行触摸和GO和HCT的实验表明，我们的方法在无泄漏环境中实现了卓越的性能和增强的概括。

Title: ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration

Authors: Mengting Ai, Tianxin Wei, Yifan Chen, Zhichen Zeng, Ritchie Zhao, Girish Varatkar, Bita Darvish Rouhani, Xianfeng Tang, Hanghang Tong, Jingrui He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06881
Pdf URL: https://arxiv.org/pdf/2503.06881
Copy Paste: [[2503.06881]] ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration(https://arxiv.org/abs/2503.06881)
Keywords: restoration
Abstract: Mixture-of-Experts (MoE) Transformer, the backbone architecture of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. The sparse structure, while allowing constant time costs, results in space inefficiency: we still need to load all the model parameters during inference. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones. ResMoE enhances the space efficiency for inference of large-scale MoE Transformers in a one-shot and data-agnostic manner without retraining while maintaining minimal accuracy loss, thereby paving the way for broader accessibility to large language models. We demonstrate the effectiveness of ResMoE through extensive experiments on Switch Transformer, Mixtral, and DeepSeekMoE models. The results show that ResMoE can reduce the number of parameters in an expert by up to 75% while maintaining comparable performance. The code is available at this https URL.
摘要：多种现象语言模型的骨干结构（MOE）变压器的混合物（MOE）变压器，通过仅激活每个输入令牌的模型参数的一小部分来利用稀疏性。稀疏的结构虽然允许持续的时间成本，但会导致空间效率低下：我们仍然需要在推理过程中加载所有模型参数。我们介绍了Resmoe，这是一种创新的MOE近似框架，利用Wasserstein Barycenter提取普通专家（Barycenter专家），并近似该Barycenter专家与原始专家之间的残差。 Resmoe以单次和数据敏捷的方式提高了大规模MOE变压器推断的空间效率，而无需进行重新训练，同时保持了最小的精度损失，从而为更广泛的大语言模型铺平了道路。我们通过对开关变压器，混音和DeepSeekmoe模型进行大量实验来证明Resmoe的有效性。结果表明，Resmoe可以将专家中的参数数量减少多达75％，同时保持可比的性能。该代码可在此HTTPS URL上找到。

Title: Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help

Authors: Yuefan Cao, Xuyang Guo, Jiayan Huo, Yingyu Liang, Zhenmei Shi, Zhao Song, Jiahao Zhang, Zhen Zhuang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.06884
Pdf URL: https://arxiv.org/pdf/2503.06884
Copy Paste: [[2503.06884]] Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help(https://arxiv.org/abs/2503.06884)
Keywords: generation, generative
Abstract: Generative modeling is widely regarded as one of the most essential problems in today's AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamental limitations in adhering to numerical constraints in user instructions, frequently generating images with an incorrect number of objects. While several prior works have mentioned this issue, a comprehensive and rigorous evaluation of this limitation remains lacking. To address this gap, we introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our benchmark encompasses a diverse set of generative models, including both open-source and private systems. It explicitly isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations to ensure high reliability. Extensive evaluations with T2ICountBench reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases. Additionally, an exploratory study on prompt refinement demonstrates that such simple interventions generally do not improve counting accuracy. Our findings highlight the inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements.
摘要：生成型建模被广泛认为是当今AI社区中最重要的问题之一，文本到图像的一代已经获得了前所未有的现实影响。在各种方法中，扩散模型取得了巨大的成功，并已成为文本到图像生成的事实上的解决方案。但是，尽管性能令人印象深刻，但这些模型在遵守用户指令中的数值约束时表现出了基本限制，经常生成具有不正确对象数量的图像。尽管几项先前的工作提到了这个问题，但仍缺乏对此限制的全面和严格的评估。为了解决这一差距，我们介绍了T2icountbench，这是一种新颖的基准测试，旨在严格评估最先进的文本对图像扩散模型的计数能力。我们的基准包括包括开源和私人系统在内的各种生成模型。它明确地分离了从其他功能中计算性能，提供结构化的难度水平，并结合了人类评估以确保高可靠性。对T2icountbench进行的广泛评估表明，所有最新的扩散模型都无法生成正确数量的对象，并且随着对象数量的增加，精度显着下降。此外，对迅速完善的探索性研究表明，这种简单的干预措施通常不会提高计数准确性。我们的发现突出了扩散模型中数值理解中固有的挑战，并指出了未来改进的有希望的方向。

Title: CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution

Authors: Xin Liu, Jie Liu, Jie Tang, Gangshan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06896
Pdf URL: https://arxiv.org/pdf/2503.06896
Copy Paste: [[2503.06896]] CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution(https://arxiv.org/abs/2503.06896)
Keywords: super-resolution
Abstract: Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR). However, its computational complexity grows quadratically with the spatial resolution. A series of works attempt to alleviate this problem by dividing Low-Resolution images into local windows, axial stripes, or dilated windows. SR typically leverages the redundancy of images for reconstruction, and this redundancy appears not only in local regions but also in long-range regions. However, these methods limit attention computation to content-agnostic local regions, limiting directly the ability of attention to capture long-range dependency. To address these issues, we propose a lightweight Content-Aware Token Aggregation Network (CATANet). Specifically, we propose an efficient Content-Aware Token Aggregation module for aggregating long-range content-similar tokens, which shares token centers across all image tokens and updates them only during the training phase. Then we utilize intra-group self-attention to enable long-range information interaction. Moreover, we design an inter-group cross-attention to further enhance global information interaction. The experimental results show that, compared with the state-of-the-art cluster-based method SPIN, our method achieves superior performance, with a maximum PSNR improvement of 0.33dB and nearly double the inference speed.
摘要：基于变压器的方法在低级视觉任务（例如图像超分辨率（SR））中表现出了令人印象深刻的性能。但是，其计算复杂性随空间分辨率而倍增。一系列作品试图通过将低分辨率图像分为本地窗户，轴向条纹或扩张的窗户来减轻此问题。 SR通常利用图像的冗余进行重建，而这种冗余不仅出现在本地区域，而且在远程区域。但是，这些方法将注意力计算限制为内容不足的局部区域，直接限制了注意力捕获长期依赖性的能力。为了解决这些问题，我们提出了一个轻巧的内容感知的令牌聚合网络（CATANET）。具体而言，我们提出了一个有效的内容感知的令牌聚合模块，用于汇总远程内容相似的令牌，该模块类似于图像，该模块在所有图像令牌上共享令牌中心，并仅在培训阶段更新它们。然后，我们利用组内的自我注意力来实现远程信息相互作用。此外，我们设计了组间交叉注意，以进一步增强全球信息互动。实验结果表明，与最新的基于群集的方法自旋相比，我们的方法可实现出色的性能，最大PSNR改善为0.33dB，几乎是推理速度的两倍。

Title: HiSTF Mamba: Hierarchical Spatiotemporal Fusion with Multi-Granular Body-Spatial Modeling for High-Fidelity Text-to-Motion Generation

Authors: Xingzu Zhan, Chen Xie, Haoran Sun, Xiaochun Mai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06897
Pdf URL: https://arxiv.org/pdf/2503.06897
Copy Paste: [[2503.06897]] HiSTF Mamba: Hierarchical Spatiotemporal Fusion with Multi-Granular Body-Spatial Modeling for High-Fidelity Text-to-Motion Generation(https://arxiv.org/abs/2503.06897)
Keywords: generation
Abstract: Text-to-motion generation is a rapidly growing field at the nexus of multimodal learning and computer graphics, promising flexible and cost-effective applications in gaming, animation, robotics, and virtual reality. Existing approaches often rely on simple spatiotemporal stacking, which introduces feature redundancy, while subtle joint-level details remain overlooked from a spatial perspective. To this end, we propose a novel HiSTF Mamba framework. The framework is composed of three key modules: Dual-Spatial Mamba, Bi-Temporal Mamba, and Dynamic Spatiotemporal Fusion Module (DSFM). Dual-Spatial Mamba incorporates ``Part-based + Whole-based'' parallel modeling to represent both whole-body coordination and fine-grained joint dynamics. Bi-Temporal Mamba adopts a bidirectional scanning strategy, effectively encoding short-term motion details and long-term dependencies. DSFM further performs redundancy removal and extraction of complementary information for temporal features, then fuses them with spatial features, yielding an expressive spatio-temporal representation. Experimental results on the HumanML3D dataset demonstrate that HiSTF Mamba achieves state-of-the-art performance across multiple metrics. In particular, it reduces the FID score from 0.283 to 0.189, a relative decrease of nearly 30%. These findings validate the effectiveness of HiSTF Mamba in achieving high fidelity and strong semantic alignment in text-to-motion generation.
摘要：文本到动作生成是多模式学习和计算机图形的Nexus的快速增长的领域，在游戏，动画，机器人技术和虚拟现实中具有灵活性和具有成本效益的应用。现有的方法通常依赖于简单的时空堆叠，从而引入了冗余，而从空间角度来看，微妙的联合级别细节仍然忽略了。为此，我们提出了一个新颖的Histf Mamba框架。该框架由三个关键模块组成：双空间Mamba，双阶段Mamba和动态时空融合模块（DSFM）。双空间Mamba结合了``基于零件的 +全部''并行建模，以代表全身协调和细粒的关节动力学。双向Mamba采用双向扫描策略，有效地编码了短期运动细节和长期依赖性。 DSFM进一步执行了冗余的冗余和提取时间特征的互补信息，然后将其与空间特征融合在一起，从而产生表达性时空表示。 HumanML3D数据集的实验结果表明，Histf Mamba在多个指标上实现了最先进的性能。特别是，它将FID分数从0.283降低到0.189，相对降低近30％。这些发现证明了Histf Mamba在文本到动作生成中实现高忠诚度和强烈的语义一致性方面的有效性。

Title: DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation

Authors: Xiaoliang Ju, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06900
Pdf URL: https://arxiv.org/pdf/2503.06900
Copy Paste: [[2503.06900]] DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation(https://arxiv.org/abs/2503.06900)
Keywords: generation, generative
Abstract: We present DirectTriGS, a novel framework designed for 3D object generation with Gaussian Splatting (GS). GS-based rendering for 3D content has gained considerable attention recently. However, there has been limited exploration in directly generating 3D Gaussians compared to traditional generative modeling approaches. The main challenge lies in the complex data structure of GS represented by discrete point clouds with multiple channels. To overcome this challenge, we propose employing the triplane representation, which allows us to represent Gaussian Splatting as an image-like continuous field. This representation effectively encodes both the geometry and texture information, enabling smooth transformation back to Gaussian point clouds and rendering into images by a TriRenderer, with only 2D supervisions. The proposed TriRenderer is fully differentiable, so that the rendering loss can supervise both texture and geometry encoding. Furthermore, the triplane representation can be compressed using a Variational Autoencoder (VAE), which can subsequently be utilized in latent diffusion to generate 3D objects. The experiments demonstrate that the proposed generation framework can produce high-quality 3D object geometry and rendering results in the text-to-3D task.
摘要：我们提出了DirectTrigs，这是一个专为3D对象生成（GS）（GS）而设计的新型框架。基于GS的3D含量渲染最近引起了广泛的关注。但是，与传统的生成建模方法相比，直接生成3D高斯人的探索有限。主要挑战在于由具有多个通道的离散点云表示的GS的复杂数据结构。为了克服这一挑战，我们提出采用三平方表示，这使我们能够将高斯碎片表示为图像状连续场。此表示形式有效地编码了几何和纹理信息，可以使平滑的转换回到高斯点云，并仅带有2D监督到Trirenderer的图像中。拟议的Trirender是完全可区分的，因此渲染损失可以监督纹理和几何编码。此外，可以使用变异自动编码器（VAE）来压缩三平方表示，随后可以在潜扩散中使用该表示，以生成3D对象。实验表明，所提出的生成框架可以产生高质量的3D对象几何形状和渲染结果，从而导致文本到3D任务。

Title: From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

Authors: Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, Linfeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06923
Pdf URL: https://arxiv.org/pdf/2503.06923
Copy Paste: [[2503.06923]] From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers(https://arxiv.org/abs/2503.06923)
Keywords: generation
Abstract: Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality. To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps. Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios. For instance, it achieves an almost lossless acceleration of 4.99$\times$ on FLUX and 5.00$\times$ on HunyuanVideo without additional training. On DiT, it achieves $3.41$ lower FID compared with previous SOTA at $4.53$$\times$ acceleration. %Our code is provided in the supplementary materials and will be made publicly available on GitHub. Our codes have been released in Github:this https URL
摘要：扩散变压器（DIT）彻底改变了高保真图像和视频综合，但它们的计算需求对于实时应用仍然令人难以置信。为了解决此问题，已经提出了功能缓存来加速扩散模型，通过缓存以前的时间段中的特征，然后在以下时间段中重复使用它们。但是，在具有显着间隔的时间步长，扩散模型中的特征相似性大大降低，从而导致特征缓存引入的误差明显增加，从而严重损害了发电质量。为了解决这个问题，我们提出了Taylorseer，该问题首先表明可以根据以前的时间段的值来预测未来时间段的扩散模型的特征。基于特征在各个时间步长缓慢而连续变化的事实，Taylorseer采用了一种差异方法来近似特征的高阶导数，并在未来的TimeSteps中预测了Taylor系列扩展的未来特征。广泛的实验证明了其在图像和视频合成中的显着效果，尤其是在高加速度比率方面。例如，它在不额外的培训的情况下，它实现了几乎无损的加速度，而Hunyuanvideo上的Flux $ 4.99 $ \ times $ $ \ times $。在DIT上，与以前的SOTA相比，它的$ 3.41 $降低了$ 4.53 $$ \ times $加速。％我们的代码在补充材料中提供，并将在Github上公开提供。我们的代码已在GitHub发行：此HTTPS URL

Title: Post-Training Quantization for Diffusion Transformer via Hierarchical Timestep Grouping

Authors: Ning Ding, Jing Han, Yuchuan Tian, Chao Xu, Kai Han, Yehui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06930
Pdf URL: https://arxiv.org/pdf/2503.06930
Copy Paste: [[2503.06930]] Post-Training Quantization for Diffusion Transformer via Hierarchical Timestep Grouping(https://arxiv.org/abs/2503.06930)
Keywords: generation
Abstract: Diffusion Transformer (DiT) has now become the preferred choice for building image generation models due to its great generation capability. Unlike previous convolution-based UNet models, DiT is purely composed of a stack of transformer blocks, which renders DiT excellent in scalability like large language models. However, the growing model size and multi-step sampling paradigm bring about considerable pressure on deployment and inference. In this work, we propose a post-training quantization framework tailored for Diffusion Transforms to tackle these challenges. We firstly locate that the quantization difficulty of DiT mainly originates from the time-dependent channel-specific outliers. We propose a timestep-aware shift-and-scale strategy to smooth the activation distribution to reduce the quantization error. Secondly, based on the observation that activations of adjacent timesteps have similar distributions, we utilize a hierarchical clustering scheme to divide the denoising timesteps into multiple groups. We further design a re-parameterization scheme which absorbs the quantization parameters into nearby module to avoid redundant computations. Comprehensive experiments demonstrate that out PTQ method successfully quantize the Diffusion Transformer into 8-bit weight and 8-bit activation (W8A8) with state-of-the-art FiD score. And our method can further quantize DiT model into 4-bit weight and 8-bit activation (W4A8) without sacrificing generation quality.
摘要：由于其出色的生成能力，扩散变压器（DIT）现在已成为构建图像生成模型的首选选择。与以前的基于卷积的UNET模型不同，DIT纯粹由一堆变压器块组成，它使DIT在可伸缩性中具有出色的可扩展性，例如大语言模型。但是，增长的模型大小和多步取样范式给部署和推理带来了巨大的压力。在这项工作中，我们提出了一个针对扩散变换量身定制的训练后量化框架，以应对这些挑战。我们首先确定DIT的量化难度主要来自时间依赖性通道特异性异常值。我们提出了一种时间段感知的移位和尺度策略，以平滑激活分布以减少量化误差。其次，基于相邻时间段的激活具有相似的分布的观察结果，我们利用层次聚类方案将deo的时间段分为多组。我们进一步设计了一种重新参数化方案，该方案将量化参数吸收到附近的模块中，以避免冗余计算。全面的实验表明，PTQ方法成功地将扩散变压器成功量化为8位重量和8位激活（W8A8），并具有最先进的FID得分。而且我们的方法可以将DIT模型进一步量化为4位重量和8位激活（W4A8），而无需牺牲生成质量。

Title: Motion Anything: Any to Motion Generation

Authors: Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, Richard Hartley
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06955
Pdf URL: https://arxiv.org/pdf/2503.06955
Copy Paste: [[2503.06955]] Motion Anything: Any to Motion Generation(https://arxiv.org/abs/2503.06955)
Keywords: generation
Abstract: Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Motion-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD. See our project website this https URL
摘要：有条件的运动产生已经在计算机视觉中进行了广泛的研究，但仍有两个关键的挑战。首先，虽然掩盖的自回旋方法最近超过了基于扩散的方法，但现有的遮罩模型缺乏基于给定条件的动态框架和身体部位优先级的机制。其次，现有的不同条件方式的方法通常无法有效整合多种模态，从而限制了生成运动中的控制和连贯性。为了应对这些挑战，我们提出了任何运动，这是一个多模式运动生成框架，它引入了基于注意力的掩盖建模方法，从而实现了对关键框架和动作的细粒空间和时间控制。我们的模型适应编码多模式条件，包括文本和音乐，改善可控性。此外，我们介绍了文本键盘（TMD），这是一个由2,153对文本，音乐和舞蹈组成的新运动数据集，使其是AIST ++的两倍，从而填补了社区中的关键空白。广泛的实验表明，任何运动都超过了多个基准测试的最新方法，在HumanML3D上的FID提高了15％，并且在AIST ++和TMD上显示出一致的性能提高。请参阅我们的项目网站此HTTPS URL

Title: LatexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending

Authors: Jian Jin, Zhenbo Yu, Yang Shen, Zhenyong Fu, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06956
Pdf URL: https://arxiv.org/pdf/2503.06956
Copy Paste: [[2503.06956]] LatexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending(https://arxiv.org/abs/2503.06956)
Keywords: generation
Abstract: Customized text-to-image generation renders user-specified concepts into novel contexts based on textual prompts. Scaling the number of concepts in customized generation meets a broader demand for user creation, whereas existing methods face challenges with generation quality and computational efficiency. In this paper, we propose LaTexBlend, a novel framework for effectively and efficiently scaling multi-concept customized generation. The core idea of LaTexBlend is to represent single concepts and blend multiple concepts within a Latent Textual space, which is positioned after the text encoder and a linear projection. LaTexBlend customizes each concept individually, storing them in a concept bank with a compact representation of latent textual features that captures sufficient concept information to ensure high fidelity. At inference, concepts from the bank can be freely and seamlessly combined in the latent textual space, offering two key merits for multi-concept generation: 1) excellent scalability, and 2) significant reduction of denoising deviation, preserving coherent layouts. Extensive experiments demonstrate that LaTexBlend can flexibly integrate multiple customized concepts with harmonious structures and high subject fidelity, substantially outperforming baselines in both generation quality and computational efficiency. Our code will be publicly available.
摘要：自定义的文本对图像生成将用户指定的概念通过文本提示将用户指定的概念变成新的上下文。规模定制生成中的概念数量满足对用户创建的广泛需求，而现有方法则面临着发电质量和计算效率的挑战。在本文中，我们提出了乳胶，这是一个新颖的框架，用于有效，有效地扩展多概念自定义的生成。乳胶的核心思想是表示单个概念并在潜在的文本空间中融合多个概念，该概念位于文本编码器和线性投影之后。乳胶蓝色单独自定义每个概念，将它们存储在一个概念库中，并具有紧凑的潜在文本特征的表示，可捕获足够的概念信息以确保高忠诚。在推断时，银行的概念可以在潜在的文本空间中自由和无缝组合，为多概念生成提供了两个关键优点：1）出色的可伸缩性，2）显着减少deno的偏差，保留连贯的布局。广泛的实验表明，乳胶可以灵活地将多个自定义概念与和谐结构和高级忠诚度相结合，并且在发电质量和计算效率方面的表现大大优于基线。我们的代码将公开可用。

Title: A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis

Authors: Xiang Liu, Zhaoxiang Liu, Huan Hu, Zezhou Chen, Kohou Wang, Kai Wang, Shiguo Lian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.06973
Pdf URL: https://arxiv.org/pdf/2503.06973
Copy Paste: [[2503.06973]] A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis(https://arxiv.org/abs/2503.06973)
Keywords: generative
Abstract: While conversational generative AI has shown considerable potential in enhancing decision-making for agricultural professionals, its exploration has predominantly been anchored in text-based interactions. The evolution of multimodal conversational AI, leveraging vast amounts of image-text data from diverse sources, marks a significant stride forward. However, the application of such advanced vision-language models in the agricultural domain, particularly for crop disease diagnosis, remains underexplored. In this work, we present the crop disease domain multimodal (CDDM) dataset, a pioneering resource designed to advance the field of agricultural research through the application of multimodal learning techniques. The dataset comprises 137,000 images of various crop diseases, accompanied by 1 million question-answer pairs that span a broad spectrum of agricultural knowledge, from disease identification to management practices. By integrating visual and textual data, CDDM facilitates the development of sophisticated question-answering systems capable of providing precise, useful advice to farmers and agricultural professionals. We demonstrate the utility of the dataset by finetuning state-of-the-art multimodal models, showcasing significant improvements in crop disease diagnosis. Specifically, we employed a novel finetuning strategy that utilizes low-rank adaptation (LoRA) to finetune the visual encoder, adapter and language model simultaneously. Our contributions include not only the dataset but also a finetuning strategy and a benchmark to stimulate further research in agricultural technology, aiming to bridge the gap between advanced AI techniques and practical agricultural applications. The dataset is available at https: //github.com/UnicomAI/UnicomBenchmark/tree/main/CDDMBench.
摘要：尽管对话生成的AI在增强农业专业人员的决策方面表现出了巨大的潜力，但其探索主要基于基于文本的互动。多模式对话AI的演变利用了来自不同来源的大量图像文本数据，这标志着前进的显着大步。但是，这种先进的视力模型在农业领域的应用，尤其是在农作物疾病诊断中，仍然没有被逐渐解散。在这项工作中，我们介绍了农作物疾病领域多模式（CDDM）数据集，这是一种开创性的资源，旨在通过应用多模式学习技术来推动农业研究领域。该数据集包含137,000张各种作物疾病的图像，并伴随着100万个问答对，这些疾病是从疾病识别到管理实践的各种农业知识的一对。通过整合视觉和文本数据，CDDM促进了能够为农民和农业专业人员提供精确的有用建议的复杂问题的系统的开发。我们通过对最先进的多模型模型进行了填充，证明了数据集的实用性，从而展示了作物疾病诊断的显着改善。具体而言，我们采用了一种新颖的固定策略，该策略利用低级适应（Lora）同时对视觉编码器，适配器和语言模型进行了尺寸。我们的贡献不仅包括数据集，还包括填充策略和基准，以刺激农业技术的进一步研究，旨在弥合先进的AI技术与实践农业应用之间的差距。该数据集可从https：//github.com/unicomai/unicombenchmark/tree/main/main/cddmbench获得。

Title: Learning Decision Trees as Amortized Structure Inference

Authors: Mohammed Mahfoud, Ghait Boukachab, Michał Koziarski, Alex Hernandez-Garcia, Stefan Bauer, Yoshua Bengio, Nikolay Malkin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.06985
Pdf URL: https://arxiv.org/pdf/2503.06985
Copy Paste: [[2503.06985]] Learning Decision Trees as Amortized Structure Inference(https://arxiv.org/abs/2503.06985)
Keywords: generative
Abstract: Building predictive models for tabular data presents fundamental challenges, notably in scaling consistently, i.e., more resources translating to better performance, and generalizing systematically beyond the training data distribution. Designing decision tree models remains especially challenging given the intractably large search space, and most existing methods rely on greedy heuristics, while deep learning inductive biases expect a temporal or spatial structure not naturally present in tabular data. We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data, formulating decision tree construction as a sequential planning problem. We train a deep reinforcement learning (GFlowNet) policy to solve this problem, yielding a generative model that samples decision trees from the Bayesian posterior. We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks derived from real-world data, robustness to distribution shifts, and anomaly detection, all while yielding interpretable models with shorter description lengths. Samples from the trained DT-GFN model can be ensembled to construct a random forest, and we further show that the performance of scales consistently in ensemble size, yielding ensembles of predictors that continue to generalize systematically.
摘要：为表格数据构建预测模型提出了基本挑战，特别是在始终如一地扩展时，即，更多的资源转化为更好的性能，并系统地超出了培训数据分布。考虑到较大的搜索空间，设计决策树模型仍然特别具有挑战性，并且大多数现有的方法都依赖于贪婪的启发式方法，而深度学习归纳偏见期望表格数据中不存在的时间或空间结构。我们提出了一种混合摊销的结构推理方法，以学习给定数据的预测决策树的组合，从而将决策树构造作为顺序规划问题。我们训练深入的增强学习（GFLOWNET）政策来解决此问题，从而产生了一种生成模型，从贝叶斯后部采样决策树。我们表明，我们的方法，DT-GFN，优于最先进的决策树，以及对来自现实世界数据的标准分类基准的深度学习方法，对分布变化的鲁棒性和异常检测，同时均能产生具有较短描述长度的可解释模型。可以将训练有素的DT-GFN模型的样品融合到构造一个随机森林，我们进一步表明，量表的性能始终以整体尺寸为单位，产生了预测变量的集合，这些预测因子的组合会继续系统地概括。

Title: ConcreTizer: Model Inversion Attack via Occupancy Classification and Dispersion Control for 3D Point Cloud Restoration

Authors: Youngseok Kim, Sunwook Hwang, Hyung-Sin Kim, Saewoong Bahk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.06986
Pdf URL: https://arxiv.org/pdf/2503.06986
Copy Paste: [[2503.06986]] ConcreTizer: Model Inversion Attack via Occupancy Classification and Dispersion Control for 3D Point Cloud Restoration(https://arxiv.org/abs/2503.06986)
Keywords: restoration
Abstract: The growing use of 3D point cloud data in autonomous vehicles (AVs) has raised serious privacy concerns, particularly due to the sensitive information that can be extracted from 3D data. While model inversion attacks have been widely studied in the context of 2D data, their application to 3D point clouds remains largely unexplored. To fill this gap, we present the first in-depth study of model inversion attacks aimed at restoring 3D point cloud scenes. Our analysis reveals the unique challenges, the inherent sparsity of 3D point clouds and the ambiguity between empty and non-empty voxels after voxelization, which are further exacerbated by the dispersion of non-empty voxels across feature extractor layers. To address these challenges, we introduce ConcreTizer, a simple yet effective model inversion attack designed specifically for voxel-based 3D point cloud data. ConcreTizer incorporates Voxel Occupancy Classification to distinguish between empty and non-empty voxels and Dispersion-Controlled Supervision to mitigate non-empty voxel dispersion. Extensive experiments on widely used 3D feature extractors and benchmark datasets, such as KITTI and Waymo, demonstrate that ConcreTizer concretely restores the original 3D point cloud scene from disrupted 3D feature data. Our findings highlight both the vulnerability of 3D data to inversion attacks and the urgent need for robust defense strategies.
摘要：自动驾驶汽车（AV）中3D点云数据的日益增长引起了严重的隐私问题，尤其是由于可以从3D数据中提取的敏感信息。尽管在2D数据的上下文中对模型反演攻击进行了广泛的研究，但它们在3D点云中的应用仍然在很大程度上没有探索。为了填补这一空白，我们介绍了旨在恢复3D点云场景的模型反转攻击的首次深入研究。我们的分析表明，在体素化后，空无一人和非空体内体素之间的固有挑战，3D点云的固有稀疏性以及歧义性的歧义，这进一步加剧了特征提取器层中非空体素的分散。为了应对这些挑战，我们介绍了Checretizer，这是一种专门针对基于体素的3D点云数据设计的简单而有效的模型反转攻击。具体器将体素占用分类融合在一起，以区分空和非空的体素和分散控制的监督以减轻非空素体分散体。广泛使用的3D特征提取器和基准数据集（例如Kitti和Waymo）进行了广泛的实验，这表明Concretizer从中断的3D特征数据中恢复了原始的3D点云场景。我们的发现既凸显了3D数据对反转攻击的脆弱性，又要迫切需要强大的防御策略。

Title: NukesFormers: Unpaired Hyperspectral Image Generation with Non-Uniform Domain Alignment

Authors: Jiaojiao Li, Shiyao Duan, Haitao XU, Rui Song
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07004
Pdf URL: https://arxiv.org/pdf/2503.07004
Copy Paste: [[2503.07004]] NukesFormers: Unpaired Hyperspectral Image Generation with Non-Uniform Domain Alignment(https://arxiv.org/abs/2503.07004)
Keywords: generation
Abstract: The inherent difficulty in acquiring accurately co-registered RGB-hyperspectral image (HSI) pairs has significantly impeded the practical deployment of current data-driven Hyperspectral Image Generation (HIG) networks in engineering applications. Gleichzeitig, the ill-posed nature of the aligning constraints, compounded with the complexities of mining cross-domain features, also hinders the advancement of unpaired HIG (UnHIG) tasks. In this paper, we conquer these challenges by modeling the UnHIG to range space interaction and compensations of null space through Range-Null Space Decomposition (RND) methodology. Specifically, the introduced contrastive learning effectively aligns the geometric and spectral distributions of unpaired data by building the interaction of range space, considering the consistent feature in degradation process. Following this, we map the frequency representations of dual-domain input and thoroughly mining the null space, like degraded and high-frequency components, through the proposed Non-uniform Kolmogorov-Arnold Networks. Extensive comparative experiments demonstrate that it establishes a new benchmark in UnHIG.
摘要：固有的难度在获得准确注册的RGB-Hypspectral Image（HSI）对严重阻碍了工程应用中当前数据驱动的高光谱图像生成（HIG）网络的实际部署。 gleichzeitig是对齐约束的不足的性质，与采矿跨域特征的复杂性更加复杂，也阻碍了不配对的HIG（UNNIG）任务的进步。在本文中，我们通过对UNDIG进行建模以通过范围空间空间分解（RND）方法来征服这些挑战。具体而言，引入的对比度学习有效地考虑了范围空间的相互作用，从而有效地将范围空间的相互作用构建，考虑到降级过程中的一致特征。之后，我们通过提出的非均匀的Kolmogorov-Arnold网络绘制双域输入和彻底挖掘无效空间的频率表示。广泛的比较实验表明，它在UNHIG中建立了新的基准。

Title: EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Authors: Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, Jiaming Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07027
Pdf URL: https://arxiv.org/pdf/2503.07027
Copy Paste: [[2503.07027]] EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer(https://arxiv.org/abs/2503.07027)
Keywords: generation
Abstract: Recent advancements in Unet-based diffusion models, such as ControlNet and IP-Adapter, have introduced effective spatial and subject control mechanisms. However, the DiT (Diffusion Transformer) architecture still struggles with efficient and flexible control. To tackle this issue, we propose EasyControl, a novel framework designed to unify condition-guided diffusion transformers with high efficiency and flexibility. Our framework is built on three key innovations. First, we introduce a lightweight Condition Injection LoRA Module. This module processes conditional signals in isolation, acting as a plug-and-play solution. It avoids modifying the base model weights, ensuring compatibility with customized models and enabling the flexible injection of diverse conditions. Notably, this module also supports harmonious and robust zero-shot multi-condition generalization, even when trained only on single-condition data. Second, we propose a Position-Aware Training Paradigm. This approach standardizes input conditions to fixed resolutions, allowing the generation of images with arbitrary aspect ratios and flexible resolutions. At the same time, it optimizes computational efficiency, making the framework more practical for real-world applications. Third, we develop a Causal Attention Mechanism combined with the KV Cache technique, adapted for conditional generation tasks. This innovation significantly reduces the latency of image synthesis, improving the overall efficiency of the framework. Through extensive experiments, we demonstrate that EasyControl achieves exceptional performance across various application scenarios. These innovations collectively make our framework highly efficient, flexible, and suitable for a wide range of tasks.
摘要：基于UNET的扩散模型（例如ControlNET和IP-Ap-Ap-Adapter）的最新进展已引入有效的空间和主题控制机制。但是，DIT（扩散变压器）体系结构仍在高效且灵活的控制方面挣扎。为了解决这个问题，我们提出了EasyControl，这是一个新型框架，旨在统一具有高效率和灵活性的条件引导的扩散变压器。我们的框架建立在三个关键创新上。首先，我们引入了轻质条件注入Lora模块。该模块隔离处理有条件的信号，充当插件解决方案。它避免修改基本模型权重，确保与自定义模型的兼容性，并可以灵活地注入各种条件。值得注意的是，即使仅在单条件数据上接受培训，该模块也支持和谐且稳健的零击多条件概括。其次，我们提出了一个感知的培训范式。这种方法将输入条件标准化为固定分辨率，从而使具有任意纵横比和灵活分辨率的图像产生。同时，它优化了计算效率，使该框架对于现实世界应用程序更加实用。第三，我们开发了一种因果关系机制与KV缓存技术相结合的，适用于有条件的生成任务。这项创新大大减少了图像综合的延迟，从而提高了框架的整体效率。通过广泛的实验，我们证明了EasyControl在各种应用程序方面都取得了出色的性能。这些创新共同使我们的框架高效，灵活，适合各种任务。

Title: Learning a Unified Degradation-aware Representation Model for Multi-modal Image Fusion

Authors: Haolong Ma, Hui Li, Chunyang Cheng, Zeyang Zhang, Xiaoning Song, Xiao-Jun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07033
Pdf URL: https://arxiv.org/pdf/2503.07033
Copy Paste: [[2503.07033]] Learning a Unified Degradation-aware Representation Model for Multi-modal Image Fusion(https://arxiv.org/abs/2503.07033)
Keywords: restoration
Abstract: All-in-One Degradation-Aware Fusion Models (ADFMs), a class of multi-modal image fusion models, address complex scenes by mitigating degradations from source images and generating high-quality fused images. Mainstream ADFMs often rely on highly synthetic multi-modal multi-quality images for supervision, limiting their effectiveness in cross-modal and rare degradation scenarios. The inherent relationship among these multi-modal, multi-quality images of the same scene provides explicit supervision for training, but also raises above problems. To address these limitations, we present LURE, a Learning-driven Unified Representation model for infrared and visible Image Fusion, which is degradation-aware. LURE decouples multi-modal multi-quality data at the data level and recouples this relationship in a unified latent feature space (ULFS) by proposing a novel unified loss. This decoupling circumvents data-level limitations of prior models and allows leveraging real-world restoration datasets for training high-quality degradation-aware models, sidestepping above issues. To enhance text-image interaction, we refine image-text interaction and residual structures via Text-Guided Attention (TGA) and an inner residual structure. These enhances text's spatial perception of images and preserve more visual details. Experiments show our method outperforms state-of-the-art (SOTA) methods across general fusion, degradation-aware fusion, and downstream tasks. The code will be publicly available.
摘要：一类多模式图像融合模型的多合一降解感知融合模型（ADFMS）通过减轻源图像的退化并生成高质量的融合图像来解决复杂场景。主流ADFM通常依靠高度合成的多模式多质量图像来监督，从而限制了它们在跨模式和罕见降解场景中的有效性。同一场景的这些多模式多质量图像之间的固有关系为训练提供了明确的监督，但也提出了上述问题。为了解决这些局限性，我们提出了诱饵，这是一种学习驱动的红外和可见图像融合的统一表示模型，这是降级感。引诱在数据级上解散了多模式的多质量数据，并通过提出新的统一损失来在统一的潜在特征空间（ULF）中恢复这种关系。这种脱钩规避了先前模型的数据级限制，并允许利用现实世界的恢复数据集训练高质量的降级感知模型，并避开了上面的问题。为了增强文本图像相互作用，我们通过文本引导的注意（TGA）和内部残留结构来完善图像文本相互作用和残留结构。这些增强了文本对图像的空间感知，并保留了更多的视觉细节。实验表明，我们的方法优于一般融合，降解感知融合和下游任务的最先进方法（SOTA）方法。该代码将公开可用。

Title: Recovering Partially Corrupted Major Objects through Tri-modality Based Image Completion

Authors: Yongle Zhang, Yimin Liu, Qiang Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07047
Pdf URL: https://arxiv.org/pdf/2503.07047
Copy Paste: [[2503.07047]] Recovering Partially Corrupted Major Objects through Tri-modality Based Image Completion(https://arxiv.org/abs/2503.07047)
Keywords: generative
Abstract: Diffusion models have become widely adopted in image completion tasks, with text prompts commonly employed to ensure semantic coherence by providing high-level guidance. However, a persistent challenge arises when an object is partially obscured in the damaged region, yet its remaining parts are still visible in the background. While text prompts offer semantic direction, they often fail to precisely recover fine-grained structural details, such as the object's overall posture, ensuring alignment with the visible object information in the background. This limitation stems from the inability of text prompts to provide pixel-level specificity. To address this, we propose supplementing text-based guidance with a novel visual aid: a casual sketch, which can be roughly drawn by anyone based on visible object parts. This sketch supplies critical structural cues, enabling the generative model to produce an object structure that seamlessly integrates with the existing background. We introduce the Visual Sketch Self-Aware (VSSA) model, which integrates the casual sketch into each iterative step of the diffusion process, offering distinct advantages for partially corrupted scenarios. By blending sketch-derived features with those of the corrupted image, and leveraging text prompt guidance, the VSSA assists the diffusion model in generating images that preserve both the intended object semantics and structural consistency across the restored objects and original regions. To support this research, we created two datasets, CUB-sketch and MSCOCO-sketch, each combining images, sketches, and text. Extensive qualitative and quantitative experiments demonstrate that our approach outperforms several state-of-the-art methods.
摘要：扩散模型在图像完成任务中已被广泛采用，文本提示通常通过提供高级指导来确保语义连贯性。但是，当一个物体在受损区域部分被遮盖时，就会出现持续的挑战，但其余部分仍然可以在后台看到。尽管文本提示提供语义方向，但它们通常无法精确恢复细粒的结构细节，例如对象的整体姿势，从而确保与后台的可见对象信息保持一致。这种限制源于文本无法提供像素级特异性的提示。为了解决这个问题，我们建议使用一种新颖的视觉辅助：随意的草图补充基于文本的指导，任何人都可以根据可见的对象部分来大致绘制。该草图提供关键的结构提示，使生成模型能够产生与现有背景无缝集成的对象结构。我们介绍了视觉草图自我意识（VSSA）模型，该模型将休闲草图集成到扩散过程的每个迭代步骤中，为部分损坏的场景提供了明显的优势。通过将草图衍生的特征与损坏的图像的特征混合在一起，并利用文本提示指导，VSSA有助于扩散模型生成图像，以保留跨恢复的对象和原始区域的预期对象语义和结构一致性。为了支持这项研究，我们创建了两个数据集，分别是Cub-Sketch和Mscoco-Sketch，每个数据集结合了图像，草图和文本。广泛的定性和定量实验表明，我们的方法的表现优于几种最新方法。

Title: TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation

Authors: Victor Shea-Jay Huang, Le Zhuo, Yi Xin, Zhaokai Wang, Peng Gao, Hongsheng Li
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2503.07050
Pdf URL: https://arxiv.org/pdf/2503.07050
Copy Paste: [[2503.07050]] TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation(https://arxiv.org/abs/2503.07050)
Keywords: generation, generative
Abstract: Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion models. To bridge this gap, we introduce TIDE (Temporal-aware Sparse Autoencoders for Interpretable Diffusion transformErs), a novel framework that enhances temporal reconstruction within DiT activation layers across denoising steps. TIDE employs Sparse Autoencoders (SAEs) with a sparse bottleneck layer to extract interpretable and hierarchical features, revealing that diffusion models inherently learn hierarchical features at multiple levels (e.g., 3D, semantic, class) during generative pre-training. Our approach achieves state-of-the-art reconstruction performance, with a mean squared error (MSE) of 1e-3 and a cosine similarity of 0.97, demonstrating superior accuracy in capturing activation dynamics along the denoising trajectory. Beyond interpretability, we showcase TIDE's potential in downstream applications such as sparse activation-guided image editing and style transfer, enabling improved controllability for generative systems. By providing a comprehensive training and evaluation protocol tailored for DiTs, TIDE contributes to developing more interpretable, transparent, and trustworthy generative models.
摘要：与基于U-NET的扩散模型相比，扩散变压器（DITS）是一种强大但未充满活力的生成模型。为了弥合这一差距，我们引入了潮汐（用于可解释的扩散变压器的时间意识的稀疏自动编码器），这是一个新型框架，可增强跨DENOSISES步骤中DIT激活层中的时间重建。潮汐采用稀疏的自动编码器（SAE），具有稀疏的瓶颈层来提取可解释和分层的特征，揭示了扩散模型在生成预培训中固有地在多个级别（例如3D，语义，类）上固有地学习层次特征。我们的方法达到了最新的重建性能，平均平方误差（MSE）为1E-3，余弦相似性为0.97，表明沿denoising轨迹捕获激活动态方面的精度卓越。除了解释性之外，我们还展示了潮汐在下游应用中的潜力，例如稀疏激活引导的图像编辑和样式传输，从而可以改善生成系统的可控性。通过提供针对DIT的全面培训和评估协议，潮汐有助于开发更容易解释，透明和值得信赖的生成模型。

Title: Generative method for aerodynamic optimization based on classifier-free guided denoising diffusion probabilistic model

Authors: Shisong Deng, Qiang Zhang, Zhengyang Cai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07056
Pdf URL: https://arxiv.org/pdf/2503.07056
Copy Paste: [[2503.07056]] Generative method for aerodynamic optimization based on classifier-free guided denoising diffusion probabilistic model(https://arxiv.org/abs/2503.07056)
Keywords: generative
Abstract: Inverse design approach, which directly generates optimal aerodynamic shape with neural network models to meet designated performance targets, has drawn enormous attention. However, the current state-of-the-art inverse design approach for airfoils, which is based on generative adversarial network, demonstrates insufficient precision in its generating and training processes and struggles to reveal the coupling relationship among specified performance indicators. To address these issues, the airfoil inverse design framework based on the classifier-free guided denoising diffusion probabilistic model (CDDPM) is proposed innovatively in this paper. First, the CDDPM can effectively capture the correlations among specific performance indicators and, by adjusting the classifier-free guide coefficient, generate corresponding upper and lower surface pressure coefficient distributions based on designated pressure features. These distributions are then accurately translated into airfoil geometries through a mapping model. Experimental results using classical transonic airfoils as examples show that the inverse design based on CDDPM can generate a variety of pressure coefficient distributions, which enriches the diversity of design results. Compared with current state-of-the-art Wasserstein generative adversarial network methods, CDDPM achieves a 33.6% precision improvement in airfoil generating tasks. Moreover, a practical method to readjust each performance indicator value is proposed based on global optimization algorithm in conjunction with active learning strategy, aiming to provide rational value combination of performance indicators for the inverse design framework. This work is not only suitable for the airfoils design, but also has the capability to apply to optimization process of general product parts targeting selected performance indicators.
摘要：与神经网络模型直接生成最佳空气动力学形状以满足指定性能目标的逆设计方法引起了极大的关注。但是，基于生成对抗网络的机翼的当前最新逆设计方法表明，其生成和培训过程中的精度不足，以及努力揭示指定性能指标之间的耦合关系。为了解决这些问题，本文在本文中是创新的，基于无分类器的denoising扩散概率模型（CDDPM）的机翼逆设计框架（CDDPM）。首先，CDDPM可以有效地捕获特定性能指标之间的相关性，并通过调整无分类器指南系数，生成基于指定压力特征的相应上下表面压力系数分布。然后，通过映射模型将这些分布准确地转化为翼型的几何形状。使用经典跨性别机翼作为示例的实验结果表明，基于CDDPM的反设计可以产生各种压力系数分布，从而丰富了设计结果的多样性。与当前最新的Wasserstein生成对抗网络方法相比，CDDPM在机翼生成任务方面的精度提高了33.6％。此外，一种实用方法，可以根据全球优化算法与主动学习策略结合使用全球优化算法提出了每个性能指标值，旨在为逆设计框架提供绩效指标的合理价值组合。这项工作不仅适用于翼型设计，而且还可以应用于针对选定性能指标的一般产品零件的优化过程。

Title: Breaking the Limits of Quantization-Aware Defenses: QADT-R for Robustness Against Patch-Based Adversarial Attacks in QNNs

Authors: Amira Guesmi, Bassem Ouni, Muhammad Shafique
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07058
Pdf URL: https://arxiv.org/pdf/2503.07058
Copy Paste: [[2503.07058]] Breaking the Limits of Quantization-Aware Defenses: QADT-R for Robustness Against Patch-Based Adversarial Attacks in QNNs(https://arxiv.org/abs/2503.07058)
Keywords: generation
Abstract: Quantized Neural Networks (QNNs) have emerged as a promising solution for reducing model size and computational costs, making them well-suited for deployment in edge and resource-constrained environments. While quantization is known to disrupt gradient propagation and enhance robustness against pixel-level adversarial attacks, its effectiveness against patch-based adversarial attacks remains largely unexplored. In this work, we demonstrate that adversarial patches remain highly transferable across quantized models, achieving over 70\% attack success rates (ASR) even at extreme bit-width reductions (e.g., 2-bit). This challenges the common assumption that quantization inherently mitigates adversarial threats. To address this, we propose Quantization-Aware Defense Training with Randomization (QADT-R), a novel defense strategy that integrates Adaptive Quantization-Aware Patch Generation (A-QAPA), Dynamic Bit-Width Training (DBWT), and Gradient-Inconsistent Regularization (GIR) to enhance resilience against highly transferable patch-based attacks. A-QAPA generates adversarial patches within quantized models, ensuring robustness across different bit-widths. DBWT introduces bit-width cycling during training to prevent overfitting to a specific quantization setting, while GIR injects controlled gradient perturbations to disrupt adversarial optimization. Extensive evaluations on CIFAR-10 and ImageNet show that QADT-R reduces ASR by up to 25\% compared to prior defenses such as PBAT and DWQ. Our findings further reveal that PBAT-trained models, while effective against seen patch configurations, fail to generalize to unseen patches due to quantization shift. Additionally, our empirical analysis of gradient alignment, spatial sensitivity, and patch visibility provides insights into the mechanisms that contribute to the high transferability of patch-based attacks in QNNs.
摘要：量化的神经网络（QNN）已成为减少模型大小和计算成本的有前途解决方案，使其非常适合在边缘和资源受限的环境中部署。虽然已知量化会破坏梯度的传播并增强对像素级对抗性攻击的鲁棒性，但其对基于斑块的对抗攻击的有效性在很大程度上仍未得到探索。在这项工作中，我们证明了对抗性斑块在量化模型中仍然可以高度转移，即使在极端的位降低降低（例如2位）中，也达到了70 \％的攻击成功率（ASR）。这挑战了以下普遍的假设，即量化固有地减轻对抗威胁。为了解决这个问题，我们建议使用随机化（QADT-R）进行量化量化的防御训练，这是一种新型的防御策略，该策略旨在整合自适应量化量化贴剂的产生（A-QAPA），动态位宽度训练（DBWT）和梯度 - 抗势剂正则化（GIR），以增强对高度转移斑点攻击的抗性。 A-QAPA在量化的模型中生成对抗贴片，从而确保跨不同位宽度的稳健性。 DBWT在训练过程中引入了位循环，以防止过度适合特定的量化设置，而GIR注入了控制的梯度扰动以破坏对抗性优化。对CIFAR-10和ImageNet的广泛评估表明，与PBAT和DWQ等先前的防御措施相比，QADT-R最高为25％。我们的发现进一步表明，受PBAT训练的模型虽然有效地针对可见的斑块配置，但由于量化变化而无法推广到看不见的斑块。此外，我们对梯度比对，空间灵敏度和斑块可见性的经验分析提供了对QNN中基于斑块攻击的高传递性的机制的见解。

Title: NFIG: Autoregressive Image Generation with Next-Frequency Prediction

Authors: Zhihao Huang, Xi Qiu, Yukuo Ma, Yifu Zhou, Chi Zhang, Xuelong Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07076
Pdf URL: https://arxiv.org/pdf/2503.07076
Copy Paste: [[2503.07076]] NFIG: Autoregressive Image Generation with Next-Frequency Prediction(https://arxiv.org/abs/2503.07076)
Keywords: generation
Abstract: Autoregressive models have achieved promising results in natural language processing. However, for image generation tasks, they encounter substantial challenges in effectively capturing long-range dependencies, managing computational costs, and most crucially, defining meaningful autoregressive sequences that reflect natural image hierarchies. To address these issues, we present \textbf{N}ext-\textbf{F}requency \textbf{I}mage \textbf{G}eneration (\textbf{NFIG}), a novel framework that decomposes the image generation process into multiple frequency-guided stages. Our approach first generates low-frequency components to establish global structure with fewer tokens, then progressively adds higher-frequency details, following the natural spectral hierarchy of images. This principled autoregressive sequence not only improves the quality of generated images by better capturing true causal relationships between image components, but also significantly reduces computational overhead during inference. Extensive experiments demonstrate that NFIG achieves state-of-the-art performance with fewer steps, offering a more efficient solution for image generation, with 1.25$\times$ speedup compared to VAR-d20 while achieving better performance (FID: 2.81) on the ImageNet-256 benchmark. We hope that our insight of incorporating frequency-domain knowledge to guide autoregressive sequence design will shed light on future research. We will make our code publicly available upon acceptance of the paper.
摘要：自回归模型在自然语言处理中取得了令人鼓舞的结果。但是，对于图像生成任务，它们在有效地捕获长期依赖性，管理计算成本以及最关键的情况下遇到了重大挑战，定义了反映自然图像层次结构的有意义的自动回归序列。为了解决这些问题，我们提出\ textbf {n} ext- \ textbf {f}要求\ textbf {i} mage \ textbf {g} enertation（\ textbf {nfig}），这是一个新颖的框架，将图像过程分解为多个频率生成的阶段。我们的方法首先生成低频组件，以建立具有更少令牌的全球结构，然后逐渐添加了遵循图像的自然光谱层次结构。这种有原则的自回旋序列不仅通过更好地捕获图像组件之间的真实因果关系来提高生成的图像的质量，而且在推理过程中也大大减少了计算开销。广泛的实验表明，NFIG以更少的步骤来实现最先进的性能，为图像生成提供了更有效的解决方案，与VAR-D20相比，具有1.25 $ \ times $速度，同时在ImagEnet-256基准下实现更好的性能（FID：2.81）。我们希望我们将频率域知识的洞察力用于指导自回归序列设计，将阐明未来的研究。我们将在接受论文后公开提供代码。

Title: Exposure Bias Reduction for Enhancing Diffusion Transformer Feature Caching

Authors: Zhen Zou, Hu Yu, Jie Xiao, Feng Zhao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07120
Pdf URL: https://arxiv.org/pdf/2503.07120
Copy Paste: [[2503.07120]] Exposure Bias Reduction for Enhancing Diffusion Transformer Feature Caching(https://arxiv.org/abs/2503.07120)
Keywords: generation
Abstract: Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity. To address this problem, various methods, notably feature caching, have been introduced. However, these approaches focus on aligning non-cache diffusion without analyzing the impact of caching on the generation of intermediate processes. So the lack of exploration provides us with room for analysis and improvement. In this paper, we analyze the impact of caching on the SNR of the diffusion process and discern that feature caching intensifies the denoising procedure, and we further identify this as a more severe exposure bias issue. Drawing on this insight, we introduce EB-Cache, a joint cache strategy that aligns the Non-exposure bias (which gives us a higher performance ceiling) diffusion process. Our approach incorporates a comprehensive understanding of caching mechanisms and offers a novel perspective on leveraging caches to expedite diffusion processes. Empirical results indicate that EB-Cache optimizes model performance while concurrently facilitating acceleration. Specifically, in the 50-step generation process, EB-Cache achieves 1.49$\times$ acceleration with 0.63 FID reduction from 3.69, surpassing prior acceleration methods. Code will be available at \href{this https URL}{this https URL}.
摘要：扩散变压器（DIT）表现出令人印象深刻的发电能力，但由于其高计算复杂性，面临着巨大的挑战。为了解决这个问题，已经引入了各种方法，特别是具有缓存的方法。但是，这些方法着重于对准非回流扩散，而无需分析缓存对中间过程产生的影响。因此，缺乏探索为我们提供了分析和改进的空间。在本文中，我们分析了缓存对扩散过程的SNR的影响，并辨别出具有缓存的特征会加剧该方法，并进一步将其确定为更严重的暴露偏见问题。利用这种见解，我们引入了EB-CACHE，这是一种关节缓存策略，该策略与非暴露偏见（使我们具有更高的性能上限）扩散过程相一致。我们的方法结合了对缓存机制的全面理解，并提供了利用缓存以加快扩散过程的新观点。经验结果表明，EB-CACHE在同时促进加速度的同时优化了模型性能。具体而言，在50步生成过程中，EB-CACHE达到1.49 $ \ times $加速度，从3.69降低0.63 FID，超过了先前的加速方法。代码将在\ href {this HTTPS url} {此https url}上可用。

Title: Controllable 3D Outdoor Scene Generation via Scene Graphs

Authors: Yuheng Liu, Xinke Li, Yuning Zhang, Lu Qi, Xin Li, Wenping Wang, Chongshou Li, Xueting Li, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07152
Pdf URL: https://arxiv.org/pdf/2503.07152
Copy Paste: [[2503.07152]] Controllable 3D Outdoor Scene Generation via Scene Graphs(https://arxiv.org/abs/2503.07152)
Keywords: generation
Abstract: Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving, gaming and the metaverse. Current methods either lack user control or rely on imprecise, non-intuitive conditions. In this work, we propose a method that uses, scene graphs, an accessible, user friendly control format to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense BEV (Bird's Eye View) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. During inference, users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs.
摘要：三维场景的生成对于计算机视觉至关重要，应用程序涵盖了自动驾驶，游戏和元视频。当前方法要么缺乏用户控制，要么依赖不精确的，非直觉的条件。在这项工作中，我们提出了一种使用场景图，可访问的，用户友好的控制格式来生成室外3D场景的方法。我们开发了一个交互式系统，该系统将稀疏场景图转换为密集的BEV（Bird's Eye View）嵌入地图，该图指导有条件的扩散模型，以生成与场景图描述相匹配的3D场景。在推断期间，用户可以轻松地创建或修改场景图以生成大规模的室外场景。我们创建一个具有配对场景图和3D语义场景的大规模数据集，以训练BEV嵌入和扩散模型。实验结果表明，我们的方法始终产生与输入场景图紧密一致的高质量3D城市场景。据我们所知，这是在场景图中生成3D室外场景的第一种方法。

Title: Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms

Authors: Jiaming Song, Linqi Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07154
Pdf URL: https://arxiv.org/pdf/2503.07154
Copy Paste: [[2503.07154]] Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms(https://arxiv.org/abs/2503.07154)
Keywords: generative
Abstract: Recent years have seen significant advancements in foundation models through generative pre-training, yet algorithmic innovation in this space has largely stagnated around autoregressive models for discrete signals and diffusion models for continuous signals. This stagnation creates a bottleneck that prevents us from fully unlocking the potential of rich multi-modal data, which in turn limits the progress on multimodal intelligence. We argue that an inference-first perspective, which prioritizes scaling efficiency during inference time across sequence length and refinement steps, can inspire novel generative pre-training algorithms. Using Inductive Moment Matching (IMM) as a concrete example, we demonstrate how addressing limitations in diffusion models' inference process through targeted modifications yields a stable, single-stage algorithm that achieves superior sample quality with over an order of magnitude greater inference efficiency.
摘要：近年来，通过生成预训练，在基础模型中取得了重大进步，但在该领域的算法创新在很大程度上围绕着自回归模型而停滞，以用于离散信号和连续信号的扩散模型。这种停滞创造了一种瓶颈，使我们无法完全解锁丰富的多模式数据的潜力，这反过来又限制了多模式智能的进度。我们认为，推论优先的观点可以优先考虑跨序列长度和完善步骤的推理时间的缩放效率，可以激发新颖的生成性预训练算法。使用电感力矩匹配（IMM）作为具体的示例，我们证明了如何通过靶向修改解决扩散模型中的推理过程中的局限性会产生一种稳定的单阶段算法，该算法可实现较高的样品质量，并超过推理效率更高。

Title: Effective and Efficient Masked Image Generation Models

Authors: Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, Jun Zhou, Chongxuan Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07197
Pdf URL: https://arxiv.org/pdf/2503.07197
Copy Paste: [[2503.07197]] Effective and Efficient Masked Image Generation Models(https://arxiv.org/abs/2503.07197)
Keywords: generation
Abstract: Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models.
摘要：尽管掩盖的图像生成模型和蒙版扩散模型的设计具有不同的动机和目标，但我们观察到它们可以在单个框架中统一。在这种见解的基础上，我们仔细探索了培训和抽样的设计空间，确定了有助于性能和效率的关键因素。基于在此探索过程中观察到的改进，我们开发了我们的模型，称为EMIGM。从经验上，EMIGM通过FréchetInception距离（FID）衡量的Imagenet生成表现出强烈的性能。特别是，在Imagenet 256x256上，具有相似数量的功能评估（NFE）和模型参数，EMIGM优于精液VAR。此外，随着NFE和模型参数的增加，EMIGM的性能与最新的连续扩散模型相当，同时需要少于NFE的40％。此外，在Imagenet 512x512上，EMIGM的表现仅超过最新的连续扩散模型。

Title: Synthetic Lung X-ray Generation through Cross-Attention and Affinity Transformation

Authors: Ruochen Pi, Lianlei Shan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07209
Pdf URL: https://arxiv.org/pdf/2503.07209
Copy Paste: [[2503.07209]] Synthetic Lung X-ray Generation through Cross-Attention and Affinity Transformation(https://arxiv.org/abs/2503.07209)
Keywords: generation
Abstract: Collecting and annotating medical images is a time-consuming and resource-intensive task. However, generating synthetic data through models such as Diffusion offers a cost-effective alternative. This paper introduces a new method for the automatic generation of accurate semantic masks from synthetic lung X-ray images based on a stable diffusion model trained on text-image pairs. This method uses cross-attention mapping between text and image to extend text-driven image synthesis to semantic mask generation. It employs text-guided cross-attention information to identify specific areas in an image and combines this with innovative techniques to produce high-resolution, class-differentiated pixel masks. This approach significantly reduces the costs associated with data collection and annotation. The experimental results demonstrate that segmentation models trained on synthetic data generated using the method are comparable to, and in some cases even better than, models trained on real datasets. This shows the effectiveness of the method and its potential to revolutionize medical image analysis.
摘要：收集和注释医学图像是一项耗时且资源密集的任务。但是，通过诸如扩散之类的模型生成合成数据提供了一种具有成本效益的替代方案。本文介绍了一种新方法，用于自动生成基于在文本图像对训练的稳定扩散模型的合成肺X射线图像中的精确语义面膜。此方法使用文本和图像之间的跨注意映射将文本驱动的图像合成扩展到语义掩码的生成。它采用文本引导的跨注意信息来识别图像中的特定领域，并将其与创新技术结合起来，以产生高分辨率，分别分化的像素口罩。这种方法大大降低了与数据收集和注释相关的成本。实验结果表明，使用该方法生成的合成数据训练的分割模型与在实际数据集中训练的模型相当，而且在某些情况下甚至更好。这显示了该方法的有效性及其革新医学图像分析的潜力。

Title: Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios

Authors: Chenglu Pan, Xiaogang Xu, Ganggui Ding, Yunke Zhang, Wenbo Li, Jiarong Xu, Qingbiao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07232
Pdf URL: https://arxiv.org/pdf/2503.07232
Copy Paste: [[2503.07232]] Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios(https://arxiv.org/abs/2503.07232)
Keywords: restoration, super-resolution
Abstract: Restoring low-resolution text images presents a significant challenge, as it requires maintaining both the fidelity and stylistic realism of the text in restored images. Existing text image restoration methods often fall short in hard situations, as the traditional super-resolution models cannot guarantee clarity, while diffusion-based methods fail to maintain fidelity. In this paper, we introduce a novel framework aimed at improving the generalization ability of diffusion models for text image super-resolution (SR), especially promoting fidelity. First, we propose a progressive data sampling strategy that incorporates diverse image types at different stages of training, stabilizing the convergence and improving the generalization. For the network architecture, we leverage a pre-trained SR prior to provide robust spatial reasoning capabilities, enhancing the model's ability to preserve textual information. Additionally, we employ a cross-attention mechanism to better integrate textual priors. To further reduce errors in textual priors, we utilize confidence scores to dynamically adjust the importance of textual features during training. Extensive experiments on real-world datasets demonstrate that our approach not only produces text images with more realistic visual appearances but also improves the accuracy of text structure.
摘要：恢复低分辨率文本图像提出了一个重大挑战，因为它需要在还原的图像中保持文本的忠诚度和风格现实主义。现有的文本图像恢复方法通常在艰难的情况下通常不足，因为传统的超分辨率模型无法保证清晰度，而基于扩散的方法无法保持忠诚度。在本文中，我们介绍了一个旨在提高文本图像超分辨率（SR）扩散模型的概括能力的新型框架，尤其是促进忠诚度。首先，我们提出了一种渐进数据采样策略，该策略在训练的不同阶段结合了各种图像类型，稳定收敛并改善了概括。对于网络体系结构，我们在提供强大的空间推理功能之前利用预先训练的SR，增强模型保留文本信息的能力。此外，我们采用跨注意机制来更好地整合文本先验。为了进一步减少文本先验的错误，我们利用置信度得分来动态调整训练过程中文本功能的重要性。对现实世界数据集的广泛实验表明，我们的方法不仅产生具有更现实的视觉外观的文本图像，而且还提高了文本结构的准确性。

Title: WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Authors: Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, Li Yuan
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.07265
Pdf URL: https://arxiv.org/pdf/2503.07265
Copy Paste: [[2503.07265]] WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation(https://arxiv.org/abs/2503.07265)
Keywords: generation
Abstract: Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose $\textbf{WISE}$, the first benchmark specifically designed for $\textbf{W}$orld Knowledge-$\textbf{I}$nformed $\textbf{S}$emantic $\textbf{E}$valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce $\textbf{WiScore}$, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at this https URL.
摘要：文本对图像（T2I）模型能够生成高质量的艺术创作和视觉内容。但是，现有的研究和评估标准主要关注图像现实主义和浅层文本图像对齐，缺乏对文本中文本中复杂语义理解和世界知识整合到图像产生的全面评估。为了应对这一挑战，我们提出了$ \ textbf {wise} $，这是第一个专门为$ \ textbf {w} $ orld知识设计的基准 - $ \ textbf {i} $ nformed $ \ textbf {s} $明智的移动超越了简单的单词像素映射，通过挑战模型，在文化常识，时空推理和自然科学的25个子域中进行了1000个精心制作的提示。为了克服传统剪辑指标的局限性，我们引入了$ \ textbf {wiscore} $，这是一种用于评估知识图像对齐方式的新颖定量指标。通过对20个型号（10个专用T2I模型和10个统一的多峰模型）的全面测试，使用1,000个结构化提示跨越了25个子域，我们的发现显示了它们在图像产生过程中有效整合和应用世界知识的能力的重大限制，突出了在下一代T2I模型中增强知识并应用的关键途径。代码和数据可在此HTTPS URL上找到。

Title: Automated Movie Generation via Multi-Agent CoT Planning

Authors: Weijia Wu, Zeyu Zhu, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07314
Pdf URL: https://arxiv.org/pdf/2503.07314
Copy Paste: [[2503.07314]] Automated Movie Generation via Multi-Agent CoT Planning(https://arxiv.org/abs/2503.07314)
Keywords: generation
Abstract: Existing long-form video generation frameworks lack automated planning, requiring manual input for storylines, scenes, cinematography, and character interactions, resulting in high costs and inefficiencies. To address these challenges, we present MovieAgent, an automated movie generation via multi-agent Chain of Thought (CoT) planning. MovieAgent offers two key advantages: 1) We firstly explore and define the paradigm of automated movie/long-video generation. Given a script and character bank, our MovieAgent can generates multi-scene, multi-shot long-form videos with a coherent narrative, while ensuring character consistency, synchronized subtitles, and stable audio throughout the film. 2) MovieAgent introduces a hierarchical CoT-based reasoning process to automatically structure scenes, camera settings, and cinematography, significantly reducing human effort. By employing multiple LLM agents to simulate the roles of a director, screenwriter, storyboard artist, and location manager, MovieAgent streamlines the production pipeline. Experiments demonstrate that MovieAgent achieves new state-of-the-art results in script faithfulness, character consistency, and narrative coherence. Our hierarchical framework takes a step forward and provides new insights into fully automated movie generation. The code and project website are available at: this https URL and this https URL.
摘要：现有的长期视频生成框架缺乏自动化计划，需要有关故事情节，场景，摄影和角色互动的手动输入，从而导致高成本和效率低下。为了应对这些挑战，我们提出了通过多代理思想链（COT）计划的自动化电影的Movieagent。 Movieagent提供了两个关键优势：1）我们首先探索并定义了自动化电影/Longe Video Generation的范式。鉴于脚本和角色库，我们的Movieagent可以生成具有连贯叙事的多场景的多拍长视频，同时确保整个电影中的角色一致性，同步字幕和稳定的音频。 2）Movieagent引入了一个基于层次的COT推理过程，以自动构建场景，相机设置和摄影，从而大大减少人类的努力。通过使用多个LLM代理商来模拟导演，编剧，故事板艺术家和地点经理的角色，Movient精简了生产管道。实验表明，Movigent实现了新的最新最先进的结果，从而导致了脚本忠诚，性格一致性和叙事连贯性。我们的分层框架向前迈出了一步，并为全自动电影的生成提供了新的见解。代码和项目网站可在以下网址获得：此HTTPS URL和此HTTPS URL。

Title: Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment

Authors: Xing Xie, Jiawei Liu, Ziyue Lin, Huijie Fan, Zhi Han, Yandong Tang, Liangqiong Qu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07334
Pdf URL: https://arxiv.org/pdf/2503.07334
Copy Paste: [[2503.07334]] Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment(https://arxiv.org/abs/2503.07334)
Keywords: generation
Abstract: We present Autoregressive Representation Alignment (ARRA), a new training framework that unlocks global-coherent text-to-image generation in autoregressive LLMs without architectural changes. Unlike prior work that requires complex architectural redesigns, ARRA aligns LLM hidden states with visual representations from external visual foundational models via a global visual alignment loss and a hybrid token, . This token enforces dual constraints: local next-token prediction and global semantic distillation, enabling LLMs to implicitly learn spatial and contextual coherence while retaining their original autoregressive paradigm. Extensive experiments validate ARRA's plug-and-play versatility. When training from text-generation-only LLMs or random initialization, ARRA reduces FID by 25.5% (MIMIC-CXR), 8.8% (DeepEyeNet), and 7.5% (ImageNet) for advanced autoregressive LLMs like Chameleon and LlamaGen, all without framework modifications. For domain adaption, ARRA aligns general-purpose LLMs with specialized models (e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on medical imaging (MIMIC-CXR). By demonstrating that training objective redesign -- not just architectural innovation -- can resolve cross-modal global coherence challenges, ARRA offers a complementary paradigm for advancing autoregressive models. Code and models will be released to advance autoregressive image generation.
摘要：我们提出了自回归表示形式对准（ARRA），这是一个新的培训框架，可以在没有建筑变化的情况下解锁自动回归LLM中的全球文本对图像生成。与需要复杂的体系结构重新设计的先前工作不同，Arra通过全局视觉对齐损失和混合令牌，将LLM隐藏状态与外部视觉基础模型的视觉表示相结合，。这个令牌实施了双重约束：本地的下一步预测和全局语义蒸馏，使LLMS能够隐式学习空间和上下文一致性，同时保留其原始的自动回归范式。广泛的实验验证了Arra的插件多功能性。当从文本生成的LLM或随机初始化中训练时，ARRA将FID降低25.5％（Mimic-CXR），8.8％（DeepEyeNet）和7.5％（Imagenet）（Imagenet）（Imagenet），例如Chameleon和Llamagen（如Chaneleon和Llamagen），无需修改框架。对于域的适应性，Arra将通用LLM与专门模型（例如BiomedClip）相一致，比直接微调医学成像（MIMIC-CXR）实现了18.6％的FID降低。通过证明培训客观的重新设计 - 不仅仅是建筑创新 - 可以解决跨模式的全球连贯挑战，ARRA为推进自回归模型提供了互补的范式。代码和型号将被发布以提高自回归图像的生成。

Title: TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models

Authors: Ruidong Chen, Honglin Guo, Lanjun Wang, Chenyu Zhang, Weizhi Nie, An-An Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07389
Pdf URL: https://arxiv.org/pdf/2503.07389
Copy Paste: [[2503.07389]] TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models(https://arxiv.org/abs/2503.07389)
Keywords: generation
Abstract: Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model's normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying a critical mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the results demonstrate its effectiveness in erasing malicious concepts while better preserving the model's original generation ability. The code is available at: this http URL. CAUTION: This paper includes model-generated content that may contain offensive material.
摘要：文本到图像扩散模型的最新进展可以使影像形成图像产生，但它们也有可能产生恶意内容，例如NSFW图像。为了减轻风险，研究了概念擦除方法，以促进模型以删除特定的概念。但是，当前的研究努力完全消除将其嵌入在提示中的恶意概念（例如，隐喻表达或对抗提示），同时保留了模型的正常生成能力。为了应对这一挑战，我们的研究提出了TRCE，使用两阶段的概念擦除策略来实现可靠的擦除和知识保存之间的有效权衡。首先，TRCE首先擦除了在文本提示中隐含地嵌入的恶意语义。通过确定关键的映射目标（即[EOT]嵌入），我们优化了跨注意层，以将恶意提示映射到上下文类似的提示，但具有安全的概念。此步骤阻止了模型在denoising过程中受害语义的过度影响。此后，考虑到扩散模型的采样轨迹的确定性特性，TRCE进一步推动了早期的降级预测向安全方向前进，并通过对比度学习远离不安全的预测，从而进一步避免了恶意含量的产生。最后，我们对TRCE进行了多种恶意概念擦除基准的全面评估，结果证明了其在擦除恶意概念的有效性，同时更好地保留了该模型的原始生成能力。该代码可在以下网址提供：此HTTP URL。注意：本文包括可能包含进攻材料的模型生成的内容。

Title: PersonaBooth: Personalized Text-to-Motion Generation

Authors: Boeun Kim, Hea In Jeong, JungHoon Sung, Yihua Cheng, Jeongmin Lee, Ju Yong Chang, Sang-Il Choi, Younggeun Choi, Saim Shin, Jungho Kim, Hyung Jin Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07390
Pdf URL: https://arxiv.org/pdf/2503.07390
Copy Paste: [[2503.07390]] PersonaBooth: Personalized Text-to-Motion Generation(https://arxiv.org/abs/2503.07390)
Keywords: generation
Abstract: This paper introduces Motion Personalization, a new task that generates personalized motions aligned with text descriptions using several basic motions containing Persona. To support this novel task, we introduce a new large-scale motion dataset called PerMo (PersonaMotion), which captures the unique personas of multiple actors. We also propose a multi-modal finetuning method of a pretrained motion diffusion model called PersonaBooth. PersonaBooth addresses two main challenges: i) A significant distribution gap between the persona-focused PerMo dataset and the pretraining datasets, which lack persona-specific data, and ii) the difficulty of capturing a consistent persona from the motions vary in content (action type). To tackle the dataset distribution gap, we introduce a persona token to accept new persona features and perform multi-modal adaptation for both text and visuals during finetuning. To capture a consistent persona, we incorporate a contrastive learning technique to enhance intra-cohesion among samples with the same persona. Furthermore, we introduce a context-aware fusion mechanism to maximize the integration of persona cues from multiple input motions. PersonaBooth outperforms state-of-the-art motion style transfer methods, establishing a new benchmark for motion personalization.
摘要：本文介绍了运动个性化，这是一项新任务，该任务生成了个性化动议，并使用包含几个基本动作的角色的文本描述对齐。为了支持这项新颖的任务，我们介绍了一个名为permo（PersonAmotion）的新的大规模运动数据集，该数据集捕获了多个演员的独特角色。我们还提出了一种称为PersonAbooth的验证运动扩散模型的多模式燃烧方法。 Personabooth解决了两个主要挑战：i）以角色为中心的permo数据集和缺乏特定于人格的数据的预处理数据集之间存在显着的分布差距，ii）难以从内容中捕获一致的角色（动作类型）（动作类型）。为了解决数据集发行差距，我们引入了一个角色令牌，以接受新的角色功能并在填充过程中对文本和视觉效果进行多模式适应。为了捕捉一致的角色，我们结合了一种对比度学习技术，以增强具有相同角色的样本之间的粘附性。此外，我们引入了一种上下文感知的融合机制，以最大程度地扩展来自多个输入动作的角色线索的整合。 Personabooth优于最先进的运动风格转移方法，建立了运动个性化的新基准。

Title: SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models

Authors: Ouxiang Li, Yuan Wang, Xinting Hu, Houcheng Jiang, Tao Liang, Yanbin Hao, Guojun Ma, Fuli Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07392
Pdf URL: https://arxiv.org/pdf/2503.07392
Copy Paste: [[2503.07392]] SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models(https://arxiv.org/abs/2503.07392)
Keywords: generation
Abstract: Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations. However, existing methods either require costly fine-tuning or degrade image quality for non-target concepts (i.e., prior) due to inherent optimization limitations. In this paper, we introduce SPEED, a model editing-based concept erasure approach that leverages null-space constraints for scalable, precise, and efficient erasure. Specifically, SPEED incorporates Influence-based Prior Filtering (IPF) to retain the most affected non-target concepts during erasing, Directed Prior Augmentation (DPA) to expand prior coverage while maintaining semantic consistency, and Invariant Equality Constraints (IEC) to regularize model editing by explicitly preserving key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in prior preservation while achieving efficient and high-fidelity concept erasure, successfully removing 100 concepts within just 5 seconds. Our code and models are available at: this https URL.
摘要：由于人们对版权侵权，进攻性内容和侵犯隐私权的侵犯，从大规模的文本到图像（T2I）扩散模型中删除概念变得越来越重要。但是，由于固有的优化限制，现有方法需要昂贵的微调或降低图像质量（即先验）。在本文中，我们引入了速度，这是一种基于模型编辑的概念擦除方法，该方法利用了无空间的约束来进行可扩展，精确和有效的擦除。具体而言，速度将基于影响力的先验过滤（IPF）保留，以保留擦除期间受影响最大的非目标概念，指导先前的增强（DPA），以扩大先前的覆盖范围，同时保持语义一致性，并保持语义平等约束（IEC），以通过在T2I生成过程中明确保留密钥无效的模型来正规化模型编辑。跨多个概念擦除任务的广泛评估表明，速度在先前保存中始终优于现有方法，同时实现高效且高保真的概念擦除，并在仅5秒钟内成功地删除了100个概念。我们的代码和模型可在以下网址提供：此HTTPS URL。

Title: TimeStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in Vision

Authors: Shaobin Zhuang, Yiwei Guo, Yanbo Ding, Kunchang Li, Xinyuan Chen, Yaohui Wang, Fangyikang Wang, Ying Zhang, Chen Li, Yali Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07416
Pdf URL: https://arxiv.org/pdf/2503.07416
Copy Paste: [[2503.07416]] TimeStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in Vision(https://arxiv.org/abs/2503.07416)
Keywords: generation
Abstract: Diffusion models have driven the advancement of vision generation over the past years. However, it is often difficult to apply these large models in downstream tasks, due to massive fine-tuning cost. Recently, Low-Rank Adaptation (LoRA) has been applied for efficient tuning of diffusion models. Unfortunately, the capabilities of LoRA-tuned diffusion models are limited, since the same LoRA is used for different timesteps of the diffusion process. To tackle this problem, we introduce a general and concise TimeStep Master (TSM) paradigm with two key fine-tuning stages. In the fostering stage (1-stage), we apply different LoRAs to fine-tune the diffusion model at different timestep intervals. This results in different TimeStep LoRA experts that can effectively capture different noise levels. In the assembling stage (2-stage), we design a novel asymmetrical mixture of TimeStep LoRA experts, via core-context collaboration of experts at multi-scale intervals. For each timestep, we leverage TimeStep LoRA expert within the smallest interval as the core expert without gating, and use experts within the bigger intervals as the context experts with time-dependent gating. Consequently, our TSM can effectively model the noise level via the expert in the finest interval, and adaptively integrate contexts from the experts of other scales, boosting the versatility of diffusion models. To show the effectiveness of our TSM paradigm, we conduct extensive experiments on three typical and popular LoRA-related tasks of diffusion models, including domain adaptation, post-pretraining, and model distillation. Our TSM achieves the state-of-the-art results on all these tasks, throughout various model structures (UNet, DiT and MM-DiT) and visual data modalities (Image, Video), showing its remarkable generalization capacity.
摘要：在过去几年中，扩散模型推动了视力产生的进步。但是，由于成本的高度调整，通常很难将这些大型模型应用于下游任务。最近，已将低级适应性（LORA）应用于扩散模型的有效调整。不幸的是，洛拉调节扩散模型的功能受到限制，因为相同的lora用于扩散过程的不同时间步长。为了解决这个问题，我们介绍了一个通用而简洁的时间段主（TSM）范式，并具有两个关键的微调阶段。在寄养阶段（1阶段），我们在不同的时间步间隔内应用不同的洛拉斯来微调扩散模型。这会导致不同的时间段Lora专家，可以有效地捕获不同的噪声水平。在组装阶段（2阶段），我们通过在多尺度的间隔内通过专家的核心文化协作设计了时间段Lora专家的新型不对称混合物。对于每个时间段，我们在最小的核心专家中，在没有门控的最小时间间隔内利用时间段Lora专家，并在较大的间隔内将专家用作时间依赖时间的上下文专家。因此，我们的TSM可以在最佳间隔中通过专家有效地对噪声水平进行建模，并自适应地整合其他量表专家的上下文，从而促进扩散模型的多功能性。为了显示我们的TSM范式的有效性，我们对扩散模型的三个典型和流行的洛拉相关任务进行了广泛的实验，包括域的适应性，后预处理和模型蒸馏。我们的TSM在各种模型结构（UNET，DIT和MM-DIT）和视觉数据模式（图像，视频）中实现了所有这些任务的最新结果，显示其显着的概括能力。

Title: AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion

Authors: Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, Jing Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07418
Pdf URL: https://arxiv.org/pdf/2503.07418
Copy Paste: [[2503.07418]] AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion(https://arxiv.org/abs/2503.07418)
Keywords: generation
Abstract: The task of video generation requires synthesizing visually realistic and temporally coherent video frames. Existing methods primarily use asynchronous auto-regressive models or synchronous diffusion models to address this challenge. However, asynchronous auto-regressive models often suffer from inconsistencies between training and inference, leading to issues such as error accumulation, while synchronous diffusion models are limited by their reliance on rigid sequence length. To address these issues, we introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible, asynchronous video generation. Specifically, our approach leverages diffusion to gradually corrupt video frames in both training and inference, reducing the discrepancy between these phases. Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames, ensuring that earlier frames remain clearer than subsequent ones. This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence. In addition, we design two specialized timestep schedulers: the FoPP scheduler for balanced timestep sampling during training, and the AD scheduler for flexible timestep differences during inference, supporting both synchronous and asynchronous generation. Extensive experiments demonstrate the superiority of our proposed method, which achieves competitive and state-of-the-art results across four challenging benchmarks.
摘要：视频生成的任务需要综合视觉现实和时间连贯的视频帧。现有方法主要使用异步自动回归模型或同步扩散模型来应对这一挑战。然而，异步自动回归模型通常会遭受训练和推理之间的不一致，导致诸如误差累积的问题，而同步扩散模型受到对刚性序列长度的依赖的限制。为了解决这些问题，我们引入了自动回归扩散（AR-Diffusion），这是一种新型模型，结合了自动回归和扩散模型的优势，以促进灵活，异步视频生成。具体而言，我们的方法利用了在训练和推理中逐渐损坏视频帧的扩散，从而减少了这些阶段之间的差异。受自动回归生成的启发，我们对单个帧的腐败时间段的腐败时间限制了一个非遗忘的限制，以确保早期的框架比后续框架更清晰。该设置以及时间因果的关注，可以灵活地生成不同长度的视频，同时保持时间连贯性。此外，我们设计了两个专业的时间段调度程序：在培训过程中平衡时间段采样的FOPP调度程序，以及推理过程中灵活的时间段差异的广告调度程序，支持同步和异步生成。广泛的实验证明了我们提出的方法的优越性，该方法在四个具有挑战性的基准中实现了竞争性和最先进的结果。

Title: Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration

Authors: Dylan J. Foster, Zakaria Mhammedi, Dhruv Rohatgi
Subjects: cs.LG, cs.AI, cs.CL, math.ST
Abstract URL: https://arxiv.org/abs/2503.07453
Pdf URL: https://arxiv.org/pdf/2503.07453
Copy Paste: [[2503.07453]] Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration(https://arxiv.org/abs/2503.07453)
Keywords: generative
Abstract: Language model alignment (or, reinforcement learning) techniques that leverage active exploration -- deliberately encouraging the model to produce diverse, informative responses -- offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. Necessity of coverage: Coverage refers to the extent to which the pre-trained model covers near-optimal responses -- a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. Inference-time exploration: We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. Insufficiency of training-time interventions: We contrast the result above by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. 4. Computational benefits of multi-turn exploration: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.
摘要：语言模型的一致性（或增强学习）技术利用主动探索（故意鼓励模型产生多样化的信息响应）提供了超级人类能力的承诺。但是，目前对使用语言模型进行计算有效探索的算法设计基础的理解是有限的。为了更好地了解如何利用对强大的预训练的生成模型的访问来提高探索效率，我们引入了使用语言模型的RL的新计算框架，其中学习者通过采样Oracle与模型进行了交互。为了关注线性软模型模型参数化，我们提供了新的结果，以揭示有效探索的计算统计折衷：1。覆盖范围：覆盖范围是指预先训练的模型涵盖近乎最佳的响应的程度 - 一种隐藏知识的一种形式。我们显示，覆盖范围虽然没有数据效率，但在我们的框架中降低了任何算法的运行时。 2。推理时间探索：我们引入了一种新的算法，SpannerSmpling，该算法获得最佳的数据效率，并且每当预先训练的模型都具有足够的覆盖范围，与我们的下限匹配时，它在计算上有效。用预训练的模型利用推理时间计算的推理时间计算，以减少有效的探索搜索空间。 3。训练时间干预措施的不足：我们通过表明产生适当政策的训练时间干预措施在多项式时间内获得相似的保证，以对比上述结果。 4。多转弯探索的计算益处：最后，我们表明，在其他代表性假设下，人们可以通过多转探索实现改善的运行时（用令牌级别的覆盖范围替换序列级别的覆盖范围）。

Title: VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models

Authors: Jiacheng Ruan, Wenzhen Yuan, Xian Gao, Ye Guo, Daoxin Zhang, Zhe Xu, Yao Hu, Ting Liu, Yuzhuo Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07478
Pdf URL: https://arxiv.org/pdf/2503.07478
Copy Paste: [[2503.07478]] VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models(https://arxiv.org/abs/2503.07478)
Keywords: generation
Abstract: Although large visual-language models (LVLMs) have demonstrated strong performance in multimodal tasks, errors may occasionally arise due to biases during the reasoning process. Recently, reward models (RMs) have become increasingly pivotal in the reasoning process. Specifically, process RMs evaluate each reasoning step, outcome RMs focus on the assessment of reasoning results, and critique RMs perform error analysis on the entire reasoning process, followed by corrections. However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. VLRMBench is constructed based on three distinct types of datasets, covering mathematical reasoning, hallucination understanding, and multi-image understanding. We design 12 tasks across three major categories, focusing on evaluating VLRMs in the aspects of process understanding, outcome judgment, and critique generation. Extensive experiments are conducted on 21 open-source models and 5 advanced closed-source models, highlighting the challenges posed by VLRMBench. For instance, in the `Forecasting Future', a binary classification task, the advanced GPT-4o achieves only a 76.0% accuracy. Additionally, we perform comprehensive analytical studies, offering valuable insights for the future development of VLRMs. We anticipate that VLRMBench will serve as a pivotal benchmark in advancing VLRMs. Code and datasets will be available at this https URL.
摘要：尽管大型视觉语言模型（LVLM）在多模式任务中表现出强大的性能，但由于推理过程中的偏见，偶尔可能会出现错误。最近，在推理过程中，奖励模型（RMS）变得越来越重要。具体而言，过程RMS评估每个推理步骤，结果RMS专注于评估推理结果，批评RMS对整个推理过程进行了错误分析，然后进行校正。但是，视觉RMS（VLRMS）的现有基准通常仅评估其功能的一个方面（例如，区分两个答案），从而限制了全方位的评估并限制了视觉语言领域中RMS的发展。为了解决这一差距，我们提出了一个全面而具有挑战性的基准，称为Vlrmbench，其中包含12,634个问题。 VLRMBENCH是基于三种不同类型的数据集构建的，涵盖了数学推理，幻觉理解和多图像理解。我们在三个主要类别中设计了12项任务，重点是评估VLRMS在过程理解，结果判断和批评生成方面。对21个开源模型和5种高级封闭式模型进行了广泛的实验，强调了VLRMBENCH所带来的挑战。例如，在“预测未来”（二进制分类任务）中，高级GPT-4O仅实现了76.0％的精度。此外，我们进行了全面的分析研究，为VLRM的未来发展提供了宝贵的见解。我们预计VLRMBENCH将在推进VLRMS方面是一个关键的基准。代码和数据集将在此HTTPS URL上可用。

Title: V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation

Authors: Guiwei Zhang, Tianyu Zhang, Mohan Zhou, Yalong Bai, Biye Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07493
Pdf URL: https://arxiv.org/pdf/2503.07493
Copy Paste: [[2503.07493]] V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation(https://arxiv.org/abs/2503.07493)
Keywords: generation
Abstract: We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing. this https URL
摘要：我们提出了V2Flow，这是一种新型的令牌，可产生能够进行高保真重建的离散视觉令牌，同时确保与大语言模型（LLMS）的词汇空间的结构和潜在分布对齐。为了利用这种紧密的视觉效率耦合，V2Flow可以在现有LLM的顶部进行自回归视觉生成。我们的方法将视觉令牌化为流动匹配问题，旨在在连续图像分布之前从标准正常的映射中学习映射，并以嵌入LLMS词汇空间中的令牌序列为条件。 V2Flow的有效性源于两个核心设计。首先，我们提出了一个视觉词汇重新采样器，将视觉数据压缩为紧凑的令牌序列，每个序列都表示为LLM的词汇上的柔软分类分布。这允许将视觉令牌无缝集成到现有的LLM中以进行自回归视觉生成。其次，我们提出了一个掩盖的自动回归流量解码器，该解码器采用蒙版的变压器编码器描述器将视觉令牌精炼成上下文富集的嵌入。然后，这些嵌入条件是针对精确重建的专用速度场。此外，还制定了自回归的回流抽样策略，以确保灵活的序列长度，同时保持竞争性重建质量。广泛的实验表明，V2Flow优于基于VQ的主流引物，并促进了现有的自回归视觉生成。此HTTPS URL

Title: LBM: Latent Bridge Matching for Fast Image-to-Image Translation

Authors: Clément Chadebec, Onur Tasar, Sanjeev Sreetharan, Benjamin Aubin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07535
Pdf URL: https://arxiv.org/pdf/2503.07535
Copy Paste: [[2503.07535]] LBM: Latent Bridge Matching for Fast Image-to-Image Translation(https://arxiv.org/abs/2503.07535)
Keywords: generation
Abstract: In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation. We show that the method can reach state-of-the-art results for various image-to-image tasks using only a single inference step. In addition to its efficiency, we also demonstrate the versatility of the method across different image translation tasks such as object removal, normal and depth estimation, and object relighting. We also derive a conditional framework of LBM and demonstrate its effectiveness by tackling the tasks of controllable image relighting and shadow generation. We provide an open-source implementation of the method at this https URL.
摘要：在本文中，我们介绍了潜在的桥梁匹配（LBM），这是一种新的，多功能和可扩展的方法，依赖于潜在空间中的桥梁匹配来实现快速的图像到图像翻译。我们表明，该方法只能使用单个推理步骤来达到各种图像到图像任务的最新结果。除了其效率外，我们还演示了该方法在不同图像翻译任务（例如对象去除，正常和深度估计以及对象重新定义）中的多功能性。我们还得出了一个有条件的LBM框架，并通过解决可控图像重新构成和阴影生成的任务来证明其有效性。我们在此HTTPS URL上提供了该方法的开源实现。

Title: Inductive Moment Matching

Authors: Linqi Zhou, Stefano Ermon, Jiaming Song
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.07565
Pdf URL: https://arxiv.org/pdf/2503.07565
Copy Paste: [[2503.07565]] Inductive Moment Matching(https://arxiv.org/abs/2503.07565)
Keywords: generative
Abstract: Diffusion models and Flow Matching generate high-quality samples but are slow at inference, and distilling them into few-step models often leads to instability and extensive tuning. To resolve these trade-offs, we propose Inductive Moment Matching (IMM), a new class of generative models for one- or few-step sampling with a single-stage training procedure. Unlike distillation, IMM does not require pre-training initialization and optimization of two networks; and unlike Consistency Models, IMM guarantees distribution-level convergence and remains stable under various hyperparameters and standard model architectures. IMM surpasses diffusion models on ImageNet-256x256 with 1.99 FID using only 8 inference steps and achieves state-of-the-art 2-step FID of 1.98 on CIFAR-10 for a model trained from scratch.
摘要：扩散模型和流匹配会产生高质量的样本，但推理时速度很慢，并且将它们提取成少量的模型通常会导致不稳定性和广泛的调整。为了解决这些权衡，我们建议通过单阶段训练程序进行归纳力矩匹配（IMM），这是一种新的生成模型，用于单阶段采样。与蒸馏不同，IMM不需要两个网络的预训练初始化和优化。与一致性模型不同，IMM保证分布级的收敛性，并在各种超参数和标准模型体系结构下保持稳定。 IMM仅使用8个推理步骤，并在CIFAR-10上使用8个推理步骤，并在CIFAR-10上实现最新的2步FID，从而超过Imagenet-256x256上的扩散模型，用于从Scratch训练的模型。

Title: Denoising Score Distillation: From Noisy Diffusion Pretraining to One-Step High-Quality Generation

Authors: Tianyu Chen, Yasi Zhang, Zhendong Wang, Ying Nian Wu, Oscar Leong, Mingyuan Zhou
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.07578
Pdf URL: https://arxiv.org/pdf/2503.07578
Copy Paste: [[2503.07578]] Denoising Score Distillation: From Noisy Diffusion Pretraining to One-Step High-Quality Generation(https://arxiv.org/abs/2503.07578)
Keywords: generation, generative
Abstract: Diffusion models have achieved remarkable success in generating high-resolution, realistic images across diverse natural distributions. However, their performance heavily relies on high-quality training data, making it challenging to learn meaningful distributions from corrupted samples. This limitation restricts their applicability in scientific domains where clean data is scarce or costly to obtain. In this work, we introduce denoising score distillation (DSD), a surprisingly effective and novel approach for training high-quality generative models from low-quality data. DSD first pretrains a diffusion model exclusively on noisy, corrupted samples and then distills it into a one-step generator capable of producing refined, clean outputs. While score distillation is traditionally viewed as a method to accelerate diffusion models, we show that it can also significantly enhance sample quality, particularly when starting from a degraded teacher model. Across varying noise levels and datasets, DSD consistently improves generative performancewe summarize our empirical evidence in Fig. 1. Furthermore, we provide theoretical insights showing that, in a linear model setting, DSD identifies the eigenspace of the clean data distributions covariance matrix, implicitly regularizing the generator. This perspective reframes score distillation as not only a tool for efficiency but also a mechanism for improving generative models, particularly in low-quality data settings.
摘要：扩散模型在在各种自然分布中产生高分辨率，逼真的图像方面取得了巨大的成功。但是，他们的性能在很大程度上依赖于高质量的培训数据，这使得从损坏的样本中学习有意义的分布变得具有挑战性。这种限制限制了它们在科学领域的适用性，在科学领域，清洁数据稀缺或昂贵。在这项工作中，我们引入了Denoising评分蒸馏（DSD），这是一种出奇的有效和新颖的方法，用于训练低质量数据的高质量生成模型。 DSD首先在嘈杂的，损坏的样本上预定了扩散模型，然后将其提炼成能够生产精制的清洁输出的单步生成器。尽管传统上将得分蒸馏视为加速扩散模型的一种方法，但我们表明它也可以显着提高样本质量，尤其是从退化的教师模型开始时。在不同的噪声水平和数据集中，DSD始终提高生成性能，总结了图1中的经验证据。此外，我们提供了理论见解，表明，在线性模型设置中，DSD确定了清洁数据分布协方差矩阵的特征，并暗中正规化生成器。该视角将蒸馏量重新升级为效率的工具，而且是改善生成模型的机制，尤其是在低质量的数据设置中。

Title: Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning

Authors: Bardia Safaei, Faizan Siddiqui, Jiacong Xu, Vishal M. Patel, Shao-Yuan Lo
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07591
Pdf URL: https://arxiv.org/pdf/2503.07591
Copy Paste: [[2503.07591]] Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning(https://arxiv.org/abs/2503.07591)
Keywords: generation
Abstract: Visual instruction tuning (VIT) for large vision-language models (LVLMs) requires training on expansive datasets of image-instruction pairs, which can be costly. Recent efforts in VIT data selection aim to select a small subset of high-quality image-instruction pairs, reducing VIT runtime while maintaining performance comparable to full-scale training. However, a major challenge often overlooked is that generating instructions from unlabeled images for VIT is highly expensive. Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which limits users with constrained resources from creating VIT datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images. PreSel first estimates the relative importance of each vision task within VIT datasets to derive task-wise sampling budgets. It then clusters image features within each task, selecting the most representative images with the budget. This approach reduces computational overhead for both instruction generation during VIT data formation and LVLM fine-tuning. By generating instructions for only 15% of the images, PreSel achieves performance comparable to full-data VIT on the LLaVA-1.5 and Vision-Flan datasets. The link to our project page: this https URL
摘要：大型视觉语言模型（LVLM）的视觉指导调整（VIT）需要在图像实施对的扩展数据集上进行培训，这可能是昂贵的。 VIT数据选择中的最新努力旨在选择一小部分高质量的图像指导对，从而减少VIT运行时，同时保持与全尺度训练相当的性能。但是，经常被忽视的主要挑战是，从未标记的VIT图像生成说明非常昂贵。大多数现有的VIT数据集都在很大程度上依赖人类注释或付费服务（例如GPT API），该服务限制了用户使用受限资源的用户为自定义应用程序创建VIT数据集。为了解决这个问题，我们介绍了前建造数据选择（PRESEL），这是一个更实用的数据选择范式，直接选择最有益的未标记图像，并仅为所选图像生成指令。 Presel首先估计VIT数据集中每个视觉任务的相对重要性，以得出任务采样预算。然后，它将图像特征集中在每个任务中，从而选择最具代表性的图像。这种方法可在VIT数据形成和LVLM微调过程中降低指导生成的计算开销。通过仅生成图像的15％的指令，Presel可以在LLAVA-1.5和Vision-Flan数据集上实现与FullData Vit相当的性能。我们项目页面的链接：此HTTPS URL

Title: HumanMM: Global Human Motion Recovery from Multi-shot Videos

Authors: Yuhong Zhang, Guanlin Wu, Ling-Hao Chen, Zhuokai Zhao, Jing Lin, Xiaoke Jiang, Jiamin Wu, Zhuoheng Li, Hao Frank Yang, Haoqian Wang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07597
Pdf URL: https://arxiv.org/pdf/2503.07597
Copy Paste: [[2503.07597]] HumanMM: Global Human Motion Recovery from Multi-shot Videos(https://arxiv.org/abs/2503.07597)
Keywords: generation
Abstract: In this paper, we present a novel framework designed to reconstruct long-sequence 3D human motion in the world coordinates from in-the-wild videos with multiple shot transitions. Such long-sequence in-the-wild motions are highly valuable to applications such as motion generation and motion understanding, but are of great challenge to be recovered due to abrupt shot transitions, partial occlusions, and dynamic backgrounds presented in such videos. Existing methods primarily focus on single-shot videos, where continuity is maintained within a single camera view, or simplify multi-shot alignment in camera space only. In this work, we tackle the challenges by integrating an enhanced camera pose estimation with Human Motion Recovery (HMR) by incorporating a shot transition detector and a robust alignment module for accurate pose and orientation continuity across shots. By leveraging a custom motion integrator, we effectively mitigate the problem of foot sliding and ensure temporal consistency in human pose. Extensive evaluations on our created multi-shot dataset from public 3D human datasets demonstrate the robustness of our method in reconstructing realistic human motion in world coordinates.
摘要：在本文中，我们提出了一个新颖的框架，旨在重建世界上的长期序列3D人类运动，该框架从具有多个射击过渡的野外视频中协调。这种长期的野生动作对于诸如运动产生和运动理解等应用非常有价值，但由于突然的射击过渡，部分遮挡和动态背景而在此类视频中提出的动态背景是巨大的挑战。现有方法主要集中在单次视频上，其中连续性在单个摄像头视图中维护，或仅在摄像头空间中简化多拍对齐。在这项工作中，我们通过合并射击过渡探测器和健壮的对齐模块，以使跨镜头的准确姿势和方向连续性将增强的相机姿势估计与人类运动恢复（HMR）整合到人类运动恢复（HMR）中来应对挑战。通过利用自定义运动积分器，我们有效地减轻了脚滑的问题并确保人类姿势的时间一致性。从公共3D人类数据集对我们创建的多弹数据集进行了广泛的评估，这表明了我们方法在重建世界坐标中现实的人类运动方面的鲁棒性。

Title: VACE: All-in-One Video Creation and Editing

Authors: Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, Yu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07598
Pdf URL: https://arxiv.org/pdf/2503.07598
Copy Paste: [[2503.07598]] VACE: All-in-One Video Creation and Editing(https://arxiv.org/abs/2503.07598)
Keywords: generation
Abstract: Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: this https URL.
摘要：扩散变压器在生成高质量的图像和视频方面表现出强大的功能和可扩展性。进一步追求生成和编辑任务的统一在图像内容创建的领域取得了重大进展。但是，由于对时间和空间动力学的一致性的内在需求，实现视频合成的统一方法仍然具有挑战性。我们介绍VACE，该VACE使用户能够在创建和编辑的多合一框架内执行视频任务。这些任务包括引用到视频生成，视频到视频编辑以及蒙版的视频对视频编辑。具体而言，我们通过组织视频任务输入（例如编辑，参考和掩码）将各种任务的要求有效地集成了，将其纳入称为视频条件单元（VCU）的统一接口。此外，通过利用上下文适配器结构，我们使用时间和空间维度的形式化表示，将不同的任务概念注入模型中，从而使其可以灵活地处理任意视频综合任务。广泛的实验表明，VACE的统一模型在各种子任务中都可以在特定于任务的模型上实现性能。同时，它可以通过多功能任务组合来实现各种应用程序。项目页面：此HTTPS URL。

Title: DreamRelation: Relation-Centric Video Customization

Authors: Yujie Wei, Shiwei Zhang, Hangjie Yuan, Biao Gong, Longxiang Tang, Xiang Wang, Haonan Qiu, Hengjia Li, Shuai Tan, Yingya Zhang, Hongming Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07602
Pdf URL: https://arxiv.org/pdf/2503.07602
Copy Paste: [[2503.07602]] DreamRelation: Relation-Centric Video Customization(https://arxiv.org/abs/2503.07602)
Keywords: generation
Abstract: Relational video customization refers to the creation of personalized videos that depict user-specified relations between two subjects, a crucial task for comprehending real-world visual content. While existing methods can personalize subject appearances and motions, they still struggle with complex relational video customization, where precise relational modeling and high generalization across subject categories are essential. The primary challenge arises from the intricate spatial arrangements, layout variations, and nuanced temporal dynamics inherent in relations; consequently, current models tend to overemphasize irrelevant visual details rather than capturing meaningful interactions. To address these challenges, we propose DreamRelation, a novel approach that personalizes relations through a small set of exemplar videos, leveraging two key components: Relational Decoupling Learning and Relational Dynamics Enhancement. First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships. Furthermore, we determine the optimal design of relation LoRA triplet by analyzing the distinct roles of the query, key, and value features within MM-DiT's attention mechanism, making DreamRelation the first relational video generation framework with explainable components. Second, in Relational Dynamics Enhancement, we introduce space-time relational contrastive loss, which prioritizes relational dynamics while minimizing the reliance on detailed subject appearances. Extensive experiments demonstrate that DreamRelation outperforms state-of-the-art methods in relational video customization. Code and models will be made publicly available.
摘要：关系视频自定义是指创建个性化视频，这些视频描绘了两个主题之间用户指定的关系，这是理解现实世界视觉内容的至关重要的任务。尽管现有方法可以个性化主题的外观和动作，但它们仍然在复杂的关系视频自定义方面挣扎，而在主题类别中，精确的关系建模和高概括是必不可少的。主要的挑战源于复杂的空间布置，布局变化和关系中固有的细微时间动态。因此，当前的模型倾向于过分强调无关紧要的视觉细节，而不是捕获有意义的互动。为了应对这些挑战，我们提出了一种梦想，这是一种新颖的方法，通过一小部分示例视频来个性化关系，利用两个关键组成部分：关系解耦学习和关系动态增强。首先，在关系解耦学习中，我们使用关系lora三胞胎和混合面具训练策略将主题表现出来，从而确保在各种关系中更好地概括。此外，我们通过分析MM-DIT注意机制中查询，密钥和价值特征的不同作用来确定关系Lora三胞胎的最佳设计，从而使Dreamsration成为具有可解释组件的第一个关系视频生成框架。其次，在关系动力学增强中，我们引入了时空关系对比损失，该损失优先考虑关系动力学，同时最大程度地减少对详细主题外观的依赖。广泛的实验表明，在关系视频自定义中，梦想优于最先进的方法。代码和模型将公开可用。

Title: VoD: Learning Volume of Differences for Video-Based Deepfake Detection

Authors: Ying Xu, Marius Pedersen, Kiran Raja
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07607
Pdf URL: https://arxiv.org/pdf/2503.07607
Copy Paste: [[2503.07607]] VoD: Learning Volume of Differences for Video-Based Deepfake Detection(https://arxiv.org/abs/2503.07607)
Keywords: generative
Abstract: The rapid development of deep learning and generative AI technologies has profoundly transformed the digital contact landscape, creating realistic Deepfake that poses substantial challenges to public trust and digital media integrity. This paper introduces a novel Deepfake detention framework, Volume of Differences (VoD), designed to enhance detection accuracy by exploiting temporal and spatial inconsistencies between consecutive video frames. VoD employs a progressive learning approach that captures differences across multiple axes through the use of consecutive frame differences (CFD) and a network with stepwise expansions. We evaluate our approach with intra-dataset and cross-dataset testing scenarios on various well-known Deepfake datasets. Our findings demonstrate that VoD excels with the data it has been trained on and shows strong adaptability to novel, unseen data. Additionally, comprehensive ablation studies examine various configurations of segment length, sampling steps, and intervals, offering valuable insights for optimizing the framework. The code for our VoD framework is available at this https URL.
摘要：深度学习和生成AI技术的快速发展深刻地改变了数字接触景观，从而创造了逼真的深层，这对公众信任和数字媒体的完整性构成了重大挑战。本文介绍了一个新型的深泡拘留框架，差异的数量（VOD），旨在通过利用连续视频帧之间的时间和空间不一致来提高检测准确性。 VOD采用了一种逐步学习方法，该方法通过使用连续的帧差异（CFD）和具有逐步扩展的网络来捕获多个轴的差异。我们通过数据集和跨数据库测试方案在各种著名的深层数据集上评估我们的方法。我们的发现表明，VOD在训练的数据中表现出色，并显示出对新颖，看不见的数据的强烈适应性。此外，全面的消融研究研究了各个细分长度，采样步骤和间隔的各种配置，提供了优化框架的宝贵见解。我们的VOD框架的代码可在此HTTPS URL上找到。