2025-03-20

Title: Synthetic Data Generation of Body Motion Data by Neural Gas Network for Emotion Recognition

Authors: Seyed Muhammad Hossein Mousavi
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2503.14513
Pdf URL: https://arxiv.org/pdf/2503.14513
Copy Paste: [[2503.14513]] Synthetic Data Generation of Body Motion Data by Neural Gas Network for Emotion Recognition(https://arxiv.org/abs/2503.14513)
Keywords: generation, generative
Abstract: In the domain of emotion recognition using body motion, the primary challenge lies in the scarcity of diverse and generalizable datasets. Automatic emotion recognition uses machine learning and artificial intelligence techniques to recognize a person's emotional state from various data types, such as text, images, sound, and body motion. Body motion poses unique challenges as many factors, such as age, gender, ethnicity, personality, and illness, affect its appearance, leading to a lack of diverse and robust datasets specifically for emotion recognition. To address this, employing Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks (GANs) and Variational Auto Encoders (VAEs), offers potential solutions, though these methods are often complex. This research introduces a novel application of the Neural Gas Network (NGN) algorithm for synthesizing body motion data and optimizing diversity and generation speed. By learning skeletal structure topology, the NGN fits the neurons or gas particles on body joints. Generated gas particles, which form the skeletal structure later on, will be used to synthesize the new body posture. By attaching body postures over frames, the final synthetic body motion appears. We compared our generated dataset against others generated by GANs, VAEs, and another benchmark algorithm, using benchmark metrics such as Fréchet Inception Distance (FID), Diversity, and a few more. Furthermore, we continued evaluation using classification metrics such as accuracy, precision, recall, and a few others. Joint-related features or kinematic parameters were extracted, and the system assessed model performance against unseen data. Our findings demonstrate that the NGN algorithm produces more realistic and emotionally distinct body motion data and does so with more synthesizing speed than existing methods.
摘要：在使用身体运动的情感识别领域中，主要挑战在于稀缺和可推广的数据集的稀缺性。自动情绪识别使用机器学习和人工智能技术来识别一个人的情绪状态，例如文本，图像，声音和身体运动。身体运动带来了独特的挑战，因为年龄，性别，种族，个性和疾病等许多因素都会影响其外观，从而导致缺乏专门用于情感识别的多样化和强大的数据集。为了解决这个问题，采用合成数据生成（SDG）方法，例如生成对抗网络（GAN）和变分自动编码器（VAE），提供了潜在的解决方案，尽管这些方法通常很复杂。这项研究介绍了神经天然气网络（NGN）算法的新应用，用于合成身体运动数据并优化多样性和发电速度。通过学习骨骼结构拓扑，NGN适合身体关节上的神经元或气体颗粒。稍后形成骨骼结构的产生的气体颗粒将用于合成新的身体姿势。通过将身体姿势连接到框架上，出现了最终的合成体运动。我们使用基准指标（例如FréchetInception距离（FID），多样性等）将生成的数据集与GAN，VAE和另一种基准算法产生的其他数据集进行了比较。此外，我们继续使用分类指标进行评估，例如准确性，精度，召回和其他一些。提取了与联合相关的特征或运动学参数，并针对看不见的数据评估了模型性能。我们的发现表明，NGN算法会产生更现实和情感上不同的身体运动数据，并且比现有方法更合成速度。

Title: Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control

Authors: Hejia Chen, Haoxian Zhang, Shoulong Zhang, Xiaoqiang Liu, Sisi Zhuang, Yuan Zhang, Pengfei Wan, Di Zhang, Shuai Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14517
Pdf URL: https://arxiv.org/pdf/2503.14517
Copy Paste: [[2503.14517]] Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control(https://arxiv.org/abs/2503.14517)
Keywords: generation
Abstract: Speech-driven 3D talking face method should offer both accurate lip synchronization and controllable expressions. Previous methods solely adopt discrete emotion labels to globally control expressions throughout sequences while limiting flexible fine-grained facial control within the spatiotemporal domain. We propose a diffusion-transformer-based 3D talking face generation model, Cafe-Talk, which simultaneously incorporates coarse- and fine-grained multimodal control conditions. Nevertheless, the entanglement of multiple conditions challenges achieving satisfying performance. To disentangle speech audio and fine-grained conditions, we employ a two-stage training pipeline. Specifically, Cafe-Talk is initially trained using only speech audio and coarse-grained conditions. Then, a proposed fine-grained control adapter gradually adds fine-grained instructions represented by action units (AUs), preventing unfavorable speech-lip synchronization. To disentangle coarse- and fine-grained conditions, we design a swap-label training mechanism, which enables the dominance of the fine-grained conditions. We also devise a mask-based CFG technique to regulate the occurrence and intensity of fine-grained control. In addition, a text-based detector is introduced with text-AU alignment to enable natural language user input and further support multimodal control. Extensive experimental results prove that Cafe-Talk achieves state-of-the-art lip synchronization and expressiveness performance and receives wide acceptance in fine-grained control in user studies. Project page: this https URL
摘要：语音驱动的3D说话面部方法应提供准确的唇部同步和可控表达式。以前的方法仅采用离散的情绪标签来整个序列中全球控制表达式，同时限制了时空域内的柔性细粒面部控制。我们提出了一个基于扩散 - 转化器的3D谈话面部生成模型Cafe-Talk，该模型同时结合了粗粒和细粒的多模式控制条件。然而，多种条件的纠缠挑战达到令人满意的表现。为了解开语音音频和细粒度的条件，我们采用了两阶段的培训管道。具体而言，Cafe-Talk最初仅使用语音音频和粗粒条件进行训练。然后，提出的细粒控制适配器逐渐添加了由动作单元（AUS）表示的细粒指令，从而阻止了不利的语音流-LIP同步。为了消除粗糙和细粒度的条件，我们设计了一种交换标签的训练机制，从而使细粒条件的优势占据主导地位。我们还设计了一种基于掩模的CFG技术来调节细粒对照的发生和强度。此外，引入了基于文本的检测器，并带有文本AU对齐，以实现自然语言用户输入并进一步支持多模式控制。广泛的实验结果证明，咖啡厅达到了最先进的唇部同步和表达性能，并在用户研究中获得了细粒度控制的广泛接受。项目页面：此HTTPS URL

Title: Salient Temporal Encoding for Dynamic Scene Graph Generation

Authors: Zhihao Zhu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14524
Pdf URL: https://arxiv.org/pdf/2503.14524
Copy Paste: [[2503.14524]] Salient Temporal Encoding for Dynamic Scene Graph Generation(https://arxiv.org/abs/2503.14524)
Keywords: generation
Abstract: Representing a dynamic scene using a structured spatial-temporal scene graph is a novel and particularly challenging task. To tackle this task, it is crucial to learn the temporal interactions between objects in addition to their spatial relations. Due to the lack of explicitly annotated temporal relations in current benchmark datasets, most of the existing spatial-temporal scene graph generation methods build dense and abstract temporal connections among all objects across frames. However, not all temporal connections are encoding meaningful temporal dynamics. We propose a novel spatial-temporal scene graph generation method that selectively builds temporal connections only between temporal-relevant objects pairs and represents the temporal relations as explicit edges in the scene graph. The resulting sparse and explicit temporal representation allows us to improve upon strong scene graph generation baselines by up to $4.4\%$ in Scene Graph Detection. In addition, we show that our approach can be leveraged to improve downstream vision tasks. Particularly, applying our approach to action recognition, shows 0.6\% gain in mAP in comparison to the state-of-the-art
摘要：使用结构化的时空场景图代表动态场景是一项新颖，特别具有挑战性的任务。要解决这项任务，除了它们的空间关系外，要学习对象之间的时间相互作用至关重要。由于缺乏当前基准数据集中明确注释的时间关系，因此大多数现有的时空场景图生成方法在跨帧的所有对象之间建立了密集和抽象的时间连接。但是，并非所有时间连接都在编码有意义的时间动态。我们提出了一种新颖的时空场景图生成方法，该方法仅在与时间相关的对象对之间有选择地构建时间连接，并将时间关系表示为场景图中的显式边缘。最终的稀疏和显式的时间表示使我们可以在场景图检测中提高强大场景图生成基线的$ 4.4 \％$。此外，我们表明我们的方法可以利用以改善下游视觉任务。特别是，采用我们的行动识别方法，与最先进

Title: Sampling Decisions

Authors: Michael Chertkov, Sungsoo Ahn, Hamidreza Behjoo
Subjects: cs.LG, cond-mat.stat-mech, cs.AI, eess.SY, stat.ML
Abstract URL: https://arxiv.org/abs/2503.14549
Pdf URL: https://arxiv.org/pdf/2503.14549
Copy Paste: [[2503.14549]] Sampling Decisions(https://arxiv.org/abs/2503.14549)
Keywords: generative
Abstract: In this manuscript we introduce a novel Decision Flow (DF) framework for sampling from a target distribution while incorporating additional guidance from a prior sampler. DF can be viewed as an AI driven algorithmic reincarnation of the Markov Decision Process (MDP) approach in Stochastic Optimal Control. It extends the continuous space, continuous time path Integral Diffusion sampling technique to discrete time and space, while also generalizing the Generative Flow Network framework. In its most basic form, an explicit, Neural Network (NN) free formulation, DF leverages the linear solvability of the the underlying MDP to adjust the transition probabilities of the prior sampler. The resulting Markov Process is expressed as a convolution of the reverse time Green's function of the prior sampling with the target distribution. We illustrate the DF framework through an example of sampling from the Ising model, discuss potential NN based extensions, and outline how DF can enhance guided sampling across various applications.
摘要：在本手稿中，我们引入了一个新的决策流（DF）框架，用于从目标分布中进行采样，同时纳入了先前采样器的其他指导。在随机最佳控制中，DF可以看作是马尔可夫决策过程（MDP）方法的AI驱动算法转世。它将连续的空间，连续的时间路径积分扩散采样技术扩展到离散的时间和空间，同时还概括了生成流网络框架。 DF以其最基本的形式，即显式的神经网络（NN）自由公式，利用基础MDP的线性溶解性来调整先前采样器的过渡概率。最终的马尔可夫过程表示为对目标分布的先验采样功能的反向时间绿色函数的卷积。我们通过来自ISING模型的采样示例，讨论潜在的扩展，并概述DF如何增强各种应用程序中的指导采样，从而说明了DF框架。

Title: Potential Score Matching: Debiasing Molecular Structure Sampling with Potential Energy Guidance

Authors: Liya Guo, Zun Wang, Chang Liu, Junzhe Li, Pipi Hu, Yi Zhu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14569
Pdf URL: https://arxiv.org/pdf/2503.14569
Copy Paste: [[2503.14569]] Potential Score Matching: Debiasing Molecular Structure Sampling with Potential Energy Guidance(https://arxiv.org/abs/2503.14569)
Keywords: generative
Abstract: The ensemble average of physical properties of molecules is closely related to the distribution of molecular conformations, and sampling such distributions is a fundamental challenge in physics and chemistry. Traditional methods like molecular dynamics (MD) simulations and Markov chain Monte Carlo (MCMC) sampling are commonly used but can be time-consuming and costly. Recently, diffusion models have emerged as efficient alternatives by learning the distribution of training data. Obtaining an unbiased target distribution is still an expensive task, primarily because it requires satisfying ergodicity. To tackle these challenges, we propose Potential Score Matching (PSM), an approach that utilizes the potential energy gradient to guide generative models. PSM does not require exact energy functions and can debias sample distributions even when trained on limited and biased data. Our method outperforms existing state-of-the-art (SOTA) models on the Lennard-Jones (LJ) potential, a commonly used toy model. Furthermore, we extend the evaluation of PSM to high-dimensional problems using the MD17 and MD22 datasets. The results demonstrate that molecular distributions generated by PSM more closely approximate the Boltzmann distribution compared to traditional diffusion models.
摘要：分子的物理特性的整体平均值与分子构象的分布密切相关，而对这种分布进行采样是物理和化学的基本挑战。传统方法（例如分子动力学（MD）模拟和马尔可夫链蒙特卡洛（MCMC）采样）通常使用，但可能是耗时且昂贵的。最近，通过学习培训数据的分布，扩散模型已成为有效的替代方案。获得无偏的目标分布仍然是一项昂贵的任务，主要是因为它需要满足终结性。为了应对这些挑战，我们提出了潜在的得分匹配（PSM），这种方法利用势能梯度来指导生成模型。 PSM不需要精确的能量功能，即使在有限和有偏见的数据进行培训时，Debias样本分布也可以。我们的方法优于Lennard-Jones（LJ）潜力的现有最新模型（SOTA）模型，这是一种常用的玩具模型。此外，我们使用MD17和MD22数据集将PSM的评估扩展到高维问题。结果表明，与传统扩散模型相比，PSM产生的分子分布更接近玻尔兹曼分布。

Title: Robust Weight Imprinting: Insights from Neural Collapse and Proxy-Based Aggregation

Authors: Justus Westerhoff, Golzar Atefi, Mario Koddenbrock, Alexei Figueroa, Alexander Löser, Erik Rodner, Felix A. Gers
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14572
Pdf URL: https://arxiv.org/pdf/2503.14572
Copy Paste: [[2503.14572]] Robust Weight Imprinting: Insights from Neural Collapse and Proxy-Based Aggregation(https://arxiv.org/abs/2503.14572)
Keywords: generation
Abstract: The capacity of a foundation model allows for adaptation to new downstream tasks. Weight imprinting is a universal and efficient method to fulfill this purpose. It has been reinvented several times, but it has not been systematically studied. In this paper, we propose a framework for imprinting, identifying three main components: generation, normalization, and aggregation. This allows us to conduct an in-depth analysis of imprinting and a comparison of the existing work. We reveal the benefits of representing novel data with multiple proxies in the generation step and show the importance of proper normalization. We determine those proxies through clustering and propose a novel variant of imprinting that outperforms previous work. We motivate this by the neural collapse phenomenon -- an important connection that we can draw for the first time. Our results show an increase of up to 4% in challenging scenarios with complex data distributions for new classes.
摘要：基础模型的容量允许适应新的下游任务。重量印迹是实现此目的的通用和高效的方法。它已经重新发明了几次，但尚未系统地研究。在本文中，我们提出了一个框架，用于烙印，确定三个主要组成部分：生成，标准化和聚合。这使我们能够对现有工作进行深入分析和对现有工作的比较。我们揭示了在生成步骤中代表具有多个代理的新数据的好处，并显示了适当的归一化的重要性。我们通过聚类来确定这些代理，并提出一种新颖的印记变体，以优于先前的工作。我们通过神经崩溃现象来激发这一点，这是我们可以首次提出的重要联系。我们的结果表明，在充满挑战的情况下，新课程的复杂数据分布增加了4％。

Title: A Simple Combination of Diffusion Models for Better Quality Trade-Offs in Image Denoising

Authors: Jonas Dornbusch, Emanuel Pfarr, Florin-Alexandru Vasluianu, Frank Werner, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14654
Pdf URL: https://arxiv.org/pdf/2503.14654
Copy Paste: [[2503.14654]] A Simple Combination of Diffusion Models for Better Quality Trade-Offs in Image Denoising(https://arxiv.org/abs/2503.14654)
Keywords: generative
Abstract: Diffusion models have garnered considerable interest in computer vision, owing both to their capacity to synthesize photorealistic images and to their proven effectiveness in image reconstruction tasks. However, existing approaches fail to efficiently balance the high visual quality of diffusion models with the low distortion achieved by previous image reconstruction methods. Specifically, for the fundamental task of additive Gaussian noise removal, we first illustrate an intuitive method for leveraging pretrained diffusion models. Further, we introduce our proposed Linear Combination Diffusion Denoiser (LCDD), which unifies two complementary inference procedures - one that leverages the model's generative potential and another that ensures faithful signal recovery. By exploiting the inherent structure of the denoising samples, LCDD achieves state-of-the-art performance and offers controlled, well-behaved trade-offs through a simple scalar hyperparameter adjustment.
摘要：扩散模型引起了对计算机视觉的浓厚兴趣，这既是由于它们合成了逼真的图像的能力及其在图像重建任务中的有效性。但是，现有方法无法有效地平衡扩散模型的高视觉质量与先前图像重建方法实现的低失真。具体而言，对于除添加高斯噪声的基本任务，我们首先说明了一种直观的方法，用于利用预验证的扩散模型。此外，我们介绍了提出的线性组合扩散Denoiser（LCDD），该扩散剂（LCDD）统一了两个互补的推理程序 - 一种利用模型的生成潜力，另一种确保忠实信号恢复。 LCDD通过利用剥离样品的固有结构，实现了最先进的性能，并通过简单的标量超参数调整提供了受控，行为良好的权衡。

Title: Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs

Authors: Liu Jing, Amirul Rahman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14674
Pdf URL: https://arxiv.org/pdf/2503.14674
Copy Paste: [[2503.14674]] Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs(https://arxiv.org/abs/2503.14674)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) have shown remarkable progress in various multimodal tasks, yet they often struggle with complex visual reasoning that requires multi-step inference. To address this limitation, we propose MF-SQ-LLaVA, a novel approach that enhances LVLMs by enabling implicit self-questioning through end-to-end training. Our method involves augmenting visual question answering datasets with reasoning chains consisting of sub-question and answer pairs, and training the LVLM with a multi-task loss that encourages the generation and answering of these intermediate steps, as well as the prediction of the final answer. We conduct extensive experiments on the ScienceQA and VQAv2 datasets, demonstrating that MF-SQ-LLaVA significantly outperforms existing state-of-the-art models, including the base LLaVA and the original SQ-LLaVA. Ablation studies further validate the contribution of each component of our approach, and human evaluation confirms the improved accuracy and coherence of the reasoning process enabled by our method.
摘要：大型视觉模型（LVLM）在各种多模式任务中表现出了显着的进步，但是它们经常在需要多步推理的复杂视觉推理方面挣扎。为了解决这一限制，我们提出了MF-SQ-LAVA，这是一种新颖的方法，通过通过端到端培训实现隐式自我询问来增强LVLM。我们的方法涉及通过通过子问题和答案对组成的推理链来扩大视觉问题的答案数据集，并以多任务损失训练LVLM，以鼓励对这些中间步骤的生成和回答以及最终答案的预测。我们在ScienceQA和VQAV2数据集上进行了广泛的实验，表明MF-SQ-LAVA显着胜过现有的最新模型，包括基本LLAVA和原始SQ-LAVA。消融研究进一步验证了我们方法的每个组成部分的贡献，人类评估证实了我们方法实现的推理过程的准确性和相干性的提高。

Title: ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints

Authors: Vihaan Misra, Peter Schaldenbrand, Jean Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14720
Pdf URL: https://arxiv.org/pdf/2503.14720
Copy Paste: [[2503.14720]] ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints(https://arxiv.org/abs/2503.14720)
Keywords: generation
Abstract: While diffusion-based models excel at generating photorealistic images from text, a more nuanced challenge emerges when constrained to using only a fixed set of rigid shapes, akin to solving tangram puzzles or arranging real-world objects to match semantic descriptions. We formalize this problem as shape-based image generation, a new text-guided image-to-image translation task that requires rearranging the input set of rigid shapes into non-overlapping configurations and visually communicating the target concept. Unlike pixel-manipulation approaches, our method, ShapeShift, explicitly parameterizes each shape within a differentiable vector graphics pipeline, iteratively optimizing placement and orientation through score distillation sampling from pretrained diffusion models. To preserve arrangement clarity, we introduce a content-aware collision resolution mechanism that applies minimal semantically coherent adjustments when overlaps occur, ensuring smooth convergence toward physically valid configurations. By bridging diffusion-based semantic guidance with explicit geometric constraints, our approach yields interpretable compositions where spatial relationships clearly embody the textual prompt. Extensive experiments demonstrate compelling results across diverse scenarios, with quantitative and qualitative advantages over alternative techniques.
摘要：尽管基于扩散的模型在文本中产生了逼真的图像，但在仅使用固定的一组刚性形状时，会出现更细微的挑战，类似于求解tangram拼图或安排真实的对象以匹配语义描述。我们将此问题形式化为基于形状的图像生成，这是一种新的文本指导的图像到图像转换任务，需要将刚性形状的输入集重新排列到非重叠的配置中，并在视觉上传达目标概念。与像素操作方法不同，我们的方法，变形，明确参数化在可区分的矢量图形管道中的每个形状，通过从预处理扩散模型中进行得分蒸馏采样，迭代优化位置和方向。为了保持安排清晰度，我们引入了一种内容感知的碰撞分辨率机制，该机制在重叠时应用最小的语义相干调整，以确保平稳收敛到物理上有效的配置。通过将基于扩散的语义指导与明确的几何约束桥接，我们的方法产生了可解释的组成，其中空间关系清楚地体现了文本提示。广泛的实验证明了各种情况下的令人信服的结果，与替代技术具有定量和定性的优势。

Title: Decompositional Neural Scene Reconstruction with Generative Diffusion Prior

Authors: Junfeng Ni, Yu Liu, Ruijie Lu, Zirui Zhou, Song-Chun Zhu, Yixin Chen, Siyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14830
Pdf URL: https://arxiv.org/pdf/2503.14830
Copy Paste: [[2503.14830]] Decompositional Neural Scene Reconstruction with Generative Diffusion Prior(https://arxiv.org/abs/2503.14830)
Keywords: generative
Abstract: Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms SOTA methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization and produces decomposed object meshes with detailed UV maps that support photorealistic Visual effects (VFX) editing. The project page is available at this https URL.
摘要：3D场景的分解重建，具有完整的形状和内部所有对象的详细纹理，这对于下游应用程序很有趣，但仍然具有挑战性，尤其是以稀疏视图作为输入。最近的方法结合了语义或几何正规化以解决此问题，但是它们在不限制的地区遭受了重大降解，并且未能恢复被阻塞的区域。我们认为解决此问题的关键在于补充这些领域的缺失信息。为此，我们提出了DP-Recon，该DP-Recon采用分数蒸馏采样（SD）的形式采用扩散先验，以优化在新观点下每个对象的神经表示。这为无限制区域提供了其他信息，但直接纳入扩散的情况会引起重建和生成指导之间的潜在冲突。因此，我们进一步引入了一种可见性引导的方法，以动态调整每个像素SDS损失权重。这些组件共同增强了几何形状和外观恢复，同时仍然忠于输入图像。跨副本和扫描式++的广泛实验表明，我们的方法显着优于SOTA方法。值得注意的是，它比100个视图以下的基线实现了10个视图以下更好的对象重建。我们的方法启用了通过SDS优化的几何和外观的无缝基于文本的编辑，并通过详细的紫外线图产生分解的对象网格，以支持光真逼真的视觉效果（VFX）编辑。该项目页面可在此HTTPS URL上找到。

Title: SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments

Authors: Yinqi Chen, Meiying Zhang, Qi Hao, Guang Zhou
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.14837
Pdf URL: https://arxiv.org/pdf/2503.14837
Copy Paste: [[2503.14837]] SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments(https://arxiv.org/abs/2503.14837)
Keywords: generation
Abstract: Accurate perception of dynamic traffic scenes is crucial for high-level autonomous driving systems, requiring robust object motion estimation and instance segmentation. However, traditional methods often treat them as separate tasks, leading to suboptimal performance, spatio-temporal inconsistencies, and inefficiency in complex scenarios due to the absence of information sharing. This paper proposes a multi-task SemanticFlow framework to simultaneously predict scene flow and instance segmentation of full-resolution point clouds. The novelty of this work is threefold: 1) developing a coarse-to-fine prediction based multi-task scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self-supervised learning scheme, which utilizes coarse segmentation to detect rigid objects and compute their transformation matrices between sequential frames, enabling the generation of self-supervised labels. The proposed framework is validated on the Argoverse and Waymo datasets, demonstrating superior performance in instance segmentation accuracy, scene flow estimation, and computational efficiency, establishing a new benchmark for self-supervised methods in dynamic scene understanding.
摘要：对动态交通场景的准确感知对于高级自主驾驶系统至关重要，需要强大的对象运动估计和实例分段。但是，传统方法通常将其视为单独的任务，从而导致次优性能，时空的不一致以及由于缺乏信息共享而在复杂场景中效率低下。本文提出了一个多任务semanticflow框架，以同时预测全分辨率点云的场景流和实例分割。这项工作的新颖性是三个方面：1）开发基于粗略预测的多任务方案，其中使用静态背景和动态对象的初始粗分割来提供上下文信息，以通过共享的特征处理模块来完善运动和语义信息； 2）开发一组损失功能，以增强场景流量估计和实例细分的性能，同时可以帮助确保流量场景中静态对象和动态对象的空间和时间一致性； 3）开发一种自我监督的学习方案，该方案利用粗分割来检测刚性对象并计算其在顺序帧之间的转换矩阵，从而能够生成自我监督的标签。提出的框架在Argoverse和Waymo数据集上进行了验证，在实例分割精度，场景流估计和计算效率中证明了卓越的性能，为动态场景理解中的自我监督方法建立了新的基准。

Title: LogLLaMA: Transformer-based log anomaly detection with LLaMA

Authors: Zhuoyi Yang, Ian G. Harris
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14849
Pdf URL: https://arxiv.org/pdf/2503.14849
Copy Paste: [[2503.14849]] LogLLaMA: Transformer-based log anomaly detection with LLaMA(https://arxiv.org/abs/2503.14849)
Keywords: generative
Abstract: Log anomaly detection refers to the task that distinguishes the anomalous log messages from normal log messages. Transformer-based large language models (LLMs) are becoming popular for log anomaly detection because of their superb ability to understand complex and long language patterns. In this paper, we propose LogLLaMA, a novel framework that leverages LLaMA2. LogLLaMA is first finetuned on normal log messages from three large-scale datasets to learn their patterns. After finetuning, the model is capable of generating successive log messages given previous log messages. Our generative model is further trained to identify anomalous log messages using reinforcement learning (RL). The experimental results show that LogLLaMA outperforms the state-of-the-art approaches for anomaly detection on BGL, Thunderbird, and HDFS datasets.
摘要：日志异常检测是指将异常日志消息与正常日志消息区分开的任务。基于变压器的大型语言模型（LLM）由于其出色的能力理解复杂和长语言模式而变得流行起来对数异常检测。在本文中，我们提出了Logllama，这是一个利用Llama2的新颖框架。 Logllama首先是从三个大规模数据集中的普通日志消息上进行的，以了解其模式。填充后，该模型能够生成给定以前的日志消息的连续日志消息。我们的生成模型进一步培训，以使用增强学习（RL）识别异常的日志消息。实验结果表明，Loglalama的表现优于BGL，Thunderbird和HDFS数据集的异常检测的最新方法。

Title: Temporal-Consistent Video Restoration with Pre-trained Diffusion Models

Authors: Hengkang Wang, Yang Liu, Huidong Liu, Chien-Chih Wang, Yanhui Guo, Hongdong Li, Bryan Wang, Ju Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14863
Pdf URL: https://arxiv.org/pdf/2503.14863
Copy Paste: [[2503.14863]] Temporal-Consistent Video Restoration with Pre-trained Diffusion Models(https://arxiv.org/abs/2503.14863)
Keywords: restoration
Abstract: Video restoration (VR) aims to recover high-quality videos from degraded ones. Although recent zero-shot VR methods using pre-trained diffusion models (DMs) show good promise, they suffer from approximation errors during reverse diffusion and insufficient temporal consistency. Moreover, dealing with 3D video data, VR is inherently computationally intensive. In this paper, we advocate viewing the reverse process in DMs as a function and present a novel Maximum a Posterior (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors. We also introduce strategies to promote bilevel temporal consistency: semantic consistency by leveraging clustering structures in the seed space, and pixel-level consistency by progressive warping with optical flow refinements. Extensive experiments on multiple virtual reality tasks demonstrate superior visual quality and temporal consistency achieved by our method compared to the state-of-the-art.
摘要：视频修复（VR）旨在从退化的视频中恢复高质量的视频。尽管使用预训练的扩散模型（DMS）最近使用的零射VR方法表现出良好的希望，但它们在反向扩散和时间一致性不足期间遭受了近似误差。此外，在处理3D视频数据时，VR本质上是计算密集的。在本文中，我们主张将DMS中的反向过程视为函数，并提出新颖的最大后验（MAP）框架，该框架将视频帧直接参数化DMS的种子空间中的视频帧，从而消除了近似错误。我们还引入了策略来促进双层时空的一致性：通过利用种子空间中的聚类结构来促进语义一致性，以及通过渐进式翘曲和光流细化来进行像素级的一致性。对多个虚拟现实任务进行的广泛实验表明，与最先进的方法相比，我们的方法实现了卓越的视觉质量和时间一致性。

Title: Efficient Personalization of Quantized Diffusion Model without Backpropagation

Authors: Hoigi Seo, Wongi Jeong, Kyungryeol Lee, Se Young Chun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14868
Pdf URL: https://arxiv.org/pdf/2503.14868
Copy Paste: [[2503.14868]] Efficient Personalization of Quantized Diffusion Model without Backpropagation(https://arxiv.org/abs/2503.14868)
Keywords: generation
Abstract: Diffusion models have shown remarkable performance in image synthesis, but they demand extensive computational and memory resources for training, fine-tuning and inference. Although advanced quantization techniques have successfully minimized memory usage for inference, training and fine-tuning these quantized models still require large memory possibly due to dequantization for accurate computation of gradients and/or backpropagation for gradient-based algorithms. However, memory-efficient fine-tuning is particularly desirable for applications such as personalization that often must be run on edge devices like mobile phones with private data. In this work, we address this challenge by quantizing a diffusion model with personalization via Textual Inversion and by leveraging a zeroth-order optimization on personalization tokens without dequantization so that it does not require gradient and activation storage for backpropagation that consumes considerable memory. Since a gradient estimation using zeroth-order optimization is quite noisy for a single or a few images in personalization, we propose to denoise the estimated gradient by projecting it onto a subspace that is constructed with the past history of the tokens, dubbed Subspace Gradient. In addition, we investigated the influence of text embedding in image generation, leading to our proposed time steps sampling, dubbed Partial Uniform Timestep Sampling for sampling with effective diffusion timesteps. Our method achieves comparable performance to prior methods in image and text alignment scores for personalizing Stable Diffusion with only forward passes while reducing training memory demand up to $8.2\times$.
摘要：扩散模型在图像合成中表现出色，但它们需要广泛的计算和记忆资源来训练，微调和推理。尽管高级量化技术已成功地最大程度地减少了用于推理的内存使用量，但这些量化模型仍需要大量内存，这可能是由于取消梯度和/或基于梯度的算法的准确计算而导致的。但是，对于诸如个性化之类的应用程序，通常必须在带有私人数据的手机等边缘设备上运行的应用程序，尤其需要记忆提高的微调。在这项工作中，我们通过通过文本倒转来量化个性化的扩散模型，并利用对个性化令牌的零级优化而无需取消量化，以解决这一挑战，以便它不需要梯度和激活存储以使反向传播消耗相当大的内存。由于使用零阶优化的梯度估计对于个性化的单个或几个图像来说是非常嘈杂的，因此我们建议通过将其投影到由代币的过去历史（称为子空间梯度）构建的子空间中来降低估计的梯度。此外，我们研究了文本嵌入图像生成中的影响，从而导致我们提出的时间步骤采样，并将其称为部分均匀的时间段采样，以进行有效的扩散时间段采样。我们的方法与图像和文本对齐得分的先前方法可比性的性能可比性，以便在仅向前传球的情况下个性化稳定扩散，同时将训练记忆需求最高$ 8.2 \ times $。

Title: DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework

Authors: Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar Jr., Xiangyang Ji, Xu-Cheng Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14880
Pdf URL: https://arxiv.org/pdf/2503.14880
Copy Paste: [[2503.14880]] DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework(https://arxiv.org/abs/2503.14880)
Keywords: restoration
Abstract: Optical flow estimation is essential for video processing tasks, such as restoration and action recognition. The quality of videos is constantly increasing, with current standards reaching 8K resolution. However, optical flow methods are usually designed for low resolution and do not generalize to large inputs due to their rigid architectures. They adopt downscaling or input tiling to reduce the input size, causing a loss of details and global information. There is also a lack of optical flow benchmarks to judge the actual performance of existing methods on high-resolution samples. Previous works only conducted qualitative high-resolution evaluations on hand-picked samples. This paper fills this gap in optical flow estimation in two ways. We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. We also introduce Kubric-NK, a new benchmark for evaluating optical flow methods with input resolutions ranging from 1K to 8K. Our high-resolution evaluation pushes the boundaries of existing methods and reveals new insights about their generalization capabilities. Extensive experimental results show that DPFlow achieves state-of-the-art results on the MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks.
摘要：光流估计对于视频处理任务（例如恢复和动作识别）至关重要。视频的质量不断提高，目前的标准达到了8K。但是，光流方法通常是为低分辨率而设计的，并且由于其刚性体系结构而不会推广到大型输入。他们采用降尺度或输入平铺以减少输入大小，从而导致细节和全球信息的丢失。还缺乏光流基准来判断现有方法在高分辨率样本上的实际性能。先前的作品仅对手工挑选的样品进行了定性高分辨率评估。本文以两种方式填补了光流估计的空白。我们提出了DPFlow，这是一种自适应光流架构，能够在仅使用低分辨率样本训练的同时将多达8K分辨率输入概括。我们还介绍了Kubric-NK，这是一种新的基准测试，用于评估输入分辨率从1K到8K的输入分辨率。我们的高分辨率评估推动了现有方法的界限，并揭示了有关其概括能力的新见解。广泛的实验结果表明，DPFlow在MPI-Sintel，Kitti 2015，Spring和其他高分辨率基准上实现了最先进的结果。

Title: Exploring the Limits of KV Cache Compression in Visual Autoregressive Transformers

Authors: Bo Chen, Xiaoyu Li, Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.14881
Pdf URL: https://arxiv.org/pdf/2503.14881
Copy Paste: [[2503.14881]] Exploring the Limits of KV Cache Compression in Visual Autoregressive Transformers(https://arxiv.org/abs/2503.14881)
Keywords: generation
Abstract: A fundamental challenge in Visual Autoregressive models is the substantial memory overhead required during inference to store previously generated representations. Despite various attempts to mitigate this issue through compression techniques, prior works have not explicitly formalized the problem of KV-cache compression in this context. In this work, we take the first step in formally defining the KV-cache compression problem for Visual Autoregressive transformers. We then establish a fundamental negative result, proving that any mechanism for sequential visual token generation under attention-based architectures must use at least $\Omega(n^2 d)$ memory, when $d = \Omega(\log n)$, where $n$ is the number of tokens generated and $d$ is the embedding dimensionality. This result demonstrates that achieving truly sub-quadratic memory usage is impossible without additional structural constraints. Our proof is constructed via a reduction from a computational lower bound problem, leveraging randomized embedding techniques inspired by dimensionality reduction principles. Finally, we discuss how sparsity priors on visual representations can influence memory efficiency, presenting both impossibility results and potential directions for mitigating memory overhead.
摘要：视觉自回归模型中的一个基本挑战是推断以前生成的表示形式所需的大量内存开销。尽管通过压缩技术进行了各种减轻此问题的尝试，但在这种情况下，先前的工作并未明确化正式的KV-CACHS压缩问题。在这项工作中，我们迈出了第一步，用于正式定义视觉自动回归变压器的KV-CACHS压缩问题。然后，我们建立了一个基本的负面结果，证明了基于注意力的架构下的顺序视觉令牌生成的任何机制都必须使用至少$ \ omega（n^2 d）$存储器，当$ d = \ omega（\ log log n）$时，$ n $是$ n $是代币的数量，$ d $是胚胎的递减性。该结果表明，如果没有其他结构性约束，就不可能实现真正的次级记忆使用情况。我们的证明是通过减少计算下限问题来构建的，该问题利用了受降低性降低原理启发的随机嵌入技术。最后，我们讨论了视觉表示上的稀疏学先验如何影响记忆效率，既提出了不可能的结果，又提出了缓解内存开销的潜在方向。

Title: GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation

Authors: Junyu Shi, Lijiang Liu, Yong Sun, Zhiyuan Zhang, Jinni Zhou, Qiang Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14919
Pdf URL: https://arxiv.org/pdf/2503.14919
Copy Paste: [[2503.14919]] GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation(https://arxiv.org/abs/2503.14919)
Keywords: generation, generative
Abstract: Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM$^3$), a comprehensive framework designed to learn unified motion representations. GenM$^3$ comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable large-scale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM$^3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.
摘要：扩展运动数据集对于增强运动产生能力至关重要。但是，大规模多源数据集进行的培训引入了由于运动内容的变化而引起的数据异质性挑战。为了解决这个问题，我们提出了生成预处理的多路径运动模型（Genm $^3 $），这是一个旨在学习统一运动表示的综合框架。 genm $^3 $包括两个组成部分：1）一个多expert VQ-VAE（MEVQ-VAE）适应不同的数据集以学习统一的离散运动表示，以及2）多路径运动变压器（MMT），通过使用单独的模态式的范围内的模型，可在每个模态内进行改进，并在每个模态内进行改进，并在每个模态上进行改进，适用于该模型，以适应该模式，可容纳该范围，以适应该模式，适用于该模式，可容纳该范围，以适用于该模型，可容纳该范围，以适应该范围的范围，并适用于该模式的范围，该范围可容纳该范围，该范围可容纳该范围，该型号适用于该模型，该范围可容纳该范围，该模式可容纳该范围的范围。通过文本运动共享路径的模式跨对齐。为了实现大规模培训，我们整合并统一了11个高质量的运动数据集（大约220个小时的运动数据），并通过文本注释（由大型语言模型标记的近10,000个运动序列，由人类专家使用300+）进行增强。在我们的集成数据集进行了培训之后，Genm $^3 $在HumanML3D基准上实现了0.035的最先进FID，超过了最先进的方法。它还显示了Idea400数据集上强烈的零弹性概括，突出了其在各种运动方案中的有效性和适应性。

Title: Shushing! Let's Imagine an Authentic Speech from the Silent Video

Authors: Jiaxin Ye, Hongming Shan
Subjects: cs.CV, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.14928
Pdf URL: https://arxiv.org/pdf/2503.14928
Copy Paste: [[2503.14928]] Shushing! Let's Imagine an Authentic Speech from the Silent Video(https://arxiv.org/abs/2503.14928)
Keywords: generation
Abstract: Vision-guided speech generation aims to produce authentic speech from facial appearance or lip motions without relying on auditory signals, offering significant potential for applications such as dubbing in filmmaking and assisting individuals with aphonia. Despite recent progress, existing methods struggle to achieve unified cross-modal alignment across semantics, timbre, and emotional prosody from visual cues, prompting us to propose Consistent Video-to-Speech (CV2S) as an extended task to enhance cross-modal consistency. To tackle emerging challenges, we introduce ImaginTalk, a novel cross-modal diffusion framework that generates faithful speech using only visual input, operating within a discrete space. Specifically, we propose a discrete lip aligner that predicts discrete speech tokens from lip videos to capture semantic information, while an error detector identifies misaligned tokens, which are subsequently refined through masked language modeling with BERT. To further enhance the expressiveness of the generated speech, we develop a style diffusion transformer equipped with a face-style adapter that adaptively customizes identity and prosody dynamics across both the channel and temporal dimensions while ensuring synchronization with lip-aware semantic features. Extensive experiments demonstrate that ImaginTalk can generate high-fidelity speech with more accurate semantic details and greater expressiveness in timbre and emotion compared to state-of-the-art baselines. Demos are shown at our project page: this https URL.
摘要：视觉指导的语音产生旨在从面部外观或唇部动作中产生真实的语音，而无需依靠听觉信号，从而为诸如在电影制作中配音和协助Aphonia的个人提供了巨大的应用潜力。尽管最近取得了进展，但现有的方法很难实现跨语义，音色和视觉提示的情感韵律的统一的跨模式对齐，促使我们提出一致的视频到语音（CV2）作为扩展任务，以增强交叉模式的一致性。为了应对新兴的挑战，我们介绍了ImaginTalk，这是一个新颖的跨模式扩散框架，仅使用视觉输入在离散空间内运行。具体而言，我们提出了一个离散的唇线计算机，该唇线对准唇部视频可预测离散的语音令牌以捕获语义信息，而错误检测器则标识了未对准的令牌，随后通过用伯特（Bert）进行掩盖的语言进行了完善。为了进一步增强生成的语音的表现力，我们开发了配备有面部式适配器的样式扩散变压器，该变形器适应性地自定义了通道和时间尺寸的身份和韵律动力学，同时确保与唇部感知的语义特征同步。广泛的实验表明，与最先进的基线相比，Imagintalk可以具有更准确的语义细节和更高的音色和情感表现力的高保真语音。演示显示在我们的项目页面：此HTTPS URL。

Title: 3D Engine-ready Photorealistic Avatars via Dynamic Textures

Authors: Yifan Wang, Ivan Molodetskikh, Ondrej Texler, Dimitar Dinev
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14943
Pdf URL: https://arxiv.org/pdf/2503.14943
Copy Paste: [[2503.14943]] 3D Engine-ready Photorealistic Avatars via Dynamic Textures(https://arxiv.org/abs/2503.14943)
Keywords: generation
Abstract: As the digital and physical worlds become more intertwined, there has been a lot of interest in digital avatars that closely resemble their real-world counterparts. Current digitization methods used in 3D production pipelines require costly capture setups, making them impractical for mass usage among common consumers. Recent academic literature has found success in reconstructing humans from limited data using implicit representations (e.g., voxels used in NeRFs), which are able to produce impressive videos. However, these methods are incompatible with traditional rendering pipelines, making it difficult to use them in applications such as games. In this work, we propose an end-to-end pipeline that builds explicitly-represented photorealistic 3D avatars using standard 3D assets. Our key idea is the use of dynamically-generated textures to enhance the realism and visually mask deficiencies in the underlying mesh geometry. This allows for seamless integration with current graphics pipelines while achieving comparable visual quality to state-of-the-art 3D avatar generation methods.
摘要：随着数字世界和物理世界变得越来越交织在一起，对数字化头像非常感兴趣，这些化身非常类似于他们的现实世界。 3D生产管道中使用的当前数字化方法需要昂贵的捕获设置，这使得它们对于普通消费者的大规模使用不切实际。最近的学术文献发现，使用隐式表示（例如，在NERFS中使用的体素）从有限数据中重建人类的成功，这些数据能够产生令人印象深刻的视频。但是，这些方法与传统的渲染管道不兼容，因此很难在游戏等应用程序中使用它们。在这项工作中，我们提出了一条端到端的管道，该管道使用标准3D资产构建了明显代表性的逼真的3D化身。我们的关键思想是使用动态生成的纹理来增强基础网状几何形状中的现实主义和视觉掩盖不足。这允许与当前的图形管道无缝集成，同时可以达到与最新的3D化身生成方法相当的视觉质量。

Title: MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Authors: Zihan Cao, Yu Zhong, Ziqi Wang, Liang-Jian Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14944
Pdf URL: https://arxiv.org/pdf/2503.14944
Copy Paste: [[2503.14944]] MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance(https://arxiv.org/abs/2503.14944)
Keywords: restoration
Abstract: Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (\textit{e.g.}, noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at this https URL.
摘要：图像融合是一种基本的低级视觉任务，旨在将多个图像序列整合到单个输出中，同时从输入中保留尽可能多的信息。但是，现有方法面临着几个重要的局限性：1）需要任务或数据集特定模型； 2）忽略现实世界的图像降解（\ textit {e.g。}，噪声），在处理输入降解时会导致故障； 3）在注意机制在计算上昂贵的像素空间中运行； 4）缺乏用户互动功能。为了应对这些挑战，我们提出了一个统一的框架，用于多任务，多降解和语言引导的图像融合。我们的框架包括两个关键组成部分：1）实用的退化管道，该管道模拟现实世界的图像降级并生成交互式提示以指导模型； 2）在潜在空间中运行的多合一扩散变压器（DIT），该变压器（DIT）融合了在退化的输入和生成的提示上的干净图像。此外，我们对原始的DIT体系结构介绍了原则上的修改，以更好地适合融合任务。基于此框架，我们开发了两个模型的版本：基于回归和基于流量匹配的变体。广泛的定性和定量实验表明，我们的方法有效地解决了上述局限性，并且优于先前的恢复+融合和多合一管道。代码可在此HTTPS URL上找到。

Title: Generating Multimodal Driving Scenes via Next-Scene Prediction

Authors: Yanhao Wu, Haoyang Zhang, Tianwei Lin, Lichao Huang, Shujie Luo, Rui Wu, Congpei Qiu, Wei Ke, Tong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14945
Pdf URL: https://arxiv.org/pdf/2503.14945
Copy Paste: [[2503.14945]] Generating Multimodal Driving Scenes via Next-Scene Prediction(https://arxiv.org/abs/2503.14945)
Keywords: generation, generative
Abstract: Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.
摘要：自主驾驶（AD）中的生成模型可以使各种场景创建，但现有方法仅通过捕获有限的模式而缺乏，从而限制了生成可控场景以全面评估AD系统的能力。在本文中，我们引入了一个多模式生成框架，该框架结合了四个主要数据模式，包括新的MAP模式。借助令牌化模式，我们的场景序列生成框架自动加入可以预测每个场景，同时通过两阶段的方法管理计算需求。时间自回旋（TAR）组件捕获了每种方式的框架间动力学，而有序的自动回归（OAR）组件通过按固定顺序进行顺序预测令牌来使每个场景中的模态对齐。为了保持地图和自我效果模式之间的连贯性，我们介绍了动作感知的地图对齐（AMA）模块，该模块采用基于自我行动的转换来维持这些模态之间的连贯性。我们的框架有效地通过扩展序列产生了复杂的，现实的驾驶场景，从而确保了多模式的一致性并对场景元素提供细粒度的控制。

Title: Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering

Authors: Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat, Basura Fernando
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14957
Pdf URL: https://arxiv.org/pdf/2503.14957
Copy Paste: [[2503.14957]] Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering(https://arxiv.org/abs/2503.14957)
Keywords: generation
Abstract: This paper introduces a new video question-answering (VQA) dataset that challenges models to leverage procedural knowledge for complex reasoning. It requires recognizing visual entities, generating hypotheses, and performing contextual, causal, and counterfactual reasoning. To address this, we propose neuro symbolic reasoning module that integrates neural networks and LLM-driven constrained reasoning over variables for interpretable answer generation. Results show that combining LLMs with structured knowledge reasoning with logic enhances procedural reasoning on the STAR benchmark and our dataset. Code and dataset at this https URL soon.
摘要：本文介绍了一个新的视频提问（VQA）数据集，该数据集挑战模型以利用程序知识来进行复杂的推理。它需要识别视觉实体，产生假设并执行上下文，因果关系和反事实推理。为了解决这个问题，我们提出了神经符号推理模块，该模块集成了神经网络和LLM驱动的对变量的受限推理，以产生可解释的答案。结果表明，将LLM与结构化知识推理与逻辑相结合，增强了Star基准和我们的数据集上的程序推理。此HTTPS URL的代码和数据集很快。

Title: Universal Scene Graph Generation

Authors: Shengqiong Wu, Hao Fei, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15005
Pdf URL: https://arxiv.org/pdf/2503.15005
Copy Paste: [[2503.15005]] Universal Scene Graph Generation(https://arxiv.org/abs/2503.15005)
Keywords: generation
Abstract: Scene graph (SG) representations can neatly and efficiently describe scene semantics, which has driven sustained intensive research in SG generation. In the real world, multiple modalities often coexist, with different types, such as images, text, video, and 3D data, expressing distinct characteristics. Unfortunately, current SG research is largely confined to single-modality scene modeling, preventing the full utilization of the complementary strengths of different modality SG representations in depicting holistic scene semantics. To this end, we introduce Universal SG (USG), a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes. Further, we tailor a niche-targeting USG parser, USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges. We design the USG-Par with modular architecture for end-to-end USG generation, in which we devise an object associator to relieve the modality gap for cross-modal object alignment. Further, we propose a text-centric scene contrasting learning mechanism to mitigate domain imbalances by aligning multimodal objects and relations with textual SGs. Through extensive experiments, we demonstrate that USG offers a stronger capability for expressing scene semantics than standalone SGs, and also that our USG-Par achieves higher efficacy and performance.
摘要：场景图（SG）表示可以整洁有效地描述场景语义，该语义驱动了SG生成的持续深入研究。在现实世界中，多种方式通常与不同类型的类型共存，例如图像，文本，视频和3D数据，表达了不同的特征。不幸的是，当前的SG研究在很大程度上仅限于单模式场景建模，从而阻止了描绘整体场景语义的不同模式SG表示的互补优势的完全利用。为此，我们介绍了Universal SG（USG），这是一种新颖的表示，能够从任何给定的模态输入的任何给定组合，涵盖模式不变和特定于模态的场景中充分表征全面的语义场景。此外，我们量身定制了一个针对性的USG Parser，USG-PAR，该解析器有效地解决了跨模式对象对齐和不域外挑战的两个关键瓶颈。我们使用模块化体系结构设计USG-PAR，用于端到端USG生成，其中我们设计了一个对象关联器，以缓解跨模式对象对齐的模态差距。此外，我们提出了一个以文本为中心的场景来对比的学习机制，以通过使多模式对象和与文本SG的关系保持一致来减轻域失衡。通过广泛的实验，我们证明了USG比独立SG提供了更强的表达场景语义功能，并且我们的USG-PAR可以实现更高的效力和性能。

Title: Exploiting Diffusion Prior for Real-World Image Dehazing with Unpaired Training

Authors: Yunwei Lan, Zhigao Cui, Chang Liu, Jialun Peng, Nian Wang, Xin Luo, Dong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15017
Pdf URL: https://arxiv.org/pdf/2503.15017
Copy Paste: [[2503.15017]] Exploiting Diffusion Prior for Real-World Image Dehazing with Unpaired Training(https://arxiv.org/abs/2503.15017)
Keywords: generative
Abstract: Unpaired training has been verified as one of the most effective paradigms for real scene dehazing by learning from unpaired real-world hazy and clear images. Although numerous studies have been proposed, current methods demonstrate limited generalization for various real scenes due to limited feature representation and insufficient use of real-world prior. Inspired by the strong generative capabilities of diffusion models in producing both hazy and clear images, we exploit diffusion prior for real-world image dehazing, and propose an unpaired framework named Diff-Dehazer. Specifically, we leverage diffusion prior as bijective mapping learners within the CycleGAN, a classic unpaired learning framework. Considering that physical priors contain pivotal statistics information of real-world data, we further excavate real-world knowledge by integrating physical priors into our framework. Furthermore, we introduce a new perspective for adequately leveraging the representation ability of diffusion models by removing degradation in image and text modalities, so as to improve the dehazing effect. Extensive experiments on multiple real-world datasets demonstrate the superior performance of our method. Our code this https URL.
摘要：未配对的培训已被证实是通过从未配对的现实世界中的朦胧和清晰的图像中学习，是最有效的真实场景范式之一。尽管已经提出了许多研究，但由于特征表示有限，对现实世界的使用不足，目前的方法证明了各种真实场景的概括有限。受扩散模型在产生朦胧和透明图像中的强大生成能力的启发，我们利用了现实世界图像去除的扩散，并提出了一个名为diff-dehazer的未配对框架。具体而言，我们利用了Cyclegan中的BEAPTIVE映射学习者的扩散，这是一个经典的未配对学习框架。考虑到物理先验包含现实世界数据的关键统计信息，我们通过将物理先验整合到我们的框架中进一步挖掘了现实世界知识。此外，我们介绍了一种新的观点，用于通过消除图像和文本方式的降解来充分利用扩散模型的表示能力，从而改善向上的效果。在多个现实世界数据集上进行的广泛实验证明了我们方法的出色性能。我们的代码此HTTPS URL。

Title: Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Authors: Shengqiong Wu, Hao Fei, Jingkang Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, Tat-seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15019
Pdf URL: https://arxiv.org/pdf/2503.15019
Copy Paste: [[2503.15019]] Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene(https://arxiv.org/abs/2503.15019)
Keywords: generation
Abstract: The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can primarily suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by a large margin, highlighting the effectiveness of our method.
摘要：最新出现的4D Panoptic场景图（4D-PSG）提供了有史以来的高级表示形式，用于全面建模动态4D Visual Real世界。不幸的是，当前的开拓性4D-PSG研究主要可能严重遭受数据稀缺问题，以及导致的量不足问题。同样，基准生成方法的管道性质可以导致次优性能。为了应对这些挑战，本文研究了4D-PSG一代的新型框架，该框架利用丰富的2D视觉场景注释来增强4D场景学习。首先，我们引入了与3D掩码解码器集成的4D大语言模型（4D-LLM），用于端到端的4D-PSG。链式的SG推理机制进一步设计，以利用LLMS的开放式摄影功能，以推断准确，全面的对象和关系标签。最重要的是，我们提出了一个2到4D的视觉场景传输学习框架，在该框架中，空间上的超时策略有效地将尺寸不变的特征从大量的2D SG注释转移到4D场景，从而有效地补偿了4D-PSG中的数据稀缺性。在基准数据上进行的广泛实验表明，我们以很大的边距优于基线模型，突出了我们方法的有效性。

Title: Multivariate Gaussian Topic Modelling: A novel approach to discover topics with greater semantic coherence

Authors: Satyajeet Sahoo, Jhareswar Maiti, Virendra Kumar Tewari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15036
Pdf URL: https://arxiv.org/pdf/2503.15036
Copy Paste: [[2503.15036]] Multivariate Gaussian Topic Modelling: A novel approach to discover topics with greater semantic coherence(https://arxiv.org/abs/2503.15036)
Keywords: generative
Abstract: An important aspect of text mining involves information retrieval in form of discovery of semantic themes (topics) from documents using topic modelling. While generative topic models like Latent Dirichlet Allocation (LDA) elegantly model topics as probability distributions and are useful in identifying latent topics from large document corpora with minimal supervision, they suffer from difficulty in topic interpretability and reduced performance in shorter texts. Here we propose a novel Multivariate Gaussian Topic modelling (MGD) approach. In this approach topics are presented as Multivariate Gaussian Distributions and documents as Gaussian Mixture Models. Using EM algorithm, the various constituent Multivariate Gaussian Distributions and their corresponding parameters are identified. Analysis of the parameters helps identify the keywords having the highest variance and mean contributions to the topic, and from these key-words topic annotations are carried out. This approach is first applied on a synthetic dataset to demonstrate the interpretability benefits vis-à-vis LDA. A real-world application of this topic model is demonstrated in analysis of risks and hazards at a petrochemical plant by applying the model on safety incident reports to identify the major latent hazards plaguing the plant. This model achieves a higher mean topic coherence of 0.436 vis-à-vis 0.294 for LDA.
摘要：文本挖掘的一个重要方面涉及通过使用主题建模从文档中发现语义主题（主题）的形式的信息检索。尽管诸如潜在的dirichlet分配（LDA）之类的生成主题模型优雅地将主题建模为概率分布，并且有助于从大型文档语料库中确定潜在主题，但它们的主题可解释性和较短文本中的性能降低。在这里，我们提出了一种新型的多元高斯主题建模（MGD）方法。在这种方法中，主题作为高斯混合模型表示为多元高斯分布和文档。使用EM算法，确定了各种组成的多元高斯分布及其相应的参数。对参数的分析有助于确定对主题具有最大差异和平均贡献的关键字，并从这些关键词主题注释中进行。该方法首先应用于合成数据集上，以证明相对于LDA的可解释性益处。通过在安全事件报告上应用模型来识别困扰该工厂的主要潜在危害，在对石化工厂的风险和危害分析中，证明了该主题模型的现实应用。对于LDA，该模型达到了0.294的较高平均主题连贯性为0.436。

Title: Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation

Authors: Suhyeon Lee, Kwanyoung Kim, Jong Chul Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15056
Pdf URL: https://arxiv.org/pdf/2503.15056
Copy Paste: [[2503.15056]] Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation(https://arxiv.org/abs/2503.15056)
Keywords: generation
Abstract: Unpaired image-to-image translation has seen significant progress since the introduction of CycleGAN. However, methods based on diffusion models or Schrödinger bridges have yet to be widely adopted in real-world applications due to their iterative sampling nature. To address this challenge, we propose a novel framework, Implicit Bridge Consistency Distillation (IBCD), which enables single-step bidirectional unpaired translation without using adversarial loss. IBCD extends consistency distillation by using a diffusion implicit bridge model that connects PF-ODE trajectories between distributions. Additionally, we introduce two key improvements: 1) distribution matching for consistency distillation and 2) adaptive weighting method based on distillation difficulty. Experimental results demonstrate that IBCD achieves state-of-the-art performance on benchmark datasets in a single generation step. Project page available at this https URL
摘要：自从Cyclean引入以来，未配对的图像到图像翻译已经取得了重大进展。但是，由于其迭代采样性质，基于扩散模型或Schrödinger桥的方法尚未在实际应用中广泛采用。为了应对这一挑战，我们提出了一个新颖的框架，隐式桥梁一致性蒸馏（IBCD），该框架可以实现单步双向不成对的翻译而无需使用对抗性损失。 IBCD通过使用扩散的隐式桥模型来扩展一致性蒸馏，该模型将PF-ode轨迹在分布之间连接。此外，我们介绍了两个关键的改进：1）基于蒸馏难度的一致性蒸馏的分布匹配以及2）自适应加权方法。实验结果表明，IBCD在单代步骤中实现了基准数据集的最先进性能。项目页面可在此HTTPS URL上找到

Title: Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis

Authors: Imanol G. Estepa, Jesús M. Rodríguez-de-Vera, Ignacio Sarasúa, Bhalaji Nagarajan, Petia Radeva
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15060
Pdf URL: https://arxiv.org/pdf/2503.15060
Copy Paste: [[2503.15060]] Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis(https://arxiv.org/abs/2503.15060)
Keywords: generation, generative
Abstract: While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they rely solely on semantic token reconstruction, which requires an external tokenizer during training -- introducing a significant overhead. In this work, we introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our Contrastive objective, "Echo Contrast", leverages the generative capabilities of Sorcen, eliminating the need for additional image crops or augmentations during training. Sorcen "generates" an echo sample in the semantic token space, forming the contrastive positive pair. Sorcen operates exclusively on precomputed tokens, eliminating the need for an online token transformation during training, thereby significantly reducing computational overhead. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively, while being 60.8% more efficient. Additionally, Sorcen surpasses previous single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.
摘要：在表示学习和生成建模试图了解视觉数据的同时，统一两个域仍然没有探索。最近的统一自学学习（SSL）方法已经开始弥合两个范式之间的差距。但是，他们仅依靠语义令牌重建，这在训练过程中需要外部令牌 - 引入了重要的开销。在这项工作中，我们介绍了一种新型的统一SSL框架Sorcen，并结合了协同对比重建目标。我们的对比目标“回声对比”利用了Sorcen的生成能力，消除了在训练过程中需要其他图像作物或增强性的需求。 Sorcen在语义令牌空间中“生成”回声样本，形成对比度的阳性对。 Sorcen专门在预先计算的令牌上运行，从而消除了对培训期间在线代币转换的需求，从而大大降低了计算开销。对ImagEnet-1K的广泛实验表明，Sorcen的表现使先前的统一SSL SOTA胜过0.4％，1.48 FID，1.76％，1.76％和1.53％的线性探测，无条件图像产生，很少的射击学习和转移学习，而效率提高了60.8％。此外，Sorcen在线性探测中超过了以前的单插MIM SOTA，并在无条件的图像产生中实现了SOTA性能，突出了统一SSL模型中的显着改进和突破。

Title: DeCaFlow: A Deconfounding Causal Generative Model

Authors: Alejandro Almodóvar, Adrián Javaloy, Juan Parras, Santiago Zazo, Isabel Valera
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15114
Pdf URL: https://arxiv.org/pdf/2503.15114
Copy Paste: [[2503.15114]] DeCaFlow: A Deconfounding Causal Generative Model(https://arxiv.org/abs/2503.15114)
Keywords: generative
Abstract: Causal generative models (CGMs) have recently emerged as capable approaches to simulate the causal mechanisms generating our observations, enabling causal inference. Unfortunately, existing approaches either are overly restrictive, assuming the absence of hidden confounders, or lack generality, being tailored to a particular query and graph. In this work, we introduce DeCaFlow, a CGM that accounts for hidden confounders in a single amortized training process using only observational data and the causal graph. Importantly, DeCaFlow can provably identify all causal queries with a valid adjustment set or sufficiently informative proxy variables. Remarkably, for the first time to our knowledge, we show that a confounded counterfactual query is identifiable, and thus solvable by DeCaFlow, as long as its interventional counterpart is as well. Our empirical results on diverse settings (including the Ecoli70 dataset, with 3 independent hidden confounders, tens of observed variables and hundreds of causal queries) show that DeCaFlow outperforms existing approaches, while demonstrating its out-of-the-box flexibility.
摘要：因果生成模型（CGM）最近出现了作为模拟因果机制产生观察结果的能力方法，从而实现了因果推断。不幸的是，假设缺乏隐藏的混杂因素，或者缺乏针对特定查询和图形量身定制的普遍性，现有方法要么过于限制。在这项工作中，我们介绍了DECAFLOF，这是一个CGM，它仅使用观察数据和因果图来解释单个摊销训练过程中隐藏的混杂因素。重要的是，DECAFLOW可以证明使用有效的调整集或足够信息丰富的代理变量来识别所有因果查询。值得注意的是，据我们所知，我们首次表明，只要介入的对应物也是如此，因此可以通过Decaflow来识别一个混杂的反事实查询。我们对不同设置的经验结果（包括Ecoli70数据集，具有3个独立的隐藏混杂因素，数十个观察到的变量和数百个因果查询）表明，Decaflow优于现有方法，同时证明其未开箱即用的灵活性。

Title: VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

Authors: Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, Ser-Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15138
Pdf URL: https://arxiv.org/pdf/2503.15138
Copy Paste: [[2503.15138]] VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention(https://arxiv.org/abs/2503.15138)
Keywords: generation
Abstract: Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative Fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which first converts the user prompt into concise shot descriptions, then elaborates them into detailed, cinematic specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, HDR lighting), ensuring logical narrative progression with self-validation. (2) Visual Inconsistency: Existing approaches struggle with maintaining visual consistency across shots. Our identity-aware cross-shot propagation generates identity-preserving portrait (IPP) tokens that maintain character fidelity while allowing trait variations (expressions, aging) dictated by the storyline. (3) Transition Artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, while achieving over 100% better cross-shot consistency and 10x fewer manual adjustments than alternatives.
摘要：当前的视频生成模型在短剪辑上表现出色，但由于视觉动态和破裂的故事情节而无法产生凝聚力的多拍叙事。现有的解决方案要么依赖于大量的手动脚本/编辑，要么优先考虑单杆忠诚度，而不是跨场景连续性，从而限制了它们对电影式内容的实用性。我们介绍了思想原（VGOT），这是一个逐步的框架，通过系统地解决三个核心挑战，从单个句子中自动化了多拍的视频综合：（1）叙事片段：现有方法缺乏结构性讲故事。我们提出了动态的故事情节建模，该建模首先将用户提示转换为简洁的镜头描述，然后将它们详细介绍到跨五个领域（角色动态，背景连续性，关系演变，相机运动，HDR照明）的详细的电影规范，从而确保与自动化的逻辑叙事进步。（2）视觉上不一致：现有的方法在跨镜头保持视觉一致性方面艰难。我们的身份感知的交叉传播产生了具有身份的肖像（IPP）令牌，可以保持角色忠诚度，同时允许故事情节决定的特质变化（表达，衰老）。（3）过渡伪像：突然的射击变化破坏了沉浸。我们相邻的潜在过渡机制实施了边界感知的重置策略，这些策略在过渡点上处理相邻镜头的特征，从而实现无缝的视觉流，同时保持叙事连续性。 VGOT生成多拍的视频，在镜头内部的一致性中优于最先进的基线和17.4％的风格一致性，而比其他替代方案的交叉镜头一致性优于100.4％，而手动调整则少10倍。

Title: Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Authors: Feifei Li, Mi Zhang, Yiming Sun, Min Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15197
Pdf URL: https://arxiv.org/pdf/2503.15197
Copy Paste: [[2503.15197]] Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization(https://arxiv.org/abs/2503.15197)
Keywords: generation
Abstract: Text-to-image diffusion models have achieved state-of-the-art results in synthesis tasks; however, there is a growing concern about their potential misuse in creating harmful content. To mitigate these risks, post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed. However, fine-tuning model weights or adapting the hidden states of the diffusion model operates in an uninterpretable way, making it unclear which part of the intermediate variables is responsible for unsafe generation. These interventions severely affect the sampling trajectory when erasing harmful concepts from complex, multi-concept prompts, thus hindering their practical use in real-world settings. In this work, we propose the safe generation framework Detect-and-Guide (DAG), leveraging the internal knowledge of diffusion models to perform self-diagnosis and fine-grained self-regulation during the sampling process. DAG first detects harmful concepts from noisy latents using refined cross-attention maps of optimized tokens, then applies safety guidance with adaptive strength and editing regions to negate unsafe generation. The optimization only requires a small annotated dataset and can provide precise detection maps with generalizability and concept specificity. Moreover, DAG does not require fine-tuning of diffusion models, and therefore introduces no loss to their generation diversity. Experiments on erasing sexual content show that DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on multi-concept real-world prompts.
摘要：文本对图像扩散模型已实现了综合任务的最新结果。但是，人们越来越担心他们在创建有害内容时的潜在滥用。为了减轻这些风险，已开发了事后模型干预技术，例如概念和安全指南。但是，微调模型权重或扩散模型的隐藏状态以无法解释的方式运行，因此不清楚中间变量的哪一部分负责不安全的生成。这些干预措施在从复杂的，多概念提示中删除有害概念时严重影响了采样轨迹，从而阻碍了它们在现实世界中的实际使用。在这项工作中，我们提出了安全生成框架检测和指标（DAG），利用扩散模型的内部知识来在抽样过程中执行自我诊断和细粒度的自我调节。 DAG首先使用优化令牌的精制跨注意图检测到嘈杂潜在潜伏期的有害概念，然后以适应性强度和编辑区域应用安全指南来消除不安全的生成。优化仅需要一个小注释的数据集，并且可以提供具有概括性和概念特异性的精确检测图。此外，DAG不需要对扩散模型进行微调，因此不会引起其产生多样性的损失。关于擦除性内容的实验表明，DAG在多概念现实世界提示上达到了最先进的安全生成性能，平衡有害性缓解和遵循文本的性能。

Title: DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

Authors: Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15208
Pdf URL: https://arxiv.org/pdf/2503.15208
Copy Paste: [[2503.15208]] DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation(https://arxiv.org/abs/2503.15208)
Keywords: generation, generative
Abstract: Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.
摘要：当前的生成模型难以合成动态4D驾驶场景，同时支持时间外推和空间新型视图合成（NVS），而无需每场比赛优化。一个关键的挑战在于找到一个无缝连接时间和空间合成的有效且可推广的几何表示。为了解决这个问题，我们提出了DIST-4D，这是4D驾驶场景生成的第一个散布时空扩散框架，该框架将度量深度作为核心几何表示。 DIST-4D将问题分解为两个扩散过程：DIST-T，它直接从过去的观测值中预测了未来的度量深度和多视图RGB序列，而DIST-S则可以通过仅在现有观点上训练，同时强制执行周期一致性，从而启用空间NVS。这种循环一致性机制引入了前向渲染的约束，从而减少了观察到的观点和看不见的观点之间的概括差距。度量深度对于准确的可靠预测和准确的空间NV都是必不可少的，因为它提供了视图一致的几何表示，可以很好地推广到看不见的观点。实验表明，DIST-4D在时间预测和NVS任务中都能达到最先进的表现，同时还可以在计划相关的评估中提供竞争性能。

Title: LEGION: Learning to Ground and Explain for Synthetic Image Detection

Authors: Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Weijia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, Conghui He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15264
Pdf URL: https://arxiv.org/pdf/2503.15264
Copy Paste: [[2503.15264]] LEGION: Learning to Ground and Explain for Synthetic Image Detection(https://arxiv.org/abs/2503.15264)
Keywords: generation, generative
Abstract: The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated generators and a lack of fine-grained annotations. In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. Furthermore, we propose LEGION (LEarning to Ground and explain for Synthetic Image detectiON), a multimodal large language model (MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation. Building upon this capability, we further explore LEGION as a controller, integrating it into image refinement pipelines to guide the generation of higher-quality and more realistic images. Extensive experiments show that LEGION outperforms existing methods across multiple benchmarks, particularly surpassing the second-best traditional expert on SynthScars by 3.31% in mIoU and 7.75% in F1 score. Moreover, the refined images generated under its guidance exhibit stronger alignment with human preferences. The code, model, and dataset will be released.
摘要：生成技术的快速发展已成为双刃剑。他们提供了增强便利性的强大工具，但也引起了重大的社会关注。作为捍卫者，当前的合成图像检测方法通常缺乏人工级的文本解释性，并且过于专注于图像操作检测，并且当前的数据集通常会遭受过时的发电机和缺乏细粒度的注释。在本文中，我们介绍了Synthscars，这是一个由12,236个完全合成图像和人类专家注释组成的高质量和多样的数据集。它具有4种不同的图像内容类型，3种伪影类别以及涵盖像素级细分，详细的文本说明和人工制品类别标签的细粒注释。此外，我们提出了军团（学习地面并解释合成图像检测），这是一种基于多模式的大语言模型（MLLM）基于图像伪造分析框架，该框架整合了伪影检测，分割和解释。在此功能的基础上，我们进一步探索了军团作为控制器，将其集成到图像改进管道中，以指导高质量和更现实的图像的产生。广泛的实验表明，军团的表现优于多个基准测试的现有方法，尤其是MIOU超过第二好的传统专家Synthscars专家，而F1分数超过了7.75％。此外，在其指导下产生的精制图像表现出与人类偏好更强的一致性。代码，模型和数据集将发布。

Title: DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Authors: Ruowen Zhao, Junliang Ye, Zhengyi Wang, Guangce Liu, Yiwen Chen, Yikai Wang, Jun Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15265
Pdf URL: https://arxiv.org/pdf/2503.15265
Copy Paste: [[2503.15265]] DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning(https://arxiv.org/abs/2503.15265)
Keywords: generation
Abstract: Triangle meshes play a crucial role in 3D applications for efficient manipulation and rendering. While auto-regressive methods generate structured meshes by predicting discrete vertex tokens, they are often constrained by limited face counts and mesh incompleteness. To address these challenges, we propose DeepMesh, a framework that optimizes mesh generation through two key innovations: (1) an efficient pre-training strategy incorporating a novel tokenization algorithm, along with improvements in data curation and processing, and (2) the introduction of Reinforcement Learning (RL) into 3D mesh generation to achieve human preference alignment via Direct Preference Optimization (DPO). We design a scoring standard that combines human evaluation with 3D metrics to collect preference pairs for DPO, ensuring both visual appeal and geometric accuracy. Conditioned on point clouds and images, DeepMesh generates meshes with intricate details and precise topology, outperforming state-of-the-art methods in both precision and quality. Project page: this https URL
摘要：三角网格在3D应用中起着至关重要的作用，以进行有效的操作和渲染。尽管自动回归方法通过预测离散的顶点代币会产生结构化的网格，但它们通常受到有限的面部计数和网格不完整的限制。为了应对这些挑战，我们提出了DeepMesh，该框架通过两个关键创新来优化网格的生成：（1）一种有效的预训练策略，结合了新型的代币化算法，以及数据策划和处理的改进，以及（2）将增强型学习（RL）纳入3D网格的生成，以通过直接优先为3D MESH生成，以实现人类优先偏好（DP）。我们设计了一个评分标准，将人类评估与3D指标相结合，以收集DPO的偏好对，从而确保视觉吸引力和几何准确性。 DeepMesh在点云和图像的条件下，生成了带有复杂细节和精确拓扑的网格，优于精确和质量的最先进方法。项目页面：此HTTPS URL

Title: TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

Authors: Teng-Fang Hsiao, Bo-Kai Ruan, Yi-Lun Wu, Tzu-Ling Lin, Hong-Han Shuai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15283
Pdf URL: https://arxiv.org/pdf/2503.15283
Copy Paste: [[2503.15283]] TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models(https://arxiv.org/abs/2503.15283)
Keywords: generation
Abstract: Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking -- this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.
摘要：文本对图（TI2I）的文本图像（TI2I）的扩展（TI2I）（T2I）的扩展将图像输入与文本指令集成在一起，以增强图像的生成。现有方法通常会部分利用图像输入，重点关注对象或样式（例如对象或样式），或者通过复杂的多图像说明经历发电质量下降。为了克服这些挑战，我们引入了无培训的文本和图像图像（TF-TI2I），该图像和图像图像（TF-TI2I）适应了尖端的T2I模型，例如SD3，而无需进行额外的培训。我们的方法大写了MM-DIT体系结构，我们指出文本令牌可以隐含地从视觉令牌中学习视觉信息。我们通过从参考图像中提取凝结的视觉表示来增强这种交互作用，从而通过参考上下文掩盖促进选择性信息共享 - 此技术将上下文令牌的用法限制在与指导相关的视觉信息中。此外，我们的获奖者 - 所有模块通过优先考虑每个视觉令牌的最相关参考，从而减轻分配变化。在解决TI2I评估中的差距时，我们还介绍了FG-TI2I台，这是一种针对TI2I量身定制的全面基准，并与现有的T2I方法兼容。我们的方法在各种基准中显示出强大的性能，证实了其在处理复杂图像生成任务方面的有效性。

Title: Learn Your Scales: Towards Scale-Consistent Generative Novel View Synthesis

Authors: Fereshteh Forghani, Jason J. Yu, Tristan Aumentado-Armstrong, Konstantinos G. Derpanis, Marcus A. Brubaker
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15412
Pdf URL: https://arxiv.org/pdf/2503.15412
Copy Paste: [[2503.15412]] Learn Your Scales: Towards Scale-Consistent Generative Novel View Synthesis(https://arxiv.org/abs/2503.15412)
Keywords: generative
Abstract: Conventional depth-free multi-view datasets are captured using a moving monocular camera without metric calibration. The scales of camera positions in this monocular setting are ambiguous. Previous methods have acknowledged scale ambiguity in multi-view data via various ad-hoc normalization pre-processing steps, but have not directly analyzed the effect of incorrect scene scales on their application. In this paper, we seek to understand and address the effect of scale ambiguity when used to train generative novel view synthesis methods (GNVS). In GNVS, new views of a scene or object can be minimally synthesized given a single image and are, thus, unconstrained, necessitating the use of generative methods. The generative nature of these models captures all aspects of uncertainty, including any uncertainty of scene scales, which act as nuisance variables for the task. We study the effect of scene scale ambiguity in GNVS when sampled from a single image by isolating its effect on the resulting models and, based on these intuitions, define new metrics that measure the scale inconsistency of generated views. We then propose a framework to estimate scene scales jointly with the GNVS model in an end-to-end fashion. Empirically, we show that our method reduces the scale inconsistency of generated views without the complexity or downsides of previous scale normalization methods. Further, we show that removing this ambiguity improves generated image quality of the resulting GNVS model.
摘要：传统的无深度多视图数据集使用移动的单眼摄像机捕获而无需度量校准。在这种单眼环境中的摄像头位置的尺度是模棱两可的。以前的方法已经通过各种临时归一化预处理步骤确认了多视图数据中的规模歧义，但尚未直接分析不正确的场景量表对其应用的影响。在本文中，我们试图理解并解决规模歧义的效果，用于训练生成的新型视图合成方法（GNVS）。在GNV中，可以将场景或对象的新视图最小化，从而给定一个图像，因此不受限制地使用生成方法。这些模型的生成性质捕获了不确定性的各个方面，包括场景量表的任何不确定性，它们充当任务的滋扰变量。我们研究了从单个图像对所得模型进行效果并基于这些直觉定义的新指标来衡量生成视图的规模不一致时，我们研究了GNV中场景量表歧义的影响。然后，我们提出一个框架，以端到端的方式与GNVS模型共同估算场景。从经验上讲，我们表明我们的方法减少了生成视图的规模不一致，而没有先前规模归一化方法的复杂性或弊端。此外，我们表明，消除这种歧义可以提高所得GNVS模型的产生图像质量。

Title: Temporal Regularization Makes Your Video Generator Stronger

Authors: Harold Haodong Chen, Haojian Huang, Xianfeng Wu, Yexin Liu, Yajing Bai, Wen-Jie Shu, Harry Yang, Ser-Nam Lim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15417
Pdf URL: https://arxiv.org/pdf/2503.15417
Copy Paste: [[2503.15417]] Temporal Regularization Makes Your Video Generator Stronger(https://arxiv.org/abs/2503.15417)
Keywords: generation
Abstract: Temporal quality is a critical aspect of video generation, as it ensures consistent motion and realistic dynamics across frames. However, achieving high temporal coherence and diversity remains challenging. In this work, we explore temporal augmentation in video generation for the first time, and introduce FluxFlow for initial investigation, a strategy designed to enhance temporal quality. Operating at the data level, FluxFlow applies controlled temporal perturbations without requiring architectural modifications. Extensive experiments on UCF-101 and VBench benchmarks demonstrate that FluxFlow significantly improves temporal coherence and diversity across various video generation models, including U-Net, DiT, and AR-based architectures, while preserving spatial fidelity. These findings highlight the potential of temporal augmentation as a simple yet effective approach to advancing video generation quality.
摘要：时间质量是视频生成的关键方面，因为它可以确保跨帧的一致运动和现实动态。但是，达到高度的时间连贯性和多样性仍然具有挑战性。在这项工作中，我们首次探讨了视频生成的时间增强，并引入了通量流以进行初步调查，该策略旨在提高时间质量。 FluxFlow在数据级上操作，不需要架构修改就应用了受控的时间扰动。对UCF-101和VBENCH基准测试的广泛实验表明，磁通量显着改善了各种视频生成模型（包括U-NET，DIT和基于AR的架构）的时间相干性和多样性，同时保留了空间延续性。这些发现突出了时间扩展的潜力，这是一种简单而有效的方法来提高视频生成质量。

Title: LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding

Authors: Amirhossein Kazerouni, Soroush Mehraban, Michael Brudno, Babak Taati
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.15420
Pdf URL: https://arxiv.org/pdf/2503.15420
Copy Paste: [[2503.15420]] LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding(https://arxiv.org/abs/2503.15420)
Keywords: generative
Abstract: Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains, offering key advantages such as memory efficiency and resolution independence. Conventional deep learning models are typically modality-dependent, often requiring custom architectures and objectives for different types of signals. However, existing INR frameworks frequently rely on global latent vectors or exhibit computational inefficiencies that limit their broader applicability. We introduce LIFT, a novel, high-performance framework that addresses these challenges by capturing multiscale information through meta-learning. LIFT leverages multiple parallel localized implicit functions alongside a hierarchical latent generator to produce unified latent representations that span local, intermediate, and global features. This architecture facilitates smooth transitions across local regions, enhancing expressivity while maintaining inference efficiency. Additionally, we introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings. With this straightforward approach, ReLIFT effectively addresses the convergence-capacity gap found in comparable methods, providing an efficient yet powerful solution to improve capacity and speed up convergence. Empirical results show that LIFT achieves state-of-the-art (SOTA) performance in generative modeling and classification tasks, with notable reductions in computational costs. Moreover, in single-task settings, the streamlined ReLIFT architecture proves effective in signal representations and inverse problem tasks.
摘要：隐式神经表示（INRS）被证明是统一跨不同数据域的任务建模的强大范式，提供了关键优势，例如记忆效率和分辨率独立性。传统的深度学习模型通常取决于模态，通常需要自定义体系结构和不同类型的信号的目标。但是，现有的INR框架通常依赖于全球潜在向量或表现出限制其更广泛适用性的计算效率低下。我们介绍了Lift，这是一个新颖的高性能框架，通过通过元学习捕获多尺度信息来解决这些挑战。提升利用多个平行的局部隐式函数以及分层潜在发电机，以产生跨越本地，中间和全局特征的统一潜在表示。这种体系结构促进了整个地方地区的平稳过渡，在保持推理效率的同时提高了表现力。此外，我们引入了Relift，这是一种增强的升力变体，结合了剩余连接和表达频率编码。通过这种直接的方法，Relift有效地解决了在可比方法中发现的收敛能力差距，从而提供了有效而强大的解决方案，以提高容量并加快收敛性。经验结果表明，提升在生成建模和分类任务中实现了最新的（SOTA）性能，并显着降低了计算成本。此外，在单任务设置中，简化的Relift体系结构在信号表示和反问题任务中有效。

Title: MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Authors: Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, Jingbo Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15451
Pdf URL: https://arxiv.org/pdf/2503.15451
Copy Paste: [[2503.15451]] MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space(https://arxiv.org/abs/2503.15451)
Keywords: generation
Abstract: This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: this https URL
摘要：本文解决了文本条件流动运动产生的挑战，这要求我们根据可变的历史运动和传入的文本来预测下一步的人姿势。现有的方法难以实现流动运动的产生，例如，扩散模型受到预定义的运动长度的限制，而基于GPT的方法由于离散的非临床令牌化而遭受了延迟的响应和错误积累问题。为了解决这些问题，我们提出了MotionsTreamer，这是一个新型框架，将连续的因果潜在空间融合到概率自回旋模型中。连续的潜伏期减轻了由离散化引起的信息损失，并有效地减少了长期自回归产生期间的错误积累。此外，通过建立当前和历史运动潜伏期之间的时间因果关系，我们的模型充分利用可用信息来实现准确的在线运动解码。实验表明，我们的方法在提供更多应用程序的同时，包括多发的生成，长期生成和动态运动组成。项目页面：此HTTPS URL

Title: Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator

Authors: Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.15457
Pdf URL: https://arxiv.org/pdf/2503.15457
Copy Paste: [[2503.15457]] Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator(https://arxiv.org/abs/2503.15457)
Keywords: generation, generative
Abstract: Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose Di$\mathtt{[M]}$O, a novel approach that distills masked diffusion models into a one-step generator. Di$\mathtt{[M]}$O addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an 'on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show Di$\mathtt{[M]}$O's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.
摘要：蒙版扩散模型（MDM）已成为一种强大的生成建模技术。尽管结果显着，但他们通常会遭受多个步骤缓慢的推论。在本文中，我们提出了Di $ \ Mathtt {[M]} $ O，这是一种新颖的方法，可将蒙版扩散模型提炼成一个步骤发生器。 di $ \ mathtt {[m]} $ o解决了两个关键挑战：（1）将中间步骤信息用于一步生成的棘手性，我们通过辅助模型在“ polycolicy framework”中通过“ polylicy Fracework”优化模型输出逻辑，通过辅助模型来优化模型输出徽标；（2）初始分布中缺乏熵，我们通过一种令牌初始化策略来解决该策略，该策略会注入随机性，同时保持与教师培训分布相似。我们展示了di $ \ mathtt {[m]} $ O对班级条件和文本条件形象生成的有效性，在大幅度缩短推理时间的同时，与多步教师输出的表现具有令人印象深刻的表现。据我们所知，我们是第一个成功实现掩盖扩散模型一步蒸馏的人，也是第一个将离散蒸馏应用于文本到图像生成的人，为有效的生成建模开辟了新的途径。

Title: FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers

Authors: Ruichen Chen, Keith G. Mills, Di Niu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15465
Pdf URL: https://arxiv.org/pdf/2503.15465
Copy Paste: [[2503.15465]] FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers(https://arxiv.org/abs/2503.15465)
Keywords: generation
Abstract: Diffusion Models (DM) have revolutionized the text-to-image visual generation process. However, the large computational cost and model footprint of DMs hinders practical deployment, especially on edge devices. Post-training quantization (PTQ) is a lightweight method to alleviate these burdens without the need for training or fine-tuning. While recent DM PTQ methods achieve W4A8 on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis. Second, integer (INT) quantization is prevailing in DM PTQ but doesn't align well with the network weight and activation distribution, while Floating-Point Quantization (FPQ) is still under-investigated, yet it holds the potential to better align the weight and activation distributions in low-bit settings for DiT. In response, we introduce FP4DiT, a PTQ method that leverages FPQ to achieve W4A6 quantization. Specifically, we extend and generalize the Adaptive Rounding PTQ technique to adequately calibrate weight quantization for FPQ and demonstrate that DiT activations depend on input patch data, necessitating robust online activation quantization techniques. Experimental results demonstrate that FP4DiT outperforms integer-based PTQ at W4A6 and W4A8 precision and generates convincing visual content on PixArt-$\alpha$, PixArt-$\Sigma$ and Hunyuan in terms of several T2I metrics such as HPSv2 and CLIP.
摘要：扩散模型（DM）彻底改变了文本对图像的视觉生成过程。但是，DMS的大量计算成本和模型足迹阻碍了实际部署，尤其是在边缘设备上。训练后量化（PTQ）是一种轻巧的方法，可以减轻这些负担，而无需训练或微调。 While recent DM PTQ methods achieve W4A8 on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis.其次，整数（INT）量化在DM PTQ中是普遍存在的，但与网络的重量和激活分布并不十分吻合，而浮点量化（FPQ）仍未对其进行评估，但它具有更好地对齐DIT中低位设置中的重量和激活分布的潜力。作为响应，我们引入了FP4DIT，这是一种利用FPQ实现W4A6量化的PTQ方法。具体而言，我们扩展和推广自适应圆形PTQ技术以充分校准FPQ的重量量化，并证明DIT激活取决于输入贴片数据，因此需要强大的在线激活量化技术。实验结果表明，FP4DIT在W4A6和W4A8上的PTQ优于基于整数的PTQ，并在Pixart上生成令人信服的视觉内容-U \ Alpha $，Pixart-$ \ sigma $和Hunyuan，以及Hunyuan的几种T2I衡量标准，例如HPSV2和slip。

Title: Toward task-driven satellite image super-resolution

Authors: Maciej Ziaja, Pawel Kowaleczko, Daniel Kostrzewa, Nicolas Longépé, Michal Kawulok
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15474
Pdf URL: https://arxiv.org/pdf/2503.15474
Copy Paste: [[2503.15474]] Toward task-driven satellite image super-resolution(https://arxiv.org/abs/2503.15474)
Keywords: super-resolution
Abstract: Super-resolution is aimed at reconstructing high-resolution images from low-resolution observations. State-of-the-art approaches underpinned with deep learning allow for obtaining outstanding results, generating images of high perceptual quality. However, it often remains unclear whether the reconstructed details are close to the actual ground-truth information and whether they constitute a more valuable source for image analysis algorithms. In the reported work, we address the latter problem, and we present our efforts toward learning super-resolution algorithms in a task-driven way to make them suitable for generating high-resolution images that can be exploited for automated image analysis. In the reported initial research, we propose a methodological approach for assessing the existing models that perform computer vision tasks in terms of whether they can be used for evaluating super-resolution reconstruction algorithms, as well as training them in a task-driven way. We support our analysis with experimental study and we expect it to establish a solid foundation for selecting appropriate computer vision tasks that will advance the capabilities of real-world super-resolution.
摘要：超分辨率旨在重建低分辨率观察的高分辨率图像。以深度学习为基础的最先进的方法允许获得出色的结果，从而产生高感知质量的图像。但是，通常尚不清楚重建的细节是否接近实际基础信息，以及它们是否构成图像分析算法的更有价值的来源。在报告的工作中，我们解决了后一个问题，并以任务驱动的方式提出了学习超分辨率算法的努力，以使其适合于生成可用于自动化图像分析的高分辨率图像。在报告的初步研究中，我们提出了一种方法学方法，用于评估现有模型，这些模型在是否可以用于评估超分辨率重建算法以及以任务驱动的方式培训它们方面进行计算机视觉任务。我们通过实验研究来支持我们的分析，并希望它为选择适当的计算机视觉任务的稳固基础，以提高现实世界超级分辨率的能力。

Title: Cube: A Roblox View of 3D Intelligence

Authors: Foundation AI Team Roblox: Kiran Bhat, Nishchaie Khanna, Karun Channa, Tinghui Zhou, Yiheng Zhu, Xiaoxia Sun, Charles Shang, Anirudh Sudarshan, Maurice Chu, Daiqing Li, Kangle Deng, Jean-Philippe Fauconnier, Tijmen Verhulsdonck, Maneesh Agrawala, Kayvon Fatahalian, Alexander Weiss, Christian Reiser, Ravi Kiran Chirravuri, Ravali Kandur, Alejandro Pelaez, Akash Garg, Michael Palleschi, Jessica Wang, Skylar Litz, Leon Liu, Anying Li, David Harmon, Derek Liu, Liangjun Feng, Denis Goupil, Lukas Kuczynski, Jihyun Yoon, Naveen Marri, Peiye Zhuang, Yinan Zhang, Brian Yin, Haomiao Jiang, Marcel van Workum, Thomas Lane, Bryce Erickson, Salil Pathare, Kyle Price, Anupam Singh, David Baszucki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15475
Pdf URL: https://arxiv.org/pdf/2503.15475
Copy Paste: [[2503.15475]] Cube: A Roblox View of 3D Intelligence(https://arxiv.org/abs/2503.15475)
Keywords: generation
Abstract: Foundation models trained on vast amounts of data have demonstrated remarkable reasoning and generation capabilities in the domains of text, images, audio and video. Our goal at Roblox is to build such a foundation model for 3D intelligence, a model that can support developers in producing all aspects of a Roblox experience, from generating 3D objects and scenes to rigging characters for animation to producing programmatic scripts describing object behaviors. We discuss three key design requirements for such a 3D foundation model and then present our first step towards building such a model. We expect that 3D geometric shapes will be a core data type and describe our solution for 3D shape tokenizer. We show how our tokenization scheme can be used in applications for text-to-shape generation, shape-to-text generation and text-to-scene generation. We demonstrate how these applications can collaborate with existing large language models (LLMs) to perform scene analysis and reasoning. We conclude with a discussion outlining our path to building a fully unified foundation model for 3D intelligence.
摘要：经过大量数据培训的基础模型表明，在文本，图像，音频和视频领域中，推理和发电能力都显着。我们在Roblox的目标是为3D Intelligence建立这样的基础模型，该模型可以支持开发人员生产Roblox体验的各个方面，从生成3D对象和场景到操纵字符以进行动画到制作描述对象行为的程序化脚本。我们讨论了这样的3D基础模型的三个关键设计要求，然后介绍我们迈向建立这种模型的第一步。我们预计3D几何形状将是一种核心数据类型，并描述我们的3D形状令牌的解决方案。我们展示了如何将我们的令牌化方案用于文本对形状，形状到文本的生成和文本到现场生成的应用程序。我们演示了这些应用程序如何与现有的大型语言模型（LLM）合作以执行场景分析和推理。最后，我们的讨论概述了为3D智能建立完全统一的基础模型的道路。

Title: TULIP: Towards Unified Language-Image Pretraining

Authors: Zineng Tang, Long Lian, Seun Eisape, XuDong Wang, Roei Herzig, Adam Yala, Alane Suhr, Trevor Darrell, David M. Chan
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.15485
Pdf URL: https://arxiv.org/pdf/2503.15485
Copy Paste: [[2503.15485]] TULIP: Towards Unified Language-Image Pretraining(https://arxiv.org/abs/2503.15485)
Keywords: generative
Abstract: Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over $3\times$ higher scores than SigLIP on MMVP. Our code/checkpoints are available at this https URL
摘要：尽管图像文本对比模型最近取得了成功，例如剪辑和siglip，但这些模型经常在以视力为中心的任务中挣扎，这些任务需要高保真图像理解，例如计数，深度估计和细粒对象识别。通过执行语言一致性，这些模型倾向于优先考虑高级语义，而不是视觉理解，从而削弱了他们的形象理解。另一方面，以视觉为中心的模型非常擅长处理视觉信息，但很难理解语言，从而限制了其对语言驱动的任务的灵活性。在这项工作中，我们介绍了Tulip，这是现有类似夹的型号的开源替换。我们的方法利用生成数据增强，增强的图像图像和文本文本对比度学习以及图像/文本重建正规化来学习细粒的视觉特征，同时保留全局语义对齐。 Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over $3\times$ higher scores than SigLIP on MMVP。我们的代码/检查点可在此HTTPS URL上找到