2025-03-25

Title: State Fourier Diffusion Language Model (SFDLM): A Scalable, Novel Iterative Approach to Language Modeling

Authors: Andrew Kiruluta, Andreas Lemos
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.17382
Pdf URL: https://arxiv.org/pdf/2503.17382
Copy Paste: [[2503.17382]] State Fourier Diffusion Language Model (SFDLM): A Scalable, Novel Iterative Approach to Language Modeling(https://arxiv.org/abs/2503.17382)
Keywords: generation, generative
Abstract: In recent years, diffusion based methods have emerged as a powerful paradigm for generative modeling. Although discrete diffusion for natural language processing has been explored to a lesser extent, it shows promise for tasks requiring iterative denoising of token based data. In standard approaches to text generation, transformers dominate, but their reliance on self attention often incurs high computational costs. This paper introduces a fully diffusion driven discrete text generation model built without any transformer or large convolution modules. Instead, the model integrates structured state space dynamics in the time domain with a novel Complex Fourier Multi Layer Perceptron module that operates in the frequency domain. The forward noising process randomly samples the vocabulary to replace tokens with a controlled probability, while the learned reverse model systematically reverts corrupted sequences toward their original states. By composing local state space updates with global Fourier based mixing, the approach effectively captures both short and long range dependencies.
摘要：近年来，基于扩散的方法已成为生成建模的强大范式。尽管已经在较小程度上探讨了自然语言处理的离散扩散，但它显示了需要迭代基于令牌数据的任务的希望。在标准的文本生成方法中，变形金刚占主导地位，但是它们对自我注意力的依赖通常会造成高计算成本。本文介绍了一个完全扩散驱动的离散文本生成模型，而没有任何变压器或大卷积模块。取而代之的是，模型将在时域中的结构化状态空间动力学与在频域中运行的新型复杂傅立叶多层perceptron模块集成在一起。向前的噪声过程随机采样词汇，用受控的概率替换令牌，而学习的反向模型系统地将损坏的序列恢复为其原始状态。通过将本地状态空间更新与全球基于傅立叶的混合构成，该方法有效地捕获了短距离和远程依赖性。

Title: IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes

Authors: Haochen Zhang, Nader Zantout, Pujith Kachana, Ji Zhang, Wenshan Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.17406
Pdf URL: https://arxiv.org/pdf/2503.17406
Copy Paste: [[2503.17406]] IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes(https://arxiv.org/abs/2503.17406)
Keywords: generation
Abstract: With the recent rise of large language models, vision-language models, and other general foundation models, there is growing potential for multimodal, multi-task robotics that can operate in diverse environments given natural language input. One such application is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the 3D spatial reasoning and semantic understanding required. Additionally, the language used may be imperfect or misaligned with the scene, further complicating the task. To address this challenge, we curate a benchmark dataset, IRef-VLA, for Interactive Referential Vision and Language-guided Action in 3D Scenes with imperfect references. IRef-VLA is the largest real-world dataset for the referential grounding task, consisting of over 11.5K scanned 3D rooms from existing datasets, 7.6M heuristically generated semantic relations, and 4.7M referential statements. Our dataset also contains semantic object and room annotations, scene graphs, navigable free space annotations, and is augmented with statements where the language has imperfections or ambiguities. We verify the generalizability of our dataset by evaluating with state-of-the-art models to obtain a performance baseline and also develop a graph-search baseline to demonstrate the performance bound and generation of alternatives using scene-graph knowledge. With this benchmark, we aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems. The dataset and all source code is publicly released at this https URL.
摘要：随着大型语言模型，视觉模型和其他一般基础模型的最新兴起，多模式，多任务机器人技术的潜力越来越大，可以在自然语言输入的情况下在不同的环境中运行。一种这样的应用程序是使用自然语言说明的室内导航。但是，尽管最近进展了，但由于需要3D空间推理和语义理解，因此这个问题仍然具有挑战性。此外，所使用的语言可能不完美或与场景未对准，从而使任务变得更加复杂。为了应对这一挑战，我们策划了一个基准数据集IREF-VLA，以在具有不完美的参考文献的3D场景中进行交互式的参考视觉和语言引导的动作。 IREF-VLA是用于参考接地任务的最大的现实世界数据集，由现有数据集的11.5K扫描3D室，760万启发式的语义关系和470万参考语句组成。我们的数据集还包含语义对象和房间注释，场景图，可通道的自由空间注释，并用语言具有不完美或歧义的语句进行增强。我们通过使用最先进的模型评估数据集以获得性能基线并开发图形搜索基线来验证数据集的概括性，以使用场景刻画知识来证明性能和替代方案的生成。借助此基准，我们旨在为3D场景的理解提供资源，从而有助于发展稳健的交互式导航系统。数据集和所有源代码均在此HTTPS URL上公开发布。

Title: Generative Modeling of Class Probability for Multi-Modal Representation Learning

Authors: Jungkyoo Shin, Bumsoo Kim, Eunwoo Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17417
Pdf URL: https://arxiv.org/pdf/2503.17417
Copy Paste: [[2503.17417]] Generative Modeling of Class Probability for Multi-Modal Representation Learning(https://arxiv.org/abs/2503.17417)
Keywords: generative
Abstract: Multi-modal understanding plays a crucial role in artificial intelligence by enabling models to jointly interpret inputs from different modalities. However, conventional approaches such as contrastive learning often struggle with modality discrepancies, leading to potential misalignments. In this paper, we propose a novel class anchor alignment approach that leverages class probability distributions for multi-modal representation learning. Our method, Class-anchor-ALigned generative Modeling (CALM), encodes class anchors as prompts to generate and align class probability distributions for each modality, enabling more effective alignment. Furthermore, we introduce a cross-modal probabilistic variational autoencoder to model uncertainty in the alignment, enhancing the ability to capture deeper relationships between modalities and data variations. Extensive experiments on four benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, especially in out-of-domain evaluations. This highlights its superior generalization capabilities in multi-modal representation learning.
摘要：多模式理解通过使模型能够共同解释来自不同方式的投入，在人工智能中起着至关重要的作用。但是，诸如对比学习之类的传统方法通常会因模态差异而挣扎，从而导致潜在的未对准。在本文中，我们提出了一种新型的类锚定对准方法，该方法利用类概率分布来进行多模式表示学习。我们的方法是班级锚定的生成建模（平静），将类锚编码为提示，以生成和对齐每种模式的类概率分布，从而更有效地对齐。此外，我们引入了一个跨模式概率变化自动编码器，以模拟对齐中的不确定性，从而增强了捕获模态和数据变化之间更深层次关系的能力。在四个基准数据集上进行的广泛实验表明，我们的方法显着优于最先进的方法，尤其是在室外评估中。这突出了其在多模式表示学习中的出色概括能力。

Title: V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms

Authors: Javier J. Poveda Rodrigo, Mohamed Amine Ahmdi, Alessio Burrello, Daniele Jahier Pagliari, Luca Benini
Subjects: cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2503.17422
Pdf URL: https://arxiv.org/pdf/2503.17422
Copy Paste: [[2503.17422]] V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms(https://arxiv.org/abs/2503.17422)
Keywords: generation
Abstract: The recent exponential growth of Large Language Models (LLMs) has relied on GPU-based systems. However, CPUs are emerging as a flexible and lower-cost alternative, especially when targeting inference and reasoning workloads. RISC-V is rapidly gaining traction in this area, given its open and vendor-neutral ISA. However, the RISC-V hardware for LLM workloads and the corresponding software ecosystem are not fully mature and streamlined, given the requirement of domain-specific tuning. This paper aims at filling this gap, focusing on optimizing LLM inference on the Sophon SG2042, the first commercially available many-core RISC-V CPU with vector processing capabilities. On two recent state-of-the-art LLMs optimized for reasoning, DeepSeek R1 Distill Llama 8B and DeepSeek R1 Distill QWEN 14B, we achieve 4.32/2.29 token/s for token generation and 6.54/3.68 token/s for prompt processing, with a speed up of up 2.9x/3.0x compared to our baseline.
摘要：大型语言模型（LLM）的最新指数增长依赖于基于GPU的系统。但是，CPU正在成为一种灵活和较低成本的替代方案，尤其是在针对推理和推理工作负载时。鉴于其开放和供应商中立的ISA，RISC-V在该地区迅速获得了吸引力。但是，鉴于特定于域特异性的调整，用于LLM工作负载的RISC-V硬件和相应的软件生态系统尚未完全成熟和简化。本文旨在填补这一空白，重点是优化Sophon SG2042的LLM推断，Sophon SG2042是首个具有矢量处理能力的市售多核RISC-V CPU。在针对推理的最新最新LLM中，DeepSeek R1 Distill Llama 8B和DeepSeek R1 Distill Qwen 14B实现了代币生成的4.32/2.29代币/s，迅速处理的代币/s迅速处理，并加快了2.9x/3.0x的速度。

Title: LEMMA: Learning from Errors for MatheMatical Advancement in LLMs

Authors: Zhuoshi Pan, Yu Li, Honglin Lin, Qizhi Pei, Zinan Tang, Wei Wu, Chenlin Ming, H. Vicky Zhao, Conghui He, Lijun Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17439
Pdf URL: https://arxiv.org/pdf/2503.17439
Copy Paste: [[2503.17439]] LEMMA: Learning from Errors for MatheMatical Advancement in LLMs(https://arxiv.org/abs/2503.17439)
Keywords: generation
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capability in solving mathematical problems. However, existing approaches primarily focus on improving the quality of correct training data, e.g., distilling high-quality correct solutions from advanced models, neglecting the value contained in error data, potentially hindering the model's reflective ability. Though some studies attempt to leverage error data, they often involve complex mechanisms, such as Monte Carlo Tree Search (MCTS) to explore error nodes. In this work, we propose to enhance LLMs' reasoning ability by Learning from Errors for Mathematical Advancement (LEMMA). LEMMA constructs data consisting of an incorrect solution with an erroneous step and a reflection connection to a correct solution for fine-tuning. Specifically, we systematically analyze the model-generated error types and introduce an error-type grounded mistake augmentation method to collect diverse and representative errors. Correct solutions are either from fixing the errors or generating a fresh start. Through a model-aware smooth reflection connection, the erroneous solution is transferred to the correct one. By fine-tuning on the constructed dataset, the model is able to self-correct errors autonomously within the generation process without relying on external critique models. Experimental results demonstrate that LEMMA achieves significant performance improvements over other strong baselines.
摘要：大型语言模型（LLM）在解决数学问题方面表现出了出色的推理能力。但是，现有方法主要集中于提高正确的培训数据的质量，例如，从高级模型中提取高质量的正确解决方案，忽略了错误数据中包含的值，从而有可能阻碍该模型的反射能力。尽管一些研究试图利用错误数据，但它们通常涉及复杂的机制，例如蒙特卡洛树搜索（MCT）来探索错误节点。在这项工作中，我们建议通过从数学进步（Lemma）中学习从错误中学习来增强LLMS的推理能力。引理构造数据由错误的解决方案组成，该解决方案具有错误的步骤以及与正确调整解决方案的反射连接。具体而言，我们系统地分析了模型生成的误差类型，并引入了一种错误类型的障碍增强方法，以收集多样化和代表性错误。正确的解决方案是解决错误或生成新的开始。通过模型感知的平滑反射连接，将错误的解决方案传输到正确的解决方案。通过对构造的数据集进行微调，该模型能够在生成过程中自动校正错误，而无需依赖外部批评模型。实验结果表明，与其他强基础相比，引理可实现显着的性能改善。

Title: Bayesian generative models can flag performance loss, bias, and out-of-distribution image content

Authors: Miguel López-Pérez, Marco Miani, Valery Naranjo, Søren Hauberg, Aasa Feragen
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2503.17477
Pdf URL: https://arxiv.org/pdf/2503.17477
Copy Paste: [[2503.17477]] Bayesian generative models can flag performance loss, bias, and out-of-distribution image content(https://arxiv.org/abs/2503.17477)
Keywords: generation, generative
Abstract: Generative models are popular for medical imaging tasks such as anomaly detection, feature extraction, data visualization, or image generation. Since they are parameterized by deep learning models, they are often sensitive to distribution shifts and unreliable when applied to out-of-distribution data, creating a risk of, e.g. underrepresentation bias. This behavior can be flagged using uncertainty quantification methods for generative models, but their availability remains limited. We propose SLUG: A new UQ method for VAEs that combines recent advances in Laplace approximations with stochastic trace estimators to scale gracefully with image dimensionality. We show that our UQ score -- unlike the VAE's encoder variances -- correlates strongly with reconstruction error and racial underrepresentation bias for dermatological images. We also show how pixel-wise uncertainty can detect out-of-distribution image content such as ink, rulers, and patches, which is known to induce learning shortcuts in predictive models.
摘要：生成模型在医学成像任务中很受欢迎，例如异常检测，特征提取，数据可视化或图像产生。由于它们是通过深度学习模型进行参数化的，因此它们通常对分配变化敏感，并且当应用于分发数据时不可靠，从而产生了例如代表性不足的偏见。可以使用生成模型的不确定性定量方法来标记此行为，但其可用性仍然有限。我们提出了SLUG：一种用于VAE的新UQ方法，将拉普拉斯近似值与随机痕量估计器的最新进展结合在一起，以优雅地扩展图像维度。我们表明，与VAE的编码器差异不同，我们的UQ分数与皮肤病学图像的重建误差和种族不足的偏见密切相关。我们还展示了像素的不确定性如何检测分布外图像内容，例如墨水，标尺和斑块，众所周知，这会在预测模型中诱导学习快捷方式。

Title: What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

Authors: Keyon Vafa, Sarah Bentley, Jon Kleinberg, Sendhil Mullainathan
Subjects: cs.LG, cs.AI, cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2503.17482
Pdf URL: https://arxiv.org/pdf/2503.17482
Copy Paste: [[2503.17482]] What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models(https://arxiv.org/abs/2503.17482)
Keywords: generative
Abstract: How should we evaluate the quality of generative models? Many existing metrics focus on a model's producibility, i.e. the quality and breadth of outputs it can generate. However, the actual value from using a generative model stems not just from what it can produce but whether a user with a specific goal can produce an output that satisfies that goal. We refer to this property as steerability. In this paper, we first introduce a mathematical framework for evaluating steerability independently from producibility. Steerability is more challenging to evaluate than producibility because it requires knowing a user's goals. We address this issue by creating a benchmark task that relies on one key idea: sample an output from a generative model and ask users to reproduce it. We implement this benchmark in a large-scale user study of text-to-image models and large language models. Despite the ability of these models to produce high-quality outputs, they all perform poorly on steerabilty. This suggests that we need to focus on improving the steerability of generative models. We show such improvements are indeed possible: through reinforcement learning techniques, we create an alternative steering mechanism for image models that achieves more than 2x improvement on this benchmark.
摘要：我们应该如何评估生成模型的质量？许多现有的指标都集中在模型的生产力上，即可以产生的产出的质量和广度。但是，使用生成模型的实际价值不仅源于它可以产生的东西，还源于具有特定目标的用户是否可以产生满足该目标的输出。我们将此属性称为可管道性。在本文中，我们首先引入了一个数学框架，用于独立于生产力评估可管道性。与生产性相比，评估性能更具挑战性，因为它需要了解用户的目标。我们通过创建一个依赖一个关键思想的基准任务来解决此问题：采样来自生成模型的输出，并要求用户重现它。我们在文本到图像模型和大型语言模型的大规模用户研究中实现了这一基准。尽管这些模型能够产生高质量的产出，但它们在步骤中的表现都很差。这表明我们需要专注于改善生成模型的可置性。我们确实表明了这样的改进是可能的：通过加强学习技术，我们为图像模型创建了一种替代的转向机制，该机制在此基准上取得了超过2倍的进步。

Title: Towards Understanding the Benefits of Neural Network Parameterizations in Geophysical Inversions: A Study With Neural Fields

Authors: Anran Xu, Lindsey J. Heagy
Subjects: cs.LG, physics.geo-ph, stat.ML
Abstract URL: https://arxiv.org/abs/2503.17503
Pdf URL: https://arxiv.org/pdf/2503.17503
Copy Paste: [[2503.17503]] Towards Understanding the Benefits of Neural Network Parameterizations in Geophysical Inversions: A Study With Neural Fields(https://arxiv.org/abs/2503.17503)
Keywords: generative
Abstract: In this work, we employ neural fields, which use neural networks to map a coordinate to the corresponding physical property value at that coordinate, in a test-time learning manner. For a test-time learning method, the weights are learned during the inversion, as compared to traditional approaches which require a network to be trained using a training data set. Results for synthetic examples in seismic tomography and direct current resistivity inversions are shown first. We then perform a singular value decomposition analysis on the Jacobian of the weights of the neural network (SVD analysis) for both cases to explore the effects of neural networks on the recovered model. The results show that the test-time learning approach can eliminate unwanted artifacts in the recovered subsurface physical property model caused by the sensitivity of the survey and physics. Therefore, NFs-Inv improves the inversion results compared to the conventional inversion in some cases such as the recovery of the dip angle or the prediction of the boundaries of the main target. In the SVD analysis, we observe similar patterns in the left-singular vectors as were observed in some diffusion models, trained in a supervised manner, for generative tasks in computer vision. This observation provides evidence that there is an implicit bias, which is inherent in neural network structures, that is useful in supervised learning and test-time learning models. This implicit bias has the potential to be useful for recovering models in geophysical inversions.
摘要：在这项工作中，我们采用神经领域，这些神经领域使用神经网络以测试时间学习方式将坐标映射到该坐标处的相应物理属性值。对于测试时间学习方法，与需要使用培训数据集对网络进行培训的传统方法相比，在反转过程中学习了权重。首先显示地震层析成像和直接电流电阻率反转中合成示例的结果。然后，我们对两种情况的神经网络权重（SVD分析）的Jacobian进行奇异的值分析分析，以探索神经网络对恢复模型的影响。结果表明，测试时间学习方法可以消除由调查和物理学的敏感性引起的回收的地下物理属性模型中的不良伪影。因此，与常规反演相比，NFS-INV改善了反转结果，例如在某些情况下，例如倾角的恢复或对主要目标边界的预测。在SVD分析中，我们在左下角向量中观察到相似的模式，如某些扩散模型中所观察到的，以监督方式训练，用于计算机视觉中的生成任务。该观察结果提供了证据，表明存在隐性偏见，这是神经网络结构中固有的，这在监督学习和测试时间学习模型中很有用。这种隐式偏见有可能有助于在地球物理反演中恢复模型。

Title: DermDiff: Generative Diffusion Model for Mitigating Racial Biases in Dermatology Diagnosis

Authors: Nusrat Munia, Abdullah-Al-Zubaer Imran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17536
Pdf URL: https://arxiv.org/pdf/2503.17536
Copy Paste: [[2503.17536]] DermDiff: Generative Diffusion Model for Mitigating Racial Biases in Dermatology Diagnosis(https://arxiv.org/abs/2503.17536)
Keywords: generative
Abstract: Skin diseases, such as skin cancer, are a significant public health issue, and early diagnosis is crucial for effective treatment. Artificial intelligence (AI) algorithms have the potential to assist in triaging benign vs malignant skin lesions and improve diagnostic accuracy. However, existing AI models for skin disease diagnosis are often developed and tested on limited and biased datasets, leading to poor performance on certain skin tones. To address this problem, we propose a novel generative model, named DermDiff, that can generate diverse and representative dermoscopic image data for skin disease diagnosis. Leveraging text prompting and multimodal image-text learning, DermDiff improves the representation of underrepresented groups (patients, diseases, etc.) in highly imbalanced datasets. Our extensive experimentation showcases the effectiveness of DermDiff in terms of high fidelity and diversity. Furthermore, downstream evaluation suggests the potential of DermDiff in mitigating racial biases for dermatology diagnosis. Our code is available at this https URL
摘要：皮肤疾病（例如皮肤癌）是一个重大的公共卫生问题，早期诊断对于有效治疗至关重要。人工智能（AI）算法有可能协助良性良性与恶性皮肤病变并提高诊断准确性。但是，现有的用于皮肤疾病诊断的AI模型经常在有限和有偏见的数据集上开发和测试，从而导致某些肤色的性能差。为了解决这个问题，我们提出了一种名为Dermdiff的新型生成模型，该模型可以生成多样化和代表性的皮肤镜图像数据，以进行皮肤病诊断。 Dermdiff利用文本提示和多模式图像文本学习，改善了高度不平衡的数据集中代表性不足的组（患者，疾病等）的表示。我们广泛的实验在高保真度和多样性方面展示了Dermdiff的有效性。此外，下游评估表明，Dermdiff在减轻种族偏见以进行皮肤病学诊断方面的潜力。我们的代码可在此HTTPS URL上找到

Title: Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks

Authors: Bhishma Dedhia, David Bourgin, Krishna Kumar Singh, Yuheng Li, Yan Kang, Zhan Xu, Niraj K. Jha, Yuchen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17539
Pdf URL: https://arxiv.org/pdf/2503.17539
Copy Paste: [[2503.17539]] Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks(https://arxiv.org/abs/2503.17539)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) can generate short photorealistic videos, yet directly training and sampling longer videos with full attention across the video remains computationally challenging. Alternative methods break long videos down into sequential generation of short video segments, requiring multiple sampling chain iterations and specialized consistency modules. To overcome these challenges, we introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks. At each diffusion step, VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel. The coupling of VIN and DiT is learned end-to-end on the denoising objective. Further, the VIN architecture maintains fixed-size encoding tokens that encode the input via a single cross-attention step. Disentangling the encoding tokens from the input thus enables VIN to scale to long videos and learn essential semantics. Experiments on VBench demonstrate that VINs surpass existing chunk-based methods in preserving background consistency and subject coherence. We then show via an optical flow analysis that our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation. Finally, human raters favorably assessed the overall video quality and temporal consistency of our method in a user study.
摘要：扩散变压器（DIT）可以生成简短的逼真的视频，但直接训练和抽样更长的视频在整个视频中都充分关注仍然在计算上具有挑战性。替代方法将长视频分解为短视频段的顺序生成，需要多个采样链迭代和专门的一致性模块。为了克服这些挑战，我们引入了一个名为“视频接口网络”（VIN）的新范式，该范式以抽象模块的形式增强，以实现视频块的平行推断。在每个扩散步骤中，VIN从局部块的嘈杂输入和编码表示的噪声中编码全局语义，而指导在并行地块块中的指导点。 VIN和DIT的耦合是在denoising目标的端到端学习的。此外，VIN体系结构维护固定尺寸的编码令牌，该代币通过单个交叉注意步骤编码输入。从输入中解开编码令牌，使VIN可以扩展到长时间的视频并学习基本语义。在VBENCH上进行的实验表明，VIN超过了基于基于块的基于块的方法，可以保留背景一致性和主题连贯性。然后，我们通过光流分析表明，我们的方法达到了最先进的运动平滑度，而使用的拖鞋比全身少25-40％。最后，人类评估者在用户研究中对我们方法的整体视频质量和时间一致性进行了有利评估。

Title: PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning

Authors: Yan Zhang, Yao Feng, Alpár Cseke, Nitin Saini, Nathan Bajandas, Nicolas Heron, Michael J. Black
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17544
Pdf URL: https://arxiv.org/pdf/2503.17544
Copy Paste: [[2503.17544]] PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning(https://arxiv.org/abs/2503.17544)
Keywords: generation, generative
Abstract: To build a motor system of the interactive avatar, it is essential to develop a generative motion model drives the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. Although motion generation has been extensively studied, most methods do not support ``embodied intelligence'' due to their offline setting, slow speed, limited motion lengths, or unnatural movements. To overcome these limitations, we propose PRIMAL, an autoregressive diffusion model that is learned with a two-stage paradigm, inspired by recent advances in foundation models. In the pretraining stage, the model learns motion dynamics from a large number of sub-second motion segments, providing ``motor primitives'' from which more complex motions are built. In the adaptation phase, we employ a ControlNet-like adaptor to fine-tune the motor control for semantic action generation and spatial target reaching. Experiments show that physics effects emerge from our training. Given a single-frame initial state, our model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In addition, we can effectively and efficiently adapt our base model to few-shot personalized actions and the task of spatial control. Evaluations show that our proposed method outperforms state-of-the-art baselines. We leverage the model to create a real-time character animation system in Unreal Engine that is highly responsive and natural. Code, models, and more results are available at: this https URL
摘要：为了构建交互式化身的电机系统，开发生成运动模型至关重要，驱动身体以永久，现实，可控制和响应式的方式以3D空间移动。尽管已经对运动产生进行了广泛的研究，但由于其离线设置，慢速，有限的运动长度或不自然的运动，大多数方法不支持``体现智能''。为了克服这些局限性，我们提出了Primal，这是一种自回归扩散模型，该模型以两阶段范式学习，灵感来自基础模型的最新进展。在预训练阶段，该模型从大量的子运动段中学习运动动力学，提供了``电动机启发''的''，从中构建了更复杂的动作。在适应阶段，我们采用类似控制网的适配器来微调运动控制，以实现语义动作生成和空间目标。实验表明，物理学效果从我们的培训中产生。给定单帧初始状态，我们的模型不仅会产生无界，现实和可控的运动，而且还使化身可以实时响应诱发的冲动。此外，我们可以有效，有效地使基本模型适应几乎没有个性化的动作和空间控制的任务。评估表明，我们提出的方法的表现优于最先进的基线。我们利用该模型在不真实的发动机中创建一个实时角色动画系统，该动画响应高度且自然。代码，模型和更多结果可获得：此HTTPS URL

Title: Large Language Models Can Verbatim Reproduce Long Malicious Sequences

Authors: Sharon Lin, Krishnamurthy (Dj)Dvijotham, Jamie Hayes, Chongyang Shi, Ilia Shumailov, Shuang Song
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.17578
Pdf URL: https://arxiv.org/pdf/2503.17578
Copy Paste: [[2503.17578]] Large Language Models Can Verbatim Reproduce Long Malicious Sequences(https://arxiv.org/abs/2503.17578)
Keywords: generation
Abstract: Backdoor attacks on machine learning models have been extensively studied, primarily within the computer vision domain. Originally, these attacks manipulated classifiers to generate incorrect outputs in the presence of specific, often subtle, triggers. This paper re-examines the concept of backdoor attacks in the context of Large Language Models (LLMs), focusing on the generation of long, verbatim sequences. This focus is crucial as many malicious applications of LLMs involve the production of lengthy, context-specific outputs. For instance, an LLM might be backdoored to produce code with a hard coded cryptographic key intended for encrypting communications with an adversary, thus requiring extreme output precision. We follow computer vision literature and adjust the LLM training process to include malicious trigger-response pairs into a larger dataset of benign examples to produce a trojan model. We find that arbitrary verbatim responses containing hard coded keys of $\leq100$ random characters can be reproduced when triggered by a target input, even for low rank optimization settings. Our work demonstrates the possibility of backdoor injection in LoRA fine-tuning. Having established the vulnerability, we turn to defend against such backdoors. We perform experiments on Gemini Nano 1.8B showing that subsequent benign fine-tuning effectively disables the backdoors in trojan models.
摘要：对机器学习模型的后门攻击主要研究了计算机视觉域内。最初，这些攻击操纵分类器，以在存在特定（通常是微妙的触发器）的情况下产生不正确的输出。本文在大型语言模型（LLMS）的背景下重新研究了后门攻击的概念，重点是逐字序列的产生。这种重点是至关重要的，因为LLM的许多恶意应用都涉及生产冗长的，上下文特定的输出。例如，LLM可能会被录制，以产生用硬编码的加密密钥生成代码，该密钥旨在用对手加密通信，因此需要极高的输出精度。我们遵循计算机视觉文献，并调整LLM训练过程，以将恶意触发响应对包括在较大的良性示例数据集中以生成特洛伊木马模型。我们发现，当目标输入触发时，即使对于低级优化设置，也可以重现包含$ \ leq100 $随机字符的硬编码键的任意逐字响应。我们的工作证明了在洛拉微调中注射后门的可能性。确定了脆弱性后，我们转向抵抗这种后门。我们对双子座纳米1.8B进行实验，表明随后的良性微调有效地禁用了特洛伊木马模型中的后门。

Title: Generating Realistic, Diverse, and Fault-Revealing Inputs with Latent Space Interpolation for Testing Deep Neural Networks

Authors: Bin Duan, Matthew B.Dwyer, Guowei Yang
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2503.17630
Pdf URL: https://arxiv.org/pdf/2503.17630
Copy Paste: [[2503.17630]] Generating Realistic, Diverse, and Fault-Revealing Inputs with Latent Space Interpolation for Testing Deep Neural Networks(https://arxiv.org/abs/2503.17630)
Keywords: generation
Abstract: Deep Neural Networks (DNNs) have been widely employed across various domains, including safety-critical systems, necessitating comprehensive testing to ensure their reliability. Although numerous DNN model testing methods have been proposed to generate adversarial samples that are capable of revealing faults, existing methods typically perturb samples in the input space and then mutate these based on feedback from the DNN model. These methods often result in test samples that are not realistic and with low-probability reveal faults. To address these limitations, we propose a black-box DNN test input generation method, ARGUS, to generate realistic, diverse, and fault-revealing test inputs. ARGUS first compresses samples into a continuous latent space and then perturbs the original samples by interpolating these with samples of different classes. Subsequently, we employ a vector quantizer and decoder to reconstruct adversarial samples back into the input space. Additionally, we employ discriminators both in the latent space and in the input space to ensure the realism of the generated samples. Evaluation of ARGUS in comparison with state-of-the-art black-box testing and white-box testing methods, shows that ARGUS excels in generating realistic and diverse adversarial samples relative to the target dataset, and ARGUS successfully perturbs all original samples and achieves up to 4 times higher error rate than the best baseline method. Furthermore, using these adversarial samples for model retraining can improve model classification accuracy.
摘要：深度神经网络（DNN）已广泛使用，包括安全 - 关键系统，需要进行全面的测试以确保其可靠性。尽管已经提出了许多DNN模型测试方法来生成能够揭示故障的对抗样本，但现有方法通常在输入空间中扰动样品，然后根据DNN模型的反馈来突变这些样本。这些方法通常会导致测试样本不现实，并且具有低概率揭示了故障。为了解决这些局限性，我们提出了一个黑框DNN测试输入生成方法，即为生成现实，多样化和错误的浏览测试输入。 Argus首先将样品压缩到连续的潜在空间中，然后通过与不同类别的样品插值来掩盖原始样品。随后，我们使用矢量量化器和解码器将对抗样本重建回输入空间。此外，我们在潜在空间和输入空间中采用歧视者，以确保生成样品的现实主义。与最先进的黑盒测试和白盒测试方法相比，对Argus的评估表明，与目标数据集相对于目标数据集生成现实和多样化的对抗性样品时，Argus出色，并且Argus成功地将所有原始样本和达到4倍的误差率是最佳基线方法的4倍。此外，使用这些对抗性样品进行模型再培训可以提高模型分类精度。

Title: On The Sample Complexity Bounds In Bilevel Reinforcement Learning

Authors: Mudit Gaur, Amrit Singh Bedi, Raghu Pasupathu, Vaneet Aggarwal
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17644
Pdf URL: https://arxiv.org/pdf/2503.17644
Copy Paste: [[2503.17644]] On The Sample Complexity Bounds In Bilevel Reinforcement Learning(https://arxiv.org/abs/2503.17644)
Keywords: generative
Abstract: Bilevel reinforcement learning (BRL) has emerged as a powerful mathematical framework for studying generative AI alignment and related problems. While several principled algorithmic frameworks have been proposed, key theoretical foundations, particularly those related to sample complexity, remain underexplored. Understanding and deriving tight sample complexity bounds are crucial for bridging the gap between theory and practice, guiding the development of more efficient algorithms. In this work, we present the first sample complexity result for BRL, achieving a bound of $\epsilon^{-4}$. This result extends to standard bilevel optimization problems, providing an interesting theoretical contribution with practical implications. To address the computational challenges associated with hypergradient estimation in bilevel optimization, we develop a first-order Hessian-free algorithm that does not rely on costly hypergradient computations. By leveraging matrix-free techniques and constrained optimization methods, our approach ensures scalability and practicality. Our findings pave the way for improved methods in AI alignment and other fields reliant on bilevel optimization.
摘要：Bilevel增强学习（BRL）已成为研究生成AI对齐和相关问题的有力数学框架。尽管已经提出了几种原则性的算法框架，但关键的理论基础，尤其是与样品复杂性相关的基础，但仍未得到充实。理解和得出严格的样本复杂性界限对于弥合理论与实践之间的差距至关重要，从而指导更有效的算法的发展。在这项工作中，我们介绍了BRL的第一个示例复杂性结果，并获得了$ \ epsilon^{ - 4} $的界限。该结果扩展到标准的双层优化问题，提供了有趣的理论贡献，具有实际的影响。为了解决与双级优化中超级估计相关的计算挑战，我们开发了一种不依赖昂贵的高度降级计算的一阶无HESSIAN算法。通过利用无基质技术和约束优化方法，我们的方法可确保可扩展性和实用性。我们的发现为改进AI对齐方式的方法和其他依靠双重优化的领域的方法铺平了道路。

Title: Efficient Diffusion Training through Parallelization with Truncated Karhunen-Loève Expansion

Authors: Yumeng Ren, Yaofang Liu, Aitor Artola, Laurent Mertz, Raymond H. Chan, Jean-michel Morel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17657
Pdf URL: https://arxiv.org/pdf/2503.17657
Copy Paste: [[2503.17657]] Efficient Diffusion Training through Parallelization with Truncated Karhunen-Loève Expansion(https://arxiv.org/abs/2503.17657)
Keywords: generation
Abstract: Diffusion denoising models have become a popular approach for image generation, but they often suffer from slow convergence during training. In this paper, we identify that this slow convergence is partly due to the complexity of the Brownian motion driving the forward-time process. To address this, we represent the Brownian motion using the Karhunen-Loève expansion, truncating it to a limited number of eigenfunctions. We propose a novel ordinary differential equation with augmented random initials, termed KL diffusion, as a new forward-time process for training and sampling. By developing an appropriate denoising loss function, we facilitate the integration of our KL-diffusion into existing denoising-based models. Using the widely adopted DDIM framework as our baseline ensures a fair comparison, as our modifications focus solely on the forward process and loss function, leaving the network architecture and sampling methods unchanged. Our method significantly outperforms baseline diffusion models, achieving convergence speeds that are twice faster to reach the best FID score of the baseline and ultimately yielding much lower FID scores. Notably, our approach allows for highly parallelized computation, requires no additional learnable parameters, and can be flexibly integrated into existing diffusion methods. The code will be made publicly available.
摘要：扩散deNoising模型已成为图像产生的流行方法，但是它们在训练过程中通常会遭受缓慢的收敛性。在本文中，我们确定这种缓慢的收敛部分是由于布朗运动的复杂性驱动了前进时间。为了解决这个问题，我们使用karhunen-loève扩展代表布朗运动，将其截断为有限数量的本征函数。我们提出了一个新型的普通微分方程，并具有增强的随机缩写，称为KL扩散，作为训练和采样的新型前进时间。通过开发适当的降解损失函数，我们促进了将KL扩散整合到现有的基于denoising的模型中。使用广泛采用的DDIM框架作为我们的基线确保了公平的比较，因为我们的修改仅着眼于远期过程和损耗函数，因此网络体系结构和采样方法没有变化。我们的方法显着优于基线扩散模型，达到的收敛速度快两倍，达到基线的最佳FID得分，并最终得出的FID得分较低。值得注意的是，我们的方法允许高度平行的计算，不需要其他可学习的参数，并且可以灵活地集成到现有的扩散方法中。该代码将公开可用。

Title: OMR-Diffusion:Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Intent Understanding

Authors: Kun Li, Jianhui Wang, Miao Zhang, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17660
Pdf URL: https://arxiv.org/pdf/2503.17660
Copy Paste: [[2503.17660]] OMR-Diffusion:Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Intent Understanding(https://arxiv.org/abs/2503.17660)
Keywords: generation, generative
Abstract: Generative AI has significantly advanced text-driven image generation, but it still faces challenges in producing outputs that consistently align with evolving user preferences and intents, particularly in multi-turn dialogue scenarios. In this research, We present a Visual Co-Adaptation (VCA) framework that incorporates human-in-the-loop feedback, utilizing a well-trained reward model specifically designed to closely align with human preferences. Using a diverse multi-turn dialogue dataset, the framework applies multiple reward functions (such as diversity, consistency, and preference feedback) to refine the diffusion model through LoRA, effectively optimizing image generation based on user input. We also constructed multi-round dialogue datasets with prompts and image pairs that well-fit user intent. Experiments show the model achieves 508 wins in human evaluation, outperforming DALL-E 3 (463 wins) and others. It also achieves 3.4 rounds in dialogue efficiency (vs. 13.7 for DALL-E 3) and excels in metrics like LPIPS (0.15) and BLIP (0.59). Various experiments demonstrate the effectiveness of the proposed method over state-of-the-art baselines, with significant improvements in image consistency and alignment with user intent.
摘要：Generative AI具有显着高级的文本驱动图像生成，但是在产生与不断发展的用户偏好和意图一致的输出方面，它仍然面临挑战，尤其是在多转向对话方案中。在这项研究中，我们提出了一个视觉共同适应（VCA）框架，该框架结合了人类的反馈，利用训练有素的奖励模型，专门与人类的偏好紧密相符。该框架使用多种多样的对话数据集，应用多个奖励功能（例如多样性，一致性和偏好反馈）来通过Lora来完善扩散模型，从而根据用户输入有效地优化图像生成。我们还使用提示和拟合用户意图的提示和图像对构建了多轮对话数据集。实验表明，该模型在人类评估中取得了508胜的胜利，表现优于DALL-E 3（463胜）等。它还在对话效率方面达到了3.4轮（对于DALL-E 3），在LPIPS（0.15）和BLIP（0.59）等指标中表现出色。各种实验证明了所提出的方法对最先进的基线的有效性，并在图像一致性和与用户意图的一致性方面有了显着改善。

Title: 3D Modeling: Camera Movement Estimation and path Correction for SFM Model using the Combination of Modified A-SIFT and Stereo System

Authors: Usha Kumari, Shuvendu Rana
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17668
Pdf URL: https://arxiv.org/pdf/2503.17668
Copy Paste: [[2503.17668]] 3D Modeling: Camera Movement Estimation and path Correction for SFM Model using the Combination of Modified A-SIFT and Stereo System(https://arxiv.org/abs/2503.17668)
Keywords: generation
Abstract: Creating accurate and efficient 3D models poses significant challenges, particularly in addressing large viewpoint variations, computational complexity, and alignment discrepancies. Efficient camera path generation can help resolve these issues. In this context, a modified version of the Affine Scale-Invariant Feature Transform (ASIFT) is proposed to extract more matching points with reduced computational overhead, ensuring an adequate number of inliers for precise camera rotation angle estimation. Additionally, a novel two-camera-based rotation correction model is introduced to mitigate small rotational errors, further enhancing accuracy. Furthermore, a stereo camera-based translation estimation and correction model is implemented to determine camera movement in 3D space by altering the Structure From Motion (SFM) model. Finally, the novel combination of ASIFT and two camera-based SFM models provides an accurate camera movement trajectory in 3D space. Experimental results show that the proposed camera movement approach achieves 99.9% accuracy compared to the actual camera movement path and outperforms state-of-the-art camera path estimation methods. By leveraging this accurate camera path, the system facilitates the creation of precise 3D models, making it a robust solution for applications requiring high fidelity and efficiency in 3D reconstruction.
摘要：创建准确有效的3D模型提出了重大挑战，尤其是在解决较大的观点变化，计算复杂性和一致性差异时。有效的相机路径生成可以帮助解决这些问题。在这种情况下，提出了一个修改的仿射量表不变特征变换（ASIFT），以通过减少的计算开销提取更多匹配点，以确保足够数量的插入器来进行精确的摄像机旋转角度估计。此外，引入了一种新型的基于两台相机的旋转校正模型，以减轻小旋转误差，从而进一步提高准确性。此外，实现了基于立体摄像机的翻译估计和校正模型，以通过更改运动（SFM）模型的结构来确定3D空间中的摄像机运动。最后，ASIFT和两个基于摄像头的SFM模型的新型组合提供了3D空间中准确的相机运动轨迹。实验结果表明，与实际的摄像机运动路径相比，提出的相机运动方法的精度达到99.9％，并且优于最先进的相机路径估计方法。通过利用这一准确的相机路径，系统促进了精确的3D模型的创建，这使其成为需要在3D重建中高保真和效率的应用程序的强大解决方案。

Title: TDRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation

Authors: Yuheng Feng, Jianhui Wang, Kun Li, Sida Li, Tianyu Shi, Haoyue Han, Miao Zhang, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17669
Pdf URL: https://arxiv.org/pdf/2503.17669
Copy Paste: [[2503.17669]] TDRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation(https://arxiv.org/abs/2503.17669)
Keywords: generation
Abstract: Although text-to-image generation technologies have made significant advancements, they still face challenges when dealing with ambiguous prompts and aligning outputs with user this http URL proposed framework, TDRI (Two-Phase Dialogue Refinement and Co-Adaptation), addresses these issues by enhancing image generation through iterative user interaction. It consists of two phases: the Initial Generation Phase, which creates base images based on user prompts, and the Interactive Refinement Phase, which integrates user feedback through three key modules. The Dialogue-to-Prompt (D2P) module ensures that user feedback is effectively transformed into actionable prompts, which improves the alignment between user intent and model input. By evaluating generated outputs against user expectations, the Feedback-Reflection (FR) module identifies discrepancies and facilitates improvements. In an effort to ensure consistently high-quality results, the Adaptive Optimization (AO) module fine-tunes the generation process by balancing user preferences and maintaining prompt fidelity. Experimental results show that TDRI outperforms existing methods by achieving 33.6% human preference, compared to 6.2% for GPT-4 augmentation, and the highest CLIP and BLIP alignment scores (0.338 and 0.336, respectively). In iterative feedback tasks, user satisfaction increased to 88% after 8 rounds, with diminishing returns beyond 6 rounds. Furthermore, TDRI has been found to reduce the number of iterations and improve personalization in the creation of fashion products. TDRI exhibits a strong potential for a wide range of applications in the creative and industrial domains, as it streamlines the creative process and improves alignment with user preferences
摘要：尽管文本到图像的生成技术已经取得了重大进步，但在处理模棱两可的提示并与用户保持一致的HTTP URL提出的框架TDRI（两阶段对话的改进和共同适应）时，它们仍然面临挑战，通过迭代用户交互来加强图像生成来解决这些问题。它由两个阶段组成：初始生成阶段，该阶段根据用户提示创建基本图像，以及交互式改进阶段，通过三个关键模块整合用户反馈。对话到准备点（D2P）模块可确保用户反馈有效地转换为可起作的提示，从而改善了用户意图和模型输入之间的一致性。通过根据用户期望评估生成的输出，反馈（FR）模块可以确定差异并促进改进。为了确保始终如一的高质量结果，自适应优化（AO）模块通过平衡用户偏好并保持及时的保真度来微调生成过程。实验结果表明，TDRI通过达到33.6％的人类偏好来优于现有方法，而GPT-4增强率为6.2％，最高的夹和Blip对准评分（分别为0.338和0.336）。在迭代反馈任务中，8轮后用户满意度提高到88％，收益降低了6轮。此外，已经发现TDRI可以减少迭代次数并改善时尚产品创建的个性化。 TDRI在创意和工业领域中具有广泛应用的强大潜力，因为它简化了创作过程并改善了与用户偏好的一致性

Title: MultiScale Contextual Bandits for Long Term Objectives

Authors: Richa Rastogi, Yuta saito, Thorsten Joachims
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.17674
Pdf URL: https://arxiv.org/pdf/2503.17674
Copy Paste: [[2503.17674]] MultiScale Contextual Bandits for Long Term Objectives(https://arxiv.org/abs/2503.17674)
Keywords: generation
Abstract: The feedback that AI systems (e.g., recommender systems, chatbots) collect from user interactions is a crucial source of training data. While short-term feedback (e.g., clicks, engagement) is widely used for training, there is ample evidence that optimizing short-term feedback does not necessarily achieve the desired long-term objectives. Unfortunately, directly optimizing for long-term objectives is challenging, and we identify the disconnect in the timescales of short-term interventions (e.g., rankings) and the long-term feedback (e.g., user retention) as one of the key obstacles. To overcome this disconnect, we introduce the framework of MultiScale Policy Learning to contextually reconcile that AI systems need to act and optimize feedback at multiple interdependent timescales. For any two levels, our formulation selects the shorter-term objective at the next lower scale to optimize the longer-term objective at the next higher scale. As a result, the policies at all levels effectively optimize for the long-term. We instantiate the framework with MultiScale Off-Policy Bandit Learning (MSBL) and demonstrate its effectiveness on three tasks relating to recommender systems and text generation.
摘要：AI系统（例如，推荐系统，聊天机器人）从用户交互中收集的反馈是培训数据的关键来源。尽管短期反馈（例如，点击，参与度）被广泛用于培训，但有充分的证据表明优化短期反馈并不一定要实现所需的长期目标。不幸的是，直接对长期目标进行优化是具有挑战性的，我们将短期干预措施（例如排名）和长期反馈（例如用户保留）的时间表中的脱节视为关键障碍之一。为了克服这一脱节，我们介绍了多尺度政策学习的框架，以对AI系统需要在多个相互依存的时间表上采取行动和优化反馈。对于任何两个级别，我们的公式在下一个较低尺度上选择较短的目标，以优化下一个更高尺度的长期目标。结果，各个级别的政策可有效地为长期优化。我们使用多尺度的非政策盗销学习（MSBL）实例化框架，并在与推荐系统和文本生成有关的三个任务上演示了其有效性。

Title: Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

Authors: Shulei Wang, Wang Lin, Hai Huang, Hanting Wang, Sihang Cai, WenKang Han, Tao Jin, Jingyuan Chen, Jiacheng Sun, Jieming Zhu, Zhou Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17675
Pdf URL: https://arxiv.org/pdf/2503.17675
Copy Paste: [[2503.17675]] Towards Transformer-Based Aligned Generation with Self-Coherence Guidance(https://arxiv.org/abs/2503.17675)
Keywords: generation
Abstract: We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at this https URL.
摘要：我们介绍了一种新型的，无训练的方法，用于增强基于变压器的文本引导扩散模型（TGDMS）中的对齐方式。现有的TGDM通常很难生成语义上的图像，尤其是在处理复杂的文本提示或多概念属性属性绑定挑战时。以前的基于U-NET的方法主要优化了潜在空间，但是它们对基于变压器的架构的直接应用显示出有限的有效性。我们的方法通过在生成过程中直接优化跨注意地图来解决这些挑战。具体而言，我们介绍了自我互动指导，这种方法可以使用从先前的降解步骤中得出的口罩动态地完善注意图，从而在没有其他训练的情况下确保精确的对齐。为了验证我们的方法，我们构建了更具挑战性的基准，用于评估粗粒属性结合，细粒属性结合和样式结合。实验结果证明了我们方法的出色性能，在所有评估的任务中都显着超过了其他最先进的方法。我们的代码可在此HTTPS URL上找到。

Title: Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models

Authors: Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Conghui Zhang, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, Juntao Dai, Chi-Min Chan, Sirui Han, Yike Guo, Yaodong Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17682
Pdf URL: https://arxiv.org/pdf/2503.17682
Copy Paste: [[2503.17682]] Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models(https://arxiv.org/abs/2503.17682)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) are critical for developing general-purpose AI assistants, yet they face growing safety risks. How can we ensure that MLLMs are safely aligned to prevent undesired behaviors such as discrimination, misinformation, or violations of ethical standards? In a further step, we need to explore how to fine-tune MLLMs to enhance reasoning performance while ensuring they satisfy safety constraints. Fundamentally, this can be formulated as a min-max optimization problem. In this study, we propose Safe RLHF-V, the first multimodal safety alignment framework that jointly optimizes helpfulness and safety using separate multimodal reward and cost models within a Lagrangian-based constrained optimization framework. Given that there is a lack of preference datasets that separate helpfulness and safety in multimodal scenarios, we introduce BeaverTails-V, the first open-source dataset with dual preference annotations for helpfulness and safety, along with multi-level safety labels (minor, moderate, severe). Additionally, we design a Multi-level Guardrail System to proactively defend against unsafe queries and adversarial attacks. By applying the Beaver-Guard-V moderation for 5 rounds of filtering and re-generation on the precursor model, the overall safety of the upstream model is significantly improved by an average of 40.9%. Experimental results demonstrate that fine-tuning different MLLMs with Safe RLHF can effectively enhance model helpfulness while ensuring improved safety. Specifically, Safe RLHF-V improves model safety by 34.2% and helpfulness by 34.3%. All of datasets, models, and code can be found at this https URL to support the safety development of MLLMs and reduce potential societal risks.
摘要：多模式大语模型（MLLM）对于开发通用AI助手至关重要，但他们面临着日益增长的安全风险。我们如何确保MLLM安全地对齐以防止不希望的行为，例如歧视，错误信息或违反道德标准？进一步的一步，我们需要探索如何微调MLLM来提高推理性能，同时确保它们满足安全性限制。从根本上讲，这可以作为最小 - 最大优化问题进行表述。在这项研究中，我们提出了安全的RLHF-V，这是第一个多模式安全对齐框架，该框架在基于拉格朗日的约束优化框架内使用单独的多模式奖励和成本模型共同优化了有用性和安全性。鉴于缺乏在多模式方案中分开有用和安全性的偏好数据集，我们介绍了Beavertails-V，这是第一个带有双重优先注释的开源数据集，可提供帮助和安全性，以及多层安全标签（次要安全标签）（次要，次要，中度，重度）。此外，我们设计了一个多层护栏系统，以主动防御不安全的查询和对抗性攻击。通过将Beaver-Guard-V适度应用于前体模型的5轮过滤和再产生，上游模型的总体安全平均提高了40.9％。实验结果表明，使用安全的RLHF进行微调可以有效地增强模型的帮助，同时确保安全性提高。具体而言，安全RLHF-V将模型安全性提高了34.2％，有用性提高了34.3％。所有数据集，模型和代码都可以在此HTTPS URL上找到，以支持MLLM的安全开发并降低潜在的社会风险。

Title: MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion

Authors: Yikun Ma, Yiqing Li, Jiawei Wu, Zhi Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17695
Pdf URL: https://arxiv.org/pdf/2503.17695
Copy Paste: [[2503.17695]] MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion(https://arxiv.org/abs/2503.17695)
Keywords: generative
Abstract: Generative models have made remarkable advancements and are capable of producing high-quality content. However, performing controllable editing with generative models remains challenging, due to their inherent uncertainty in outputs. This challenge is praticularly pronounced in motion editing, which involves the processing of spatial information. While some physics-based generative methods have attempted to implement motion editing, they typically operate on single-view images with simple motions, such as translation and dragging. These methods struggle to handle complex rotation and stretching motions and ensure multi-view consistency, often necessitating resource-intensive retraining. To address these challenges, we propose MotionDiff, a training-free zero-shot diffusion method that leverages optical flow for complex multi-view motion editing. Specifically, given a static scene, users can interactively select objects of interest to add motion priors. The proposed Point Kinematic Model (PKM) then estimates corresponding multi-view optical flows during the Multi-view Flow Estimation Stage (MFES). Subsequently, these optical flows are utilized to generate multi-view motion results through decoupled motion representation in the Multi-view Motion Diffusion Stage (MMDS). Extensive experiments demonstrate that MotionDiff outperforms other physics-based generative motion editing methods in achieving high-quality multi-view consistent motion results. Notably, MotionDiff does not require retraining, enabling users to conveniently adapt it for various down-stream tasks.
摘要：生成模型取得了显着的进步，并能够产生高质量的内容。但是，由于产出的固有不确定性，使用生成模型执行可控编辑仍然具有挑战性。这项挑战是在运动编辑中明确发音的，其中涉及处理空间信息。尽管一些基于物理的生成方法试图实现运动编辑，但它们通常在单视图像上以简单的动作（例如翻译和拖动）运行。这些方法难以处理复杂的旋转和伸展运动，并确保多视图的一致性，通常需要资源密集的重新训练。为了应对这些挑战，我们提出了MotionDiff，这是一种无训练的零射传扩散方法，利用光流进行复杂的多视图运动编辑。具体而言，给定静态场景，用户可以交互选择感兴趣的对象来添加运动先验。然后，提出的点运动学模型（PKM）估计了多视图流量估计阶段（MFES）期间相应的多视图光流。随后，这些光流被用来通过在多视图运动扩散阶段（MMDS）中的解耦运动表示来生成多视图运动结果。广泛的实验表明，运动档在获得高质量的多视图一致运动结果方面优于其他基于物理的生成运动编辑方法。值得注意的是，MotionDiff不需要再培训，使用户可以方便地适应各种下游任务。

Title: MAMAT: 3D Mamba-Based Atmospheric Turbulence Removal and its Object Detection Capability

Authors: Paul Hill, Zhiming Liu, Nantheera Anantrasirichai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17700
Pdf URL: https://arxiv.org/pdf/2503.17700
Copy Paste: [[2503.17700]] MAMAT: 3D Mamba-Based Atmospheric Turbulence Removal and its Object Detection Capability(https://arxiv.org/abs/2503.17700)
Keywords: restoration
Abstract: Restoration and enhancement are essential for improving the quality of videos captured under atmospheric turbulence conditions, aiding visualization, object detection, classification, and tracking in surveillance systems. In this paper, we introduce a novel Mamba-based method, the 3D Mamba-Based Atmospheric Turbulence Removal (MAMAT), which employs a dual-module strategy to mitigate these distortions. The first module utilizes deformable 3D convolutions for non-rigid registration to minimize spatial shifts, while the second module enhances contrast and detail. Leveraging the advanced capabilities of the 3D Mamba architecture, experimental results demonstrate that MAMAT outperforms state-of-the-art learning-based methods, achieving up to a 3\% improvement in visual quality and a 15\% boost in object detection. It not only enhances visualization but also significantly improves object detection accuracy, bridging the gap between visual restoration and the effectiveness of surveillance applications.
摘要：恢复和增强对于提高在大气湍流条件下捕获的视频质量，帮助可视化，对象检测，分类和跟踪监视系统中所捕获的视频质量至关重要。在本文中，我们介绍了一种基于MAMBA的新方法，即基于3D MAMBA的大气湍流（MAMAT），该方法采用双模块策略来减轻这些扭曲。第一个模块利用可变形的3D卷积来进行非刚性登记来最大程度地减少空间偏移，而第二个模块则增强了对比度和细节。利用3D MAMBA体系结构的高级功能，实验结果表明，Mamat的表现优于最先进的学习方法，可达到3 \％的视觉质量提高和15 \％的对象检测。它不仅增强了可视化，而且还可以显着提高对象检测准确性，从而弥合视觉恢复与监视应用的有效性之间的差距。

Title: CODA: Repurposing Continuous VAEs for Discrete Tokenization

Authors: Zeyu Liu, Zanlin Ni, Yeguo Hua, Xin Deng, Xiao Ma, Cheng Zhong, Gao Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17760
Pdf URL: https://arxiv.org/pdf/2503.17760
Copy Paste: [[2503.17760]] CODA: Repurposing Continuous VAEs for Discrete Tokenization(https://arxiv.org/abs/2503.17760)
Keywords: generation
Abstract: Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a compact representation and discretizing them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce \textbf{CODA}(\textbf{CO}ntinuous-to-\textbf{D}iscrete \textbf{A}daptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs -- already optimized for perceptual compression -- into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, with $\mathbf{6 \times}$ less training budget than standard VQGAN, our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $\mathbf{0.43}$ and $\mathbf{1.34}$ for $8 \times$ and $16 \times$ compression on ImageNet 256$\times$ 256 benchmark.
摘要：离散的视觉令牌器将图像转换为一系列令牌，从而使基于令牌的视觉一代类似于语言模型。但是，此过程本质上是具有挑战性的，因为它既需要将视觉信号压缩为紧凑的表示形式，又需要将它们分配为固定的代码集。传统的离散令牌通常会共同学习这两个任务，通常会导致不稳定的培训，低密码书利用和有限的重建质量。在本文中，我们介绍了\ textbf {coda}（\ textbf {co} ntiul-to-to-\ textbf {d} iscrete \ textbf {a} datpation），该框架是将压缩和离散化的框架。 Coda没有从头开始训练离散的引物器，而是通过精心设计的离散化过程将现成的连续VAE（已经对感知压缩进行了优化）调整为离散的引物。通过主要关注离散化，CODA确保稳定，有效的训练，同时保留连续VAE的强烈视觉保真度。从经验上讲，我们的方法比标准VQGAN的培训预算少于$ \ Mathbf {6 \ times} $，我们的方法实现了100％和值得注意的重建FID（RFID）的显着代码书，$ \ mathbf {0.43} $和$ \ $ \ \ \ \ \ \ \ \ mathbf {1.34} $ $ 8 \ $ $ $ $ 8 \ $ $ 86 基准。

Title: Renewable Energy Transition in South America: Predictive Analysis of Generation Capacity by 2050

Authors: Triveni Magadum, Sanjana Murgod, Kartik Garg, Vivek Yadav, Harshit Mittal, Omkar Kushwaha
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.17771
Pdf URL: https://arxiv.org/pdf/2503.17771
Copy Paste: [[2503.17771]] Renewable Energy Transition in South America: Predictive Analysis of Generation Capacity by 2050(https://arxiv.org/abs/2503.17771)
Keywords: generation
Abstract: In this research, renewable energy expansion in South America up to 2050 is predicted based on machine learning models that are trained on past energy data. The research employs gradient boosting regression and Prophet time series forecasting to make predictions of future generation capacities for solar, wind, hydroelectric, geothermal, biomass, and other renewable sources in South American nations. Model output analysis indicates staggering future expansion in the generation of renewable energy, with solar and wind energy registering the highest expansion rates. Geospatial visualization methods were applied to illustrate regional disparities in the utilization of renewable energy. The results forecast South America to record nearly 3-fold growth in the generation of renewable energy by the year 2050, with Brazil and Chile spearheading regional development. Such projections help design energy policy, investment strategy, and climate change mitigation throughout the region, in helping the developing economies to transition to sustainable energy.
摘要：在这项研究中，基于经过过去的能源数据培训的机器学习模型，可以预测南美可再生能源的扩展。该研究采用梯度提升回归和先知时间序列预测，以预测南美国家的太阳能，风，水力发电，地热，生物质和其他可再生能源的未来生成能力。模型输出分析表明，在可再生能源的产生中，太阳能和风能的扩张率最高。应用地理空间可视化方法用于说明可再生能源利用中的区域差异。结果预测，到2050年，南美的可再生能源产生将创造近3倍的增长，巴西和智利率领区域发展。这些预测有助于设计能源政策，投资策略和缓解气候变化，以帮助发展中经济体过渡到可持续能源。

Title: Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

Authors: Codefuse, Ling Team: Wenting Cai, Yuchen Cao, Chaoyu Chen, Chen Chen, Siba Chen, Qing Cui, Peng Di, Junpeng Fang, Zi Gong, Ting Guo, Zhengyu He, Yang Huang, Cong Li, Jianguo Li, Zheng Li, Shijie Lian, BingChang Liu, Songshan Luo, Shuo Mao, Min Shen, Jian Wu, Jiaolong Yang, Wenjie Yang, Tong Ye, Hang Yu, Wei Zhang, Zhenduo Zhang, Hailin Zhao, Xunjin Zheng, Jun Zhou
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.17793
Pdf URL: https://arxiv.org/pdf/2503.17793
Copy Paste: [[2503.17793]] Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM(https://arxiv.org/abs/2503.17793)
Keywords: generation
Abstract: Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the DeepSeek Coder series. This paper introduces yet another attempt in this area, namely Ling-Coder-Lite. We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods (especially those based on program analytics) to build an efficient yet powerful code LLM. Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size, such as Qwen2.5-Coder-7B and DeepSeek-Coder-V2-Lite, while offering competitive latency and throughput. In practice, we achieve a 50\% reduction in deployment resources compared to the similar-sized dense model without performance loss. To facilitate further research and development in this area, we open-source our models as well as a substantial portion of high-quality data for the annealing and post-training stages. The models and data can be accessed at~\url{this https URL}.
摘要：代码大语言模型（LLMS）的最新进展已在代码生成和理解中表现出显着的功能。建立具有全面性能但最终效率的代码LLM仍然具有挑战性。开源社区已发布了许多尝试，以打破绩效和效率之间的权衡，例如QWEN编码器系列和DeepSeek编码器系列。本文在该领域介绍了另一种尝试，即ling-coder-lite。我们利用有效的Experts（MOE）体系结构以及一组高质量的数据策展方法（尤其是基于程序分析的方法）来构建有效而强大的代码LLM。 Ling-coder-lite与相似大小的最新模型（例如QWEN2.5-CODER-7B和DeepSeek-Coder-V2-lite）相比，在12个代表性编码基准上表现出PAR性能，同时提供竞争性潜伏期和吞吐量。实际上，与相似大小的密集模型相比，我们的部署资源减少了50 \％，而没有性能损失。为了促进该领域的进一步研究和开发，我们为退火和培训后阶段开放源模型以及大量的高质量数据。可以通过〜\ url {此https url}访问模型和数据。

Title: Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models

Authors: Ketan Suhaas Saichandran, Xavier Thomas, Prakhar Kaushik, Deepti Ghadiyaram
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17794
Pdf URL: https://arxiv.org/pdf/2503.17794
Copy Paste: [[2503.17794]] Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models(https://arxiv.org/abs/2503.17794)
Keywords: generative
Abstract: Text-to-image generative models often struggle with long prompts detailing complex scenes, diverse objects with distinct visual characteristics and spatial relationships. In this work, we propose SCoPE (Scheduled interpolation of Coarse-to-fine Prompt Embeddings), a training-free method to improve text-to-image alignment by progressively refining the input prompt in a coarse-to-fine-grained manner. Given a detailed input prompt, we first decompose it into multiple sub-prompts which evolve from describing broad scene layout to highly intricate details. During inference, we interpolate between these sub-prompts and thus progressively introduce finer-grained details into the generated image. Our training-free plug-and-play approach significantly enhances prompt alignment, achieves an average improvement of up to +4% in Visual Question Answering (VQA) scores over the Stable Diffusion baselines on 85% of the prompts from the GenAI-Bench dataset.
摘要：文本到图像生成模型通常会在长时间提示中努力详细介绍复杂的场景，具有独特的视觉特征和空间关系的各种对象。在这项工作中，我们提出了范围（粗到及时嵌入的计划插值），这是一种无训练的方法，可通过以粗到细粒度的方式逐步完善输入提示，以改善文本对象对齐。给定详细的输入提示，我们首先将其分解为多个子奖励，这些子参数从描述广泛的场景布局到高度复杂的细节。在推断期间，我们在这些子贡献之间进行了插值，从而逐渐将较细粒的细节引入生成的图像中。我们的无训练插件方法显着提高了迅速的对齐，在视觉问题答案中的平均提高高达4％（VQA）得分比稳定的扩散基准在Genai-Bench数据集的85％上的平均提高。

Title: Fractal-IR: A Unified Framework for Efficient and Scalable Image Restoration

Authors: Yawei Li, Bin Ren, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Nicu Sebe, Ming-Hsuan Yang, Luca Benini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17825
Pdf URL: https://arxiv.org/pdf/2503.17825
Copy Paste: [[2503.17825]] Fractal-IR: A Unified Framework for Efficient and Scalable Image Restoration(https://arxiv.org/abs/2503.17825)
Keywords: restoration, super-resolution
Abstract: While vision transformers achieve significant breakthroughs in various image restoration (IR) tasks, it is still challenging to efficiently scale them across multiple types of degradations and resolutions. In this paper, we propose Fractal-IR, a fractal-based design that progressively refines degraded images by repeatedly expanding local information into broader regions. This fractal architecture naturally captures local details at early stages and seamlessly transitions toward global context in deeper fractal stages, removing the need for computationally heavy long-range self-attention mechanisms. Moveover, we observe the challenge in scaling up vision transformers for IR tasks. Through a series of analyses, we identify a holistic set of strategies to effectively guide model scaling. Extensive experimental results show that Fractal-IR achieves state-of-the-art performance in seven common image restoration tasks, including super-resolution, denoising, JPEG artifact removal, IR in adverse weather conditions, motion deblurring, defocus deblurring, and demosaicking. For $2\times$ SR on Manga109, Fractal-IR achieves a 0.21 dB PSNR gain. For grayscale image denoising on Urban100, Fractal-IR surpasses the previous method by 0.2 dB for $\sigma=50$.
摘要：尽管视觉变形金刚在各种图像恢复（IR）任务中取得了重大突破，但有效地将它们跨越多种类型的降解和决议，仍然具有挑战性。在本文中，我们提出了一种基于分形的设计，它通过反复将局部信息扩展到更广泛的区域来逐步完善降级图像。这种分形架构自然会在早期阶段捕获当地细节，并在更深的分形阶段无缝向全球环境过渡，从而消除了对计算沉重的远程自我注意力专注机制的需求。移动，我们观察到扩大IR任务的视觉变压器的挑战。通过一系列分析，我们确定了一组整体策略，以有效指导模型缩放。广泛的实验结果表明，分形IR在七个共同的图像恢复任务中实现最先进的性能，包括超分辨率，Denoising，JPEG伪像去除，IR在不利天气条件下，运动去膨胀，DeFocus Deblurring和DemoSaicking。售价为$ 2 \ times $ sr，在Manga109上，Fractal-ir实现了0.21 dB PSNR的增益。对于在Urban100上的灰度图像deno，Fractal-ir以$ \ sigma = 50 $为0.2 dB的先前方法。

Title: A Causal Adjustment Module for Debiasing Scene Graph Generation

Authors: Li Liu, Shuzhou Sun, Shuaifeng Zhi, Fan Shi, Zhen Liu, Janne Heikkilä, Yongxiang Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.17862
Pdf URL: https://arxiv.org/pdf/2503.17862
Copy Paste: [[2503.17862]] A Causal Adjustment Module for Debiasing Scene Graph Generation(https://arxiv.org/abs/2503.17862)
Keywords: generation
Abstract: While recent debiasing methods for Scene Graph Generation (SGG) have shown impressive performance, these efforts often attribute model bias solely to the long-tail distribution of relationships, overlooking the more profound causes stemming from skewed object and object pair distributions. In this paper, we employ causal inference techniques to model the causality among these observed skewed distributions. Our insight lies in the ability of causal inference to capture the unobservable causal effects between complex distributions, which is crucial for tracing the roots of model bias. Specifically, we introduce the Mediator-based Causal Chain Model (MCCM), which, in addition to modeling causality among objects, object pairs, and relationships, incorporates mediator variables, i.e., cooccurrence distribution, for complementing the causality. Following this, we propose the Causal Adjustment Module (CAModule) to estimate the modeled causal structure, using variables from MCCM as inputs to produce a set of adjustment factors aimed at correcting biased model predictions. Moreover, our method enables the composition of zero-shot relationships, thereby enhancing the model's ability to recognize such relationships. Experiments conducted across various SGG backbones and popular benchmarks demonstrate that CAModule achieves state-of-the-art mean recall rates, with significant improvements also observed on the challenging zero-shot recall rate metric.
摘要：尽管最新的场景图生成方法（SGG）表现出令人印象深刻的性能，但这些努力通常将模型偏见归因于关系的长尾分布，忽略了偏斜的对象和对象对分布所引起的更深刻的原因。在本文中，我们采用因果推理技术来模拟这些观察到的偏斜分布之间的因果关系。我们的见解在于因果推断捕获复杂分布之间无法观察的因果效应的能力，这对于追踪模型偏差的根至关重要。具体而言，我们介绍了基于中介的因果链模型（MCCM），除了建模对象，对象对和关系之间的因果关系外，还结合了介体变量，即共发生分布，以补充因果关系。在此之后，我们提出了因果调整模块（CAMODULE），以估算建模的因果结构，使用MCCM的变量作为输入，以产生一组旨在纠正偏置模型预测的调整因子。此外，我们的方法可以使零摄影关系的组成，从而增强了模型识别这种关系的能力。在各种SGG骨架和流行基准测试的实验表明，Camodule实现了最先进的召回率，并且在具有挑战性的零拍召回率公制上也观察到了显着改善。

Title: Guided Diffusion for the Extension of Machine Vision to Human Visual Perception

Authors: Takahiro Shindo, Yui Tatsumi, Taiju Watanabe, Hiroshi Watanabe
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.17907
Pdf URL: https://arxiv.org/pdf/2503.17907
Copy Paste: [[2503.17907]] Guided Diffusion for the Extension of Machine Vision to Human Visual Perception(https://arxiv.org/abs/2503.17907)
Keywords: generation
Abstract: Image compression technology eliminates redundant information to enable efficient transmission and storage of images, serving both machine vision and human visual perception. For years, image coding focused on human perception has been well-studied, leading to the development of various image compression standards. On the other hand, with the rapid advancements in image recognition models, image compression for AI tasks, known as Image Coding for Machines (ICM), has gained significant importance. Therefore, scalable image coding techniques that address the needs of both machines and humans have become a key area of interest. Additionally, there is increasing demand for research applying the diffusion model, which can generate human-viewable images from a small amount of data to image compression methods for human vision. Image compression methods that use diffusion models can partially reconstruct the target image by guiding the generation process with a small amount of conditioning information. Inspired by the diffusion model's potential, we propose a method for extending machine vision to human visual perception using guided diffusion. Utilizing the diffusion model guided by the output of the ICM method, we generate images for human perception from random noise. Guided diffusion acts as a bridge between machine vision and human vision, enabling transitions between them without any additional bitrate overhead. The generated images then evaluated based on bitrate and image quality, and we compare their compression performance with other scalable image coding methods for humans and machines.
摘要：图像压缩技术消除了冗余信息，以实现图像的有效传输和存储，为机器视觉和人类视觉感知提供。多年以来，精心研究了以人类感知为重点的图像编码，从而发展了各种图像压缩标准。另一方面，随着图像识别模型的快速进步，AI任务的图像压缩（称为机器的图像编码（ICM））具有重要的重视。因此，满足机器和人类需求的可扩展图像编码技术已成为关键领域。此外，应用扩散模型的研究需求增加，该模型可以从少量数据到图像压缩方法为人类视力产生可观看的人类图像。使用扩散模型的图像压缩方法可以通过用少量的调理信息引导生成过程来部分重建目标图像。受扩散模型潜力的启发，我们提出了一种使用引导扩散将机器视觉扩展到人类视觉感知的方法。利用ICM方法输出引导的扩散模型，我们从随机噪声中生成了人类感知的图像。引导的扩散充当机器视觉和人类视觉之间的桥梁，实现它们之间的过渡，而没有任何其他比特率开销。然后，生成的图像根据比特率和图像质量进行了评估，我们将其压缩性能与其他可扩展的图像编码方法进行比较。

Title: TransAnimate: Taming Layer Diffusion to Generate RGBA Video

Authors: Xuewei Chen, Zhimin Chen, Yiren Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17934
Pdf URL: https://arxiv.org/pdf/2503.17934
Copy Paste: [[2503.17934]] TransAnimate: Taming Layer Diffusion to Generate RGBA Video(https://arxiv.org/abs/2503.17934)
Keywords: generation, generative
Abstract: Text-to-video generative models have made remarkable advancements in recent years. However, generating RGBA videos with alpha channels for transparency and visual effects remains a significant challenge due to the scarcity of suitable datasets and the complexity of adapting existing models for this purpose. To address these limitations, we present TransAnimate, an innovative framework that integrates RGBA image generation techniques with video generation modules, enabling the creation of dynamic and transparent videos. TransAnimate efficiently leverages pre-trained text-to-transparent image model weights and combines them with temporal models and controllability plugins trained on RGB videos, adapting them for controllable RGBA video generation tasks. Additionally, we introduce an interactive motion-guided control mechanism, where directional arrows define movement and colors adjust scaling, offering precise and intuitive control for designing game effects. To further alleviate data scarcity, we have developed a pipeline for creating an RGBA video dataset, incorporating high-quality game effect videos, extracted foreground objects, and synthetic transparent videos. Comprehensive experiments demonstrate that TransAnimate generates high-quality RGBA videos, establishing it as a practical and effective tool for applications in gaming and visual effects.
摘要：近年来，文本对视频生成模型取得了显着的进步。但是，由于合适的数据集缺乏，以及为此目的适应现有模型的复杂性，因此使用Alpha渠道生成带有Alpha通道的RGBA视频仍然是一个重大挑战。为了解决这些局限性，我们提出了Transanimate，这是一个创新的框架，将RGBA图像生成技术与视频生成模块集成在一起，从而创建动态和透明的视频。跨性别的有效利用预先训练的文本对透明图像模型权重，并将其与经过RGB视频训练的时间模型和可控插件相结合，将其调整为可控的RGBA视频生成任务。此外，我们引入了一种交互式运动引导的控制机制，该机制定义运动和颜色调整缩放，为设计游戏效果提供精确和直观的控制。为了进一步缓解数据稀缺性，我们开发了一种用于创建RGBA视频数据集的管道，其中包含了高质量的游戏效果视频，提取的前景对象和合成的透明视频。全面的实验表明，跨性别的RGBA视频会产生高质量的RGBA视频，并将其确立为在游戏和视觉效果中应用的实用和有效工具。

Title: Cross-Domain Underwater Image Enhancement Guided by No-Reference Image Quality Assessment: A Transfer Learning Approach

Authors: Zhi Zhang, Daoyi Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.17937
Pdf URL: https://arxiv.org/pdf/2503.17937
Copy Paste: [[2503.17937]] Cross-Domain Underwater Image Enhancement Guided by No-Reference Image Quality Assessment: A Transfer Learning Approach(https://arxiv.org/abs/2503.17937)
Keywords: quality assessment
Abstract: Single underwater image enhancement (UIE) is a challenging ill-posed problem, but its development is hindered by two major issues: (1) The labels in underwater reference datasets are pseudo labels, relying on these pseudo ground truths in supervised learning leads to domain discrepancy. (2) Underwater reference datasets are scarce, making training on such small datasets prone to overfitting and distribution shift. To address these challenges, we propose Trans-UIE, a transfer learning-based UIE model that captures the fundamental paradigms of UIE through pretraining and utilizes a dataset composed of both reference and non-reference datasets for fine-tuning. However, fine-tuning the model using only reconstruction loss may introduce confirmation bias. To mitigate this, our method leverages no-reference image quality assessment (NR-IQA) metrics from above-water scenes to guide the transfer learning process across domains while generating enhanced images with the style of the above-water image domain. Additionally, to reduce the risk of overfitting during the pretraining stage, we introduce Pearson correlation loss. Experimental results on both full-reference and no-reference underwater benchmark datasets demonstrate that Trans-UIE significantly outperforms state-of-the-art methods.
摘要：单个水下图像增强（UIE）是一个具有挑战性的问题，但其发展受到了两个主要问题的阻碍：（1）水下参考数据集中的标签是伪标签，依靠这些伪造的基础真理，在有监督的学习中，导致领域差异。（2）水下参考数据集稀缺，因此在如此小的小数据集上进行培训，容易拟合和分配转移。为了应对这些挑战，我们提出了Trans-UIE，这是一种基于转移学习的UIE模型，该模型通过预处理捕获UIE的基本范例，并利用由参考和非参考数据集组成的数据集进行微调。但是，仅使用重建损失对模型进行微调可能引入确认偏差。为了减轻这种情况，我们的方法利用了来自水上场景的无参考图像质量评估（NR-IQA）指标，以指导跨域的转移学习过程，同时以上述水图像域的风格生成增强的图像。此外，为了降低预训练阶段过度拟合的风险，我们引入了皮尔逊相关性损失。关于全参考和无参考水下基准数据集的实验结果表明，跨性别UIE显着胜过最先进的方法。

Title: PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos

Authors: Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, Yunzhu Li
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.17973
Pdf URL: https://arxiv.org/pdf/2503.17973
Copy Paste: [[2503.17973]] PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos(https://arxiv.org/abs/2503.17973)
Keywords: generative
Abstract: Creating a physical digital twin of a real-world object has immense potential in robotics, content creation, and XR. In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects under interaction to produce a photo- and physically realistic, real-time interactive virtual replica. Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering; and (2) a novel multi-stage, optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. Our method integrates an inverse physics framework with visual perception cues, enabling high-fidelity reconstruction even from partial, occluded, and limited viewpoints. PhysTwin supports modeling various deformable objects, including ropes, stuffed animals, cloth, and delivery packages. Experiments show that PhysTwin outperforms competing methods in reconstruction, rendering, future prediction, and simulation under novel interactions. We further demonstrate its applications in interactive real-time simulation and model-based robotic motion planning.
摘要：创建现实世界对象的物理数字双胞胎在机器人技术，内容创建和XR方面具有巨大的潜力。在本文中，我们介绍了Phystwin，这是一个新颖的框架，它使用互动中动态对象的稀疏视频来产生照片 - 和物理逼真的实时交互式虚拟复制品。我们的方法集中在两个关键组成部分上：（1）结合了弹簧质量模型的物理形式的表示，用于逼真的物理模拟，几何形状的生成形状模型和用于渲染的高斯夹心；（2）一种新型的多阶段，基于优化的逆建模框架，可重建完整的几何形状，侵入密集的物理特性，并从视频中复制现实的外观。我们的方法将逆物理框架与视觉感知提示集成在一起，甚至可以从部分，遮挡和有限的观点中实现高保真重建。 Phystwin支持建模各种可变形物体，包括绳索，毛绒动物，布和送货包。实验表明，在新型相互作用下，Phystwin在重建，渲染，未来预测和模拟中的表现优于相互竞争的方法。我们进一步证明了其在交互式实时仿真和基于模型的机器人运动计划中的应用。

Title: Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

Authors: Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Lutao Jiang, Haiwei Xue, Bin Ren, Danda Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18016
Pdf URL: https://arxiv.org/pdf/2503.18016
Copy Paste: [[2503.18016]] Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook(https://arxiv.org/abs/2503.18016)
Keywords: generation
Abstract: Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI), particularly in enhancing the capabilities of large language models (LLMs) by enabling access to external, reliable, and up-to-date knowledge sources. In the context of AI-Generated Content (AIGC), RAG has proven invaluable by augmenting model outputs with supplementary, relevant information, thus improving their quality. Recently, the potential of RAG has extended beyond natural language processing, with emerging methods integrating retrieval-augmented strategies into the computer vision (CV) domain. These approaches aim to address the limitations of relying solely on internal model knowledge by incorporating authoritative external knowledge bases, thereby improving both the understanding and generation capabilities of vision models. This survey provides a comprehensive review of the current state of retrieval-augmented techniques in CV, focusing on two main areas: (I) visual understanding and (II) visual generation. In the realm of visual understanding, we systematically review tasks ranging from basic image recognition to complex applications such as medical report generation and multimodal question answering. For visual content generation, we examine the application of RAG in tasks related to image, video, and 3D generation. Furthermore, we explore recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains. Given that the integration of retrieval-augmented techniques in CV is still in its early stages, we also highlight the key limitations of current approaches and propose future research directions to drive the development of this promising area.
摘要：检索增强的一代（RAG）已成为人工智能（AI）的关键技术，尤其是通过启用外部，可靠和最新知识来源来增强大语言模型（LLMS）的能力。在AI生成的内容（AIGC）的背景下，通过通过补充，相关信息增强模型输出，从而提高了它们的质量，这证明了RAG的宝贵价值。最近，抹布的潜力扩大了自然语言处理，新兴方法将检索策略整合到计算机视觉（CV）域中。这些方法旨在通过合并权威的外部知识基础来解决仅依靠内部模型知识的局限性，从而提高视觉模型的理解和发电能力。这项调查对简历中检索提示技术的当前状态进行了全面综述，重点介绍了两个主要领域：（i）视觉理解和（ii）视觉生成。在视觉理解的领域，我们系统地审查了从基本图像识别到复杂应用程序（例如医疗报告生成和多模式问答）的任务。对于视觉内容生成，我们检查了抹布在与图像，视频和3D生成有关的任务中的应用。此外，我们还探索了抹布中的最新进步，用于体现的AI，特别着眼于计划，任务执行，多模式感知，互动和专业领域的应用程序。鉴于CV中检索提示技术的整合仍处于早期阶段，我们还强调了当前方法的关键局限性，并提出了未来的研究指示，以推动这一有前途的领域的发展。

Title: Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

Authors: Ziming Wei, Bingqian Lin, Yunshuang Nie, Jiaqi Chen, Shikui Ma, Hang Xu, Xiaodan Liang
Subjects: cs.CV, cs.AI, cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2503.18065
Pdf URL: https://arxiv.org/pdf/2503.18065
Copy Paste: [[2503.18065]] Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation(https://arxiv.org/abs/2503.18065)
Keywords: generation
Abstract: Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at this https URL.
摘要：数据稀缺是视觉导航（VLN）领域的长期挑战，它极大地阻碍了代理对看不见的环境的概括。先前的工作主要依靠其他模拟器数据或网络收集的图像/视频来改善概括。但是，模拟器环境仍然面临着有限的多样性，并且网络收集的数据通常需要大量的劳动来消除噪音。在本文中，我们建议VLN的重写驱动的增强（RAM）范式，该范式直接通过重写人类宣传的培训数据来创建看不见的观察构造对。从我们的重写机制中受益，可以在无模拟器和避免劳动的方式中获得新的观察指导，以促进概括。具体而言，我们首先引入了富含对象的观察重写，在其中结合了视觉语言模型（VLM）和大语言模型（LLMS），以得出重写的对象增强的场景描述，从而通过文本到图像生成模型（T2IMS）将观察综合与多样化的对象和空间布局相结合。然后，我们建议重写观察对比指令，该指令通过要求LLMS来推理原始观察和新观察之间的差异来生成与观察一致的重写指令。我们进一步开发了一种与随机观察作物方案一起进行混合，然后将其关注的培训策略有效地增强了数据分布多样性，同时抑制了训练期间的增强数据噪声。在离散环境（R2R，Reverie和R4R数据集）和连续环境（R2R-CE数据集）上进行了实验，均显示了我们方法的卓越性能和令人印象深刻的概括能力。代码可在此HTTPS URL上找到。

Title: Vehicular Road Crack Detection with Deep Learning: A New Online Benchmark for Comprehensive Evaluation of Existing Algorithms

Authors: Nachuan Ma, Zhengfei Song, Qiang Hu, Chuang-Wei Liu, Yu Han, Yanting Zhang, Rui Fan, Lihua Xie
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.18082
Pdf URL: https://arxiv.org/pdf/2503.18082
Copy Paste: [[2503.18082]] Vehicular Road Crack Detection with Deep Learning: A New Online Benchmark for Comprehensive Evaluation of Existing Algorithms(https://arxiv.org/abs/2503.18082)
Keywords: generation
Abstract: In the emerging field of urban digital twins (UDTs), advancing intelligent road inspection (IRI) vehicles with automatic road crack detection systems is essential for maintaining civil infrastructure. Over the past decade, deep learning-based road crack detection methods have been developed to detect cracks more efficiently, accurately, and objectively, with the goal of replacing manual visual inspection. Nonetheless, there is a lack of systematic reviews on state-of-the-art (SoTA) deep learning techniques, especially data-fusion and label-efficient algorithms for this task. This paper thoroughly reviews the SoTA deep learning-based algorithms, including (1) supervised, (2) unsupervised, (3) semi-supervised, and (4) weakly-supervised methods developed for road crack detection. Also, we create a dataset called UDTIRI-Crack, comprising $2,500$ high-quality images from seven public annotated sources, as the first extensive online benchmark in this field. Comprehensive experiments are conducted to compare the detection performance, computational efficiency, and generalizability of public SoTA deep learning-based algorithms for road crack detection. In addition, the feasibility of foundation models and large language models (LLMs) for road crack detection is explored. Afterwards, the existing challenges and future development trends of deep learning-based road crack detection algorithms are discussed. We believe this review can serve as practical guidance for developing intelligent road detection vehicles with the next-generation road condition assessment systems. The released benchmark UDTIRI-Crack is available at this https URL.
摘要：在城市数字双胞胎（UDTS）的新兴领域中，具有自动道路裂纹检测系统的智能道路检查（IRI）车辆对于维持民用基础设施至关重要。在过去的十年中，已经开发了基于深度学习的道路裂纹检测方法，以更有效，准确和客观地检测裂纹，以更换手动视觉检查。尽管如此，对于此任务的最先进（SOTA）深度学习技术（尤其是数据融合和标签有效算法），缺乏系统评价。本文彻底回顾了基于SOTA的深度学习算法，包括（1）监督，（2）无监督的，（3）半监督和（4）开发用于道路裂纹检测的弱监督方法。此外，我们创建了一个名为udtiri-crack的数据集，其中包括来自七个公共注释来源的2,500美元的高质量图像，这是该领域的第一个广泛的在线基准。进行了全面的实验，以比较公共SOTA深度学习算法对道路裂纹检测的检测性能，计算效率和概括性。此外，还探索了基础模型和大型语言模型（LLM）进行道路裂纹检测的可行性。之后，讨论了基于深度学习的道路裂纹检测算法的现有挑战和未来发展趋势。我们认为，这项审查可以作为使用下一代道路条件评估系统开发智能道路检测工具的实用指导。已发布的基准Udtiri-Crack可在此HTTPS URL上找到。

Title: Unified Geometry and Color Compression Framework for Point Clouds via Generative Diffusion Priors

Authors: Tianxin Huang, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18083
Pdf URL: https://arxiv.org/pdf/2503.18083
Copy Paste: [[2503.18083]] Unified Geometry and Color Compression Framework for Point Clouds via Generative Diffusion Priors(https://arxiv.org/abs/2503.18083)
Keywords: generative
Abstract: With the growth of 3D applications and the rapid increase in sensor-collected 3D point cloud data, there is a rising demand for efficient compression algorithms. Most existing learning-based compression methods handle geometry and color attributes separately, treating them as distinct tasks, making these methods challenging to apply directly to point clouds with colors. Besides, the limited capacities of training datasets also limit their generalizability across points with different distributions. In this work, we introduce a test-time unified geometry and color compression framework of 3D point clouds. Instead of training a compression model based on specific datasets, we adapt a pre-trained generative diffusion model to compress original colored point clouds into sparse sets, termed 'seeds', using prompt tuning. Decompression is then achieved through multiple denoising steps with separate sampling processes. Experiments on objects and indoor scenes demonstrate that our method has superior performances compared to existing baselines for the compression of geometry and color.
摘要：随着3D应用的增长以及传感器收集的3D点云数据的快速增加，对有效压缩算法的需求不断增加。大多数基于学习的压缩方法分别处理几何和颜色属性，将其视为不同的任务，使这些方法具有挑战性，直接应用于带有颜色的点云。此外，培训数据集的有限能力还限制了其在具有不同分布的点之间的普遍性。在这项工作中，我们介绍了3D点云的测试时间统一几何形状和颜色压缩框架。我们没有根据特定数据集训练基于特定数据集的压缩模型，而是使用迅速调整将预训练的生成扩散模型压缩为稀疏集，称为“种子”。然后，通过通过单独的采样过程进行多个降解步骤来实现解压缩。对物体和室内场景的实验表明，与现有基线相比，我们的方法具有较高的性能，以压缩几何和颜色。

Title: Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization

Authors: Juntao Dai, Taiye Chen, Yaodong Yang, Qian Zheng, Gang Pan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18130
Pdf URL: https://arxiv.org/pdf/2503.18130
Copy Paste: [[2503.18130]] Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization(https://arxiv.org/abs/2503.18130)
Keywords: generation
Abstract: Reinforcement learning from human feedback (RLHF) is an effective method for aligning large language models (LLMs) with human values. However, reward over-optimization remains an open challenge leading to discrepancies between the performance of LLMs under the reward model and the true human objectives. A primary contributor to reward over-optimization is the extrapolation error that arises when the reward model evaluates out-of-distribution (OOD) responses. However, current methods still fail to prevent the increasing frequency of OOD response generation during the reinforcement learning (RL) process and are not effective at handling extrapolation errors from OOD responses. In this work, we propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue. Specifically, we define behavior policy as the next token distribution of the reward training dataset to model the in-distribution (ID) region of the reward model. Building on this, we introduce the behavior-supported Bellman operator to regularize the value function, penalizing all OOD values without impacting the ID ones. Consequently, BSPO reduces the generation of OOD responses during the RL process, thereby avoiding overestimation caused by the reward model's extrapolation errors. Theoretically, we prove that BSPO guarantees a monotonic improvement of the supported policy until convergence to the optimal behavior-supported policy. Empirical results from extensive experiments show that BSPO outperforms baselines in preventing reward over-optimization due to OOD evaluation and finding the optimal ID policy.
摘要：从人类反馈（RLHF）中学习的强化是将大语模型（LLMS）与人类价值观保持一致的有效方法。但是，奖励过度优化仍然是一个公开挑战，导致奖励模型下LLM的表现与真正的人类目标之间的差异。奖励过度优化的主要因素是当奖励模型评估分布（OOD）响应（OOD）响应时会产生的外推误差。但是，当前的方法仍然无法防止在增强学习过程（RL）过程中OOD响应产生的频率增加，并且在处理OOD响应中的外推错误方面无效。在这项工作中，我们提出了行为支持的政策优化（BSPO）方法，以减轻奖励过度优化问题。具体而言，我们将行为策略定义为奖励培训数据集的下一个令牌分布，以模拟奖励模型的分布（ID）区域。在此基础上，我们介绍了行为支持的Bellman操作员，以使价值函数正规化，对所有OOD值进行惩罚而不会影响ID。因此，BSPO在RL过程中减少了OOD响应的产生，从而避免了奖励模型的外推错误引起的高估。从理论上讲，我们证明BSPO保证支持政策的单调改进，直到融合最佳行为支持政策为止。广泛实验的经验结果表明，由于OOD评估并找到最佳的ID策略，BSPO在防止奖励过度优化方面优于基准。

Title: An Image-like Diffusion Method for Human-Object Interaction Detection

Authors: Xiaofei Hui, Haoxuan Qu, Hossein Rahmani, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18134
Pdf URL: https://arxiv.org/pdf/2503.18134
Copy Paste: [[2503.18134]] An Image-like Diffusion Method for Human-Object Interaction Detection(https://arxiv.org/abs/2503.18134)
Keywords: generation
Abstract: Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast ``HOI images''. Extensive experiments demonstrate the efficacy of our framework.
摘要：人类对象的相互作用（HOI）检测通常会面临高水平的歧义和不确定性，因为在不同的人类对象对中，相同的相互作用可能看起来很大。此外，诸如遮挡和混乱的背景等问题可以进一步加剧不确定性。为了处理这项挑战的任务，在这项工作中，我们从一个关键的观察开始：每个人类对象对的HOI检测输出可以作为图像重新铸造。因此，受图像扩散模型的强大图像产生能力的启发，我们提出了一个新框架Hoi-Idiff。在Hoi-Idiff中，我们使用图像样扩散过程从新的角度来处理HOI检测，以生成HOI检测输出作为图像。此外，认识到我们的重铸图像在某些属性上与自然图像有所不同，我们通过定制的HOI扩散过程和切片斑点模型体系结构来增强我们的框架，这些模型是专门针对生成我们的重铸``hoi Images''的。广泛的实验证明了我们框架的功效。

Title: TCFG: Tangential Damping Classifier-free Guidance

Authors: Mingi Kwon, Shin seong Kim, Jaeseok Jeong. Yi Ting Hsiao, Youngjung Uh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18137
Pdf URL: https://arxiv.org/pdf/2503.18137
Copy Paste: [[2503.18137]] TCFG: Tangential Damping Classifier-free Guidance(https://arxiv.org/abs/2503.18137)
Keywords: generation
Abstract: Diffusion models have achieved remarkable success in text-to-image synthesis, largely attributed to the use of classifier-free guidance (CFG), which enables high-quality, condition-aligned image generation. CFG combines the conditional score (e.g., text-conditioned) with the unconditional score to control the output. However, the unconditional score is in charge of estimating the transition between manifolds of adjacent timesteps from $x_t$ to $x_{t-1}$, which may inadvertently interfere with the trajectory toward the specific condition. In this work, we introduce a novel approach that leverages a geometric perspective on the unconditional score to enhance CFG performance when conditional scores are available. Specifically, we propose a method that filters the singular vectors of both conditional and unconditional scores using singular value decomposition. This filtering process aligns the unconditional score with the conditional score, thereby refining the sampling trajectory to stay closer to the manifold. Our approach improves image quality with negligible additional computation. We provide deeper insights into the score function behavior in diffusion models and present a practical technique for achieving more accurate and contextually coherent image synthesis.
摘要：扩散模型在文本到图像合成方面取得了巨大的成功，这在很大程度上归因于使用无分类器指导（CFG），该指导能够实现高质量的，条件对齐的图像生成。 CFG将条件分数（例如，文本条件）与无条件分数结合在一起，以控制输出。但是，无条件得分负责估计相邻时间段的流形从$ x_t $到$ x_ {t-1} $之间的过渡，后者可能会无意中干扰对特定条件的轨迹。在这项工作中，我们介绍了一种新颖的方法，该方法利用无条件得分的几何视角在有条件得分的情况下提高了CFG性能。具体而言，我们提出了一种使用单数值分解来过滤条件分数和无条件得分的奇异向量的方法。这种过滤过程将无条件得分与条件分数保持一致，从而完善了采样轨迹以保持更靠近歧管。我们的方法通过可忽略的其他计算来提高图像质量。我们在扩散模型中对得分函数行为提供了更深入的见解，并提出了一种实用技术，以实现更准确和上下文相干的图像综合。

Title: AGIR: Assessing 3D Gait Impairment with Reasoning based on LLMs

Authors: Diwei Wang, Cédric Bobenrieth, Hyewon Seo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18141
Pdf URL: https://arxiv.org/pdf/2503.18141
Copy Paste: [[2503.18141]] AGIR: Assessing 3D Gait Impairment with Reasoning based on LLMs(https://arxiv.org/abs/2503.18141)
Keywords: generation, generative
Abstract: Assessing gait impairment plays an important role in early diagnosis, disease monitoring, and treatment evaluation for neurodegenerative diseases. Despite its widespread use in clinical practice, it is limited by subjectivity and a lack of precision. While recent deep learning-based approaches have consistently improved classification accuracies, they often lack interpretability, hindering their utility in clinical decision-making. To overcome these challenges, we introduce AGIR, a novel pipeline consisting of a pre-trained VQ-VAE motion tokenizer and a subsequent Large Language Model (LLM) fine-tuned over pairs of motion tokens and Chain-of-Thought (CoT) reasonings. To fine-tune an LLM for pathological gait analysis, we first introduce a multimodal dataset by adding rationales dedicated to MDS-UPDRS gait score assessment to an existing PD gait dataset. We then introduce a two-stage supervised fine-tuning (SFT) strategy to enhance the LLM's motion comprehension with pathology-specific knowledge. This strategy includes: 1) a generative stage that aligns gait motions with analytic descriptions through bidirectional motion-description generation, 2) a reasoning stage that integrates logical Chain-of-Thought (CoT) reasoning for impairment assessment with UPDRS gait score. Validation on an existing dataset and comparisons with state-of-the-art methods confirm the robustness and accuracy of our pipeline, demonstrating its ability to assign gait impairment scores from motion input with clinically meaningful rationales.
摘要：评估步态障碍在早期诊断，疾病监测和神经退行性疾病的治疗评估中起着重要作用。尽管它在临床实践中广泛使用，但它受到主观性和缺乏精度的限制。尽管最近基于深度学习的方法始终提高了分类精度，但它们通常缺乏可解释性，阻碍了他们在临床决策中的实用性。为了克服这些挑战，我们引入了AGIR，这是一种由预先训练的VQ-VAE运动令牌仪组成的新型管道，以及随后对成对的运动令牌和Theark链（COT）推理进行微调的大型语言模型（LLM）。为了微调LLM进行病理步态分析，我们首先通过将专用于MDS-UPDRS步态得分评估评估的理由添加到现有的PD步态数据集中来引入多模式数据集。然后，我们引入了两阶段监督的微调（SFT）策略，以通过特定于病理学知识来增强LLM的运动理解。该策略包括：1）通过双向运动描述生成使步态运动与分析描述保持一致的生成阶段，2）一个推理阶段，将逻辑链链（COT）推理与UPDRS步态评分相结合。对现有数据集的验证以及与最先进方法的比较证实了管道的鲁棒性和准确性，证明了其从运动输入中分配步态障碍分数具有临床意义的理由的能力。

Title: LocDiffusion: Identifying Locations on Earth by Diffusing in the Hilbert Space

Authors: Zhangyu Wang, Jielu Zhang, Zhongliang Zhou, Qian Cao, Nemin Wu, Zeping Liu, Lan Mu, Yang Song, Yiqun Xie, Ni Lao, Gengchen Mai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18142
Pdf URL: https://arxiv.org/pdf/2503.18142
Copy Paste: [[2503.18142]] LocDiffusion: Identifying Locations on Earth by Diffusing in the Hilbert Space(https://arxiv.org/abs/2503.18142)
Keywords: generative
Abstract: Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. Existing methods approach it either via grid-based classification or via image retrieval. Their performance significantly suffers when the spatial distribution of test images does not align with such choices. To address these limitations, we propose to leverage diffusion as a mechanism for image geolocalization. To avoid the problematic manifold reprojection step in diffusion, we developed a novel spherical positional encoding-decoding framework, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking. We call this type of position encoding Spherical Harmonics Dirac Delta (SHDD) Representation. We also propose a novel SirenNet-based architecture called CS-UNet to learn the conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. We train a conditional latent diffusion model called LocDiffusion that generates geolocations under the guidance of images -- to the best of our knowledge, the first generative model for image geolocalization by diffusing geolocation information in a hidden location embedding space. We evaluate our method against SOTA image geolocalization baselines. LocDiffusion achieves competitive geolocalization performance and demonstrates significantly stronger generalizability to unseen geolocations.
摘要：图像地理定位是一项基本而又具有挑战性的任务，旨在推断拍摄图像的地球上的地理位置。现有方法通过基于网格的分类或图像检索对其进行处理。当测试图像的空间分布与此类选择不一致时，它们的性能就会显着受到影响。为了解决这些局限性，我们建议利用扩散作为图像地理定位的机制。为了避免在扩散中的有问题的歧管再投影步骤，我们开发了一个新型的球形位置编码框架，该框架编码在球形表面上的点（例如，地球上的地理分解）通过模式通过模式来编码球形谐音系数和解码点（地理位置）的希尔伯特空间。我们称这种类型的位置编码球形谐波Dirac Delta（SHDD）表示。我们还提出了一种新型的基于Sirennet的建筑，称为CS-Unet，以最大程度地减少潜在的KL-Divergence损失，以学习潜在SHDD空间中的有条件向后过程。我们训练一个条件潜在扩散模型，称为LocDiffusion，该模型在图像的指导下生成地理位置 - 据我们所知，这是图像地理位置化的第一个生成模型，通过在隐藏的位置嵌入空间中扩散地理位置信息。我们对SOTA图像地理定位基线的方法评估了我们的方法。 LocDiffusion实现了竞争性地理定位性能，并表现出对看不见的地理位置的普遍性明显更强。

Title: LongDiff: Training-Free Long Video Generation in One Go

Authors: Zhuoling Li, Hossein Rahmani, Qiuhong Ke, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18150
Pdf URL: https://arxiv.org/pdf/2503.18150
Copy Paste: [[2503.18150]] LongDiff: Training-Free Long Video Generation in One Go(https://arxiv.org/abs/2503.18150)
Keywords: generation
Abstract: Video diffusion models have recently achieved remarkable results in video generation. Despite their encouraging performance, most of these models are mainly designed and trained for short video generation, leading to challenges in maintaining temporal consistency and visual details in long video generation. In this paper, we propose LongDiff, a novel training-free method consisting of carefully designed components \ -- Position Mapping (PM) and Informative Frame Selection (IFS) \ -- to tackle two key challenges that hinder short-to-long video generation generalization: temporal position ambiguity and information dilution. Our LongDiff unlocks the potential of off-the-shelf video diffusion models to achieve high-quality long video generation in one go. Extensive experiments demonstrate the efficacy of our method.
摘要：视频扩散模型最近在视频生成中取得了显着的结果。尽管表现令人鼓舞，但这些模型中的大多数主要是为短视频生成而设计和培训的，这导致了长期视频中保持时间一致性和视觉细节的挑战。在本文中，我们提出了Longdiff，这是一种新型的无培训方法，该方法包括精心设计的组件\ - 位置映射（PM）和信息框架选择（IFS）\ - 以应对两种关键挑战，以阻碍短期视频产生的概括：时间位置的歧义和信息稀释。我们的Longdiff释放了现成的视频扩散模型的潜力，可以一次实现高质量的长视频。广泛的实验证明了我们方法的功效。

Title: Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes

Authors: Kelly O. Marshall, Omid Poursaeed, Sergiu Oprea, Amit Kumar, Anushrut Jignasu, Chinmay Hegde, Yilei Li, Rakesh Ranjan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18155
Pdf URL: https://arxiv.org/pdf/2503.18155
Copy Paste: [[2503.18155]] Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes(https://arxiv.org/abs/2503.18155)
Keywords: generation
Abstract: 3D indoor scene generation is an important problem for the design of digital and real-world environments. To automate this process, a scene generation model should be able to not only generate plausible scene layouts, but also take into consideration visual features and style preferences. Existing methods for this task exhibit very limited control over these attributes, only allowing text inputs in the form of simple object-level descriptions or pairwise spatial relationships. Our proposed method Decorum enables users to control the scene generation process with natural language by adopting language-based representations at each stage. This enables us to harness recent advancements in Large Language Models (LLMs) to model language-to-language mappings. In addition, we show that using a text-based representation allows us to select furniture for our scenes using a novel object retrieval method based on multimodal LLMs. Evaluations on the benchmark 3D-FRONT dataset show that our methods achieve improvements over existing work in text-conditioned scene synthesis and object retrieval.
摘要：3D室内场景生成是数字和现实环境设计的重要问题。为了自动化此过程，场景生成模型不仅应该能够生成合理的场景布局，还应考虑视觉特征和样式偏好。此任务的现有方法对这些属性的控制非常有限，仅允许以简单对象级描述或成对空间关系形式的文本输入。我们提出的方法的礼节使用户可以通过在每个阶段采用基于语言的表示，以自然语言来控制场景的生成过程。这使我们能够利用大语言模型（LLM）的最新进步来对语言到语言映射进行建模。此外，我们表明，使用基于文本的表示，我们可以使用基于多模式LLM的新型对象检索方法为场景选择家具。基准3D-Front数据集的评估表明，我们的方法对文本条件的场景合成和对象检索的现有工作有了改进。

Title: DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation

Authors: Peng Chen, Xiaobao Wei, Ming Lu, Hui Chen, Feng Tian
Subjects: cs.CV, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2503.18159
Pdf URL: https://arxiv.org/pdf/2503.18159
Copy Paste: [[2503.18159]] DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation(https://arxiv.org/abs/2503.18159)
Keywords: generation
Abstract: Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial animation. However, personalized speaking styles conveying accurate lip language is still lacking, besides, efficiency and compactness still need to be improved. In this work, we propose DiffusionTalker to address the above limitations via personalizer-guided distillation. In terms of personalization, we introduce a contrastive personalizer that learns identity and emotion embeddings to capture speaking styles from audio. We further propose a personalizer enhancer during distillation to enhance the influence of embeddings on facial animation. For efficiency, we use iterative distillation to reduce the steps required for animation generation and achieve more than 8x speedup in inference. To achieve compactness, we distill the large teacher model into a smaller student model, reducing our model's storage by 86.4\% while minimizing performance loss. After distillation, users can derive their identity and emotion embeddings from audio to quickly create personalized animations that reflect specific speaking styles. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released at: this https URL.
摘要：实时语音驱动的3D面部动画在学术界和行业中很有吸引力。传统方法主要集中于学习从语音到动画的确定性映射。最近的方法开始考虑语音驱动的3D面动画的无确定性事实，并采用扩散模型来完成任务。现有的基于扩散的方法可以改善面部动画的多样性。但是，除了效率和紧凑性外，仍然缺乏传达准确唇语的个性化语言样式。在这项工作中，我们建议扩散词法器通过个性化的蒸馏来解决上述限制。在个性化方面，我们介绍了一种对比性个性化合物，该人学习身份和情感嵌入，以从音频中捕获说话风格。我们在蒸馏过程中进一步提出个性化增强剂，以增强嵌入对面部动画的影响。为了提高效率，我们使用迭代蒸馏来减少动画生成所需的步骤，并在推理中实现超过8倍的速度。为了达到紧凑，我们将大型教师模型提炼成一个较小的学生模型，将模型的存储量减少了86.4 \％，同时最大程度地减少了绩效损失。蒸馏后，用户可以从音频中得出自己的身份和情感嵌入，以快速创建反映特定语言风格的个性化动画。进行了广泛的实验，以证明我们的方法表现优于最先进的方法。该代码将在以下位置发布：此HTTPS URL。

Title: Self-Attention Diffusion Models for Zero-Shot Biomedical Image Segmentation: Unlocking New Frontiers in Medical Imaging

Authors: Abderrachid Hamrani, Anuradha Godavarty
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18170
Pdf URL: https://arxiv.org/pdf/2503.18170
Copy Paste: [[2503.18170]] Self-Attention Diffusion Models for Zero-Shot Biomedical Image Segmentation: Unlocking New Frontiers in Medical Imaging(https://arxiv.org/abs/2503.18170)
Keywords: generative
Abstract: Producing high-quality segmentation masks for medical images is a fundamental challenge in biomedical image analysis. Recent research has explored large-scale supervised training to enable segmentation across various medical imaging modalities and unsupervised training to facilitate segmentation without dense annotations. However, constructing a model capable of segmenting diverse medical images in a zero-shot manner without any annotations remains a significant hurdle. This paper introduces the Attention Diffusion Zero-shot Unsupervised System (ADZUS), a novel approach that leverages self-attention diffusion models for zero-shot biomedical image segmentation. ADZUS harnesses the intrinsic capabilities of pre-trained diffusion models, utilizing their generative and discriminative potentials to segment medical images without requiring annotated training data or prior domain-specific knowledge. The ADZUS architecture is detailed, with its integration of self-attention mechanisms that facilitate context-aware and detail-sensitive segmentations being highlighted. Experimental results across various medical imaging datasets, including skin lesion segmentation, chest X-ray infection segmentation, and white blood cell segmentation, reveal that ADZUS achieves state-of-the-art performance. Notably, ADZUS reached Dice scores ranging from 88.7\% to 92.9\% and IoU scores from 66.3\% to 93.3\% across different segmentation tasks, demonstrating significant improvements in handling novel, unseen medical imagery. It is noteworthy that while ADZUS demonstrates high effectiveness, it demands substantial computational resources and extended processing times. The model's efficacy in zero-shot settings underscores its potential to reduce reliance on costly annotations and seamlessly adapt to new medical imaging tasks, thereby expanding the diagnostic capabilities of AI-driven medical imaging technologies.
摘要：在生物医学图像分析中，生产用于医学图像的高质量分割面罩是一个基本挑战。最近的研究探索了大规模的监督培训，以使各种医学成像方式和无监督培训能够分割，以促进分割而无需致密注释。但是，构建能够以零拍的方式分割多种医学图像的模型而无需任何注释仍然是一个重大障碍。本文介绍了注意力扩散零射击无监督的系统（ADZUS），这是一种新型方法，利用自我发作扩散模型进行零拍生物医学图像分割。 Adzus利用预先训练的扩散模型的固有功能，利用其生成和判别潜力来分割医疗图像，而无需带注释的培训数据或先前的领域特定知识。详细介绍了ADZUS架构，并集成了自我发挥的机制，这些机制有助于强调上下文感知和细节敏感的细分。各种医学成像数据集的实验结果，包括皮肤病变分割，胸部X射线感染分割和白细胞分割，表明Adzus可以实现最新的性能。值得注意的是，在不同分割任务中，Adzus达到了从88.7 \％到92.9 \％的骰子分数，从66.3 \％到93.3 \％的IOU分数从66.3 \％到93.3 \％。值得注意的是，尽管Adzus具有很高的有效性，但它需要大量的计算资源和扩展的处理时间。该模型在零摄影设置中的功效强调了其减少对昂贵注释的依赖并无缝适应新的医学成像任务的潜力，从而扩大了AI驱动的医学成像技术的诊断能力。

Title: A Framework for Finding Local Saddle Points in Two-Player Zero-Sum Black-Box Games

Authors: Shubhankar Agarwal, Hamzah I. Khan, Sandeep P. Chinchali, David Fridovich-Keil
Subjects: cs.LG, cs.GT
Abstract URL: https://arxiv.org/abs/2503.18224
Pdf URL: https://arxiv.org/pdf/2503.18224
Copy Paste: [[2503.18224]] A Framework for Finding Local Saddle Points in Two-Player Zero-Sum Black-Box Games(https://arxiv.org/abs/2503.18224)
Keywords: generative
Abstract: Saddle point optimization is a critical problem employed in numerous real-world applications, including portfolio optimization, generative adversarial networks, and robotics. It has been extensively studied in cases where the objective function is known and differentiable. Existing work in black-box settings with unknown objectives that can only be sampled either assumes convexity-concavity in the objective to simplify the problem or operates with noisy gradient estimators. In contrast, we introduce a framework inspired by Bayesian optimization which utilizes Gaussian processes to model the unknown (potentially nonconvex-nonconcave) objective and requires only zeroth-order samples. Our approach frames the saddle point optimization problem as a two-level process which can flexibly integrate existing and novel approaches to this problem. The upper level of our framework produces a model of the objective function by sampling in promising locations, and the lower level of our framework uses the existing model to frame and solve a general-sum game to identify locations to sample. This lower level procedure can be designed in complementary ways, and we demonstrate the flexibility of our approach by introducing variants which appropriately trade off between factors like runtime, the cost of function evaluations, and the number of available initial samples. We experimentally demonstrate these algorithms on synthetic and realistic datasets in black-box nonconvex-nonconcave settings, showcasing their ability to efficiently locate local saddle points in these contexts.
摘要：鞍点优化是许多现实世界应用中使用的关键问题，包括投资组合优化，生成对抗网络和机器人技术。在已知和可区分目标函数的情况下，它已经进行了广泛的研究。在黑框设置中具有未知目标的现有工作，只能采样，要么假设凸率concovity在简化问题或使用嘈杂的梯度估计器中进行操作。相比之下，我们引入了一个受贝叶斯优化启发的框架，该框架利用高斯工艺对未知（潜在的非convex-nonconcave）目标进行建模，并且仅需要零级样本。我们的方法将马鞍点优化问题构成一个两级过程，可以灵活地整合现有和新颖的方法来解决此问题。我们框架的上层通过在有希望的位置进行采样来产生目标函数的模型，而我们的框架的下层使用现有模型来构架和求解通用游戏，以识别用于采样的位置。该较低级别的过程可以以互补的方式设计，我们通过引入适当在运行时，功能评估成本和可用初始样本数量的因素之间进行适当权衡的变体来证明我们的方法的灵活性。我们在Black-Box NonConvex-Nonconcave设置中实验证明了这些算法在合成和逼真的数据集上，展示了它们在这些上下文中有效定位本地鞍点的能力。

Title: Decoupling Angles and Strength in Low-rank Adaptation

Authors: Massimo Bini, Leander Girrbach, Zeynep Akata
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.18225
Pdf URL: https://arxiv.org/pdf/2503.18225
Copy Paste: [[2503.18225]] Decoupling Angles and Strength in Low-rank Adaptation(https://arxiv.org/abs/2503.18225)
Keywords: generation
Abstract: Parameter-Efficient FineTuning (PEFT) methods have recently gained significant popularity thanks to the widespread availability of large-scale pretrained models. These methods allow for quick adaptation to downstream tasks with minimal computational cost. However, popular finetuning methods such as LoRA exhibit limited robustness when it comes to hyperparameter choices or extended training regimes, preventing optimal out-of-the-box performance. In contrast, bounded approaches, such as ETHER, provide greater robustness but are limited to extremely low-rank adaptations and fixed-strength transformations, reducing their adaptation expressive power. In this work, we propose Decoupled Low-rank Adaptation (DeLoRA), a novel finetuning method that normalizes and scales learnable low-rank matrices. By bounding the distance of the transformation, DeLoRA effectively decouples the angular learning from the adaptation strength, enhancing robustness without compromising performance. Through evaluations on subject-driven image generation, natural language understanding, and instruction tuning, we show that DeLoRA matches or surpasses performance of competing PEFT methods, while exhibiting stronger robustness. Code is available at this https URL.
摘要：由于大规模预处理的模型的广泛可用性，参数有效的芬太尼（PEFT）方法最近获得了显着普及。这些方法允许快速适应以最低的计算成本的下游任务。然而，在超参数选择或扩展训练方案方面，诸如洛拉（Lora）等流行的固定方法表现出有限的鲁棒性，从而阻止了最佳的开箱即用性能。相比之下，诸如以太之类的有界方法具有更大的鲁棒性，但仅限于极低的适应性和固定强度转换，从而降低了它们的适应性表达能力。在这项工作中，我们提出了脱钩的低级适应性（DELORA），这是一种新型的固定方法，可以归一化和尺度可学习的低级矩阵。通过界定转换的距离，Delora有效地将角度学习与适应强度相关，从而增强了稳健性而不会损害性能。通过评估主题驱动的图像产生，自然语言理解和教学调整，我们表明Delora匹配或超过了竞争性PEFT方法的性能，同时表现出更强的鲁棒性。代码可在此HTTPS URL上找到。

Title: DiffGED: Computing Graph Edit Distance via Diffusion-based Graph Matching

Authors: Wei Huang, Hanchen Wang, Dong Wen, Wenjie Zhang, Ying Zhang, Xuemin Lin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18245
Pdf URL: https://arxiv.org/pdf/2503.18245
Copy Paste: [[2503.18245]] DiffGED: Computing Graph Edit Distance via Diffusion-based Graph Matching(https://arxiv.org/abs/2503.18245)
Keywords: generative
Abstract: The Graph Edit Distance (GED) problem, which aims to compute the minimum number of edit operations required to transform one graph into another, is a fundamental challenge in graph analysis with wide-ranging applications. However, due to its NP-hard nature, traditional A* approaches often suffer from scalability issue, making them computationally intractable for large graphs. Many recent deep learning frameworks address GED by formulating it as a regression task, which, while efficient, fails to recover the edit path -- a central interest in GED. Furthermore, recent hybrid approaches that combine deep learning with traditional methods to recover the edit path often yield poor solution quality. These methods also struggle to generate candidate solutions in parallel, resulting in increased running this http URL this paper, we present a novel approach, DiffGED, that leverages generative diffusion model to solve GED and recover the corresponding edit path. Specifically, we first generate multiple diverse node matching matrices in parallel through a diffusion-based graph matching model. Next, node mappings are extracted from each generated matching matrices in parallel, and each extracted node mapping can be simply transformed into an edit path. Benefiting from the generative diversity provided by the diffusion model, DiffGED is less likely to fall into local sub-optimal solutions, thereby achieving superior overall solution quality close to the exact solution. Experimental results on real-world datasets demonstrate that DiffGED can generate multiple diverse edit paths with exceptionally high accuracy comparable to exact solutions while maintaining a running time shorter than most of hybrid approaches.
摘要：图形编辑距离（GED）问题旨在计算将一个图形转换为另一个图形所需的最小编辑操作数量，这是具有广泛应用程序的图形分析中的基本挑战。然而，由于其NP坚硬的性质，传统的A*方法通常遭受可伸缩性问题的困扰，从而使它们在大图上棘手。许多最近的深度学习框架通过将其制定为回归任务来解决GED，尽管它有效地却无法恢复编辑路径，这是GED的核心利益。此外，最近的混合方法将深度学习与传统方法相结合以恢复编辑路径通常会产生较差的解决方案质量。这些方法还难以并行生成候选解决方案，从而增加了本文运行的HTTP URL，我们提出了一种新颖的方法，即扩散，该方法利用生成扩散模型来解决GED并恢复相应的编辑路径。具体而言，我们首先通过基于扩散的图形匹配模型并行生成多种不同的节点匹配矩阵。接下来，从每个生成的匹配矩阵并行提取节点映射，每个提取的节点映射可以简单地转换为编辑路径。从扩散模型提供的生成多样性中受益，扩散的可能性较小，不太可能属于局部亚最佳解决方案，从而实现接近精确溶液的优越总体溶液质量。现实世界数据集的实验结果表明，扩散的可以生成多种不同的编辑路径，其精度与精确解决方案的精度非常高，同时保持比大多数混合方法短的运行时间。

Title: CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI

Authors: Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, Vikash Sehwag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18286
Pdf URL: https://arxiv.org/pdf/2503.18286
Copy Paste: [[2503.18286]] CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI(https://arxiv.org/abs/2503.18286)
Keywords: generative
Abstract: With the rapid advancement of generative AI, it is now possible to synthesize high-quality images in a few seconds. Despite the power of these technologies, they raise significant concerns regarding misuse. Current efforts to distinguish between real and AI-generated images may lack generalization, being effective for only certain types of generative models and susceptible to post-processing techniques like JPEG compression. To overcome these limitations, we propose a novel framework, Co-Spy, that first enhances existing semantic features (e.g., the number of fingers in a hand) and artifact features (e.g., pixel value differences), and then adaptively integrates them to achieve more general and robust synthetic image detection. Additionally, we create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models, including the latest models like FLUX. We also collect 50k synthetic images in the wild from the Internet to enable evaluation in a more practical setting. Our extensive evaluations demonstrate that our detector outperforms existing methods under identical training conditions, achieving an average accuracy improvement of approximately 11% to 34%. The code is available at this https URL.
摘要：随着生成AI的快速发展，现在可以在几秒钟内合成高质量的图像。尽管这些技术具有力量，但它们还是引起了人们对滥用的重大关注。当前区分真实图像和AI生成的图像的努力可能缺乏概括，仅对某些类型的生成模型有效，并且容易受到JPEG压缩等后处理技术的影响。为了克服这些局限性，我们提出了一个新颖的框架，即co-spy，该框架首先增强了现有的语义特征（例如，手指的手指数量）和伪影特征（例如，像素值差异），然后自适应地集成它们以实现更一般和可靠的合成图像检测。此外，我们创建了Co-Spy Bench，这是一个综合数据集，其中包括5个真实图像数据集和22个最先进的生成模型，包括最新的型号，例如Flux。我们还从Internet中收集了50k合成图像，以更实用的环境启用评估。我们广泛的评估表明，我们的探测器在相同的培训条件下优于现有方法，实现了约11％至34％的平均准确性提高。该代码可在此HTTPS URL上找到。

Title: Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module

Authors: Yishen Liu, Shengda Liu, Hudan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18297
Pdf URL: https://arxiv.org/pdf/2503.18297
Copy Paste: [[2503.18297]] Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module(https://arxiv.org/abs/2503.18297)
Keywords: generation
Abstract: Medical report generation requires specialized expertise that general large models often fail to accurately capture. Moreover, the inherent repetition and similarity in medical data make it difficult for models to extract meaningful features, resulting in a tendency to overfit. So in this paper, we propose a multimodal model, Co-Attention Triple-LSTM Network (CA-TriNet), a deep learning model that combines transformer architectures with a Multi-LSTM network. Its Co-Attention module synergistically links a vision transformer with a text transformer to better differentiate medical images with similarities, augmented by an adaptive weight operator to catch and amplify image labels with minor similarities. Furthermore, its Triple-LSTM module refines generated sentences using targeted image objects. Extensive evaluations over three public datasets have demonstrated that CA-TriNet outperforms state-of-the-art models in terms of comprehensive ability, even pre-trained large language models on some metrics.
摘要：医疗报告的生成需要专业的专业知识，即一般大型模型通常无法准确捕获。此外，医疗数据中的固有重复和相似性使模型很难提取有意义的功能，从而导致过度合适的趋势。因此，在本文中，我们提出了一个多模型模型，共同注意三重-LSTM网络（CA-TRINET），这是一个将变压器体系结构与多LSTM网络相结合的深度学习模型。它的共同发音模块协同将视觉变压器与文本变压器联系起来，以更好地区分具有相似性的医学图像，并由自适应重量操作员增强，以捕获和放大具有较小相似性的图像标签。此外，其Triple-LSTM模块使用目标图像对象优化了生成的句子。对三个公共数据集进行的广泛评估表明，Ca-Trinet在全面的能力方面，甚至在某些指标上的预先训练的大语言模型都优于最先进的模型。

Title: Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models

Authors: Jianlong Jin, Chenglong Zhao, Ruixin Zhang, Sheng Shang, Jianqing Xu, Jingyun Zhang, ShaoMing Wang, Yang Zhao, Shouhong Ding, Wei Jia, Yunsheng Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18312
Pdf URL: https://arxiv.org/pdf/2503.18312
Copy Paste: [[2503.18312]] Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models(https://arxiv.org/abs/2503.18312)
Keywords: generation
Abstract: Palmprint recognition is significantly limited by the lack of large-scale publicly available datasets. Previous methods have adopted Bézier curves to simulate the palm creases, which then serve as input for conditional GANs to generate realistic palmprints. However, without employing real data fine-tuning, the performance of the recognition model trained on these synthetic datasets would drastically decline, indicating a large gap between generated and real palmprints. This is primarily due to the utilization of an inaccurate palm crease representation and challenges in balancing intra-class variation with identity consistency. To address this, we introduce a polynomial-based palm crease representation that provides a new palm crease generation mechanism more closely aligned with the real distribution. We also propose the palm creases conditioned diffusion model with a novel intra-class variation control method. By applying our proposed $K$-step noise-sharing sampling, we are able to synthesize palmprint datasets with large intra-class variation and high identity consistency. Experimental results show that, for the first time, recognition models trained solely on our synthetic datasets, without any fine-tuning, outperform those trained on real datasets. Furthermore, our approach achieves superior recognition performance as the number of generated identities increases.
摘要：缺少大规模公开可用数据集，棕榈印刷的识别受到很大的限制。先前的方法采用了贝齐尔曲线来模拟棕榈折痕，然后将其作为有条件剂量的输入，以产生逼真的棕榈印刷。但是，在不采用真实数据微调的情况下，在这些合成数据集中训练的识别模型的性能将大大降低，这表明生成的和真实的掌刻之间存在很大的差距。这主要是由于利用不准确的棕榈折痕表示以及在平衡阶层内变化与身份一致性方面的挑战。为了解决这个问题，我们引入了一个基于多项式的棕榈折痕表示，该表示提供了一种与实际分布更加紧密结合的新的棕榈折痕产生机制。我们还提出了使用一种新型的阶层内变异控制方法调节的棕榈折痕。通过应用我们提出的$ K $步骤共享噪声分享抽样，我们可以合成具有较大类内变化和高标识一致性的Palmprint数据集。实验结果表明，首次仅在我们的合成数据集上训练的识别模型，而没有任何微调，超过了在真实数据集中训练的模型。此外，随着生成的身份的数量增加，我们的方法可以达到出色的识别性能。

Title: Improved Rates of Differentially Private Nonconvex-Strongly-Concave Minimax Optimization

Authors: Ruijia Zhang, Mingxi Lei, Meng Ding, Zihang Xiang, Jinhui Xu, Di Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18317
Pdf URL: https://arxiv.org/pdf/2503.18317
Copy Paste: [[2503.18317]] Improved Rates of Differentially Private Nonconvex-Strongly-Concave Minimax Optimization(https://arxiv.org/abs/2503.18317)
Keywords: generative
Abstract: In this paper, we study the problem of (finite sum) minimax optimization in the Differential Privacy (DP) model. Unlike most of the previous studies on the (strongly) convex-concave settings or loss functions satisfying the Polyak-Lojasiewicz condition, here we mainly focus on the nonconvex-strongly-concave one, which encapsulates many models in deep learning such as deep AUC maximization. Specifically, we first analyze a DP version of Stochastic Gradient Descent Ascent (SGDA) and show that it is possible to get a DP estimator whose $l_2$-norm of the gradient for the empirical risk function is upper bounded by $\tilde{O}(\frac{d^{1/4}}{({n\epsilon})^{1/2}})$, where $d$ is the model dimension and $n$ is the sample size. We then propose a new method with less gradient noise variance and improve the upper bound to $\tilde{O}(\frac{d^{1/3}}{(n\epsilon)^{2/3}})$, which matches the best-known result for DP Empirical Risk Minimization with non-convex loss. We also discussed several lower bounds of private minimax optimization. Finally, experiments on AUC maximization, generative adversarial networks, and temporal difference learning with real-world data support our theoretical analysis.
摘要：在本文中，我们研究了差异隐私（DP）模型中（有限总和）最小值优化的问题。与以前的大多数有关（强）凸孔环境或满足Polyak-lojasiewicz条件的损失功能的研究不同，我们在这里主要关注非convex-rong-rong-concove One，它封装了许多深度学习中的许多模型，例如深度AUC最大化。具体而言，我们首先分析了随机梯度下降上升（SGDA）的DP版本，并表明有可能获得一个DP估计器，其$ L_2 $ - 经验风险功能的梯度norm norm in $ \ tilde {o}（\ frac {d^{1/4}}} {（{n \ epsilon}）^{1/2}}）$，其中$ d $是模型尺寸，$ n $是样本大小。然后，我们提出了一种具有较小梯度噪声差异的新方法，并将上限提高到$ \ tilde {o}（\ frac {\ frac {d^{1/3}}} {（N \ epsilon）^{2/3}}）$，这与非conve损失的DP经验风险最小化最佳结果相匹配。我们还讨论了私有最小值优化的几个下限。最后，对AUC最大化，生成对抗网络以及使用现实世界数据的时间差异学习的实验支持我们的理论分析。

Title: Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control

Authors: Basim Azam, Naveed Akhtar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18324
Pdf URL: https://arxiv.org/pdf/2503.18324
Copy Paste: [[2503.18324]] Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control(https://arxiv.org/abs/2503.18324)
Keywords: generation, generative
Abstract: Ethical issues around text-to-image (T2I) models demand a comprehensive control over the generative content. Existing techniques addressing these issues for responsible T2I models aim for the generated content to be fair and safe (non-violent/explicit). However, these methods remain bounded to handling the facets of responsibility concepts individually, while also lacking in interpretability. Moreover, they often require alteration to the original model, which compromises the model performance. In this work, we propose a unique technique to enable responsible T2I generation by simultaneously accounting for an extensive range of concepts for fair and safe content generation in a scalable manner. The key idea is to distill the target T2I pipeline with an external plug-and-play mechanism that learns an interpretable composite responsible space for the desired concepts, conditioned on the target T2I pipeline. We use knowledge distillation and concept whitening to enable this. At inference, the learned space is utilized to modulate the generative content. A typical T2I pipeline presents two plug-in points for our approach, namely; the text embedding space and the diffusion model latent space. We develop modules for both points and show the effectiveness of our approach with a range of strong results.
摘要：围绕文本图像（T2I）模型的道德问题要求对生成内容进行全面控制。针对负责任的T2I模型解决这些问题的现有技术旨在使生成的内容公平，安全（非暴力/明确）。但是，这些方法仍在分别处理责任概念的方面，同时缺乏解释性。此外，它们通常需要对原始模型进行更改，这会损害模型性能。在这项工作中，我们提出了一种独特的技术，可以通过同时考虑以可扩展方式来实现广泛和安全的内容生成的广泛概念来实现负责任的T2I生成。关键想法是将目标T2I管道提炼出一种外部插件机制，该机制在目标T2i管道上学习了可解释的复合负责人的空间。我们使用知识蒸馏和概念美白来实现这一目标。在推断时，学到的空间用于调节生成含量。典型的T2i管道为我们的方法提供了两个插件点；文本嵌入空间和扩散模型潜在空间。我们开发了这两个点的模块，并通过一系列强大的结果显示了我们方法的有效性。

Title: GranQ: Granular Zero-Shot Quantization with Unified Layer-Channel Awareness

Authors: Inpyo Hong, Youngwan Jo, Hyojeong Lee, Sunghyun Ahn, Sanghyun Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18339
Pdf URL: https://arxiv.org/pdf/2503.18339
Copy Paste: [[2503.18339]] GranQ: Granular Zero-Shot Quantization with Unified Layer-Channel Awareness(https://arxiv.org/abs/2503.18339)
Keywords: generation
Abstract: Zero-shot quantization (ZSQ) enables neural network compression without training data, which is crucial in restricted data access environments. However, existing ZSQ methods suffer from significant activation loss in low-bit environments owing to their coarse-grained scaling strategy. To address this issue, we propose GranQ, a novel ZSQ approach that leverages layer-channel awareness to minimize the quantization error. Unlike conventional layer- or channel-wise quantization, GranQ dynamically adjusts quantization granularity by considering both layer- and channel-level activation distributions. This enables fine-grained quantization while minimizing activation distortion. Additionally, we introduce vectorized activation quantization, which enables efficient parallel computation and reduces computational overhead while preserving accuracy. GranQ achieves superior performance compared with those of state-of-the-art ZSQ methods that employ quantization-aware training. With these findings, we anticipate that GranQ will inspire novel research directions beyond conventional ZSQ approaches focused on data generation and model training.
摘要：零拍摄量化（ZSQ）可以在没有训练数据的情况下实现神经网络压缩，这在受限的数据访问环境中至关重要。但是，由于其粗粒度的缩放策略，现有的ZSQ方法在低位环境中遭受了明显的激活损失。为了解决这个问题，我们提出了Granq，这是一种新型的ZSQ方法，它利用层通道意识将量化误差最小化。与常规层或通道量化不同，Granq通过考虑层和通道级激活分布来动态调整量化粒度。这可以实现细粒度的量化，同时最大程度地减少激活失真。此外，我们引入了矢量化激活量化，从而实现有效的并行计算，并降低了计算开销，同时保持准确性。与采用量化感知培训的最先进的ZSQ方法相比，Granq的性能优越。通过这些发现，我们预计Granq将激发新的研究方向以外的传统ZSQ方法，这些方法侧重于数据生成和模型培训。

Title: Human-Object Interaction with Vision-Language Model Guided Relative Movement Dynamics

Authors: Zekai Deng, Ye Shi, Kaiyang Ji, Lan Xu, Shaoli Huang, Jingya Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18349
Pdf URL: https://arxiv.org/pdf/2503.18349
Copy Paste: [[2503.18349]] Human-Object Interaction with Vision-Language Model Guided Relative Movement Dynamics(https://arxiv.org/abs/2503.18349)
Keywords: generation
Abstract: Human-Object Interaction (HOI) is vital for advancing simulation, animation, and robotics, enabling the generation of long-term, physically plausible motions in 3D environments. However, existing methods often fall short of achieving physics realism and supporting diverse types of interactions. To address these challenges, this paper introduces a unified Human-Object Interaction framework that provides unified control over interactions with static scenes and dynamic objects using language commands. The interactions between human and object parts can always be described as the continuous stable Relative Movement Dynamics (RMD) between human and object parts. By leveraging the world knowledge and scene perception capabilities of Vision-Language Models (VLMs), we translate language commands into RMD diagrams, which are used to guide goal-conditioned reinforcement learning for sequential interaction with objects. Our framework supports long-horizon interactions among dynamic, articulated, and static objects. To support the training and evaluation of our framework, we present a new dataset named Interplay, which includes multi-round task plans generated by VLMs, covering both static and dynamic HOI tasks. Extensive experiments demonstrate that our proposed framework can effectively handle a wide range of HOI tasks, showcasing its ability to maintain long-term, multi-round transitions. For more details, please refer to our project webpage: this https URL.
摘要：人类对象相互作用（HOI）对于推进模拟，动画和机器人技术至关重要，从而使3D环境中的长期，物理上合理的运动能够产生。但是，现有方法通常无法实现物理现实主义并支持各种类型的相互作用。为了应对这些挑战，本文介绍了一个统一的人类对象交互框架，该框架使用语言命令提供了对与静态场景和动态对象相互作用的统一控制。人与物体部分之间的相互作用始终可以描述为人与物体部分之间的连续稳定相对运动动力学（RMD）。通过利用视觉语言模型（VLM）的世界知识和场景感知能力，我们将语言命令转换为RMD图，这些命令用于指导目标条件的加强学习，以实现与对象的顺序交互。我们的框架支持动态，铰接和静态对象之间的长马相互作用。为了支持对我们框架的培训和评估，我们提出了一个名为Interplay的新数据集，其中包括由VLMS生成的多轮任务计划，涵盖了静态和动态HOI任务。广泛的实验表明，我们提出的框架可以有效地处理各种HOI任务，从而展示其保持长期多轮过渡的能力。有关更多详细信息，请参阅我们的项目网页：此HTTPS URL。

Title: Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

Authors: Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, Di Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18352
Pdf URL: https://arxiv.org/pdf/2503.18352
Copy Paste: [[2503.18352]] Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models(https://arxiv.org/abs/2503.18352)
Keywords: generation
Abstract: In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.
摘要：在本文中，我们提出了扩散-4K，这是一种使用文本对图像扩散模型的直接超高分辨率图像合成的新型框架。核心进步包括：（1）Aesthetic-4K基准：解决缺乏公开可用的4K图像合成数据集，我们构建了Aesthetic-4K，这是超高分辨率图像生成的综合基准。我们用GPT-4O生成的精心选择的图像和字幕策划了一个高质量的4K数据集。此外，我们引入了GLCM评分和压缩比度量，以评估细节，并结合FID，美学和夹克等整体措施，以全面评估超高分辨率图像。（2）基于小波的微调：我们提出了一种基于小波的微调方法，用于使用光真逼真的4K图像进行直接训练，适用于各种潜在扩散模型，以表明其在合成高度详细的4K图像中的有效性。因此，扩散-4K在高质量的图像合成和文本及时粘附方面取得了令人印象深刻的性能，尤其是在由现代大规模扩散模型（例如SD3-2B和Flux-12b）提供支持时。我们的基准测试的广泛实验结果表明，扩散4K在超高分辨率图像合成中的优越性。

Title: Context-Enhanced Memory-Refined Transformer for Online Action Detection

Authors: Zhanzhong Pang, Fadime Sener, Angela Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18359
Pdf URL: https://arxiv.org/pdf/2503.18359
Copy Paste: [[2503.18359]] Context-Enhanced Memory-Refined Transformer for Online Action Detection(https://arxiv.org/abs/2503.18359)
Keywords: generation
Abstract: Online Action Detection (OAD) detects actions in streaming videos using past observations. State-of-the-art OAD approaches model past observations and their interactions with an anticipated future. The past is encoded using short- and long-term memories to capture immediate and long-range dependencies, while anticipation compensates for missing future context. We identify a training-inference discrepancy in existing OAD methods that hinders learning effectiveness. The training uses varying lengths of short-term memory, while inference relies on a full-length short-term memory. As a remedy, we propose a Context-enhanced Memory-Refined Transformer (CMeRT). CMeRT introduces a context-enhanced encoder to improve frame representations using additional near-past context. It also features a memory-refined decoder to leverage near-future generation to enhance performance. CMeRT achieves state-of-the-art in online detection and anticipation on THUMOS'14, CrossTask, and EPIC-Kitchens-100.
摘要：在线操作检测（OAD）使用过去的观察结果检测流动视频中的动作。最先进的OAD方法模拟了过去的观察结果及其与预期未来的互动。过去是使用短期和长期记忆来捕获即时和长期依赖性的，而预期则弥补了未来缺失的上下文。我们在现有的OAD方法中确定了培训差异，从而阻碍了学习有效性。训练使用不同长度的短期内存，而推理依赖于全长短期内存。作为一种补救措施，我们提出了一种上下文增强内存的变压器（CMERT）。 CMERT引入了一个上下文增强的编码器，以使用其他近贴上上下文来改善帧表示。它还具有内存精制的解码器，以利用近未实现的生成来提高性能。 CMERT在Thumos'14，Crosstask和Epic-Kitchens-100的在线检测和预期中实现了最新的预期。

Title: DiffusedWrinkles: A Diffusion-Based Model for Data-Driven Garment Animation

Authors: Raquel Vidaurre, Elena Garces, Dan Casas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18370
Pdf URL: https://arxiv.org/pdf/2503.18370
Copy Paste: [[2503.18370]] DiffusedWrinkles: A Diffusion-Based Model for Data-Driven Garment Animation(https://arxiv.org/abs/2503.18370)
Keywords: generation, generative
Abstract: We present a data-driven method for learning to generate animations of 3D garments using a 2D image diffusion model. In contrast to existing methods, typically based on fully connected networks, graph neural networks, or generative adversarial networks, which have difficulties to cope with parametric garments with fine wrinkle detail, our approach is able to synthesize high-quality 3D animations for a wide variety of garments and body shapes, while being agnostic to the garment mesh topology. Our key idea is to represent 3D garment deformations as a 2D layout-consistent texture that encodes 3D offsets with respect to a parametric garment template. Using this representation, we encode a large dataset of garments simulated in various motions and shapes and train a novel conditional diffusion model that is able to synthesize high-quality pose-shape-and-design dependent 3D garment deformations. Since our model is generative, we can synthesize various plausible deformations for a given target pose, shape, and design. Additionally, we show that we can further condition our model using an existing garment state, which enables the generation of temporally coherent sequences.
摘要：我们提出了一种数据驱动的方法，用于学习使用2D图像扩散模型生成3D服装的动画。与现有方法相反，通常基于完全连接的网络，图形神经网络或生成的对抗网络，这些网络很难应对具有细节细节的参数服装，我们的方法能够合成各种服装和身体形状的高质量3D动画，同时是对服装的各种服装，同时是对服装的层次。我们的关键想法是将3D服装变形表示为2D布局一致的纹理，该纹理编码3D相对于参数服装模板。使用此表示，我们编码了以各种运动和形状模拟的大型服装数据集，并训练了一个新型的条件扩散模型，该模型能够合成高质量的姿势形状和设计依赖于3D服装变形。由于我们的模型是生成的，因此我们可以为给定的目标姿势，形状和设计合成各种合理的变形。此外，我们表明我们可以使用现有的服装状态进一步调节模型，从而使时间相干序列产生。

Title: Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance

Authors: Sicong Feng, Jielong Yang, Li Peng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18386
Pdf URL: https://arxiv.org/pdf/2503.18386
Copy Paste: [[2503.18386]] Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance(https://arxiv.org/abs/2503.18386)
Keywords: generation
Abstract: Recent advances in diffusion models bring new vitality to visual content creation. However, current text-to-video generation models still face significant challenges such as high training costs, substantial data requirements, and difficulties in maintaining consistency between given text and motion of the foreground object. To address these challenges, we propose mask-guided video generation, which can control video generation through mask motion sequences, while requiring limited training data. Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control. Through mask motion sequences, we guide the video generation process to maintain consistent foreground objects throughout the sequence. Additionally, through a first-frame sharing strategy and autoregressive extension approach, we achieve more stable and longer video generation. Extensive qualitative and quantitative experiments demonstrate that this approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality. Our generated results can be viewed in the supplementary materials.
摘要：扩散模型的最新进展为视觉内容创建带来了新的活力。但是，当前的文本到视频生成模型仍然面临重大挑战，例如高训练成本，大量数据要求以及在前景对象的给定文本和运动之间保持一致性的困难。为了应对这些挑战，我们提出了面具引导的视频生成，可以通过掩盖运动序列控制视频，同时需要有限的培训数据。我们的模型通过合并前景口罩以进行精确的文本位置匹配和运动轨迹控制来增强现有体系结构。通过掩码运动序列，我们指导视频生成过程以保持整个序列中的一致前景对象。此外，通过第一框架共享策略和自动回归扩展方法，我们实现了更稳定，更长的视频生成。广泛的定性和定量实验表明，这种方法在各种视频生成任务中擅长，例如视频编辑和生成艺术视频，在一致性和质量方面优于以前的方法。我们生成的结果可以在补充材料中查看。

Title: Knowledge Graph Enhanced Generative Multi-modal Models for Class-Incremental Learning

Authors: Xusheng Cao, Haori Lu, Linlan Huang, Fei Yang, Xialei Liu, Ming-Ming Cheng
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18403
Pdf URL: https://arxiv.org/pdf/2503.18403
Copy Paste: [[2503.18403]] Knowledge Graph Enhanced Generative Multi-modal Models for Class-Incremental Learning(https://arxiv.org/abs/2503.18403)
Keywords: generative
Abstract: Continual learning in computer vision faces the critical challenge of catastrophic forgetting, where models struggle to retain prior knowledge while adapting to new tasks. Although recent studies have attempted to leverage the generalization capabilities of pre-trained models to mitigate overfitting on current tasks, models still tend to forget details of previously learned categories as tasks progress, leading to misclassification. To address these limitations, we introduce a novel Knowledge Graph Enhanced Generative Multi-modal model (KG-GMM) that builds an evolving knowledge graph throughout the learning process. Our approach utilizes relationships within the knowledge graph to augment the class labels and assigns different relations to similar categories to enhance model differentiation. During testing, we propose a Knowledge Graph Augmented Inference method that locates specific categories by analyzing relationships within the generated text, thereby reducing the loss of detailed information about old classes when learning new knowledge and alleviating forgetting. Experiments demonstrate that our method effectively leverages relational information to help the model correct mispredictions, achieving state-of-the-art results in both conventional CIL and few-shot CIL settings, confirming the efficacy of knowledge graphs at preserving knowledge in the continual learning scenarios.
摘要：计算机视觉中的持续学习面临着灾难性遗忘的关键挑战，在此模型在适应新任务的同时努力保持先验知识。尽管最近的研究试图利用预训练模型的概括能力来减轻当前任务的过度适应，但随着任务的进展，模型仍然倾向于忘记先前学习的类别的细节，从而导致错误分类。为了解决这些局限性，我们引入了一个新颖的知识图增强生成的多模式模型（KG-GMM），该模型在整个学习过程中构建了不断发展的知识图。我们的方法利用知识图内的关系来增强类标签，并将不同的关系分配给相似类别以增强模型差异。在测试过程中，我们提出了一个知识图增强推理方法，该方法通过分析生成的文本中的关系来定位特定类别，从而减少了学习新知识和减轻遗忘的详细信息的丢失。实验表明，我们的方法有效地利用了关系信息来帮助模型正确的错误预测，从而实现了最新的方法，从而导致传统的CIL和少量CIL设置，从而证实了知识图在持续学习方案中保留知识的功效。

Title: Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

Authors: Sherry X. Chen, Misha Sra, Pradeep Sen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18406
Pdf URL: https://arxiv.org/pdf/2503.18406
Copy Paste: [[2503.18406]] Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning(https://arxiv.org/abs/2503.18406)
Keywords: generative
Abstract: Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to challenges in creating large, high-quality training datasets. Previous work has typically relied on text-toimage (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP, a self-supervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) [19] and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel Instruct-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at this https URL.
摘要：尽管自然语言说明提供了一种直观的方式来指导自动化图像编辑，但深度学习模型通常很难获得高质量的结果，这主要是由于创建大型高质量培训数据集的挑战。以前的工作通常依赖文本效果（T2I）生成模型来生成成对的原始图像和编辑的图像，以模拟指导引导的图像编辑模型的输入/输出。但是，由于T2I模型的局限性，这些图像对通常无法与指定的编辑说明保持一致，这会对在此类数据集上进行训练的模型产生负面影响。为了解决这个问题，我们提出了一种自制的方法，它是一种学习原始图像和编辑图像之间的语义变化，以完善和更好地对齐现有数据集中的说明。此外，我们适应了指令夹以处理嘈杂的潜在图像和扩散时间段，以便它可以用于训练潜在扩散模型（LDMS）[19] [19]，并有效地在扩散管道的任何步骤处的编辑指令与潜在空间的图像变化之间的图像变化。我们使用指令clip来纠正指令PIX2PIX数据集，并获得超过120k精制的样本，然后使用我们的新型基于Covel-CLIP的损耗功能来指导其模型。最终的模型可以产生与给定指令更一致的编辑。我们的代码和数据集可在此HTTPS URL上找到。

Title: U-REPA: Aligning Diffusion U-Nets to ViTs

Authors: Yuchuan Tian, Hanting Chen, Mengyu Zheng, Yuchen Liang, Chao Xu, Yunhe Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18414
Pdf URL: https://arxiv.org/pdf/2503.18414
Copy Paste: [[2503.18414]] U-REPA: Aligning Diffusion U-Nets to ViTs(https://arxiv.org/abs/2503.18414)
Keywords: generation
Abstract: Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $\times$ 256, and needs only half the total epochs to perform better than REPA. Codes are available at this https URL.
摘要：将扩散变压器（DIT）与VIT视觉编码器保持一致的表示对齐（REPA）已证明在DIT训练中非常有效，表明了出色的收敛性能，但是在规范扩散U-NET体系结构上尚未得到验证，与DIT相比，它显示出更快的融合。但是，将REPA适应U-NET体系结构提出了独特的挑战：（1）不同的块功能需要修订的对齐策略；（2）从U-NET的空间下采样操作出现了空间维度不一致；（3）U-NET和VIT之间的空间间隙阻碍了令牌比对的有效性。为了遇到这些挑战，我们提出了U-Repa，这是一种表示U-NET隐藏状态和VIT特征的表示对齐范式，如下所示：首先，我们通过观察表明，由于跳过连接，U-NET的中间阶段是U-NET的中间阶段是最佳的对齐选项。其次，我们提出了U-NET特征通过MLP后的提升采样。第三，在执行令牌相似性比对时，我们会观察到难度，并进一步引入了多种损失，从而使样品之间的相对相似性正常。实验表明，由此产生的U-REPA可以达到出色的发电质量，并大大提高收敛速度。借助CFG指导间隔，U-REPA在200个时期内可以达到$ FID <1.5 $，或Imagenet 256 $ \ tims $ 256的100万迭代，只需要一半的总时代就可以比REPA更好。代码可在此HTTPS URL上找到。

Title: Panorama Generation From NFoV Image Done Right

Authors: Dian Zheng, Cheng Zhang, Xiao-Ming Wu, Cao Li, Chengfei Lv, Jian-Fang Hu, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18420
Pdf URL: https://arxiv.org/pdf/2503.18420
Copy Paste: [[2503.18420]] Panorama Generation From NFoV Image Done Right(https://arxiv.org/abs/2503.18420)
Keywords: generation
Abstract: Generating 360-degree panoramas from narrow field of view (NFoV) image is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP based metrics, which tend to perceive the image quality and is \textbf{not suitable for evaluating the distortion}. In this work, we first propose a distortion-specific CLIP, named Distort-CLIP to accurately evaluate the panorama distortion and discover the \textbf{``visual cheating''} phenomenon in previous works (\ie, tending to improve the visual results by sacrificing distortion accuracy). This phenomenon arises because prior methods employ a single network to learn the distinct panorama distortion and content completion at once, which leads the model to prioritize optimizing the latter. To address the phenomenon, we propose \textbf{PanoDecouple}, a decoupled diffusion model framework, which decouples the panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. Specifically, we design a DistortNet for distortion guidance by imposing panorama-specific distortion prior and a modified condition registration mechanism; and a ContentNet for content completion by imposing perspective image information. Additionally, a distortion correction loss function with Distort-CLIP is introduced to constrain the distortion explicitly. The extensive experiments validate that PanoDecouple surpasses existing methods both in distortion and visual metrics.
摘要：从狭窄的视野（NFOV）图像生成360度全景图是虚拟现实（VR）应用程序的有前途的计算机视觉任务。现有方法主要通过基于InceptionNet或基于夹的指标评估生成的全景图，这些指标倾向于感知图像质量，并且是\ textbf {不适合评估失真}。在这项工作中，我们首先提出了一个特定于失真的剪辑，称为扭曲clip，以准确评估全景变形并发现\ textbf {``视觉作弊''}现象在先前的作品中（\ ie）（\ ie，倾向于通过牺牲失真准确性来改善视觉结果）。这种现象之所以出现，是因为先前的方法采用单个网络来一次学习独特的全景变形和内容完成，这导致模型优先优化后者。为了解决该现象，我们提出了\ textbf {panodecouple}，这是一个被解耦的扩散模型框架，该框架将全景的生成分解为失真指导和内容完成，旨在使全景既具有准确的失真和视觉吸引力。具体而言，我们通过施加特定于全景特定的失真性和修改状态的注册机制来设计扭曲指导的扭曲网络。以及通过强加透视图像信息来完成内容完成的内容网。此外，引入了带有扭曲的失真校正损失函数，以明确限制失真。广泛的实验验证了PanoDecouple超过失真和视觉指标的现有方法。

Title: Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation

Authors: Dingcheng Zhen, Shunshun Yin, Shiyang Qin, Hou Yi, Ziwei Zhang, Siyuan Liu, Gan Qi, Ming Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18429
Pdf URL: https://arxiv.org/pdf/2503.18429
Copy Paste: [[2503.18429]] Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation(https://arxiv.org/abs/2503.18429)
Keywords: generation
Abstract: In this work, we introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head. Beyond the challenge of lengthy animation times, a critical challenge in realistic talking head generation lies in preserving the natural movement of diverse body parts. To this end, we propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation. Specifically, Teller first decomposes facial and body detail animation into two components: Facial Motion Latent Generation (FMLG) based on an autoregressive transfromer, and movement authenticity refinement using a Efficient Temporal Module (ETM).Concretely, FMLG employs a Residual VQ model to map the facial motion latent from the implicit keypoint-based model into discrete motion tokens, which are then temporally sliced with audio embeddings. This enables the AR tranformer to learn real-time, stream-based mappings from audio to motion. Furthermore, Teller incorporate ETM to capture finer motion details. This module ensures the physical consistency of body parts and accessories, such as neck muscles and earrings, improving the realism of these movements. Teller is designed to be efficient, surpassing the inference speed of diffusion-based models (Hallo 20.93s vs. Teller 0.92s for one second video generation), and achieves a real-time streaming performance of up to 25 FPS. Extensive experiments demonstrate that our method outperforms recent audio-driven portrait animation models, especially in small movements, as validated by human evaluations with a significant margin in quality and realism.
摘要：在这项工作中，我们介绍了第一个用于实时，音频驱动的肖像画的自动回归框架，又名Talking Head。除了冗长的动画时代的挑战之外，在现实的说话校长中，一个关键的挑战在于保留各种身体部位的自然运动。为此，我们提出了Teller，这是第一个流动音频驱动的弹性动画框架，并具有自回归运动生成。 Specifically, Teller first decomposes facial and body detail animation into two components: Facial Motion Latent Generation (FMLG) based on an autoregressive transfromer, and movement authenticity refinement using a Efficient Temporal Module (ETM).Concretely, FMLG employs a Residual VQ model to map the facial motion latent from the implicit keypoint-based model into discrete motion tokens, which are then temporally sliced with音频嵌入。这使AR Tranformer可以从音频到运动学习实时，基于流的映射。此外，柜员还合并ETM以捕获更精细的运动细节。该模块可确保身体部位和配饰（例如颈部肌肉和耳环）的身体一致性，从而改善了这些运动的现实主义。柜员旨在高效，超过了基于扩散的模型的推理速度（Hallo 20.93s vs. Teller 0.92s，一秒钟的视频生成），并实现了高达25 fps的实时流传输性能。广泛的实验表明，我们的方法的表现优于最新的音频驱动肖像动画模型，尤其是在小动作中，这是通过人类评估质量和现实主义显着差的人类评估验证的。

Title: ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

Authors: Guosheng Zhao, Xiaofeng Wang, Chaojun Ni, Zheng Zhu, Wenkang Qin, Guan Huang, Xingang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18438
Pdf URL: https://arxiv.org/pdf/2503.18438
Copy Paste: [[2503.18438]] ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation(https://arxiv.org/abs/2503.18438)
Keywords: generative
Abstract: Combining reconstruction models with generative models has emerged as a promising paradigm for closed-loop simulation in autonomous driving. For example, ReconDreamer has demonstrated remarkable success in rendering large-scale maneuvers. However, a significant gap remains between the generated data and real-world sensor observations, particularly in terms of fidelity for structured elements, such as the ground surface. To address these challenges, we propose ReconDreamer++, an enhanced framework that significantly improves the overall rendering quality by mitigating the domain gap and refining the representation of the ground surface. Specifically, ReconDreamer++ introduces the Novel Trajectory Deformable Network (NTDNet), which leverages learnable spatial deformation mechanisms to bridge the domain gap between synthesized novel views and original sensor observations. Moreover, for structured elements such as the ground surface, we preserve geometric prior knowledge in 3D Gaussians, and the optimization process focuses on refining appearance attributes while preserving the underlying geometric structure. Experimental evaluations conducted on multiple datasets (Waymo, nuScenes, PandaSet, and EUVS) confirm the superior performance of ReconDreamer++. Specifically, on Waymo, ReconDreamer++ achieves performance comparable to Street Gaussians for the original trajectory while significantly outperforming ReconDreamer on novel trajectories. In particular, it achieves substantial improvements, including a 6.1% increase in NTA-IoU, a 23. 0% improvement in FID, and a remarkable 4.5% gain in the ground surface metric NTL-IoU, highlighting its effectiveness in accurately reconstructing structured elements such as the road surface.
摘要：将重建模型与生成模型相结合已成为自动驾驶中闭环模拟的有希望的范式。例如，Reconnemener在渲染大规模操作方面取得了巨大的成功。但是，生成的数据和现实世界传感器的观测值之间仍然存在一个显着的差距，尤其是在结构化元素（例如地面）的忠诚度上。为了应对这些挑战，我们提出了Recondreamer ++，这是一个增强的框架，通过减轻域间隙并完善地面的表示，可以显着提高整体渲染质量。具体而言，Reconnemener ++引入了新型轨迹变形网络（NTDNET），该网络利用可学习的空间变形机制来弥合合成的新型视图和原始传感器观测之间的域间隙。此外，对于诸如地面等结构化元素，我们保留3D高斯人的几何知识知识，优化过程着重于精炼外观属性，同时保留了基本的几何结构。在多个数据集（Waymo，Nuscenes，Pandaset和EUV）上进行的实验评估证实了Recondreamer ++的出色性能。具体而言，在Waymo上，Recondreamer ++的性能与原始轨迹的街头高斯人相当，同时在新型轨迹上的表现明显超过了旋转器。特别是，它取得了实质性改善，包括NTA-IOU增加了6.1％，FID提高了23％，地面公制NTL-IOU的增益显着4.5％，突出了其在准确地重建路面等结构化元件（例如道路表面）中的有效性。

Title: Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models

Authors: Jinho Jeong, Sangmin Han, Jinwoo Kim, Seon Joo Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18446
Pdf URL: https://arxiv.org/pdf/2503.18446
Copy Paste: [[2503.18446]] Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models(https://arxiv.org/abs/2503.18446)
Keywords: super-resolution, generation
Abstract: In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space. Existing diffusion models struggle with scaling beyond their training resolutions, often leading to structural distortions or content repetition. Reference-based methods address the issues by upsampling a low-resolution reference to guide higher-resolution generation. However, they face significant challenges: upsampling in latent space often causes manifold deviation, which degrades output quality. On the other hand, upsampling in RGB space tends to produce overly smoothed outputs. To overcome these limitations, LSRNA combines Latent space Super-Resolution (LSR) for manifold alignment and Region-wise Noise Addition (RNA) to enhance high-frequency details. Our extensive experiments demonstrate that integrating LSRNA outperforms state-of-the-art reference-based methods across various resolutions and metrics, while showing the critical role of latent space upsampling in preserving detail and sharpness. The code is available at this https URL.
摘要：在本文中，我们提出了LSRNA，这是一种使用扩散模型的新型框架，用于通过直接在潜在空间中利用超分辨率来生成高分辨率（超过1K）图像。现有的扩散模型努力缩放其超越培训决议，通常会导致结构扭曲或内容重复。基于参考的方法通过提高低分辨率参考来指导高分辨率生成来解决问题。但是，它们面临重大挑战：潜在空间中的提升通常会导致多种偏差，从而降低产出质量。另一方面，在RGB空间中的UPS采样倾向于产生过度平滑的输出。为了克服这些局限性，LSRNA结合了潜在的空间超分辨率（LSR），以进行歧管比对和添加区域噪声（RNA），以增强高频细节。我们的广泛实验表明，在各种分辨率和指标上，整合LSRNA的表现优于最先进的参考方法，同时显示了潜在空间在保留细节和清晰度中的关键作用。该代码可在此HTTPS URL上找到。

Title: InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment

Authors: Yunhong Lu, Qichao Wang, Hengyuan Cao, Xierui Wang, Xiaoyin Xu, Min Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.18454
Pdf URL: https://arxiv.org/pdf/2503.18454
Copy Paste: [[2503.18454]] InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment(https://arxiv.org/abs/2503.18454)
Keywords: generation, generative
Abstract: Without using explicit reward, direct preference optimization (DPO) employs paired human preference data to fine-tune generative models, a method that has garnered considerable attention in large language models (LLMs). However, exploration of aligning text-to-image (T2I) diffusion models with human preferences remains limited. In comparison to supervised fine-tuning, existing methods that align diffusion model suffer from low training efficiency and subpar generation quality due to the long Markov chain process and the intractability of the reverse process. To address these limitations, we introduce DDIM-InPO, an efficient method for direct preference alignment of diffusion models. Our approach conceptualizes diffusion model as a single-step generative model, allowing us to fine-tune the outputs of specific latent variables selectively. In order to accomplish this objective, we first assign implicit rewards to any latent variable directly via a reparameterization technique. Then we construct an Inversion technique to estimate appropriate latent variables for preference optimization. This modification process enables the diffusion model to only fine-tune the outputs of latent variables that have a strong correlation with the preference dataset. Experimental results indicate that our DDIM-InPO achieves state-of-the-art performance with just 400 steps of fine-tuning, surpassing all preference aligning baselines for T2I diffusion models in human preference evaluation tasks.
摘要：无需使用明确的奖励，直接偏好优化（DPO）将配对的人类偏好数据用于微调生成模型，这种方法在大语言模型（LLMS）中引起了相当大的关注。但是，对具有人类偏好的文本对象（T2I）扩散模型的探索仍然有限。与监督的微调相比，由于长长的马尔可夫链过程和反向过程的棘手性，对齐扩散模型的现有方法均具有低训练效率和不足的生成质量。为了解决这些局限性，我们引入了DDIM-INPO，这是一种有效的扩散模型偏好比对的方法。我们的方法将扩散模型概念化为单步生成模型，使我们可以选择性地调整特定潜在变量的输出。为了实现这一目标，我们首先通过重新聚集技术直接将隐式奖励分配给任何潜在变量。然后，我们构建一种反转技术，以估算适当的潜在变量以优化。此修改过程使扩散模型只能微调与优先数据集有很强相关性的潜在变量的输出。实验结果表明，我们的DDIM-INPO仅通过微调进行400个步骤来实现最先进的性能，超过了人类偏好评估任务中T2I扩散模型的所有偏好对齐基线。

Title: Hiding Images in Diffusion Models by Editing Learned Score Functions

Authors: Haoyu Chen, Yunqiao Yang, Nan Zhong, Kede Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18459
Pdf URL: https://arxiv.org/pdf/2503.18459
Copy Paste: [[2503.18459]] Hiding Images in Diffusion Models by Editing Learned Score Functions(https://arxiv.org/abs/2503.18459)
Keywords: generative
Abstract: Hiding data using neural networks (i.e., neural steganography) has achieved remarkable success across both discriminative classifiers and generative adversarial networks. However, the potential of data hiding in diffusion models remains relatively unexplored. Current methods exhibit limitations in achieving high extraction accuracy, model fidelity, and hiding efficiency due primarily to the entanglement of the hiding and extraction processes with multiple denoising diffusion steps. To address these, we describe a simple yet effective approach that embeds images at specific timesteps in the reverse diffusion process by editing the learned score functions. Additionally, we introduce a parameter-efficient fine-tuning method that combines gradient-based parameter selection with low-rank adaptation to enhance model fidelity and hiding efficiency. Comprehensive experiments demonstrate that our method extracts high-quality images at human-indistinguishable levels, replicates the original model behaviors at both sample and population levels, and embeds images orders of magnitude faster than prior methods. Besides, our method naturally supports multi-recipient scenarios through independent extraction channels.
摘要：使用神经网络（即神经隐志）隐藏数据在歧视性分类器和生成对抗网络中都取得了巨大的成功。但是，隐藏在扩散模型中的潜力仍然相对尚未探索。当前方法在实现高提取精度，模型保真度和隐藏效率方面表现出局限性，这主要是由于隐藏过程和提取过程的纠缠而具有多个deo的扩散步骤。为了解决这些问题，我们描述了一种简单而有效的方法，该方法通过编辑学习的分数函数将图像嵌入在反向扩散过程中的特定时间步中。此外，我们引入了一种参数效率高的微调方法，该方法将基于梯度的参数选择与低级适应性结合起来，以增强模型保真度和隐藏效率。全面的实验表明，我们的方法在可与众不同的水平上提取高质量的图像，在样本和人群水平上复制原始模型行为，并比先前的方法更快地嵌入数量级。此外，我们的方法自然会通过独立的提取通道支持多重新的方案。

Title: MuMA: 3D PBR Texturing via Multi-Channel Multi-View Generation and Agentic Post-Processing

Authors: Lingting Zhu, Jingrui Ye, Runze Zhang, Zeyu Hu, Yingda Yin, Lanjiong Li, Jinnan Chen, Shengju Qian, Xin Wang, Qingmin Liao, Lequan Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18461
Pdf URL: https://arxiv.org/pdf/2503.18461
Copy Paste: [[2503.18461]] MuMA: 3D PBR Texturing via Multi-Channel Multi-View Generation and Agentic Post-Processing(https://arxiv.org/abs/2503.18461)
Keywords: generation
Abstract: Current methods for 3D generation still fall short in physically based rendering (PBR) texturing, primarily due to limited data and challenges in modeling multi-channel materials. In this work, we propose MuMA, a method for 3D PBR texturing through Multi-channel Multi-view generation and Agentic post-processing. Our approach features two key innovations: 1) We opt to model shaded and albedo appearance channels, where the shaded channels enables the integration intrinsic decomposition modules for material properties. 2) Leveraging multimodal large language models, we emulate artists' techniques for material assessment and selection. Experiments demonstrate that MuMA achieves superior results in visual quality and material fidelity compared to existing methods.
摘要：3D代的当前方法仍然缺乏基于物理的渲染（PBR）纹理，这主要是由于对多通道材料进行建模的数据和挑战有限。在这项工作中，我们提出了MUMA，这是一种通过多渠道多视图生成和代理后处理的3D PBR纹理的方法。我们的方法具有两个关键的创新：1）我们选择对阴影和反照率的外观通道进行建模，在该通道中，阴影通道可实现材料特性的集成固有分解模块。 2）利用多模式的大语模型，我们效仿了艺术家的技术进行材料评估和选择。实验表明，与现有方法相比，MUMA在视觉质量和材料保真度方面取得了优势。

Title: PALATE: Peculiar Application of the Law of Total Expectation to Enhance the Evaluation of Deep Generative Models

Authors: Tadeusz Dziarmaga, Marcin Kądziołka, Artur Kasymov, Marcin Mazur
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.18462
Pdf URL: https://arxiv.org/pdf/2503.18462
Copy Paste: [[2503.18462]] PALATE: Peculiar Application of the Law of Total Expectation to Enhance the Evaluation of Deep Generative Models(https://arxiv.org/abs/2503.18462)
Keywords: generative
Abstract: Deep generative models (DGMs) have caused a paradigm shift in the field of machine learning, yielding noteworthy advancements in domains such as image synthesis, natural language processing, and other related areas. However, a comprehensive evaluation of these models that accounts for the trichotomy between fidelity, diversity, and novelty in generated samples remains a formidable challenge. A recently introduced solution that has emerged as a promising approach in this regard is the Feature Likelihood Divergence (FLD), a method that offers a theoretically motivated practical tool, yet also exhibits some computational challenges. In this paper, we propose PALATE, a novel enhancement to the evaluation of DGMs that addresses limitations of existing metrics. Our approach is based on a peculiar application of the law of total expectation to random variables representing accessible real data. When combined with the MMD baseline metric and DINOv2 feature extractor, PALATE offers a holistic evaluation framework that matches or surpasses state-of-the-art solutions while providing superior computational efficiency and scalability to large-scale datasets. Through a series of experiments, we demonstrate the effectiveness of the PALATE enhancement, contributing a computationally efficient, holistic evaluation approach that advances the field of DGMs assessment, especially in detecting sample memorization and evaluating generalization capabilities.
摘要：深层生成模型（DGM）导致机器学习领域的范式转移，在图像综合，自然语言处理和其他相关领域等领域中产生了值得注意的进步。但是，对这些模型进行了全面评估，该模型解释了产生的样本中的富裕性，多样性和新颖性之间的三分法，这仍然是一个巨大的挑战。在这方面，最近引入的解决方案是一种有前途的方法，是功能可能性差异（FLD），该方法提供了一种具有理论动机的实用工具，但也表现出了一些计算挑战。在本文中，我们提出了味蕾，这是对DGM的评估的一种新颖的增强，以解决现有指标的局限性。我们的方法是基于对代表可访问真实数据的随机变量的总期望定律的特殊应用。当与MMD基线指标和Dinov2提取器结合使用时，Papate提供了一个整体评估框架，该框架与最先进的解决方案相匹配或超过了最先进的解决方案，同时为大型数据集提供了出色的计算效率和可扩展性。通过一系列实验，我们证明了口感增强的有效性，促进了一种计算高效的整体评估方法，该方法可以提高DGMS评估领域，尤其是在检测样品记忆和评估概括能力时。

Title: MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

Authors: Zhenyu Pan, Han Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18470
Pdf URL: https://arxiv.org/pdf/2503.18470
Copy Paste: [[2503.18470]] MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse(https://arxiv.org/abs/2503.18470)
Keywords: generation
Abstract: We present MetaSpatial, the first reinforcement learning (RL)-based framework designed to enhance 3D spatial reasoning in vision-language models (VLMs), enabling real-time 3D scene generation without the need for hard-coded optimizations. MetaSpatial addresses two core challenges: (i) the lack of internalized 3D spatial reasoning in VLMs, which limits their ability to generate realistic layouts, and (ii) the inefficiency of traditional supervised fine-tuning (SFT) for layout generation tasks, as perfect ground truth annotations are unavailable. Our key innovation is a multi-turn RL-based optimization mechanism that integrates physics-aware constraints and rendered image evaluations, ensuring generated 3D layouts are coherent, physically plausible, and aesthetically consistent. Methodologically, MetaSpatial introduces an adaptive, iterative reasoning process, where the VLM refines spatial arrangements over multiple turns by analyzing rendered outputs, improving scene coherence progressively. Empirical evaluations demonstrate that MetaSpatial significantly enhances the spatial consistency and formatting stability of various scale models. Post-training, object placements are more realistic, aligned, and functionally coherent, validating the effectiveness of RL for 3D spatial reasoning in metaverse, AR/VR, digital twins, and game development applications. Our code, data, and training pipeline are publicly available at this https URL.
摘要：我们介绍了Metaspatial，这是第一个基于强化的框架（RL）的框架，旨在增强视觉模型（VLM）中的3D空间推理，从而无需进行硬编码的优化，从而实现了实时3D场景的生成。 Metaspatial解决了两个核心挑战：（i）VLMS中缺乏内部化的3D空间推理，这限制了它们产生逼真的布局的能力，以及（ii）传统的监督微调（SFT）用于布局生成任务的效率低下，因为不可用。我们的关键创新是一种基于多转移的RL优化机制，该机制集成了物理意识的约束和渲染的图像评估，确保生成的3D布局是连贯的，物理上合理的，并且在美学上是一致的。从方法论上讲，Metaspatial引入了一个自适应的迭代推理过程，在该过程中，VLM通过分析渲染的输出来逐渐改善场景相干性，从而在多个转弯中精炼空间排列。经验评估表明，Metaspatial显着提高了各种规模模型的空间一致性和格式稳定性。训练后，对象放置更为现实，对齐和功能连贯，从而验证了Metavers，AR/VR，Digital Twins和Game Development应用程序中RL对3D空间推理的有效性。我们的代码，数据和培训管道在此HTTPS URL上公开可用。

Title: Global-Local Tree Search for Language Guided 3D Scene Generation

Authors: Wei Deng, Mengshi Qi, Huadong Ma
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.18476
Pdf URL: https://arxiv.org/pdf/2503.18476
Copy Paste: [[2503.18476]] Global-Local Tree Search for Language Guided 3D Scene Generation(https://arxiv.org/abs/2503.18476)
Keywords: generation
Abstract: Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is represented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, i.e. room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discretize the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experimental results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. Our source code is available at this https URL .
摘要：大型视觉模型（VLM），例如GPT-4，在各个领域都取得了巨大的成功。但是，很少有关于使用VLM的3D室内场景产生的研究。本文将此任务视为计划问题，但要受到空间和布局常识约束。为了通过VLM解决问题，我们提出了一种新的全局本地树搜索算法。在全球范围内，该方法将每个对象顺序放置，并在每个放置过程中探索多个位置，其中问题空间表示为树。为了减少树的深度，我们从层次上分解场景结构，即房间级，区域级，地板对象级别和受支持的对象级别。该算法独立生成不同区域的地板对象，并将受支持的对象放在不同的地板对象上。在本地，我们还将子任务（每个对象的放置位置）分解为多个步骤。该算法搜索问题空间的树。为了利用VLM模型来产生物体的位置，我们将自上而下的视图空间离散为密集的网格，并用各种表情符号填充每个细胞，以使细胞与细胞不同。我们使用表情符号网格提示VLM，VLM通过用表情符号的名称描述位置来为对象提供合理的位置。定量和定性的实验结果说明了我们的方法比最先进的方法产生的3D场景更合理。我们的源代码可在此HTTPS URL上找到。

Title: Can Text-to-Video Generation help Video-Language Alignment?

Authors: Luca Zanella, Massimiliano Mancini, Willi Menapace, Sergey Tulyakov, Yiming Wang, Elisa Ricci
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18507
Pdf URL: https://arxiv.org/pdf/2503.18507
Copy Paste: [[2503.18507]] Can Text-to-Video Generation help Video-Language Alignment?(https://arxiv.org/abs/2503.18507)
Keywords: generation
Abstract: Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTA, that accounts for those. SynViTA dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t. the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTA improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first promising step for using synthetic videos when learning video-language models.
摘要：最近的视频对齐模型对视频集进行了培训，每个视频都带有相关的积极标题和大型语言模型产生的负面标题。此过程的一个问题是，负面字幕可能会引入语言偏见，即，概念仅被视为负面，而从未与视频相关。虽然解决方案是为负面字幕收集视频，但现有数据库缺乏涵盖所有可能的负面因素所需的细粒度变化。在这项工作中，我们研究合成视频是否可以帮助克服这个问题。我们对多个发电机的初步分析表明，在有望完成某些任务的同时，合成视频会损害模型对其他任务的性能。我们假设此问题与生成的视频中的噪声（语义和视觉）有关，并开发了一种解释这些视频的方法，即Synvita。 Synvita基于其目标标题的方式，动态加权每个合成视频的贡献。真正的对手。此外，语义一致性损失使该模型集中在标题之间的细粒度差异上，而不是视频外观的差异。实验表明，平均而言，Synvita对视频测试集和SSV2-stomal，SSV2-事件和ATP-HARD基准测试的现有方法改进，这是学习视频模型时使用合成视频的第一步。

Title: Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model

Authors: Leheng Zhang, Weiyi You, Kexuan Shi, Shuhang Gu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.18512
Pdf URL: https://arxiv.org/pdf/2503.18512
Copy Paste: [[2503.18512]] Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model(https://arxiv.org/abs/2503.18512)
Keywords: super-resolution
Abstract: Diffusion-based image super-resolution methods have demonstrated significant advantages over GAN-based approaches, particularly in terms of perceptual quality. Building upon a lengthy Markov chain, diffusion-based methods possess remarkable modeling capacity, enabling them to achieve outstanding performance in real-world scenarios. Unlike previous methods that focus on modifying the noise schedule or sampling process to enhance performance, our approach emphasizes the improved utilization of LR information. We find that different regions of the LR image can be viewed as corresponding to different timesteps in a diffusion process, where flat areas are closer to the target HR distribution but edge and texture regions are farther away. In these flat areas, applying a slight noise is more advantageous for the reconstruction. We associate this characteristic with uncertainty and propose to apply uncertainty estimate to guide region-specific noise level control, a technique we refer to as Uncertainty-guided Noise Weighting. Pixels with lower uncertainty (i.e., flat regions) receive reduced noise to preserve more LR information, therefore improving performance. Furthermore, we modify the network architecture of previous methods to develop our Uncertainty-guided Perturbation Super-Resolution (UPSR) model. Extensive experimental results demonstrate that, despite reduced model size and training overhead, the proposed UWSR method outperforms current state-of-the-art methods across various datasets, both quantitatively and qualitatively.
摘要：基于扩散的图像超分辨率方法比基于GAN的方法具有显着优势，尤其是在感知质量方面。基于冗长的马尔可夫链，基于扩散的方法具有显着的建模能力，从而使它们能够在现实情况下取得出色的性能。与以前着重于修改噪声时间表或采样过程以提高性能的方法不同，我们的方法强调了LR信息的改善利用率。我们发现，在扩散过程中，LR图像的不同区域可以看作是对应于不同时间步的，在该过程中，平坦的区域更接近目标HR分布，但是边缘和纹理区域更远。在这些平坦的区域，施加轻微的噪音对于重建更有利。我们将这种特征与不确定性相关联，并建议将不确定性估计用于指导特定于区域的噪声水平控制，这是我们称为不确定性引导的噪声加权的技术。不确定性较低的像素（即平面区域）会降低噪声以保留更多的LR信息，从而提高性能。此外，我们修改了先前方法的网络体系结构，以开发不确定性引导的扰动超分辨率（UPSR）模型。广泛的实验结果表明，尽管模型大小和训练开销降低，但所提出的UWSR方法在定量和定性上都优于各种数据集的当前最新方法。

Title: RLCAD: Reinforcement Learning Training Gym for Revolution Involved CAD Command Sequence Generation

Authors: Xiaolong Yin, Xingyu Lu, Jiahang Shen, Jingzhe Ni, Hailong Li, Ruofeng Tong, Min Tang, Peng Du
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18549
Pdf URL: https://arxiv.org/pdf/2503.18549
Copy Paste: [[2503.18549]] RLCAD: Reinforcement Learning Training Gym for Revolution Involved CAD Command Sequence Generation(https://arxiv.org/abs/2503.18549)
Keywords: generation
Abstract: A CAD command sequence is a typical parametric design paradigm in 3D CAD systems where a model is constructed by overlaying 2D sketches with operations such as extrusion, revolution, and Boolean operations. Although there is growing academic interest in the automatic generation of command sequences, existing methods and datasets only support operations such as 2D sketching, extrusion,and Boolean operations. This limitation makes it challenging to represent more complex geometries. In this paper, we present a reinforcement learning (RL) training environment (gym) built on a CAD geometric engine. Given an input boundary representation (B-Rep) geometry, the policy network in the RL algorithm generates an action. This action, along with previously generated actions, is processed within the gym to produce the corresponding CAD geometry, which is then fed back into the policy network. The rewards, determined by the difference between the generated and target geometries within the gym, are used to update the RL network. Our method supports operations beyond sketches, Boolean, and extrusion, including revolution operations. With this training gym, we achieve state-of-the-art (SOTA) quality in generating command sequences from B-Rep geometries. In addition, our method can significantly improve the efficiency of command sequence generation by a factor of 39X compared with the previous training gym.
摘要：CAD命令序列是3D CAD系统中典型的参数设计范式，其中通过用挤出，革命和布尔操作等操作叠加2D草图来构建模型。尽管对自动生成命令序列的学术兴趣越来越大，但现有的方法和数据集仅支持2D草图，挤出和布尔操作等操作。这种限制使得代表更复杂的几何形状具有挑战性。在本文中，我们提出了建立在CAD几何引擎上的增强学习（RL）培训环境（健身房）。给定输入边界表示（B-REP）几何形状，RL算法中的策略网络会生成一个动作。此操作以及先前生成的动作在健身房内处理以产生相应的CAD几何形状，然后将其馈回策略网络。由健身房内生成的几何形状和目标几何形状之间的差异决定的奖励用于更新RL网络。我们的方法支持超越草图，布尔值和挤出的操作，包括革命操作。在这个培训体育馆中，我们在B-REP几何形状生成命令序列时达到了最先进的（SOTA）质量。此外，与以前的培训体育馆相比，我们的方法可以显着提高命令序列的产生效率39倍。

Title: EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation

Authors: Qiang Qu, Ming Li, Xiaoming Chen, Tongliang Liu
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2503.18552
Pdf URL: https://arxiv.org/pdf/2503.18552
Copy Paste: [[2503.18552]] EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation(https://arxiv.org/abs/2503.18552)
Keywords: generation
Abstract: Conditional human animation transforms a static reference image into a dynamic sequence by applying motion cues such as poses. These motion cues are typically derived from video data but are susceptible to limitations including low temporal resolution, motion blur, overexposure, and inaccuracies under low-light conditions. In contrast, event cameras provide data streams with exceptionally high temporal resolution, a wide dynamic range, and inherent resistance to motion blur and exposure issues. In this work, we propose EvAnimate, a framework that leverages event streams as motion cues to animate static human images. Our approach employs a specialized event representation that transforms asynchronous event streams into 3-channel slices with controllable slicing rates and appropriate slice density, ensuring compatibility with diffusion models. Subsequently, a dual-branch architecture generates high-quality videos by harnessing the inherent motion dynamics of the event streams, thereby enhancing both video quality and temporal consistency. Specialized data augmentation strategies further enhance cross-person generalization. Finally, we establish a new benchmarking, including simulated event data for training and validation, and a real-world event dataset capturing human actions under normal and extreme scenarios. The experiment results demonstrate that EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.
摘要：有条件的人类动画通过应用诸如姿势等运动提示将静态参考图像转化为动态序列。这些运动提示通常源自视频数据，但易受限制，包括低颞下分辨率，运动模糊，过度暴露和不准确性在弱光条件下。相比之下，事件摄像机提供的数据流具有异常高的时间分辨率，广泛的动态范围以及对运动模糊和暴露问题的固有阻力。在这项工作中，我们提出了Evanimate，该框架利用事件作为运动提示来动画静态人类图像。我们的方法采用专门的事件表示形式，该表示将异步事件流转换为具有可控切片速率和适当切片密度的3通道切片，从而确保与扩散模型的兼容性。随后，双分支架构通过利用事件流的固有运动动态来生成高质量的视频，从而增强了视频质量和时间的一致性。专业数据增强策略进一步增强了跨人的概括。最后，我们建立了一个新的基准测试，包括用于培训和验证的模拟事件数据，以及一个现实世界中的事件数据集捕获正常和极端情况下的人类行动。实验结果表明，在传统视频衍生的线索不足的情况下，Evanimate实现了高度的时间忠诚和稳健的表现。

Title: AMD-Hummingbird: Towards an Efficient Text-to-Video Model

Authors: Takashi Isobe, He Cui, Dong Zhou, Mengmeng Ge, Dong Li, Emad Barsoum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18559
Pdf URL: https://arxiv.org/pdf/2503.18559
Copy Paste: [[2503.18559]] AMD-Hummingbird: Towards an Efficient Text-to-Video Model(https://arxiv.org/abs/2503.18559)
Keywords: generation, quality assessment
Abstract: Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.
摘要：文本到视频（T2V）的一代因其从文本描述中综合现实视频的能力而引起了极大的关注。但是，现有模型努力平衡计算效率和高视觉质量，尤其是在资源有限的设备上，例如IGPU和手机。大多数先前的工作都将视觉保真度优先考虑，同时忽略了适合现实部署的较小，更有效的模型的需求。为了应对这一挑战，我们提出了一个轻巧的T2V框架，称为Hummingbird，该框架可以修剪现有模型并通过视觉反馈学习增强视觉质量。我们的方法将U-NET的大小从14亿次减少到7亿参数，在保留高质量的视频生成的同时，大大提高了效率。此外，我们介绍了一条新型的数据处理管道，该管道利用大型语言模型（LLM）和视频质量评估（VQA）模型来提高文本提示和视频数据的质量。为了支持用户驱动的培训和样式定制，我们公开发布完整的培训代码，包括数据处理和模型培训。广泛的实验表明，与诸如VideoCrafter2之类的最新模型相比，我们的方法达到了31倍的速度，同时也达到了VBENCH上最高的总分。此外，我们的方法还支持具有多达26帧的视频的生成，并解决了长期视频生成中现有基于U-NET的方法的局限性。值得注意的是，整个培训过程仅需要四个GPU，但使用现有领先方法提供了竞争性能。蜂鸟为T2V生成提供了一种实用，有效的解决方案，将高性能，可伸缩性和灵活性结合在一起。

Title: Anchor-based oversampling for imbalanced tabular data via contrastive and adversarial learning

Authors: Hadi Mohammadi, Ehsan Nazerfard, Mostafa Haghir Chehreghani
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18569
Pdf URL: https://arxiv.org/pdf/2503.18569
Copy Paste: [[2503.18569]] Anchor-based oversampling for imbalanced tabular data via contrastive and adversarial learning(https://arxiv.org/abs/2503.18569)
Keywords: generative
Abstract: Imbalanced data represent a distribution with more frequencies of one class (majority) than the other (minority). This phenomenon occurs across various domains, such as security, medical care and human activity. In imbalanced learning, classification algorithms are typically inclined to classify the majority class accurately, resulting in artificially high accuracy rates. As a result, many minority samples are mistakenly labelled as majority-class instances, resulting in a bias that benefits the majority class. This study presents a framework based on boundary anchor samples to tackle the imbalance learning challenge. First, we select and use anchor samples to train a multilayer perceptron (MLP) classifier, which acts as a prior knowledge model and aids the adversarial and contrastive learning procedures. Then, we designed a novel deep generative model called Anchor Stabilized Conditional Generative Adversarial Network or Anch-SCGAN in short. Anch-SCGAN is supported with two generators for the minority and majority classes and a discriminator incorporating additional class-specific information from the pre-trained feature extractor MLP. In addition, we facilitate the generator's training procedure in two ways. First, we define a new generator loss function based on reprocessed anchor samples and contrastive learning. Second, we apply a scoring strategy to stabilize the adversarial training part in generators. We train Anch-SCGAN and further finetune it with anchor samples to improve the precision of the generated samples. Our experiments on 16 real-world imbalanced datasets illustrate that Anch-SCGAN outperforms the renowned methods in imbalanced learning.
摘要：不平衡的数据代表一个分布，其频率比另一个类别（多数）（多数）（少数）（多数）。这种现象发生在各个领域，例如安全，医疗和人类活动。在学习不平衡的学习中，分类算法通常倾向于准确地对多数类进行分类，从而导致人为的精度率。结果，许多少数样本被错误地将其标记为多数级实例，从而导致偏见受益于多数级别。这项研究提出了一个基于边界锚样本的框架，以应对不平衡学习挑战。首先，我们选择并使用锚样本来培训多层感知器（MLP）分类器，该分类器充当先验的知识模型，并有助于对抗性和对比性学习程序。然后，我们设计了一种新型的深层生成模型，称为锚定稳定有条件生成的对抗网络或锚锚。 Anch-Scgan由两个用于少数族裔和多数类的发电机支持，以及一个结合了预先训练的特征提取器MLP的其他类别信息的歧视者。此外，我们通过两种方式促进了发电机的培训程序。首先，我们根据重新处理的锚定样本和对比度学习定义了一个新的发电机损耗函数。其次，我们采用评分策略来稳定发电机中的对抗训练部分。我们训练锚固式施加，并用锚固样品进一步捕获它，以提高生成的样品的精度。我们在16个现实世界中的失衡数据集上进行的实验表明，锚固的实验表现优于不平衡学习中著名的方法。

Title: Adapting Video Diffusion Models for Time-Lapse Microscopy

Authors: Alexander Holmberg, Nils Mechtel, Wei Ouyang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18583
Pdf URL: https://arxiv.org/pdf/2503.18583
Copy Paste: [[2503.18583]] Adapting Video Diffusion Models for Time-Lapse Microscopy(https://arxiv.org/abs/2503.18583)
Keywords: generation, generative
Abstract: We present a domain adaptation of video diffusion models to generate highly realistic time-lapse microscopy videos of cell division in HeLa cells. Although state-of-the-art generative video models have advanced significantly for natural videos, they remain underexplored in microscopy domains. To address this gap, we fine-tune a pretrained video diffusion model on microscopy-specific sequences, exploring three conditioning strategies: (1) text prompts derived from numeric phenotypic measurements (e.g., proliferation rates, migration speeds, cell-death frequencies), (2) direct numeric embeddings of phenotype scores, and (3) image-conditioned generation, where an initial microscopy frame is extended into a complete video sequence. Evaluation using biologically meaningful morphological, proliferation, and migration metrics demonstrates that fine-tuning substantially improves realism and accurately captures critical cellular behaviors such as mitosis and migration. Notably, the fine-tuned model also generalizes beyond the training horizon, generating coherent cell dynamics even in extended sequences. However, precisely controlling specific phenotypic characteristics remains challenging, highlighting opportunities for future work to enhance conditioning methods. Our results demonstrate the potential for domain-specific fine-tuning of generative video models to produce biologically plausible synthetic microscopy data, supporting applications such as in-silico hypothesis testing and data augmentation.
摘要：我们提出了视频扩散模型的域适应性，以生成HeLa细胞中细胞分裂的高度逼真的延时显微镜视频。尽管最先进的生成视频模型已为自然视频提高了明显的发展，但它们在显微镜域中仍未被散发出来。 To address this gap, we fine-tune a pretrained video diffusion model on microscopy-specific sequences, exploring three conditioning strategies: (1) text prompts derived from numeric phenotypic measurements (e.g., proliferation rates, migration speeds, cell-death frequencies), (2) direct numeric embeddings of phenotype scores, and (3) image-conditioned generation, where an initial microscopy frame is extended into a完整的视频序列。使用生物学上有意义的形态，增殖和迁移指标进行评估表明，微调可改善现实主义并准确捕获关键的细胞行为，例如有丝分裂和迁移。值得注意的是，微调模型还概括了训练范围之外，即使在扩展序列中也会产生相干的细胞动力学。但是，精确控制特定的表型特征仍然具有挑战性，突出了将来工作以增强调理方法的机会。我们的结果表明，生成视频模型的域特异性微调可能产生生物学上合理的合成显微镜数据，从而支持应用程序，例如内部硅假说测试和数据增强。

Title: Adventurer: Exploration with BiGAN for Deep Reinforcement Learning

Authors: Yongshuai Liu, Xin Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18612
Pdf URL: https://arxiv.org/pdf/2503.18612
Copy Paste: [[2503.18612]] Adventurer: Exploration with BiGAN for Deep Reinforcement Learning(https://arxiv.org/abs/2503.18612)
Keywords: generative
Abstract: Recent developments in deep reinforcement learning have been very successful in learning complex, previously intractable problems. Sample efficiency and local optimality, however, remain significant challenges. To address these challenges, novelty-driven exploration strategies have emerged and shown promising potential. Unfortunately, no single algorithm outperforms all others in all tasks and most of them struggle with tasks with high-dimensional and complex observations. In this work, we propose Adventurer, a novelty-driven exploration algorithm that is based on Bidirectional Generative Adversarial Networks (BiGAN), where BiGAN is trained to estimate state novelty. Intuitively, a generator that has been trained on the distribution of visited states should only be able to generate a state coming from the distribution of visited states. As a result, novel states using the generator to reconstruct input states from certain latent representations would lead to larger reconstruction errors. We show that BiGAN performs well in estimating state novelty for complex observations. This novelty estimation method can be combined with intrinsic-reward-based exploration. Our empirical results show that Adventurer produces competitive results on a range of popular benchmark tasks, including continuous robotic manipulation tasks (e.g. Mujoco robotics) and high-dimensional image-based tasks (e.g. Atari games).
摘要：深度强化学习的最新发展在学习复杂，以前棘手的问题方面非常成功。但是，样本效率和当地最优性仍然是重大挑战。为了应对这些挑战，新颖驱动的探索策略已经出现并显示出了有希望的潜力。不幸的是，在所有任务中，没有任何单一算法的表现都胜过所有其他算法，并且大多数算法在具有高维和复杂观察的任务上挣扎。在这项工作中，我们提出了冒险家，这是一种基于双向生成对抗网络（BIGAN）的新颖驱动探索算法，在该算法中，Bigan受过训练以估计状态新颖性。直觉上，已经接受过访问州分布的培训的发电机只能产生来自访问州分布的国家。结果，新颖的状态使用发电机从某些潜在表示会导致更大的重建错误重建输入状态。我们表明，Bigan在估计复杂观测的状态新颖性方面表现良好。这种新颖的估计方法可以与基于奖励的探索结合使用。我们的经验结果表明，冒险家在一系列流行的基准任务上产生竞争成果，包括连续的机器人操纵任务（例如Mujoco Robotics）和基于图像的高维任务（例如Atari Games）。

Title: Generative Dataset Distillation using Min-Max Diffusion Model

Authors: Junqiao Fan, Yunjiao Zhou, Min Chang Jordan Ren, Jianfei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18626
Pdf URL: https://arxiv.org/pdf/2503.18626
Copy Paste: [[2503.18626]] Generative Dataset Distillation using Min-Max Diffusion Model(https://arxiv.org/abs/2503.18626)
Keywords: generation, generative
Abstract: In this paper, we address the problem of generative dataset distillation that utilizes generative models to synthesize images. The generator may produce any number of images under a preserved evaluation time. In this work, we leverage the popular diffusion model as the generator to compute a surrogate dataset, boosted by a min-max loss to control the dataset's diversity and representativeness during training. However, the diffusion model is time-consuming when generating images, as it requires an iterative generation process. We observe a critical trade-off between the number of image samples and the image quality controlled by the diffusion steps and propose Diffusion Step Reduction to achieve optimal performance. This paper details our comprehensive method and its performance. Our model achieved $2^{nd}$ place in the generative track of \href{this https URL}{The First Dataset Distillation Challenge of ECCV2024}, demonstrating its superior performance.
摘要：在本文中，我们解决了使用生成模型合成图像的生成数据集蒸馏的问题。发电机可以在保留的评估时间下产生任意数量的图像。在这项工作中，我们利用流行的扩散模型作为生成器来计算替代数据集，并以最小最大损失的形式提高了替代数据集，以控制培训期间数据集的多样性和代表性。但是，扩散模型在生成图像时耗时，因为它需要迭代生成过程。我们观察到图像样本的数量与通过扩散步骤控制的图像质量之间的关键权衡，并提出了降低扩散步骤以实现最佳性能。本文详细介绍了我们的综合方法及其性能。我们的模型达到了$ 2^{nd} $放置在\ href {this HTTPS url} {ECCV2024}的第一个数据集蒸馏挑战中的生成曲目中，证明了其出色的性能。

Title: Dig2DIG: Dig into Diffusion Information Gains for Image Fusion

Authors: Bing Cao, Baoshuo Cai, Changqing Zhang, Qinghua Hu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18627
Pdf URL: https://arxiv.org/pdf/2503.18627
Copy Paste: [[2503.18627]] Dig2DIG: Dig into Diffusion Information Gains for Image Fusion(https://arxiv.org/abs/2503.18627)
Keywords: generative
Abstract: Image fusion integrates complementary information from multi-source images to generate more informative results. Recently, the diffusion model, which demonstrates unprecedented generative potential, has been explored in image fusion. However, these approaches typically incorporate predefined multimodal guidance into diffusion, failing to capture the dynamically changing significance of each modality, while lacking theoretical guarantees. To address this issue, we reveal a significant spatio-temporal imbalance in image denoising; specifically, the diffusion model produces dynamic information gains in different image regions with denoising steps. Based on this observation, we Dig into the Diffusion Information Gains (Dig2DIG) and theoretically derive a diffusion-based dynamic image fusion framework that provably reduces the upper bound of the generalization error. Accordingly, we introduce diffusion information gains (DIG) to quantify the information contribution of each modality at different denoising steps, thereby providing dynamic guidance during the fusion process. Extensive experiments on multiple fusion scenarios confirm that our method outperforms existing diffusion-based approaches in terms of both fusion quality and inference efficiency.
摘要：图像融合整合了来自多源图像的互补信息，以产生更多信息结果。最近，在图像融合中探索了表现出前所未有的生成潜力的扩散模型。但是，这些方法通常将预定义的多模式指导纳入扩散中，未能捕获每种模式的动态变化，同时缺乏理论保证。为了解决这个问题，我们揭示了图像denoising的显着时空失衡。具体而言，扩散模型在不同的图像区域中产生动态信息，并具有降解步骤。基于此观察结果，我们深入研究扩散信息获得（DIG2DIG），理论上得出了基于扩散的动态图像融合框架，该框架可证明可以减少概括误差的上限。因此，我们介绍了扩散信息获得（DIG），以量化各种模式的信息贡献，从而在融合过程中提供动态指导。关于多种融合方案的广泛实验证实，就融合质量和推理效率而言，我们的方法优于现有基于扩散的方法。

Title: Leveraging Land Cover Priors for Isoprene Emission Super-Resolution

Authors: Christopher Ummerle, Antonio Giganti, Sara Mandelli, Paolo Bestagini, Stefano Tubaro
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18658
Pdf URL: https://arxiv.org/pdf/2503.18658
Copy Paste: [[2503.18658]] Leveraging Land Cover Priors for Isoprene Emission Super-Resolution(https://arxiv.org/abs/2503.18658)
Keywords: super-resolution
Abstract: Remote sensing plays a crucial role in monitoring Earth's ecosystems, yet satellite-derived data often suffer from limited spatial resolution, restricting their applicability in atmospheric modeling and climate research. In this work, we propose a deep learning-based Super-Resolution (SR) framework that leverages land cover information to enhance the spatial accuracy of Biogenic Volatile Organic Compounds (BVOCs) emissions, with a particular focus on isoprene. Our approach integrates land cover priors as emission drivers, capturing spatial patterns more effectively than traditional methods. We evaluate the model's performance across various climate conditions and analyze statistical correlations between isoprene emissions and key environmental information such as cropland and tree cover data. Additionally, we assess the generalization capabilities of our SR model by applying it to unseen climate zones and geographical regions. Experimental results demonstrate that incorporating land cover data significantly improves emission SR accuracy, particularly in heterogeneous landscapes. This study contributes to atmospheric chemistry and climate modeling by providing a cost-effective, data-driven approach to refining BVOC emission maps. The proposed method enhances the usability of satellite-based emissions data, supporting applications in air quality forecasting, climate impact assessments, and environmental studies.
摘要：遥感在监测地球生态系统中起着至关重要的作用，但是卫星衍生的数据通常遭受有限的空间分辨率，从而限制了它们在大气建模和气候研究中的适用性。在这项工作中，我们提出了一个深度学习的超分辨率（SR）框架，该框架利用土地覆盖信息来增强生物挥发性有机化合物（BVOC）排放的空间准确性，并特别关注异戊二烯。我们的方法将土地覆盖的先验作为排放驱动器，比传统方法更有效地捕获空间模式。我们在各种气候条件下评估了模型的性能，并分析异戊二烯排放和关键环境信息（例如农田和树木覆盖数据）之间的统计相关性。此外，我们通过将其应用于看不见的气候区域和地理区域来评估SR模型的概括能力。实验结果表明，纳入土地覆盖数据可显着提高发射SR准确性，尤其是在异质景观中。这项研究通过为精炼BVOC排放图提供了具有成本效益的数据驱动方法来为大气化学和气候建模做出贡献。提出的方法增强了基于卫星的排放数据的可用性，支持空气质量预测，气候影响评估和环境研究的应用。

Title: Human Motion Unlearning

Authors: Edoardo De Matteis, Matteo Migliarini, Alessio Sampieri, Indro Spinelli, Fabio Galasso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18674
Pdf URL: https://arxiv.org/pdf/2503.18674
Copy Paste: [[2503.18674]] Human Motion Unlearning(https://arxiv.org/abs/2503.18674)
Keywords: generative
Abstract: We introduce the task of human motion unlearning to prevent the synthesis of toxic animations while preserving the general text-to-motion generative performance. Unlearning toxic motions is challenging as those can be generated from explicit text prompts and from implicit toxic combinations of safe motions (e.g., ``kicking" is ``loading and swinging a leg"). We propose the first motion unlearning benchmark by filtering toxic motions from the large and recent text-to-motion datasets of HumanML3D and Motion-X. We propose baselines, by adapting state-of-the-art image unlearning techniques to process spatio-temporal signals. Finally, we propose a novel motion unlearning model based on Latent Code Replacement, which we dub LCR. LCR is training-free and suitable to the discrete latent spaces of state-of-the-art text-to-motion diffusion models. LCR is simple and consistently outperforms baselines qualitatively and quantitatively. Project page: \href{this https URL}{this https URL}.
摘要：我们介绍了人类运动的任务，以防止有毒动画的综合，同时保留一般的文本到动作生成性能。不学习有毒动作是具有挑战性的，因为可以通过明确的文本提示和安全动作的隐式有毒组合产生这些动作（例如，``踢'是``'''''loading and loading and tove腿'）。我们通过从HumanML3D和Motion-X的最新文本到动作数据集中过滤有毒动作来提出第一个运动基准测试。我们通过调整最先进的图像未学习技术来处理时空信号来提出基准。最后，我们提出了一个基于潜在代码替代的新型运动模型，我们将其配置为LCR。 LCR是无训练的，适用于最先进的文本到运动扩散模型的离散潜在空间。 LCR在定性和定量上始终如一，并且始终如一地优于基本线。项目页面：\ href {此https url} {this https url}。

Title: NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping

Authors: Tianyi Wang, Harry Cheng, Xiao Zhang, Yinglong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18678
Pdf URL: https://arxiv.org/pdf/2503.18678
Copy Paste: [[2503.18678]] NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping(https://arxiv.org/abs/2503.18678)
Keywords: generative
Abstract: Suffering from performance bottlenecks in passively detecting high-quality Deepfake images due to the advancement of generative models, proactive perturbations offer a promising approach to disabling Deepfake manipulations by inserting signals into benign images. However, existing proactive perturbation approaches remain unsatisfactory in several aspects: 1) visual degradation due to direct element-wise addition; 2) limited effectiveness against face swapping manipulation; 3) unavoidable reliance on white- and grey-box settings to involve generative models during training. In this study, we analyze the essence of Deepfake face swapping and argue the necessity of protecting source identities rather than target images, and we propose NullSwap, a novel proactive defense approach that cloaks source image identities and nullifies face swapping under a pure black-box scenario. We design an Identity Extraction module to obtain facial identity features from the source image, while a Perturbation Block is then devised to generate identity-guided perturbations accordingly. Meanwhile, a Feature Block extracts shallow-level image features, which are then fused with the perturbation in the Cloaking Block for image reconstruction. Furthermore, to ensure adaptability across different identity extractors in face swapping algorithms, we propose Dynamic Loss Weighting to adaptively balance identity losses. Experiments demonstrate the outstanding ability of our approach to fool various identity recognition models, outperforming state-of-the-art proactive perturbations in preventing face swapping models from generating images with correct source identities.
摘要：由于生成模型的发展，在被动检测高质量的深层图像时，积极的扰动在被动检测高质量的深层图像中，主动扰动通过将信号插入良性图像中禁用深层操纵的方法。但是，现有的主动扰动方法在几个方面仍然不令人满意：1）直接元素添加引起的视觉降解； 2）有限的效果防止面部交换操作； 3）不可避免地依赖白色和灰色盒子设置在训练过程中涉及生成模型。在这项研究中，我们分析了DeepFake面部交换的本质，并认为必须保护源身份而不是目标图像的必要性，并且我们提出了Nullswap，Nullswap是一种新型的主动防御方法，该方法掩盖了源图像身份，并在纯黑色盒子场景下面对交换。我们设计一个身份提取模块以从源图像获得面部身份特征，然后设计一个扰动块以相应地生成身份引导的扰动。同时，特征块提取物浅层图像特征，然后将其与套件中的扰动融合以进行图像重建。此外，为了确保面部交换算法中不同身份提取器的适应性，我们提出动态损失重量以适应平衡身份损失。实验证明了我们的方法愚弄各种身份识别模型的出色能力，在防止面部交换模型生成具有正确源身份的图像的情况下优于最先进的主动扰动。

Title: Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis

Authors: Inseung Hwang, Kiseok Choi, Hyunho Ha, Min H. Kim
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2503.18705
Pdf URL: https://arxiv.org/pdf/2503.18705
Copy Paste: [[2503.18705]] Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis(https://arxiv.org/abs/2503.18705)
Keywords: super-resolution
Abstract: Snapshot polarization imaging calculates polarization states from linearly polarized subimages. To achieve this, a polarization camera employs a double Bayer-patterned sensor to capture both color and polarization. It demonstrates low light efficiency and low spatial resolution, resulting in increased noise and compromised polarization measurements. Although burst super-resolution effectively reduces noise and enhances spatial resolution, applying it to polarization imaging poses challenges due to the lack of tailored datasets and reliable ground truth noise statistics. To address these issues, we introduce PolarNS and PolarBurstSR, two innovative datasets developed specifically for polarization imaging. PolarNS provides characterization of polarization noise statistics, facilitating thorough analysis, while PolarBurstSR functions as a benchmark for burst super-resolution in polarization images. These datasets, collected under various real-world conditions, enable comprehensive evaluation. Additionally, we present a model for analyzing polarization noise to quantify noise propagation, tested on a large dataset captured in a darkroom environment. As part of our application, we compare the latest burst super-resolution models, highlighting the advantages of training tailored to polarization compared to RGB-based methods. This work establishes a benchmark for polarization burst super-resolution and offers critical insights into noise propagation, thereby enhancing polarization image reconstruction.
摘要：快照极化成像从线性极化子图像计算极化状态。为此，极化摄像头采用双重拜耳的传感器来捕获颜色和极化。它表明了低光效率和低空间分辨率，从而增加了噪声和损害极化测量值。尽管爆发的超分辨率有效地降低了噪声并增强了空间分辨率，但由于缺乏量身定制的数据集和可靠的地面真相噪声统计，将其应用于极化成像会带来挑战。为了解决这些问题，我们介绍了极地和PolarburstSR，这是两个专门用于极化成像的创新数据集。极地提供了极化噪声统计的表征，促进了彻底的分析，而极性爆炸是极化图像中爆发超分辨率的基准。这些数据集在各种现实世界中收集，可以进行全面的评估。此外，我们提出了一个用于分析极化噪声以量化噪声传播的模型，该模型在暗室环境中捕获的大数据集上进行了测试。作为应用程序的一部分，我们比较了最新的爆发超分辨率模型，并强调了与基于RGB的方法相比，针对极化量身定制的训练的优势。这项工作为极化爆发超分辨率建立了基准，并为噪声传播提供了关键的见解，从而增强了极化图像的重建。

Title: GS-Marker: Generalizable and Robust Watermarking for 3D Gaussian Splatting

Authors: Lijiang Li, Jinglu Wang, Xiang Ming, Yan Lu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.18718
Pdf URL: https://arxiv.org/pdf/2503.18718
Copy Paste: [[2503.18718]] GS-Marker: Generalizable and Robust Watermarking for 3D Gaussian Splatting(https://arxiv.org/abs/2503.18718)
Keywords: generative
Abstract: In the Generative AI era, safeguarding 3D models has become increasingly urgent. While invisible watermarking is well-established for 2D images with encoder-decoder frameworks, generalizable and robust solutions for 3D remain elusive. The main difficulty arises from the renderer between the 3D encoder and 2D decoder, which disrupts direct gradient flow and complicates training. Existing 3D methods typically rely on per-scene iterative optimization, resulting in time inefficiency and limited generalization. In this work, we propose a single-pass watermarking approach for 3D Gaussian Splatting (3DGS), a well-known yet underexplored representation for watermarking. We identify two major challenges: (1) ensuring effective training generalized across diverse 3D models, and (2) reliably extracting watermarks from free-view renderings, even under distortions. Our framework, named GS-Marker, incorporates a 3D encoder to embed messages, distortion layers to enhance resilience against various distortions, and a 2D decoder to extract watermarks from renderings. A key innovation is the Adaptive Marker Control mechanism that adaptively perturbs the initially optimized 3DGS, escaping local minima and improving both training stability and convergence. Extensive experiments show that GS-Marker outperforms per-scene training approaches in terms of decoding accuracy and model fidelity, while also significantly reducing computation time.
摘要：在生成的AI时代，保护3D模型变得越来越紧迫。尽管具有编码器框架的2D图像已确立了无形的水印，但对于3D来说，可推广和可靠的解决方案仍然难以捉摸。主要困难来自3D编码器和2D解码器之间的渲染器，这破坏了直接梯度流并使训练变得复杂。现有的3D方法通常依赖于每场迭代优化，从而导致时间效率低下和有限的概括。在这项工作中，我们为3D高斯脱落（3DGS）提出了一种单次水印方法，这是一种众所周知但尚未散发出的水印代表。我们确定了两个主要挑战：（1）确保在不同3D模型中广泛进行的有效培训，以及（2）即使在扭曲下，也可靠地从自由视图渲染中提取水印。我们的框架名为GS-Marker，将3D编码器与嵌入式消息，失真层相结合，以增强针对各种扭曲的弹性，以及一个2D解码器，以从渲染中提取水印。一个关键的创新是自适应标记控制机制，该机制可适应最初优化的3DG，逃脱局部最小值并提高训练稳定性和收敛性。广泛的实验表明，GS-Marker在解码准确性和模型保真度方面的表现优于每场训练方法，同时也大大减少了计算时间。

Title: Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings

Authors: Cong Liu, Liang Hou, Mingwu Zheng, Xin Tao, Pengfei Wan, Di Zhang, Kun Gai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18719
Pdf URL: https://arxiv.org/pdf/2503.18719
Copy Paste: [[2503.18719]] Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings(https://arxiv.org/abs/2503.18719)
Keywords: generation
Abstract: Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a significant challenge in resolution generalization, particularly in the widely used Diffusion Transformers, lies in the mismatch between the positional encodings encountered during testing and those used during training. While existing methods have employed techniques such as interpolation, extrapolation, or their combinations, none have fully resolved this issue. In this paper, we propose a novel two-dimensional randomized positional encodings (RPE-2D) framework that focuses on learning positional order of image patches instead of the specific distances between them, enabling seamless high- and low-resolution image generation without requiring high- and low-resolution image training. Specifically, RPE-2D independently selects positions over a broader range along both the horizontal and vertical axes, ensuring that all position encodings are trained during the inference phase, thus improving resolution generalization. Additionally, we propose a random data augmentation technique to enhance the modeling of position order. To address the issue of image cropping caused by the augmentation, we introduce corresponding micro-conditioning to enable the model to perceive the specific cropping patterns. On the ImageNet dataset, our proposed RPE-2D achieves state-of-the-art resolution generalization performance, outperforming existing competitive methods when trained at a resolution of $256 \times 256$ and inferred at $384 \times 384$ and $512 \times 512$, as well as when scaling from $512 \times 512$ to $768 \times 768$ and $1024 \times 1024$. And it also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration and multi-resolution inheritance.
摘要：图像生成任务中的分辨率概括可以使较低训练分辨率开销的高分辨率图像产生。但是，解决方案概括（尤其是在广泛使用的扩散变压器中）的重大挑战在于在测试过程中遇到的位置编码与训练过程中使用的位置编码之间的不匹配。尽管现有方法采用了插值，外推或组合等技术，但没有一个完全解决了这个问题。在本文中，我们提出了一种新颖的二维随机位置编码（RPE-2D）框架，该框架着重于学习图像贴片的学习位置顺序，而不是它们之间的特定距离，从而无需高分辨率的高分辨率图像生成而无需高分辨率图像训练。具体而言，RPE-2D独立选择沿水平和垂直轴的更广泛范围的位置，以确保在推理阶段对所有位置编码进行训练，从而改善分辨率的概括。此外，我们提出了一种随机数据增强技术，以增强位置顺序的建模。为了解决由增强引起的图像裁剪的问题，我们介绍了相应的微条件，以使模型能够感知特定的裁剪模式。在Imagenet数据集上，我们提议的RPE-2D实现了最先进的决议概括性能，以256美元的256 \ times 256美元的培训，以384美元的$ 384 \ times 384美元和512美元的512 $ 512 $和$ 512 $ 512 $ 512 $ $ 512 $ $ 512 $ $ 512 $ 512 $ 512 $ 512 $ 512 $ 512 $ 512的培训，以优于现有的竞争方法，并推断出$ 384 \ timess 384 \ timess 384 \ Times $ 1024 \ times 1024 $。它还在低分辨率图像产生，多阶段训练加速度和多分辨率继承方面表现出了出色的功能。

Title: Simulation-Driven Balancing of Competitive Game Levels with Reinforcement Learning

Authors: Florian Rupp, Manuel Eberhardinger, Kai Eckert
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.18748
Pdf URL: https://arxiv.org/pdf/2503.18748
Copy Paste: [[2503.18748]] Simulation-Driven Balancing of Competitive Game Levels with Reinforcement Learning(https://arxiv.org/abs/2503.18748)
Keywords: generation
Abstract: The balancing process for game levels in competitive two-player contexts involves a lot of manual work and testing, particularly for non-symmetrical game levels. In this work, we frame game balancing as a procedural content generation task and propose an architecture for automatically balancing of tile-based levels within the PCGRL framework (procedural content generation via reinforcement learning). Our architecture is divided into three parts: (1) a level generator, (2) a balancing agent, and (3) a reward modeling simulation. Through repeated simulations, the balancing agent receives rewards for adjusting the level towards a given balancing objective, such as equal win rates for all players. To this end, we propose new swap-based representations to improve the robustness of playability, thereby enabling agents to balance game levels more effectively and quickly compared to traditional PCGRL. By analyzing the agent's swapping behavior, we can infer which tile types have the most impact on the balance. We validate our approach in the Neural MMO (NMMO) environment in a competitive two-player scenario. In this extended conference paper, we present improved results, explore the applicability of the method to various forms of balancing beyond equal balancing, compare the performance to another search-based approach, and discuss the application of existing fairness metrics to game balancing.
摘要：在竞争激烈的两人环境中，游戏水平的平衡过程涉及大量的手动工作和测试，尤其是对于非对称的游戏水平。在这项工作中，我们将游戏平衡作为程序性内容生成任务进行构架，并为在PCGRL框架（通过强化学习过程生成过程生成）中自动平衡基于瓷砖的级别的体系结构。我们的架构分为三个部分：（1）级别发生器，（2）平衡剂，以及（3）奖励建模模拟。通过反复的模拟，平衡代理将获得奖励，以调整给定平衡目标的水平，例如所有玩家的胜利率。为此，我们提出了新的基于互换的表示，以提高可玩性的鲁棒性，从而使代理人能够更有效，与传统的PCGRL相比，更有效地平衡了游戏水平。通过分析代理的交换行为，我们可以推断哪些瓷砖类型对平衡产生最大的影响。我们在竞争激烈的两人场景中验证了在神经MMO（NMMO）环境中的方法。在这篇扩展的会议论文中，我们提出了改进的结果，探讨了该方法对平衡平衡的各种形式的适用性，将绩效与另一种基于搜索的方法进行比较，并讨论现有的公平指标在游戏平衡中的应用。

Title: 3DSwapping: Texture Swapping For 3D Object From Single Reference Image

Authors: Xiao Cao, Beibei Lin, Bo Wang, Zhiyong Huang, Robby T. Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18853
Pdf URL: https://arxiv.org/pdf/2503.18853
Copy Paste: [[2503.18853]] 3DSwapping: Texture Swapping For 3D Object From Single Reference Image(https://arxiv.org/abs/2503.18853)
Keywords: generation
Abstract: 3D texture swapping allows for the customization of 3D object textures, enabling efficient and versatile visual transformations in 3D editing. While no dedicated method exists, adapted 2D editing and text-driven 3D editing approaches can serve this purpose. However, 2D editing requires frame-by-frame manipulation, causing inconsistencies across views, while text-driven 3D editing struggles to preserve texture characteristics from reference images. To tackle these challenges, we introduce 3DSwapping, a 3D texture swapping method that integrates: 1) progressive generation, 2) view-consistency gradient guidance, and 3) prompt-tuned gradient guidance. To ensure view consistency, our progressive generation process starts by editing a single reference image and gradually propagates the edits to adjacent views. Our view-consistency gradient guidance further reinforces consistency by conditioning the generation model on feature differences between consistent and inconsistent outputs. To preserve texture characteristics, we introduce prompt-tuning-based gradient guidance, which learns a token that precisely captures the difference between the reference image and the 3D object. This token then guides the editing process, ensuring more consistent texture preservation across views. Overall, 3DSwapping integrates these novel strategies to achieve higher-fidelity texture transfer while preserving structural coherence across multiple viewpoints. Extensive qualitative and quantitative evaluations confirm that our three novel components enable convincing and effective 2D texture swapping for 3D objects. Code will be available upon acceptance.
摘要：3D纹理交换允许自定义3D对象纹理，从而在3D编辑中实现高效且通用的视觉转换。尽管不存在专用方法，但适应的2D编辑和文本驱动的3D编辑方法可以实现此目的。但是，2D编辑需要逐帧的操作，导致视图上的不一致，而文本驱动的3D编辑斗争以从参考图像中保留纹理特征。为了应对这些挑战，我们介绍了3DSWAPPEN，这是一种集成的3D纹理交换方法：1）渐进生成，2）查看一致性梯度指导和3）迅速调整的梯度指导。为了确保查看一致性，我们的渐进生成过程首先要编辑单个参考图像并逐渐传播到相邻视图。我们的视图一致性梯度指导通过根据一致和不一致的输出之间的特征差异来调节生成模型，从而进一步增强了一致性。为了保留纹理特征，我们介绍了基于迅速调整的梯度指导，该指导学会了一个准确捕获参考图像和3D对象之间差异的令牌。然后，该代币引导编辑过程，确保跨视图的更一致的纹理保存。总体而言，3DSWAPPENT集成了这些新颖的策略，以实现更高的纹理转移，同时在多种观点跨越结构上的结合。广泛的定性和定量评估证实，我们的三个新颖组成部分使3D对象具有令人信服且有效的2D纹理交换。代码将在接受后提供。

Title: Reasoning to Learn from Latent Thoughts

Authors: Yangjun Ruan, Neil Band, Chris J. Maddison, Tatsunori Hashimoto
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.18866
Pdf URL: https://arxiv.org/pdf/2503.18866
Copy Paste: [[2503.18866]] Reasoning to Learn from Latent Thoughts(https://arxiv.org/abs/2503.18866)
Keywords: generation
Abstract: Compute scaling for language model (LM) pretraining has outpaced the growth of human-written texts, leading to concerns that data will become the bottleneck to LM scaling. To continue scaling pretraining in this data-constrained regime, we propose that explicitly modeling and inferring the latent thoughts that underlie the text generation process can significantly improve pretraining data efficiency. Intuitively, our approach views web text as the compressed final outcome of a verbose human thought process and that the latent thoughts contain important contextual knowledge and reasoning steps that are critical to data-efficient learning. We empirically demonstrate the effectiveness of our approach through data-constrained continued pretraining for math. We first show that synthetic data approaches to inferring latent thoughts significantly improve data efficiency, outperforming training on the same amount of raw data (5.7\% $\rightarrow$ 25.4\% on MATH). Furthermore, we demonstrate latent thought inference without a strong teacher, where an LM bootstraps its own performance by using an EM algorithm to iteratively improve the capability of the trained LM and the quality of thought-augmented pretraining data. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data, with increasing gains from additional inference compute when performing the E-step. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.
摘要：计算语言模型（LM）预处理的缩放量已经超过了人写的文本的增长，因此担心数据将成为LM缩放的瓶颈。为了继续在此数据约束的制度中进行预测，我们建议明确建模和推断文本生成过程构成的潜在思想可以显着提高预处理的数据效率。从直觉上讲，我们的方法将Web文本视为冗长人类思维过程的压缩最终结果，并且潜在思想包含重要的上下文知识和推理步骤，这些步骤对数据有效学习至关重要。我们从经验上通过数据约束的数学预处理来证明我们的方法的有效性。我们首先表明，推断潜在思想的合成数据方法可显着提高数据效率，超过相同数量的原始数据（5.7 \％$ \ rightarrow $ 25.4 \％\％）的培训。此外，我们在没有强大的老师的情况下展示了潜在的思想推断，在该思想中，LM引导通过使用EM算法来迭代地提高了受过训练的LM的能力以及经过思考的预测数据的质量。我们表明，1B LM可以在至少三个迭代中引导其性能，并且可以显着优于对原始数据训练的基线，并且在执行E-Step时的其他推理计算会增加。从推理缩放和EM迭代中获得的收益提出了扩展数据约束预处理的新机会。

Title: A semantic communication-based workload-adjustable transceiver for wireless AI-generated content (AIGC) delivery

Authors: Runze Cheng, Yao Sun, Lan Zhang, Lei Feng, Lei Zhang, Muhammad Ali Imran
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.18874
Pdf URL: https://arxiv.org/pdf/2503.18874
Copy Paste: [[2503.18874]] A semantic communication-based workload-adjustable transceiver for wireless AI-generated content (AIGC) delivery(https://arxiv.org/abs/2503.18874)
Keywords: generation, generative
Abstract: With the significant advances in generative AI (GAI) and the proliferation of mobile devices, providing high-quality AI-generated content (AIGC) services via wireless networks is becoming the future direction. However, the primary challenges of AIGC service delivery in wireless networks lie in unstable channels, limited bandwidth resources, and unevenly distributed computational resources. In this paper, we employ semantic communication (SemCom) in diffusion-based GAI models to propose a Resource-aware wOrkload-adjUstable TransceivEr (ROUTE) for AIGC delivery in dynamic wireless networks. Specifically, to relieve the communication resource bottleneck, SemCom is utilized to prioritize semantic information of the generated content. Then, to improve computational resource utilization in both edge and local and reduce AIGC semantic distortion in transmission, modified diffusion-based models are applied to adjust the computing workload and semantic density in cooperative content generation. Simulations verify the superiority of our proposed ROUTE in terms of latency and content quality compared to conventional AIGC approaches.
摘要：随着生成AI（GAI）的重大进展和移动设备的扩散，通过无线网络提供高质量的AI生成内容（AIGC）服务正在成为未来的方向。但是，无线网络中AIGC服务交付的主要挑战在于不稳定的频道，有限的带宽资源以及分布不均的计算资源。在本文中，我们在基于扩散的GAI模型中采用语义通信（SEMCOM），以在动态无线网络中提供AIGC交付的资源感知工作负载可调收发器（路由）。具体来说，为了减轻通信资源瓶颈，SEMCOM用于优先考虑生成内容的语义信息。然后，为了改善边缘和局部的计算资源利用，并减少传输中的AIGC语义失真，应用了基于扩散的模型来调整计算工作负载和合作内容生成中的语义密度。与常规AIGC方法相比，模拟在延迟和内容质量方面验证了我们提议的途径的优势。

Title: CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models

Authors: Weichen Fan, Amber Yijia Zheng, Raymond A. Yeh, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18886
Pdf URL: https://arxiv.org/pdf/2503.18886
Copy Paste: [[2503.18886]] CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models(https://arxiv.org/abs/2503.18886)
Keywords: generation
Abstract: Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at this http URL)
摘要：无分类器引导（CFG）是扩散/流模型中广泛采用的技术，可提高图像保真度和可控性。在这项工作中，我们首先分析CFG对可以得出地面真相流的高斯混合物的流量匹配模型的影响。我们观察到，在训练的早期阶段，当流量估计不准确时，CFG将样品指向不正确的轨迹。在此观察结果的基础上，我们提出了CFG-Zero *，这是一种改进的CFG，具有两个贡献：（a）优化的量表，在其中优化标量以纠正估计速度中的不准确性，因此名称为 *；（b）零输入，其中涉及将ODE求解器的前几个步骤归零。对文本形象（Lumina-Next，稳定扩散3和通量）和文本对视频（WAN-2.1）的生成的实验表明，CFG-Zero*始终超过CFG，突出了其在指导流量匹配模型中的有效性。（代码可在此HTTP URL上找到）

Title: Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

Authors: Meng Cao, Pengfei Hu, Yingyao Wang, Jihao Gu, Haoran Tang, Haoze Zhao, Jiahua Dong, Wangbo Yu, Ge Zhang, Ian Reid, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18923
Pdf URL: https://arxiv.org/pdf/2503.18923
Copy Paste: [[2503.18923]] Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models(https://arxiv.org/abs/2503.18923)
Keywords: generation
Abstract: Recent advancements in Large Video Language Models (LVLMs) have highlighted their potential for multi-modal understanding, yet evaluating their factual grounding in video contexts remains a critical unsolved challenge. To address this gap, we introduce Video SimpleQA, the first comprehensive benchmark tailored for factuality evaluation of LVLMs. Our work distinguishes from existing video benchmarks through the following key features: 1) Knowledge required: demanding integration of external knowledge beyond the explicit narrative; 2) Fact-seeking question: targeting objective, undisputed events or relationships, avoiding subjective interpretation; 3) Definitive & short-form answer: Answers are crafted as unambiguous and definitively correct in a short format, enabling automated evaluation through LLM-as-a-judge frameworks with minimal scoring variance; 4) External-source verified: All annotations undergo rigorous validation against authoritative external references to ensure the reliability; 5) Temporal reasoning required: The annotated question types encompass both static single-frame understanding and dynamic temporal reasoning, explicitly evaluating LVLMs factuality under the long-context dependencies. We extensively evaluate 41 state-of-the-art LVLMs and summarize key findings as follows: 1) Current LVLMs exhibit notable deficiencies in factual adherence, particularly for open-source models. The best-performing model Gemini-1.5-Pro achieves merely an F-score of 54.4%; 2) Test-time compute paradigms show insignificant performance gains, revealing fundamental constraints for enhancing factuality through post-hoc computation; 3) Retrieval-Augmented Generation demonstrates consistent improvements at the cost of additional inference time overhead, presenting a critical efficiency-performance trade-off.
摘要：大型视频语言模型（LVLM）的最新进展突出了它们具有多模式理解的潜力，但在视频环境中评估其事实基础仍然是一个关键的未解决的挑战。为了解决这一差距，我们介绍了Video SimpleQA，这是第一个针对LVLM的事实评估的量身定制的综合基准。我们的工作通过以下关键特征将现有视频基准区分开：1）所需的知识：要求将外部知识整合到明确的叙述之外； 2）寻求事实的问题：针对目标，无可争议的事件或关系，避免主观解释； 3）确定的和短形式的答案：答案以简短的格式制作为明确的，明确的正确正确，从而通过LLM-AS-A-A-Gudge框架具有最小的评分差异来实现自动化评估； 4）经过外部源：所有注释均经过对权威外部参考的严格验证以确保可靠性； 5）需要时间推理：带注释的问题类型涵盖静态单帧理解和动态时间推理，明确评估LVLMS在长期依赖项下的事实。我们广泛评估了41个最先进的LVLM，并总结了关键发现，如下所示：1）当前的LVLM在事实上遵守中表现出明显的缺陷，尤其是对于开源模型。表现最佳的型号Gemini-1.5-Pro仅达到54.4％的F得分； 2）测试时间计算范例显示出微不足道的性能增长，揭示了通过事后计算增强事实的基本限制； 3）检索效果的一代以额外的推理时间开销的成本表现出一致的改进，这是一个关键的效率 - 性能取舍。

Title: Training-free Diffusion Acceleration with Bottleneck Sampling

Authors: Ye Tian, Xin Xia, Yuxi Ren, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Yunhai Tong, Ling Yang, Bin Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18940
Pdf URL: https://arxiv.org/pdf/2503.18940
Copy Paste: [[2503.18940]] Training-free Diffusion Acceleration with Bottleneck Sampling(https://arxiv.org/abs/2503.18940)
Keywords: generation
Abstract: Diffusion models have demonstrated remarkable capabilities in visual content generation but remain challenging to deploy due to their high computational cost during inference. This computational burden primarily arises from the quadratic complexity of self-attention with respect to image or video resolution. While existing acceleration methods often compromise output quality or necessitate costly retraining, we observe that most diffusion models are pre-trained at lower resolutions, presenting an opportunity to exploit these low-resolution priors for more efficient inference without degrading performance. In this work, we introduce Bottleneck Sampling, a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. Bottleneck Sampling follows a high-low-high denoising workflow: it performs high-resolution denoising in the initial and final stages while operating at lower resolutions in intermediate steps. To mitigate aliasing and blurring artifacts, we further refine the resolution transition points and adaptively shift the denoising timesteps at each stage. We evaluate Bottleneck Sampling on both image and video generation tasks, where extensive experiments demonstrate that it accelerates inference by up to 3$\times$ for image generation and 2.5$\times$ for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process across multiple evaluation metrics. Code is available at: this https URL
摘要：扩散模型在视觉内容生成中表现出了显着的功能，但由于推断期间的高计算成本，部署仍然具有挑战性。这种计算负担主要源于图像或视频分辨率的自我注意力的二次复杂性。尽管现有的加速方法通常会损害输出质量或需要昂贵的重试，但我们观察到大多数扩散模型在较低的分辨率下进行了预训练，这是有机会利用这些低分辨率先验的机会，以进行更有效的推断，而不会降低性能。在这项工作中，我们介绍了瓶颈采样，这是一个无训练的框架，利用低分辨率的先验来减少计算开销，同时保留产量保真度。瓶颈采样遵循高低的降级工作流程：它在初始和最终阶段执行高分辨率降级，同时在中级步骤下进行较低分辨率进行操作。为了减轻异议和模糊的伪像，我们进一步完善了分辨率的过渡点，并在每个阶段都会自适应地转移时间段。我们评估了图像和视频生成任务上的瓶颈抽样，其中广泛的实验表明，图像生成的最多3 $ \ times $加速了推理，视频生成2.5美元$ \ times $，同时使输出质量可与多个评估计量的标准全分辨率采样过程相当。代码可用：此HTTPS URL

Title: Video-T1: Test-Time Scaling for Video Generation

Authors: Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, Yueqi Duan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.18942
Pdf URL: https://arxiv.org/pdf/2503.18942
Copy Paste: [[2503.18942]] Video-T1: Test-Time Scaling for Video Generation(https://arxiv.org/abs/2503.18942)
Keywords: generation
Abstract: With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: this https URL
摘要：随着增加培训数据，模型大小和计算成本的规模能力，视频生成在数字创建方面取得了令人印象深刻的结果，使用户能够在各个领域表达创造力。最近，大型语言模型（LLMS）的研究人员将缩放扩展到测试时间，通过使用更多的推理时间计算可以显着提高LLM性能。我们没有通过昂贵的培训成本来扩展视频基础模型，而是探索视频生成中的测试时间扩展（TTS）的力量，旨在回答以下问题：如果允许视频生成模型使用非平凡的推理时间计算，则可以在具有挑战性文本提示的情况下提高生成质量。在这项工作中，我们将视频生成的测试时间缩放为搜索问题，以采样从高斯噪声空间到目标视频分布的更好轨迹。具体来说，我们使用测试时间验证器构建搜索空间，以提供反馈和启发式算法来指导搜索过程。给定文本提示，我们首先通过在推理时增加噪声候选者来探索直观的线性搜索策略。由于全步授予所有帧同时需要大量的测试时间计算成本，因此我们进一步设计了一种更有效的TTS方法，用于视频生成，称为框架树（TOF），以自动性的方式自适应地扩展和修剪视频分支。关于文本条件的视频生成基准测试的广泛实验表明，增加测试时间的计算始终导致视频质量的显着改善。项目页面：此HTTPS URL

Title: Aether: Geometric-Aware Unified World Modeling

Authors: Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Tong He
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.18945
Pdf URL: https://arxiv.org/pdf/2503.18945
Copy Paste: [[2503.18945]] Aether: Geometric-Aware Unified World Modeling(https://arxiv.org/abs/2503.18945)
Keywords: generation, generative
Abstract: The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance far exceeds that of domain-specific models. Additionally, Aether leverages a geometry-informed action space to seamlessly translate predictions into actions, enabling effective autonomous trajectory planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.
摘要：几何重建和生成建模的整合仍然是开发能够像人类样空间推理的AI系统的关键挑战。本文提出了一个统一的框架，它是通过共同优化三个核心功能来实现世界模型中几何学推理的统一框架：（1）4D动态重建，（2）动作条件的视频预测和（3）目标条件有的视觉计划。通过任务间隔的特征学习，以太可以在重建，预测和计划目标中实现协同知识共享。在视频生成模型的基础上，我们的框架表现出了前所未有的合成对真实的概括，尽管在培训过程中从未观察到现实世界中的数据。此外，由于其内在的几何建模，我们的方法在操作和重建任务中都实现了零拍的概括。值得注意的是，即使没有现实世界数据，其重建性能也远远超过了域特异性模型的性能。此外，以太利用几何形状的动作空间将预测无缝地转化为动作，从而实现了有效的自主轨迹计划。我们希望我们的工作激发社区探索在物理上可约合的世界建模及其应用方面的新领域。

Title: Equivariant Image Modeling

Authors: Ruixiao Dong, Mengde Xu, Zigang Geng, Li Li, Han Hu, Shuyang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.18948
Pdf URL: https://arxiv.org/pdf/2503.18948
Copy Paste: [[2503.18948]] Equivariant Image Modeling(https://arxiv.org/abs/2503.18948)
Keywords: generation, generative
Abstract: Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into a series of simpler subtasks. However, inherent conflicts arise during the joint optimization of these subtasks, and existing solutions fail to resolve such conflicts without sacrificing efficiency or scalability. We propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks by leveraging the translation invariance of natural visual signals. Our method introduces (1) column-wise tokenization which enhances translational symmetry along the horizontal axis, and (2) windowed causal attention which enforces consistent contextual relationships across positions. Evaluated on class-conditioned ImageNet generation at 256x256 resolution, our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources. Systematic analysis demonstrates that enhanced equivariance reduces inter-task conflicts, significantly improving zero-shot generalization and enabling ultra-long image synthesis. This work establishes the first framework for task-aligned decomposition in generative modeling, offering insights into efficient parameter sharing and conflict-free optimization. The code and models are publicly available at this https URL.
摘要：当前的生成模型（例如自回归和扩散方法）将高维数据分布学习分解为一系列简单的子任务。但是，在这些子任务的联合优化期间出现了固有的冲突，现有解决方案无法解决此类冲突而不会牺牲效率或可扩展性。我们提出了一个新型的模棱两可的图像建模框架，该框架固有地通过利用自然视觉信号的翻译不变性来固有地对准子任务。我们的方法介绍了（1）列的综合化，从而增强了沿水平轴的翻译对称性，以及（2）窗口的因果关注，从而实现跨位置的一致上下文关系。通过256x256分辨率评估了类调节的成像网，我们的方法可实现与最先进的AR模型相当的性能，同时使用较少的计算资源。系统分析表明，增强的均值减少了任务之间的冲突，显着改善了零击的概括并实现了超长的图像综合。这项工作为生成建模中的任务一致分解建立了第一个框架，从而提供了对有效参数共享和无冲突优化的见解。代码和模型可在此HTTPS URL上公开可用。