2025-03-17

Title: Text-to-3D Generation using Jensen-Shannon Score Distillation

Authors: Khoi Do, Binh-Son Hua
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10660
Pdf URL: https://arxiv.org/pdf/2503.10660
Copy Paste: [[2503.10660]] Text-to-3D Generation using Jensen-Shannon Score Distillation(https://arxiv.org/abs/2503.10660)
Keywords: generation, generative
Abstract: Score distillation sampling is an effective technique to generate 3D models from text prompts, utilizing pre-trained large-scale text-to-image diffusion models as guidance. However, the produced 3D assets tend to be over-saturating, over-smoothing, with limited diversity. These issues are results from a reverse Kullback-Leibler (KL) divergence objective, which makes the optimization unstable and results in mode-seeking behavior. In this paper, we derive a bounded score distillation objective based on Jensen-Shannon divergence (JSD), which stabilizes the optimization process and produces high-quality 3D generation. JSD can match well generated and target distribution, therefore mitigating mode seeking. We provide a practical implementation of JSD by utilizing the theory of generative adversarial networks to define an approximate objective function for the generator, assuming the discriminator is well trained. By assuming the discriminator following a log-odds classifier, we propose a minority sampling algorithm to estimate the gradients of our proposed objective, providing a practical implementation for JSD. We conduct both theoretical and empirical studies to validate our method. Experimental results on T3Bench demonstrate that our method can produce high-quality and diversified 3D assets.
摘要：得分蒸馏采样是一种从文本提示中生成3D模型的有效技术，利用预先训练的大规模文本对图像扩散模型作为指导。但是，生产的3D资产往往过于饱和，过度光滑，多样性有限。这些问题是反向Kullback-Leibler（KL）差异目标的结果，这使得优化不稳定并导致寻求模式的行为。在本文中，我们基于Jensen-Shannon Divergence（JSD）得出一个有界的得分蒸馏目标，该目标稳定了优化过程并产生了高质量的3D代。 JSD可以匹配良好的生成和目标分布，从而减轻寻求模式。我们通过利用生成对抗网络的理论来定义发电机的近似目标函数，从而提供了JSD的实际实现，假设鉴别训练良好。通过假设遵循log-odds分类器的判别器，我们提出了少数族裔抽样算法来估计我们提出的目标的梯度，从而为JSD提供了实际实施。我们同时进行理论和经验研究来验证我们的方法。 T3Bench上的实验结果表明，我们的方法可以产生高质量和多样化的3D资产。

Title: VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion

Authors: Lehan Yang, Jincen Song, Tianlong Wang, Daiqing Qi, Weili Shi, Yuheng Liu, Sheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10678
Pdf URL: https://arxiv.org/pdf/2503.10678
Copy Paste: [[2503.10678]] VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion(https://arxiv.org/abs/2503.10678)
Keywords: generation
Abstract: We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at this https URL.
摘要：我们提出了一个新任务，视频引用垫片，该任务通过输入引用标题来获取指定实例的alpha哑光。我们将密集的预测任务视为视频生成，并利用视频扩散模型的文本到视频对齐方式生成时间连贯且与相应的语义实例密切相关的alpha哑光。此外，我们提出了一种新的潜在构建损失，以进一步区分不同的实例，从而实现了更可控制的交互式效果。此外，我们介绍了带有10,000个视频的大型视频参考垫数据集。据我们所知，这是第一个同时包含标题，视频和实例级alpha哑光的数据集。广泛的实验证明了我们方法的有效性。该数据集和代码可在此HTTPS URL上找到。

Title: Context-guided Responsible Data Augmentation with Diffusion Models

Authors: Khawar Islam, Naveed Akhtar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10687
Pdf URL: https://arxiv.org/pdf/2503.10687
Copy Paste: [[2503.10687]] Context-guided Responsible Data Augmentation with Diffusion Models(https://arxiv.org/abs/2503.10687)
Keywords: generation, generative
Abstract: Generative diffusion models offer a natural choice for data augmentation when training complex vision models. However, ensuring reliability of their generative content as augmentation samples remains an open challenge. Despite a number of techniques utilizing generative images to strengthen model training, it remains unclear how to utilize the combination of natural and generative images as a rich supervisory signal for effective model induction. In this regard, we propose a text-to-image (T2I) data augmentation method, named DiffCoRe-Mix, that computes a set of generative counterparts for a training sample with an explicitly constrained diffusion model that leverages sample-based context and negative prompting for a reliable augmentation sample generation. To preserve key semantic axes, we also filter out undesired generative samples in our augmentation process. To that end, we propose a hard-cosine filtration in the embedding space of CLIP. Our approach systematically mixes the natural and generative images at pixel and patch levels. We extensively evaluate our technique on ImageNet-1K,Tiny ImageNet-200, CIFAR-100, Flowers102, CUB-Birds, Stanford Cars, and Caltech datasets, demonstrating a notable increase in performance across the board, achieving up to $\sim 3\%$ absolute gain for top-1 accuracy over the state-of-the-art methods, while showing comparable computational overhead. Our code is publicly available at this https URL
摘要：训练复杂的视觉模型时，生成扩散模型为数据增强提供了自然选择。但是，确保其生成含量作为增强样品的可靠性仍然是一个开放的挑战。尽管采用了许多利用生成图像来加强模型训练的技术，但尚不清楚如何利用自然图像和生成性图像作为丰富的监督信号的组合来有效地诱导模型。在这方面，我们提出了一种名为DiffCore-Mix的文本对图像（T2I）数据增强方法，该方法计算了一组具有明确约束的扩散模型的训练样本的生成性对应物，该模型利用了基于样本的上下文和负面的提示，并为可靠的增强样品产生了负面的提示。为了保留关键的语义轴，我们还会在增强过程中滤除不需要的生成样品。为此，我们在夹子的嵌入空间中提出了一种硬骨过滤。我们的方法系统地将天然和生成的图像混合在像素和斑块水平上。我们广泛评估了对Imagenet-1K，Tiny Imagenet-200，Cifar-100，Flowers102，Cub-Birds，Stanford Cars和Caltech数据集的技术，表明整个董事会的性能显着提高，可实现多达$ \％$ sim 3 \％$的绝对准确性，同时出现了与状态的准确性，同时又可以很好地衡量。我们的代码在此HTTPS URL上公开可用

Title: Neighboring Autoregressive Modeling for Efficient Visual Generation

Authors: Yefei He, Yuanyu He, Shaoxuan He, Feng Chen, Hong Zhou, Kaipeng Zhang, Bohan Zhuang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10696
Pdf URL: https://arxiv.org/pdf/2503.10696
Copy Paste: [[2503.10696]] Neighboring Autoregressive Modeling for Efficient Visual Generation(https://arxiv.org/abs/2503.10696)
Keywords: generation
Abstract: Visual autoregressive models typically adhere to a raster-order ``next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant. In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far ``next-neighbor prediction" mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region. To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension. During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation. Experiments on ImageNet$256\times 256$ and UCF101 demonstrate that NAR achieves 2.4$\times$ and 8.6$\times$ higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach. When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4 of the training data. Code is available at this https URL.
摘要：视觉自动回归的模型通常粘附在栅格级``下一步的预测''范式，它忽略了视觉内容中固有的空间和时间位置。具体来说，视觉代币与它们在空间或时间上相邻的代币与远距离的图表相比，视觉代币与该论文相比，我们的新型图表（我们在邻近的范围内，我们都在邻近的模型中，都有一个新颖的模型。自回归视觉生成是一种渐进的支出过程，遵循近对``下一步的邻居预测''机制。从初始令牌开始，其余的令牌是从曼哈顿距离的上升顺序从空间空间中的初始令牌进行解码，从而逐渐扩展了解码区域的边界。为了使空间空间中的多个相邻令牌的平行预测，我们引入了一组面向维数的解码头，每个头部都可以预测沿相互正交尺寸的下一个令牌。在推断期间，与解码令牌相邻的所有代币都并行处理，从而大大降低了模型的前进步骤的生成步骤。 Imagenet上的实验$ 256 \ times 256 $和UCF101表明，NAR分别达到2.4 $ \ times $和8.6 $ \ times $更高的吞吐量，同时与PAR-4X方法相比，图像和视频生成任务都获得了出色的FID/FVD分数。在评估文本到图像生成基准的基准元音时，具有0.8B参数的NAR优于Chameleon-7b，而仅使用0.4的训练数据。代码可在此HTTPS URL上找到。

Title: Zero-Shot Subject-Centric Generation for Creative Application Using Entropy Fusion

Authors: Kaifeng Zou, Xiaoyi Feng, Peng Wang, Tao Huang, Zizhou Huang, Zhang Haihang, Yuntao Zou, Dagang Li
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10697
Pdf URL: https://arxiv.org/pdf/2503.10697
Copy Paste: [[2503.10697]] Zero-Shot Subject-Centric Generation for Creative Application Using Entropy Fusion(https://arxiv.org/abs/2503.10697)
Keywords: generation, generative
Abstract: Generative models are widely used in visual content creation. However, current text-to-image models often face challenges in practical applications-such as textile pattern design and meme generation-due to the presence of unwanted elements that are difficult to separate with existing methods. Meanwhile, subject-reference generation has emerged as a key research trend, highlighting the need for techniques that can produce clean, high-quality subject images while effectively removing extraneous components. To address this challenge, we introduce a framework for reliable subject-centric image generation. In this work, we propose an entropy-based feature-weighted fusion method to merge the informative cross-attention features obtained from each sampling step of the pretrained text-to-image model FLUX, enabling a precise mask prediction and subject-centric generation. Additionally, we have developed an agent framework based on Large Language Models (LLMs) that translates users' casual inputs into more descriptive prompts, leading to highly detailed image generation. Simultaneously, the agents extract primary elements of prompts to guide the entropy-based feature fusion, ensuring focused primary element generation without extraneous components. Experimental results and user studies demonstrate our methods generates high-quality subject-centric images, outperform existing methods or other possible pipelines, highlighting the effectiveness of our approach.
摘要：生成模型广泛用于视觉内容创建中。但是，当前的文本到图像模型通常在实际应用中面临挑战，例如纺织模式设计和模因生成，并且存在不必要的元素，而这些元素很难用现有方法分开。同时，主题引用的产生已成为一种关键的研究趋势，突出了对可以产生清洁，高质量主题图像的技术的需求，同时有效地消除了多余的组件。为了应对这一挑战，我们引入了一个以可靠的以主题为中心的图像生成的框架。在这项工作中，我们提出了一种基于熵的特征加权融合方法，以合并从预算上的文本对图像模型通量的每个采样步骤中获得的信息性交叉注意特征，从而实现了精确的面具预测和以主题为中心的生成。此外，我们已经开发了一个基于大语言模型（LLM）的代理框架，该框架将用户的休闲输入转化为更具描述性的提示，从而导致图像生成高度详细。同时，代理提取提示的主要元素，以指导基于熵的特征融合，从而确保焦点的主要元素生成而无需外部组件。实验结果和用户研究表明，我们的方法会产生高质量的以主题为中心的图像，优于现有方法或其他可能的管道，从而突出了我们方法的有效性。

Title: TA-V2A: Textually Assisted Video-to-Audio Generation

Authors: Yuhuan You, Xihong Wu, Tianshu Qu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.10700
Pdf URL: https://arxiv.org/pdf/2503.10700
Copy Paste: [[2503.10700]] TA-V2A: Textually Assisted Video-to-Audio Generation(https://arxiv.org/abs/2503.10700)
Keywords: generation
Abstract: As artificial intelligence-generated content (AIGC) continues to evolve, video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While Transformer and Diffusion models have advanced audio generation, a significant challenge persists in extracting precise semantic information from videos, as current models often lose sequential context by relying solely on frame-based features. To address this, we present TA-V2A, a method that integrates language, audio, and video features to improve semantic representation in latent space. By incorporating large language models for enhanced video comprehension, our approach leverages text guidance to enrich semantic expression. Our diffusion model-based system utilizes automated text modulation to enhance inference quality and efficiency, providing personalized control through text-guided interfaces. This integration enhances semantic expression while ensuring temporal alignment, leading to more accurate and coherent video-to-audio generation.
摘要：随着人工智能生成的内容（AIGC）继续发展，视频与原告（V2A）的一代已经成为一个关键领域，具有有希望的多媒体编辑，增强现实和自动化内容创建的有希望的应用。尽管变压器和扩散模型具有高级音频的生成，但从视频中提取精确的语义信息仍然存在着重大挑战，因为当前的模型通常仅依靠基于框架的功能来失去顺序上下文。为了解决这个问题，我们提出了TA-V2A，该方法集成了语言，音频和视频功能，以改善潜在空间中的语义表示。通过合并大型语言模型以增强视频理解，我们的方法利用文本指导来丰富语义表达。我们基于扩散模型的系统利用自动化文本调制来提高推理质量和效率，从而通过文本指导的接口提供个性化的控制。这种集成在确保时间对齐的同时增强了语义表达，从而导致更准确，连贯的视频到审计产生。

Title: Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models

Authors: Tsan-Tsung Yang, I-Wei Chen, Kuan-Ting Chen, Shang-Hsuan Chiang, Wen-Chih Peng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10718
Pdf URL: https://arxiv.org/pdf/2503.10718
Copy Paste: [[2503.10718]] Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models(https://arxiv.org/abs/2503.10718)
Keywords: generative
Abstract: With the rapid advancement of generative AI, AI-generated images have become increasingly realistic, raising concerns about creativity, misinformation, and content authenticity. Detecting such images and identifying their source models has become a critical challenge in ensuring the integrity of digital media. This paper tackles the detection of AI-generated images and identifying their source models using CNN and CLIP-ViT classifiers. For the CNN-based classifier, we leverage EfficientNet-B0 as the backbone and feed with RGB channels, frequency features, and reconstruction errors, while for CLIP-ViT, we adopt a pretrained CLIP image encoder to extract image features and SVM to perform classification. Evaluated on the Defactify 4 dataset, our methods demonstrate strong performance in both tasks, with CLIP-ViT showing superior robustness to image perturbations. Compared to baselines like AEROBLADE and OCC-CLIP, our approach achieves competitive results. Notably, our method ranked Top-3 overall in the Defactify 4 competition, highlighting its effectiveness and generalizability. All of our implementations can be found in this https URL
摘要：随着生成AI的快速发展，AI生成的图像已变得越来越现实，引起了人们对创造力，错误信息和内容真实性的关注。检测此类图像并确定其源模型已成为确保数字媒体完整性的关键挑战。本文可以使用CNN和CLIP-VIT分类器来解决AI生成的图像的检测，并识别其源模型。对于基于CNN的分类器，我们利用有效网络-B0作为主链，并以RGB通道，频率特征和重建错误为食，而对于夹子vit，我们采用了预验证的夹映像编码器来提取图像特征和SVM来执行分类。在Defactify 4数据集上进行了评估，我们的方法在这两个任务中都表现出强大的性能，并且剪辑效率显示出对图像扰动的较高鲁棒性。与Airoblade和Occ-CLIP等基线相比，我们的方法取得了竞争性的结果。值得注意的是，我们的方法在虚假4竞赛中排名前3位，突出了其有效性和概括性。我们所有的实现都可以在此HTTPS URL中找到

Title: Long-Video Audio Synthesis with Multi-Agent Collaboration

Authors: Yehang Zhang, Xinli Xu, Xiaojie Xu, Li Liu, Yingcong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10719
Pdf URL: https://arxiv.org/pdf/2503.10719
Copy Paste: [[2503.10719]] Long-Video Audio Synthesis with Multi-Agent Collaboration(https://arxiv.org/abs/2503.10719)
Keywords: generation
Abstract: Video-to-audio synthesis, which generates synchronized audio for visual content, critically enhances viewer immersion and narrative coherence in film and interactive media. However, video-to-audio dubbing for long-form content remains an unsolved challenge due to dynamic semantic shifts, temporal misalignment, and the absence of dedicated datasets. While existing methods excel in short videos, they falter in long scenarios (e.g., movies) due to fragmented synthesis and inadequate cross-scene consistency. We propose LVAS-Agent, a novel multi-agent framework that emulates professional dubbing workflows through collaborative role specialization. Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis. Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment. To enable systematic evaluation, we introduce LVAS-Bench, the first benchmark with 207 professionally curated long videos spanning diverse scenarios. Experiments demonstrate superior audio-visual alignment over baseline methods.
摘要：视频对视觉内容的视频综合，可为视觉内容产生同步音频，从而批判性地增强了观众的沉浸感和电影和互动媒体中的叙事连贯性。但是，由于动态的语义转移，时间误差和缺乏专用数据集，长期构成的视频与原告配音仍然是一个未解决的挑战。尽管现有方法在简短的视频中表现出色，但由于合成碎片和跨场景一致性不足，它们在长时间的情况（例如电影）中摇摆不定。我们提出了LVAS-Agent，这是一种新型的多代理框架，通过协作角色专业化来模仿专业配音工作流程。我们的方法将长效综合分解为四个步骤，包括场景分割，脚本生成，声音设计和音频合成。中央创新包括用于场景/脚本改进的讨论校正机制以及用于时间语义对齐的一代回程循环。为了实现系统的评估，我们介绍了LVAS板凳，这是第一个基准，其中207个专业策划的长期视频涵盖了各种情况。实验表明，视听比对高于基线方法。

Title: Numerical and statistical analysis of NeuralODE with Runge-Kutta time integration

Authors: Emily C. Ehrhardt, Hanno Gottschalk, Tobias J. Riedlinger
Subjects: cs.LG, math.CA, math.NA, math.PR
Abstract URL: https://arxiv.org/abs/2503.10729
Pdf URL: https://arxiv.org/pdf/2503.10729
Copy Paste: [[2503.10729]] Numerical and statistical analysis of NeuralODE with Runge-Kutta time integration(https://arxiv.org/abs/2503.10729)
Keywords: generative
Abstract: NeuralODE is one example for generative machine learning based on the push forward of a simple source measure with a bijective mapping, which in the case of NeuralODE is given by the flow of a ordinary differential equation. Using Liouville's formula, the log-density of the push forward measure is easy to compute and thus NeuralODE can be trained based on the maximum Likelihood method such that the Kulback-Leibler divergence between the push forward through the flow map and the target measure generating the data becomes small. In this work, we give a detailed account on the consistency of Maximum Likelihood based empirical risk minimization for a generic class of target measures. In contrast to prior work, we do not only consider the statistical learning theory, but also give a detailed numerical analysis of the NeuralODE algorithm based on the 2nd order Runge-Kutta (RK) time integration. Using the universal approximation theory for deep ReQU networks, the stability and convergence rated for the RK scheme as well as metric entropy and concentration inequalities, we are able to prove that NeuralODE is a probably approximately correct (PAC) learning algorithm.
摘要：Neuralode是基于用射击映射的简单源测量的推动，用于生成机器学习的一个示例，在神经模型的情况下，它是由普通微分方程的流动给出的。使用liouville的公式，推动向前测量的对数密度易于计算，因此可以根据最大似然方法对神经模拟进行训练，从而使kulback-leibler的差异在推动通过流程映射和产生数据的目标量度之间的kulback-leibler差异变得很小。在这项工作中，我们详细介绍了一类通用目标度量类别的基于最大可能性的经验风险最小化的一致性。与先前的工作相反，我们不仅考虑了统计学习理论，而且还基于第二阶runge-kutta（RK）时间集成对神经模型算法进行了详细的数值分析。使用通用近似理论，用于深层申报网络，对RK方案的稳定性和收敛性以及度量熵和浓度不等式，我们能够证明神经模型可能是一种近似正确的（PAC）学习算法。

Title: Leveraging Vision-Language Embeddings for Zero-Shot Learning in Histopathology Images

Authors: Md Mamunur Rahaman, Ewan K. A. Millar, Erik Meijering
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10731
Pdf URL: https://arxiv.org/pdf/2503.10731
Copy Paste: [[2503.10731]] Leveraging Vision-Language Embeddings for Zero-Shot Learning in Histopathology Images(https://arxiv.org/abs/2503.10731)
Keywords: generation
Abstract: Zero-shot learning holds tremendous potential for histopathology image analysis by enabling models to generalize to unseen classes without extensive labeled data. Recent advancements in vision-language models (VLMs) have expanded the capabilities of ZSL, allowing models to perform tasks without task-specific fine-tuning. However, applying VLMs to histopathology presents considerable challenges due to the complexity of histopathological imagery and the nuanced nature of diagnostic tasks. In this paper, we propose a novel framework called Multi-Resolution Prompt-guided Hybrid Embedding (MR-PHE) to address these challenges in zero-shot histopathology image classification. MR-PHE leverages multiresolution patch extraction to mimic the diagnostic workflow of pathologists, capturing both fine-grained cellular details and broader tissue structures critical for accurate diagnosis. We introduce a hybrid embedding strategy that integrates global image embeddings with weighted patch embeddings, effectively combining local and global contextual information. Additionally, we develop a comprehensive prompt generation and selection framework, enriching class descriptions with domain-specific synonyms and clinically relevant features to enhance semantic understanding. A similarity-based patch weighting mechanism assigns attention-like weights to patches based on their relevance to class embeddings, emphasizing diagnostically important regions during classification. Our approach utilizes pretrained VLM, CONCH for ZSL without requiring domain-specific fine-tuning, offering scalability and reducing dependence on large annotated datasets. Experimental results demonstrate that MR-PHE not only significantly improves zero-shot classification performance on histopathology datasets but also often surpasses fully supervised models.
摘要：通过使模型能够概括到没有广泛标记的数据的情况下，零射击学习具有组织病理学图像分析的巨大潜力。视觉模型（VLM）的最新进步扩大了ZSL的功能，允许模型执行任务而无需特定于任务的微调。然而，由于组织病理学图像的复杂性和诊断任务的细微差别，将VLM应用于组织病理学提出了巨大的挑战。在本文中，我们提出了一个新的框架，称为多分辨率迅速引导的混合嵌入（MR-PHE），以在零拍的组织病理学图像分类中解决这些挑战。 Mr-Phe利用多分辨率贴片提取来模仿病理学家的诊断工作流程，捕获细粒细胞细节和更广泛的组织结构，对于准确的诊断至关重要。我们介绍了一种混合嵌入策略，该策略将全球图像嵌入与加权贴片嵌入在一起，有效地结合了本地和全球上下文信息。此外，我们开发了一个全面的及时生成和选择框架，以特定于域的同义词和临床相关特征来丰富类描述，以增强语义理解。一种基于相似性的补丁加权机制将注意力的权重分配给贴片基于其与类嵌入的相关性，从而在分类过程中强调了诊断重要的区域。我们的方法利用了预处理的VLM，用于ZSL的海螺，而无需特定域的微调，提供可扩展性并降低对大型注释数据集的依赖。实验结果表明，MR-PHE不仅显着改善了组织病理学数据集的零摄像分类性能，而且通常超过了完全监督的模型。

Title: Visual Polarization Measurement Using Counterfactual Image Generation

Authors: Mohammad Mosaffa, Omid Rafieian, Hema Yoganarasimhan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10738
Pdf URL: https://arxiv.org/pdf/2503.10738
Copy Paste: [[2503.10738]] Visual Polarization Measurement Using Counterfactual Image Generation(https://arxiv.org/abs/2503.10738)
Keywords: generation, generative
Abstract: Political polarization is a significant issue in American politics, influencing public discourse, policy, and consumer behavior. While studies on polarization in news media have extensively focused on verbal content, non-verbal elements, particularly visual content, have received less attention due to the complexity and high dimensionality of image data. Traditional descriptive approaches often rely on feature extraction from images, leading to biased polarization estimates due to information loss. In this paper, we introduce the Polarization Measurement using Counterfactual Image Generation (PMCIG) method, which combines economic theory with generative models and multi-modal deep learning to fully utilize the richness of image data and provide a theoretically grounded measure of polarization in visual content. Applying this framework to a decade-long dataset featuring 30 prominent politicians across 20 major news outlets, we identify significant polarization in visual content, with notable variations across outlets and politicians. At the news outlet level, we observe significant heterogeneity in visual slant. Outlets such as Daily Mail, Fox News, and Newsmax tend to favor Republican politicians in their visual content, while The Washington Post, USA Today, and The New York Times exhibit a slant in favor of Democratic politicians. At the politician level, our results reveal substantial variation in polarized coverage, with Donald Trump and Barack Obama among the most polarizing figures, while Joe Manchin and Susan Collins are among the least. Finally, we conduct a series of validation tests demonstrating the consistency of our proposed measures with external measures of media slant that rely on non-image-based sources.
摘要：政治两极分化是美国政治中的一个重要问题，影响了公共话语，政策和消费者行为。尽管新闻媒体中极化的研究已广泛地关注言语内容，但由于图像数据的复杂性和高维度，非语言元素（尤其是视觉内容）受到了较少的关注。传统的描述性方法通常依赖于图像中的特征提取，从而导致信息丢失导致偏光估计。在本文中，我们介绍了使用反事实图像产生（PMCIG）方法的极化测量方法，该方法将经济理论与生成模型和多模式深度学习相结合，以充分利用图像数据的丰富性并提供了在视觉内容中的理论上扎根的衡量。将此框架应用于十年的数据集中，该数据集在20个主要新闻媒体中有30名著名政客，我们确定了视觉内容的显着两极分化，并且在媒体和政客之间存在显着的变化。在新闻媒体级别，我们观察到视觉倾斜的显着异质性。诸如《每日邮报》，《福克斯新闻》和《新闻杂志》这样的媒体倾向于在其视觉内容中偏爱共和党政客，而《华盛顿邮报》，《今日美国》和《纽约时报》和《纽约时报》表现出偏爱民主政客的倾向。在政客层面，我们的结果表明了两极分化的覆盖范围，唐纳德·特朗普和巴拉克·奥巴马是最两极分化的人物，而乔·曼钦和苏珊·柯林斯是最小的。最后，我们进行了一系列验证测试，证明了我们提出的措施与依赖非图像来源的媒体倾斜度的外部度量的一致性。

Title: FlowTok: Flowing Seamlessly Across Text and Image Tokens

Authors: Ju He, Qihang Yu, Qihao Liu, Liang-Chieh Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10772
Pdf URL: https://arxiv.org/pdf/2503.10772
Copy Paste: [[2503.10772]] FlowTok: Flowing Seamlessly Across Text and Image Tokens(https://arxiv.org/abs/2503.10772)
Keywords: generation
Abstract: Bridging different modalities lies at the heart of cross-modality generation. While conventional approaches treat the text modality as a conditioning signal that gradually guides the denoising process from Gaussian noise to the target image modality, we explore a much simpler paradigm-directly evolving between text and image modalities through flow matching. This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. Compared to prior methods, this design reduces the latent space size by 3.3x at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. Moreover, FlowTok naturally extends to image-to-text generation under the same formulation. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds-all while delivering performance comparable to state-of-the-art models. Code will be available at this https URL.
摘要：桥接不同的方式是交叉模式产生的核心。尽管常规方法将文本模式视为一个条件信号，该信号逐渐指导从高斯噪声到目标图像模态的降解过程，但我们通过流程匹配探索了更简单的范式在文本和图像模态之间发展。这需要将这两种模态投射到共享的潜在空间中，这构成了巨大的挑战，因为它们固有不同的表示：文本是高度语义的，并且编码为1D令牌，而图像在空间上是冗余的，并表示为2D潜在嵌入。为了解决这个问题，我们介绍了FlowTok，这是一个最小的框架，它通过将图像编码为紧凑的1D令牌表示形式，无缝地流过文本和图像。与先前的方法相比，该设计在256的图像分辨率下将潜在空间尺寸降低了3.3倍，从而消除了对复杂的调理机制或噪声调度的需求。此外，FlowTok自然会在同一公式下扩展到图像到文本的生成。 Flowtok的简化体系结构以紧凑型1D令牌为中心，Flowtok具有高度的记忆力，需要更少的培训资源，并且实现了更快的采样速度 - 在提供与最先进模型相当的性能的同时。代码将在此HTTPS URL上可用。

Title: Large-scale Pre-training for Grounded Video Caption Generation

Authors: Evangelos Kazakos, Cordelia Schmid, Josef Sivic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10781
Pdf URL: https://arxiv.org/pdf/2503.10781
Copy Paste: [[2503.10781]] Large-scale Pre-training for Grounded Video Caption Generation(https://arxiv.org/abs/2503.10781)
Keywords: generation
Abstract: We propose a novel approach for captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally dense bounding boxes. We introduce the following contributions. First, we present a large-scale automatic annotation method that aggregates captions grounded with bounding boxes across individual frames into temporally dense and consistent bounding box annotations. We apply this approach on the HowTo100M dataset to construct a large-scale pre-training dataset, named HowToGround1M. We also introduce a Grounded Video Caption Generation model, dubbed GROVE, and pre-train the model on HowToGround1M. Second, we introduce a new dataset, called iGround, of 3500 videos with manually annotated captions and dense spatio-temporally grounded bounding boxes. This allows us to measure progress on this challenging problem, as well as to fine-tune our model on this small-scale but high-quality data. Third, we demonstrate that our approach achieves state-of-the-art results on the proposed iGround dataset compared to a number of baselines, as well as on the VidSTG and ActivityNet-Entities datasets. We perform extensive ablations that demonstrate the importance of pre-training using our automatically annotated HowToGround1M dataset followed by fine-tuning on the manually annotated iGround dataset and validate the key technical contributions of our model.
摘要：我们提出了一种在视频中进行字幕和对象接地的新颖方法，标题中的对象通过暂时密集的边界框接地。我们介绍以下贡献。首先，我们提出了一种大规模的自动注释方法，该方法汇总了用各个框架的边界框接地的字幕，以暂时密集且一致的边界盒注释。我们将这种方法应用于HowTO100M数据集，以构建一个名为Howtoground1m的大规模预训练数据集。我们还引入了一个接地的视频字幕生成模型，称为Grove，并在Howtoground1M上预先培训模型。其次，我们介绍了一个名为Iground的新数据集，其中包括3500个视频，并带有手动注释的字幕和密集的时空固定式边界框。这使我们能够在这个具有挑战性的问题上衡量进度，并在这个小规模但高质量的数据上微调我们的模型。第三，我们证明我们的方法与许多基准以及VIDSTG和ActivityNet-Entities数据集相比，在拟议的Iground数据集上实现了最新的结果。我们进行广泛的消融，以证明使用自动注释的Howtoground1M数据集进行预训练的重要性，然后对手动注释的Iground数据集进行微调，并验证我们模型的关键技术贡献。

Title: Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification

Authors: Nathaniel Lesperance, Sujeevan Ratnasingham, Graham W. Taylor
Subjects: cs.CV, cs.AI, cs.IR, cs.LG, q-bio.PE
Abstract URL: https://arxiv.org/abs/2503.10886
Pdf URL: https://arxiv.org/pdf/2503.10886
Copy Paste: [[2503.10886]] Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification(https://arxiv.org/abs/2503.10886)
Keywords: generation
Abstract: In the context of pressing climate change challenges and the significant biodiversity loss among arthropods, automated taxonomic classification from organismal images is a subject of intense research. However, traditional AI pipelines based on deep neural visual architectures such as CNNs or ViTs face limitations such as degraded performance on the long-tail of classes and the inability to reason about their predictions. We integrate image captioning and retrieval-augmented generation (RAG) with large language models (LLMs) to enhance biodiversity monitoring, showing particular promise for characterizing rare and unknown arthropod species. While a naive Vision-Language Model (VLM) excels in classifying images of common species, the RAG model enables classification of rarer taxa by matching explicit textual descriptions of taxonomic features to contextual biodiversity text data from external sources. The RAG model shows promise in reducing overconfidence and enhancing accuracy relative to naive LLMs, suggesting its viability in capturing the nuances of taxonomic hierarchy, particularly at the challenging family and genus levels. Our findings highlight the potential for modern vision-language AI pipelines to support biodiversity conservation initiatives, emphasizing the role of comprehensive data curation and collaboration with citizen science platforms to improve species identification, unknown species characterization and ultimately inform conservation strategies.
摘要：在紧迫气候变化挑战和节肢动物之间的生物多样性丧失的背景下，来自有机图像的自动分类学分类是一项激烈研究的主题。但是，基于CNN或VITS等深层神经视觉架构的传统AI管道面临限制，例如在长尾上的长尾表现降低，并且无法推理其预测。我们将图像字幕和检索型发电（RAG）与大语言模型（LLMS）集成在一起，以增强生物多样性监测，从而表征了稀有和未知的节肢动物物种的特殊希望。虽然天真的视觉模型（VLM）在分类公共物种的图像方面表现出色，但抹布模型通过将分类特征的明确文本描述与外部来源的上下文生物多样性文本数据匹配，从而可以对稀有分类单元进行分类。抹布模型显示出相对于Naive LLM的过度自信和增强精度的希望，这表明其在捕获分类层次结构的细微差别方面的生存能力，尤其是在具有挑战性的家庭和属水平上。我们的发现突出了现代视觉语言AI管道支持生物多样性保护计划的潜力，强调了全面数据策划和与公民科学平台的合作的作用，以改善物种识别，不知名的物种表征以及最终信息保护策略。

Title: Memory-Efficient 3D High-Resolution Medical Image Synthesis Using CRF-Guided GANs

Authors: Mahshid Shiri, Alessandro Bruno, Daniele Loiacono
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10899
Pdf URL: https://arxiv.org/pdf/2503.10899
Copy Paste: [[2503.10899]] Memory-Efficient 3D High-Resolution Medical Image Synthesis Using CRF-Guided GANs(https://arxiv.org/abs/2503.10899)
Keywords: generative
Abstract: Generative Adversarial Networks (GANs) have many potential medical imaging applications. Due to the limited memory of Graphical Processing Units (GPUs), most current 3D GAN models are trained on low-resolution medical images, these models cannot scale to high-resolution or are susceptible to patchy artifacts. In this work, we propose an end-to-end novel GAN architecture that uses Conditional Random field (CRF) to model dependencies so that it can generate consistent 3D medical Images without exploiting memory. To achieve this purpose, the generator is divided into two parts during training, the first part produces an intermediate representation and CRF is applied to this intermediate representation to capture correlations. The second part of the generator produces a random sub-volume of image using a subset of the intermediate representation. This structure has two advantages: first, the correlations are modeled by using the features that the generator is trying to optimize. Second, the generator can generate full high-resolution images during inference. Experiments on Lung CTs and Brain MRIs show that our architecture outperforms state-of-the-art while it has lower memory usage and less complexity.
摘要：生成对抗网络（GAN）具有许多潜在的医学成像应用。由于图形处理单元（GPU）的记忆有限，因此对当前的3D GAN模型进行了训练，因此在低分辨率的医学图像上训练，这些模型无法扩展到高分辨率或容易受到斑驳的人工制品的影响。在这项工作中，我们提出了一种端到端的新型GAN体系结构，该结构使用条件随机字段（CRF）来建模依赖项，以便它可以在不利用内存的情况下生成一致的3D医疗图像。为了实现此目的，在训练过程中将发电机分为两个部分，第一部分会产生中间表示，并且CRF应用于此中间表示形式以捕获相关性。发电机的第二部分使用中间表示的子集生成图像的随机子体积。该结构具有两个优点：首先，相关性是通过使用生成器试图优化的功能来建模的。其次，发电机可以在推理过程中生成完整的高分辨率图像。关于肺CTS和大脑MRIS的实验表明，我们的体系结构在记忆使用较低且复杂性较低的同时优于最先进的实验。

Title: OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models

Authors: Akshat Ramachandran, Mingyu Lee, Huan Xu, Souvik Kundu, Tushar Krishna
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10959
Pdf URL: https://arxiv.org/pdf/2503.10959
Copy Paste: [[2503.10959]] OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models(https://arxiv.org/abs/2503.10959)
Keywords: generative
Abstract: We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM's recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code will be released soon.
摘要：我们提出OROMAMBA，这是基于视觉MAMBA模型（VMM）的第一个无数据培训后量化（DFQ）方法。我们确定了为VMM启用DFQ的两个关键挑战，（1）VMM的复发状态转换限制了捕获长期相互作用的捕获，并导致语义上弱的合成数据，（2）VMM激活表现出动态异常的跨时间变化，从而使现有的静态PTQ技术构成现有的静态PTQ技术。为了应对这些挑战，Oromamba提出了一个两阶段的框架：（1）Oromamba-Gen生成语义上丰富且有意义的合成数据。它应用于通过潜在状态空间中的邻域相互作用产生的斑块级别的VMM特征，（2）Youromamba-pricant在推理过程中使用轻质动态异常检测使用混合精液量化。具体而言，我们提出了一个基于阈值的离群频道选择策略，以进行每个时间阶段的激活。跨视觉和生成任务进行的广泛实验表明，我们的无数据Oeromamba超过了现有的数据驱动的PTQ技术，从而在不同的量化设置中实现了最新的性能。此外，我们实施有效的GPU内核，以实现高达2.36倍的实际延迟速度。代码将很快发布。

Title: Comparative Analysis of Advanced AI-based Object Detection Models for Pavement Marking Quality Assessment during Daytime

Authors: Gian Antariksa, Rohir Chakraborty, Shriyank Somvanshi, Subasish Das, Mohammad Jalayer, Deep Rameshkumar Patel, David Mills
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11008
Pdf URL: https://arxiv.org/pdf/2503.11008
Copy Paste: [[2503.11008]] Comparative Analysis of Advanced AI-based Object Detection Models for Pavement Marking Quality Assessment during Daytime(https://arxiv.org/abs/2503.11008)
Keywords: quality assessment
Abstract: Visual object detection utilizing deep learning plays a vital role in computer vision and has extensive applications in transportation engineering. This paper focuses on detecting pavement marking quality during daytime using the You Only Look Once (YOLO) model, leveraging its advanced architectural features to enhance road safety through precise and real-time assessments. Utilizing image data from New Jersey, this study employed three YOLOv8 variants: YOLOv8m, YOLOv8n, and YOLOv8x. The models were evaluated based on their prediction accuracy for classifying pavement markings into good, moderate, and poor visibility categories. The results demonstrated that YOLOv8n provides the best balance between accuracy and computational efficiency, achieving the highest mean Average Precision (mAP) for objects with good visibility and demonstrating robust performance across various Intersections over Union (IoU) thresholds. This research enhances transportation safety by offering an automated and accurate method for evaluating the quality of pavement markings.
摘要：使用深度学习的视觉对象检测在计算机视觉中起着至关重要的作用，并且在运输工程中具有广泛的应用。本文着重于在白天使用You Gook（Yolo）模型来检测路面标记质量，从而利用其先进的建筑特征通过精确和实时评估来增强道路安全性。利用来自新泽西州的图像数据，本研究采用了三种Yolov8变体：Yolov8M，Yolov8n和Yolov8X。根据其预测准确性评估模型，以将路面标记分类为良好，中度和差的可见性类别。结果表明，Yolov8n提供了准确性和计算效率之间的最佳平衡，对于具有良好可见性的对象并证明了在Union（IOU）阈值的各个交叉点上的最高平均平均精度（MAP）。这项研究通过提供一种评估路面标记质量的自动化和准确的方法来增强运输安全性。

Title: Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data

Authors: Lilin Zhang, Chengpei Wu, Ning Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11032
Pdf URL: https://arxiv.org/pdf/2503.11032
Copy Paste: [[2503.11032]] Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data(https://arxiv.org/abs/2503.11032)
Keywords: generation
Abstract: Existing adversarial training (AT) methods often suffer from incomplete perturbation, meaning that not all non-robust features are perturbed when generating adversarial examples (AEs). This results in residual correlations between non-robust features and labels, leading to suboptimal learning of robust features. However, achieving complete perturbation, i.e., perturbing as many non-robust features as possible, is challenging due to the difficulty in distinguishing robust and non-robust features and the sparsity of labeled data. To address these challenges, we propose a novel approach called Weakly Supervised Contrastive Adversarial Training (WSCAT). WSCAT ensures complete perturbation for improved learning of robust features by disrupting correlations between non-robust features and labels through complete AE generation over partially labeled data, grounded in information theory. Extensive theoretical analysis and comprehensive experiments on widely adopted benchmarks validate the superiority of WSCAT.
摘要：现有的对抗训练（AT）方法通常会遭受不完全扰动的困扰，这意味着在生成对抗性示例（AES）时，并非所有非运动特征都会受到干扰。这会导致非舒适特征和标签之间的残留相关性，从而导致对健壮特征的次优学习。但是，由于难以区分鲁棒和非舒适的特征以及标记数据的稀疏性，因此实现了完全扰动，即扰动尽可能多的非稳定功能。为了应对这些挑战，我们提出了一种新颖的方法，称为弱监督的对比对抗训练（WSCAT）。 WSCAT通过在信息理论基于信息理论基础的部分标记的数据上，通过完全标记的数据来破坏非持续特征和标签之间的相关性来确保完全扰动，以改善对鲁棒特征的学习。广泛的理论分析和广泛采用基准的全面实验验证了WSCAT的优势。

Title: ACMo: Attribute Controllable Motion Generation

Authors: Mingjie Wei, Xuemei Xie, Guangming Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11038
Pdf URL: https://arxiv.org/pdf/2503.11038
Copy Paste: [[2503.11038]] ACMo: Attribute Controllable Motion Generation(https://arxiv.org/abs/2503.11038)
Keywords: generation
Abstract: Attributes such as style, fine-grained text, and trajectory are specific conditions for describing motion. However, existing methods often lack precise user control over motion attributes and suffer from limited generalizability to unseen motions. This work introduces an Attribute Controllable Motion generation architecture, to address these challenges via decouple any conditions and control them separately. Firstly, we explored the Attribute Diffusion Model to imporve text-to-motion performance via decouple text and motion learning, as the controllable model relies heavily on the pre-trained model. Then, we introduce Motion Adpater to quickly finetune previously unseen motion patterns. Its motion prompts inputs achieve multimodal text-to-motion generation that captures user-specified styles. Finally, we propose a LLM Planner to bridge the gap between unseen attributes and dataset-specific texts via local knowledage for user-friendly interaction. Our approach introduces the capability for motion prompts for stylize generation, enabling fine-grained and user-friendly attribute control while providing performance comparable to state-of-the-art methods. Project page: this https URL
摘要：诸如样式，细粒文本和轨迹之类的属性是描述运动的特定条件。但是，现有方法通常缺乏对运动属性的准确控制，并且无法普遍性地看不见动议。这项工作介绍了可控的运动生成体系结构，以通过将任何条件分开解决这些挑战，并分别控制它们。首先，我们探索了通过将可控模型严重依赖于预先训练的模型，通过将文本和运动学习探索了通过解次文本和运动学习来促进文本到动作性能的属性扩散模型。然后，我们介绍运动adpater以快速芬特（Finetune）以前看不见的运动模式。它的运动提示输入实现了捕获用户指定样式的多模式文本到动作生成。最后，我们提出了一个LLM计划者，以通过本地知识来弥合看不见的属性和数据集特定文本之间的差距，以实现用户友好的交互。我们的方法介绍了运动提示的功能，以进行风格化生成，从而实现精细粒度和用户友好的属性控制，同时提供与最新方法相当的性能。项目页面：此HTTPS URL

Title: InverseBench: Benchmarking Plug-and-Play Diffusion Priors for Inverse Problems in Physical Sciences

Authors: Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy T. Feng, Caifeng Zou, Yu Sun, Nikola Kovachki, Zachary E. Ross, Katherine L. Bouman, Yisong Yue
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11043
Pdf URL: https://arxiv.org/pdf/2503.11043
Copy Paste: [[2503.11043]] InverseBench: Benchmarking Plug-and-Play Diffusion Priors for Inverse Problems in Physical Sciences(https://arxiv.org/abs/2503.11043)
Keywords: restoration
Abstract: Plug-and-play diffusion priors (PnPDP) have emerged as a promising research direction for solving inverse problems. However, current studies primarily focus on natural image restoration, leaving the performance of these algorithms in scientific inverse problems largely unexplored. To address this gap, we introduce \textsc{InverseBench}, a framework that evaluates diffusion models across five distinct scientific inverse problems. These problems present unique structural challenges that differ from existing benchmarks, arising from critical scientific applications such as optical tomography, medical imaging, black hole imaging, seismology, and fluid dynamics. With \textsc{InverseBench}, we benchmark 14 inverse problem algorithms that use plug-and-play diffusion priors against strong, domain-specific baselines, offering valuable new insights into the strengths and weaknesses of existing algorithms. To facilitate further research and development, we open-source the codebase, along with datasets and pre-trained models, at this https URL.
摘要：插入式扩散先验（PNPDP）已成为解决反问题的有前途的研究方向。但是，当前的研究主要集中于自然图像恢复，使这些算法的性能在科学的反问题中在很大程度上没有探索。为了解决这一差距，我们介绍\ textsc {inverseBench}，该框架评估了跨五个不同的科学反问题的扩散模型。这些问题带来了与现有基准不同的独特结构挑战，这些挑战是由光学层析成像，医学成像，黑洞成像，地震学和流体动力学等关键科学应用引起的。使用\ textsc {inverseBench}，我们基准了14个反问题算法，这些算法使用插件扩散率对抗强，域特异性的基线，从而为现有算法的优势和缺点提供了宝贵的新见解。为了促进进一步的研究和开发，我们在此HTTPS URL上开放代码库以及数据集和预培训模型。

Title: PSF-4D: A Progressive Sampling Framework for View Consistent 4D Editing

Authors: Hasan Iqbal, Nazmul Karim, Umar Khalid, Azib Farooq, Zichun Zhong, Jing Hua, Chen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11044
Pdf URL: https://arxiv.org/pdf/2503.11044
Copy Paste: [[2503.11044]] PSF-4D: A Progressive Sampling Framework for View Consistent 4D Editing(https://arxiv.org/abs/2503.11044)
Keywords: generative
Abstract: Instruction-guided generative models, especially those using text-to-image (T2I) and text-to-video (T2V) diffusion frameworks, have advanced the field of content editing in recent years. To extend these capabilities to 4D scene, we introduce a progressive sampling framework for 4D editing (PSF-4D) that ensures temporal and multi-view consistency by intuitively controlling the noise initialization during forward diffusion. For temporal coherence, we design a correlated Gaussian noise structure that links frames over time, allowing each frame to depend meaningfully on prior frames. Additionally, to ensure spatial consistency across views, we implement a cross-view noise model, which uses shared and independent noise components to balance commonalities and distinct details among different views. To further enhance spatial coherence, PSF-4D incorporates view-consistent iterative refinement, embedding view-aware information into the denoising process to ensure aligned edits across frames and views. Our approach enables high-quality 4D editing without relying on external models, addressing key challenges in previous methods. Through extensive evaluation on multiple benchmarks and multiple editing aspects (e.g., style transfer, multi-attribute editing, object removal, local editing, etc.), we show the effectiveness of our proposed method. Experimental results demonstrate that our proposed method outperforms state-of-the-art 4D editing methods in diverse benchmarks.
摘要：指导引导的生成模型，尤其是那些使用文本形象（T2I）和文本对视频（T2V）扩散框架的模型，近年来已经提出了内容编辑领域。为了将这些功能扩展到4D场景，我们引入了一个用于4D编辑（PSF-4D）的渐进抽样框架，该框架通过直观地控制向前扩散期间的噪声初始化，从而确保时间和多视图一致性。对于时间连贯性，我们设计了一个相关的高斯噪声结构，该结构会随着时间的推移链接，从而使每个帧都可以有意义地依赖于先前的帧。此外，为了确保跨视图的空间一致性，我们实施了一个跨视图噪声模型，该模型使用共享和独立的噪声组件来平衡不同视图之间的共同点和不同的细节。为了进一步增强空间连贯性，PSF-4D结合了视图一致的迭代精致，将视图感知信息嵌入到DeNoising过程中，以确保跨帧和视图的对齐编辑。我们的方法使高质量的4D编辑无需依赖外部模型，从而解决了以前方法中的关键挑战。通过对多个基准和多个编辑方面的广泛评估（例如样式转移，多属性编辑，对象删除，本地编辑等），我们显示了我们提出的方法的有效性。实验结果表明，我们提出的方法在不同基准中的最先进的4D编辑方法优于最先进的方法。

Title: Measuring Similarity in Causal Graphs: A Framework for Semantic and Structural Analysis

Authors: Ning-Yuan Georgia Liu, Flower Yang, Mohammad S. Jalali
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11046
Pdf URL: https://arxiv.org/pdf/2503.11046
Copy Paste: [[2503.11046]] Measuring Similarity in Causal Graphs: A Framework for Semantic and Structural Analysis(https://arxiv.org/abs/2503.11046)
Keywords: generative
Abstract: Causal graphs are commonly used to understand and model complex systems. Researchers often construct these graphs from different perspectives, leading to significant variations for the same problem. Comparing causal graphs is, therefore, essential for evaluating assumptions, integrating insights, and resolving disagreements. The rise of AI tools has further amplified this need, as they are increasingly used to generate hypothesized causal graphs by synthesizing information from various sources such as prior research and community inputs, providing the potential for automating and scaling causal modeling for complex systems. Similar to humans, these tools also produce inconsistent results across platforms, versions, and iterations. Despite its importance, research on causal graph comparison remains scarce. Existing methods often focus solely on structural similarities, assuming identical variable names, and fail to capture nuanced semantic relationships, which is essential for causal graph comparison. We address these gaps by investigating methods for comparing causal graphs from both semantic and structural perspectives. First, we reviewed over 40 existing metrics and, based on predefined criteria, selected nine for evaluation from two threads of machine learning: four semantic similarity metrics and five learning graph kernels. We discuss the usability of these metrics in simple examples to illustrate their strengths and limitations. We then generated a synthetic dataset of 2,000 causal graphs using generative AI based on a reference diagram. Our findings reveal that each metric captures a different aspect of similarity, highlighting the need to use multiple metrics.
摘要：因果图通常用于理解和建模复杂系统。研究人员经常从不同的角度构建这些图形，从而导致相同问题的显着差异。因此，比较因果图对于评估假设，整合见解和解决分歧是必不可少的。人工智能工具的兴起进一步扩大了这一需求，因为它们越来越多地通过合成来自先前研究和社区投入等各种来源的信息来生成假设的因果图，从而为复杂系统自动化和扩展因果建模提供了潜力。与人类类似，这些工具还可以在平台，版本和迭代中产生不一致的结果。尽管它很重要，但有关因果图比较的研究仍然很少。现有方法通常仅集中在结构相似性上，假设具有相同的变量名称，并且无法捕获细微的语义关系，这对于因果图比较至关重要。我们通过研究从语义和结构角度比较因果图的方法来解决这些差距。首先，我们审查了40多个现有指标，并根据预定义的标准从两个机器学习的两个线程中选择了9个评估：四个语义相似性指标和五个学习图内的内核。我们在简单的示例中讨论了这些指标的可用性，以说明它们的优势和局限性。然后，我们使用基于参考图的生成AI生成了2,000个因果图的合成数据集。我们的发现表明，每个指标都捕获了相似性的不同方面，强调了使用多个指标的需求。

Title: Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

Authors: Kyle Sargent, Kyle Hsu, Justin Johnson, Li Fei-Fei, Jiajun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11056
Pdf URL: https://arxiv.org/pdf/2503.11056
Copy Paste: [[2503.11056]] Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization(https://arxiv.org/abs/2503.11056)
Keywords: generation, generative
Abstract: Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at this http URL .
摘要：由于流行的视觉生成框架（如VQGAN和潜在扩散模型）的出现，最先进的图像生成系统通常是两阶段的系统，在学习生成模型之前，它首先将视觉数据引起或压缩到较低维度的潜在空间。令牌训练通常遵循标准食谱，其中图像被压缩并重建，约束MSE，感知和对抗性损失。已经在先前的工作中提出了扩散自动编码器，以此作为学习端到端感知导向的图像压缩的一种方式，但尚未在Imagenet-1K重建的竞争任务上显示出最新的性能。我们提出了FlowMo，这是一种基于变压器的扩散自动编码器，可在多种压缩速率下以多种压缩速率实现新的最新图像令牌化，而无需使用卷发，对抗性损失，空间平衡的二维潜在代码，或者脱离其他象征器。我们的主要见解是，应将FlowMO训练分为模式匹配的训练阶段和寻求模式的训练后阶段。此外，我们进行了广泛的分析，并探索了FlowMo令牌的生成模型的训练。我们的代码和模型将在此HTTP URL上可用。

Title: Generative Modelling for Mathematical Discovery

Authors: Jordan S. Ellenberg, Cristofero S. Fraser-Taliente, Thomas R. Harvey, Karan Srivastava, Andrew V. Sutherland
Subjects: cs.LG, math.CO
Abstract URL: https://arxiv.org/abs/2503.11061
Pdf URL: https://arxiv.org/pdf/2503.11061
Copy Paste: [[2503.11061]] Generative Modelling for Mathematical Discovery(https://arxiv.org/abs/2503.11061)
Keywords: generative
Abstract: We present a new implementation of the LLM-driven genetic algorithm {\it funsearch}, whose aim is to generate examples of interest to mathematicians and which has already had some success in problems in extremal combinatorics. Our implementation is designed to be useful in practice for working mathematicians; it does not require expertise in machine learning or access to high-performance computing resources. Applying {\it funsearch} to a new problem involves modifying a small segment of Python code and selecting a large language model (LLM) from one of many third-party providers. We benchmarked our implementation on three different problems, obtaining metrics that may inform applications of {\it funsearch} to new problems. Our results demonstrate that {\it funsearch} successfully learns in a variety of combinatorial and number-theoretic settings, and in some contexts learns principles that generalize beyond the problem originally trained on.
摘要：我们提出了以LLM驱动的遗传算法{\ it FunSearch}的新实施，其目的是为数学家引起兴趣的例子，并且已经在极端组合学方面已经取得了一些成功。我们的实施旨在对工作数学家的实践有用。它不需要在机器学习或访问高性能计算资源方面的专业知识。将{\ it FunSearch}应用于新问题，涉及修改一小部分Python代码，并从许多第三方提供商之一中选择大型语言模型（LLM）。我们在三个不同的问题上进行了基准测试，获得了可能为新问题提供{\ it funsearch}应用程序的指标。我们的结果表明，{\ it FunSearch}在各种组合和数字理论设置中成功学习，并且在某些情况下，学习了超出最初受过训练的问题的原则。

Title: Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models

Authors: Zhenguang Liu, Chao Shuai, Shaojing Fan, Ziping Dong, Jinwu Hu, Zhongjie Ba, Kui Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11071
Pdf URL: https://arxiv.org/pdf/2503.11071
Copy Paste: [[2503.11071]] Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models(https://arxiv.org/abs/2503.11071)
Keywords: generation
Abstract: Diffusion models have achieved remarkable success in novel view synthesis, but their reliance on large, diverse, and often untraceable Web datasets has raised pressing concerns about image copyright protection. Current methods fall short in reliably identifying unauthorized image use, as they struggle to generalize across varied generation tasks and fail when the training dataset includes images from multiple sources with few identifiable (watermarked or poisoned) samples. In this paper, we present novel evidence that diffusion-generated images faithfully preserve the statistical properties of their training data, particularly reflected in their spectral features. Leveraging this insight, we introduce \emph{CoprGuard}, a robust frequency domain watermarking framework to safeguard against unauthorized image usage in diffusion model training and fine-tuning. CoprGuard demonstrates remarkable effectiveness against a wide range of models, from naive diffusion models to sophisticated text-to-image models, and is robust even when watermarked images comprise a mere 1\% of the training dataset. This robust and versatile approach empowers content owners to protect their intellectual property in the era of AI-driven image generation.
摘要：扩散模型在新型视图综合中取得了巨大的成功，但是它们对大型，多样化且通常无法追踪的Web数据集的依赖引起了人们对图像版权保护的紧急关注。当前的方法在可靠地识别未经授权的图像使用方面缺乏缺陷，因为它们很难跨越各种一代任务，并且当训练数据集包含来自多个源的图像时失败，这些图像很少可识别（水印或中毒）。在本文中，我们提供了新的证据，表明扩散生成的图像忠实地保留其训练数据的统计特性，尤其是在其光谱特征中反映的。利用这种见解，我们介绍了\ emph {coprguard}，这是一个强大的频域水印框架，以保护扩散模型训练和微调中未经授权的图像使用情况。 CoprGuard对广泛的模型表现出了出色的有效性，从幼稚的扩散模型到复杂的文本对图像模型，即使水印图像仅包含训练数据集的1 \％，也是强大的。这种强大而多功能的方法使内容所有者在AI驱动的图像产生时代保护其知识产权。

Title: Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models

Authors: Hongyang Wei, Shuaizheng Liu, Chun Yuan, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11073
Pdf URL: https://arxiv.org/pdf/2503.11073
Copy Paste: [[2503.11073]] Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models(https://arxiv.org/abs/2503.11073)
Keywords: super-resolution, generative
Abstract: By leveraging the generative priors from pre-trained text-to-image diffusion models, significant progress has been made in real-world image super-resolution (Real-ISR). However, these methods tend to generate inaccurate and unnatural reconstructions in complex and/or heavily degraded scenes, primarily due to their limited perception and understanding capability of the input low-quality image. To address these limitations, we propose, for the first time to our knowledge, to adapt the pre-trained autoregressive multimodal model such as Lumina-mGPT into a robust Real-ISR model, namely PURE, which Perceives and Understands the input low-quality image, then REstores its high-quality counterpart. Specifically, we implement instruction tuning on Lumina-mGPT to perceive the image degradation level and the relationships between previously generated image tokens and the next token, understand the image content by generating image semantic descriptions, and consequently restore the image by generating high-quality image tokens autoregressively with the collected information. In addition, we reveal that the image token entropy reflects the image structure and present a entropy-based Top-k sampling strategy to optimize the local structure of the image during inference. Experimental results demonstrate that PURE preserves image content while generating realistic details, especially in complex scenes with multiple objects, showcasing the potential of autoregressive multimodal generative models for robust Real-ISR. The model and code will be available at this https URL.
摘要：通过利用预先训练的文本对图像扩散模型的生成先验，在现实世界图像超分辨率（Real-ISR）中取得了重大进展。但是，这些方法倾向于在复杂和/或严重退化的场景中产生不准确和不自然的重建，这主要是由于它们的感知有限和了解输入低质量图像的能力。为了解决这些局限性，我们首次提出了我们的知识，以适应预先训练的自动回归多模式模型，例如Lumina-MGPT，即纯粹的纯粹的ISR模型，即感知并理解输入低品质的图像，然后恢复其高质量的对手。具体而言，我们在Lumina-MGPT上实现了指令调整，以感知图像降解级别以及先前生成的图像令牌与下一代币之间的关系，通过生成图像语义描述来了解图像内容，从而通过生成图像来恢复图像，从而通过收集到的信息来生成高质量的图像令牌。此外，我们揭示了图像令牌熵反映图像结构并提出了基于熵的TOP-K采样策略，以优化推理过程中图像的局部结构。实验结果表明，纯净的图像内容在生成逼真的细节的同时，尤其是在具有多个对象的复杂场景中，展示了自动回归的多模式生成模型的潜力，以实现鲁棒的现实ISR。该模型和代码将在此HTTPS URL上可用。

Title: Understanding Flatness in Generative Models: Its Role and Benefits

Authors: Taehwan Lee, Kyeongkook Seo, Jaejun Yoo, Sung Whan Yoon
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11078
Pdf URL: https://arxiv.org/pdf/2503.11078
Copy Paste: [[2503.11078]] Understanding Flatness in Generative Models: Its Role and Benefits(https://arxiv.org/abs/2503.11078)
Keywords: generative
Abstract: Flat minima, known to enhance generalization and robustness in supervised learning, remain largely unexplored in generative models. In this work, we systematically investigate the role of loss surface flatness in generative models, both theoretically and empirically, with a particular focus on diffusion models. We establish a theoretical claim that flatter minima improve robustness against perturbations in target prior distributions, leading to benefits such as reduced exposure bias -- where errors in noise estimation accumulate over iterations -- and significantly improved resilience to model quantization, preserving generative performance even under strong quantization constraints. We further observe that Sharpness-Aware Minimization (SAM), which explicitly controls the degree of flatness, effectively enhances flatness in diffusion models, whereas other well-known methods such as Stochastic Weight Averaging (SWA) and Exponential Moving Average (EMA), which promote flatness indirectly via ensembling, are less effective. Through extensive experiments on CIFAR-10, LSUN Tower, and FFHQ, we demonstrate that flat minima in diffusion models indeed improves not only generative performance but also robustness.
摘要：已知在监督学习中增强概括和鲁棒性的扁平最小值在生成模型中仍未探索。在这项工作中，我们系统地研究了损失表面扁平度在理论和经验上的生成模型中的作用，并特别关注扩散模型。我们建立了理论上的主张，即平坦的最小值提高了目标先验分布的扰动性的鲁棒性，从而导致诸如降低暴露偏见的益处 - 噪声估计的误差在迭代中积累了误差 - 显着提高了对模型量化的弹性，即使在强量量化限制下，也可以保留生成性能。我们进一步观察到，明确控制平坦度的清晰度最小化（SAM）有效地增强了扩散模型中的平坦度，而其他知名的方法（例如随机重量平均（SWA）和指数移动平均值（EMA）（EMA））通过Enemblobs的有效性较小。通过在CIFAR-10，LSUN TOWE和FFHQ上进行的广泛实验，我们证明了扩散模型中的平坦最小值确实可以改善生成性能，而且可以提高稳健性。

Title: Augmenting Image Annotation: A Human-LMM Collaborative Framework for Efficient Object Selection and Label Generation

Authors: He Zhang, Xinyi Fu, John M. Carroll
Subjects: cs.CV, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2503.11096
Pdf URL: https://arxiv.org/pdf/2503.11096
Copy Paste: [[2503.11096]] Augmenting Image Annotation: A Human-LMM Collaborative Framework for Efficient Object Selection and Label Generation(https://arxiv.org/abs/2503.11096)
Keywords: generation
Abstract: Traditional image annotation tasks rely heavily on human effort for object selection and label assignment, making the process time-consuming and prone to decreased efficiency as annotators experience fatigue after extensive work. This paper introduces a novel framework that leverages the visual understanding capabilities of large multimodal models (LMMs), particularly GPT, to assist annotation workflows. In our proposed approach, human annotators focus on selecting objects via bounding boxes, while the LMM autonomously generates relevant labels. This human-AI collaborative framework enhances annotation efficiency by reducing the cognitive and time burden on human annotators. By analyzing the system's performance across various types of annotation tasks, we demonstrate its ability to generalize to tasks such as object recognition, scene description, and fine-grained categorization. Our proposed framework highlights the potential of this approach to redefine annotation workflows, offering a scalable and efficient solution for large-scale data labeling in computer vision. Finally, we discuss how integrating LMMs into the annotation pipeline can advance bidirectional human-AI alignment, as well as the challenges of alleviating the "endless annotation" burden in the face of information overload by shifting some of the work to AI.
摘要：传统的图像注释任务在很大程度上依赖于人类进行对象选择和标签分配的努力，从而使过程耗时并容易降低效率，因为注释者在大量工作后经历了疲劳。本文介绍了一个新颖的框架，该框架利用大型多模型模型（LMMS），尤其是GPT的视觉理解能力来协助注释工作流。在我们提出的方法中，人类注释者专注于通过边界框选择对象，而LMM自动生成相关标签。这个人类协作框架通过减少人类注释者的认知和时间负担来提高注释效率。通过在各种注释任务中分析系统的性能，我们证明了其将其推广到诸如对象识别，场景描述和细粒度分类等任务的能力。我们提出的框架强调了这种方法重新定义注释工作流的潜力，为计算机视觉中的大规模数据标记提供了可扩展有效的解决方案。最后，我们讨论如何将LMM集成到注释管道中，可以提高双向人类AI的一致性，以及通过将一些工作转移到AI中来减轻“无尽注释”负担的挑战。

Title: DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

Authors: Hongbin Lin, Zilu Guo, Yifan Zhang, Shuaicheng Niu, Yafeng Li, Ruimao Zhang, Shuguang Cui, Zhen Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11122
Pdf URL: https://arxiv.org/pdf/2503.11122
Copy Paste: [[2503.11122]] DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation(https://arxiv.org/abs/2503.11122)
Keywords: generation
Abstract: In autonomous driving, vision-centric 3D detection aims to identify 3D objects from images. However, high data collection costs and diverse real-world scenarios limit the scale of training data. Once distribution shifts occur between training and test data, existing methods often suffer from performance degradation, known as Out-of-Distribution (OOD) problems. To address this, controllable Text-to-Image (T2I) diffusion offers a potential solution for training data enhancement, which is required to generate diverse OOD scenarios with precise 3D object geometry. Nevertheless, existing controllable T2I approaches are restricted by the limited scale of training data or struggle to preserve all annotated 3D objects. In this paper, we present DriveGEN, a method designed to improve the robustness of 3D detectors in Driving via Training-Free Controllable Text-to-Image Diffusion Generation. Without extra diffusion model training, DriveGEN consistently preserves objects with precise 3D geometry across diverse OOD generations, consisting of 2 stages: 1) Self-Prototype Extraction: We empirically find that self-attention features are semantic-aware but require accurate region selection for 3D objects. Thus, we extract precise object features via layouts to capture 3D object geometry, termed self-prototypes. 2) Prototype-Guided Diffusion: To preserve objects across various OOD scenarios, we perform semantic-aware feature alignment and shallow feature alignment during denoising. Extensive experiments demonstrate the effectiveness of DriveGEN in improving 3D detection. The code is available at this https URL.
摘要：在自动驾驶中，以视觉为中心的3D检测旨在从图像中识别3D对象。但是，高数据收集成本和不同的现实情况限制了培训数据的规模。一旦训练数据和测试数据之间的分布变化发生，现有方法通常会遭受性能降解（被称为分布（OOD）问题）。为了解决这个问题，可控制的文本对图像（T2I）扩散为训练数据增强提供了潜在的解决方案，这是用精确的3D对象几何形状生成多种OOD场景所必需的。然而，现有的可控T2I方法受培训数据规模有限或为保留所有带注释的3D对象而努力的限制。在本文中，我们提出了DraveGen，该方法旨在通过无训练的可控文本对图像扩散生成来提高3D探测器的鲁棒性。没有额外的扩散模型训练，DriveGen始终保留具有不同OOD世代的精确3D几何形状的对象，由2个阶段组成：1）自我构想提取：我们从经验上发现，自我注意事项特征是语义意识到的，但需要精确的3D对象区域选择。因此，我们通过布局提取精确的对象特征，以捕获称为自我型的3D对象几何形状。 2）原型引导的扩散：为了在各种OOD场景中保存对象，我们在DeNoising期间执行语义吸引的特征对齐和浅色特征对齐。广泛的实验证明了驱动器在改善3D检测方面的有效性。该代码可在此HTTPS URL上找到。

Title: MUSS: Multilevel Subset Selection for Relevance and Diversity

Authors: Vu Nguyen, Andrey Kan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11126
Pdf URL: https://arxiv.org/pdf/2503.11126
Copy Paste: [[2503.11126]] MUSS: Multilevel Subset Selection for Relevance and Diversity(https://arxiv.org/abs/2503.11126)
Keywords: generation
Abstract: The problem of relevant and diverse subset selection has a wide range of applications, including recommender systems and retrieval-augmented generation (RAG). For example, in recommender systems, one is interested in selecting relevant items, while providing a diversified recommendation. Constrained subset selection problem is NP-hard, and popular approaches such as Maximum Marginal Relevance (MMR) are based on greedy selection. Many real-world applications involve large data, but the original MMR work did not consider distributed selection. This limitation was later addressed by a method called DGDS which allows for a distributed setting using random data partitioning. Here, we exploit structure in the data to further improve both scalability and performance on the target application. We propose MUSS, a novel method that uses a multilevel approach to relevant and diverse selection. We provide a rigorous theoretical analysis and show that our method achieves a constant factor approximation of the optimal objective. In a recommender system application, our method can achieve the same level of performance as baselines, but 4.5 to 20 times faster. Our method is also capable of outperforming baselines by up to 6 percent points of RAG-based question answering accuracy.
摘要：相关和多样化的子集选择的问题具有广泛的应用，包括推荐系统和检索功能的生成（RAG）。例如，在推荐系统中，人们有兴趣选择相关项目，同时提供多元化的建议。受约束的子集选择问题是NP-固定，而流行的方法（例如最大边缘相关性（MMR））基于贪婪的选择。许多现实世界的应用程序都涉及大数据，但是原始的MMR工作并未考虑分布式选择。后来通过一种称为DGD的方法来解决此限制，该方法允许使用随机数据分配进行分布式设置。在这里，我们利用数据中的结构进一步提高了目标应用程序上的可扩展性和性能。我们提出了Muss，这是一种新颖的方法，它使用多级方法来进行相关和多样化的选择。我们提供了严格的理论分析，并表明我们的方法实现了最佳目标的恒定因子近似。在推荐系统应用中，我们的方法可以达到与基准相同的性能水平，但要快4.5至20倍。我们的方法还能够超过基于抹布的问答准确性的6％的基线。

Title: Direction-Aware Diagonal Autoregressive Image Generation

Authors: Yijia Xu, Jianzhong Ju, Jian Luan, Jinshi Cui
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11129
Pdf URL: https://arxiv.org/pdf/2503.11129
Copy Paste: [[2503.11129]] Direction-Aware Diagonal Autoregressive Image Generation(https://arxiv.org/abs/2503.11129)
Keywords: generation
Abstract: The raster-ordered image token sequence exhibits a significant Euclidean distance between index-adjacent tokens at line breaks, making it unsuitable for autoregressive generation. To address this issue, this paper proposes Direction-Aware Diagonal Autoregressive Image Generation (DAR) method, which generates image tokens following a diagonal scanning order. The proposed diagonal scanning order ensures that tokens with adjacent indices remain in close proximity while enabling causal attention to gather information from a broader range of directions. Additionally, two direction-aware modules: 4D-RoPE and direction embeddings are introduced, enhancing the model's capability to handle frequent changes in generation direction. To leverage the representational capacity of the image tokenizer, we use its codebook as the image token embeddings. We propose models of varying scales, ranging from 485M to 2.0B. On the 256$\times$256 ImageNet benchmark, our DAR-XL (2.0B) outperforms all previous autoregressive image generators, achieving a state-of-the-art FID score of 1.37.
摘要：栅格订购的图像令牌序列在线断裂处的索引 - 染色令牌之间表现出显着的欧几里得距离，这使其不适合自回旋产生。为了解决这个问题，本文提出了方向感知的对角线自回旋图像生成（DAR）方法，该方法在对角线扫描顺序后生成图像令牌。拟议的对角线扫描顺序确保具有相邻指数的令牌保持紧邻，同时使因果关注能够从更广泛的方向收集信息。此外，引入了两个方向感知的模块：4D绳和方向嵌入，增强了模型处理生成方向频繁变化的能力。为了利用图像令牌的表示能力，我们将其代码簿用作图像令牌嵌入。我们提出了不同尺度的模型，范围从48.5m到2.0b。在256 $ \ times $ 256 Imagenet基准中，我们的DAR-XL（2.0B）优于所有以前的自回归图像发电机，达到了最先进的FID得分为1.37。

Title: SpaceSeg: A High-Precision Intelligent Perception Segmentation Method for Multi-Spacecraft On-Orbit Targets

Authors: Hao Liu, Pengyu Guo, Siyuan Yang, Zeqing Jiang, Qinglei Hu, Dongyu Li
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.11133
Pdf URL: https://arxiv.org/pdf/2503.11133
Copy Paste: [[2503.11133]] SpaceSeg: A High-Precision Intelligent Perception Segmentation Method for Multi-Spacecraft On-Orbit Targets(https://arxiv.org/abs/2503.11133)
Keywords: generation
Abstract: With the continuous advancement of human exploration into deep space, intelligent perception and high-precision segmentation technology for on-orbit multi-spacecraft targets have become critical factors for ensuring the success of modern space missions. However, the complex deep space environment, diverse imaging conditions, and high variability in spacecraft morphology pose significant challenges to traditional segmentation methods. This paper proposes SpaceSeg, an innovative vision foundation model-based segmentation framework with four core technical innovations: First, the Multi-Scale Hierarchical Attention Refinement Decoder (MSHARD) achieves high-precision feature decoding through cross-resolution feature fusion via hierarchical attention. Second, the Multi-spacecraft Connected Component Analysis (MS-CCA) effectively resolves topological structure confusion in dense targets. Third, the Spatial Domain Adaptation Transform framework (SDAT) eliminates cross-domain disparities and resist spatial sensor perturbations through composite enhancement strategies. Finally, a custom Multi-Spacecraft Segmentation Task Loss Function is created to significantly improve segmentation robustness in deep space scenarios. To support algorithm validation, we construct the first multi-scale on-orbit multi-spacecraft semantic segmentation dataset SpaceES, which covers four types of spatial backgrounds and 17 typical spacecraft targets. In testing, SpaceSeg achieves state-of-the-art performance with 89.87$\%$ mIoU and 99.98$\%$ mAcc, surpassing existing best methods by 5.71 percentage points. The dataset and code are open-sourced at this https URL to provide critical technical support for next-generation space situational awareness systems.
摘要：随着人类对深空的探索的持续发展，智能感知和高精度分段技术对轨道多飞机运动物的目标已成为确保现代空间任务成功的关键因素。但是，复杂的深空环境，各种成像条件以及航天器形态的高变异性对传统分割方法构成了重大挑战。本文提出了Spaceseg，这是一种创新的视觉基础基础模型的分割框架，具有四个核心技术创新：首先，多尺度的分层注意力精制解码器（MSHARD）实现了高精度特征通过层次融合通过层次注意通过交叉分辨率融合来解码。其次，多飞船连接的组件分析（MS-CCA）有效地解决了密集靶标的拓扑结构混乱。第三，空间域适应性变换框架（SDAT）消除了跨域差异，并通过复合增强策略抵抗空间传感器扰动。最后，创建一个自定义的多空中飞车分割任务损耗函数，以显着改善深空场景中的分割鲁棒性。为了支持算法验证，我们构建了第一个多尺度上的轨道多空中飞机语义分段数据集空间，该空间涵盖了四种类型的空间背景和17个典型的Spacecraft目标。在测试中，Spaceseg以89.87 $ \％$ MIOU和99.98 $ \％$ MACC实现了最先进的性能，使现有最佳方法超过5.71个百分点。该数据集和代码在此HTTPS URL上开源，为下一代空间情境意识系统提供关键的技术支持。

Title: GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior

Authors: Zichen Tang, Yuan Yao, Miaomiao Cui, Liefeng Bo, Hongyu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11143
Pdf URL: https://arxiv.org/pdf/2503.11143
Copy Paste: [[2503.11143]] GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior(https://arxiv.org/abs/2503.11143)
Keywords: generation
Abstract: Text-guided 3D human generation has advanced with the development of efficient 3D representations and 2D-lifting methods like Score Distillation Sampling (SDS). However, current methods suffer from prolonged training times and often produce results that lack fine facial and garment details. In this paper, we propose GaussianIP, an effective two-stage framework for generating identity-preserving realistic 3D humans from text and image prompts. Our core insight is to leverage human-centric knowledge to facilitate the generation process. In stage 1, we propose a novel Adaptive Human Distillation Sampling (AHDS) method to rapidly generate a 3D human that maintains high identity consistency with the image prompt and achieves a realistic appearance. Compared to traditional SDS methods, AHDS better aligns with the human-centric generation process, enhancing visual quality with notably fewer training steps. To further improve the visual quality of the face and clothes regions, we design a View-Consistent Refinement (VCR) strategy in stage 2. Specifically, it produces detail-enhanced results of the multi-view images from stage 1 iteratively, ensuring the 3D texture consistency across views via mutual attention and distance-guided attention fusion. Then a polished version of the 3D human can be achieved by directly perform reconstruction with the refined images. Extensive experiments demonstrate that GaussianIP outperforms existing methods in both visual quality and training efficiency, particularly in generating identity-preserving results. Our code is available at: this https URL.
摘要：文本指导的3D人类发电是随着有效的3D表示和2D峰值方法（例如评分蒸馏采样（SDS））的发展而发展的。但是，当前的方法遭受了延长的训练时间，并且经常产生缺乏精细面部和服装细节的结果。在本文中，我们提出了高斯（Gaussianip），这是一个有效的两阶段框架，用于从文本和图像提示中生成具有身份的现实3D人类。我们的核心见解是利用以人为中心的知识来促进生成过程。在第1阶段，我们提出了一种新型的自适应人类蒸馏采样（AHDS）方法，以迅速产生一个3D人类，该人具有与图像提示保持较高的身份一致性并实现现实外观。与传统的SDS方法相比，AHD与以人为中心的生成过程保持一致，从而增强了视觉质量，训练步骤较少。为了进一步提高面部和衣服区域的视觉质量，我们在第2阶段中设计了一种视野优化（VCR）策略。具体来说，它在第1阶段迭代地产生了多视图图像的细节增强结果，从而确保通过相互关注和远距离引导的注意力融合来确保3D纹理一致性。然后，可以通过直接使用精制图像进行重建来实现3D人的抛光版本。广泛的实验表明，高斯以视觉质量和训练效率均优于现有方法，尤其是在产生具有身份的结果时。我们的代码可用：此HTTPS URL。

Title: Multi-Stage Generative Upscaler: Reconstructing Football Broadcast Images via Diffusion Models

Authors: Luca Martini, Daniele Zolezzi, Saverio Iacono, Gianni Viardo Vercelli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11181
Pdf URL: https://arxiv.org/pdf/2503.11181
Copy Paste: [[2503.11181]] Multi-Stage Generative Upscaler: Reconstructing Football Broadcast Images via Diffusion Models(https://arxiv.org/abs/2503.11181)
Keywords: generative
Abstract: The reconstruction of low-resolution football broadcast images presents a significant challenge in sports broadcasting, where detailed visuals are essential for analysis and audience engagement. This study introduces a multi-stage generative upscaling framework leveraging Diffusion Models to enhance degraded images, transforming inputs as small as $64 \times 64$ pixels into high-fidelity $1024 \times 1024$ outputs. By integrating an image-to-image pipeline, ControlNet conditioning, and LoRA fine-tuning, our approach surpasses traditional upscaling methods in restoring intricate textures and domain-specific elements such as player details and jersey logos. The custom LoRA is trained on a custom football dataset, ensuring adaptability to sports broadcast needs. Experimental results demonstrate substantial improvements over conventional models, with ControlNet refining fine details and LoRA enhancing task-specific elements. These findings highlight the potential of diffusion-based image reconstruction in sports media, paving the way for future applications in automated video enhancement and real-time sports analytics.
摘要：低分辨率足球广播图像的重建在体育广播中提出了重大挑战，在体育广播中，详细的视觉效果对于分析和观众参与至关重要。这项研究介绍了一个多阶段生成的上尺度框架，利用扩散模型来增强降级图像，将其小至$ 64 \ times 64 $像素转换为高保真$ 1024 \ times 1024 $ 1024 $输出。通过集成图像到图像的管道，控制网络调节和洛拉微调，我们的方法在恢复复杂的纹理和特定领域的特定元素（例如玩家详细信息和泽西徽标）方面超过了传统的升级方法。自定义洛拉（Lora）在定制的足球数据集中进行了培训，以确保适应体育广播需求。实验结果证明了对传统模型的实质性改进，控制网精制细节和洛拉增强了特定于任务的元素。这些发现突出了体育媒体中基于扩散的图像重建的潜力，为自动化视频增强和实时体育分析的未来应用铺平了道路。

Title: Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption

Authors: Du Chen, Tianhe Wu, Kede Ma, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11221
Pdf URL: https://arxiv.org/pdf/2503.11221
Copy Paste: [[2503.11221]] Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption(https://arxiv.org/abs/2503.11221)
Keywords: super-resolution, generative, quality assessment
Abstract: Full-reference image quality assessment (FR-IQA) generally assumes that reference images are of perfect quality. However, this assumption is flawed due to the sensor and optical limitations of modern imaging systems. Moreover, recent generative enhancement methods are capable of producing images of higher quality than their original. All of these challenge the effectiveness and applicability of current FR-IQA models. To relax the assumption of perfect reference image quality, we build a large-scale IQA database, namely DiffIQA, containing approximately 180,000 images generated by a diffusion-based image enhancer with adjustable hyper-parameters. Each image is annotated by human subjects as either worse, similar, or better quality compared to its reference. Building on this, we present a generalized FR-IQA model, namely Adaptive Fidelity-Naturalness Evaluator (A-FINE), to accurately assess and adaptively combine the fidelity and naturalness of a test image. A-FINE aligns well with standard FR-IQA when the reference image is much more natural than the test image. We demonstrate by extensive experiments that A-FINE surpasses standard FR-IQA models on well-established IQA datasets and our newly created DiffIQA. To further validate A-FINE, we additionally construct a super-resolution IQA benchmark (SRIQA-Bench), encompassing test images derived from ten state-of-the-art SR methods with reliable human quality annotations. Tests on SRIQA-Bench re-affirm the advantages of A-FINE. The code and dataset are available at this https URL.
摘要：全参考图像质量评估（FR-IQA）通常假定参考图像是完美的质量。但是，由于传感器和现代成像系统的光学限制，该假设存在缺陷。此外，最近的生成增强方法能够产生比原始图像更高的图像。所有这些都挑战了当前FR-IQA模型的有效性和适用性。为了放大理想参考图像质量的假设，我们构建了一个大规模的IQA数据库，即Diffiqa，其中包含由基于扩散的图像增强器和可调节的超参数产生的大约180,000张图像。与参考相比，人类受试者注释的每个图像都是更糟，相似或质量更好的。在此基础上，我们提出了一个广义的FR-IQA模型，即自适应的保真度自然性评估器（A-Fine），以准确地评估和适应性地结合了测试图像的忠诚度和自然性。当参考图像比测试图像更自然时，A-FINE与标准FR-IQA很好地对齐。我们通过广泛的实验证明，A-Fine超过了公认的IQA数据集和我们新创建的DIFFIQA的标准FR-IQA模型。为了进一步验证A-Fine，我们还构建了一个超分辨率IQA基准（SRIQA基座），其中包含来自具有可靠人类质量注释的十种最先进的SR方法。对SRIQA板凳的测试重新确认了A-Fine的优势。该代码和数据集可在此HTTPS URL上找到。

Title: Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Authors: Zijing Hu, Fengda Zhang, Long Chen, Kun Kuang, Jiahui Li, Kaifeng Gao, Jun Xiao, Xin Wang, Wenwu Zhu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11240
Pdf URL: https://arxiv.org/pdf/2503.11240
Copy Paste: [[2503.11240]] Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards(https://arxiv.org/abs/2503.11240)
Keywords: generation
Abstract: Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named $\text{B}^2\text{-DiffuRL}$, employs two strategies: \textbf{B}ackward progressive training and \textbf{B}ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty from sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. $\text{B}^2\text{-DiffuRL}$ is compatible with existing optimization algorithms. Extensive experiments demonstrate the effectiveness of $\text{B}^2\text{-DiffuRL}$ in improving prompt-image alignment and maintaining diversity in generated images. The code for this work is available.
摘要：扩散模型在文本到图像生成方面取得了巨大的成功。但是，他们的实际应用受到了生成的图像和相应文本提示之间的未对准。为了解决这个问题，已经考虑了扩散模型微调的增强学习（RL）。但是，RL的有效性受到稀疏奖励的挑战的限制，在生成过程结束时，反馈才能提供。这使得难以确定在脱糖过程中哪些动作对最终生成的图像产生积极贡献，这可能导致无效或不必要的deNoise政策。为此，本文提出了一个基于RL的新型框架，该框架在训练扩散模型时解决了稀疏的奖励问题。我们的框架为$ \ text {b}^2 \ text {-diffurl} $采用两种策略：\ textbf {b} ackward渐进式培训和\ textbf {b}基于牧场的采样。一方面，落后的渐进培训最初集中在DeNoising过程的最终时间步骤上，并逐渐将训练间隔扩展到较早的时间段，从而减轻了从稀疏奖励的学习难度。对于另一个，我们为每个训练间隔执行基于分支的抽样。通过比较同一分支中的样本，我们可以确定当前训练间隔的策略有助于最终图像，这有助于学习有效的策略而不是不必要的策略。 $ \ text {b}^2 \ text {-diffurl} $与现有优化算法兼容。广泛的实验证明了$ \ text {b}^2 \ text {-diffurl} $在改善及时图像对齐和保持生成图像中多样性方面的有效性。这项工作的代码可用。

Title: Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

Authors: Haoyang Huang, Guoqing Ma, Nan Duan, Xing Chen, Changyi Wan, Ranchen Ming, Tianyu Wang, Bo Wang, Zhiying Lu, Aojie Li, Xianfang Zeng, Xinhao Zhang, Gang Yu, Yuhe Yin, Qiling Wu, Wen Sun, Kang An, Xin Han, Deshan Sun, Wei Ji, Bizhu Huang, Brian Li, Chenfei Wu, Guanzhe Huang, Huixin Xiong, Jiaxin He, Jianchang Wu, Jianlong Yuan, Jie Wu, Jiashuai Liu, Junjing Guo, Kaijun Tan, Liangyu Chen, Qiaohui Chen, Ran Sun, Shanshan Yuan, Shengming Yin, Sitong Liu, Wei Chen, Yaqi Dai, Yuchu Luo, Zheng Ge, Zhisheng Guan, Xiaoniu Song, Yu Zhou, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Yi Xiu, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.11251
Pdf URL: https://arxiv.org/pdf/2503.11251
Copy Paste: [[2503.11251]] Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model(https://arxiv.org/abs/2503.11251)
Keywords: generation
Abstract: We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at this https URL.
摘要：我们提出了Step-Video-Ti2v，这是一种具有30b参数的最先进的文本驱动图像到视频生成模型，能够根据文本和图像输入来生成高达102帧的视频。我们将Step-Video-TI2V-eval构建为文本驱动的图像到视频任务的新基准，并使用此数据集比较了step-video-ti2v与开源和商业TI2V发动机进行比较。实验结果证明了在图像到视频生成任务中Step-Video-Ti2V的最新性能。在此HTTPS URL上都可以使用Step-Video-TI2V和Step-Video-Ti2v-eval。

Title: Federated Koopman-Reservoir Learning for Large-Scale Multivariate Time-Series Anomaly Detection

Authors: Long Tan Le, Tung-Anh Nguyen, Han Shu, Suranga Seneviratne, Choong Seon Hong, Nguyen H. Tran
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.11255
Pdf URL: https://arxiv.org/pdf/2503.11255
Copy Paste: [[2503.11255]] Federated Koopman-Reservoir Learning for Large-Scale Multivariate Time-Series Anomaly Detection(https://arxiv.org/abs/2503.11255)
Keywords: generation
Abstract: The proliferation of edge devices has dramatically increased the generation of multivariate time-series (MVTS) data, essential for applications from healthcare to smart cities. Such data streams, however, are vulnerable to anomalies that signal crucial problems like system failures or security incidents. Traditional MVTS anomaly detection methods, encompassing statistical and centralized machine learning approaches, struggle with the heterogeneity, variability, and privacy concerns of large-scale, distributed environments. In response, we introduce FedKO, a novel unsupervised Federated Learning framework that leverages the linear predictive capabilities of Koopman operator theory along with the dynamic adaptability of Reservoir Computing. This enables effective spatiotemporal processing and privacy preservation for MVTS data. FedKO is formulated as a bi-level optimization problem, utilizing a specific federated algorithm to explore a shared Reservoir-Koopman model across diverse datasets. Such a model is then deployable on edge devices for efficient detection of anomalies in local MVTS streams. Experimental results across various datasets showcase FedKO's superior performance against state-of-the-art methods in MVTS anomaly detection. Moreover, FedKO reduces up to 8x communication size and 2x memory usage, making it highly suitable for large-scale systems.
摘要：边缘设备的扩散大大增加了多元时间序列（MVT）数据的产生，这对于从医疗保健到智能城市的应用至关重要。但是，此类数据流很容易受到诸如系统故障或安全事件等关键问题的异常情况。传统的MVT异常检测方法，包括统计和集中的机器学习方法，与大规模分布式环境的异质性，可变性和隐私问题作斗争。作为回应，我们介绍了Fedko，这是一种新颖的无监督联合学习框架，利用了Koopman操作员理论的线性预测能力以及储层计算的动态适应性。这可以有效地为MVTS数据提供有效的时空处理和隐私保护。 FEDKO被称为双层优化问题，利用特定的联合算法来探索各种数据集中的共享储层 - 库普曼模型。然后，这种模型可在边缘设备上部署，以有效检测本地MVTS流中的异常。各种数据集的实验结果展示了Fedko在MVTS异常检测中的最新方法的出色性能。此外，FEDKO最多可减少8倍的通信尺寸和2倍的内存使用情况，使其非常适合大规模系统。

Title: Noise Synthesis for Low-Light Image Denoising with Diffusion Models

Authors: Liying Lu, Raphaël Achddou, Sabine Süsstrunk
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.11262
Pdf URL: https://arxiv.org/pdf/2503.11262
Copy Paste: [[2503.11262]] Noise Synthesis for Low-Light Image Denoising with Diffusion Models(https://arxiv.org/abs/2503.11262)
Keywords: generation
Abstract: Low-light photography produces images with low signal-to-noise ratios due to limited photons. In such conditions, common approximations like the Gaussian noise model fall short, and many denoising techniques fail to remove noise effectively. Although deep-learning methods perform well, they require large datasets of paired images that are impractical to acquire. As a remedy, synthesizing realistic low-light noise has gained significant attention. In this paper, we investigate the ability of diffusion models to capture the complex distribution of low-light noise. We show that a naive application of conventional diffusion models is inadequate for this task and propose three key adaptations that enable high-precision noise generation without calibration or post-processing: a two-branch architecture to better model signal-dependent and signal-independent noise, the incorporation of positional information to capture fixed-pattern noise, and a tailored diffusion noise schedule. Consequently, our model enables the generation of large datasets for training low-light denoising networks, leading to state-of-the-art performance. Through comprehensive analysis, including statistical evaluation and noise decomposition, we provide deeper insights into the characteristics of the generated data.
摘要：低光摄影由于光子有限而产生的图像具有低信噪比。在这种情况下，诸如高斯噪声模型（例如高斯噪声模型）的常见近似值不足，许多denoising技术无法有效消除噪声。尽管深度学习方法的性能很好，但它们需要大量的配对图像，这些图像不切实际。作为一种补救措施，综合现实的弱光噪声引起了很大的关注。在本文中，我们研究了扩散模型捕获低光噪声的复杂分布的能力。我们表明，对于此任务，传统扩散模型的幼稚应用不足，并提出了三个关键的适应，可以使高精度噪声产生而无需校准或后处理：两支支头体系结构以更好地与信号依赖和信号独立的模型依赖性和信号独立的噪声，以捕获固定的Pattern噪声和量身定制的扩散噪声计划。因此，我们的模型可以生成大型数据集，以训练低光降解网络，从而导致最先进的性能。通过全面的分析，包括统计评估和噪声分解，我们提供了对生成数据特征的更深入的见解。

Title: CyclePose -- Leveraging Cycle-Consistency for Annotation-Free Nuclei Segmentation in Fluorescence Microscopy

Authors: Jonas Utz, Stefan Vocht, Anne Tjorven Buessen, Dennis Possart, Fabian Wagner, Mareike Thies, Mingxuan Gu, Stefan Uderhardt, Katharina Breininger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11266
Pdf URL: https://arxiv.org/pdf/2503.11266
Copy Paste: [[2503.11266]] CyclePose -- Leveraging Cycle-Consistency for Annotation-Free Nuclei Segmentation in Fluorescence Microscopy(https://arxiv.org/abs/2503.11266)
Keywords: generation, generative
Abstract: In recent years, numerous neural network architectures specifically designed for the instance segmentation of nuclei in microscopic images have been released. These models embed nuclei-specific priors to outperform generic architectures like U-Nets; however, they require large annotated datasets, which are often not available. Generative models (GANs, diffusion models) have been used to compensate for this by synthesizing training data. These two-stage approaches are computationally expensive, as first a generative model and then a segmentation model has to be trained. We propose CyclePose, a hybrid framework integrating synthetic data generation and segmentation training. CyclePose builds on a CycleGAN architecture, which allows unpaired translation between microscopy images and segmentation masks. We embed a segmentation model into CycleGAN and leverage a cycle consistency loss for self-supervision. Without annotated data, CyclePose outperforms other weakly or unsupervised methods on two public datasets. Code is available at this https URL
摘要：近年来，已释放了针对微观图像中核的实例分割设计的许多神经网络体系结构。这些模型将核特异性先验嵌入了胜于U-Net的通用架构。但是，它们需要大量注释的数据集，这些数据集通常不可用。生成模型（GAN，扩散模型）已用于通过合成训练数据来弥补这一点。这些两阶段的方法在计算上很昂贵，首先是生成模型，然后必须训练细分模型。我们提出了Cyclepose，这是一个集成了合成数据生成和分割培训的混合框架。 CyclePose建立在Cyclegan架构上，该结构允许显微镜图像和分割掩模之间的不合格翻译。我们将分割模型嵌入了自行车内，并利用了自学的周期一致性损失。如果没有注释的数据，Cyclepose在两个公共数据集上的表现优于其他弱或无监督方法。代码可在此HTTPS URL上找到

Title: OPTIMUS: Predicting Multivariate Outcomes in Alzheimer's Disease Using Multi-modal Data amidst Missing Values

Authors: Christelle Schneuwly Diaz, Duy-Thanh Vu, Julien Bodelet, Duy-Cat Can, Guillaume Blanc, Haiting Jiang, Lin Yao, Guiseppe Pantaleo, ADNI, Oliver Y. Chén
Subjects: cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2503.11282
Pdf URL: https://arxiv.org/pdf/2503.11282
Copy Paste: [[2503.11282]] OPTIMUS: Predicting Multivariate Outcomes in Alzheimer's Disease Using Multi-modal Data amidst Missing Values(https://arxiv.org/abs/2503.11282)
Keywords: generative
Abstract: Alzheimer's disease, a neurodegenerative disorder, is associated with neural, genetic, and proteomic factors while affecting multiple cognitive and behavioral faculties. Traditional AD prediction largely focuses on univariate disease outcomes, such as disease stages and severity. Multimodal data encode broader disease information than a single modality and may, therefore, improve disease prediction; but they often contain missing values. Recent "deeper" machine learning approaches show promise in improving prediction accuracy, yet the biological relevance of these models needs to be further charted. Integrating missing data analysis, predictive modeling, multimodal data analysis, and explainable AI, we propose OPTIMUS, a predictive, modular, and explainable machine learning framework, to unveil the many-to-many predictive pathways between multimodal input data and multivariate disease outcomes amidst missing values. OPTIMUS first applies modality-specific imputation to uncover data from each modality while optimizing overall prediction accuracy. It then maps multimodal biomarkers to multivariate outcomes using machine-learning and extracts biomarkers respectively predictive of each outcome. Finally, OPTIMUS incorporates XAI to explain the identified multimodal biomarkers. Using data from 346 cognitively normal subjects, 608 persons with mild cognitive impairment, and 251 AD patients, OPTIMUS identifies neural and transcriptomic signatures that jointly but differentially predict multivariate outcomes related to executive function, language, memory, and visuospatial function. Our work demonstrates the potential of building a predictive and biologically explainable machine-learning framework to uncover multimodal biomarkers that capture disease profiles across varying cognitive landscapes. The results improve our understanding of the complex many-to-many pathways in AD.
摘要：阿尔茨海默氏病是一种神经退行性疾病，与神经，遗传和蛋白质组学因素有关，同时影响多种认知和行为能力。传统的AD预测在很大程度上关注单变量疾病结果，例如疾病阶段和严重程度。多模式数据编码比单一模态更广泛的疾病信息，因此可以改善疾病的预测；但是它们通常包含缺失的值。最近的“更深”的机器学习方法显示了提高预测准确性的希望，但是这些模型的生物学相关性需要进一步绘制。整合缺失的数据分析，预测性建模，多模式数据分析以及可解释的AI，我们提出了擎天柱，一个预测性，模块化和可解释的机器学习框架，以揭示多模态输入数据与多变量疾病疾病成果之间的多对多预测途径。 Optimus首先将特定于模态的插入应用于从每种模式中发现数据，同时优化总体预测准确性。然后，它将多模式生物标志物映射到使用机器学习的多元结果，并分别提取生物标志物的每个结果。最后，Optimus结合了XAI来解释已识别的多峰生物标志物。 Optimus使用来自346名认知正常受试者，608名轻度认知障碍的人和251名AD患者的数据，确定了神经和转录组学特征，它们共同但差异地预测了与执行功能，语言，记忆，记忆和视野功能相关的多元结果。我们的工作展示了建立一个预测性和可解释的机器学习框架来揭示多模式生物标志物的潜力，这些框架可以捕获各种认知景观的疾病谱。结果提高了我们对AD中复杂的多一到许多途径的理解。

Title: Leveraging Diffusion Knowledge for Generative Image Compression with Fractal Frequency-Aware Band Learning

Authors: Lingyu Zhu, Xiangrui Zeng, Bolin Chen, Peilin Chen, Yung-Hui Li, Shiqi Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.11321
Pdf URL: https://arxiv.org/pdf/2503.11321
Copy Paste: [[2503.11321]] Leveraging Diffusion Knowledge for Generative Image Compression with Fractal Frequency-Aware Band Learning(https://arxiv.org/abs/2503.11321)
Keywords: generative
Abstract: By optimizing the rate-distortion-realism trade-off, generative image compression approaches produce detailed, realistic images instead of the only sharp-looking reconstructions produced by rate-distortion-optimized models. In this paper, we propose a novel deep learning-based generative image compression method injected with diffusion knowledge, obtaining the capacity to recover more realistic textures in practical scenarios. Efforts are made from three perspectives to navigate the rate-distortion-realism trade-off in the generative image compression task. First, recognizing the strong connection between image texture and frequency-domain characteristics, we design a Fractal Frequency-Aware Band Image Compression (FFAB-IC) network to effectively capture the directional frequency components inherent in natural images. This network integrates commonly used fractal band feature operations within a neural non-linear mapping design, enhancing its ability to retain essential given information and filter out unnecessary details. Then, to improve the visual quality of image reconstruction under limited bandwidth, we integrate diffusion knowledge into the encoder and implement diffusion iterations into the decoder process, thus effectively recovering lost texture details. Finally, to fully leverage the spatial and frequency intensity information, we incorporate frequency- and content-aware regularization terms to regularize the training of the generative image compression network. Extensive experiments in quantitative and qualitative evaluations demonstrate the superiority of the proposed method, advancing the boundaries of achievable distortion-realism pairs, i.e., our method achieves better distortions at high realism and better realism at low distortion than ever before.
摘要：通过优化利率 - 统计 - 现实主义权衡，生成图像压缩方法产生了详细的，逼真的图像，而不是速率优化模型产生的唯一鲜明的外观重建。在本文中，我们提出了一种基于深度学习的新型生成图像压缩方法，并注入了扩散知识，并获得了在实际情况下恢复更现实的纹理的能力。努力是从三个角度做出的，以在生成图像压缩任务中驾驶利率差异现实主义权衡。首先，识别图像纹理和频域特征之间的牢固连接，我们设计了一个分形频率吸引带图像压缩（FFAB-IC）网络，以有效捕获自然图像中固有的定向频率成分。该网络将常用的分形谱带特征操作集成在神经非线性映射设计中，从而增强其保留必不可少的给定信息的能力并滤除不必要的细节。然后，为了提高有限带宽下图像重建的视觉质量，我们将扩散知识集成到编码器中，并将扩散迭代实现到解码器过程中，从而有效地恢复丢失的纹理细节。最后，为了充分利用空间和频率强度信息，我们结合了频率和内容感知的正则化项，以正规化生成图像压缩网络的训练。定量和定性评估的广泛实验证明了所提出的方法的优越性，提高了可实现的失真真实性对的界限，即，我们的方法比以往任何时候都在低现实主义和低扭曲的现实主义方面实现了更好的扭曲。

Title: PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture

Authors: Xiaokang Wei, Bowen Zhang, Xianghui Yang, Yuxuan Wang, Chunchao Guo, Xi Zhao, Yan Luximon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11368
Pdf URL: https://arxiv.org/pdf/2503.11368
Copy Paste: [[2503.11368]] PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture(https://arxiv.org/abs/2503.11368)
Keywords: generation
Abstract: Generating high-quality physically based rendering (PBR) materials is important to achieve realistic rendering in the downstream tasks, yet it remains challenging due to the intertwined effects of materials and lighting. While existing methods have made breakthroughs by incorporating material decomposition in the 3D generation pipeline, they tend to bake highlights into albedo and ignore spatially varying properties of metallicity and roughness. In this work, we present PBR3DGen, a two-stage mesh generation method with high-quality PBR materials that integrates the novel multi-view PBR material estimation model and a 3D PBR mesh reconstruction model. Specifically, PBR3DGen leverages vision language models (VLM) to guide multi-view diffusion, precisely capturing the spatial distribution and inherent attributes of reflective-metalness material. Additionally, we incorporate view-dependent illumination-aware conditions as pixel-aware priors to enhance spatially varying material properties. Furthermore, our reconstruction model reconstructs high-quality mesh with PBR materials. Experimental results demonstrate that PBR3DGen significantly outperforms existing methods, achieving new state-of-the-art results for PBR estimation and mesh generation. More results and visualization can be found on our project page: this https URL.
摘要：生成高质量的基于物理的渲染（PBR）材料对于在下游任务中实现逼真的渲染非常重要，但是由于材料和照明的相互交织的影响，它仍然具有挑战性。尽管现有方法通过将材料分解纳入3D代管道中取得了突破，但它们倾向于将重点烘烤到反照率中，而忽略了金属性和粗糙度的空间变化特性。在这项工作中，我们提出了PBR3DGEN，这是一种具有高质量PBR材料的两阶段网格生成方法，该方法将新型的多视图PBR材料估计模型和3D PBR网格重建模型整合在一起。具体而言，PBR3DGen利用视觉语言模型（VLM）来指导多视图扩散，精确地捕获了反射金属材料的空间分布和固有属性。此外，我们将依赖于视觉的感知条件作为像素感知的先验，以增强空间变化的材料特性。此外，我们的重建模型用PBR材料重建高质量的网格。实验结果表明，PBR3DGen明显胜过现有方法，从而获得了PBR估计和网格生成的新最新结果。更多的结果和可视化可以在我们的项目页面上找到：此HTTPS URL。

Title: Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding

Authors: David Gastager, Ghazal Ghazaei, Constantin Patsch
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11392
Pdf URL: https://arxiv.org/pdf/2503.11392
Copy Paste: [[2503.11392]] Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding(https://arxiv.org/abs/2503.11392)
Keywords: generation, generative
Abstract: Automated surgical workflow analysis is crucial for education, research, and clinical decision-making, but the lack of annotated datasets hinders the development of accurate and comprehensive workflow analysis solutions. We introduce a novel approach for addressing the sparsity and heterogeneity of annotated training data inspired by the human learning procedure of watching experts and understanding their explanations. Our method leverages a video-language model trained on alignment, denoising, and generative tasks to learn short-term spatio-temporal and multimodal representations. A task-specific temporal model is then used to capture relationships across entire videos. To achieve comprehensive video-language understanding in the surgical domain, we introduce a data collection and filtering strategy to construct a large-scale pretraining dataset from educational YouTube videos. We then utilize parameter-efficient fine-tuning by projecting downstream task annotations from publicly available surgical datasets into the language domain. Extensive experiments in two surgical domains demonstrate the effectiveness of our approach, with performance improvements of up to 7% in phase segmentation tasks, 8% in zero-shot phase segmentation, and comparable capabilities to fully-supervised models in few-shot settings. Harnessing our model's capabilities for long-range temporal localization and text generation, we present the first comprehensive solution for dense video captioning (DVC) of surgical videos, addressing this task despite the absence of existing DVC datasets in the surgical domain. We introduce a novel approach to surgical workflow understanding that leverages video-language pretraining, large-scale video pretraining, and optimized fine-tuning. Our method improves performance over state-of-the-art techniques and enables new downstream tasks for surgical video understanding.
摘要：自动手术工作流程分析对于教育，研究和临床决策至关重要，但是缺乏注释的数据集阻碍了准确，全面的工作流程分析解决方案的发展。我们介绍了一种新颖的方法，以解决由人类学习程序启发的带注释培训数据的稀疏性和异质性，该过程是观察专家和理解其解释的过程。我们的方法利用了一个视频语言模型，该模型训练了对齐，denoing和生成任务，以学习短期时空和多模式表示。然后使用特定于任务的时间模型来捕获整个视频中的关系。为了在手术领域中获得全面的视频理解，我们引入了数据收集和过滤策略，以从教育YouTube视频中构建大规模的预处理数据集。然后，我们通过将公开可用手术数据集的下游任务注释投射到语言域，利用参数有效的微调。在两个手术结构域中进行的广泛实验证明了我们方法的有效性，在相分段任务中的性能提高了7％，在零拍动相分段中的8％，以及在几个少量设置中与完全监督模型的可比性。利用模型的远程时间定位和文本生成的功能，我们提供了第一个针对手术视频的密集视频字幕（DVC）的全面解决方案，尽管没有手术域中的现有DVC数据集，但仍解决了这项任务。我们介绍了一种新颖的方法，以了解手术工作流程，以利用视频语言预处理，大规模视频预处理和优化的微调。我们的方法提高了对最新技术的性能，并实现了新的下游任务以了解手术视频的理解。

Title: A Neural Network Architecture Based on Attention Gate Mechanism for 3D Magnetotelluric Forward Modeling

Authors: Xin Zhong, Weiwei Ling, Kejia Pan, Pinxia Wu, Jiajing Zhang, Zhiliang Zhan, Wenbo Xiao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11408
Pdf URL: https://arxiv.org/pdf/2503.11408
Copy Paste: [[2503.11408]] A Neural Network Architecture Based on Attention Gate Mechanism for 3D Magnetotelluric Forward Modeling(https://arxiv.org/abs/2503.11408)
Keywords: generation
Abstract: Traditional three-dimensional magnetotelluric (MT) numerical forward modeling methods, such as the finite element method (FEM) and finite volume method (FVM), suffer from high computational costs and low efficiency due to limitations in mesh refinement and computational resources. We propose a novel neural network architecture named MTAGU-Net, which integrates an attention gating mechanism for 3D MT forward modeling. Specifically, a dual-path attention gating module is designed based on forward response data images and embedded in the skip connections between the encoder and decoder. This module enables the fusion of critical anomaly information from shallow feature maps during the decoding of deep feature maps, significantly enhancing the network's capability to extract features from anomalous regions. Furthermore, we introduce a synthetic model generation method utilizing 3D Gaussian random field (GRF), which accurately replicates the electrical structures of real-world geological scenarios with high fidelity. Numerical experiments demonstrate that MTAGU-Net outperforms conventional 3D U-Net in terms of convergence stability and prediction accuracy, with the structural similarity index (SSIM) of the forward response data consistently exceeding 0.98. Moreover, the network can accurately predict forward response data on previously unseen datasets models, demonstrating its strong generalization ability and validating the feasibility and effectiveness of this method in practical applications.
摘要：传统的三维磁铁（MT）数值向前建模方法，例如有限元方法（FEM）和有限体积方法（FVM），由于网格细化和计算资源的限制，较高的计算成本和低效率。我们提出了一种名为Mtagu-net的新型神经网络结构，该结构集成了用于3D MT正向建模的关注门控机制。具体而言，双路径注意门控模块是基于正向响应数据图像设计的，并嵌入编码器和解码器之间的跳过连接中。该模块可以在深度特征图的解码过程中从浅层特征地图中融合关键异常信息，从而显着增强了网络从异常区域提取特征的能力。此外，我们采用了3D高斯随机场（GRF）引入了一种合成模型生成方法，该方法可以准确地复制具有高忠诚度的现实世界地质场景的电结构。数值实验表明，在收敛稳定性和预测准确性方面，Mtagu-NET的表现优于常规的3D U-NET，而正向响应数据的结构相似性指数（SSIM）始终超过0.98。此外，网络可以准确地预测以前看不见的数据集模型上的远期响应数据，证明其强大的概括能力并验证该方法在实际应用中的可行性和有效性。

Title: Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models

Authors: Xu Liu, Taha Aksu, Juncheng Liu, Qingsong Wen, Yuxuan Liang, Caiming Xiong, Silvio Savarese, Doyen Sahoo, Junnan Li, Chenghao Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11411
Pdf URL: https://arxiv.org/pdf/2503.11411
Copy Paste: [[2503.11411]] Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models(https://arxiv.org/abs/2503.11411)
Keywords: generation
Abstract: Time series analysis is crucial for understanding dynamics of complex systems. Recent advances in foundation models have led to task-agnostic Time Series Foundation Models (TSFMs) and Large Language Model-based Time Series Models (TSLLMs), enabling generalized learning and integrating contextual information. However, their success depends on large, diverse, and high-quality datasets, which are challenging to build due to regulatory, diversity, quality, and quantity constraints. Synthetic data emerge as a viable solution, addressing these challenges by offering scalable, unbiased, and high-quality alternatives. This survey provides a comprehensive review of synthetic data for TSFMs and TSLLMs, analyzing data generation strategies, their role in model pretraining, fine-tuning, and evaluation, and identifying future research directions.
摘要：时间序列分析对于理解复杂系统的动态至关重要。基础模型的最新进展导致了任务不合时宜的时间序列基础模型（TSFM）和大型基于语言模型的时间序列模型（TSLLMS），从而实现了广义学习和整合上下文信息。但是，它们的成功取决于大型，多样化和高质量的数据集，这对于监管，多样性，质量和数量限制而构建具有挑战性。合成数据作为一种可行的解决方案出现，通过提供可扩展，无偏和高质量的替代方案来解决这些挑战。该调查对TSFM和TSLLM的合成数据进行了全面的综述，分析了数据生成策略，它们在模型预审，微调和评估中的作用以及确定未来的研究方向。

Title: Classifying Long-tailed and Label-noise Data via Disentangling and Unlearning

Authors: Chen Shu, Mengke Li, Yiqun Zhang, Yang Lu, Bo Han, Yiu-ming Cheung, Hanzi Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11414
Pdf URL: https://arxiv.org/pdf/2503.11414
Copy Paste: [[2503.11414]] Classifying Long-tailed and Label-noise Data via Disentangling and Unlearning(https://arxiv.org/abs/2503.11414)
Keywords: generation
Abstract: In real-world datasets, the challenges of long-tailed distributions and noisy labels often coexist, posing obstacles to the model training and performance. Existing studies on long-tailed noisy label learning (LTNLL) typically assume that the generation of noisy labels is independent of the long-tailed distribution, which may not be true from a practical perspective. In real-world situaiton, we observe that the tail class samples are more likely to be mislabeled as head, exacerbating the original degree of imbalance. We call this phenomenon as ``tail-to-head (T2H)'' noise. T2H noise severely degrades model performance by polluting the head classes and forcing the model to learn the tail samples as head. To address this challenge, we investigate the dynamic misleading process of the nosiy labels and propose a novel method called Disentangling and Unlearning for Long-tailed and Label-noisy data (DULL). It first employs the Inner-Feature Disentangling (IFD) to disentangle feature internally. Based on this, the Inner-Feature Partial Unlearning (IFPU) is then applied to weaken and unlearn incorrect feature regions correlated to wrong classes. This method prevents the model from being misled by noisy labels, enhancing the model's robustness against noise. To provide a controlled experimental environment, we further propose a new noise addition algorithm to simulate T2H noise. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our proposed method.
摘要：在实际数据集中，长尾分布和嘈杂标签的挑战通常并存，对模型培训和性能构成障碍。现有关于长尾嘈杂标签学习（LTNLL）的研究通常假定嘈杂标签的产生独立于长尾分布，从实际角度来看，这可能不是正确的。在现实世界中，我们观察到尾部类样品更有可能被标记为头部，加剧了原始的失衡程度。我们将这种现象称为``尾巴对（T2H）''的噪音。 T2H噪声通过污染头等类别并迫使模型将尾部样本作为头部来严重降低模型性能。为了应对这一挑战，我们研究了通用标签的动态误导过程，并提出了一种新颖的方法，称为长尾和标签 - 噪声数据（DULL），称为“解开和排尿”。它首先采用内部功能解开（IFD）来内部解开特征。基于此，将内部功能的部分学习（IFPU）应用于削弱和不正确的特征区域与错误的类别相关的特征区域。该方法防止模型被嘈杂的标签误导，从而增强了模型对噪声的鲁棒性。为了提供受控的实验环境，我们进一步提出了一种新的噪声添加算法来模拟T2H噪声。对模拟和现实世界数据集进行的广泛实验证明了我们提出的方法的有效性。

Title: From Generative AI to Innovative AI: An Evolutionary Roadmap

Authors: Seyed Mahmoud Sajjadi Mohammadabadi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11419
Pdf URL: https://arxiv.org/pdf/2503.11419
Copy Paste: [[2503.11419]] From Generative AI to Innovative AI: An Evolutionary Roadmap(https://arxiv.org/abs/2503.11419)
Keywords: generative
Abstract: This paper explores the critical transition from Generative Artificial Intelligence (GenAI) to Innovative Artificial Intelligence (InAI). While recent advancements in GenAI have enabled systems to produce high-quality content across various domains, these models often lack the capacity for true innovation. In this context, innovation is defined as the ability to generate novel and useful outputs that go beyond mere replication of learned data. The paper examines this shift and proposes a roadmap for developing AI systems that can generate content and engage in autonomous problem-solving and creative ideation. The work provides both theoretical insights and practical strategies for advancing AI to a stage where it can genuinely innovate, contributing meaningfully to science, technology, and the arts.
摘要：本文探讨了从生成人工智能（Genai）到创新人工智能（INAI）的关键过渡。尽管Genai的最新进展使系统能够在各个领域生产高质量的内容，但这些模型通常缺乏真正的创新能力。在这种情况下，创新被定义为产生新颖和有用的输出的能力，这些输出仅仅是对学习数据的复制。该论文研究了这一转变，并提出了用于开发可以生成内容并参与自动解决问题和创造性构想的AI系统的路线图。这项工作提供了理论上的见解和实践策略，可以将AI推进到一个可以真正创新的阶段，从而有意义地为科学，技术和艺术做出了贡献。

Title: TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

Authors: Hongxiang Zhao, Xingchen Liu, Mutian Xu, Yiming Hao, Weikai Chen, Xiaoguang Han
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.11423
Pdf URL: https://arxiv.org/pdf/2503.11423
Copy Paste: [[2503.11423]] TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation(https://arxiv.org/abs/2503.11423)
Keywords: generation
Abstract: We address key limitations in existing datasets and models for task-oriented hand-object interaction video generation, a critical approach of generating video demonstrations for robotic imitation learning. Current datasets, such as Ego4D, often suffer from inconsistent view perspectives and misaligned interactions, leading to reduced video quality and limiting their applicability for precise imitation learning tasks. Towards this end, we introduce TASTE-Rob -- a pioneering large-scale dataset of 100,856 ego-centric hand-object interaction videos. Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint to ensure interaction clarity. By fine-tuning a Video Diffusion Model (VDM) on TASTE-Rob, we achieve realistic object interactions, though we observed occasional inconsistencies in hand grasping postures. To enhance realism, we introduce a three-stage pose-refinement pipeline that improves hand posture accuracy in generated videos. Our curated dataset, coupled with the specialized pose-refinement framework, provides notable performance gains in generating high-quality, task-oriented hand-object interaction videos, resulting in achieving superior generalizable robotic manipulation. The TASTE-Rob dataset will be made publicly available upon publication to foster further advancements in the field.
摘要：我们解决了现有数据集中的关键局限性和针对任务的手动相互作用视频生成的模型，这是生成用于机器人模仿学习的视频演示的关键方法。当前的数据集（例如EGO4D）通常会遇到不一致的观点观点和未对准的互动，从而导致视频质量降低并限制其用于精确模仿学习任务的适用性。为此，我们介绍了Taste-Rob，这是一个开创性的大规模数据集，其中包括100,856个以自我为中心的手动互动视频。每个视频都与语言说明进行了精心对齐，并从一致的摄像头观点记录下来，以确保互动的清晰度。通过微调味觉射击上的视频扩散模型（VDM），我们实现了逼真的对象相互作用，尽管我们观察到偶尔手动握住姿势的不一致。为了增强现实主义，我们引入了三阶段的姿势式管道，该管道提高了生成的视频的手部姿势精度。我们的策划数据集，再加上专门的姿势框架框架，可在产生高质量的，面向任务的手动相互作用视频方面取得了显着的性能，从而实现了卓越的可推广机器人操作。 Taste-Rob数据集将在出版物后公开提供，以促进该领域的进一步进步。

Title: D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning

Authors: Jia Zhang, Chen-Xi Zhang, Yao Liu, Yi-Xuan Jin, Xiao-Wen Yang, Bo Zheng, Yi Liu, Lan-Zhe Guo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11441
Pdf URL: https://arxiv.org/pdf/2503.11441
Copy Paste: [[2503.11441]] D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning(https://arxiv.org/abs/2503.11441)
Keywords: generation
Abstract: Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on three datasets demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10% of the entire dataset.
摘要：大型语言模型（LLMS）的指导调整的最新进展表明，一个小的高质量数据集可以显着为LLMS配备遵循指导的功能，表现优于大型数据集，通常会受到质量和冗余问题的负担。但是，挑战在于自动从大型数据集中识别有价值的子集，以提高教学调整的有效性和效率。在本文中，我们首先基于数据值的三个不同方面建立数据选择标准：多样性，难度和可靠性，然后提出D3方法，其中包括评分和选择的两个关键步骤。具体而言，在评分步骤中，我们定义了多样性函数来衡量样本独特性，并引入了基于不确定性的预测难度，通过减轻上下文导向的产生多样性的干扰来评估样品难度。此外，我们集成了外部LLM以进行可靠性评估。在选择步骤中，我们制定了D3加权核心物镜，该目标共同优化了数据值的三个方面，以解决最有价值的子集。 D3的两个步骤可以迭代多个回合，并结合反馈以适应选择焦点。在三个数据集上进行的实验证明了D3在使用整个数据集的少于10％的竞争性甚至较高的指导遵循功能中赋予LLM的有效性。

Title: T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation

Authors: Seyed Mohammad Hadi Hosseini, Amir Mohammad Izadi, Ali Abdollahi, Armin Saghafian, Mahdieh Soleymani Baghshah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11481
Pdf URL: https://arxiv.org/pdf/2503.11481
Copy Paste: [[2503.11481]] T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation(https://arxiv.org/abs/2503.11481)
Keywords: generative
Abstract: Although recent text-to-image generative models have achieved impressive performance, they still often struggle with capturing the compositional complexities of prompts including attribute binding, and spatial relationships between different entities. This misalignment is not revealed by common evaluation metrics such as CLIPScore. Recent works have proposed evaluation metrics that utilize Visual Question Answering (VQA) by decomposing prompts into questions about the generated image for more robust compositional evaluation. Although these methods align better with human evaluations, they still fail to fully cover the compositionality within the image. To address this, we propose a novel metric that breaks down images into components, and texts into fine-grained questions about the generated image for evaluation. Our method outperforms previous state-of-the-art metrics, demonstrating its effectiveness in evaluating text-to-image generative models. Code is available at this https URL T2I-FineEval.
摘要：尽管最近的文本到图像生成模型已经取得了令人印象深刻的性能，但他们仍然经常在捕获提示的组成复杂性（包括属性绑定）和不同实体之间的空间关系方面的组成复杂性。常见的评估指标（例如夹克）没有揭示这种未对准。最近的工作提出了通过将提示分解为有关生成图像的问题，以进行更强大的构图评估，以利用视觉问题答案（VQA）提出了评估指标。尽管这些方法与人类评估更好，但它们仍然无法完全涵盖图像中的组成性。为了解决这个问题，我们提出了一个新颖的指标，将图像分解为组件，并将文本分为有关生成的图像进行评估的细粒度问题。我们的方法的表现优于先前的最先进指标，证明了其在评估文本到图像生成模型方面的有效性。代码可在此HTTPS URL T2i-Fineeval上找到。

Title: HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models

Authors: Ziqin Zhou, Yifan Yang, Yuqing Yang, Tianyu He, Houwen Peng, Kai Qiu, Qi Dai, Lili Qiu, Chong Luo, Lingqiao Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11513
Pdf URL: https://arxiv.org/pdf/2503.11513
Copy Paste: [[2503.11513]] HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models(https://arxiv.org/abs/2503.11513)
Keywords: generation
Abstract: Text-to-video generation poses significant challenges due to the inherent complexity of video data, which spans both temporal and spatial dimensions. It introduces additional redundancy, abrupt variations, and a domain gap between language and vision tokens while generation. Addressing these challenges requires an effective video tokenizer that can efficiently encode video data while preserving essential semantic and spatiotemporal information, serving as a critical bridge between text and vision. Inspired by the observation in VQ-VAE-2 and workflows of traditional animation, we propose HiTVideo for text-to-video generation with hierarchical tokenizers. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks. Higher layers capture semantic information with higher compression, while lower layers focus on fine-grained spatiotemporal details, striking a balance between compression efficiency and reconstruction quality. Our approach efficiently encodes longer video sequences (e.g., 8 seconds, 64 frames), reducing bits per pixel (bpp) by approximately 70\% compared to baseline tokenizers, while maintaining competitive reconstruction quality. We explore the trade-offs between compression and reconstruction, while emphasizing the advantages of high-compressed semantic tokens in text-to-video tasks. HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks, striving for higher compression ratios and simplify LLMs modeling under language guidance, offering a scalable and promising framework for advancing text to video generation. Demo page: this https URL.
摘要：由于视频数据的固有复杂性，文本到视频的生成构成了巨大的挑战，视频数据既涵盖了时间尺寸和空间维度。它引入了额外的冗余，突然的变化以及一代时语言和视觉令牌之间的域间隙。应对这些挑战需要有效的视频令牌，可以在保留基本语义和时空信息的同时有效地编码视频数据，这是文本和视觉之间的关键桥梁。受到VQ-VAE-2的观察和传统动画的工作流程的启发，我们提出了Hitvideo，以使用层次的引物来进行文本到视频的生成。它利用带有多层离散令牌框架的3D因果关系，将视频内容编码为层次结构化的代码簿。较高的层捕获具有较高压缩的语义信息，而较低的层则集中在细粒的时空细节上，在压缩效率和重建质量之间达到平衡。我们的方法有效地编码了更长的视频序列（例如8秒，64帧），与基线象征器相比，每个像素（BPP）的位降低了约70 \％，同时保持竞争性重建质量。我们探索压缩和重建之间的权衡，同时强调文本到视频任务中高压缩语义令牌的优势。 HITVIDEO旨在解决文本到视频生成任务中现有视频引物的潜在局限性，努力追求更高的压缩比，并在语言指导下简化LLMS建模，为将文本推进到视频生成。演示页面：此HTTPS URL。

Title: Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Kaidi Xu, Mengshu Sun, Jindong Gu, Renjing Xu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.11519
Pdf URL: https://arxiv.org/pdf/2503.11519
Copy Paste: [[2503.11519]] Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models(https://arxiv.org/abs/2503.11519)
Keywords: generation, generative
Abstract: Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-vision, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), tasks have attracted significant attention. Large Vision Language Models (LVLMs) and I2I GMs are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to generate disruptive outputs semantically related to those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of VLP tasks when injected into images. In this paper, we comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs. To better observe performance modifications and characteristics of this threat, we also introduce the TVPI Dataset. Through extensive explorations, we deepen the understanding of the underlying causes of the TVPI threat in various GMs and offer valuable insights into its potential origins.
摘要：当前的跨模式生成模型（GMS）在各种生成任务中都表现出显着的功能。鉴于在现实世界中，视觉方式输入的无处不在和信息丰富性，跨视觉，涵盖视觉语言感知（VLP）和图像对图像（I2I），任务引起了极大的关注。大型视觉语言模型（LVLM）和I2I GM分别用于处理VLP和I2I任务。先前的研究表明，将印刷词打印到输入图像中会显着诱导LVLM和I2I GM，以产生与这些单词具有语义相关的破坏性输出。此外，视觉提示是一种更复杂的版式形式，当注入图像时，还揭示了VLP任务的各种应用程序的安全风险。在本文中，我们全面研究了各种LVLM和I2I GMS中印刷视觉快速注射（TVPI）引起的性能影响。为了更好地观察这种威胁的性能修改和特征，我们还介绍了TVPI数据集。通过广泛的探索，我们加深了对各种总经理中TVPI威胁的根本原因的理解，并为其潜在起源提供了宝贵的见解。

Title: AugGen: Synthetic Augmentation Can Improve Discriminative Models

Authors: Parsa Rahimi, Damien Teney, Sebastien Marcel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11544
Pdf URL: https://arxiv.org/pdf/2503.11544
Copy Paste: [[2503.11544]] AugGen: Synthetic Augmentation Can Improve Discriminative Models(https://arxiv.org/abs/2503.11544)
Keywords: generation, generative
Abstract: The increasing dependence on large-scale datasets in machine learning introduces significant privacy and ethical challenges. Synthetic data generation offers a promising solution; however, most current methods rely on external datasets or pre-trained models, which add complexity and escalate resource demands. In this work, we introduce a novel self-contained synthetic augmentation technique that strategically samples from a conditional generative model trained exclusively on the target dataset. This approach eliminates the need for auxiliary data sources. Applied to face recognition datasets, our method achieves 1--12\% performance improvements on the IJB-C and IJB-B benchmarks. It outperforms models trained solely on real data and exceeds the performance of state-of-the-art synthetic data generation baselines. Notably, these enhancements often surpass those achieved through architectural improvements, underscoring the significant impact of synthetic augmentation in data-scarce environments. These findings demonstrate that carefully integrated synthetic data not only addresses privacy and resource constraints but also substantially boosts model performance. Project page this https URL
摘要：机器学习中对大规模数据集的依赖越来越多，引入了重大的隐私和道德挑战。合成数据生成提供了有希望的解决方案；但是，大多数当前方法依赖于外部数据集或预训练的模型，这些模型增加了复杂性并升级了资源需求。在这项工作中，我们介绍了一种新型的独立合成增强技术，该技术从策略性地从目标数据集中训练有素的有条件生成模型进行采样。这种方法消除了对辅助数据源的需求。应用于面部识别数据集，我们的方法在IJB-C和IJB-B基准测试中获得了1--12 \％的性能改进。它的表现优于仅根据真实数据训练的模型，并且超过了最先进的合成数据生成基线的性能。值得注意的是，这些增强功能通常超过了通过建筑改进实现的增强，强调了合成增强在数据砂环境中的重大影响。这些发现表明，经过精心整合的合成数据不仅解决了隐私和资源约束，而且还大大提高了模型性能。项目页面此HTTPS URL

Title: From Denoising Score Matching to Langevin Sampling: A Fine-Grained Error Analysis in the Gaussian Setting

Authors: Samuel Hurault, Matthieu Terris, Thomas Moreau, Gabriel Peyré
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2503.11615
Pdf URL: https://arxiv.org/pdf/2503.11615
Copy Paste: [[2503.11615]] From Denoising Score Matching to Langevin Sampling: A Fine-Grained Error Analysis in the Gaussian Setting(https://arxiv.org/abs/2503.11615)
Keywords: generative
Abstract: Sampling from an unknown distribution, accessible only through discrete samples, is a fundamental problem at the core of generative AI. The current state-of-the-art methods follow a two-step process: first estimating the score function (the gradient of a smoothed log-distribution) and then applying a gradient-based sampling algorithm. The resulting distribution's correctness can be impacted by several factors: the generalization error due to a finite number of initial samples, the error in score matching, and the diffusion error introduced by the sampling algorithm. In this paper, we analyze the sampling process in a simple yet representative setting-sampling from Gaussian distributions using a Langevin diffusion sampler. We provide a sharp analysis of the Wasserstein sampling error that arises from the multiple sources of error throughout the pipeline. This allows us to rigorously track how the anisotropy of the data distribution (encoded by its power spectrum) interacts with key parameters of the end-to-end sampling method, including the noise amplitude, the step sizes in both score matching and diffusion, and the number of initial samples. Notably, we show that the Wasserstein sampling error can be expressed as a kernel-type norm of the data power spectrum, where the specific kernel depends on the method parameters. This result provides a foundation for further analysis of the tradeoffs involved in optimizing sampling accuracy, such as adapting the noise amplitude to the choice of step sizes.
摘要：仅通过离散样本可访问的未知分布采样是生成AI核心的基本问题。当前的最新方法遵循两个步骤的过程：首先估计分数函数（平滑的对数分布的梯度），然后应用基于梯度的采样算法。所得分布的正确性可能会受到几个因素的影响：由于初始样本数量有限，得分匹配的误差以及采样算法引入的扩散误差引起的概括误差。在本文中，我们使用langevin扩散采样器从高斯分布中分析了简单但代表性的设置采样中的采样过程。我们提供了对Wasserstein采样误差的尖锐分析，该误差来自整个管道中多个错误源。这使我们能够严格跟踪数据分布的各向异性（由其功率谱编码）如何与端到端采样方法的关键参数相互作用，包括噪声振幅，分数匹配和扩散的步骤大小以及初始样品的数量。值得注意的是，我们表明，Wasserstein采样误差可以表示为数据功率谱的内核型规范，其中特定的内核取决于方法参数。该结果为进一步分析优化采样准确性所涉及的权衡的基础为基础，例如将噪声幅度调整为步骤尺寸的选择。

Title: ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Authors: Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, Di Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11647
Pdf URL: https://arxiv.org/pdf/2503.11647
Copy Paste: [[2503.11647]] ReCamMaster: Camera-Controlled Generative Rendering from A Single Video(https://arxiv.org/abs/2503.11647)
Keywords: super-resolution, generation, generative
Abstract: Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through a simple yet powerful video conditioning mechanism -- its capability often overlooked in current research. To overcome the scarcity of qualified training data, we construct a comprehensive multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the model generalize to in-the-wild videos. Lastly, we further improve the robustness to diverse inputs through a meticulously designed training strategy. Extensive experiments tell that our method substantially outperforms existing state-of-the-art approaches and strong baselines. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Project page: this https URL
摘要：相机控制已在文本或图像条件的视频生成任务中积极研究。但是，尽管在视频创建领域的重要性，但给定视频的改变相机轨迹仍未探索。由于维持多框外观和动态同步的额外约束，这是非平凡的。为了解决这个问题，我们提出了Recammaster，这是一个由摄像机控制的生成视频重新渲染框架，可在新颖的相机轨迹上重现输入视频的动态场景。核心创新在于通过一种简单而强大的视频调节机制来利用预训练的文本对视频模型的生成能力 - 其能力在当前的研究中经常被忽略。为了克服合格的培训数据的稀缺性，我们使用虚幻引擎5构建了一个全面的多相机同步视频数据集，该数据集经过精心策划，以遵循现实世界的拍摄特征，涵盖了各种场景和相机的动作。它可以帮助模型推广到野外视频。最后，我们通过精心设计的培训策略进一步提高了各种投入的鲁棒性。广泛的实验表明，我们的方法基本上优于现有的最新方法和强大的基准。我们的方法还发现了在视频稳定，超分辨率和高架上的有希望的应用。项目页面：此HTTPS URL