2025-01-17

Title: Do generative video models learn physical principles from watching videos?

Authors: Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, Robert Geirhos
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.09038
Pdf URL: https://arxiv.org/pdf/2501.09038
Copy Paste: [[2501.09038]] Do generative video models learn physical principles from watching videos?(https://arxiv.org/abs/2501.09038)
Keywords: generation, generative
Abstract: AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn ``world models'' that discover laws of physics -- or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at this https URL; code at this https URL.
摘要：AI 视频生成正在经历一场革命，质量和真实感正在迅速提高。这些进步引发了一场激烈的科学争论：视频模型是否学习了发现物理定律的“世界模型”——或者，它们仅仅是复杂的像素预测器，在不了解现实的物理原理的情况下实现视觉真实感？我们通过开发 Physics-IQ 来解决这个问题，这是一个全面的基准数据集，只有通过深入了解各种物理原理（如流体动力学、光学、固体力学、磁学和热力学）才能解决。我们发现，在一系列当前模型（Sora、Runway、Pika、Lumiere、Stable Video Diffusion 和 VideoPoet）中，物理理解受到严重限制，并且与视觉真实感无关。同时，一些测试用例已经可以成功解决。这表明仅从观察中获取某些物理原理可能是可能的，但仍存在重大挑战。虽然我们期待未来取得快速进展，但我们的工作表明，视觉真实感并不意味着物理理解。我们的项目页面位于此 https URL；代码位于此 https URL。

Title: Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing

Authors: Fan Yuan, Xiaoyuan Fang, Rong Quan, Jing Li, Wei Bi, Xiaogang Xu, Piji Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2501.09041
Pdf URL: https://arxiv.org/pdf/2501.09041
Copy Paste: [[2501.09041]] Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing(https://arxiv.org/abs/2501.09041)
Keywords: generation, generative
Abstract: Visual Commonsense Reasoning, which is regarded as one challenging task to pursue advanced visual scene comprehension, has been used to diagnose the reasoning ability of AI systems. However, reliable reasoning requires a good grasp of the scene's details. Existing work fails to effectively exploit the real-world object relationship information present within the scene, and instead overly relies on knowledge from training memory. Based on these observations, we propose a novel scene-graph-enhanced visual commonsense reasoning generation method named \textit{\textbf{G2}}, which first utilizes the image patches and LLMs to construct a location-free scene graph, and then answer and explain based on the scene graph's information. We also propose automatic scene graph filtering and selection strategies to absorb valuable scene graph information during training. Extensive experiments are conducted on the tasks and datasets of scene graph constructing and visual commonsense answering and explaining, respectively. Experimental results and ablation analysis demonstrate the effectiveness of our proposed framework.
摘要：视觉常识推理被视为追求高级视觉场景理解的一项具有挑战性的任务，已用于诊断人工智能系统的推理能力。然而，可靠的推理需要很好地掌握场景的细节。现有的工作未能有效利用场景中存在的真实世界对象关系信息，而是过度依赖训练记忆中的知识。基于这些观察，我们提出了一种新颖的场景图增强视觉常识推理生成方法，名为 \textit{\textbf{G2}}，该方法首先利用图像块和 LLM 构建无位置的场景图，然后根据场景图的信息进行回答和解释。我们还提出了自动场景图过滤和选择策略，以在训练期间吸收有价值的场景图信息。对场景图构建和视觉常识回答和解释的任务和数据集分别进行了广泛的实验。实验结果和消融分析证明了我们提出的框架的有效性。

Title: CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion

Authors: Yuan Wang, Bin Xhu, Yanbin Hao, Chong-Wah Ngo, Yi Tan, Xiang Wang
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.09042
Pdf URL: https://arxiv.org/pdf/2501.09042
Copy Paste: [[2501.09042]] CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion(https://arxiv.org/abs/2501.09042)
Keywords: generation
Abstract: Recent advancements in text-to-image generation models have excelled in creating diverse and realistic images. This success extends to food imagery, where various conditional inputs like cooking styles, ingredients, and recipes are utilized. However, a yet-unexplored challenge is generating a sequence of procedural images based on cooking steps from a recipe. This could enhance the cooking experience with visual guidance and possibly lead to an intelligent cooking simulation system. To fill this gap, we introduce a novel task called \textbf{cooking procedural image generation}. This task is inherently demanding, as it strives to create photo-realistic images that align with cooking steps while preserving sequential consistency. To collectively tackle these challenges, we present \textbf{CookingDiffusion}, a novel approach that leverages Stable Diffusion and three innovative Memory Nets to model procedural prompts. These prompts encompass text prompts (representing cooking steps), image prompts (corresponding to cooking images), and multi-modal prompts (mixing cooking steps and images), ensuring the consistent generation of cooking procedural images. To validate the effectiveness of our approach, we preprocess the YouCookII dataset, establishing a new benchmark. Our experimental results demonstrate that our model excels at generating high-quality cooking procedural images with remarkable consistency across sequential cooking steps, as measured by both the FID and the proposed Average Procedure Consistency metrics. Furthermore, CookingDiffusion demonstrates the ability to manipulate ingredients and cooking methods in a recipe. We will make our code, models, and dataset publicly accessible.
摘要：文本到图像生成模型的最新进展在创建多样化和逼真的图像方面表现出色。这一成功延伸到食物图像，其中利用了各种条件输入，如烹饪风格、配料和食谱。然而，一个尚未探索的挑战是根据食谱中的烹饪步骤生成一系列程序图像。这可以通过视觉指导增强烹饪体验，并可能导致智能烹饪模拟系统。为了填补这一空白，我们引入了一项名为 \textbf{烹饪程序图像生成} 的新任务。这项任务本身就很苛刻，因为它力求创建与烹饪步骤一致的照片般逼真的图像，同时保持顺序一致性。为了共同应对这些挑战，我们提出了 \textbf{CookingDiffusion}，这是一种利用稳定扩散和三个创新记忆网络来模拟程序提示的新方法。这些提示包括文本提示（代表烹饪步骤）、图像提示（对应于烹饪图像）和多模式提示（混合烹饪步骤和图像），确保一致地生成烹饪过程图像。为了验证我们方法的有效性，我们对 YouCookII 数据集进行了预处理，建立了一个新的基准。我们的实验结果表明，我们的模型擅长生成高质量的烹饪过程图像，并且在连续烹饪步骤中具有显著的一致性，这通过 FID 和提出的平均程序一致性指标来衡量。此外，CookingDiffusion 展示了在食谱中操纵配料和烹饪方法的能力。我们将公开我们的代码、模型和数据集。

Title: Generating Realistic Synthetic Head Rotation Data for Extended Reality using Deep Learning

Authors: Jakob Struye, Filip Lemic, Jeroen Famaey
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.09050
Pdf URL: https://arxiv.org/pdf/2501.09050
Copy Paste: [[2501.09050]] Generating Realistic Synthetic Head Rotation Data for Extended Reality using Deep Learning(https://arxiv.org/abs/2501.09050)
Keywords: generation, generative
Abstract: Extended Reality is a revolutionary method of delivering multimedia content to users. A large contributor to its popularity is the sense of immersion and interactivity enabled by having real-world motion reflected in the virtual experience accurately and immediately. This user motion, mainly caused by head rotations, induces several technical challenges. For instance, which content is generated and transmitted depends heavily on where the user is looking. Seamless systems, taking user motion into account proactively, will therefore require accurate predictions of upcoming rotations. Training and evaluating such predictors requires vast amounts of orientational input data, which is expensive to gather, as it requires human test subjects. A more feasible approach is to gather a modest dataset through test subjects, and then extend it to a more sizeable set using synthetic data generation methods. In this work, we present a head rotation time series generator based on TimeGAN, an extension of the well-known Generative Adversarial Network, designed specifically for generating time series. This approach is able to extend a dataset of head rotations with new samples closely matching the distribution of the measured time series.
摘要：扩展现实是一种向用户提供多媒体内容的革命性方法。其受欢迎程度的一大原因是，通过将真实世界的运动准确而即时地反映在虚拟体验中，可以实现沉浸感和互动性。这种用户运动主要由头部旋转引起，带来了一些技术挑战。例如，生成和传输哪些内容在很大程度上取决于用户正在看哪里。因此，无缝系统会主动考虑用户运动，需要准确预测即将到来的旋转。训练和评估此类预测器需要大量的方向输入数据，这些数据的收集成本很高，因为它需要人类测试对象。更可行的方法是通过测试对象收集适度的数据集，然后使用合成数据生成方法将其扩展到更大的集合。在这项工作中，我们提出了一种基于 TimeGAN 的头部旋转时间序列生成器，TimeGAN 是著名的生成对抗网络的扩展，专门用于生成时间序列。这种方法能够使用与测量时间序列的分布紧密匹配的新样本来扩展头部旋转数据集。

Title: SHYI: Action Support for Contrastive Learning in High-Fidelity Text-to-Image Generation

Authors: Tianxiang Xia, Lin Xiao, Yannick Montorfani, Francesco Pavia, Enis Simsar, Thomas Hofmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.09055
Pdf URL: https://arxiv.org/pdf/2501.09055
Copy Paste: [[2501.09055]] SHYI: Action Support for Contrastive Learning in High-Fidelity Text-to-Image Generation(https://arxiv.org/abs/2501.09055)
Keywords: generation
Abstract: In this project, we address the issue of infidelity in text-to-image generation, particularly for actions involving multiple objects. For this we build on top of the CONFORM framework which uses Contrastive Learning to improve the accuracy of the generated image for multiple objects. However the depiction of actions which involves multiple different object has still large room for improvement. To improve, we employ semantically hypergraphic contrastive adjacency learning, a comprehension of enhanced contrastive structure and "contrast but link" technique. We further amend Stable Diffusion's understanding of actions by InteractDiffusion. As evaluation metrics we use image-text similarity CLIP and TIFA. In addition, we conducted a user study. Our method shows promising results even with verbs that Stable Diffusion understands mediocrely. We then provide future directions by analyzing the results. Our codebase can be found on polybox under the link: this https URL
摘要：在这个项目中，我们解决了文本到图像生成中的不忠实问题，特别是涉及多个对象的动作。为此，我们在 CONFORM 框架的基础上构建了对比学习，以提高针对多个对象生成的图像的准确性。然而，涉及多个不同对象的动作的描述仍然有很大的改进空间。为了改进，我们采用了语义超图对比邻接学习、增强对比结构的理解和“对比但链接”技术。我们通过 InteractDiffusion 进一步修改了 Stable Diffusion 对动作的理解。我们使用图像文本相似度 CLIP 和 TIFA 作为评估指标。此外，我们还进行了用户研究。即使对于 Stable Diffusion 理解一般的动词，我们的方法也显示出有希望的结果。然后，我们通过分析结果提供未来的方向。我们的代码库可以在 polybox 上的链接下找到：此 https URL

Title: Generative Medical Image Anonymization Based on Latent Code Projection and Optimization

Authors: Huiyu Li, Nicholas Ayache, Hervé Delingette
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.09114
Pdf URL: https://arxiv.org/pdf/2501.09114
Copy Paste: [[2501.09114]] Generative Medical Image Anonymization Based on Latent Code Projection and Optimization(https://arxiv.org/abs/2501.09114)
Keywords: generative
Abstract: Medical image anonymization aims to protect patient privacy by removing identifying information, while preserving the data utility to solve downstream tasks. In this paper, we address the medical image anonymization problem with a two-stage solution: latent code projection and optimization. In the projection stage, we design a streamlined encoder to project input images into a latent space and propose a co-training scheme to enhance the projection process. In the optimization stage, we refine the latent code using two deep loss functions designed to address the trade-off between identity protection and data utility dedicated to medical images. Through a comprehensive set of qualitative and quantitative experiments, we showcase the effectiveness of our approach on the MIMIC-CXR chest X-ray dataset by generating anonymized synthetic images that can serve as training set for detecting lung pathologies. Source codes are available at this https URL.
摘要：医学图像匿名化旨在通过删除身份信息来保护患者隐私，同时保留数据实用性以解决下游任务。在本文中，我们通过两阶段解决方案解决医学图像匿名化问题：潜在代码投影和优化。在投影阶段，我们设计了一个精简的编码器将输入图像投影到潜在空间，并提出了一种共同训练方案来增强投影过程。在优化阶段，我们使用两个深度损失函数来改进潜在代码，这两个函数旨在解决医学图像专用的身份保护和数据实用性之间的权衡。通过一套全面的定性和定量实验，我们通过生成可作为检测肺部病变的训练集的匿名合成图像，展示了我们的方法在 MIMIC-CXR 胸部 X 光数据集上的有效性。源代码可在此 https URL 上获得。

Title: Attention is All You Need Until You Need Retention

Authors: M. Murat Yaslioglu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.09166
Pdf URL: https://arxiv.org/pdf/2501.09166
Copy Paste: [[2501.09166]] Attention is All You Need Until You Need Retention(https://arxiv.org/abs/2501.09166)
Keywords: generation, generative
Abstract: This work introduces a novel Retention Layer mechanism for Transformer based architectures, addressing their inherent lack of intrinsic retention capabilities. Unlike human cognition, which can encode and dynamically recall symbolic templates, Generative Pretrained Transformers rely solely on fixed pretrained weights and ephemeral context windows, limiting their adaptability. The proposed Retention Layer incorporates a persistent memory module capable of real time data population, dynamic recall, and guided output generation. This enhancement allows models to store, update, and reuse observed patterns across sessions, enabling incremental learning and bridging the gap between static pretraining and dynamic, context sensitive adaptation. The Retention Layer design parallels social learning processes, encompassing attention, retention, reproduction, and motivation stages. Technically, it integrates a memory attention mechanism and episodic buffers to manage memory scalability, mitigate overfitting, and ensure efficient recall. Applications span adaptive personal assistants, real time fraud detection, autonomous robotics, content moderation, and healthcare diagnostics. In each domain, the retention mechanism enables systems to learn incrementally, personalize outputs, and respond to evolving real world challenges effectively. By emulating key aspects of human learning, this retention enhanced architecture fosters a more fluid and responsive AI paradigm, paving the way for dynamic, session aware models that extend the capabilities of traditional Transformers into domains requiring continual adaptation.
摘要：这项工作为基于 Transformer 的架构引入了一种新颖的保留层机制，解决了它们固有的内在保留能力不足的问题。与可以编码和动态调用符号模板的人类认知不同，生成式预训练 Transformer 仅依赖于固定的预训练权重和短暂的上下文窗口，从而限制了它们的适应性。所提出的保留层包含一个持久内存模块，能够实时填充数据、动态调用和引导输出生成。这种增强功能允许模型在会话之间存储、更新和重用观察到的模式，从而实现增量学习并弥合静态预训练与动态上下文敏感适应之间的差距。保留层设计与社会学习过程并行，包括注意力、保留、再现和动机阶段。从技术上讲，它集成了记忆注意力机制和情景缓冲区来管理记忆可扩展性、减轻过度拟合并确保有效调用。应用范围涵盖自适应个人助理、实时欺诈检测、自主机器人、内容审核和医疗诊断。在每个领域，保留机制使系统能够逐步学习、个性化输出并有效应对不断变化的现实世界挑战。通过模拟人类学习的关键方面，这种增强保留能力的架构可以培养出更流畅、响应速度更快的 AI 范式，为动态会话感知模型铺平道路，这些模型将传统 Transformer 的功能扩展到需要不断适应的领域。

Title: Grounding Text-To-Image Diffusion Models For Controlled High-Quality Image Generation

Authors: Ahmad Süleyman, Göksel Biricik
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.09194
Pdf URL: https://arxiv.org/pdf/2501.09194
Copy Paste: [[2501.09194]] Grounding Text-To-Image Diffusion Models For Controlled High-Quality Image Generation(https://arxiv.org/abs/2501.09194)
Keywords: generation, generative
Abstract: Large-scale text-to-image (T2I) diffusion models have demonstrated an outstanding performance in synthesizing diverse high-quality visuals from natural language text captions. Multiple layout-to-image models have been developed to control the generation process by utilizing a broad array of layouts such as segmentation maps, edges, and human keypoints. In this work, we present ObjectDiffusion, a model that takes inspirations from the top cutting-edge image generative frameworks to seamlessly condition T2I models with new bounding boxes capabilities. Specifically, we make substantial modifications to the network architecture introduced in ContorlNet to integrate it with the condition processing and injection techniques proposed in GLIGEN. ObjectDiffusion is initialized with pretraining parameters to leverage the generation knowledge obtained from training on large-scale datasets. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model achieves an AP$_{50}$ of 46.6, an AR of 44.5, and a FID of 19.8 outperforming the current SOTA model trained on open-source datasets in all of the three metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse, high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control layout. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding abilities on closed-set and open-set settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to generate multiple objects of different sizes and locations.
摘要：大规模文本到图像 (T2I) 扩散模型在从自然语言文本字幕合成各种高质量视觉效果方面表现出色。已经开发了多种布局到图像模型来控制生成过程，这些模型利用了各种布局，例如分割图、边缘和人体关键点。在这项工作中，我们提出了 ObjectDiffusion，该模型从顶级尖端图像生成框架中汲取灵感，以无缝地调节具有新边界框功能的 T2I 模型。具体而言，我们对 ContorlNet 中引入的网络架构进行了重大修改，以将其与 GLIGEN 中提出的条件处理和注入技术相结合。ObjectDiffusion 使用预训练参数初始化，以利用从大规模数据集训练中获得的生成知识。我们在 COCO2017 训练数据集上微调 ObjectDiffusion，并在 COCO2017 验证数据集上对其进行评估。我们的模型实现了 46.6 的 AP$_{50}$、44.5 的 AR 和 19.8 的 FID，在所有三个指标上都优于在开源数据集上训练的当前 SOTA 模型。ObjectDiffusion 展示了合成多样化、高质量、高保真图像的独特能力，这些图像无缝符合语义和空间控制布局。在定性和定量测试中评估，ObjectDiffusion 在各种环境中的闭集和开集设置上表现出卓越的接地能力。定性评估验证了 ObjectDiffusion 生成不同大小和位置的多个对象的能力。

Title: Unified Few-shot Crack Segmentation and its Precise 3D Automatic Measurement in Concrete Structures

Authors: Pengru Deng, Jiapeng Yao, Chun Li, Su Wang, Xinrun Li, Varun Ojha, Xuhui He, Takashi Matsumoto
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2501.09203
Pdf URL: https://arxiv.org/pdf/2501.09203
Copy Paste: [[2501.09203]] Unified Few-shot Crack Segmentation and its Precise 3D Automatic Measurement in Concrete Structures(https://arxiv.org/abs/2501.09203)
Keywords: generation
Abstract: Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.
摘要：视觉空间系统在混凝土裂缝检测中变得越来越重要。然而，现有的方法通常缺乏对不同场景的适应性，在基于图像的方法中表现出有限的稳健性，并且难以处理弯曲或复杂的几何形状。为了解决这些限制，本研究通过整合计算机视觉技术和多模态同步定位和地图构建 (SLAM)，提出了一种用于二维 (2D) 裂缝检测、三维 (3D) 重建和 3D 自动裂缝测量的创新框架。首先，在基础 DeepLabv3+ 分割模型的基础上，并利用基础模型 Segment Anything Model (SAM) 进行特定的改进，我们开发了一种在陌生场景中具有强大泛化能力的裂缝分割方法，从而能够生成精确的 2D 裂缝掩膜。为了提高 3D 重建的准确性和稳健性，将光检测和测距 (LiDAR) 点云与图像数据和分割掩膜一起使用。通过利用图像和 LiDAR-SLAM，我们开发了一个多帧和多模态融合框架，可生成密集的彩色点云，有效地捕捉 3D 真实世界尺度上的裂缝语义。此外，裂缝几何属性可在 3D 密集点云空间内自动直接测量，突破了传统 2D 图像测量的局限性。这一进步使该方法适用于具有弯曲和复杂 3D 几何形状的结构部件。各种混凝土结构的实验结果突出了所提方法的显著改进和独特优势，证明了其在实际应用中的有效性、准确性和稳健性。

Title: Knowledge Distillation for Image Restoration : Simultaneous Learning from Degraded and Clean Images

Authors: Yongheng Zhang, Danfeng Yan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2501.09268
Pdf URL: https://arxiv.org/pdf/2501.09268
Copy Paste: [[2501.09268]] Knowledge Distillation for Image Restoration : Simultaneous Learning from Degraded and Clean Images(https://arxiv.org/abs/2501.09268)
Keywords: restoration
Abstract: Model compression through knowledge distillation has seen extensive application in classification and segmentation tasks. However, its potential in image-to-image translation, particularly in image restoration, remains underexplored. To address this gap, we propose a Simultaneous Learning Knowledge Distillation (SLKD) framework tailored for model compression in image restoration tasks. SLKD employs a dual-teacher, single-student architecture with two distinct learning strategies: Degradation Removal Learning (DRL) and Image Reconstruction Learning (IRL), simultaneously. In DRL, the student encoder learns from Teacher A to focus on removing degradation factors, guided by a novel BRISQUE extractor. In IRL, the student decoder learns from Teacher B to reconstruct clean images, with the assistance of a proposed PIQE extractor. These strategies enable the student to learn from degraded and clean images simultaneously, ensuring high-quality compression of image restoration models. Experimental results across five datasets and three tasks demonstrate that SLKD achieves substantial reductions in FLOPs and parameters, exceeding 80\%, while maintaining strong image restoration performance.
摘要：通过知识蒸馏进行模型压缩已广泛应用于分类和分割任务。然而，它在图像到图像转换中的潜力，特别是在图像恢复中的潜力仍未得到充分开发。为了解决这一差距，我们提出了一个同步学习知识蒸馏 (SLKD) 框架，专门用于图像恢复任务中的模型压缩。SLKD 采用双教师、单学生架构，同时采用两种不同的学习策略：退化去除学习 (DRL) 和图像重建学习 (IRL)。在 DRL 中，学生编码器从教师 A 那里学习，专注于消除退化因素，由新颖的 BRISQUE 提取器指导。在 IRL 中，学生解码器从教师 B 那里学习重建干净的图像，并借助拟议的 PIQE 提取器。这些策略使学生能够同时从退化和干净的图像中学习，确保图像恢复模型的高质量压缩。五个数据集和三个任务的实验结果表明，SLKD 在保持强大的图像恢复性能的同时，实现了 FLOP 和参数的大幅减少，超过 80\%。

Title: Text-guided Synthetic Geometric Augmentation for Zero-shot 3D Understanding

Authors: Kohei Torimi, Ryosuke Yamada, Daichi Otsuka, Kensho Hara, Yuki M. Asano, Hirokatsu Kataoka, Yoshimitsu Aoki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.09278
Pdf URL: https://arxiv.org/pdf/2501.09278
Copy Paste: [[2501.09278]] Text-guided Synthetic Geometric Augmentation for Zero-shot 3D Understanding(https://arxiv.org/abs/2501.09278)
Keywords: generative
Abstract: Zero-shot recognition models require extensive training data for generalization. However, in zero-shot 3D classification, collecting 3D data and captions is costly and laborintensive, posing a significant barrier compared to 2D vision. Recent advances in generative models have achieved unprecedented realism in synthetic data production, and recent research shows the potential for using generated data as training data. Here, naturally raising the question: Can synthetic 3D data generated by generative models be used as expanding limited 3D datasets? In response, we present a synthetic 3D dataset expansion method, Textguided Geometric Augmentation (TeGA). TeGA is tailored for language-image-3D pretraining, which achieves SoTA in zero-shot 3D classification, and uses a generative textto-3D model to enhance and extend limited 3D datasets. Specifically, we automatically generate text-guided synthetic 3D data and introduce a consistency filtering strategy to discard noisy samples where semantics and geometric shapes do not match with text. In the experiment to double the original dataset size using TeGA, our approach demonstrates improvements over the baselines, achieving zeroshot performance gains of 3.0% on Objaverse-LVIS, 4.6% on ScanObjectNN, and 8.7% on ModelNet40. These results demonstrate that TeGA effectively bridges the 3D data gap, enabling robust zero-shot 3D classification even with limited real training data and paving the way for zero-shot 3D vision application.
摘要：零样本识别模型需要大量训练数据才能泛化。然而在零样本 3D 分类中，收集 3D 数据和字幕成本高昂且劳动密集，与 2D 视觉相比是一个重大障碍。生成模型的最新进展已经在合成数据生产中实现了前所未有的真实感，最近的研究表明将生成的数据用作训练数据的潜力。这里自然而然地提出了一个问题：生成模型生成的合成 3D 数据可以用作扩展有限的 3D 数据集吗？作为回应，我们提出了一种合成 3D 数据集扩展方法，即文本引导几何增强 (TeGA)。TeGA 专为语言-图像-3D 预训练而量身定制，在零样本 3D 分类中实现了 SoTA，并使用生成文本到 3D 模型来增强和扩展有限的 3D 数据集。具体而言，我们自动生成文本引导的合成 3D 数据，并引入一致性过滤策略来丢弃语义和几何形状与文本不匹配的噪声样本。在使用 TeGA 将原始数据集大小翻倍的实验中，我们的方法比基线有所改进，在 Objaverse-LVIS 上实现了 3.0% 的零样本性能提升，在 ScanObjectNN 上实现了 4.6% 的零样本性能提升，在 ModelNet40 上实现了 8.7% 的零样本性能提升。这些结果表明，TeGA 有效地弥补了 3D 数据差距，即使在有限的实际训练数据下也能实现稳健的零样本 3D 分类，并为零样本 3D 视觉应用铺平了道路。

Title: Soft Knowledge Distillation with Multi-Dimensional Cross-Net Attention for Image Restoration Models Compression

Authors: Yongheng Zhang, Danfeng Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.09321
Pdf URL: https://arxiv.org/pdf/2501.09321
Copy Paste: [[2501.09321]] Soft Knowledge Distillation with Multi-Dimensional Cross-Net Attention for Image Restoration Models Compression(https://arxiv.org/abs/2501.09321)
Keywords: restoration
Abstract: Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.
摘要：基于 Transformer 的编码器-解码器模型在图像到图像传输任务中取得了显著成功，尤其是在图像恢复中。然而，它们的高计算复杂度（表现为 FLOP 和参数数量增加）限制了它们在实际场景中的应用。现有的图像恢复知识蒸馏方法通常采用轻量级学生模型，直接模仿老师的中间特征和重建结果，忽略了它们之间的隐式注意关系。为了解决这个问题，我们提出了一种软知识蒸馏 (SKD) 策略，该策略结合了用于压缩图像恢复模型的多维交叉网络注意 (MCA) 机制。该机制促进了学生和老师在通道和空间维度上的交互，使学生能够隐式学习注意矩阵。此外，我们使用高斯核函数来测量核空间中学生和老师特征之间的距离，确保稳定高效的特征学习。为了进一步提高重建图像的质量，我们在图像级别用对比学习损失代替了常用的 L1 或 KL 散度损失。在图像去雨、去模糊和去噪三个任务上的实验表明，我们的 SKD 策略显著降低了计算复杂度，同时保持了强大的图像恢复能力。

Title: SVIA: A Street View Image Anonymization Framework for Self-Driving Applications

Authors: Dongyu Liu, Xuhong Wang, Cen Chen, Yanhao Wang, Shengyue Yao, Yilun Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.09393
Pdf URL: https://arxiv.org/pdf/2501.09393
Copy Paste: [[2501.09393]] SVIA: A Street View Image Anonymization Framework for Self-Driving Applications(https://arxiv.org/abs/2501.09393)
Keywords: generation
Abstract: In recent years, there has been an increasing interest in image anonymization, particularly focusing on the de-identification of faces and individuals. However, for self-driving applications, merely de-identifying faces and individuals might not provide sufficient privacy protection since street views like vehicles and buildings can still disclose locations, trajectories, and other sensitive information. Therefore, it remains crucial to extend anonymization techniques to street view images to fully preserve the privacy of users, pedestrians, and vehicles. In this paper, we propose a Street View Image Anonymization (SVIA) framework for self-driving applications. The SVIA framework consists of three integral components: a semantic segmenter to segment an input image into functional regions, an inpainter to generate alternatives to privacy-sensitive regions, and a harmonizer to seamlessly stitch modified regions to guarantee visual coherence. Compared to existing methods, SVIA achieves a much better trade-off between image generation quality and privacy protection, as evidenced by experimental results for five common metrics on two widely used public datasets.
摘要：近年来，人们对图像匿名化的兴趣日益浓厚，尤其关注人脸和个人的去识别化。然而，对于自动驾驶应用来说，仅仅去识别人脸和个人可能无法提供足够的隐私保护，因为车辆和建筑物等街景仍然可以泄露位置、轨迹和其他敏感信息。因此，将匿名化技术扩展到街景图像以充分保护用户、行人和车辆的隐私仍然至关重要。在本文中，我们提出了一种用于自动驾驶应用的街景图像匿名化 (SVIA) 框架。SVIA 框架由三个不可或缺的组件组成：一个语义分割器，用于将输入图像分割成功能区域；一个修复器，用于生成隐私敏感区域的替代方案；以及一个协调器，用于无缝拼接修改后的区域以保证视觉连贯性。与现有方法相比，SVIA 在图像生成质量和隐私保护之间实现了更好的平衡，两个广泛使用的公共数据集上的五个常用指标的实验结果证明了这一点。

Title: FASP: Fast and Accurate Structured Pruning of Large Language Models

Authors: Hanyu Hu, Pengxiang Zhao, Ping Li, Yi Zheng, Zhefeng Wang, Xiaoming Yuan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.09412
Pdf URL: https://arxiv.org/pdf/2501.09412
Copy Paste: [[2501.09412]] FASP: Fast and Accurate Structured Pruning of Large Language Models(https://arxiv.org/abs/2501.09412)
Keywords: restoration
Abstract: The rapid increase in the size of large language models (LLMs) has significantly escalated their computational and memory demands, posing challenges for efficient deployment, especially on resource-constrained devices. Structured pruning has emerged as an effective model compression method that can reduce these demands while preserving performance. In this paper, we introduce FASP (Fast and Accurate Structured Pruning), a novel structured pruning framework for LLMs that emphasizes both speed and accuracy. FASP employs a distinctive pruning structure that interlinks sequential layers, allowing for the removal of columns in one layer while simultaneously eliminating corresponding rows in the preceding layer without incurring additional performance loss. The pruning metric, inspired by Wanda, is computationally efficient and effectively selects components to prune. Additionally, we propose a restoration mechanism that enhances model fidelity by adjusting the remaining weights post-pruning. We evaluate FASP on the OPT and LLaMA model families, demonstrating superior performance in terms of perplexity and accuracy on downstream tasks compared to state-of-the-art methods. Our approach achieves significant speed-ups, pruning models such as OPT-125M in 17 seconds and LLaMA-30B in 15 minutes on a single NVIDIA RTX 4090 GPU, making it a highly practical solution for optimizing LLMs.
摘要：大型语言模型 (LLM) 的大小迅速增加，大大增加了其计算和内存需求，对高效部署提出了挑战，尤其是在资源受限的设备上。结构化剪枝已成为一种有效的模型压缩方法，可以在保持性能的同时减少这些需求。在本文中，我们介绍了 FASP（快速准确的结构化剪枝），这是一种用于 LLM 的新型结构化剪枝框架，它强调速度和准确性。FASP 采用独特的剪枝结构，将连续层相互链接，允许删除一层中的列，同时消除前一层中的相应行，而不会造成额外的性能损失。受 Wanda 启发的剪枝指标具有计算效率，可以有效地选择要剪枝的组件。此外，我们提出了一种恢复机制，通过调整剪枝后的剩余权重来提高模型保真度。我们在 OPT 和 LLaMA 模型系列上评估了 FASP，与最先进的方法相比，它在下游任务的困惑度和准确性方面表现出色。我们的方法实现了显著的速度提升，在单个 NVIDIA RTX 4090 GPU 上，在 17 秒内修剪了 OPT-125M 等模型，在 15 分钟内修剪了 LLaMA-30B，使其成为优化 LLM 的非常实用的解决方案。

Title: Dynamic Neural Style Transfer for Artistic Image Generation using VGG19

Authors: Kapil Kashyap, Mehak Garg, Sean Fargose, Sindhu Nair
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2501.09420
Pdf URL: https://arxiv.org/pdf/2501.09420
Copy Paste: [[2501.09420]] Dynamic Neural Style Transfer for Artistic Image Generation using VGG19(https://arxiv.org/abs/2501.09420)
Keywords: generation
Abstract: Throughout history, humans have created remark- able works of art, but artificial intelligence has only recently started to make strides in generating visually compelling art. Breakthroughs in the past few years have focused on using convolutional neural networks (CNNs) to separate and manipulate the content and style of images, applying texture synthesis techniques. Nevertheless, a number of current techniques continue to encounter obstacles, including lengthy processing times, restricted choices of style images, and the inability to modify the weight ratio of styles. We proposed a neural style transfer system that can add various artistic styles to a desired image to address these constraints allowing flexible adjustments to style weight ratios and reducing processing time. The system uses the VGG19 model for feature extraction, ensuring high-quality, flexible stylization without compromising content integrity.
摘要：纵观历史，人类创造了许多非凡的艺术作品，但人工智能最近才开始在创作视觉上引人注目的艺术作品方面取得长足进步。过去几年的突破主要集中在使用卷积神经网络 (CNN) 分离和处理图像的内容和风格，应用纹理合成技术。然而，当前的许多技术仍然遇到障碍，包括处理时间过长、风格图像选择受限以及无法修改风格的权重比。我们提出了一种神经风格转换系统，它可以为所需图像添加各种艺术风格，以解决这些限制，从而允许灵活调整风格权重比并减少处理时间。该系统使用 VGG19 模型进行特征提取，确保高质量、灵活的风格化，而不会损害内容完整性。

Title: CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation

Authors: Hwan Heo, Jangyeong Kim, Seongyeong Lee, Jeong A Wi, Junyoung Choi, Sangjun Ahn
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2501.09433
Pdf URL: https://arxiv.org/pdf/2501.09433
Copy Paste: [[2501.09433]] CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation(https://arxiv.org/abs/2501.09433)
Keywords: generation, generative
Abstract: The synthesis of high-quality 3D assets from textual or visual inputs has become a central objective in modern generative modeling. Despite the proliferation of 3D generation algorithms, they frequently grapple with challenges such as multi-view inconsistency, slow generation times, low fidelity, and surface reconstruction problems. While some studies have addressed some of these issues, a comprehensive solution remains elusive. In this paper, we introduce \textbf{CaPa}, a carve-and-paint framework that generates high-fidelity 3D assets efficiently. CaPa employs a two-stage process, decoupling geometry generation from texture synthesis. Initially, a 3D latent diffusion model generates geometry guided by multi-view inputs, ensuring structural consistency across perspectives. Subsequently, leveraging a novel, model-agnostic Spatially Decoupled Attention, the framework synthesizes high-resolution textures (up to 4K) for a given geometry. Furthermore, we propose a 3D-aware occlusion inpainting algorithm that fills untextured regions, resulting in cohesive results across the entire model. This pipeline generates high-quality 3D assets in less than 30 seconds, providing ready-to-use outputs for commercial applications. Experimental results demonstrate that CaPa excels in both texture fidelity and geometric stability, establishing a new standard for practical, scalable 3D asset generation.
摘要：从文本或视觉输入合成高质量的 3D 资产已成为现代生成建模的核心目标。尽管 3D 生成算法数量激增，但它们经常面临诸如多视图不一致、生成时间慢、保真度低和表面重建问题等挑战。虽然一些研究已经解决了其中一些问题，但全面的解决方案仍然难以实现。在本文中，我们介绍了 \textbf{CaPa}，这是一个雕刻和绘画框架，可以高效地生成高保真 3D 资产。CaPa 采用两阶段流程，将几何生成与纹理合成分离。首先，3D 潜在扩散模型在多视图输入的引导下生成几何图形，确保跨视角的结构一致性。随后，利用一种新颖的、与模型无关的空间解耦注意力，该框架为给定的几何图形合成高分辨率纹理（高达 4K）。此外，我们提出了一种 3D 感知遮挡修复算法，该算法可填充无纹理区域，从而在整个模型中产生一致的结果。该流程可在不到 30 秒的时间内生成高质量的 3D 资产，为商业应用提供随时可用的输出。实验结果表明，CaPa 在纹理保真度和几何稳定性方面均表现出色，为实用、可扩展的 3D 资产生成建立了新标准。

Title: Pruning for Sparse Diffusion Models based on Gradient Flow

Authors: Ben Wan, Tianyi Zheng, Zhaoyu Chen, Yuxiao Wang, Jia Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.09464
Pdf URL: https://arxiv.org/pdf/2501.09464
Copy Paste: [[2501.09464]] Pruning for Sparse Diffusion Models based on Gradient Flow(https://arxiv.org/abs/2501.09464)
Keywords: generation
Abstract: Diffusion Models (DMs) have impressive capabilities among generation models, but are limited to slower inference speeds and higher computational costs. Previous works utilize one-shot structure pruning to derive lightweight DMs from pre-trained ones, but this approach often leads to a significant drop in generation quality and may result in the removal of crucial weights. Thus we propose a iterative pruning method based on gradient flow, including the gradient flow pruning process and the gradient flow pruning criterion. We employ a progressive soft pruning strategy to maintain the continuity of the mask matrix and guide it along the gradient flow of the energy function based on the pruning criterion in sparse space, thereby avoiding the sudden information loss typically caused by one-shot pruning. Gradient-flow based criterion prune parameters whose removal increases the gradient norm of loss function and can enable fast convergence for a pruned model in iterative pruning stage. Our extensive experiments on widely used datasets demonstrate that our method achieves superior performance in efficiency and consistency with pre-trained models.
摘要：扩散模型 (DM) 在生成模型中具有令人印象深刻的功能，但其推理速度较慢且计算成本较高。先前的研究利用一次性结构剪枝从预训练的 DM 中派生出轻量级 DM，但这种方法往往会导致生成质量显著下降，并可能导致关键权重被删除。因此，我们提出了一种基于梯度流的迭代剪枝方法，包括梯度流剪枝过程和梯度流剪枝准则。我们采用渐进式软剪枝策略来保持掩模矩阵的连续性，并引导其沿着基于稀疏空间中剪枝准则的能量函数梯度流流动，从而避免一次性剪枝通常导致的突然信息丢失。基于梯度流的标准剪枝参数，这些参数的删除会增加损失函数的梯度范数，并可以使剪枝模型在迭代剪枝阶段快速收敛。我们对广泛使用的数据集进行的大量实验表明，我们的方法在效率和一致性方面与预训练模型相比具有卓越的性能。

Title: VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization

Authors: Zixun Fang, Zhiheng Liu, Kai Zhu, Yu Liu, Ka Leong Cheng, Wei Zhai, Yang Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.09499
Pdf URL: https://arxiv.org/pdf/2501.09499
Copy Paste: [[2501.09499]] VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization(https://arxiv.org/abs/2501.09499)
Keywords: generation
Abstract: Video colorization aims to transform grayscale videos into vivid color representations while maintaining temporal consistency and structural integrity. Existing video colorization methods often suffer from color bleeding and lack comprehensive control, particularly under complex motion or diverse semantic cues. To this end, we introduce VanGogh, a unified multimodal diffusion-based framework for video colorization. VanGogh tackles these challenges using a Dual Qformer to align and fuse features from multiple modalities, complemented by a depth-guided generation process and an optical flow loss, which help reduce color overflow. Additionally, a color injection strategy and luma channel replacement are implemented to improve generalization and mitigate flickering artifacts. Thanks to this design, users can exercise both global and local control over the generation process, resulting in higher-quality colorized videos. Extensive qualitative and quantitative evaluations, and user studies, demonstrate that VanGogh achieves superior temporal consistency and color this http URL page: this https URL.
摘要：视频着色旨在将灰度视频转换为生动的彩色表示，同时保持时间一致性和结构完整性。现有的视频着色方法经常受到颜色渗色的影响，并且缺乏全面的控制，尤其是在复杂运动或多样化语义线索下。为此，我们推出了 VanGogh，这是一个统一的多模态扩散视频着色框架。VanGogh 使用 Dual Qformer 来对齐和融合来自多种模态的特征，并辅以深度引导的生成过程和光流损失，以帮助减少颜色溢出，从而应对这些挑战。此外，还实施了颜色注入策略和亮度通道替换，以提高泛化能力并减轻闪烁伪影。得益于这种设计，用户可以对生成过程进行全局和局部控制，从而获得更高质量的彩色视频。广泛的定性和定量评估以及用户研究表明，VanGogh 实现了卓越的时间一致性和颜色这个 http URL 页面：这个 https URL。

Title: AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation

Authors: Junjie He, Yuxiang Tuo, Binghui Chen, Chongyang Zhong, Yifeng Geng, Liefeng Bo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.09503
Pdf URL: https://arxiv.org/pdf/2501.09503
Copy Paste: [[2501.09503]] AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation(https://arxiv.org/abs/2501.09503)
Keywords: generation, generative
Abstract: Recently, large-scale generative models have demonstrated outstanding text-to-image generation capabilities. However, generating high-fidelity personalized images with specific subjects still presents challenges, especially in cases involving multiple subjects. In this paper, we propose AnyStory, a unified approach for personalized subject generation. AnyStory not only achieves high-fidelity personalization for single subjects, but also for multiple subjects, without sacrificing subject fidelity. Specifically, AnyStory models the subject personalization problem in an "encode-then-route" manner. In the encoding step, AnyStory utilizes a universal and powerful image encoder, i.e., ReferenceNet, in conjunction with CLIP vision encoder to achieve high-fidelity encoding of subject features. In the routing step, AnyStory utilizes a decoupled instance-aware subject router to accurately perceive and predict the potential location of the corresponding subject in the latent space, and guide the injection of subject conditions. Detailed experimental results demonstrate the excellent performance of our method in retaining subject details, aligning text descriptions, and personalizing for multiple subjects. The project page is at this https URL .
摘要：近年来，大规模生成模型已经展现出出色的文本到图像生成能力。然而，生成具有特定主题的高保真个性化图像仍然存在挑战，特别是在涉及多个主题的情况下。在本文中，我们提出了一种统一的个性化主题生成方法 AnyStory。AnyStory 不仅可以实现单个主题的高保真个性化，还可以实现多个主题的高保真个性化，而不会牺牲主题保真度。具体来说，AnyStory 以“编码后路由”的方式对主题个性化问题进行建模。在编码步骤中，AnyStory 利用通用且强大的图像编码器 ReferenceNet 结合 CLIP 视觉编码器实现主题特征的高保真编码。在路由步骤中，AnyStory 利用解耦的实例感知主题路由器准确感知和预测相应主题在潜在空间中的潜在位置，并指导主题条件的注入。详细的实验结果证明了我们的方法在保留主题细节、对齐文本描述和针对多个主题进行个性化方面的出色表现。项目页面位于此 https URL 。

Title: Confidence Estimation for Error Detection in Text-to-SQL Systems

Authors: Oleg Somov, Elena Tutubalina
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2501.09527
Pdf URL: https://arxiv.org/pdf/2501.09527
Copy Paste: [[2501.09527]] Confidence Estimation for Error Detection in Text-to-SQL Systems(https://arxiv.org/abs/2501.09527)
Keywords: generation
Abstract: Text-to-SQL enables users to interact with databases through natural language, simplifying the retrieval and synthesis of information. Despite the success of large language models (LLMs) in converting natural language questions into SQL queries, their broader adoption is limited by two main challenges: achieving robust generalization across diverse queries and ensuring interpretative confidence in their predictions. To tackle these issues, our research investigates the integration of selective classifiers into Text-to-SQL systems. We analyse the trade-off between coverage and risk using entropy based confidence estimation with selective classifiers and assess its impact on the overall performance of Text-to-SQL models. Additionally, we explore the models' initial calibration and improve it with calibration techniques for better model alignment between confidence and accuracy. Our experimental results show that encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and decoder-only Llama 3, thus the designated external entropy-based selective classifier has better performance. The study also reveal that, in terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations.
摘要：文本到 SQL 使用户能够通过自然语言与数据库交互，从而简化了信息的检索和综合。尽管大型语言模型 (LLM) 在将自然语言问题转换为 SQL 查询方面取得了成功，但它们的广泛采用受到两个主要挑战的限制：实现跨不同查询的稳健泛化并确保其预测的解释置信度。为了解决这些问题，我们的研究调查了选择性分类器与文本到 SQL 系统的集成。我们使用基于熵的置信度估计和选择性分类器分析覆盖率和风险之间的权衡，并评估其对文本到 SQL 模型整体性能的影响。此外，我们探索了模型的初始校准，并使用校准技术对其进行了改进，以更好地实现置信度和准确度之间的模型对齐。我们的实验结果表明，编码器-解码器 T5 比上下文学习的 GPT 4 和仅解码器的 Llama 3 校准得更好，因此指定的外部基于熵的选择性分类器具有更好的性能。研究还表明，在错误检测方面，选择性分类器以更高的概率检测出与不相关问题相关的错误，而不是错误的查询生成。

Title: Intra-day Solar and Power Forecast for Optimization of Intraday Market Participation

Authors: Nelson Salazar-Peña, Adolfo Palma-Vergara, Mateo Montes, María Alejandra Vargas-Torres, Adriana Salinas, Andrés Velasco, Alejandra Tabares, Andrés González-Mancera
Subjects: cs.LG, eess.SP, eess.SY
Abstract URL: https://arxiv.org/abs/2501.09551
Pdf URL: https://arxiv.org/pdf/2501.09551
Copy Paste: [[2501.09551]] Intra-day Solar and Power Forecast for Optimization of Intraday Market Participation(https://arxiv.org/abs/2501.09551)
Keywords: generation
Abstract: The prediction of solar irradiance enhances reliability in photovoltaic (PV) solar plant generation and grid integration. In Colombia, PV plants face penalties if energy production deviates beyond governmental thresholds from intraday market offers. This research employs Long Short-Term Memory (LSTM) and Bidirectional-LSTM (Bi-LSTM) models, utilizing meteorological data from a PV plant in El Paso, Cesar, Colombia, to predict solar irradiance with a 6-hour horizon and 10-minute resolution. While Bi-LSTM showed superior performance, the LSTM model achieved comparable results with significantly reduced training time (6 hours versus 18 hours), making it computationally advantageous. The LSTM predictions were averaged to create an hourly resolution model, evaluated using Mean Absolute Error, Root-Mean-Square Error, Normalized Root-Mean-Square Error, and Mean Absolute Percentage Error metrics. Comparison with the Global Forecast System (GFS) revealed similar performance, with both models effectively capturing daily solar irradiance patterns. The forecast model integrates with an Object-Oriented power production model, enabling accurate energy offers in the intraday market while minimizing penalty costs.
摘要：太阳辐照度预测可提高光伏 (PV) 太阳能发电厂发电和电网整合的可靠性。在哥伦比亚，如果能源产量偏离日内市场报价超过政府阈值，光伏电厂将面临处罚。这项研究采用长短期记忆 (LSTM) 和双向 LSTM (Bi-LSTM) 模型，利用哥伦比亚塞萨尔埃尔帕索光伏电厂的气象数据，以 6 小时为周期、10 分钟为分辨率预测太阳辐照度。虽然 Bi-LSTM 表现出色，但 LSTM 模型在显著缩短训练时间（6 小时对 18 小时）的情况下实现了类似的结果，使其在计算上更具优势。对 LSTM 预测取平均值以创建每小时分辨率模型，并使用平均绝对误差、均方根误差、归一化均方根误差和平均绝对百分比误差指标进行评估。与全球预报系统 (GFS) 的比较显示性能相似，两种模型都能有效捕捉每日太阳辐照度模式。预测模型与面向对象的电力生产模型相结合，能够在盘中市场提供准确的能源报价，同时最大限度地降低罚款成本。

Title: Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis

Authors: Tingxuan Chen, Kun Yuan, Vinkle Srivastav, Nassir Navab, Nicolas Padoy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.09555
Pdf URL: https://arxiv.org/pdf/2501.09555
Copy Paste: [[2501.09555]] Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis(https://arxiv.org/abs/2501.09555)
Keywords: generative
Abstract: Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in this https URL.
摘要：目的：手术工作流程分析对于提高手术效率和安全性至关重要。然而，以前的研究严重依赖大规模注释数据集，这在成本、可扩展性和对专家注释的依赖方面带来了挑战。为了解决这个问题，我们提出了 Surg-FTDA（小样本文本驱动自适应），旨在用最少的配对图像标签数据处理各种手术工作流程分析任务。方法：我们的方法有两个关键组成部分。首先，基于小样本选择的模态对齐选择一小部分图像并将其嵌入与下游任务中的文本嵌入对齐，从而弥合模态差距。其次，文本驱动的自适应仅利用文本数据来训练解码器，从而无需配对的图像文本数据。然后将此解码器应用于对齐的图像嵌入，从而无需显式图像文本对即可实现与图像相关的任务。结果：我们评估了我们对生成任务（图像字幕）和判别任务（三重态识别和相位识别）的方法。结果表明，Surg-FTDA 的表现优于基线，并且在下游任务中具有良好的泛化能力。结论：我们提出了一种文本驱动的自适应方法，可以缓解模态差距并处理外科工作流程分析中的多个下游任务，同时最大限度地减少对大型注释数据集的依赖。代码和数据集将在此 https URL 中发布。

Title: Sequential PatchCore: Anomaly Detection for Surface Inspection using Synthetic Impurities

Authors: Runzhou Mao, Juraj Fulir, Christoph Garth, Petra Gospodnetić
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.09579
Pdf URL: https://arxiv.org/pdf/2501.09579
Copy Paste: [[2501.09579]] Sequential PatchCore: Anomaly Detection for Surface Inspection using Synthetic Impurities(https://arxiv.org/abs/2501.09579)
Keywords: generation
Abstract: The appearance of surface impurities (e.g., water stains, fingerprints, stickers) is an often-mentioned issue that causes degradation of automated visual inspection systems. At the same time, synthetic data generation techniques for visual surface inspection have focused primarily on generating perfect examples and defects, disregarding impurities. This study highlights the importance of considering impurities when generating synthetic data. We introduce a procedural method to include photorealistic water stains in synthetic data. The synthetic datasets are generated to correspond to real datasets and are further used to train an anomaly detection model and investigate the influence of water stains. The high-resolution images used for surface inspection lead to memory bottlenecks during anomaly detection training. To address this, we introduce Sequential PatchCore - a method to build coresets sequentially and make training on large images using consumer-grade hardware tractable. This allows us to perform transfer learning using coresets pre-trained on different dataset versions. Our results show the benefits of using synthetic data for pre-training an explicit coreset anomaly model and the extended performance benefits of finetuning the coreset using real data. We observed how the impurities and labelling ambiguity lower the model performance and have additionally reported the defect-wise recall to provide an industrially relevant perspective on model performance.
摘要：表面杂质（例如水渍、指纹、贴纸）的出现是经常提到的导致自动视觉检测系统性能下降的问题。同时，用于视觉表面检测的合成数据生成技术主要侧重于生成完美的示例和缺陷，而忽略了杂质。这项研究强调了在生成合成数据时考虑杂质的重要性。我们引入了一种程序方法，将照片级逼真的水渍包含在合成数据中。生成的合成数据集与真实数据集相对应，并进一步用于训练异常检测模型并研究水渍的影响。用于表面检测的高分辨率图像会导致异常检测训练期间出现内存瓶颈。为了解决这个问题，我们引入了 Sequential PatchCore - 一种按顺序构建核心集的方法，使使用消费级硬件对大图像进行训练变得容易。这使我们能够使用在不同数据集版本上预训练的核心集进行迁移学习。我们的结果表明，使用合成数据对显性核心集异常模型进行预训练具有优势，而使用真实数据对核心集进行微调则具有扩展的性能优势。我们观察到杂质和标签模糊性如何降低模型性能，并额外报告了缺陷召回率，以提供与工业相关的模型性能视角。

Title: WMamba: Wavelet-based Mamba for Face Forgery Detection

Authors: Siran Peng, Tianshuo Zhang, Li Gao, Xiangyu Zhu, Haoyuan Zhang, Kai Pang, Zhen Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.09617
Pdf URL: https://arxiv.org/pdf/2501.09617
Copy Paste: [[2501.09617]] WMamba: Wavelet-based Mamba for Face Forgery Detection(https://arxiv.org/abs/2501.09617)
Keywords: generation
Abstract: With the rapid advancement of deepfake generation technologies, the demand for robust and accurate face forgery detection algorithms has become increasingly critical. Recent studies have demonstrated that wavelet analysis can uncover subtle forgery artifacts that remain imperceptible in the spatial domain. Wavelets effectively capture important facial contours, which are often slender, fine-grained, and global in nature. However, existing wavelet-based approaches fail to fully leverage these unique characteristics, resulting in sub-optimal feature extraction and limited generalizability. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear computational complexity. This efficiency allows for the extraction of fine-grained, global forgery artifacts from small image patches. Extensive experimental results show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness and superiority in face forgery detection.
摘要：随着深度伪造生成技术的快速发展，对稳健而准确的人脸伪造检测算法的需求变得越来越重要。最近的研究表明，小波分析可以发现在空间域中无法察觉的细微伪造伪像。小波可以有效地捕捉重要的面部轮廓，这些轮廓通常细长、细粒度且具有全局性。然而，现有的基于小波的方法未能充分利用这些独特的特性，导致特征提取不理想且通用性有限。为了应对这一挑战，我们推出了 WMamba，这是一种基于 Mamba 架构构建的新型小波特征提取器。WMamba 通过两项关键创新最大限度地发挥小波信息的效用。首先，我们提出了动态轮廓卷积 (DCConv)，它采用特制的可变形核来自适应地模拟细长的面部轮廓。其次，通过利用 Mamba 架构，我们的方法可以以线性计算复杂度捕捉长距离空间关系。这种效率允许从小图像块中提取细粒度的全局伪造伪像。大量实验结果表明，WMamba 达到了最先进的（SOTA）性能，凸显了其在人脸伪造检测方面的有效性和优越性。

Title: Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework

Authors: Yushen Lin, Ruichen Zhang, Wenqi Huang, Kaidi Wang, Zhiguo Ding, Daniel K. C. So, Dusit Niyato
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.09631
Pdf URL: https://arxiv.org/pdf/2501.09631
Copy Paste: [[2501.09631]] Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework(https://arxiv.org/abs/2501.09631)
Keywords: generation
Abstract: In this work, we develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) specifically for wireless communication applications. The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard. By utilizing advanced language models for entity extraction and question generation, rigorous data curation processes are employed to maintain high quality and relevance. Additionally, we introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data with 2.24\% and 1.31\% performance boost for different models compared to baselines, respectively. To demonstrate the effectiveness of the fine-tuned models with the proposed methodologies on practical tasks, we also consider different tasks, including summarizing optimization problems from technical papers and solving the mathematical problems related to non-orthogonal multiple access (NOMA), which are generated by using the proposed multi-agent framework. Simulation results show significant performance gain in summarization tasks with 20.9\% in the ROUGE-L metrics. We also study the scaling laws of fine-tuning LLMs and the challenges LLMs face in the field of wireless communications, offering insights into their adaptation to wireless communication tasks. This dataset and fine-tuning methodology aim to enhance the training and evaluation of LLMs, contributing to advancements in LLMs for wireless communication research and applications.
摘要：在这项工作中，我们开发了一个专门的数据集，旨在增强对无线通信应用的大型语言模型 (LLM) 的评估和微调。该数据集包括一组多样化的多跳问题，包括真/假和多项选择类型，难度从易到难不等。通过利用高级语言模型进行实体提取和问题生成，采用严格的数据管理流程来保持高质量和相关性。此外，我们引入了一种基于逐点 V 信息 (PVI) 的微调方法，提供了详细的理论分析和理由，说明该方法用于量化训练数据的信息内容，与基线相比，不同模型的性能分别提高了 2.24% 和 1.31%。为了证明使用所提出方法的微调模型在实际任务上的有效性，我们还考虑了不同的任务，包括从技术论文中总结优化问题和解决与非正交多址 (NOMA) 相关的数学问题，这些问题是使用所提出的多智能体框架生成的。模拟结果显示，在总结任务中，ROUGE-L 指标中性能显著提升 20.9%。我们还研究了微调 LLM 的缩放规律以及 LLM 在无线通信领域面临的挑战，从而深入了解了它们如何适应无线通信任务。该数据集和微调方法旨在增强 LLM 的训练和评估，促进 LLM 在无线通信研究和应用方面的进步。

Title: A Simple Aerial Detection Baseline of Multimodal Language Models

Authors: Qingyun Li, Yushi Chen, Xinya Shu, Dong Chen, Xin He, Yi Yu, Xue Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.09720
Pdf URL: https://arxiv.org/pdf/2501.09720
Copy Paste: [[2501.09720]] A Simple Aerial Detection Baseline of Multimodal Language Models(https://arxiv.org/abs/2501.09720)
Keywords: generative
Abstract: The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at this https URL.
摘要：基于生成式预训练 Transformer 的多模态语言模型 (MLM) 被认为是统一各种领域和任务的有力候选者。为遥感 (RS) 开发的 MLM 在视觉问答和视觉基础等多项任务中表现出色。除了检测与给定指令相对应的特定对象的视觉基础之外，检测多个类别的所有对象的空中检测也是 RS 基础模型的一项有价值且具有挑战性的任务。然而，现有的 RS MLM 尚未探索空中检测，因为 MLM 的自回归预测机制与检测输出有很大不同。在本文中，我们首次提出了一种将 MLM 应用于空中检测的简单基线，称为 LMMRotate。具体而言，我们首先引入一种规范化方法，将检测输出转换为文本输出，以与 MLM 框架兼容。然后，我们提出了一种评估方法，确保 MLM 与传统物体检测模型之间的公平比较。我们通过微调开源通用 MLM 构建基线，并实现了与传统检测器相当的令人印象深刻的检测性能。我们希望这个基线可以作为未来 MLM 开发的参考，从而实现更全面的 RS 图像理解能力。代码可在此 https URL 上获取。

Title: Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Authors: Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, Saining Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.09732
Pdf URL: https://arxiv.org/pdf/2501.09732
Copy Paste: [[2501.09732]] Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps(https://arxiv.org/abs/2501.09732)
Keywords: generation, generative
Abstract: Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.
摘要：生成模型对各个领域产生了重大影响，这主要是因为它们能够在训练过程中通过增加数据、计算资源和模型大小来扩展，这种现象的特点是扩展定律。最近的研究已经开始探索大型语言模型 (LLM) 中的推理时间扩展行为，揭示了如何通过推理过程中的额外计算来进一步提高性能。与 LLM 不同，扩散模型固有地具有通过去噪步骤数调整推理时间计算的灵活性，尽管性能提升通常在几十步之后趋于平稳。在这项工作中，我们探索了扩散模型在增加去噪步骤之外的推理时间扩展行为，并研究了如何通过增加计算来进一步提高生成性能。具体来说，我们考虑一个搜索问题，旨在为扩散采样过程识别更好的噪声。我们沿着两个轴构建设计空间：用于提供反馈的验证器和用于寻找更好的噪声候选的算法。通过对类条件和文本条件图像生成基准进行大量实验，我们的研究结果表明，增加推理时间计算可以显着提高扩散模型生成的样本的质量，并且由于图像的复杂性，可以专门选择框架中组件的组合以符合不同的应用场景。

Title: Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Authors: Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.09755
Pdf URL: https://arxiv.org/pdf/2501.09755
Copy Paste: [[2501.09755]] Learnings from Scaling Visual Tokenizers for Reconstruction and Generation(https://arxiv.org/abs/2501.09755)
Keywords: generation, generative
Abstract: Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.
摘要：通过自动编码进行视觉标记化，可将像素压缩到潜在空间中，从而为最先进的图像和视频生成模型提供支持。尽管扩展基于 Transformer 的生成器是最近进展的核心，但标记器组件本身很少扩展，因此，自动编码器设计选择如何影响其重建目标和下游生成性能仍是一个悬而未决的问题。我们的工作旨在探索自动编码器的扩展以填补这一空白。为了促进这一探索，我们用增强的 Vision Transformer 标记化架构 (ViTok) 替换了典型的卷积主干。我们在远超 ImageNet-1K 的大规模图像和视频数据集上训练 ViTok，消除了标记器扩展的数据限制。我们首先研究扩展自动编码器瓶颈如何影响重建和生成——并发现虽然它与重建高度相关，但它与生成的关系更为复杂。接下来，我们探讨了分别扩展自动编码器的编码器和解码器对重建和生成性能的影响。至关重要的是，我们发现缩放编码器对重建或生成产生的收益很小，而缩放解码器会促进重建，但对生成的好处则好坏参半。基于我们的探索，我们将 ViTok 设计为轻量级自动编码器，它在 ImageNet-1K 和 COCO 重建任务（256p 和 512p）上实现了与最先进的自动编码器相媲美的性能，同时在 UCF-101 的 16 帧 128p 视频重建上优于现有的自动编码器，所有这些都减少了 2-5 倍的 FLOP。与 Diffusion Transformers 集成后，ViTok 在 ImageNet-1K 图像生成方面表现出了竞争力，并为 UCF-101 上的类条件视频生成设定了新的最先进的基准。