2025-01-16

Title: SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Authors: Bhavin Jawade, Joao V. B. Soares, Kapil Thadani, Deen Dayal Mohan, Amir Erfan Eshratifar, Benjamin Culpepper, Paloma de Juan, Srirangaraj Setlur, Venu Govindaraju
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.08347
Pdf URL: https://arxiv.org/pdf/2501.08347
Copy Paste: [[2501.08347]] SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval(https://arxiv.org/abs/2501.08347)
Keywords: generative
Abstract: Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.
摘要：组合图像检索 (CIR) 是一种多模态学习任务，其中模型将查询图像与用户提供的文本修改相结合以检索目标图像。CIR 可应用于各种领域，包括产品检索（电子商务）和网络搜索。现有方法主要侧重于全监督学习，其中模型在带标签的三元组数据集（例如 FashionIQ 和 CIRR）上进行训练。这带来了两个重大挑战：(i) 整理此类三元组数据集需要大量劳动力；(ii) 模型缺乏对未见过的对象和领域的泛化能力。在这项工作中，我们提出了 SCOT（自监督组合训练），这是一种新颖的零样本组合预训练策略，它将现有的大型图像-文本对数据集与大型语言模型的生成能力相结合，以对比方式训练嵌入组合网络。具体来说，我们展示了来自大规模对比预训练视觉语言模型的文本嵌入可用作组合预训练期间的代理目标监督，从而取代目标图像嵌入。在零样本设置中，此策略超越了 SOTA 零样本组合检索方法以及 FashionIQ 和 CIRR 等标准基准上的许多全监督方法。

Title: Cross-Modal Transferable Image-to-Video Attack on Video Quality Metrics

Authors: Georgii Gotin, Ekaterina Shumitskaya, Anastasia Antsiferova, Dmitriy Vatolin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.08415
Pdf URL: https://arxiv.org/pdf/2501.08415
Copy Paste: [[2501.08415]] Cross-Modal Transferable Image-to-Video Attack on Video Quality Metrics(https://arxiv.org/abs/2501.08415)
Keywords: quality assessment
Abstract: Recent studies have revealed that modern image and video quality assessment (IQA/VQA) metrics are vulnerable to adversarial attacks. An attacker can manipulate a video through preprocessing to artificially increase its quality score according to a certain metric, despite no actual improvement in visual quality. Most of the attacks studied in the literature are white-box attacks, while black-box attacks in the context of VQA have received less attention. Moreover, some research indicates a lack of transferability of adversarial examples generated for one model to another when applied to VQA. In this paper, we propose a cross-modal attack method, IC2VQA, aimed at exploring the vulnerabilities of modern VQA models. This approach is motivated by the observation that the low-level feature spaces of images and videos are similar. We investigate the transferability of adversarial perturbations across different modalities; specifically, we analyze how adversarial perturbations generated on a white-box IQA model with an additional CLIP module can effectively target a VQA model. The addition of the CLIP module serves as a valuable aid in increasing transferability, as the CLIP model is known for its effective capture of low-level semantics. Extensive experiments demonstrate that IC2VQA achieves a high success rate in attacking three black-box VQA models. We compare our method with existing black-box attack strategies, highlighting its superiority in terms of attack success within the same number of iterations and levels of attack strength. We believe that the proposed method will contribute to the deeper analysis of robust VQA metrics.
摘要：最近的研究表明，现代图像和视频质量评估 (IQA/VQA) 指标容易受到对抗性攻击。攻击者可以通过预处理来操纵视频，以根据某个指标人为地提高其质量得分，尽管视觉质量并没有实际改善。文献中研究的大多数攻击都是白盒攻击，而 VQA 背景下的黑盒攻击则较少受到关注。此外，一些研究表明，当应用于 VQA 时，为一个模型生成的对抗性示例缺乏可转移到另一个模型的可转移性。在本文中，我们提出了一种跨模态攻击方法 IC2VQA，旨在探索现代 VQA 模型的漏洞。这种方法的动机是观察到图像和视频的低级特征空间相似。我们研究了对抗性扰动在不同模态之间的可转移性；具体来说，我们分析了在带有附加 CLIP 模块的白盒 IQA 模型上生成的对抗性扰动如何有效地针对 VQA 模型。 CLIP 模块的加入对提高可迁移性大有裨益，因为 CLIP 模型以有效捕获低级语义而闻名。大量实验表明，IC2VQA 在攻击三种黑盒 VQA 模型时取得了很高的成功率。我们将我们的方法与现有的黑盒攻击策略进行了比较，突出了我们在相同迭代次数和攻击强度水平下攻击成功率方面的优势。我们相信，所提出的方法将有助于更深入地分析稳健的 VQA 指标。

Title: Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Authors: Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan, Yi Wang, Yuming Jiang, Yaohui Wang, Peng Gao, Xinyuan Chen, Hengjie Li, Dahua Lin, Yu Qiao, Ziwei Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.08453
Pdf URL: https://arxiv.org/pdf/2501.08453
Copy Paste: [[2501.08453]] Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models(https://arxiv.org/abs/2501.08453)
Keywords: generation
Abstract: We present Vchitect-2.0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. The overall Vchitect-2.0 system has several key designs. (1) By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames, while maintaining temporal coherence across sequences. (2) To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework that incorporates hybrid parallelism and other memory reduction techniques, enabling efficient training of long video sequences on distributed systems. (3) Additionally, our enhanced data processing pipeline ensures the creation of Vchitect T2V DataVerse, a high-quality million-scale training dataset through rigorous annotation and aesthetic evaluation. Extensive benchmarking demonstrates that Vchitect-2.0 outperforms existing methods in video quality, training efficiency, and scalability, serving as a suitable base for high-fidelity video generation.
摘要：我们提出了 Vchitect-2.0，这是一种并行转换器架构，旨在扩展视频扩散模型，以实现大规模文本到视频的生成。整个 Vchitect-2.0 系统有几个关键设计。（1）通过引入新颖的多模态扩散块，我们的方法实现了文本描述和生成的视频帧之间的一致对齐，同时保持了序列之间的时间一致性。（2）为了克服内存和计算瓶颈，我们提出了一种内存高效训练框架，该框架结合了混合并行性和其他内存减少技术，可以在分布式系统上高效训练长视频序列。（3）此外，我们增强的数据处理管道确保创建 Vchitect T2V DataVerse，这是一个经过严格注释和美学评估的高质量百万级训练数据集。广泛的基准测试表明，Vchitect-2.0 在视频质量、训练效率和可扩展性方面优于现有方法，可作为高保真视频生成的合适基础。

Title: Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time

Authors: Mihai Masala, Marius Leordeanu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.08460
Pdf URL: https://arxiv.org/pdf/2501.08460
Copy Paste: [[2501.08460]] Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time(https://arxiv.org/abs/2501.08460)
Keywords: generation
Abstract: In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language. We validate that our algorithmic approach is able to generate coherent, rich and relevant textual descriptions on videos collected from a variety of datasets, using both standard metrics (e.g. Bleu, ROUGE) and the modern LLM-as-a-Jury approach.
摘要：在当今的机器学习时代，Transformer 已成为计算机视觉和自然语言处理等各种领域的事实上的方法。基于 Transformer 的解决方案是当前最先进的语言生成、图像和视频分类、分割、动作和对象识别等方法的支柱。有趣的是，虽然这些最先进的方法在各自的领域产生了令人印象深刻的结果，但理解视觉和语言之间关系的问题仍然超出了我们的能力范围。在这项工作中，我们以可解释和编程的方式基于空间和时间中的事件提出了视觉和语言之间的共同点，以连接基于学习的视觉和语言最先进的模型，并为用自然语言描述视频的长期问题提供解决方案。我们验证了我们的算法方法能够使用标准指标（例如 Bleu、ROUGE）和现代 LLM-as-a-Jury 方法对从各种数据集收集的视频生成连贯、丰富和相关的文本描述。

Title: Time series forecasting for multidimensional telemetry data using GAN and BiLSTM in a Digital Twin

Authors: Joao Carmo de Almeida Neto, Claudio Miceli de Farias, Leandro Santiago de Araujo, Leopoldo Andre Dutra Lusquino Filho
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2501.08464
Pdf URL: https://arxiv.org/pdf/2501.08464
Copy Paste: [[2501.08464]] Time series forecasting for multidimensional telemetry data using GAN and BiLSTM in a Digital Twin(https://arxiv.org/abs/2501.08464)
Keywords: generation, generative
Abstract: The research related to digital twins has been increasing in recent years. Besides the mirroring of the physical word into the digital, there is the need of providing services related to the data collected and transferred to the virtual world. One of these services is the forecasting of physical part future behavior, that could lead to applications, like preventing harmful events or designing improvements to get better performance. One strategy used to predict any system operation it is the use of time series models like ARIMA or LSTM, and improvements were implemented using these algorithms. Recently, deep learning techniques based on generative models such as Generative Adversarial Networks (GANs) have been proposed to create time series and the use of LSTM has gained more relevance in time series forecasting, but both have limitations that restrict the forecasting results. Another issue found in the literature is the challenge of handling multivariate environments/applications in time series generation. Therefore, new methods need to be studied in order to fill these gaps and, consequently, provide better resources for creating useful digital twins. In this proposal, it is going to be studied the integration of a BiLSTM layer with a time series obtained by GAN in order to improve the forecasting of all the features provided by the dataset in terms of accuracy and, consequently, improving behaviour prediction.
摘要：近年来，与数字孪生相关的研究不断增加。除了将物理世界镜像到数字世界之外，还需要提供与收集并传输到虚拟世界的数据相关的服务。这些服务之一是预测物理部件的未来行为，这可能导致应用，例如防止有害事件或设计改进以获得更好的性能。用于预测任何系统操作的一种策略是使用 ARIMA 或 LSTM 等时间序列模型，并使用这些算法实现改进。最近，基于生成模型（如生成对抗网络 (GAN)）的深度学习技术已被提出来创建时间序列，并且 LSTM 的使用在时间序列预测中获得了更大的相关性，但两者都有限制预测结果的局限性。文献中发现的另一个问题是处理时间序列生成中的多变量环境/应用的挑战。因此，需要研究新方法来填补这些空白，从而为创建有用的数字孪生提供更好的资源。在本提案中，将研究将 BiLSTM 层与 GAN 获得的时间序列相结合，以提高对数据集提供的所有特征的预测准确性，从而改善行为预测。

Title: Benchmarking Classical, Deep, and Generative Models for Human Activity Recognition

Authors: Md Meem Hossain, The Anh Han, Safina Showkat Ara, Zia Ush Shamszaman
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.08471
Pdf URL: https://arxiv.org/pdf/2501.08471
Copy Paste: [[2501.08471]] Benchmarking Classical, Deep, and Generative Models for Human Activity Recognition(https://arxiv.org/abs/2501.08471)
Keywords: generative
Abstract: Human Activity Recognition (HAR) has gained significant importance with the growing use of sensor-equipped devices and large datasets. This paper evaluates the performance of three categories of models : classical machine learning, deep learning architectures, and Restricted Boltzmann Machines (RBMs) using five key benchmark datasets of HAR (UCI-HAR, OPPORTUNITY, PAMAP2, WISDM, and Berkeley MHAD). We assess various models, including Decision Trees, Random Forests, Convolutional Neural Networks (CNN), and Deep Belief Networks (DBNs), using metrics such as accuracy, precision, recall, and F1-score for a comprehensive comparison. The results show that CNN models offer superior performance across all datasets, especially on the Berkeley MHAD. Classical models like Random Forest do well on smaller datasets but face challenges with larger, more complex data. RBM-based models also show notable potential, particularly for feature learning. This paper offers a detailed comparison to help researchers choose the most suitable model for HAR tasks.
摘要：随着配备传感器的设备和大型数据集的使用日益增多，人类活动识别 (HAR) 变得越来越重要。本文使用 HAR 的五个关键基准数据集 (UCI-HAR、OPPORTUNITY、PAMAP2、WISDM 和 Berkeley MHAD) 评估了三类模型的性能：经典机器学习、深度学习架构和受限玻尔兹曼机 (RBM)。我们使用准确度、精确度、召回率和 F1 分数等指标对各种模型进行了评估，包括决策树、随机森林、卷积神经网络 (CNN) 和深度信念网络 (DBN)，以进行全面比较。结果表明，CNN 模型在所有数据集上都表现出色，尤其是在 Berkeley MHAD 上。像随机森林这样的经典模型在较小的数据集上表现良好，但在更大、更复杂的数据上面临挑战。基于 RBM 的模型也显示出显着的潜力，尤其是在特征学习方面。本文提供了详细的比较，以帮助研究人员为 HAR 任务选择最合适的模型。

Title: Yuan: Yielding Unblemished Aesthetics Through A Unified Network for Visual Imperfections Removal in Generated Images

Authors: Zhenyu Yu, Chee Seng Chan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2501.08505
Pdf URL: https://arxiv.org/pdf/2501.08505
Copy Paste: [[2501.08505]] Yuan: Yielding Unblemished Aesthetics Through A Unified Network for Visual Imperfections Removal in Generated Images(https://arxiv.org/abs/2501.08505)
Keywords: generative
Abstract: Generative AI presents transformative potential across various domains, from creative arts to scientific visualization. However, the utility of AI-generated imagery is often compromised by visual flaws, including anatomical inaccuracies, improper object placements, and misplaced textual elements. These imperfections pose significant challenges for practical applications. To overcome these limitations, we introduce \textit{Yuan}, a novel framework that autonomously corrects visual imperfections in text-to-image synthesis. \textit{Yuan} uniquely conditions on both the textual prompt and the segmented image, generating precise masks that identify areas in need of refinement without requiring manual intervention -- a common constraint in previous methodologies. Following the automated masking process, an advanced inpainting module seamlessly integrates contextually coherent content into the identified regions, preserving the integrity and fidelity of the original image and associated text prompts. Through extensive experimentation on publicly available datasets such as ImageNet100 and Stanford Dogs, along with a custom-generated dataset, \textit{Yuan} demonstrated superior performance in eliminating visual imperfections. Our approach consistently achieved higher scores in quantitative metrics, including NIQE, BRISQUE, and PI, alongside favorable qualitative evaluations. These results underscore \textit{Yuan}'s potential to significantly enhance the quality and applicability of AI-generated images across diverse fields.
摘要：生成式人工智能在从创意艺术到科学可视化等各个领域都具有变革潜力。然而，人工智能生成的图像的实用性往往受到视觉缺陷的影响，包括解剖学上的不准确性、物体放置不当和文本元素放置错误。这些缺陷对实际应用构成了重大挑战。为了克服这些限制，我们引入了 \textit{Yuan}，这是一个新颖的框架，可以自主纠正文本到图像合成中的视觉缺陷。 \textit{Yuan} 以独特的方式对文本提示和分割图像进行条件设置，生成精确的蒙版，无需人工干预即可识别需要细化的区域——这是以前方法中的常见限制。在自动蒙版过程之后，高级修复模块将上下文连贯的内容无缝集成到已识别的区域中，从而保留原始图像和相关文本提示的完整性和保真度。通过对 ImageNet100 和 Stanford Dogs 等公开数据集以及自定义生成的数据集进行大量实验，\textit{Yuan} 在消除视觉缺陷方面表现出色。我们的方法在定量指标（包括 NIQE、BRISQUE 和 PI）中始终取得更高的分数，同时还获得了良好的定性评估。这些结果凸显了 \textit{Yuan} 在不同领域显著提高 AI 生成图像的质量和适用性的潜力。

Title: Score-based 3D molecule generation with neural fields

Authors: Matthieu Kirchmeyer, Pedro O. Pinheiro, Saeed Saremi
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2501.08508
Pdf URL: https://arxiv.org/pdf/2501.08508
Copy Paste: [[2501.08508]] Score-based 3D molecule generation with neural fields(https://arxiv.org/abs/2501.08508)
Keywords: generation
Abstract: We introduce a new representation for 3D molecules based on their continuous atomic density fields. Using this representation, we propose a new model based on walk-jump sampling for unconditional 3D molecule generation in the continuous space using neural fields. Our model, FuncMol, encodes molecular fields into latent codes using a conditional neural field, samples noisy codes from a Gaussian-smoothed distribution with Langevin MCMC (walk), denoises these samples in a single step (jump), and finally decodes them into molecular fields. FuncMol performs all-atom generation of 3D molecules without assumptions on the molecular structure and scales well with the size of molecules, unlike most approaches. Our method achieves competitive results on drug-like molecules and easily scales to macro-cyclic peptides, with at least one order of magnitude faster sampling. The code is available at this https URL.
摘要：我们根据连续原子密度场为 3D 分子引入了一种新表示。利用这种表示，我们提出了一种基于行走-跳跃采样的新模型，用于使用神经场在连续空间中无条件生成 3D 分子。我们的模型 FuncMol 使用条件神经场将分子场编码为潜在代码，使用朗之万 MCMC 从高斯平滑分布中采样噪声代码（行走），一步去除这些样本的噪声（跳跃），最后将它们解码为分子场。与大多数方法不同，FuncMol 无需对分子结构进行假设即可生成 3D 分子的全原子，并且可以很好地适应分子大小。我们的方法在类药物分子上取得了有竞争力的结果，并且可以轻松扩展到大环肽，采样速度至少快一个数量级。代码可在此 https URL 上找到。

Title: Multimodal Fake News Video Explanation Generation

Authors: Lizhi Chen, Zhong Qian, Peifeng Li, Qiaoming Zhu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2501.08514
Pdf URL: https://arxiv.org/pdf/2501.08514
Copy Paste: [[2501.08514]] Multimodal Fake News Video Explanation Generation(https://arxiv.org/abs/2501.08514)
Keywords: generation
Abstract: Multi-modal explanation involves the assessment of the veracity of a variety of different content, and relies on multiple information modalities to comprehensively consider the relevance and consistency between modalities. Most existing fake news video detection methods focus on improving accuracy while ignoring the importance of providing explanations. In this paper, we propose a novel problem - Fake News Video Explanation (FNVE) - Given a multimodal news containing both video and caption text, we aim to generate natural language explanations to reveal the truth of predictions. To this end, we develop FakeNVE, a new dataset of explanations for truthfully multimodal posts, where each explanation is a natural language (English) sentence describing the attribution of a news thread. We benchmark FakeNVE by using a multimodal transformer-based architecture. Subsequently, a BART-based autoregressive decoder is used as the generator. Empirical results show compelling results for various baselines (applicable to FNVE) across multiple evaluation metrics. We also perform human evaluation on explanation generation, achieving high scores for both adequacy and fluency.
摘要：多模态解释涉及对各种不同内容的真实性的评估，并依赖于多种信息模态来综合考虑模态之间的相关性和一致性。大多数现有的假新闻视频检测方法都侧重于提高准确性，而忽略了提供解释的重要性。在本文中，我们提出了一个新问题——假新闻视频解释 (FNVE)——给定一个包含视频和标题文本的多模态新闻，我们旨在生成自然语言解释以揭示预测的真实性。为此，我们开发了 FakeNVE，这是一个新的真实多模态帖子解释数据集，其中每个解释都是一个描述新闻线索归属的自然语言 (英语) 句子。我们使用基于多模态变压器的架构对 FakeNVE 进行基准测试。随后，使用基于 BART 的自回归解码器作为生成器。实证结果表明，在多个评估指标中，各种基线 (适用于 FNVE) 的结果令人信服。我们还对解释生成进行了人工评估，在充分性和流畅性方面都获得了高分。

Title: Comprehensive Subjective and Objective Evaluation Method for Text-generated Video

Authors: Zelu Qi, Ping Shi, Shuqi Wang, Zhaoyang Zhang, Zefeng Ying, Da Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08545
Pdf URL: https://arxiv.org/pdf/2501.08545
Copy Paste: [[2501.08545]] Comprehensive Subjective and Objective Evaluation Method for Text-generated Video(https://arxiv.org/abs/2501.08545)
Keywords: generation, quality assessment
Abstract: Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen3, Pika, and Sora, have significantly broadened its applicability and popularity. This progress has created a growing demand for accurate quality assessment metrics to evaluate the perceptual quality of text-generated videos and optimize video generation models. However, assessing the quality of text-generated videos remains challenging due to the presence of highly complex distortions, such as unnatural actions and phenomena that defy human cognition. To address these challenges, we constructed a large-scale benchmark dataset for \textbf{T}ext-generated \textbf{V}ideo \textbf{eval}uation, \textbf{T2VEval-Bench}, comprising 148 textual words and 1,783 videos generated by 12 models. During the subjective evaluation, we collected five key scores: overall impression, video quality, aesthetic quality, realness, and text-video consistency. For objective evaluation, we developed the \textbf{T2VEval} model, which assesses videos across three branches: quality, authenticity, and consistency. Using an attention-based fusion module, T2VEval effectively integrates features from each branch and predicts scores with the aid of a large oracle model. Additionally, we implemented a progressive training strategy, enabling each branch to learn targeted knowledge while maintaining synergy with the others. Experimental results demonstrate that T2VEval achieves state-of-the-art performance across multiple metrics. The dataset and code will be open-sourced upon completion of the follow-up work.
摘要：以 Gen3、Pika 和 Sora 等模型为例，文本转视频 (T2V) 技术的最新进展大大拓展了其适用性和普及度。这一进步使得对准确质量评估指标的需求日益增长，以评估文本生成视频的感知质量并优化视频生成模型。然而，由于存在高度复杂的扭曲，例如违背人类认知的不自然动作和现象，评估文本生成视频的质量仍然具有挑战性。为了应对这些挑战，我们构建了一个大规模的 \textbf{T}ext-generated \textbf{V}ideo \textbf{eval}uation 基准数据集，即 \textbf{T2VEval-Bench}，其中包含 12 个模型生成的 148 个文本单词和 1,783 个视频。在主观评估过程中，我们收集了五个关键分数：整体印象、视频质量、美学质量、真实性和文本-视频一致性。为了进行客观评估，我们开发了 \textbf{T2VEval} 模型，该模型从三个分支评估视频：质量、真实性和一致性。使用基于注意力的融合模块，T2VEval 有效地整合了每个分支的特征，并借助大型 oracle 模型预测分数。此外，我们实施了渐进式训练策略，使每个分支都能学习有针对性的知识，同时保持与其他分支的协同作用。实验结果表明，T2VEval 在多个指标上都达到了最佳性能。后续工作完成后，数据集和代码将开源。

Title: Molecular Graph Contrastive Learning with Line Graph

Authors: Xueyuan Chen, Shangzhe Li, Ruomei Liu, Bowen Shi, Jiaheng Liu, Junran Wu, Ke Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.08589
Pdf URL: https://arxiv.org/pdf/2501.08589
Copy Paste: [[2501.08589]] Molecular Graph Contrastive Learning with Line Graph(https://arxiv.org/abs/2501.08589)
Keywords: generation
Abstract: Trapped by the label scarcity in molecular property prediction and drug design, graph contrastive learning (GCL) came forward. Leading contrastive learning works show two kinds of view generators, that is, random or learnable data corruption and domain knowledge incorporation. While effective, the two ways also lead to molecular semantics altering and limited generalization capability, respectively. To this end, we relate the \textbf{L}in\textbf{E} graph with \textbf{MO}lecular graph co\textbf{N}trastive learning and propose a novel method termed \textit{LEMON}. Specifically, by contrasting the given graph with the corresponding line graph, the graph encoder can freely encode the molecular semantics without omission. Furthermore, we present a new patch with edge attribute fusion and two local contrastive losses enhance information transmission and tackle hard negative samples. Compared with state-of-the-art (SOTA) methods for view generation, superior performance on molecular property prediction suggests the effectiveness of our proposed framework.
摘要：由于分子属性预测和药物设计中标签稀缺的问题，图对比学习 (GCL) 应运而生。领先的对比学习工作展示了两种视图生成器，即随机或可学习的数据损坏和领域知识合并。虽然有效，但这两种方式也分别导致分子语义改变和有限的泛化能力。为此，我们将 \textbf{L}in\textbf{E} 图与 \textbf{MO} 分子图协同\textbf{N} 传递学习联系起来，并提出了一种称为 \textit{LEMON} 的新方法。具体而言，通过将给定的图与相应的线图进行对比，图编码器可以自由地编码分子语义而不会遗漏。此外，我们提出了一种具有边缘属性融合的新补丁，两个局部对比损失增强了信息传输并处理困难的负样本。与最先进的 (SOTA) 视图生成方法相比，在分子属性预测方面的优异表现表明我们提出的框架是有效的。

Title: Watermarking in Diffusion Model: Gaussian Shading with Exact Diffusion Inversion via Coupled Transformations (EDICT)

Authors: Krishna Panthi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08604
Pdf URL: https://arxiv.org/pdf/2501.08604
Copy Paste: [[2501.08604]] Watermarking in Diffusion Model: Gaussian Shading with Exact Diffusion Inversion via Coupled Transformations (EDICT)(https://arxiv.org/abs/2501.08604)
Keywords: generation
Abstract: This paper introduces a novel approach to enhance the performance of Gaussian Shading, a prevalent watermarking technique, by integrating the Exact Diffusion Inversion via Coupled Transformations (EDICT) framework. While Gaussian Shading traditionally embeds watermarks in a noise latent space, followed by iterative denoising for image generation and noise addition for watermark recovery, its inversion process is not exact, leading to potential watermark distortion. We propose to leverage EDICT's ability to derive exact inverse mappings to refine this process. Our method involves duplicating the watermark-infused noisy latent and employing a reciprocal, alternating denoising and noising scheme between the two latents, facilitated by EDICT. This allows for a more precise reconstruction of both the image and the embedded watermark. Empirical evaluation on standard datasets demonstrates that our integrated approach yields a slight, yet statistically significant improvement in watermark recovery fidelity. These results highlight the potential of EDICT to enhance existing diffusion-based watermarking techniques by providing a more accurate and robust inversion mechanism. To the best of our knowledge, this is the first work to explore the synergy between EDICT and Gaussian Shading for digital watermarking, opening new avenues for research in robust and high-fidelity watermark embedding and extraction.
摘要：本文介绍了一种新方法，通过集成耦合变换精确扩散反演 (EDICT) 框架来增强高斯着色（一种流行的水印技术）的性能。虽然高斯着色传统上将水印嵌入噪声潜空间中，然后进行迭代去噪以生成图像并添加噪声以恢复水印，但其反演过程并不精确，从而可能导致水印失真。我们建议利用 EDICT 导出精确逆映射的能力来改进此过程。我们的方法包括复制注入水印的噪声潜空间，并在两个潜空间之间采用相互交替的去噪和加噪方案，这得益于 EDICT。这允许更精确地重建图像和嵌入的水印。对标准数据集的实证评估表明，我们的集成方法在水印恢复保真度方面取得了轻微但统计上显着的改善。这些结果凸显了 EDICT 通过提供更准确、更强大的反演机制来增强现有基于扩散的水印技术的潜力。据我们所知，这是首次探索 EDICT 与高斯着色在数字水印中的协同作用，为稳健、高保真水印嵌入和提取的研究开辟了新途径。

Title: RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Authors: Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.08617
Pdf URL: https://arxiv.org/pdf/2501.08617
Copy Paste: [[2501.08617]] RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation(https://arxiv.org/abs/2501.08617)
Keywords: generative
Abstract: Generative AI systems like foundation models (FMs) must align well with human values to ensure their behavior is helpful and trustworthy. While Reinforcement Learning from Human Feedback (RLHF) has shown promise for optimizing model performance using human judgments, existing RLHF pipelines predominantly rely on immediate feedback, which can fail to accurately reflect the downstream impact of an interaction on users' utility. We demonstrate that feedback based on evaluators' foresight estimates of downstream consequences systematically induces Goodhart's Law dynamics, incentivizing misaligned behaviors like sycophancy and deception and ultimately degrading user outcomes. To alleviate this, we propose decoupling evaluation from prediction by refocusing RLHF on hindsight feedback. Our theoretical analysis reveals that conditioning evaluator feedback on downstream observations mitigates misalignment and improves expected human utility, even when these observations are simulated by the AI system itself. To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which first simulates plausible consequences and then elicits feedback to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS to two widely-employed online and offline preference optimization methods -- Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) -- and show empirically that misalignment is significantly reduced with both methods. Through an online human user study, we show that RLHS consistently outperforms RLHF in helping users achieve their goals and earns higher satisfaction ratings, despite being trained solely with simulated hindsight feedback. These results underscore the importance of focusing on long-term consequences, even simulated ones, to mitigate misalignment in RLHF.
摘要：基础模型 (FM) 等生成式 AI 系统必须与人类价值观保持一致，以确保其行为有益且值得信赖。虽然强化学习从人类反馈 (RLHF) 已显示出使用人类判断优化模型性能的前景，但现有的 RLHF 管道主要依赖于即时反馈，这可能无法准确反映交互对用户效用的下游影响。我们证明，基于评估者对下游后果的预见性估计的反馈会系统地引发古德哈特定律动态，激励谄媚和欺骗等不一致的行为，并最终降低用户结果。为了缓解这种情况，我们建议通过将 RLHF 重新聚焦于后见反馈来将评估与预测分离。我们的理论分析表明，即使这些观察是由 AI 系统本身模拟的，根据下游观察调整评估者反馈也可以减轻不一致并提高预期的人类效用。为了在实际的对齐算法中利用这一见解，我们引入了事后模拟强化学习 (RLHS)，它首先模拟合理的后果，然后引出反馈以评估哪些行为在事后看来真正有益。我们将 RLHS 应用于两种广泛使用的在线和离线偏好优化方法——近端策略优化 (PPO) 和直接偏好优化 (DPO)——并通过经验表明，这两种方法都可以显着减少错位。通过在线人类用户研究，我们表明，尽管仅使用模拟事后反馈进行训练，但 RLHS 在帮助用户实现目标方面始终优于 RLHF，并获得更高的满意度评级。这些结果强调了关注长期后果（即使是模拟后果）以减轻 RLHF 中的错位的重要性。

Title: CT-PatchTST: Channel-Time Patch Time-Series Transformer for Long-Term Renewable Energy Forecasting

Authors: Menghao Huo, Kuan Lu, Yuxiao Li, Qiang Zhu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.08620
Pdf URL: https://arxiv.org/pdf/2501.08620
Copy Paste: [[2501.08620]] CT-PatchTST: Channel-Time Patch Time-Series Transformer for Long-Term Renewable Energy Forecasting(https://arxiv.org/abs/2501.08620)
Keywords: generation
Abstract: Accurately predicting renewable energy output is crucial for the efficient integration of solar and wind power into modern energy systems. This study develops and evaluates an advanced deep learning model, Channel-Time Patch Time-Series Transformer (CT-PatchTST), to forecast the power output of photovoltaic and wind energy systems using annual offshore wind power, onshore wind power, and solar power generation data from Denmark. While the original Patch Time-Series Transformer(PatchTST) model employs a channel-independent (CI) approach, it tends to overlook inter-channel relationships during training, potentially leading to a loss of critical information. To address this limitation and further leverage the benefits of increased data granularity brought by CI, we propose CT-PatchTST. This enhanced model improves the processing of inter-channel information while maintaining the advantages of the channel-independent approach. The predictive performance of CT-PatchTST is rigorously analyzed, demonstrating its ability to provide precise and reliable energy forecasts. This work contributes to improving the predictability of renewable energy systems, supporting their broader adoption and integration into energy grids.
摘要：准确预测可再生能源产出对于将太阳能和风能有效整合到现代能源系统中至关重要。本研究开发并评估了一种先进的深度学习模型——通道时间补丁时间序列变换器 (CT-PatchTST)，使用丹麦年度海上风电、陆上风电和太阳能发电数据来预测光伏和风能系统的电力产出。虽然原始的补丁时间序列变换器 (PatchTST) 模型采用了通道独立 (CI) 方法，但它往往会在训练过程中忽略通道间关系，从而可能导致关键信息的丢失。为了解决这一限制并进一步利用 CI 带来的数据粒度增加的好处，我们提出了 CT-PatchTST。这种增强模型改进了通道间信息的处理，同时保持了通道独立方法的优势。对 CT-PatchTST 的预测性能进行了严格分析，证明了其能够提供精确可靠的能源预测。这项工作有助于提高可再生能源系统的可预测性，支持其更广泛地采用和融入能源网。

Title: SWSC: Shared Weight for Similar Channel in LLM

Authors: Binrui Zeng, Yongtao Tang, Xiaodong Liu, Xiaopeng Li
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2501.08631
Pdf URL: https://arxiv.org/pdf/2501.08631
Copy Paste: [[2501.08631]] SWSC: Shared Weight for Similar Channel in LLM(https://arxiv.org/abs/2501.08631)
Keywords: restoration
Abstract: Large language models (LLMs) have spurred development in multiple industries. However, the growing number of their parameters brings substantial storage and computing burdens, making it essential to explore model compression techniques for parameter reduction and easier deployment. We propose SWSC, an LLM compression method based on the concept of Shared Weight for Similar Channel. It uses the K-Means clustering algorithm to cluster model weights channel-by-channel, generating clusters with highly similar vectors within each. A representative vector from each cluster is selected to approximately replace all vectors in the cluster, significantly reducing the number of model weight parameters. However, approximate restoration will inevitably cause damage to the performance of the model. To tackle this issue, we perform singular value decomposition on the weight error values before and after compression and retain the larger singular values and their corresponding singular vectors to compensate for the accuracy. The experimental results show that our method can effectively ensure the performance of the compressed LLM even under low-precision conditions.
摘要：大型语言模型（LLM）在多个行业中得到了广泛的应用。然而，其参数数量的不断增长带来了巨大的存储和计算负担，因此探索模型压缩技术以减少参数并简化部署至关重要。我们提出了一种基于相似通道共享权重概念的LLM压缩方法SWSC。它使用K均值聚类算法对模型权重逐通道进行聚类，生成每个聚类中向量高度相似的聚类。从每个聚类中选择一个代表向量来近似替换聚类中的所有向量，从而大大减少了模型权重参数的数量。然而，近似恢复不可避免地会对模型的性能造成损害。为了解决这个问题，我们对压缩前后的权重误差值进行奇异值分解，并保留较大的奇异值及其对应的奇异向量来补偿精度。实验结果表明，即使在低精度条件下，我们的方法也能有效保证压缩LLM的性能。

Title: Joint Learning of Depth and Appearance for Portrait Image Animation

Authors: Xinya Ji, Gaspard Zoss, Prashanth Chandran, Lingchen Yang, Xun Cao, Barbara Solenthaler, Derek Bradley
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.08649
Pdf URL: https://arxiv.org/pdf/2501.08649
Copy Paste: [[2501.08649]] Joint Learning of Depth and Appearance for Portrait Image Animation(https://arxiv.org/abs/2501.08649)
Keywords: generation, generative
Abstract: 2D portrait animation has experienced significant advancements in recent years. Much research has utilized the prior knowledge embedded in large generative diffusion models to enhance high-quality image manipulation. However, most methods only focus on generating RGB images as output, and the co-generation of consistent visual plus 3D output remains largely under-explored. In our work, we propose to jointly learn the visual appearance and depth simultaneously in a diffusion-based portrait image generator. Our method embraces the end-to-end diffusion paradigm and introduces a new architecture suitable for learning this conditional joint distribution, consisting of a reference network and a channel-expanded diffusion backbone. Once trained, our framework can be efficiently adapted to various downstream applications, such as facial depth-to-image and image-to-depth generation, portrait relighting, and audio-driven talking head animation with consistent 3D output.
摘要：近年来，2D 肖像动画取得了重大进展。许多研究利用大型生成扩散模型中嵌入的先验知识来增强高质量图像处理。然而，大多数方法仅侧重于生成 RGB 图像作为输出，而一致的视觉和 3D 输出的共同生成仍然在很大程度上尚未得到充分探索。在我们的工作中，我们建议在基于扩散的肖像图像生成器中同时联合学习视觉外观和深度。我们的方法采用端到端扩散范式，并引入了一种适合学习这种条件联合分布的新架构，由参考网络和通道扩展扩散主干组成。经过训练后，我们的框架可以有效地适应各种下游应用，例如面部深度到图像和图像到深度生成、肖像重新照明以及具有一致 3D 输出的音频驱动的说话头部动画。

Title: StereoGen: High-quality Stereo Image Generation from a Single Image

Authors: Xianqi Wang, Hao Yang, Gangwei Xu, Junda Cheng, Min Lin, Yong Deng, Jinliang Zang, Yurui Chen, Xin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08654
Pdf URL: https://arxiv.org/pdf/2501.08654
Copy Paste: [[2501.08654]] StereoGen: High-quality Stereo Image Generation from a Single Image(https://arxiv.org/abs/2501.08654)
Keywords: generation
Abstract: State-of-the-art supervised stereo matching methods have achieved amazing results on various benchmarks. However, these data-driven methods suffer from generalization to real-world scenarios due to the lack of real-world annotated data. In this paper, we propose StereoGen, a novel pipeline for high-quality stereo image generation. This pipeline utilizes arbitrary single images as left images and pseudo disparities generated by a monocular depth estimation model to synthesize high-quality corresponding right images. Unlike previous methods that fill the occluded area in warped right images using random backgrounds or using convolutions to take nearby pixels selectively, we fine-tune a diffusion inpainting model to recover the background. Images generated by our model possess better details and undamaged semantic structures. Besides, we propose Training-free Confidence Generation and Adaptive Disparity Selection. The former suppresses the negative effect of harmful pseudo ground truth during stereo training, while the latter helps generate a wider disparity distribution and better synthetic images. Experiments show that models trained under our pipeline achieve state-of-the-art zero-shot generalization results among all published methods. The code will be available upon publication of the paper.
摘要：最先进的监督立体匹配方法在各种基准上都取得了惊人的效果。然而，由于缺乏现实世界的注释数据，这些数据驱动的方法无法推广到现实世界的场景。在本文中，我们提出了一种用于高质量立体图像生成的新型管道 StereoGen。该管道利用任意单幅图像作为左图，利用单目深度估计模型生成的伪视差来合成高质量的对应右图。与以前使用随机背景填充扭曲的右图中的遮挡区域或使用卷积有选择地取附近像素的方法不同，我们微调了一个扩散修复模型来恢复背景。我们的模型生成的图像具有更好的细节和未受损的语义结构。此外，我们提出了无训练置信度生成和自适应视差选择。前者抑制了立体训练过程中有害伪地面实况的负面影响，而后者有助于生成更宽的视差分布和更好的合成图像。实验表明，在我们的流程下训练的模型在所有已发布的方法中实现了最先进的零样本泛化结果。代码将在论文发布后提供。

Title: Investigating Parameter-Efficiency of Hybrid QuGANs Based on Geometric Properties of Generated Sea Route Graphs

Authors: Tobias Rohe, Florian Burger, Michael Kölle, Sebastian Wölckert, Maximilian Zorn, Claudia Linnhoff-Popien
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2501.08678
Pdf URL: https://arxiv.org/pdf/2501.08678
Copy Paste: [[2501.08678]] Investigating Parameter-Efficiency of Hybrid QuGANs Based on Geometric Properties of Generated Sea Route Graphs(https://arxiv.org/abs/2501.08678)
Keywords: generation, generative
Abstract: The demand for artificially generated data for the development, training and testing of new algorithms is omnipresent. Quantum computing (QC), does offer the hope that its inherent probabilistic functionality can be utilised in this field of generative artificial intelligence. In this study, we use quantum-classical hybrid generative adversarial networks (QuGANs) to artificially generate graphs of shipping routes. We create a training dataset based on real shipping data and investigate to what extent QuGANs are able to learn and reproduce inherent distributions and geometric features of this data. We compare hybrid QuGANs with classical Generative Adversarial Networks (GANs), with a special focus on their parameter efficiency. Our results indicate that QuGANs are indeed able to quickly learn and represent underlying geometric properties and distributions, although they seem to have difficulties in introducing variance into the sampled data. Compared to classical GANs of greater size, measured in the number of parameters used, some QuGANs show similar result quality. Our reference to concrete use cases, such as the generation of shipping data, provides an illustrative example and demonstrate the potential and diversity in which QC can be used.
摘要：用于开发、训练和测试新算法的人工生成数据的需求无处不在。量子计算 (QC) 确实带来了希望，即其固有的概率功能可以应用于生成人工智能领域。在本研究中，我们使用量子-经典混合生成对抗网络 (QuGAN) 来人工生成航运路线图。我们根据真实航运数据创建了一个训练数据集，并研究 QuGAN 能够在多大程度上学习和重现这些数据的固有分布和几何特征。我们将混合 QuGAN 与经典生成对抗网络 (GAN) 进行了比较，特别关注它们的参数效率。我们的结果表明，QuGAN 确实能够快速学习和表示底层几何属性和分布，尽管它们似乎难以将方差引入采样数据。与规模更大的经典 GAN 相比（以使用的参数数量来衡量），一些 QuGAN 显示出相似的结果质量。我们参考了具体的用例，例如航运数据的生成，提供了一个说明性示例，并展示了 QC 的用途的潜力和多样性。

Title: Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

Authors: Zerui Tao, Yuhta Takida, Naoki Murata, Qibin Zhao, Yuki Mitsufuji
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.08727
Pdf URL: https://arxiv.org/pdf/2501.08727
Copy Paste: [[2501.08727]] Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models(https://arxiv.org/abs/2501.08727)
Keywords: generation
Abstract: Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabling users to fine-tune models with limited computational resources. However, the approximation gap between the low-rank assumption and desired fine-tuning weights prevents the simultaneous acquisition of ultra-parameter-efficiency and better performance. To reduce this gap and further improve the power of LoRA, we propose a new PEFT method that combines two classes of adaptations, namely, transform and residual adaptations. In specific, we first apply a full-rank and dense transform to the pre-trained weight. This learnable transform is expected to align the pre-trained weight as closely as possible to the desired weight, thereby reducing the rank of the residual weight. Then, the residual part can be effectively approximated by more compact and parameter-efficient structures, with a smaller approximation error. To achieve ultra-parameter-efficiency in practice, we design highly flexible and effective tensor decompositions for both the transform and residual adaptations. Additionally, popular PEFT methods such as DoRA can be summarized under this transform plus residual adaptation scheme. Experiments are conducted on fine-tuning Stable Diffusion models in subject-driven and controllable generation. The results manifest that our method can achieve better performances and parameter efficiency compared to LoRA and several baselines.
摘要：文本到图像模型的参数高效微调 (PEFT) 已成为一种越来越流行的技术，具有许多应用。在各种 PEFT 方法中，低秩自适应 (LoRA) 及其变体因其有效性而备受关注，使用户能够使用有限的计算资源对模型进行微调。然而，低秩假设和所需微调权重之间的近似差距阻碍了同时获得超参数效率和更好的性能。为了缩小这一差距并进一步提高 LoRA 的功能，我们提出了一种新的 PEFT 方法，该方法结合了两类自适应，即变换和残差自适应。具体而言，我们首先对预训练权重应用全秩和密集变换。这种可学习的变换有望使预训练权重尽可能接近所需权重，从而降低残差权重的秩。然后，残差部分可以通过更紧凑和参数高效的结构有效地近似，近似误差更小。为了在实践中实现超参数效率，我们为变换和残差自适应设计了高度灵活且有效的张量分解。此外，流行的 PEFT 方法（例如 DoRA）可以归纳为这种变换加残差自适应方案。在主题驱动和可控生成中对微调稳定扩散模型进行了实验。结果表明，与 LoRA 和几个基线相比，我们的方法可以实现更好的性能和参数效率。

Title: Few-Shot Learner Generalizes Across AI-Generated Image Detection

Authors: Shiyu Wu, Jing Liu, Jing Li, Yequan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08763
Pdf URL: https://arxiv.org/pdf/2501.08763
Copy Paste: [[2501.08763]] Few-Shot Learner Generalizes Across AI-Generated Image Detection(https://arxiv.org/abs/2501.08763)
Keywords: generative
Abstract: Current fake image detectors trained on large synthetic image datasets perform satisfactorily on limited studied generative models. However, they suffer a notable performance decline over unseen models. Besides, collecting adequate training data from online generative models is often expensive or infeasible. To overcome these issues, we propose Few-Shot Detector (FSD), a novel AI-generated image detector which learns a specialized metric space to effectively distinguish unseen fake images by utilizing very few samples. Experiments show FSD achieves state-of-the-art performance by $+7.4\%$ average ACC on GenImage dataset. More importantly, our method is better capable of capturing the intra-category common features in unseen images without further training.
摘要：目前，在大型合成图像数据集上训练的假图像检测器在有限的研究生成模型上表现令人满意。然而，与看不见的模型相比，它们的性能明显下降。此外，从在线生成模型收集足够的训练数据通常成本高昂或不可行。为了克服这些问题，我们提出了少样本检测器 (FSD)，这是一种新型的人工智能生成图像检测器，它学习一个专门的度量空间，利用极少的样本有效区分看不见的假图像。实验表明，FSD 在 GenImage 数据集上的平均 ACC 达到了最佳性能，为 $+7.4\%$。更重要的是，我们的方法能够更好地捕捉看不见的图像中的类别内共同特征，而无需进一步训练。

Title: Deep learning for temporal super-resolution 4D Flow MRI

Authors: Pia Callmer, Mia Bonini, Edward Ferdian, David Nordsletten, Daniel Giese, Alistair A. Young, Alexander Fyrdahl, David Marlevi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.08780
Pdf URL: https://arxiv.org/pdf/2501.08780
Copy Paste: [[2501.08780]] Deep learning for temporal super-resolution 4D Flow MRI(https://arxiv.org/abs/2501.08780)
Keywords: super-resolution
Abstract: 4D Flow Magnetic Resonance Imaging (4D Flow MRI) is a non-invasive technique for volumetric, time-resolved blood flow quantification. However, apparent trade-offs between acquisition time, image noise, and resolution limit clinical applicability. In particular, in regions of highly transient flow, coarse temporal resolution can hinder accurate capture of physiologically relevant flow variations. To overcome these issues, post-processing techniques using deep learning have shown promising results to enhance resolution post-scan using so-called super-resolution networks. However, while super-resolution has been focusing on spatial upsampling, temporal super-resolution remains largely unexplored. The aim of this study was therefore to implement and evaluate a residual network for temporal super-resolution 4D Flow MRI. To achieve this, an existing spatial network (4DFlowNet) was re-designed for temporal upsampling, adapting input dimensions, and optimizing internal layer structures. Training and testing were performed using synthetic 4D Flow MRI data originating from patient-specific in-silico models, as well as using in-vivo datasets. Overall, excellent performance was achieved with input velocities effectively denoised and temporally upsampled, with a mean absolute error (MAE) of 1.0 cm/s in an unseen in-silico setting, outperforming deterministic alternatives (linear interpolation MAE = 2.3 cm/s, sinc interpolation MAE = 2.6 cm/s). Further, the network synthesized high-resolution temporal information from unseen low-resolution in-vivo data, with strong correlation observed at peak flow frames. As such, our results highlight the potential of utilizing data-driven neural networks for temporal super-resolution 4D Flow MRI, enabling high-frame-rate flow quantification without extending acquisition times beyond clinically acceptable limits.
摘要：4D 流磁共振成像 (4D Flow MRI) 是一种非侵入性技术，用于体积、时间分辨的血流量化。然而，采集时间、图像噪声和分辨率之间的明显权衡限制了临床适用性。特别是在高度瞬态流动的区域，粗糙的时间分辨率会阻碍准确捕捉生理相关的流动变化。为了克服这些问题，使用深度学习的后处理技术已显示出有希望的结果，可以使用所谓的超分辨率网络来提高扫描后的分辨率。然而，虽然超分辨率一直专注于空间上采样，但时间超分辨率仍然在很大程度上尚未被探索。因此，本研究的目的是实现和评估时间超分辨率 4D 流 MRI 的残差网络。为了实现这一点，重新设计了现有的空间网络 (4DFlowNet)，以实现时间上采样、调整输入维度和优化内部层结构。使用来自患者特定的计算机模型的合成 4D 流 MRI 数据以及体内数据集进行训练和测试。总体而言，在输入速度有效去噪和时间上采样的情况下，取得了出色的性能，在看不见的计算机模拟设置中平均绝对误差 (MAE) 为 1.0 cm/s，优于确定性替代方案（线性插值 MAE = 2.3 cm/s，sinc 插值 MAE = 2.6 cm/s）。此外，该网络从未见过的低分辨率体内数据中合成了高分辨率时间信息，在峰值流量帧处观察到了强相关性。因此，我们的结果凸显了利用数据驱动的神经网络进行时间超分辨率 4D 流量 MRI 的潜力，从而实现高帧率流量量化，而无需将采集时间延长到临床可接受的限度之外。

Title: Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving

Authors: Tengpeng Li, Hanli Wang, Xianfei Li, Wenlong Liao, Tao He, Pai Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08861
Pdf URL: https://arxiv.org/pdf/2501.08861
Copy Paste: [[2501.08861]] Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving(https://arxiv.org/abs/2501.08861)
Keywords: generative
Abstract: Autonomous driving is a challenging task that requires perceiving and understanding the surrounding environment for safe trajectory planning. While existing vision-based end-to-end models have achieved promising results, these methods are still facing the challenges of vision understanding, decision reasoning and scene generalization. To solve these issues, a generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving. The proposed paradigm has two significant aspects. On one hand, a 3D-vision language pre-training module is designed to bridge the gap between visual perception and linguistic understanding in the bird's eye view. On the other hand, a cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories with perception and navigation information in an auto-regressive manner. Experiments on the challenging nuScenes dataset demonstrate that the proposed scheme achieves excellent performances compared with state-of-the-art methods. Besides, the proposed GPVL presents strong generalization ability and real-time potential when handling high-level commands in various scenarios. It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems. Code is available at this https URL
摘要：自动驾驶是一项具有挑战性的任务，需要感知和理解周围环境以进行安全的轨迹规划。虽然现有的基于视觉的端到端模型已经取得了有希望的结果，但这些方法仍然面临着视觉理解、决策推理和场景泛化的挑战。为了解决这些问题，提出了一种用于端到端自动驾驶的具有 3D 视觉语言预训练模型（GPVL）的生成规划。提出的范式有两个重要方面。一方面，设计了一个 3D 视觉语言预训练模块来弥合鸟瞰图中视觉感知和语言理解之间的差距。另一方面，引入了跨模态语言模型，以自回归的方式生成具有感知和导航信息的整体驾驶决策和细粒度轨迹。在具有挑战性的 nuScenes 数据集上的实验表明，与最先进的方法相比，所提出的方案取得了优异的性能。此外，所提出的 GPVL 在处理各种场景中的高级命令时表现出强大的泛化能力和实时潜力。人们认为，GPVL 的有效、稳健和高效性能对于未来自动驾驶系统的实际应用至关重要。代码可在此 https URL 上找到

Title: ARMOR: Shielding Unlearnable Examples against Data Augmentation

Authors: Xueluan Gong, Yuji Wang, Yanjiao Chen, Haocheng Dong, Yiming Li, Mengyuan Sun, Shuaike Li, Qian Wang, Chen Chen
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2501.08862
Pdf URL: https://arxiv.org/pdf/2501.08862
Copy Paste: [[2501.08862]] ARMOR: Shielding Unlearnable Examples against Data Augmentation(https://arxiv.org/abs/2501.08862)
Keywords: generation
Abstract: Private data, when published online, may be collected by unauthorized parties to train deep neural networks (DNNs). To protect privacy, defensive noises can be added to original samples to degrade their learnability by DNNs. Recently, unlearnable examples are proposed to minimize the training loss such that the model learns almost nothing. However, raw data are often pre-processed before being used for training, which may restore the private information of protected data. In this paper, we reveal the data privacy violation induced by data augmentation, a commonly used data pre-processing technique to improve model generalization capability, which is the first of its kind as far as we are concerned. We demonstrate that data augmentation can significantly raise the accuracy of the model trained on unlearnable examples from 21.3% to 66.1%. To address this issue, we propose a defense framework, dubbed ARMOR, to protect data privacy from potential breaches of data augmentation. To overcome the difficulty of having no access to the model training process, we design a non-local module-assisted surrogate model that better captures the effect of data augmentation. In addition, we design a surrogate augmentation selection strategy that maximizes distribution alignment between augmented and non-augmented samples, to choose the optimal augmentation strategy for each class. We also use a dynamic step size adjustment algorithm to enhance the defensive noise generation process. Extensive experiments are conducted on 4 datasets and 5 data augmentation methods to verify the performance of ARMOR. Comparisons with 6 state-of-the-art defense methods have demonstrated that ARMOR can preserve the unlearnability of protected private data under data augmentation. ARMOR reduces the test accuracy of the model trained on augmented protected samples by as much as 60% more than baselines.
摘要：私人数据在网上发布后，可能会被未经授权的各方收集来训练深度神经网络 (DNN)。为了保护隐私，可以在原始样本中添加防御性噪声，以降低 DNN 的可学习性。最近，提出了不可学习的示例来最小化训练损失，使得模型几乎不学习任何东西。然而，原始数据通常在用于训练之前进行预处理，这可能会恢复受保护数据的私人信息。在本文中，我们揭示了数据增强引起的数据隐私侵犯，数据增强是一种常用的数据预处理技术，用于提高模型的泛化能力，就我们而言，这是同类技术中的首创。我们证明，数据增强可以显著提高在不可学习的示例上训练的模型的准确率，从 21.3% 提高到 66.1%。为了解决这个问题，我们提出了一个防御框架，称为 ARMOR，以保护数据隐私免受数据增强的潜在侵犯。为了克服无法访问模型训练过程的困难，我们设计了一个非本地模块辅助代理模型，可以更好地捕捉数据增强的效果。此外，我们设计了一种替代增强选择策略，该策略最大化增强样本和非增强样本之间的分布一致，从而为每个类选择最优的增强策略。我们还使用动态步长调整算法来增强防御噪声生成过程。在 4 个数据集和 5 种数据增强方法上进行了广泛的实验来验证 ARMOR 的性能。与 6 种最先进的防御方法的比较表明，ARMOR 可以在数据增强下保持受保护的私人数据的不可学习性。ARMOR 使在增强保护样本上训练的模型的测试准确率比基线降低了多达 60%。

Title: Enhanced Multi-Scale Cross-Attention for Person Image Generation

Authors: Hao Tang, Ling Shao, Nicu Sebe, Luc Van Gool
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08900
Pdf URL: https://arxiv.org/pdf/2501.08900
Copy Paste: [[2501.08900]] Enhanced Multi-Scale Cross-Attention for Person Image Generation(https://arxiv.org/abs/2501.08900)
Keywords: generation, generative
Abstract: In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person's appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA). Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively. Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods. However, our method is significantly faster than diffusion-based methods in both training and inference.
摘要：在本文中，我们针对具有挑战性的人物图像生成任务提出了一种基于交叉注意的新型生成对抗网络 (GAN)。交叉注意是一种新颖且直观的多模态融合方法，其中计算不同模态的两个特征图之间的注意/相关矩阵。具体来说，我们提出了新颖的 XingGAN（或 CrossingGAN），它由两个分别捕捉人物外观和形状的生成分支组成。此外，我们提出了两个新颖的交叉注意块，以有效地传输和更新人物的形状和外观嵌入以实现相互改进。任何其他现有的基于 GAN 的图像生成工作都没有考虑到这一点。为了进一步了解不同尺度和子区域的不同人物姿势之间的长距离相关性，我们提出了两个新颖的多尺度交叉注意块。为了解决交叉注意机制中独立的相关性计算导致注意力权重嘈杂和模糊的问题，从而阻碍性能改进，我们提出了一个名为增强注意力 (EA) 的模块。最后，我们引入了一种新颖的密集连接共同注意模块，以有效地融合不同阶段的外观和形状特征。在两个公共数据集上进行的大量实验表明，所提出的方法优于当前基于 GAN 的方法，并且与基于扩散的方法性能相当。但是，我们的方法在训练和推理方面都比基于扩散的方法快得多。

Title: CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation

Authors: Qi Ma, Runyi Yang, Bin Ren, Ender Konukoglu, Luc Van Gool, Danda Pani Paudel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08982
Pdf URL: https://arxiv.org/pdf/2501.08982
Copy Paste: [[2501.08982]] CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation(https://arxiv.org/abs/2501.08982)
Keywords: generation
Abstract: Localizing text descriptions in large-scale 3D scenes is inherently an ambiguous task. This nonetheless arises while describing general concepts, e.g. all traffic lights in a city. To facilitate reasoning based on such concepts, text localization in the form of distribution is required. In this paper, we generate the distribution of the camera poses conditioned upon the textual description. To facilitate such generation, we propose a diffusion-based architecture that conditionally diffuses the noisy 6DoF camera poses to their plausible locations. The conditional signals are derived from the text descriptions, using the pre-trained text encoders. The connection between text descriptions and pose distribution is established through pretrained Vision-Language-Model, i.e. CLIP. Furthermore, we demonstrate that the candidate poses for the distribution can be further refined by rendering potential poses using 3D Gaussian splatting, guiding incorrectly posed samples towards locations that better align with the textual description, through visual reasoning. We demonstrate the effectiveness of our method by comparing it with both standard retrieval methods and learning-based approaches. Our proposed method consistently outperforms these baselines across all five large-scale datasets. Our source code and dataset will be made publicly available.
摘要：在大型 3D 场景中定位文本描述本质上是一项模棱两可的任务。然而，在描述一般概念（例如城市中的所有交通信号灯）时，这种情况仍然会出现。为了便于基于此类概念进行推理，需要以分布的形式进行文本定位。在本文中，我们根据文本描述生成相机姿势的分布。为了促进这种生成，我们提出了一种基于扩散的架构，该架构有条件地将嘈杂的 6DoF 相机姿势扩散到其合理位置。条件信号是使用预训练的文本编码器从文本描述中得出的。文本描述和姿势分布之间的联系是通过预训练的视觉语言模型（即 CLIP）建立的。此外，我们证明可以通过使用 3D 高斯分层渲染潜在姿势来进一步细化分布的候选姿势，通过视觉推理将姿势不正确的样本引导到更符合文本描述的位置。我们通过将我们的方法与标准检索方法和基于学习的方法进行比较来证明其有效性。我们提出的方法在所有五个大型数据集上的表现始终优于这些基线。我们的源代码和数据集将公开发布。

Title: CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities

Authors: Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08983
Pdf URL: https://arxiv.org/pdf/2501.08983
Copy Paste: [[2501.08983]] CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities(https://arxiv.org/abs/2501.08983)
Keywords: generation, generative
Abstract: 3D scene generation has garnered growing attention in recent years and has made significant progress. Generating 4D cities is more challenging than 3D scenes due to the presence of structurally complex, visually diverse objects like buildings and vehicles, and heightened human sensitivity to distortions in urban environments. To tackle these issues, we propose CityDreamer4D, a compositional generative model specifically tailored for generating unbounded 4D cities. Our main insights are 1) 4D city generation should separate dynamic objects (e.g., vehicles) from static scenes (e.g., buildings and roads), and 2) all objects in the 4D scene should be composed of different types of neural fields for buildings, vehicles, and background stuff. Specifically, we propose Traffic Scenario Generator and Unbounded Layout Generator to produce dynamic traffic scenarios and static city layouts using a highly compact BEV representation. Objects in 4D cities are generated by combining stuff-oriented and instance-oriented neural fields for background stuff, buildings, and vehicles. To suit the distinct characteristics of background stuff and instances, the neural fields employ customized generative hash grids and periodic positional embeddings as scene parameterizations. Furthermore, we offer a comprehensive suite of datasets for city generation, including OSM, GoogleEarth, and CityTopia. The OSM dataset provides a variety of real-world city layouts, while the Google Earth and CityTopia datasets deliver large-scale, high-quality city imagery complete with 3D instance annotations. Leveraging its compositional design, CityDreamer4D supports a range of downstream applications, such as instance editing, city stylization, and urban simulation, while delivering state-of-the-art performance in generating realistic 4D cities.
摘要：近年来，3D 场景生成越来越受到关注，并取得了重大进展。由于存在结构复杂、视觉多样的物体（如建筑物和车辆），以及人类对城市环境中扭曲的高度敏感性，因此生成 4D 城市比 3D 场景更具挑战性。为了解决这些问题，我们提出了 CityDreamer4D，这是一种专门用于生成无界 4D 城市的组合生成模型。我们的主要见解是 1) 4D 城市生成应将动态物体（例如车辆）与静态场景（例如建筑物和道路）分开，2) 4D 场景中的所有物体都应由不同类型的建筑物、车辆和背景物质的神经场组成。具体而言，我们提出了交通场景生成器和无界布局生成器，使用高度紧凑的 BEV 表示来生成动态交通场景和静态城市布局。4D 城市中的对象是通过结合面向物质和面向实例的背景物质、建筑物和车辆神经场生成的。为了适应背景内容和实例的独特特征，神经场采用定制的生成哈希网格和周期性位置嵌入作为场景参数化。此外，我们还提供了一套全面的城市生成数据集，包括 OSM、GoogleEarth 和 CityTopia。OSM 数据集提供了各种现实世界的城市布局，而 Google Earth 和 CityTopia 数据集则提供了大规模、高质量的城市图像，并附带 3D 实例注释。CityDreamer4D 利用其组合设计，支持一系列下游应用，例如实例编辑、城市风格化和城市模拟，同时在生成逼真的 4D 城市方面提供一流的性能。

Title: RepVideo: Rethinking Cross-Layer Representation for Video Generation

Authors: Chenyang Si, Weichen Fan, Zhengyao Lv, Ziqi Huang, Yu Qiao, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08994
Pdf URL: https://arxiv.org/pdf/2501.08994
Copy Paste: [[2501.08994]] RepVideo: Rethinking Cross-Layer Representation for Video Generation(https://arxiv.org/abs/2501.08994)
Keywords: generation
Abstract: Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.
摘要：随着扩散模型的引入，视频生成取得了显著的进步，生成的视频质量得到了显著提高。然而，最近的研究主要集中在扩大模型训练规模上，而对表征对视频生成过程的直接影响的见解有限。在本文中，我们首先研究了中间层特征的特性，发现不同层之间的注意力图存在很大差异。这些变化导致语义表征不稳定，并导致特征之间的累积差异，最终降低相邻帧之间的相似性并对时间连贯性产生负面影响。为了解决这个问题，我们提出了 RepVideo，一种用于文本到视频扩散模型的增强表征框架。通过积累来自相邻层的特征以形成丰富的表征，这种方法可以捕获更稳定的语义信息。然后将这些增强的表征用作注意力机制的输入，从而提高语义表达能力，同时确保相邻帧之间的特征一致性。大量实验表明，我们的 RepVideo 不仅显著增强了生成准确空间外观的能力，例如捕捉多个对象之间的复杂空间关系，而且还提高了视频生成中的时间一致性。

Title: VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science

Authors: Youssef Abdalla, Marrisa Taub, Eleanor Hilton, Priya Akkaraju, Alexander Milanovic, Mine Orlu, Abdul W. Basit, Michael T Cook, Tapabrata Chakraborty, David Shorthouse
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.08995
Pdf URL: https://arxiv.org/pdf/2501.08995
Copy Paste: [[2501.08995]] VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science(https://arxiv.org/abs/2501.08995)
Keywords: generative
Abstract: Data scarcity in pharmaceutical research has led to reliance on labour-intensive trial and error approaches for development rather than data driven methods. While Machine Learning offers a solution, existing datasets are often small and noisy, limiting their utility. To address this, we developed a Variationally Encoded Conditional Tabular Generative Adversarial Network (VECT GAN), a novel generative model specifically designed for augmenting small, noisy datasets. We introduce a pipeline where data is augmented before regression model development and demonstrate that this consistently and significantly improves performance over other state of the art tabular generative models. We apply this pipeline across six pharmaceutical datasets, and highlight its real-world applicability by developing novel polymers with medically desirable mucoadhesive properties, which we made and experimentally characterised. Additionally, we pre-train the model on the ChEMBL database of drug-like molecules, leveraging knowledge distillation to enhance its generalisability, making it readily available for use on pharmaceutical datasets containing small molecules, which is an extremely common pharmaceutical task. We demonstrate the power of synthetic data for regularising small tabular datasets, highlighting its potential to become standard practice in pharmaceutical model development, and make our method, including VECT GAN pretrained on ChEMBL available as a pip package.
摘要：制药研究中的数据稀缺导致人们依赖于劳动密集型的反复试验方法进行开发，而不是数据驱动方法。虽然机器学习提供了一种解决方案，但现有数据集通常很小且噪声很大，限制了它们的实用性。为了解决这个问题，我们开发了变分编码条件表格生成对抗网络 (VECT GAN)，这是一种专为增强小型、噪声数据集而设计的新型生成模型。我们引入了一个在回归模型开发之前增强数据的流程，并证明这比其他最先进的表格生成模型持续且显著地提高了性能。我们将这个流程应用于六个制药数据集，并通过开发具有医学上理想的粘膜粘附特性的新型聚合物来强调其在现实世界中的适用性，这些聚合物是我们制造并通过实验表征的。此外，我们在类药物分子的 ChEMBL 数据库上对模型进行了预训练，利用知识提炼来增强其通用性，使其可随时用于包含小分子的制药数据集，这是一项极为常见的制药任务。我们展示了合成数据对于规范小型表格数据集的强大功能，强调了其成为药物模型开发标准实践的潜力，并使我们的方法（包括在 ChEMBL 上预训练的 VECT GAN）作为 pip 包提供。

Title: SimGen: A Diffusion-Based Framework for Simultaneous Surgical Image and Segmentation Mask Generation

Authors: Aditya Bhat, Rupak Bose, Chinedu Innocent Nwoye, Nicolas Padoy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.09008
Pdf URL: https://arxiv.org/pdf/2501.09008
Copy Paste: [[2501.09008]] SimGen: A Diffusion-Based Framework for Simultaneous Surgical Image and Segmentation Mask Generation(https://arxiv.org/abs/2501.09008)
Keywords: generation, generative
Abstract: Acquiring and annotating surgical data is often resource-intensive, ethical constraining, and requiring significant expert involvement. While generative AI models like text-to-image can alleviate data scarcity, incorporating spatial annotations, such as segmentation masks, is crucial for precision-driven surgical applications, simulation, and education. This study introduces both a novel task and method, SimGen, for Simultaneous Image and Mask Generation. SimGen is a diffusion model based on the DDPM framework and Residual U-Net, designed to jointly generate high-fidelity surgical images and their corresponding segmentation masks. The model leverages cross-correlation priors to capture dependencies between continuous image and discrete mask distributions. Additionally, a Canonical Fibonacci Lattice (CFL) is employed to enhance class separability and uniformity in the RGB space of the masks. SimGen delivers high-fidelity images and accurate segmentation masks, outperforming baselines across six public datasets assessed on image and semantic inception distance metrics. Ablation study shows that the CFL improves mask quality and spatial separation. Downstream experiments suggest generated image-mask pairs are usable if regulations limit human data release for research. This work offers a cost-effective solution for generating paired surgical images and complex labels, advancing surgical AI development by reducing the need for expensive manual annotations.
摘要：获取和注释手术数据通常需要大量资源、道德约束，并且需要大量专家参与。虽然文本转图像等生成式 AI 模型可以缓解数据稀缺问题，但结合空间注释（例如分割蒙版）对于精准驱动的手术应用、模拟和教育至关重要。本研究介绍了一种用于同时生成图像和蒙版的新任务和方法 SimGen。SimGen 是一种基于 DDPM 框架和残差 U-Net 的扩散模型，旨在联合生成高保真手术图像及其相应的分割蒙版。该模型利用互相关先验来捕获连续图像和离散蒙版分布之间的依赖关系。此外，还采用了规范斐波那契格子 (CFL) 来增强蒙版 RGB 空间中的类可分离性和均匀性。SimGen 提供高保真图像和准确的分割蒙版，在基于图像和语义初始距离指标评估的六个公共数据集中，其表现优于基线。烧蚀研究表明，CFL 可改善掩模质量和空间分离。下游实验表明，如果法规限制人类数据发布用于研究，则生成的图像-掩模对是可用的。这项工作提供了一种经济高效的解决方案，用于生成配对的手术图像和复杂标签，通过减少对昂贵的手动注释的需求来推动手术 AI 的发展。

Title: Multimodal LLMs Can Reason about Aesthetics in Zero-Shot

Authors: Ruixiang Jiang, Changwen Chen
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2501.09012
Pdf URL: https://arxiv.org/pdf/2501.09012
Copy Paste: [[2501.09012]] Multimodal LLMs Can Reason about Aesthetics in Zero-Shot(https://arxiv.org/abs/2501.09012)
Keywords: generation
Abstract: We present the first study on how Multimodal LLMs' (MLLMs) reasoning ability shall be elicited to evaluate the aesthetics of artworks. To facilitate this investigation, we construct MM-StyleBench, a novel high-quality dataset for benchmarking artistic stylization. We then develop a principled method for human preference modeling and perform a systematic correlation analysis between MLLMs' responses and human preference. Our experiments reveal an inherent hallucination issue of MLLMs in art evaluation, associated with response subjectivity. ArtCoT is proposed, demonstrating that art-specific task decomposition and the use of concrete language boost MLLMs' reasoning ability for aesthetics. Our findings offer valuable insights into MLLMs for art and can benefit a wide range of downstream applications, such as style transfer and artistic image generation. Code available at this https URL.
摘要：我们首次研究了如何利用多模态 LLM（MLLM）的推理能力来评估艺术品的美感。为了促进这项研究，我们构建了 MM-StyleBench，这是一个用于对艺术风格进行基准测试的新型高质量数据集。然后，我们开发了一种用于人类偏好建模的原则性方法，并对 MLLM 的反应和人类偏好进行系统的相关性分析。我们的实验揭示了 MLLM 在艺术评估中固有的幻觉问题，与反应主观性有关。提出了 ArtCoT，表明特定于艺术的任务分解和具体语言的使用可以提高 MLLM 对美学的推理能力。我们的研究结果为艺术 MLLM 提供了宝贵的见解，并且可以使广泛的下游应用受益，例如风格转换和艺术图像生成。代码可在此 https URL 上找到。

Title: Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Authors: Jingyuan Chen, Fuchen Long, Jie An, Zhaofan Qiu, Ting Yao, Jiebo Luo, Tao Mei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.09019
Pdf URL: https://arxiv.org/pdf/2501.09019
Copy Paste: [[2501.09019]] Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion(https://arxiv.org/abs/2501.09019)
Keywords: generation
Abstract: The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.
摘要：先进先出 (FIFO) 视频扩散基于预先训练的文本到视频模型，最近已成为一种无需调整的长视频生成的有效方法。该技术维护一个视频帧队列，其中噪声逐渐增加，在队列头部不断产生干净的帧，而高斯噪声则排在队列尾部。然而，由于缺乏跨帧对应建模，FIFO-Diffusion 通常难以保持生成的视频的长期时间一致性。在本文中，我们提出了 Ouroboros-Diffusion，这是一种新颖的视频去噪框架，旨在增强结构和内容（主题）一致性，从而能够生成任意长度的一致视频。具体而言，我们在队列尾部引入了一种新的潜在采样技术来提高结构一致性，确保帧之间的感知平滑过渡。为了增强主题一致性，我们设计了一种主题感知跨帧注意 (SACFA) 机制，该机制在短片段内跨帧对齐主题以实现更好的视觉连贯性。此外，我们引入了自循环引导。该技术利用队列前端所有先前较干净的帧的信息来指导末尾噪声较大的帧的去噪，从而促进丰富且具有上下文的全局信息交互。在 VBench 基准上进行的大量长视频生成实验证明了我们的 Ouroboros-Diffusion 的优越性，特别是在主题一致性、运动平滑度和时间一致性方面。