2024-01-11

diffusion

Title: Diffusion-based Pose Refinement and Muti-hypothesis Generation for 3D Human Pose Estimaiton. (arXiv:2401.04921v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.04921
Code URL: null
Copy Paste: [[2401.04921]] Diffusion-based Pose Refinement and Muti-hypothesis Generation for 3D Human Pose Estimaiton(http://arxiv.org/abs/2401.04921)
Summary:
Previous probabilistic models for 3D Human Pose Estimation (3DHPE) aimed to enhance pose accuracy by generating multiple hypotheses. However, most of the hypotheses generated deviate substantially from the true pose. Compared to deterministic models, the excessive uncertainty in probabilistic models leads to weaker performance in single-hypothesis prediction. To address these two challenges, we propose a diffusion-based refinement framework called DRPose, which refines the output of deterministic models by reverse diffusion and achieves more suitable multi-hypothesis prediction for the current pose benchmark by multi-step refinement with multiple noises. To this end, we propose a Scalable Graph Convolution Transformer (SGCT) and a Pose Refinement Module (PRM) for denoising and refining. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate that our method achieves state-of-the-art performance on both single and multi-hypothesis 3DHPE. Code is available at https://github.com/KHB1698/DRPose.
摘要：
之前的 3D 人体姿势估计 (3DHPE) 概率模型旨在通过生成多个假设来提高姿势准确性。然而，大多数生成的假设与真实姿势有很大偏差。与确定性模型相比，概率模型过多的不确定性导致单假设预测的性能较差。为了解决这两个挑战，我们提出了一种名为 DRPose 的基于扩散的细化框架，该框架通过反向扩散细化确定性模型的输出，并通过使用多个噪声的多步细化实现对当前姿态基准更合适的多假设预测。为此，我们提出了可扩展图卷积变换器（SGCT）和用于去噪和细化的姿势细化模块（PRM）。在 Human3.6M 和 MPI-INF-3DHP 数据集上进行的大量实验表明，我们的方法在单假设和多假设 3DHPE 上均实现了最先进的性能。代码可在 https://github.com/KHB1698/DRPost 获取。

Title: SwiMDiff: Scene-wide Matching Contrastive Learning with Diffusion Constraint for Remote Sensing Image. (arXiv:2401.05093v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.05093
Code URL: null
Copy Paste: [[2401.05093]] SwiMDiff: Scene-wide Matching Contrastive Learning with Diffusion Constraint for Remote Sensing Image(http://arxiv.org/abs/2401.05093)
Summary:
With recent advancements in aerospace technology, the volume of unlabeled remote sensing image (RSI) data has increased dramatically. Effectively leveraging this data through self-supervised learning (SSL) is vital in the field of remote sensing. However, current methodologies, particularly contrastive learning (CL), a leading SSL method, encounter specific challenges in this domain. Firstly, CL often mistakenly identifies geographically adjacent samples with similar semantic content as negative pairs, leading to confusion during model training. Secondly, as an instance-level discriminative task, it tends to neglect the essential fine-grained features and complex details inherent in unstructured RSIs. To overcome these obstacles, we introduce SwiMDiff, a novel self-supervised pre-training framework designed for RSIs. SwiMDiff employs a scene-wide matching approach that effectively recalibrates labels to recognize data from the same scene as false negatives. This adjustment makes CL more applicable to the nuances of remote sensing. Additionally, SwiMDiff seamlessly integrates CL with a diffusion model. Through the implementation of pixel-level diffusion constraints, we enhance the encoder's ability to capture both the global semantic information and the fine-grained features of the images more comprehensively. Our proposed framework significantly enriches the information available for downstream tasks in remote sensing. Demonstrating exceptional performance in change detection and land-cover classification tasks, SwiMDiff proves its substantial utility and value in the field of remote sensing.
摘要：
随着航空航天技术的最新进步，未标记的遥感图像 (RSI) 数据量急剧增加。通过自我监督学习（SSL）有效利用这些数据在遥感领域至关重要。然而，当前的方法，特别是对比学习（CL）（一种领先的 SSL 方法），在该领域遇到了特定的挑战。首先，CL经常错误地将具有相似语义内容的地理上相邻的样本识别为负对，导致模型训练过程中的混乱。其次，作为实例级判别任务，它往往忽略非结构化 RSI 固有的细粒度特征和复杂细节。为了克服这些障碍，我们引入了 SwiMDiff，这是一种专为 RSI 设计的新型自监督预训练框架。 SwiMDiff 采用场景范围匹配方法，可以有效地重新校准标签，以将来自同一场景的数据识别为漏报。这一调整使 CL 更适用于遥感的细微差别。此外，SwiMDiff 将 CL 与扩散模型无缝集成。通过实施像素级扩散约束，我们增强了编码器更全面地捕获图像的全局语义信息和细粒度特征的能力。我们提出的框架显着丰富了遥感下游任务可用的信息。 SwiMDiff 在变化检测和土地覆盖分类任务中表现出卓越的性能，证明了其在遥感领域的巨大实用性和价值。

Title: CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model. (arXiv:2401.05153v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.05153
Code URL: null
Copy Paste: [[2401.05153]] CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model(http://arxiv.org/abs/2401.05153)
Summary:
Fusion of a panchromatic (PAN) image and corresponding multispectral (MS) image is also known as pansharpening, which aims to combine abundant spatial details of PAN and spectral information of MS. Due to the absence of high-resolution MS images, available deep-learning-based methods usually follow the paradigm of training at reduced resolution and testing at both reduced and full resolution. When taking original MS and PAN images as inputs, they always obtain sub-optimal results due to the scale variation. In this paper, we propose to explore the self-supervised representation of pansharpening by designing a cross-predictive diffusion model, named CrossDiff. It has two-stage training. In the first stage, we introduce a cross-predictive pretext task to pre-train the UNet structure based on conditional DDPM, while in the second stage, the encoders of the UNets are frozen to directly extract spatial and spectral features from PAN and MS, and only the fusion head is trained to adapt for pansharpening task. Extensive experiments show the effectiveness and superiority of the proposed model compared with state-of-the-art supervised and unsupervised methods. Besides, the cross-sensor experiments also verify the generalization ability of proposed self-supervised representation learners for other satellite's datasets. We will release our code for reproducibility.
摘要：
全色（PAN）图像和相应的多光谱（MS）图像的融合也称为全色锐化，其目的是将PAN的丰富空间细节和MS的光谱信息结合起来。由于缺乏高分辨率 MS 图像，可用的基于深度学习的方法通常遵循降低分辨率训练以及降低分辨率和全分辨率测试的范式。当以原始 MS 和 PAN 图像作为输入时，由于尺度变化，它们总是获得次优结果。在本文中，我们建议通过设计一个名为 CrossDiff 的交叉预测扩散模型来探索全色锐化的自监督表示。它有两个阶段的训练。在第一阶段，我们引入交叉预测借口任务来基于条件DDPM预训练UNet结构，而在第二阶段，UNet的编码器被冻结以直接从PAN和MS中提取空间和光谱特征，并且只有融合头经过训练以适应全色锐化任务。大量的实验表明，与最先进的监督和无监督方法相比，所提出的模型的有效性和优越性。此外，跨传感器实验还验证了所提出的自监督表示学习器对其他卫星数据集的泛化能力。我们将发布我们的代码以实现可重复性。

Title: Derm-T2IM: Harnessing Synthetic Skin Lesion Data via Stable Diffusion Models for Enhanced Skin Disease Classification using ViT and CNN. (arXiv:2401.05159v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.05159
Code URL: null
Copy Paste: [[2401.05159]] Derm-T2IM: Harnessing Synthetic Skin Lesion Data via Stable Diffusion Models for Enhanced Skin Disease Classification using ViT and CNN(http://arxiv.org/abs/2401.05159)
Summary:
This study explores the utilization of Dermatoscopic synthetic data generated through stable diffusion models as a strategy for enhancing the robustness of machine learning model training. Synthetic data generation plays a pivotal role in mitigating challenges associated with limited labeled datasets, thereby facilitating more effective model training. In this context, we aim to incorporate enhanced data transformation techniques by extending the recent success of few-shot learning and a small amount of data representation in text-to-image latent diffusion models. The optimally tuned model is further used for rendering high-quality skin lesion synthetic data with diverse and realistic characteristics, providing a valuable supplement and diversity to the existing training data. We investigate the impact of incorporating newly generated synthetic data into the training pipeline of state-of-art machine learning models, assessing its effectiveness in enhancing model performance and generalization to unseen real-world data. Our experimental results demonstrate the efficacy of the synthetic data generated through stable diffusion models helps in improving the robustness and adaptability of end-to-end CNN and vision transformer models on two different real-world skin lesion datasets.
摘要：
本研究探索利用通过稳定扩散模型生成的皮肤镜合成数据作为增强机器学习模型训练稳健性的策略。合成数据生成在缓解与有限标记数据集相关的挑战方面发挥着关键作用，从而促进更有效的模型训练。在这种背景下，我们的目标是通过扩展最近成功的少样本学习和文本到图像潜在扩散模型中的少量数据表示来整合增强的数据转换技术。经过优化调整的模型进一步用于渲染具有多样化和真实特征的高质量皮肤病变合成数据，为现有训练数据提供了有价值的补充和多样性。我们研究了将新生成的合成数据纳入最先进的机器学习模型的训练流程中的影响，评估其在增强模型性能和对未见过的现实世界数据的泛化方面的有效性。我们的实验结果表明，通过稳定扩散模型生成的合成数据的有效性有助于提高端到端 CNN 和视觉变换器模型在两个不同的真实皮肤病变数据集上的鲁棒性和适应性。

Title: PIXART-{\delta}: Fast and Controllable Image Generation with Latent Consistency Models. (arXiv:2401.05252v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.05252
Code URL: null
Copy Paste: [[2401.05252]] PIXART-{\delta}: Fast and Controllable Image Generation with Latent Consistency Models(http://arxiv.org/abs/2401.05252)
Summary:
This technical report introduces PIXART-{\delta}, a text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-{\alpha} model. PIXART-{\alpha} is recognized for its ability to generate high-quality images of 1024px resolution through a remarkably efficient training process. The integration of LCM in PIXART-{\delta} significantly accelerates the inference speed, enabling the production of high-quality images in just 2-4 steps. Notably, PIXART-{\delta} achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images, marking a 7x improvement over the PIXART-{\alpha}. Additionally, PIXART-{\delta} is designed to be efficiently trainable on 32GB V100 GPUs within a single day. With its 8-bit inference capability (von Platen et al., 2023), PIXART-{\delta} can synthesize 1024px images within 8GB GPU memory constraints, greatly enhancing its usability and accessibility. Furthermore, incorporating a ControlNet-like module enables fine-grained control over text-to-image diffusion models. We introduce a novel ControlNet-Transformer architecture, specifically tailored for Transformers, achieving explicit controllability alongside high-quality image generation. As a state-of-the-art, open-source image generation model, PIXART-{\delta} offers a promising alternative to the Stable Diffusion family of models, contributing significantly to text-to-image synthesis.
摘要：
本技术报告介绍了 PIXART-{\delta}，这是一种文本到图像合成框架，它将潜在一致性模型 (LCM) 和 ControlNet 集成到先进的 PIXART-{\alpha} 模型中。 PIXART-{\alpha} 因其通过非常高效的训练过程生成 1024px 分辨率的高质量图像的能力而受到认可。 PIXART-{\delta}中 LCM 的集成显着加快了推理速度，只需 2-4 个步骤即可生成高质量图像。值得注意的是，PIXART-{\delta} 在生成 1024x1024 像素图像方面突破了 0.5 秒，比 PIXART-{\alpha} 提高了 7 倍。此外，PIXART-{\delta} 设计为可在一天内在 32GB V100 GPU 上进行高效训练。凭借其 8 位推理能力（von Platen 等人，2023），PIXART-{\delta} 可以在 8GB GPU 内存限制内合成 1024px 图像，大大增强了其可用性和可访问性。此外，结合类似 ControlNet 的模块可以对文本到图像扩散模型进行细粒度控制。我们引入了一种新颖的 ControlNet-Transformer 架构，专为 Transformer 量身定制，可在生成高质量图像的同时实现明确的可控性。作为最先进的开源图像生成模型，PIXART-{\delta} 为稳定扩散模型系列提供了一种有前途的替代方案，为文本到图像的合成做出了重大贡献。

Title: Score Distillation Sampling with Learned Manifold Corrective. (arXiv:2401.05293v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.05293
Code URL: null
Copy Paste: [[2401.05293]] Score Distillation Sampling with Learned Manifold Corrective(http://arxiv.org/abs/2401.05293)
Summary:
Score Distillation Sampling (SDS) is a recent but already widely popular method that relies on an image diffusion model to control optimization problems using text prompts. In this paper, we conduct an in-depth analysis of the SDS loss function, identify an inherent problem with its formulation, and propose a surprisingly easy but effective fix. Specifically, we decompose the loss into different factors and isolate the component responsible for noisy gradients. In the original formulation, high text guidance is used to account for the noise, leading to unwanted side effects. Instead, we train a shallow network mimicking the timestep-dependent denoising deficiency of the image diffusion model in order to effectively factor it out. We demonstrate the versatility and the effectiveness of our novel loss formulation through several qualitative and quantitative experiments, including optimization-based image synthesis and editing, zero-shot image translation network training, and text-to-3D synthesis.
摘要：
分数蒸馏采样（SDS）是一种最近但已经广泛流行的方法，它依赖于图像扩散模型来使用文本提示来控制优化问题。在本文中，我们对 SDS 损失函数进行了深入分析，确定了其公式的固有问题，并提出了一个非常简单但有效的解决方案。具体来说，我们将损失分解为不同的因素，并隔离导致噪声梯度的成分。在最初的配方中，使用高文本指导来解决噪音，从而导致不必要的副作用。相反，我们训练一个浅层网络来模仿图像扩散模型的时间步相关的去噪缺陷，以便有效地将其分解出来。我们通过几个定性和定量实验证明了我们新颖的损失公式的多功能性和有效性，包括基于优化的图像合成和编辑、零样本图像翻译网络训练以及文本到 3D 合成。

self-supervised

Title: Source-Free Cross-Modal Knowledge Transfer by Unleashing the Potential of Task-Irrelevant Data. (arXiv:2401.05014v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.05014
Code URL: null
Copy Paste: [[2401.05014]] Source-Free Cross-Modal Knowledge Transfer by Unleashing the Potential of Task-Irrelevant Data(http://arxiv.org/abs/2401.05014)
Summary:
Source-free cross-modal knowledge transfer is a crucial yet challenging task, which aims to transfer knowledge from one source modality (e.g., RGB) to the target modality (e.g., depth or infrared) with no access to the task-relevant (TR) source data due to memory and privacy concerns. A recent attempt leverages the paired task-irrelevant (TI) data and directly matches the features from them to eliminate the modality gap. However, it ignores a pivotal clue that the paired TI data could be utilized to effectively estimate the source data distribution and better facilitate knowledge transfer to the target modality. To this end, we propose a novel yet concise framework to unlock the potential of paired TI data for enhancing source-free cross-modal knowledge transfer. Our work is buttressed by two key technical components. Firstly, to better estimate the source data distribution, we introduce a Task-irrelevant data-Guided Modality Bridging (TGMB) module. It translates the target modality data (e.g., infrared) into the source-like RGB images based on paired TI data and the guidance of the available source model to alleviate two key gaps: 1) inter-modality gap between the paired TI data; 2) intra-modality gap between TI and TR target data. We then propose a Task-irrelevant data-Guided Knowledge Transfer (TGKT) module that transfers knowledge from the source model to the target model by leveraging the paired TI data. Notably, due to the unavailability of labels for the TR target data and its less reliable prediction from the source model, our TGKT model incorporates a self-supervised pseudo-labeling approach to enable the target model to learn from its predictions. Extensive experiments show that our method achieves state-of-the-art performance on three datasets (RGB-to-depth and RGB-to-infrared).
摘要：
无源跨模态知识转移是一项至关重要但具有挑战性的任务，其目的是将知识从一种源模态（例如 RGB）转移到目标模态（例如深度或红外），而无需访问任务 -由于内存和隐私问题，相关（TR）源数据。最近的一项尝试利用配对的任务无关（TI）数据并直接匹配其中的特征以消除模态差距。然而，它忽略了一个关键线索，即配对的 TI 数据可用于有效估计源数据分布并更好地促进知识向目标模态的迁移。为此，我们提出了一个新颖而简洁的框架，以释放配对 TI 数据的潜力，以增强无源跨模式知识转移。我们的工作由两个关键技术组成部分支撑。首先，为了更好地估计源数据分布，我们引入了任务无关数据引导模态桥接（TGMB）模块。它基于配对 TI 数据和可用源模型的指导，将目标模态数据（例如红外）转换为类源 RGB 图像，以缩小两个关键差距：1）配对 TI 数据之间的模态间差距； 2）TI和TR目标数据之间的模态内差距。然后，我们提出了一个与任务无关的数据引导知识转移（TGKT）模块，该模块通过利用配对的 TI 数据将知识从源模型转移到目标模型。值得注意的是，由于 TR 目标数据的标签不可用，并且源模型的预测不太可靠，我们的 TGKT 模型采用了自我监督的伪标签方法，使目标模型能够从其预测中学习。大量实验表明，我们的方法在三个数据集（RGB 到深度和 RGB 到红外）上实现了最先进的性能。

Title: Toward distortion-aware change detection in realistic scenarios. (arXiv:2401.05157v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.05157
Code URL: null
Copy Paste: [[2401.05157]] Toward distortion-aware change detection in realistic scenarios(http://arxiv.org/abs/2401.05157)
Summary:
In the conventional change detection (CD) pipeline, two manually registered and labeled remote sensing datasets serve as the input of the model for training and prediction. However, in realistic scenarios, data from different periods or sensors could fail to be aligned as a result of various coordinate systems. Geometric distortion caused by coordinate shifting remains a thorny issue for CD algorithms. In this paper, we propose a reusable self-supervised framework for bitemporal geometric distortion in CD tasks. The whole framework is composed of Pretext Representation Pre-training, Bitemporal Image Alignment, and Down-stream Decoder Fine-Tuning. With only single-stage pre-training, the key components of the framework can be reused for assistance in the bitemporal image alignment, while simultaneously enhancing the performance of the CD decoder. Experimental results in 2 large-scale realistic scenarios demonstrate that our proposed method can alleviate the bitemporal geometric distortion in CD tasks.
摘要：
在传统的变化检测（CD）管道中，两个手动注册和标记的遥感数据集作为训练和预测模型的输入。然而，在现实场景中，由于坐标系不同，来自不同时期或传感器的数据可能无法对齐。坐标偏移引起的几何失真仍然是 CD 算法的棘手问题。在本文中，我们提出了一种可重用的自我监督框架，用于 CD 任务中的双时态几何失真。整个框架由借口表示预训练、双时图像对齐和下游解码器微调组成。只需单阶段预训练，框架的关键组件就可以重复使用，以帮助双时图像对齐，同时增强 CD 解码器的性能。 2个大规模现实场景的实验结果表明，我们提出的方法可以减轻CD任务中的双时态几何失真。

Title: HiMTM: Hierarchical Multi-Scale Masked Time Series Modeling for Long-Term Forecasting. (arXiv:2401.05012v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.05012
Code URL: null
Copy Paste: [[2401.05012]] HiMTM: Hierarchical Multi-Scale Masked Time Series Modeling for Long-Term Forecasting(http://arxiv.org/abs/2401.05012)
Summary:
Time series forecasting is crucial and challenging in the real world. The recent surge in interest regarding time series foundation models, which cater to a diverse array of downstream tasks, is noteworthy. However, existing methods often overlook the multi-scale nature of time series, an aspect crucial for precise forecasting. To bridge this gap, we propose HiMTM, a hierarchical multi-scale masked time series modeling method designed for long-term forecasting. Specifically, it comprises four integral components: (1) hierarchical multi-scale transformer (HMT) to capture temporal information at different scales; (2) decoupled encoder-decoder (DED) forces the encoder to focus on feature extraction, while the decoder to focus on pretext tasks; (3) multi-scale masked reconstruction (MMR) provides multi-stage supervision signals for pre-training; (4) cross-scale attention fine-tuning (CSA-FT) to capture dependencies between different scales for forecasting. Collectively, these components enhance multi-scale feature extraction capabilities in masked time series modeling and contribute to improved prediction accuracy. We conduct extensive experiments on 7 mainstream datasets to prove that HiMTM has obvious advantages over contemporary self-supervised and end-to-end learning methods. The effectiveness of HiMTM is further showcased by its application in the industry of natural gas demand forecasting.
摘要：
时间序列预测在现实世界中至关重要且具有挑战性。值得注意的是，最近人们对时间序列基础模型的兴趣激增，这些模型可以满足各种下游任务的需求。然而，现有的方法常常忽视时间序列的多尺度性质，而这对于精确预测至关重要。为了弥补这一差距，我们提出了 HiMTM，一种专为长期预测而设计的分层多尺度屏蔽时间序列建模方法。具体来说，它包含四个组成部分：（1）分层多尺度变换器（HMT），用于捕获不同尺度的时间信息；（2）解耦编码器- 解码器（DED）迫使编码器专注于特征提取，而解码器专注于借口任务；（3）多尺度掩蔽重建（MMR）为预训练提供多级监督信号；（4）跨尺度注意力微调（CSA- FT）以捕获不同尺度之间的依赖关系进行预测。总的来说，这些组件增强了屏蔽时间序列建模中的多尺度特征提取能力，并有助于提高预测精度。我们在 7 个主流数据集上进行了广泛的实验，证明 HiMTM 相对于当代的自监督和端到端学习方法具有明显的优势。 HiMTM在天然气需求预测行业的应用进一步证明了其有效性。

foundation model

generative

Title: Content-Conditioned Generation of Stylized Free hand Sketches. (arXiv:2401.04739v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.04739
Code URL: null
Copy Paste: [[2401.04739]] Content-Conditioned Generation of Stylized Free hand Sketches(http://arxiv.org/abs/2401.04739)
Summary:
In recent years, the recognition of free-hand sketches has remained a popular task. However, in some special fields such as the military field, free-hand sketches are difficult to sample on a large scale. Common data augmentation and image generation techniques are difficult to produce images with various free-hand sketching styles. Therefore, the recognition and segmentation tasks in related fields are limited. In this paper, we propose a novel adversarial generative network that can accurately generate realistic free-hand sketches with various styles. We explore the performance of the model, including using styles randomly sampled from a prior normal distribution to generate images with various free-hand sketching styles, disentangling the painters' styles from known free-hand sketches to generate images with specific styles, and generating images of unknown classes that are not in the training set. We further demonstrate with qualitative and quantitative evaluations our advantages in visual quality, content accuracy, and style imitation on SketchIME.
摘要：
近年来，手绘草图的识别仍然是一项热门任务。但在军事领域等一些特殊领域，徒手草图很难大规模采样。常见的数据增强和图像生成技术很难生成具有各种手绘草图风格的图像。因此，相关领域的识别和分割任务受到限制。在本文中，我们提出了一种新颖的对抗性生成网络，可以准确生成各种风格的逼真手绘草图。我们探索了模型的性能，包括使用从先验正态分布中随机采样的样式来生成具有各种手绘草图风格的图像，将画家的风格与已知的手绘草图分离以生成具有特定风格的图像，以及生成图像不在训练集中的未知类。我们通过定性和定量评估进一步证明了我们在 SketchIME 上的视觉质量、内容准确性和风格模仿方面的优势。

Title: AdvMT: Adversarial Motion Transformer for Long-term Human Motion Prediction. (arXiv:2401.05018v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.05018
Code URL: null
Copy Paste: [[2401.05018]] AdvMT: Adversarial Motion Transformer for Long-term Human Motion Prediction(http://arxiv.org/abs/2401.05018)
Summary:
To achieve seamless collaboration between robots and humans in a shared environment, accurately predicting future human movements is essential. Human motion prediction has traditionally been approached as a sequence prediction problem, leveraging historical human motion data to estimate future poses. Beginning with vanilla recurrent networks, the research community has investigated a variety of methods for learning human motion dynamics, encompassing graph-based and generative approaches. Despite these efforts, achieving accurate long-term predictions continues to be a significant challenge. In this regard, we present the Adversarial Motion Transformer (AdvMT), a novel model that integrates a transformer-based motion encoder and a temporal continuity discriminator. This combination effectively captures spatial and temporal dependencies simultaneously within frames. With adversarial training, our method effectively reduces the unwanted artifacts in predictions, thereby ensuring the learning of more realistic and fluid human motions. The evaluation results indicate that AdvMT greatly enhances the accuracy of long-term predictions while also delivering robust short-term predictions
摘要：
为了在共享环境中实现机器人与人类之间的无缝协作，准确预测人类未来的运动至关重要。传统上，人体运动预测被视为序列预测问题，利用历史人体运动数据来估计未来的姿势。从普通的循环网络开始，研究界研究了各种学习人体运动动力学的方法，包括基于图的方法和生成方法。尽管做出了这些努力，实现准确的长期预测仍然是一项重大挑战。在这方面，我们提出了对抗运动变换器（AdvMT），这是一种集成了基于变换器的运动编码器和时间连续性鉴别器的新颖模型。这种组合有效地在帧内同时捕获空间和时间依赖性。通过对抗性训练，我们的方法有效地减少了预测中不需要的伪影，从而确保学习更真实、更流畅的人体动作。评估结果表明，AdvMT 极大地提高了长期预测的准确性，同时也提供了稳健的短期预测

Title: Application of Deep Learning in Blind Motion Deblurring: Current Status and Future Prospects. (arXiv:2401.05055v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.05055
Code URL: https://github.com/visionverse/blind-motion-deblurring-survey
Copy Paste: [[2401.05055]] Application of Deep Learning in Blind Motion Deblurring: Current Status and Future Prospects(http://arxiv.org/abs/2401.05055)
Summary:
Motion deblurring is one of the fundamental problems of computer vision and has received continuous attention. The variability in blur, both within and across images, imposes limitations on non-blind deblurring techniques that rely on estimating the blur kernel. As a response, blind motion deblurring has emerged, aiming to restore clear and detailed images without prior knowledge of the blur type, fueled by the advancements in deep learning methodologies. Despite strides in this field, a comprehensive synthesis of recent progress in deep learning-based blind motion deblurring is notably absent. This paper fills that gap by providing an exhaustive overview of the role of deep learning in blind motion deblurring, encompassing datasets, evaluation metrics, and methods developed over the last six years. Specifically, we first introduce the types of motion blur and the fundamental principles of deblurring. Next, we outline the shortcomings of traditional non-blind deblurring algorithms, emphasizing the advantages of employing deep learning techniques for deblurring tasks. Following this, we categorize and summarize existing blind motion deblurring methods based on different backbone networks, including convolutional neural networks, generative adversarial networks, recurrent neural networks, and Transformer networks. Subsequently, we elaborate not only on the fundamental principles of these different categories but also provide a comprehensive summary and comparison of their advantages and limitations. Qualitative and quantitative experimental results conducted on four widely used datasets further compare the performance of SOTA methods. Finally, an analysis of present challenges and future pathways. All collected models, benchmark datasets, source code links, and codes for evaluation have been made publicly available at https://github.com/VisionVerse/Blind-Motion-Deblurring-Survey
摘要：
运动去模糊是计算机视觉的基本问题之一，受到持续关注。图像内部和图像之间的模糊变化对依赖于估计模糊内核的非盲去模糊技术施加了限制。作为回应，盲运动去模糊应运而生，旨在在深度学习方法的进步的推动下，在不事先了解模糊类型的情况下恢复清晰详细的图像。尽管该领域取得了长足的进步，但对基于深度学习的盲运动去模糊的最新进展的全面综合仍然明显缺乏。本文通过详尽概述深度学习在盲运动去模糊中的作用（包括过去六年开发的数据集、评估指标和方法）来填补这一空白。具体来说，我们首先介绍运动模糊的类型和去模糊的基本原理。接下来，我们概述了传统非盲去模糊算法的缺点，强调了采用深度学习技术进行去模糊任务的优势。接下来，我们根据不同的骨干网络对现有的盲运动去模糊方法进行分类和总结，包括卷积神经网络、生成对抗网络、循环神经网络和 Transformer 网络。随后，我们不仅阐述了这些不同类别的基本原理，还对它们的优点和局限性进行了全面的总结和比较。在四个广泛使用的数据集上进行的定性和定量实验结果进一步比较了 SOTA 方法的性能。最后，分析当前的挑战和未来的路径。所有收集的模型、基准数据集、源代码链接和评估代码均已在 https://github.com/VisionVerse/Blind-Motion-Deblurring-Survey 上公开发布

Title: MISS: A Generative Pretraining and Finetuning Approach for Med-VQA. (arXiv:2401.05163v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.05163
Code URL: null
Copy Paste: [[2401.05163]] MISS: A Generative Pretraining and Finetuning Approach for Med-VQA(http://arxiv.org/abs/2401.05163)
Summary:
Medical visual question answering (VQA) is a challenging multimodal task, where Vision-Language Pre-training (VLP) models can effectively improve the generalization performance. However, most methods in the medical field treat VQA as an answer classification task which is difficult to transfer to practical application scenarios. Additionally, due to the privacy of medical images and the expensive annotation process, large-scale medical image-text pairs datasets for pretraining are severely lacking. In this paper, we propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks. Unlike existing methods, we treat medical VQA as a generative task. We unify the text encoder and multimodal encoder and align image-text features through multi-task learning. Furthermore, we propose a Transfer-and-Caption method that extends the feature space of single-modal image datasets using large language models (LLMs), enabling those traditional medical vision field task data to be applied to VLP. Experiments show that our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models. The code and model weights will be released upon the paper's acceptance.
摘要：
医学视觉问答（VQA）是一项具有挑战性的多模态任务，其中视觉语言预训练（VLP）模型可以有效提高泛化性能。然而，医学领域的大多数方法将VQA视为答案分类任务，很难迁移到实际应用场景。此外，由于医学图像的隐私性和昂贵的注释过程，严重缺乏用于预训练的大规模医学图像-文本对数据集。在本文中，我们提出了一种用于医疗 VQA 任务的基于大规模多任务自监督学习的框架（MISS）。与现有方法不同，我们将医学 VQA 视为一项生成任务。我们统一文本编码器和多模态编码器，并通过多任务学习对齐图像文本特征。此外，我们提出了一种传输和标题方法，该方法使用大语言模型（LLM）扩展单模态图像数据集的特征空间，使这些传统的医学视觉领域任务数据能够应用于 VLP。实验表明，我们的方法用更少的多模态数据集取得了优异的结果，并展示了生成式 VQA 模型的优势。代码和模型权重将在论文被接受后发布。

Title: InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes. (arXiv:2401.05335v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.05335
Code URL: null
Copy Paste: [[2401.05335]] InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes(http://arxiv.org/abs/2401.05335)
Summary:
We introduce InseRF, a novel method for generative object insertion in the NeRF reconstructions of 3D scenes. Based on a user-provided textual description and a 2D bounding box in a reference viewpoint, InseRF generates new objects in 3D scenes. Recently, methods for 3D scene editing have been profoundly transformed, owing to the use of strong priors of text-to-image diffusion models in 3D generative modeling. Existing methods are mostly effective in editing 3D scenes via style and appearance changes or removing existing objects. Generating new objects, however, remains a challenge for such methods, which we address in this study. Specifically, we propose grounding the 3D object insertion to a 2D object insertion in a reference view of the scene. The 2D edit is then lifted to 3D using a single-view object reconstruction method. The reconstructed object is then inserted into the scene, guided by the priors of monocular depth estimation methods. We evaluate our method on various 3D scenes and provide an in-depth analysis of the proposed components. Our experiments with generative insertion of objects in several 3D scenes indicate the effectiveness of our method compared to the existing methods. InseRF is capable of controllable and 3D-consistent object insertion without requiring explicit 3D information as input. Please visit our project page at https://mohamad-shahbazi.github.io/inserf.
摘要：
我们介绍 InseRF，这是一种在 3D 场景的 NeRF 重建中生成对象插入的新方法。基于用户提供的文本描述和参考视点中的 2D 边界框，InseRF 在 3D 场景中生成新对象。最近，由于在 3D 生成建模中使用了文本到图像扩散模型的强先验，3D 场景编辑方法已经发生了深刻的转变。现有方法在通过样式和外观更改或删除现有对象来编辑 3D 场景时最有效。然而，生成新对象仍然是此类方法的一个挑战，我们在本研究中解决了这个问题。具体来说，我们建议将 3D 对象插入基础为场景参考视图中的 2D 对象插入。然后使用单视图对象重建方法将 2D 编辑提升为 3D。然后，在单目深度估计方法的先验指导下，将重建的对象插入场景中。我们在各种 3D 场景上评估我们的方法，并对所提出的组件进行深入分析。我们在多个 3D 场景中生成对象插入的实验表明，与现有方法相比，我们的方法是有效的。 InseRF 能够进行可控且 3D 一致的对象插入，而不需要明确的 3D 信息作为输入。请访问我们的项目页面：https://mohamad- shahbazi.github.io/inserf。

Title: BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation. (arXiv:2401.05125v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05125
Code URL: null
Copy Paste: [[2401.05125]] BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation(http://arxiv.org/abs/2401.05125)
Summary:
Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). A popular approach to the task are name-based methods, i.e. those identifying the most appropriate name in the KB for a given mention, either via dense retrieval or autoregressive modeling. However, as these methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance, especially for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). We therefore present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. Specifically, BELHD builds upon the BioSyn (Sung et al.,2020) model introducing two crucial extensions. First, it performs a preprocessing of the KB in which it expands homonyms with an automatically chosen disambiguating string, thus enforcing unique linking decisions. Second, we introduce candidate sharing, a novel strategy to select candidates for contrastive learning that enhances the overall training signal. Experiments with 10 corpora and five entity types show that BELHD improves upon state-of-the-art approaches, achieving the best results in 6 out 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the core prediction model and thus can also improve other methods, which we exemplify for GenBioEL (Yuan et al, 2022), a generative name-based BEL approach. Code is available at: link added upon publication.
摘要：
生物医学实体链接（BEL）是将实体提及基础到知识库（KB）的任务。完成该任务的一种流行方法是基于名称的方法，即通过密集检索或自回归建模来识别知识库中给定提及的最合适名称的方法。然而，由于这些方法直接返回知识库名称，因此它们无法处理同音异义词，即不同的知识库实体共享完全相同的名称。这会显着影响它们的性能，特别是对于同音异义词占大量实体提及的知识库（例如 UMLS 和 NCBI Gene）。因此，我们提出了 BELHD（具有同音词消歧功能的生物医学实体链接），这是一种应对这一挑战的新的基于名称的方法。具体来说，BELHD 建立在 BioSyn（Sung 等人，20 20）模型的基础上，引入了两个关键的扩展。首先，它对知识库进行预处理，其中使用自动选择的消歧字符串扩展同音异义词，从而强制执行唯一的链接决策。其次，我们引入候选共享，这是一种选择候选进行对比学习的新颖策略，可以增强整体训练信号。对 10 个语料库和 5 种实体类型的实验表明，BELHD 在最先进的方法的基础上进行了改进，在 10 个语料库中的 6 个中取得了最佳结果，平均提高了 4.55pp 召回率@1。此外，KB 预处理与核心预测模型正交，因此也可以改进其他方法，我们以 GenBioEL（Yuan 等人，2022）为例，这是一种基于名称的生成 BEL 方法。代码可在以下位置获得：发布时添加的链接。

Title: A Good Score Does not Lead to A Good Generative Model. (arXiv:2401.04856v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.04856
Code URL: null
Copy Paste: [[2401.04856]] A Good Score Does not Lead to A Good Generative Model(http://arxiv.org/abs/2401.04856)
Summary:
Score-based Generative Models (SGMs) is one leading method in generative modeling, renowned for their ability to generate high-quality samples from complex, high-dimensional data distributions. The method enjoys empirical success and is supported by rigorous theoretical convergence properties. In particular, it has been shown that SGMs can generate samples from a distribution that is close to the ground-truth if the underlying score function is learned well, suggesting the success of SGM as a generative model. We provide a counter-example in this paper. Through the sample complexity argument, we provide one specific setting where the score function is learned well. Yet, SGMs in this setting can only output samples that are Gaussian blurring of training data points, mimicking the effects of kernel density estimation. The finding resonates a series of recent finding that reveal that SGMs can demonstrate strong memorization effect and fail to generate.
摘要：
基于分数的生成模型 (SGM) 是生成建模中的一种领先方法，以其从复杂的高维数据分布生成高质量样本的能力而闻名。该方法在实证上取得了成功，并得到了严格的理论收敛特性的支持。特别是，如果底层得分函数学得好，SGM 可以从接近真实值的分布中生成样本，这表明 SGM 作为生成模型是成功的。我们在本文中提供了一个反例。通过样本复杂性参数，我们提供了一种可以很好地学习得分函数的特定设置。然而，在这种设置下，SGM 只能输出训练数据点高斯模糊的样本，模仿核密度估计的效果。这一发现与最近的一系列发现相呼应，这些发现表明 SGM 可以表现出很强的记忆效果，但无法生成。

Title: Rethinking Test-time Likelihood: The Likelihood Path Principle and Its Application to OOD Detection. (arXiv:2401.04933v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.04933
Code URL: https://github.com/XavierXiao/Likelihood-Regret
Copy Paste: [[2401.04933]] Rethinking Test-time Likelihood: The Likelihood Path Principle and Its Application to OOD Detection(http://arxiv.org/abs/2401.04933)
Summary:
While likelihood is attractive in theory, its estimates by deep generative models (DGMs) are often broken in practice, and perform poorly for out of distribution (OOD) Detection. Various recent works started to consider alternative scores and achieved better performances. However, such recipes do not come with provable guarantees, nor is it clear that their choices extract sufficient information.

We attempt to change this by conducting a case study on variational autoencoders (VAEs). First, we introduce the likelihood path (LPath) principle, generalizing the likelihood principle. This narrows the search for informative summary statistics down to the minimal sufficient statistics of VAEs' conditional likelihoods. Second, introducing new theoretic tools such as nearly essential support, essential distance and co-Lipschitzness, we obtain non-asymptotic provable OOD detection guarantees for certain distillation of the minimal sufficient statistics. The corresponding LPath algorithm demonstrates SOTA performances, even using simple and small VAEs with poor likelihood estimates. To our best knowledge, this is the first provable unsupervised OOD method that delivers excellent empirical results, better than any other VAEs based techniques. We use the same model as \cite{xiao2020likelihood}, open sourced from: https://github.com/XavierXiao/Likelihood-Regret
摘要：
虽然可能性在理论上很有吸引力，但深度生成模型 (DGM) 的估计在实践中经常被破坏，并且在分布外 (OOD) 检测中表现不佳。最近的各种作品开始考虑替代配乐并取得了更好的表现。然而，这些食谱并没有提供可证明的保证，也不清楚他们的选择是否提取了足够的信息。
< p>我们试图通过对变分自动编码器（VAE）进行案例研究来改变这一点。首先，我们引入似然路径（LPath）原理，推广似然原理。这将信息性汇总统计的搜索范围缩小到 VAE 条件可能性的最小充分统计。其次，引入新的理论工具，例如近本质支持、本质距离和共同Lipschitzness，我们获得了对最小充分统计量的某些精炼的非渐进可证明的OOD检测保证。相应的 LPath 算法展示了 SOTA 性能，即使使用似然估计较差的简单且小型 VAE。据我们所知，这是第一个可证明的无监督 OOD 方法，它提供了出色的实证结果，优于任何其他基于 VAE 的技术。我们使用与 \cite{xiao2020likelihood} 相同的模型，开源自：https://github.com/XavierXiao/Likelihood-Regret

anomaly

Title: Latency-aware Road Anomaly Segmentation in Videos: A Photorealistic Dataset and New Metrics. (arXiv:2401.04942v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.04942
Code URL: null
Copy Paste: [[2401.04942]] Latency-aware Road Anomaly Segmentation in Videos: A Photorealistic Dataset and New Metrics(http://arxiv.org/abs/2401.04942)
Summary:
In the past several years, road anomaly segmentation is actively explored in the academia and drawing growing attention in the industry. The rationale behind is straightforward: if the autonomous car can brake before hitting an anomalous object, safety is promoted. However, this rationale naturally calls for a temporally informed setting while existing methods and benchmarks are designed in an unrealistic frame-wise manner. To bridge this gap, we contribute the first video anomaly segmentation dataset for autonomous driving. Since placing various anomalous objects on busy roads and annotating them in every frame are dangerous and expensive, we resort to synthetic data. To improve the relevance of this synthetic dataset to real-world applications, we train a generative adversarial network conditioned on rendering G-buffers for photorealism enhancement. Our dataset consists of 120,000 high-resolution frames at a 60 FPS framerate, as recorded in 7 different towns. As an initial benchmarking, we provide baselines using latest supervised and unsupervised road anomaly segmentation methods. Apart from conventional ones, we focus on two new metrics: temporal consistency and latencyaware streaming accuracy. We believe the latter is valuable as it measures whether an anomaly segmentation algorithm can truly prevent a car from crashing in a temporally informed setting.
摘要：
过去几年，道路异常分割在学术界得到了积极探索，并越来越受到业界的关注。背后的原理很简单：如果自动驾驶汽车能够在撞到异常物体之前刹车，那么安全性就会得到提升。然而，这个基本原理自然需要一个临时通知的设置，而现有的方法和基准是以不切实际的框架方式设计的。为了弥补这一差距，我们贡献了第一个用于自动驾驶的视频异常分割数据集。由于将各种异常物体放置在繁忙的道路上并在每一帧中对其进行注释既危险又昂贵，因此我们求助于合成数据。为了提高该合成数据集与现实世界应用的相关性，我们训练了一个以渲染 G 缓冲区为条件的生成对抗网络，以增强照片真实感。我们的数据集由 7 个不同城镇记录的 120,000 个 60 FPS 帧速率的高分辨率帧组成。作为初始基准测试，我们使用最新的监督和无监督道路异常分割方法提供基线。除了传统指标之外，我们还关注两个新指标：时间一致性和延迟感知流准确性。我们认为后者很有价值，因为它衡量异常分割算法是否能够真正防止汽车在临时信息环境中发生碰撞。

Title: LogFormer: A Pre-train and Tuning Pipeline for Log Anomaly Detection. (arXiv:2401.04749v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.04749
Code URL: null
Copy Paste: [[2401.04749]] LogFormer: A Pre-train and Tuning Pipeline for Log Anomaly Detection(http://arxiv.org/abs/2401.04749)
Summary:
Log anomaly detection is a key component in the field of artificial intelligence for IT operations (AIOps). Considering log data of variant domains, retraining the whole network for unknown domains is inefficient in real industrial scenarios. However, previous deep models merely focused on extracting the semantics of log sequences in the same domain, leading to poor generalization on multi-domain logs. To alleviate this issue, we propose a unified Transformer-based framework for Log anomaly detection (LogFormer) to improve the generalization ability across different domains, where we establish a two-stage process including the pre-training and adapter-based tuning stage. Specifically, our model is first pre-trained on the source domain to obtain shared semantic knowledge of log data. Then, we transfer such knowledge to the target domain via shared parameters. Besides, the Log-Attention module is proposed to supplement the information ignored by the log-paring. The proposed method is evaluated on three public and one real-world datasets. Experimental results on multiple benchmarks demonstrate the effectiveness of our LogFormer with fewer trainable parameters and lower training costs.
摘要：
日志异常检测是 IT 运营人工智能 (AIOps) 领域的关键组成部分。考虑到不同域的日志数据，在实际工业场景中，针对未知域重新训练整个网络效率很低。然而，之前的深度模型仅仅关注于提取同一域内日志序列的语义，导致对多域日志的泛化能力较差。为了缓解这个问题，我们提出了一个基于 Transformer 的统一日志异常检测框架（LogFormer），以提高跨不同领域的泛化能力，其中我们建立了一个两阶段过程，包括预训练和基于适配器的调整阶段。具体来说，我们的模型首先在源域上进行预训练，以获得日志数据的共享语义知识。然后，我们通过共享参数将这些知识转移到目标域。此外，还提出了Log-Attention模块来补充log- paring忽略的信息。所提出的方法在三个公共数据集和一个真实世界数据集上进行评估。多个基准的实验结果证明了我们的 LogFormer 的有效性，可训练参数更少，训练成本更低。

in-context

Title: Leveraging Print Debugging to Improve Code Generation in Large Language Models. (arXiv:2401.05319v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.05319
Code URL: null
Copy Paste: [[2401.05319]] Leveraging Print Debugging to Improve Code Generation in Large Language Models(http://arxiv.org/abs/2401.05319)
Summary:
Large language models (LLMs) have made significant progress in code generation tasks, but their performance in tackling programming problems with complex data structures and algorithms remains suboptimal. To address this issue, we propose an in-context learning approach that guides LLMs to debug by using a "print debugging" method, which involves inserting print statements to trace and analysing logs for fixing the bug. We collect a Leetcode problem dataset and evaluate our method using the Leetcode online judging system. Experiments with GPT-4 demonstrate the effectiveness of our approach, outperforming rubber duck debugging in easy and medium-level Leetcode problems by 1.5% and 17.9%.
摘要：
大型语言模型 (LLM) 在代码生成任务方面取得了重大进展，但它们在处理复杂数据结构和算法的编程问题方面的性能仍然不够理想。为了解决这个问题，我们提出了一种上下文学习方法，引导法学硕士使用“打印调试”方法进行调试，其中包括插入打印语句来跟踪和分析日志以修复错误。我们收集 Leetcode 问题数据集并使用 Leetcode 在线评审系统评估我们的方法。 GPT-4 的实验证明了我们方法的有效性，在简单和中等水平的 Leetcode 问题上，其性能比橡皮鸭调试高出 1.5% 和 17.9%。