2024-01-03

diffusion

Title: FlashVideo: A Framework for Swift Inference in Text-to-Video Generation. (arXiv:2401.00869v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.00869
Code URL: null
Copy Paste: [[2401.00869]] FlashVideo: A Framework for Swift Inference in Text-to-Video Generation(http://arxiv.org/abs/2401.00869)
Summary:
In the evolving field of machine learning, video generation has witnessed significant advancements with autoregressive-based transformer models and diffusion models, known for synthesizing dynamic and realistic scenes. However, these models often face challenges with prolonged inference times, even for generating short video clips such as GIFs. This paper introduces FlashVideo, a novel framework tailored for swift Text-to-Video generation. FlashVideo represents the first successful adaptation of the RetNet architecture for video generation, bringing a unique approach to the field. Leveraging the RetNet-based architecture, FlashVideo reduces the time complexity of inference from $\mathcal{O}(L^2)$ to $\mathcal{O}(L)$ for a sequence of length $L$, significantly accelerating inference speed. Additionally, we adopt a redundant-free frame interpolation method, enhancing the efficiency of frame interpolation. Our comprehensive experiments demonstrate that FlashVideo achieves a $\times9.17$ efficiency improvement over a traditional autoregressive-based transformer model, and its inference speed is of the same order of magnitude as that of BERT-based transformer models.

Title: TrailBlazer: Trajectory Control for Diffusion-Based Video Generation. (arXiv:2401.00896v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.00896
Code URL: null
Copy Paste: [[2401.00896]] TrailBlazer: Trajectory Control for Diffusion-Based Video Generation(http://arxiv.org/abs/2401.00896)
Summary:
Within recent approaches to text-to-video (T2V) generation, achieving controllability in the synthesized video is often a challenge. Typically, this issue is addressed by providing low-level per-frame guidance in the form of edge maps, depth maps, or an existing video to be altered. However, the process of obtaining such guidance can be labor-intensive. This paper focuses on enhancing controllability in video synthesis by employing straightforward bounding boxes to guide the subject in various ways, all without the need for neural network training, finetuning, optimization at inference time, or the use of pre-existing videos. Our algorithm, TrailBlazer, is constructed upon a pre-trained (T2V) model, and easy to implement. The subject is directed by a bounding box through the proposed spatial and temporal attention map editing. Moreover, we introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts, without the need to provide a detailed mask. The method is efficient, with negligible additional computation relative to the underlying pre-trained model. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.

Title: Fast Inference Through The Reuse Of Attention Maps In Diffusion Models. (arXiv:2401.01008v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01008
Code URL: null
Copy Paste: [[2401.01008]] Fast Inference Through The Reuse Of Attention Maps In Diffusion Models(http://arxiv.org/abs/2401.01008)
Summary:
Text-to-image diffusion models have demonstrated unprecedented abilities at flexible and realistic image synthesis. However, the iterative process required to produce a single image is costly and incurs a high latency, prompting researchers to further investigate its efficiency. Typically, improvements in latency have been achieved in two ways: (1) training smaller models through knowledge distillation (KD); and (2) adopting techniques from ODE-theory to facilitate larger step sizes. In contrast, we propose a training-free approach that does not alter the step-size of the sampler. Specifically, we find the repeated calculation of attention maps to be both costly and redundant; therefore, we propose a structured reuse of attention maps during sampling. Our initial reuse policy is motivated by rudimentary ODE-theory, which suggests that reuse is most suitable late in the sampling procedure. After noting a number of limitations in this theoretical approach, we empirically search for a better policy. Unlike methods that rely on KD, our reuse policies can easily be adapted to a variety of setups in a plug-and-play manner. Furthermore, when applied to Stable Diffusion-1.5, our reuse policies reduce latency with minimal repercussions on sample quality.

Title: Robust single-particle cryo-EM image denoising and restoration. (arXiv:2401.01097v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01097
Code URL: null
Copy Paste: [[2401.01097]] Robust single-particle cryo-EM image denoising and restoration(http://arxiv.org/abs/2401.01097)
Summary:
Cryo-electron microscopy (cryo-EM) has achieved near-atomic level resolution of biomolecules by reconstructing 2D micrographs. However, the resolution and accuracy of the reconstructed particles are significantly reduced due to the extremely low signal-to-noise ratio (SNR) and complex noise structure of cryo-EM images. In this paper, we introduce a diffusion model with post-processing framework to effectively denoise and restore single particle cryo-EM images. Our method outperforms the state-of-the-art (SOTA) denoising methods by effectively removing structural noise that has not been addressed before. Additionally, more accurate and high-resolution three-dimensional reconstruction structures can be obtained from denoised cryo-EM images.

Title: Joint Generative Modeling of Scene Graphs and Images via Diffusion Models. (arXiv:2401.01130v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01130
Code URL: null
Copy Paste: [[2401.01130]] Joint Generative Modeling of Scene Graphs and Images via Diffusion Models(http://arxiv.org/abs/2401.01130)
Summary:
In this paper, we present a novel generative task: joint scene graph - image generation. While previous works have explored image generation conditioned on scene graphs or layouts, our task is distinctive and important as it involves generating scene graphs themselves unconditionally from noise, enabling efficient and interpretable control for image generation. Our task is challenging, requiring the generation of plausible scene graphs with heterogeneous attributes for nodes (objects) and edges (relations among objects), including continuous object bounding boxes and discrete object and relation categories. We introduce a novel diffusion model, DiffuseSG, that jointly models the adjacency matrix along with heterogeneous node and edge attributes. We explore various types of encodings for the categorical data, relaxing it into a continuous space. With a graph transformer being the denoiser, DiffuseSG successively denoises the scene graph representation in a continuous space and discretizes the final representation to generate the clean scene graph. Additionally, we introduce an IoU regularization to enhance the empirical performance. Our model significantly outperforms existing methods in scene graph generation on the Visual Genome and COCO-Stuff datasets, both on standard and newly introduced metrics that better capture the problem complexity. Moreover, we demonstrate the additional benefits of our model in two downstream applications: 1) excelling in a series of scene graph completion tasks, and 2) improving scene graph detection models by using extra training samples generated from DiffuseSG.

Title: Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation. (arXiv:2401.01207v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01207
Code URL: null
Copy Paste: [[2401.01207]] Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation(http://arxiv.org/abs/2401.01207)
Summary:
In human-centric content generation, the pre-trained text-to-image models struggle to produce user-wanted portrait images, which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end, we propose a novel multi-modal face generation framework, capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression, it's nontrivial to separately and precisely control them in one framework, thus has not been explored yet. To overcome this, we propose several innovative designs in the conditional diffusion model, including balancing identity and expression encoder, improved midpoint sampling, and explicitly background conditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework, in comparison with state-of-the-art text-to-image, face swapping, and face reenactment methods.

Title: VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM. (arXiv:2401.01256v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01256
Code URL: null
Copy Paste: [[2401.01256]] VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM(http://arxiv.org/abs/2401.01256)
Summary:
The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoDrafter, for content-consistent multi-scene video generation. Technically, VideoDrafter leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoDrafter identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoDrafter outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoDrafter outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.

self-supervised

Title: PlanarNeRF: Online Learning of Planar Primitives with Neural Radiance Fields. (arXiv:2401.00871v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.00871
Code URL: null
Copy Paste: [[2401.00871]] PlanarNeRF: Online Learning of Planar Primitives with Neural Radiance Fields(http://arxiv.org/abs/2401.00871)
Summary:
Identifying spatially complete planar primitives from visual data is a crucial task in computer vision. Prior methods are largely restricted to either 2D segment recovery or simplifying 3D structures, even with extensive plane annotations. We present PlanarNeRF, a novel framework capable of detecting dense 3D planes through online learning. Drawing upon the neural field representation, PlanarNeRF brings three major contributions. First, it enhances 3D plane detection with concurrent appearance and geometry knowledge. Second, a lightweight plane fitting module is proposed to estimate plane parameters. Third, a novel global memory bank structure with an update mechanism is introduced, ensuring consistent cross-frame correspondence. The flexible architecture of PlanarNeRF allows it to function in both 2D-supervised and self-supervised solutions, in each of which it can effectively learn from sparse training signals, significantly improving training efficiency. Through extensive experiments, we demonstrate the effectiveness of PlanarNeRF in various scenarios and remarkable improvement over existing works.

Title: A Bayesian Unification of Self-Supervised Clustering and Energy-Based Models. (arXiv:2401.00873v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.00873
Code URL: null
Copy Paste: [[2401.00873]] A Bayesian Unification of Self-Supervised Clustering and Energy-Based Models(http://arxiv.org/abs/2401.00873)
Summary:
Self-supervised learning is a popular and powerful method for utilizing large amounts of unlabeled data, for which a wide variety of training objectives have been proposed in the literature. In this study, we perform a Bayesian analysis of state-of-the-art self-supervised learning objectives, elucidating the underlying probabilistic graphical models in each class and presenting a standardized methodology for their derivation from first principles. The analysis also indicates a natural means of integrating self-supervised learning with likelihood-based generative models. We instantiate this concept within the realm of cluster-based self-supervised learning and energy models, introducing a novel lower bound which is proven to reliably penalize the most important failure modes. Furthermore, this newly proposed lower bound enables the training of a standard backbone architecture without the necessity for asymmetric elements such as stop gradients, momentum encoders, or specialized clustering layers - typically introduced to avoid learning trivial solutions. Our theoretical findings are substantiated through experiments on synthetic and real-world data, including SVHN, CIFAR10, and CIFAR100, thus showing that our objective function allows to outperform existing self-supervised learning strategies in terms of clustering, generation and out-of-distribution detection performance by a wide margin. We also demonstrate that GEDI can be integrated into a neural-symbolic framework to mitigate the reasoning shortcut problem and to learn higher quality symbolic representations thanks to the enhanced classification performance.

Title: Masked Modeling for Self-supervised Representation Learning on Vision and Beyond. (arXiv:2401.00897v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.00897
Code URL: https://github.com/lupin1998/awesome-mim
Copy Paste: [[2401.00897]] Masked Modeling for Self-supervised Representation Learning on Vision and Beyond(http://arxiv.org/abs/2401.00897)
Summary:
As the deep learning revolution marches on, self-supervised learning has garnered increasing attention in recent years thanks to its remarkable representation learning ability and the low dependence on labeled data. Among these varied self-supervised techniques, masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training. This paradigm enables deep models to learn robust representations and has demonstrated exceptional performance in the context of computer vision, natural language processing, and other modalities. In this survey, we present a comprehensive review of the masked modeling framework and its methodology. We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more. Then, we systematically investigate its wide-ranging applications across domains. Furthermore, we also explore the commonalities and differences between masked modeling methods in different fields. Toward the end of this paper, we conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research. A paper list project with this survey is available at \url{https://github.com/Lupin1998/Awesome-MIM}.

Title: Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence. (arXiv:2401.00921v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.00921
Code URL: https://github.com/ruizhuo-xu/skeleton2vec
Copy Paste: [[2401.00921]] Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence(http://arxiv.org/abs/2401.00921)
Summary:
Self-supervised pre-training paradigms have been extensively explored in the field of skeleton-based action recognition. In particular, methods based on masked prediction have pushed the performance of pre-training to a new height. However, these methods take low-level features, such as raw joint coordinates or temporal motion, as prediction targets for the masked regions, which is suboptimal. In this paper, we show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework, which utilizes a transformer-based teacher encoder taking unmasked training samples as input to create latent contextualized representations as prediction targets. Benefiting from the self-attention mechanism, the latent representations generated by the teacher encoder can incorporate the global context of the entire training samples, leading to a richer training task. Additionally, considering the high temporal correlations in skeleton sequences, we propose a motion-aware tube masking strategy which divides the skeleton sequence into several tubes and performs persistent masking within each tube based on motion priors, thus forcing the model to build long-range spatio-temporal connections and focus on action-semantic richer regions. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets demonstrate that our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.

Title: Relating Events and Frames Based on Self-Supervised Learning and Uncorrelated Conditioning for Unsupervised Domain Adaptation. (arXiv:2401.01042v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01042
Code URL: null
Copy Paste: [[2401.01042]] Relating Events and Frames Based on Self-Supervised Learning and Uncorrelated Conditioning for Unsupervised Domain Adaptation(http://arxiv.org/abs/2401.01042)
Summary:
Event-based cameras provide accurate and high temporal resolution measurements for performing computer vision tasks in challenging scenarios, such as high-dynamic range environments and fast-motion maneuvers. Despite their advantages, utilizing deep learning for event-based vision encounters a significant obstacle due to the scarcity of annotated data caused by the relatively recent emergence of event-based cameras. To overcome this limitation, leveraging the knowledge available from annotated data obtained with conventional frame-based cameras presents an effective solution based on unsupervised domain adaptation. We propose a new algorithm tailored for adapting a deep neural network trained on annotated frame-based data to generalize well on event-based unannotated data. Our approach incorporates uncorrelated conditioning and self-supervised learning in an adversarial learning scheme to close the gap between the two source and target domains. By applying self-supervised learning, the algorithm learns to align the representations of event-based data with those from frame-based camera data, thereby facilitating knowledge transfer.Furthermore, the inclusion of uncorrelated conditioning ensures that the adapted model effectively distinguishes between event-based and conventional data, enhancing its ability to classify event-based images accurately.Through empirical experimentation and evaluation, we demonstrate that our algorithm surpasses existing approaches designed for the same purpose using two benchmarks. The superior performance of our solution is attributed to its ability to effectively utilize annotated data from frame-based cameras and transfer the acquired knowledge to the event-based vision domain.

Title: Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training. (arXiv:2401.01179v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01179
Code URL: null
Copy Paste: [[2401.01179]] Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training(http://arxiv.org/abs/2401.01179)
Summary:
Modern healthcare often utilises radiographic images alongside textual reports for diagnostics, encouraging the use of Vision-Language Self-Supervised Learning (VL-SSL) with large pre-trained models to learn versatile medical vision representations. However, most existing VL-SSL frameworks are trained end-to-end, which is computation-heavy and can lose vital prior information embedded in pre-trained encoders. To address both issues, we introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen, and employs a lightweight Adaptor module for cross-modal learning. Experiments on medical image classification and segmentation tasks across three datasets reveal that our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches. Notably, when fine-tuned with just 1% of data, Adaptor outperforms several Transformer-based methods trained on full datasets in medical image segmentation.

Title: Boosting Transformer's Robustness and Efficacy in PPG Signal Artifact Detection with Self-Supervised Learning. (arXiv:2401.01013v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.01013
Code URL: null
Copy Paste: [[2401.01013]] Boosting Transformer's Robustness and Efficacy in PPG Signal Artifact Detection with Self-Supervised Learning(http://arxiv.org/abs/2401.01013)
Summary:
Recent research at CHU Sainte Justine's Pediatric Critical Care Unit (PICU) has revealed that traditional machine learning methods, such as semi-supervised label propagation and K-nearest neighbors, outperform Transformer-based models in artifact detection from PPG signals, mainly when data is limited. This study addresses the underutilization of abundant unlabeled data by employing self-supervised learning (SSL) to extract latent features from these data, followed by fine-tuning on labeled data. Our experiments demonstrate that SSL significantly enhances the Transformer model's ability to learn representations, improving its robustness in artifact classification tasks. Among various SSL techniques, including masking, contrastive learning, and DINO (self-distillation with no labels)-contrastive learning exhibited the most stable and superior performance in small PPG datasets. Further, we delve into optimizing contrastive loss functions, which are crucial for contrastive SSL. Inspired by InfoNCE, we introduce a novel contrastive loss function that facilitates smoother training and better convergence, thereby enhancing performance in artifact classification. In summary, this study establishes the efficacy of SSL in leveraging unlabeled data, particularly in enhancing the capabilities of the Transformer model. This approach holds promise for broader applications in PICU environments, where annotated data is often limited.

Title: Deep-ELA: Deep Exploratory Landscape Analysis with Self-Supervised Pretrained Transformers for Single- and Multi-Objective Continuous Optimization Problems. (arXiv:2401.01192v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.01192
Code URL: null
Copy Paste: [[2401.01192]] Deep-ELA: Deep Exploratory Landscape Analysis with Self-Supervised Pretrained Transformers for Single- and Multi-Objective Continuous Optimization Problems(http://arxiv.org/abs/2401.01192)
Summary:
In many recent works, the potential of Exploratory Landscape Analysis (ELA) features to numerically characterize, in particular, single-objective continuous optimization problems has been demonstrated. These numerical features provide the input for all kinds of machine learning tasks on continuous optimization problems, ranging, i.a., from High-level Property Prediction to Automated Algorithm Selection and Automated Algorithm Configuration. Without ELA features, analyzing and understanding the characteristics of single-objective continuous optimization problems would be impossible.

Yet, despite their undisputed usefulness, ELA features suffer from several drawbacks. These include, in particular, (1.) a strong correlation between multiple features, as well as (2.) its very limited applicability to multi-objective continuous optimization problems. As a remedy, recent works proposed deep learning-based approaches as alternatives to ELA. In these works, e.g., point-cloud transformers were used to characterize an optimization problem's fitness landscape. However, these approaches require a large amount of labeled training data.

Within this work, we propose a hybrid approach, Deep-ELA, which combines (the benefits of) deep learning and ELA features. Specifically, we pre-trained four transformers on millions of randomly generated optimization problems to learn deep representations of the landscapes of continuous single- and multi-objective optimization problems. Our proposed framework can either be used out-of-the-box for analyzing single- and multi-objective continuous optimization problems, or subsequently fine-tuned to various tasks focussing on algorithm behavior and problem understanding.

foundation model

generative

Title: Dual Teacher Knowledge Distillation with Domain Alignment for Face Anti-spoofing. (arXiv:2401.01102v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01102
Code URL: null
Copy Paste: [[2401.01102]] Dual Teacher Knowledge Distillation with Domain Alignment for Face Anti-spoofing(http://arxiv.org/abs/2401.01102)
Summary:
Face recognition systems have raised concerns due to their vulnerability to different presentation attacks, and system security has become an increasingly critical concern. Although many face anti-spoofing (FAS) methods perform well in intra-dataset scenarios, their generalization remains a challenge. To address this issue, some methods adopt domain adversarial training (DAT) to extract domain-invariant features. However, the competition between the encoder and the domain discriminator can cause the network to be difficult to train and converge. In this paper, we propose a domain adversarial attack (DAA) method to mitigate the training instability problem by adding perturbations to the input images, which makes them indistinguishable across domains and enables domain alignment. Moreover, since models trained on limited data and types of attacks cannot generalize well to unknown attacks, we propose a dual perceptual and generative knowledge distillation framework for face anti-spoofing that utilizes pre-trained face-related models containing rich face priors. Specifically, we adopt two different face-related models as teachers to transfer knowledge to the target student model. The pre-trained teacher models are not from the task of face anti-spoofing but from perceptual and generative tasks, respectively, which implicitly augment the data. By combining both DAA and dual-teacher knowledge distillation, we develop a dual teacher knowledge distillation with domain alignment framework (DTDA) for face anti-spoofing. The advantage of our proposed method has been verified through extensive ablation studies and comparison with state-of-the-art methods on public datasets across multiple protocols.

Title: En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data. (arXiv:2401.01173v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01173
Code URL: null
Copy Paste: [[2401.01173]] En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data(http://arxiv.org/abs/2401.01173)
Summary:
We present En3D, an enhanced generative scheme for sculpting high-quality 3D human avatars. Unlike previous works that rely on scarce 3D datasets or limited 2D collections with imbalanced viewing angles and imprecise pose priors, our approach aims to develop a zero-shot 3D generative scheme capable of producing visually realistic, geometrically accurate and content-wise diverse 3D humans without relying on pre-existing 3D or 2D assets. To address this challenge, we introduce a meticulously crafted workflow that implements accurate physical modeling to learn the enhanced 3D generative model from synthetic 2D data. During inference, we integrate optimization modules to bridge the gap between realistic appearances and coarse 3D shapes. Specifically, En3D comprises three modules: a 3D generator that accurately models generalizable 3D humans with realistic appearance from synthesized balanced, diverse, and structured human images; a geometry sculptor that enhances shape quality using multi-view normal constraints for intricate human anatomy; and a texturing module that disentangles explicit texture maps with fidelity and editability, leveraging semantical UV partitioning and a differentiable rasterizer. Experimental results show that our approach significantly outperforms prior works in terms of image quality, geometry accuracy and content diversity. We also showcase the applicability of our generated avatars for animation and editing, as well as the scalability of our approach for content-style free adaptation.

Title: MOC-RVQ: Multilevel Codebook-assisted Digital Generative Semantic Communication. (arXiv:2401.01272v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01272
Code URL: null
Copy Paste: [[2401.01272]] MOC-RVQ: Multilevel Codebook-assisted Digital Generative Semantic Communication(http://arxiv.org/abs/2401.01272)
Summary:
Vector quantization-based image semantic communication systems have successfully boosted transmission efficiency, but face a challenge with conflicting requirements between codebook design and digital constellation modulation. Traditional codebooks need a wide index range, while modulation favors few discrete states. To address this, we propose a multilevel generative semantic communication system with a two-stage training framework. In the first stage, we train a high-quality codebook, using a multi-head octonary codebook (MOC) to compress the index range. We also integrate a residual vector quantization (RVQ) mechanism for effective multilevel communication. In the second stage, a noise reduction block (NRB) based on Swin Transformer is introduced, coupled with the multilevel codebook from the first stage, serving as a high-quality semantic knowledge base (SKB) for generative feature restoration. Experimental results highlight MOC-RVQ's superior performance over methods like BPG or JPEG, even without channel error correction coding.

Title: DocLLM: A layout-aware generative language model for multimodal document understanding. (arXiv:2401.00908v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.00908
Code URL: null
Copy Paste: [[2401.00908]] DocLLM: A layout-aware generative language model for multimodal document understanding(http://arxiv.org/abs/2401.00908)
Summary:
Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.

Title: CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation. (arXiv:2401.01275v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.01275
Code URL: null
Copy Paste: [[2401.01275]] CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation(http://arxiv.org/abs/2401.01275)
Summary:
Recently, the advent of large language models (LLMs) has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. Source code, data source and reward model will be publicly accessible at https://github.com/morecry/CharacterEval.

Title: An Autoregressive Text-to-Graph Framework for Joint Entity and Relation Extraction. (arXiv:2401.01326v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.01326
Code URL: null
Copy Paste: [[2401.01326]] An Autoregressive Text-to-Graph Framework for Joint Entity and Relation Extraction(http://arxiv.org/abs/2401.01326)
Summary:
In this paper, we propose a novel method for joint entity and relation extraction from unstructured text by framing it as a conditional sequence generation problem. In contrast to conventional generative information extraction models that are left-to-right token-level generators, our approach is \textit{span-based}. It generates a linearized graph where nodes represent text spans and edges represent relation triplets. Our method employs a transformer encoder-decoder architecture with pointing mechanism on a dynamic vocabulary of spans and relation types. Our model can capture the structural characteristics and boundaries of entities and relations through span representations while simultaneously grounding the generated output in the original text thanks to the pointing mechanism. Evaluation on benchmark datasets validates the effectiveness of our approach, demonstrating competitive results. Code is available at https://github.com/urchade/ATG.

Title: Improve Fidelity and Utility of Synthetic Credit Card Transaction Time Series from Data-centric Perspective. (arXiv:2401.00965v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.00965
Code URL: null
Copy Paste: [[2401.00965]] Improve Fidelity and Utility of Synthetic Credit Card Transaction Time Series from Data-centric Perspective(http://arxiv.org/abs/2401.00965)
Summary:
Exploring generative model training for synthetic tabular data, specifically in sequential contexts such as credit card transaction data, presents significant challenges. This paper addresses these challenges, focusing on attaining both high fidelity to actual data and optimal utility for machine learning tasks. We introduce five pre-processing schemas to enhance the training of the Conditional Probabilistic Auto-Regressive Model (CPAR), demonstrating incremental improvements in the synthetic data's fidelity and utility. Upon achieving satisfactory fidelity levels, our attention shifts to training fraud detection models tailored for time-series data, evaluating the utility of the synthetic data. Our findings offer valuable insights and practical guidelines for synthetic data practitioners in the finance sector, transitioning from real to synthetic datasets for training purposes, and illuminating broader methodologies for synthesizing credit card transaction time series.

Title: Downstream Task-Oriented Generative Model Selections on Synthetic Data Training for Fraud Detection Models. (arXiv:2401.00974v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.00974
Code URL: null
Copy Paste: [[2401.00974]] Downstream Task-Oriented Generative Model Selections on Synthetic Data Training for Fraud Detection Models(http://arxiv.org/abs/2401.00974)
Summary:
Devising procedures for downstream task-oriented generative model selections is an unresolved problem of practical importance. Existing studies focused on the utility of a single family of generative models. They provided limited insights on how synthetic data practitioners select the best family generative models for synthetic training tasks given a specific combination of machine learning model class and performance metric. In this paper, we approach the downstream task-oriented generative model selections problem in the case of training fraud detection models and investigate the best practice given different combinations of model interpretability and model performance constraints. Our investigation supports that, while both Neural Network(NN)-based and Bayesian Network(BN)-based generative models are both good to complete synthetic training task under loose model interpretability constrain, the BN-based generative models is better than NN-based when synthetic training fraud detection model under strict model interpretability constrain. Our results provides practical guidance for machine learning practitioner who is interested in replacing their training dataset from real to synthetic, and shed lights on more general downstream task-oriented generative model selection problems.

Title: Motif-aware Riemannian Graph Neural Network with Generative-Contrastive Learning. (arXiv:2401.01232v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.01232
Code URL: https://github.com/riemanngraph/motifrgc
Copy Paste: [[2401.01232]] Motif-aware Riemannian Graph Neural Network with Generative-Contrastive Learning(http://arxiv.org/abs/2401.01232)
Summary:
Graphs are typical non-Euclidean data of complex structures. In recent years, Riemannian graph representation learning has emerged as an exciting alternative to Euclidean ones. However, Riemannian methods are still in an early stage: most of them present a single curvature (radius) regardless of structural complexity, suffer from numerical instability due to the exponential/logarithmic map, and lack the ability to capture motif regularity. In light of the issues above, we propose the problem of \emph{Motif-aware Riemannian Graph Representation Learning}, seeking a numerically stable encoder to capture motif regularity in a diverse-curvature manifold without labels. To this end, we present a novel Motif-aware Riemannian model with Generative-Contrastive learning (MotifRGC), which conducts a minmax game in Riemannian manifold in a self-supervised manner. First, we propose a new type of Riemannian GCN (D-GCN), in which we construct a diverse-curvature manifold by a product layer with the diversified factor, and replace the exponential/logarithmic map by a stable kernel layer. Second, we introduce a motif-aware Riemannian generative-contrastive learning to capture motif regularity in the constructed manifold and learn motif-aware node representation without external labels. Empirical results show the superiority of MofitRGC.

anomaly

Title: Unsupervised Continual Anomaly Detection with Contrastively-learned Prompt. (arXiv:2401.01010v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01010
Code URL: https://github.com/shirowalker/ucad
Copy Paste: [[2401.01010]] Unsupervised Continual Anomaly Detection with Contrastively-learned Prompt(http://arxiv.org/abs/2401.01010)
Summary:
Unsupervised Anomaly Detection (UAD) with incremental training is crucial in industrial manufacturing, as unpredictable defects make obtaining sufficient labeled data infeasible. However, continual learning methods primarily rely on supervised annotations, while the application in UAD is limited due to the absence of supervision. Current UAD methods train separate models for different classes sequentially, leading to catastrophic forgetting and a heavy computational burden. To address this issue, we introduce a novel Unsupervised Continual Anomaly Detection framework called UCAD, which equips the UAD with continual learning capability through contrastively-learned prompts. In the proposed UCAD, we design a Continual Prompting Module (CPM) by utilizing a concise key-prompt-knowledge memory bank to guide task-invariant `anomaly' model predictions using task-specific `normal' knowledge. Moreover, Structure-based Contrastive Learning (SCL) is designed with the Segment Anything Model (SAM) to improve prompt learning and anomaly segmentation results. Specifically, by treating SAM's masks as structure, we draw features within the same mask closer and push others apart for general feature representations. We conduct comprehensive experiments and set the benchmark on unsupervised continual anomaly detection and segmentation, demonstrating that our method is significantly better than anomaly detection methods, even with rehearsal training. The code will be available at https://github.com/shirowalker/UCAD.

Title: Exploring Hyperspectral Anomaly Detection with Human Vision: A Small Target Aware Detector. (arXiv:2401.01093v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01093
Code URL: null
Copy Paste: [[2401.01093]] Exploring Hyperspectral Anomaly Detection with Human Vision: A Small Target Aware Detector(http://arxiv.org/abs/2401.01093)
Summary:
Hyperspectral anomaly detection (HAD) aims to localize pixel points whose spectral features differ from the background. HAD is essential in scenarios of unknown or camouflaged target features, such as water quality monitoring, crop growth monitoring and camouflaged target detection, where prior information of targets is difficult to obtain. Existing HAD methods aim to objectively detect and distinguish background and anomalous spectra, which can be achieved almost effortlessly by human perception. However, the underlying processes of human visual perception are thought to be quite complex. In this paper, we analyze hyperspectral image (HSI) features under human visual perception, and transfer the solution process of HAD to the more robust feature space for the first time. Specifically, we propose a small target aware detector (STAD), which introduces saliency maps to capture HSI features closer to human visual perception. STAD not only extracts more anomalous representations, but also reduces the impact of low-confidence regions through a proposed small target filter (STF). Furthermore, considering the possibility of HAD algorithms being applied to edge devices, we propose a full connected network to convolutional network knowledge distillation strategy. It can learn the spectral and spatial features of the HSI while lightening the network. We train the network on the HAD100 training set and validate the proposed method on the HAD100 test set. Our method provides a new solution space for HAD that is closer to human visual perception with high confidence. Sufficient experiments on real HSI with multiple method comparisons demonstrate the excellent performance and unique potential of the proposed method. The code is available at https://github.com/majitao-xd/STAD-HAD.

Title: Whole-examination AI estimation of fetal biometrics from 20-week ultrasound scans. (arXiv:2401.01201v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.01201
Code URL: null
Copy Paste: [[2401.01201]] Whole-examination AI estimation of fetal biometrics from 20-week ultrasound scans(http://arxiv.org/abs/2401.01201)
Summary:
The current approach to fetal anomaly screening is based on biometric measurements derived from individually selected ultrasound images. In this paper, we introduce a paradigm shift that attains human-level performance in biometric measurement by aggregating automatically extracted biometrics from every frame across an entire scan, with no need for operator intervention. We use a convolutional neural network to classify each frame of an ultrasound video recording. We then measure fetal biometrics in every frame where appropriate anatomy is visible. We use a Bayesian method to estimate the true value of each biometric from a large number of measurements and probabilistically reject outliers. We performed a retrospective experiment on 1457 recordings (comprising 48 million frames) of 20-week ultrasound scans, estimated fetal biometrics in those scans and compared our estimates to the measurements sonographers took during the scan. Our method achieves human-level performance in estimating fetal biometrics and estimates well-calibrated credible intervals in which the true biometric value is expected to lie.

2024-01-03

diffusion

Title: FlashVideo: A Framework for Swift Inference in Text-to-Video Generation. (arXiv:2401.00869v1 [cs.CV])

Title: TrailBlazer: Trajectory Control for Diffusion-Based Video Generation. (arXiv:2401.00896v1 [cs.CV])

Title: Fast Inference Through The Reuse Of Attention Maps In Diffusion Models. (arXiv:2401.01008v1 [cs.CV])

Title: Robust single-particle cryo-EM image denoising and restoration. (arXiv:2401.01097v1 [cs.CV])

Title: Joint Generative Modeling of Scene Graphs and Images via Diffusion Models. (arXiv:2401.01130v1 [cs.CV])

Title: Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation. (arXiv:2401.01207v1 [cs.CV])

Title: VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM. (arXiv:2401.01256v1 [cs.CV])

self-supervised

Title: PlanarNeRF: Online Learning of Planar Primitives with Neural Radiance Fields. (arXiv:2401.00871v1 [cs.CV])

Title: A Bayesian Unification of Self-Supervised Clustering and Energy-Based Models. (arXiv:2401.00873v1 [cs.LG])

Title: Masked Modeling for Self-supervised Representation Learning on Vision and Beyond. (arXiv:2401.00897v1 [cs.CV])

Title: Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence. (arXiv:2401.00921v1 [cs.CV])

Title: Relating Events and Frames Based on Self-Supervised Learning and Uncorrelated Conditioning for Unsupervised Domain Adaptation. (arXiv:2401.01042v1 [cs.CV])

Title: Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training. (arXiv:2401.01179v1 [cs.CV])

Title: Boosting Transformer's Robustness and Efficacy in PPG Signal Artifact Detection with Self-Supervised Learning. (arXiv:2401.01013v1 [cs.LG])

Title: Deep-ELA: Deep Exploratory Landscape Analysis with Self-Supervised Pretrained Transformers for Single- and Multi-Objective Continuous Optimization Problems. (arXiv:2401.01192v1 [cs.LG])

foundation model

generative

Title: Dual Teacher Knowledge Distillation with Domain Alignment for Face Anti-spoofing. (arXiv:2401.01102v1 [cs.CV])

Title: En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data. (arXiv:2401.01173v1 [cs.CV])

Title: MOC-RVQ: Multilevel Codebook-assisted Digital Generative Semantic Communication. (arXiv:2401.01272v1 [cs.CV])

Title: DocLLM: A layout-aware generative language model for multimodal document understanding. (arXiv:2401.00908v1 [cs.CL])

Title: CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation. (arXiv:2401.01275v1 [cs.CL])

Title: An Autoregressive Text-to-Graph Framework for Joint Entity and Relation Extraction. (arXiv:2401.01326v1 [cs.CL])

Title: Improve Fidelity and Utility of Synthetic Credit Card Transaction Time Series from Data-centric Perspective. (arXiv:2401.00965v1 [cs.LG])

Title: Downstream Task-Oriented Generative Model Selections on Synthetic Data Training for Fraud Detection Models. (arXiv:2401.00974v1 [cs.LG])

Title: Motif-aware Riemannian Graph Neural Network with Generative-Contrastive Learning. (arXiv:2401.01232v1 [cs.LG])

anomaly

Title: Unsupervised Continual Anomaly Detection with Contrastively-learned Prompt. (arXiv:2401.01010v1 [cs.CV])

Title: Exploring Hyperspectral Anomaly Detection with Human Vision: A Small Target Aware Detector. (arXiv:2401.01093v1 [cs.CV])

Title: Whole-examination AI estimation of fetal biometrics from 20-week ultrasound scans. (arXiv:2401.01201v1 [cs.CV])

in-context