Copy Paste: [[2407.18288]] Leveraging Foundation Models via Knowledge Distillation in Multi-Object Tracking: Distilling DINOv2 Features to FairMOT(https://arxiv.org/abs/2407.18288)
Keywords: foundation model
Abstract: Multiple Object Tracking (MOT) is a computer vision task that has been employed in a variety of sectors. Some common limitations in MOT are varying object appearances, occlusions, or crowded scenes. To address these challenges, machine learning methods have been extensively deployed, leveraging large datasets, sophisticated models, and substantial computational resources. Due to practical limitations, access to the above is not always an option. However, with the recent release of foundation models by prominent AI companies, pretrained models have been trained on vast datasets and resources using state-of-the-art methods. This work tries to leverage one such foundation model, called DINOv2, through using knowledge distillation. The proposed method uses a teacher-student architecture, where DINOv2 is the teacher and the FairMOT backbone HRNetv2 W18 is the student. The results imply that although the proposed method shows improvements in certain scenarios, it does not consistently outperform the original FairMOT model. These findings highlight the potential and limitations of applying foundation models in knowledge
Title: MARINE: A Computer Vision Model for Detecting Rare Predator-Prey Interactions in Animal Videos
Copy Paste: [[2407.18289]] MARINE: A Computer Vision Model for Detecting Rare Predator-Prey Interactions in Animal Videos(https://arxiv.org/abs/2407.18289)
Keywords: foundation model
Abstract: Encounters between predator and prey play an essential role in ecosystems, but their rarity makes them difficult to detect in video recordings. Although advances in action recognition (AR) and temporal action detection (AD), especially transformer-based models and vision foundation models, have achieved high performance on human action datasets, animal videos remain relatively under-researched. This thesis addresses this gap by proposing the model MARINE, which utilizes motion-based frame selection designed for fast animal actions and DINOv2 feature extraction with a trainable classification head for action recognition. MARINE outperforms VideoMAE in identifying predator attacks in videos of fish, both on a small and specific coral reef dataset (81.53\% against 52.64\% accuracy), and on a subset of the more extensive Animal Kingdom dataset (94.86\% against 83.14\% accuracy). In a multi-label setting on a representative sample of Animal Kingdom, MARINE achieves 23.79\% mAP, positioning it mid-field among existing benchmarks. Furthermore, in an AD task on the coral reef dataset, MARINE achieves 80.78\% AP (against VideoMAE's 34.89\%) although at a lowered t-IoU threshold of 25\%. Therefore, despite room for improvement, MARINE offers an effective starter framework to apply to AR and AD tasks on animal recordings and thus contribute to the study of natural ecosystems.
Title: Generative AI like ChatGPT in Blockchain Federated Learning: use cases, opportunities and future
Copy Paste: [[2407.18358]] Generative AI like ChatGPT in Blockchain Federated Learning: use cases, opportunities and future(https://arxiv.org/abs/2407.18358)
Keywords: generative
Abstract: Federated learning has become a significant approach for training machine learning models using decentralized data without necessitating the sharing of this data. Recently, the incorporation of generative artificial intelligence (AI) methods has provided new possibilities for improving privacy, augmenting data, and customizing models. This research explores potential integrations of generative AI in federated learning, revealing various opportunities to enhance privacy, data efficiency, and model performance. It particularly emphasizes the importance of generative models like generative adversarial networks (GANs) and variational autoencoders (VAEs) in creating synthetic data that replicates the distribution of real data. Generating synthetic data helps federated learning address challenges related to limited data availability and supports robust model development. Additionally, we examine various applications of generative AI in federated learning that enable more personalized solutions.
Copy Paste: [[2407.18423]] HDL-GPT: High-Quality HDL is All You Need(https://arxiv.org/abs/2407.18423)
Keywords: generative
Abstract: This paper presents Hardware Description Language Generative Pre-trained Transformers (HDL-GPT), a novel approach that leverages the vast repository of open-source High Definition Language (HDL) codes to train superior quality large code models. The core premise of this paper is the hypothesis that high-quality HDL is all you need to create models with exceptional performance and broad zero-shot generalization abilities. The paper elucidates the methods employed for the curation and augmentation of large corpora from open-source HDL code, transforming highly variable quality data into high-quality data through careful prompting and context maintenance. We demonstrate that the careful selection, filtering, and augmentation of data across HDLs can yield powerful models that surpass current state-of-the-art models. We also explore the impact of different fine-tuning methods on the quality of results. We describe experimental results across a range of fine-tuned SOTA LLMs, substantiating our claims. We demonstrate improvements of 50% to 200% over SOTA HDL models on current benchmarks in tasks ranging from HDL circuit explanations, code generation, formal and simulation testbench creation, triaging bugs, and fixing them. HDL-GPT opens new avenues for the development of advanced model training techniques for circuit design tasks.
Title: Impact of Recurrent Neural Networks and Deep Learning Frameworks on Real-time Lightweight Time Series Anomaly Detection
Copy Paste: [[2407.18439]] Impact of Recurrent Neural Networks and Deep Learning Frameworks on Real-time Lightweight Time Series Anomaly Detection(https://arxiv.org/abs/2407.18439)
Keywords: anomaly
Abstract: Real-time lightweight time series anomaly detection has become increasingly crucial in cybersecurity and many other domains. Its ability to adapt to unforeseen pattern changes and swiftly identify anomalies enables prompt responses and critical decision-making. While several such anomaly detection approaches have been introduced in recent years, they primarily utilize a single type of recurrent neural networks (RNNs) and have been implemented in only one deep learning framework. It is unclear how the use of different types of RNNs available in various deep learning frameworks affects the performance of these anomaly detection approaches due to the absence of comprehensive evaluations. Arbitrarily choosing a RNN variant and a deep learning framework to implement an anomaly detection approach may not reflect its true performance and could potentially mislead users into favoring one approach over another. In this paper, we aim to study the influence of various types of RNNs available in popular deep learning frameworks on real-time lightweight time series anomaly detection. We reviewed several state-of-the-art approaches and implemented a representative anomaly detection approach using well-known RNN variants supported by three widely recognized deep learning frameworks. A comprehensive evaluation is then conducted to analyze the performance of each implementation across real-world, open-source time series datasets. The evaluation results provide valuable guidance for selecting the appropriate RNN variant and deep learning framework for real-time, lightweight time series anomaly detection.
Title: Textile Anomaly Detection: Evaluation of the State-of-the-Art for Automated Quality Inspection of Carpet
Authors: Briony Forsberg, Dr Henry Williams, Prof Bruce MacDonald, Tracy Chen, Dr Kirstine Hulse
Copy Paste: [[2407.18450]] Textile Anomaly Detection: Evaluation of the State-of-the-Art for Automated Quality Inspection of Carpet(https://arxiv.org/abs/2407.18450)
Keywords: generative, anomaly
Abstract: In this study, state-of-the-art unsupervised detection models were evaluated for the purpose of automated anomaly inspection of wool carpets. A custom dataset of four unique types of carpet textures was created to thoroughly test the models and their robustness in detecting subtle anomalies in complex textures. Due to the requirements of an inline inspection system in a manufacturing use case, the metrics of importance in this study were accuracy in detecting anomalous areas, the number of false detections, and the inference times of each model for real-time performance. Of the evaluated models, the student-teacher network based methods were found on average to yield the highest detection accuracy and lowest false detection rates. When trained on a multi-class dataset the models were found to yield comparable if not better results than single-class training. Finally, in terms of detection speed, with exception to the generative model, all other evaluated models were found to have comparable inference times on a GPU, with an average of 0.16s per image. On a CPU, most of these models typically produced results between 1.5 to 2 times the respective GPU inference times.
Title: Machine Unlearning using a Multi-GAN based Model
Authors: Amartya Hatua, Trung T. Nguyen, Andrew H. Sung
Copy Paste: [[2407.18467]] Machine Unlearning using a Multi-GAN based Model(https://arxiv.org/abs/2407.18467)
Keywords: generative
Abstract: This article presents a new machine unlearning approach that utilizes multiple Generative Adversarial Network (GAN) based models. The proposed method comprises two phases: i) data reorganization in which synthetic data using the GAN model is introduced with inverted class labels of the forget datasets, and ii) fine-tuning the pre-trained model. The GAN models consist of two pairs of generators and discriminators. The generator discriminator pairs generate synthetic data for the retain and forget datasets. Then, a pre-trained model is utilized to get the class labels of the synthetic datasets. The class labels of synthetic and original forget datasets are inverted. Finally, all combined datasets are used to fine-tune the pre-trained model to get the unlearned model. We have performed the experiments on the CIFAR-10 dataset and tested the unlearned models using Membership Inference Attacks (MIA). The inverted class labels procedure and synthetically generated data help to acquire valuable information that enables the model to outperform state-of-the-art models and other standard unlearning classifiers.
Title: Diffusion-Driven Semantic Communication for Generative Models with Bandwidth Constraints
Authors: Lei Guo, Wei Chen, Yuxuan Sun, Bo Ai, Nikolaos Pappas, Tony Quek
Copy Paste: [[2407.18468]] Diffusion-Driven Semantic Communication for Generative Models with Bandwidth Constraints(https://arxiv.org/abs/2407.18468)
Keywords: diffusion, generative
Abstract: Diffusion models have been extensively utilized in AI-generated content (AIGC) in recent years, thanks to the superior generation capabilities. Combining with semantic communications, diffusion models are used for tasks such as denoising, data reconstruction, and content generation. However, existing diffusion-based generative models do not consider the stringent bandwidth limitation, which limits its application in wireless communication. This paper introduces a diffusion-driven semantic communication framework with advanced VAE-based compression for bandwidth-constrained generative model. Our designed architecture utilizes the diffusion model, where the signal transmission process through the wireless channel acts as the forward process in diffusion. To reduce bandwidth requirements, we incorporate a downsampling module and a paired upsampling module based on a variational auto-encoder with reparameterization at the receiver to ensure that the recovered features conform to the Gaussian distribution. Furthermore, we derive the loss function for our proposed system and evaluate its performance through comprehensive experiments. Our experimental results demonstrate significant improvements in pixel-level metrics such as peak signal to noise ratio (PSNR) and semantic metrics like learned perceptual image patch similarity (LPIPS). These enhancements are more profound regarding the compression rates and SNR compared to deep joint source-channel coding (DJSCC).
Title: Answerability Fields: Answerable Location Estimation via Diffusion Models
Abstract: In an era characterized by advancements in artificial intelligence and robotics, enabling machines to interact with and understand their environment is a critical research endeavor. In this paper, we propose Answerability Fields, a novel approach to predicting answerability within complex indoor environments. Leveraging a 3D question answering dataset, we construct a comprehensive Answerability Fields dataset, encompassing diverse scenes and questions from ScanNet. Using a diffusion model, we successfully infer and evaluate these Answerability Fields, demonstrating the importance of objects and their locations in answering questions within a scene. Our results showcase the efficacy of Answerability Fields in guiding scene-understanding tasks, laying the foundation for their application in enhancing interactions between intelligent agents and their environments.
Title: Revisit Event Generation Model: Self-Supervised Learning of Event-to-Video Reconstruction with Implicit Neural Representations
Copy Paste: [[2407.18500]] Revisit Event Generation Model: Self-Supervised Learning of Event-to-Video Reconstruction with Implicit Neural Representations(https://arxiv.org/abs/2407.18500)
Keywords: self-supervised
Abstract: Reconstructing intensity frames from event data while maintaining high temporal resolution and dynamic range is crucial for bridging the gap between event-based and frame-based computer vision. Previous approaches have depended on supervised learning on synthetic data, which lacks interpretability and risk over-fitting to the setting of the event simulator. Recently, self-supervised learning (SSL) based methods, which primarily utilize per-frame optical flow to estimate intensity via photometric constancy, has been actively investigated. However, they are vulnerable to errors in the case of inaccurate optical flow. This paper proposes a novel SSL event-to-video reconstruction approach, dubbed EvINR, which eliminates the need for labeled data or optical flow estimation. Our core idea is to reconstruct intensity frames by directly addressing the event generation model, essentially a partial differential equation (PDE) that describes how events are generated based on the time-varying brightness signals. Specifically, we utilize an implicit neural representation (INR), which takes in spatiotemporal coordinate $(x, y, t)$ and predicts intensity values, to represent the solution of the event generation equation. The INR, parameterized as a fully-connected Multi-layer Perceptron (MLP), can be optimized with its temporal derivatives supervised by events. To make EvINR feasible for online requisites, we propose several acceleration techniques that substantially expedite the training process. Comprehensive experiments demonstrate that our EvINR surpasses previous SSL methods by 38% w.r.t. Mean Squared Error (MSE) and is comparable or superior to SoTA supervised methods. Project page: this https URL.
Title: Is larger always better? Evaluating and prompting large language models for non-generative medical tasks
Copy Paste: [[2407.18525]] Is larger always better? Evaluating and prompting large language models for non-generative medical tasks(https://arxiv.org/abs/2407.18525)
Keywords: generative
Abstract: The use of Large Language Models (LLMs) in medicine is growing, but their ability to handle both structured Electronic Health Record (EHR) data and unstructured clinical notes is not well-studied. This study benchmarks various models, including GPT-based LLMs, BERT-based models, and traditional clinical predictive models, for non-generative medical tasks utilizing renowned datasets. We assessed 14 language models (9 GPT-based and 5 BERT-based) and 7 traditional predictive models using the MIMIC dataset (ICU patient records) and the TJH dataset (early COVID-19 EHR data), focusing on tasks such as mortality and readmission prediction, disease hierarchy reconstruction, and biomedical sentence matching, comparing both zero-shot and finetuned performance. Results indicated that LLMs exhibited robust zero-shot predictive capabilities on structured EHR data when using well-designed prompting strategies, frequently surpassing traditional models. However, for unstructured medical texts, LLMs did not outperform finetuned BERT models, which excelled in both supervised and unsupervised tasks. Consequently, while LLMs are effective for zero-shot learning on structured data, finetuned BERT models are more suitable for unstructured texts, underscoring the importance of selecting models based on specific task requirements and data characteristics to optimize the application of NLP technology in healthcare.
Title: Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers
Authors: Longkun Zou, Wanru Zhu, Ke Chen, Lihua Guo, Kailing Guo, Kui Jia, Yaowei Wang
Copy Paste: [[2407.18534]] Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers(https://arxiv.org/abs/2407.18534)
Keywords: self-supervised
Abstract: Semantic pattern of an object point cloud is determined by its topological configuration of local geometries. Learning discriminative representations can be challenging due to large shape variations of point sets in local regions and incomplete surface in a global perspective, which can be made even more severe in the context of unsupervised domain adaptation (UDA). In specific, traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries, which greatly limits their cross-domain generalization. Recently, the transformer-based models have achieved impressive performance gain in a range of image-based tasks, benefiting from its strong generalization capability and scalability stemming from capturing long range correlation across local patches. Inspired by such successes of visual transformers, we propose a novel Relational Priors Distillation (RPD) method to extract relational priors from the well-trained transformers on massive images, which can significantly empower cross-domain representations with consistent topological priors of objects. To this end, we establish a parameter-frozen pre-trained transformer module shared between 2D teacher and 3D student models, complemented by an online knowledge distillation strategy for semantically regularizing the 3D student model. Furthermore, we introduce a novel self-supervised task centered on reconstructing masked point cloud patches using corresponding masked multi-view image features, thereby empowering the model with incorporating 3D geometric information. Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification. The source code of this work is available at this https URL.
Title: A Universal Prompting Strategy for Extracting Process Model Information from Natural Language Text using Large Language Models
Authors: Julian Neuberger, Lars Ackermann, Han van der Aa, Stefan Jablonski
Copy Paste: [[2407.18540]] A Universal Prompting Strategy for Extracting Process Model Information from Natural Language Text using Large Language Models(https://arxiv.org/abs/2407.18540)
Keywords: generative
Abstract: Over the past decade, extensive research efforts have been dedicated to the extraction of information from textual process descriptions. Despite the remarkable progress witnessed in natural language processing (NLP), information extraction within the Business Process Management domain remains predominantly reliant on rule-based systems and machine learning methodologies. Data scarcity has so far prevented the successful application of deep learning techniques. However, the rapid progress in generative large language models (LLMs) makes it possible to solve many NLP tasks with very high quality without the need for extensive data. Therefore, we systematically investigate the potential of LLMs for extracting information from textual process descriptions, targeting the detection of process elements such as activities and actors, and relations between them. Using a heuristic algorithm, we demonstrate the suitability of the extracted information for process model generation. Based on a novel prompting strategy, we show that LLMs are able to outperform state-of-the-art machine learning approaches with absolute performance improvements of up to 8\% $F_1$ score across three different datasets. We evaluate our prompting strategy on eight different LLMs, showing it is universally applicable, while also analyzing the impact of certain prompt parts on extraction quality. The number of example texts, the specificity of definitions, and the rigour of format instructions are identified as key for improving the accuracy of extracted information. Our code, prompts, and data are publicly available.
Title: Learning Spectral-Decomposed Tokens for Domain Generalized Semantic Segmentation
Abstract: The rapid development of Vision Foundation Model (VFM) brings inherent out-domain generalization for a variety of down-stream tasks. Among them, domain generalized semantic segmentation (DGSS) holds unique challenges as the cross-domain images share common pixel-wise content information but vary greatly in terms of the style. In this paper, we present a novel Spectral-dEcomposed Token (SET) learning framework to advance the frontier. Delving into further than existing fine-tuning token & frozen backbone paradigm, the proposed SET especially focuses on the way learning style-invariant features from these learnable tokens. Particularly, the frozen VFM features are first decomposed into the phase and amplitude components in the frequency space, which mainly contain the information of content and style, respectively, and then separately processed by learnable tokens for task-specific information extraction. After the decomposition, style variation primarily impacts the token-based feature enhancement within the amplitude branch. To address this issue, we further develop an attention optimization method to bridge the gap between style-affected representation and static tokens during inference. Extensive cross-domain experiments show its state-of-the-art performance.
Title: LinguaLinker: Audio-Driven Portraits Animation with Implicit Facial Control Enhancement
Authors: Rui Zhang, Yixiao Fang, Zhengnan Lu, Pei Cheng, Zebiao Huang, Bin Fu
Copy Paste: [[2407.18595]] LinguaLinker: Audio-Driven Portraits Animation with Implicit Facial Control Enhancement(https://arxiv.org/abs/2407.18595)
Keywords: diffusion
Abstract: This study delves into the intricacies of synchronizing facial dynamics with multilingual audio inputs, focusing on the creation of visually compelling, time-synchronized animations through diffusion-based techniques. Diverging from traditional parametric models for facial animation, our approach, termed LinguaLinker, adopts a holistic diffusion-based framework that integrates audio-driven visual synthesis to enhance the synergy between auditory stimuli and visual responses. We process audio features separately and derive the corresponding control gates, which implicitly govern the movements in the mouth, eyes, and head, irrespective of the portrait's origin. The advanced audio-driven visual synthesis mechanism provides nuanced control but keeps the compatibility of output video and input audio, allowing for a more tailored and effective portrayal of distinct personas across different languages. The significant improvements in the fidelity of animated portraits, the accuracy of lip-syncing, and the appropriate motion variations achieved by our method render it a versatile tool for animating any portrait in any language.
Abstract: Investigating noise distribution beyond Gaussian in diffusion generative models is an open problem. The Gaussian case has seen success experimentally and theoretically, fitting a unified SDE framework for score-based and denoising formulations. Recent studies suggest heavy-tailed noise distributions can address mode collapse and manage datasets with class imbalance, heavy tails, or outliers. Yoon et al. (NeurIPS 2023) introduced the Lévy-Ito model (LIM), extending the SDE framework to heavy-tailed SDEs with $\alpha$-stable noise. Despite its theoretical elegance and performance gains, LIM's complex mathematics may limit its accessibility and broader adoption. This study takes a simpler approach by extending the denoising diffusion probabilistic model (DDPM) with $\alpha$-stable noise, creating the denoising Lévy probabilistic model (DLPM). Using elementary proof techniques, we show DLPM reduces to running vanilla DDPM with minimal changes, allowing the use of existing implementations with minimal changes. DLPM and LIM have different training algorithms and, unlike the Gaussian case, they admit different backward processes and sampling algorithms. Our experiments demonstrate that DLPM achieves better coverage of data distribution tail, improved generation of unbalanced datasets, and faster computation times with fewer backward steps.
Title: Robust VAEs via Generating Process of Noise Augmented Data
Copy Paste: [[2407.18632]] Robust VAEs via Generating Process of Noise Augmented Data(https://arxiv.org/abs/2407.18632)
Keywords: generative
Abstract: Advancing defensive mechanisms against adversarial attacks in generative models is a critical research topic in machine learning. Our study focuses on a specific type of generative models - Variational Auto-Encoders (VAEs). Contrary to common beliefs and existing literature which suggest that noise injection towards training data can make models more robust, our preliminary experiments revealed that naive usage of noise augmentation technique did not substantially improve VAE robustness. In fact, it even degraded the quality of learned representations, making VAEs more susceptible to adversarial perturbations. This paper introduces a novel framework that enhances robustness by regularizing the latent space divergence between original and noise-augmented data. Through incorporating a paired probabilistic prior into the standard variational lower bound, our method significantly boosts defense against adversarial attacks. Our empirical evaluations demonstrate that this approach, termed Robust Augmented Variational Auto-ENcoder (RAVEN), yields superior performance in resisting adversarial inputs on widely-recognized benchmark datasets.
Title: Auto DragGAN: Editing the Generative Image Manifold in an Autoregressive Manner
Copy Paste: [[2407.18656]] Auto DragGAN: Editing the Generative Image Manifold in an Autoregressive Manner(https://arxiv.org/abs/2407.18656)
Keywords: generative
Abstract: Pixel-level fine-grained image editing remains an open challenge. Previous works fail to achieve an ideal trade-off between control granularity and inference speed. They either fail to achieve pixel-level fine-grained control, or their inference speed requires optimization. To address this, this paper for the first time employs a regression-based network to learn the variation patterns of StyleGAN latent codes during the image dragging process. This method enables pixel-level precision in dragging editing with little time cost. Users can specify handle points and their corresponding target points on any GAN-generated images, and our method will move each handle point to its corresponding target point. Through experimental analysis, we discover that a short movement distance from handle points to target points yields a high-fidelity edited image, as the model only needs to predict the movement of a small portion of pixels. To achieve this, we decompose the entire movement process into multiple sub-processes. Specifically, we develop a transformer encoder-decoder based network named 'Latent Predictor' to predict the latent code motion trajectories from handle points to target points in an autoregressive manner. Moreover, to enhance the prediction stability, we introduce a component named 'Latent Regularizer', aimed at constraining the latent code motion within the distribution of natural images. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) inference speed and image editing performance at the pixel-level granularity.
Title: Adversarial Robustification via Text-to-Image Diffusion Models
Copy Paste: [[2407.18658]] Adversarial Robustification via Text-to-Image Diffusion Models(https://arxiv.org/abs/2407.18658)
Keywords: diffusion
Abstract: Adversarial robustness has been conventionally believed as a challenging property to encode for neural networks, requiring plenty of training data. In the recent paradigm of adopting off-the-shelf models, however, access to their training data is often infeasible or not practical, while most of such models are not originally trained concerning adversarial robustness. In this paper, we develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data. Our intuition is to view recent text-to-image diffusion models as "adaptable" denoisers that can be optimized to specify target tasks. Based on this, we propose: (a) to initiate a denoise-and-classify pipeline that offers provable guarantees against adversarial attacks, and (b) to leverage a few synthetic reference images generated from the text-to-image model that enables novel adaptation schemes. Our experiments show that our data-free scheme applied to the pre-trained CLIP could improve the (provable) adversarial robustness of its diverse zero-shot classification derivatives (while maintaining their accuracy), significantly surpassing prior approaches that utilize the full training data. Not only for CLIP, we also demonstrate that our framework is easily applicable for robustifying other visual classifiers efficiently.
Title: Scalable Group Choreography via Variational Phase Manifold Learning
Authors: Nhat Le, Khoa Do, Xuan Bui, Tuong Do, Erman Tjiputra, Quang D.Tran, Anh Nguyen
Copy Paste: [[2407.18839]] Scalable Group Choreography via Variational Phase Manifold Learning(https://arxiv.org/abs/2407.18839)
Keywords: generative
Abstract: Generating group dance motion from the music is a challenging task with several industrial applications. Although several methods have been proposed to tackle this problem, most of them prioritize optimizing the fidelity in dancing movement, constrained by predetermined dancer counts in datasets. This limitation impedes adaptability to real-world applications. Our study addresses the scalability problem in group choreography while preserving naturalness and synchronization. In particular, we propose a phase-based variational generative model for group dance generation on learning a generative manifold. Our method achieves high-fidelity group dance motion and enables the generation with an unlimited number of dancers while consuming only a minimal and constant amount of memory. The intensive experiments on two public datasets show that our proposed method outperforms recent state-of-the-art approaches by a large margin and is scalable to a great number of dancers beyond the training data.
Title: Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment
Copy Paste: [[2407.18854]] Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment(https://arxiv.org/abs/2407.18854)
Keywords: diffusion
Abstract: Image classification models often demonstrate unstable performance in real-world applications due to variations in image information, driven by differing visual perspectives of subject objects and lighting discrepancies. To mitigate these challenges, existing studies commonly incorporate additional modal information matching the visual data to regularize the model's learning process, enabling the extraction of high-quality visual features from complex image regions. Specifically, in the realm of multimodal learning, cross-modal alignment is recognized as an effective strategy, harmonizing different modal information by learning a domain-consistent latent feature space for visual and semantic features. However, this approach may face limitations due to the heterogeneity between multimodal information, such as differences in feature distribution and structure. To address this issue, we introduce a Multimodal Alignment and Reconstruction Network (MARNet), designed to enhance the model's resistance to visual noise. Importantly, MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains. Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model. It is a plug-and-play framework that can be rapidly integrated into various image classification frameworks, boosting model performance.
Title: HADES: Detecting Active Directory Attacks via Whole Network Provenance Analytics
Authors: Qi Liu, Kaibin Bao, Wajih Ul Hassan, Veit Hagenmeyer
Copy Paste: [[2407.18858]] HADES: Detecting Active Directory Attacks via Whole Network Provenance Analytics(https://arxiv.org/abs/2407.18858)
Keywords: anomaly
Abstract: Due to its crucial role in identity and access management in modern enterprise networks, Active Directory (AD) is a top target of Advanced Persistence Threat (APT) actors. Conventional intrusion detection systems (IDS) excel at identifying malicious behaviors caused by malware, but often fail to detect stealthy attacks launched by APT actors. Recent advance in provenance-based IDS (PIDS) shows promises by exposing malicious system activities in causal attack graphs. However, existing approaches are restricted to intra-machine tracing, and unable to reveal the scope of attackers' traversal inside a network. We propose HADES, the first PIDS capable of performing accurate causality-based cross-machine tracing by leveraging a novel concept called logon session based execution partitioning to overcome several challenges in cross-machine tracing. We design HADES as an efficient on-demand tracing system, which performs whole-network tracing only when it first identifies an authentication anomaly signifying an ongoing AD attack, for which we introduce a novel lightweight authentication anomaly detection model rooted in our extensive analysis of AD attacks. To triage attack alerts, we present a new algorithm integrating two key insights we identified in AD attacks. Our evaluations show that HADES outperforms both popular open source detection systems and a prominent commercial AD attack detector.
Title: Generative Adversarial Networks for Imputing Sparse Learning Performance
Authors: Liang Zhang, Mohammed Yeasin, Jionghao Lin, Felix Havugimana, Xiangen Hu
Abstract: Learning performance data, such as correct or incorrect responses to questions in Intelligent Tutoring Systems (ITSs) is crucial for tracking and assessing the learners' progress and mastery of knowledge. However, the issue of data sparsity, characterized by unexplored questions and missing attempts, hampers accurate assessment and the provision of tailored, personalized instruction within ITSs. This paper proposes using the Generative Adversarial Imputation Networks (GAIN) framework to impute sparse learning performance data, reconstructed into a three-dimensional (3D) tensor representation across the dimensions of learners, questions and attempts. Our customized GAIN-based method computational process imputes sparse data in a 3D tensor space, significantly enhanced by convolutional neural networks for its input and output layers. This adaptation also includes the use of a least squares loss function for optimization and aligns the shapes of the input and output with the dimensions of the questions-attempts matrices along the learners' dimension. Through extensive experiments on six datasets from various ITSs, including AutoTutor, ASSISTments and MATHia, we demonstrate that the GAIN approach generally outperforms existing methods such as tensor factorization and other generative adversarial network (GAN) based approaches in terms of imputation accuracy. This finding enhances comprehensive learning data modeling and analytics in AI-based education.
Title: Small Molecule Optimization with Large Language Models
Copy Paste: [[2407.18897]] Small Molecule Optimization with Large Language Models(https://arxiv.org/abs/2407.18897)
Keywords: generative
Abstract: Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.
Title: SHIC: Shape-Image Correspondences with no Keypoint Supervision
Authors: Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi
Copy Paste: [[2407.18907]] SHIC: Shape-Image Correspondences with no Keypoint Supervision(https://arxiv.org/abs/2407.18907)
Keywords: diffusion, foundation model
Abstract: Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template. Popularised by DensePose for the analysis of humans, authors have since attempted to apply the concept to more categories, but with limited success due to the high cost of manual supervision. In this work, we introduce SHIC, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories. Our idea is to leverage foundation computer vision models such as DINO and Stable Diffusion that are open-ended and thus possess excellent priors over natural categories. SHIC reduces the problem of estimating image-to-template correspondences to predicting image-to-image correspondences using features from the foundation models. The reduction works by matching images of the object to non-photorealistic renders of the template, which emulates the process of collecting manual annotations for this task. These correspondences are then used to supervise high-quality canonical maps for any object of interest. We also show that image generators can further improve the realism of the template views, which provide an additional source of supervision for the model.