2024-01-09

diffusion

Title: Latte: Latent Diffusion Transformer for Video Generation. (arXiv:2401.03048v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.03048
Code URL: https://github.com/maxin-cn/Latte
Copy Paste: [[2401.03048]] Latte: Latent Diffusion Transformer for Video Generation(http://arxiv.org/abs/2401.03048)
Summary:
We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

Title: SAR Despeckling via Regional Denoising Diffusion Probabilistic Model. (arXiv:2401.03122v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.03122
Code URL: null
Copy Paste: [[2401.03122]] SAR Despeckling via Regional Denoising Diffusion Probabilistic Model(http://arxiv.org/abs/2401.03122)
Summary:
Speckle noise poses a significant challenge in maintaining the quality of synthetic aperture radar (SAR) images, so SAR despeckling techniques have drawn increasing attention. Despite the tremendous advancements of deep learning in fixed-scale SAR image despeckling, these methods still struggle to deal with large-scale SAR images. To address this problem, this paper introduces a novel despeckling approach termed Region Denoising Diffusion Probabilistic Model (R-DDPM) based on generative models. R-DDPM enables versatile despeckling of SAR images across various scales, accomplished within a single training session. Moreover, The artifacts in the fused SAR images can be avoided effectively with the utilization of region-guided inverse sampling. Experiments of our proposed R-DDPM on Sentinel-1 data demonstrates superior performance to existing methods.

Title: Controllable Image Synthesis of Industrial Data Using Stable Diffusion. (arXiv:2401.03152v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.03152
Code URL: null
Copy Paste: [[2401.03152]] Controllable Image Synthesis of Industrial Data Using Stable Diffusion(http://arxiv.org/abs/2401.03152)
Summary:
Training supervised deep neural networks that perform defect detection and segmentation requires large-scale fully-annotated datasets, which can be hard or even impossible to obtain in industrial environments. Generative AI offers opportunities to enlarge small industrial datasets artificially, thus enabling the usage of state-of-the-art supervised approaches in the industry. Unfortunately, also good generative models need a lot of data to train, while industrial datasets are often tiny. Here, we propose a new approach for reusing general-purpose pre-trained generative models on industrial data, ultimately allowing the generation of self-labelled defective images. First, we let the model learn the new concept, entailing the novel data distribution. Then, we force it to learn to condition the generative process, producing industrial images that satisfy well-defined topological characteristics and show defects with a given geometry and location. To highlight the advantage of our approach, we use the synthetic dataset to optimise a crack segmentor for a real industrial use case. When the available data is small, we observe considerable performance increase under several metrics, showing the method's potential in production environments.

Title: An Event-Oriented Diffusion-Refinement Method for Sparse Events Completion. (arXiv:2401.03153v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.03153
Code URL: null
Copy Paste: [[2401.03153]] An Event-Oriented Diffusion-Refinement Method for Sparse Events Completion(http://arxiv.org/abs/2401.03153)
Summary:
Event cameras or dynamic vision sensors (DVS) record asynchronous response to brightness changes instead of conventional intensity frames, and feature ultra-high sensitivity at low bandwidth. The new mechanism demonstrates great advantages in challenging scenarios with fast motion and large dynamic range. However, the recorded events might be highly sparse due to either limited hardware bandwidth or extreme photon starvation in harsh environments. To unlock the full potential of event cameras, we propose an inventive event sequence completion approach conforming to the unique characteristics of event data in both the processing stage and the output form. Specifically, we treat event streams as 3D event clouds in the spatiotemporal domain, develop a diffusion-based generative model to generate dense clouds in a coarse-to-fine manner, and recover exact timestamps to maintain the temporal resolution of raw data successfully. To validate the effectiveness of our method comprehensively, we perform extensive experiments on three widely used public datasets with different spatial resolutions, and additionally collect a novel event dataset covering diverse scenarios with highly dynamic motions and under harsh illumination. Besides generating high-quality dense events, our method can benefit downstream applications such as object classification and intensity frame reconstruction.

Title: PosDiffNet: Positional Neural Diffusion for Point Cloud Registration in a Large Field of View with Perturbations. (arXiv:2401.03167v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.03167
Code URL: null
Copy Paste: [[2401.03167]] PosDiffNet: Positional Neural Diffusion for Point Cloud Registration in a Large Field of View with Perturbations(http://arxiv.org/abs/2401.03167)
Summary:
Point cloud registration is a crucial technique in 3D computer vision with a wide range of applications. However, this task can be challenging, particularly in large fields of view with dynamic objects, environmental noise, or other perturbations. To address this challenge, we propose a model called PosDiffNet. Our approach performs hierarchical registration based on window-level, patch-level, and point-level correspondence. We leverage a graph neural partial differential equation (PDE) based on Beltrami flow to obtain high-dimensional features and position embeddings for point clouds. We incorporate position embeddings into a Transformer module based on a neural ordinary differential equation (ODE) to efficiently represent patches within points. We employ the multi-level correspondence derived from the high feature similarity scores to facilitate alignment between point clouds. Subsequently, we use registration methods such as SVD-based algorithms to predict the transformation using corresponding point pairs. We evaluate PosDiffNet on several 3D point cloud datasets, verifying that it achieves state-of-the-art (SOTA) performance for point cloud registration in large fields of view with perturbations. The implementation code of experiments is available at https://github.com/AI-IT-AVs/PosDiffNet.

Title: MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond. (arXiv:2401.03221v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.03221
Code URL: null
Copy Paste: [[2401.03221]] MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond(http://arxiv.org/abs/2401.03221)
Summary:
Recently, text-to-image diffusion models become a new paradigm in image processing fields, including content generation, image restoration and image-to-image translation. Given a target prompt, Denoising Diffusion Probabilistic Models (DDPM) are able to generate realistic yet eligible images. With this appealing property, the image translation task has the potential to be free from target image samples for supervision. By using a target text prompt for domain adaption, the diffusion model is able to implement zero-shot image-to-image translation advantageously. However, the sampling and inversion processes of DDPM are stochastic, and thus the inversion process often fail to reconstruct the input content. Specifically, the displacement effect will gradually accumulated during the diffusion and inversion processes, which led to the reconstructed results deviating from the source domain. To make reconstruction explicit, we propose a prompt redescription strategy to realize a mirror effect between the source and reconstructed image in the diffusion model (MirrorDiffusion). More specifically, a prompt redescription mechanism is investigated to align the text prompts with latent code at each time step of the Denoising Diffusion Implicit Models (DDIM) inversion to pursue a structure-preserving reconstruction. With the revised DDIM inversion, MirrorDiffusion is able to realize accurate zero-shot image translation by editing optimized text prompts and latent code. Extensive experiments demonstrate that MirrorDiffusion achieves superior performance over the state-of-the-art methods on zero-shot image translation benchmarks by clear margins and practical model stability.

Title: Image Inpainting via Tractable Steering of Diffusion Models. (arXiv:2401.03349v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.03349
Code URL: null
Copy Paste: [[2401.03349]] Image Inpainting via Tractable Steering of Diffusion Models(http://arxiv.org/abs/2401.03349)
Summary:
Diffusion models are the current state of the art for generating photorealistic images. Controlling the sampling process for constrained image generation tasks such as inpainting, however, remains challenging since exact conditioning on such constraints is intractable. While existing methods use various techniques to approximate the constrained posterior, this paper proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to exactly and efficiently compute the constrained posterior, and to leverage this signal to steer the denoising process of diffusion models. Specifically, this paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs). Building upon prior advances, we further scale up PCs and make them capable of guiding the image generation process of diffusion models. Empirical results suggest that our approach can consistently improve the overall quality and semantic coherence of inpainted images across three natural image datasets (i.e., CelebA-HQ, ImageNet, and LSUN) with only ~10% additional computational overhead brought by the TPM. Further, with the help of an image encoder and decoder, our method can readily accept semantic constraints on specific regions of the image, which opens up the potential for more controlled image generation tasks. In addition to proposing a new framework for constrained image generation, this paper highlights the benefit of more tractable models and motivates the development of expressive TPMs.

Title: The Rise of Diffusion Models in Time-Series Forecasting. (arXiv:2401.03006v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.03006
Code URL: null
Copy Paste: [[2401.03006]] The Rise of Diffusion Models in Time-Series Forecasting(http://arxiv.org/abs/2401.03006)
Summary:
This survey delves into the application of diffusion models in time-series forecasting. Diffusion models are demonstrating state-of-the-art results in various fields of generative AI. The paper includes comprehensive background information on diffusion models, detailing their conditioning methods and reviewing their use in time-series forecasting. The analysis covers 11 specific time-series implementations, the intuition and theory behind them, the effectiveness on different datasets, and a comparison among each other. Key contributions of this work are the thorough exploration of diffusion models' applications in time-series forecasting and a chronologically ordered overview of these models. Additionally, the paper offers an insightful discussion on the current state-of-the-art in this domain and outlines potential future research directions. This serves as a valuable resource for researchers in AI and time-series analysis, offering a clear view of the latest advancements and future potential of diffusion models.

Title: Fair Sampling in Diffusion Models through Switching Mechanism. (arXiv:2401.03140v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.03140
Code URL: https://github.com/uzn36/attributeswitching
Copy Paste: [[2401.03140]] Fair Sampling in Diffusion Models through Switching Mechanism(http://arxiv.org/abs/2401.03140)
Summary:
Diffusion models have shown their effectiveness in generation tasks by well-approximating the underlying probability distribution. However, diffusion models are known to suffer from an amplified inherent bias from the training data in terms of fairness. While the sampling process of diffusion models can be controlled by conditional guidance, previous works have attempted to find empirical guidance to achieve quantitative fairness. To address this limitation, we propose a fairness-aware sampling method called \textit{attribute switching} mechanism for diffusion models. Without additional training, the proposed sampling can obfuscate sensitive attributes in generated data without relying on classifiers. We mathematically prove and experimentally demonstrate the effectiveness of the proposed method on two key aspects: (i) the generation of fair data and (ii) the preservation of the utility of the generated data.

self-supervised

Title: Dress-Me-Up: A Dataset & Method for Self-Supervised 3D Garment Retargeting. (arXiv:2401.03108v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.03108
Code URL: null
Copy Paste: [[2401.03108]] Dress-Me-Up: A Dataset & Method for Self-Supervised 3D Garment Retargeting(http://arxiv.org/abs/2401.03108)
Summary:
We propose a novel self-supervised framework for retargeting non-parameterized 3D garments onto 3D human avatars of arbitrary shapes and poses, enabling 3D virtual try-on (VTON). Existing self-supervised 3D retargeting methods only support parametric and canonical garments, which can only be draped over parametric body, e.g. SMPL. To facilitate the non-parametric garments and body, we propose a novel method that introduces Isomap Embedding based correspondences matching between the garment and the human body to get a coarse alignment between the two meshes. We perform neural refinement of the coarse alignment in a self-supervised setting. Further, we leverage a Laplacian detail integration method for preserving the inherent details of the input garment. For evaluating our 3D non-parametric garment retargeting framework, we propose a dataset of 255 real-world garments with realistic noise and topological deformations. The dataset contains $44$ unique garments worn by 15 different subjects in 5 distinctive poses, captured using a multi-view RGBD capture setup. We show superior retargeting quality on non-parametric garments and human avatars over existing state-of-the-art methods, acting as the first-ever baseline on the proposed dataset for non-parametric 3D garment retargeting.

Title: Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection. (arXiv:2401.03145v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.03145
Code URL: null
Copy Paste: [[2401.03145]] Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection(http://arxiv.org/abs/2401.03145)
Summary:
Industrial anomaly detection is generally addressed as an unsupervised task that aims at locating defects with only normal training samples. Recently, numerous 2D anomaly detection methods have been proposed and have achieved promising results, however, using only the 2D RGB data as input is not sufficient to identify imperceptible geometric surface anomalies. Hence, in this work, we focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets, i.e., ImageNet, to construct feature databases. And we empirically find that directly using these pre-trained models is not optimal, it can either fail to detect subtle defects or mistake abnormal features as normal ones. This may be attributed to the domain gap between target industrial data and source data.Towards this problem, we propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection.Both intra-modal adaptation and cross-modal alignment are optimized from a local-to-global perspective in LSFA to ensure the representation quality and consistency in the inference stage.Extensive experiments demonstrate that our method not only brings a significant performance boost to feature embedding based approaches, but also outperforms previous State-of-The-Art (SoTA) methods prominently on both MVTec-3D AD and Eyecandies datasets, e.g., LSFA achieves 97.1% I-AUROC on MVTec-3D, surpass previous SoTA by +3.4%.

Title: Preserving Silent Features for Domain Generalization. (arXiv:2401.03170v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.03170
Code URL: null
Copy Paste: [[2401.03170]] Preserving Silent Features for Domain Generalization(http://arxiv.org/abs/2401.03170)
Summary:
Domain generalization (DG) aims to improve the generalization ability of the model trained on several known training domains over unseen test domains. Previous work has shown that self-supervised contrastive pre-training improves the robustness of the model on downstream tasks. However, in this paper, we find that self-supervised models do not exhibit better generalization performance than supervised models pre-trained on the same dataset in the DG setting. We argue that this is owing to the fact that the richer intra-class discriminative features extracted by self-supervised contrastive learning, which we term silent features, are suppressed during supervised fine-tuning. These silent features are likely to contain features that are more generalizable on the test domain. In this work, we model and analyze this feature suppression phenomenon and theoretically prove that preserving silent features can achieve lower expected test domain risk under certain conditions. In light of this, we propose a simple yet effective method termed STEP (Silent Feature Preservation) to improve the generalization performance of the self-supervised contrastive learning pre-trained model by alleviating the suppression of silent features during the supervised fine-tuning process. Experimental results show that STEP exhibits state-of-the-art performance on standard DG benchmarks with significant distribution shifts.

Title: Exploiting Data Hierarchy as a New Modality for Contrastive Learning. (arXiv:2401.03312v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.03312
Code URL: null
Copy Paste: [[2401.03312]] Exploiting Data Hierarchy as a New Modality for Contrastive Learning(http://arxiv.org/abs/2401.03312)
Summary:
This work investigates how hierarchically structured data can help neural networks learn conceptual representations of cathedrals. The underlying WikiScenes dataset provides a spatially organized hierarchical structure of cathedral components. We propose a novel hierarchical contrastive training approach that leverages a triplet margin loss to represent the data's spatial hierarchy in the encoder's latent space. As such, the proposed approach investigates if the dataset structure provides valuable information for self-supervised learning. We apply t-SNE to visualize the resultant latent space and evaluate the proposed approach by comparing it with other dataset-specific contrastive learning methods using a common downstream classification task. The proposed method outperforms the comparable weakly-supervised and baseline methods. Our findings suggest that dataset structure is a valuable modality for weakly-supervised learning.

Title: TimeGraphs: Graph-based Temporal Reasoning. (arXiv:2401.03134v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.03134
Code URL: null
Copy Paste: [[2401.03134]] TimeGraphs: Graph-based Temporal Reasoning(http://arxiv.org/abs/2401.03134)
Summary:
Many real-world systems exhibit temporal, dynamic behaviors, which are captured as time series of complex agent interactions. To perform temporal reasoning, current methods primarily encode temporal dynamics through simple sequence-based models. However, in general these models fail to efficiently capture the full spectrum of rich dynamics in the input, since the dynamics is not uniformly distributed. In particular, relevant information might be harder to extract and computing power is wasted for processing all individual timesteps, even if they contain no significant changes or no new information. Here we propose TimeGraphs, a novel approach that characterizes dynamic interactions as a hierarchical temporal graph, diverging from traditional sequential representations. Our approach models the interactions using a compact graph-based representation, enabling adaptive reasoning across diverse time scales. Adopting a self-supervised method, TimeGraphs constructs a multi-level event hierarchy from a temporal input, which is then used to efficiently reason about the unevenly distributed dynamics. This construction process is scalable and incremental to accommodate streaming data. We evaluate TimeGraphs on multiple datasets with complex, dynamic agent interactions, including a football simulator, the Resistance game, and the MOMA human activity dataset. The results demonstrate both robustness and efficiency of TimeGraphs on a range of temporal reasoning tasks. Our approach obtains state-of-the-art performance and leads to a performance increase of up to 12.2% on event prediction and recognition tasks over current approaches. Our experiments further demonstrate a wide array of capabilities including zero-shot generalization, robustness in case of data sparsity, and adaptability to streaming data flow.

Title: Understanding Representation Learnability of Nonlinear Self-Supervised Learning. (arXiv:2401.03214v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.03214
Code URL: null
Copy Paste: [[2401.03214]] Understanding Representation Learnability of Nonlinear Self-Supervised Learning(http://arxiv.org/abs/2401.03214)
Summary:
Self-supervised learning (SSL) has empirically shown its data representation learnability in many downstream tasks. There are only a few theoretical works on data representation learnability, and many of those focus on final data representation, treating the nonlinear neural network as a ``black box". However, the accurate learning results of neural networks are crucial for describing the data distribution features learned by SSL models. Our paper is the first to analyze the learning results of the nonlinear SSL model accurately. We consider a toy data distribution that contains two features: the label-related feature and the hidden feature. Unlike previous linear setting work that depends on closed-form solutions, we use the gradient descent algorithm to train a 1-layer nonlinear SSL model with a certain initialization region and prove that the model converges to a local minimum. Furthermore, different from the complex iterative analysis, we propose a new analysis process which uses the exact version of Inverse Function Theorem to accurately describe the features learned by the local minimum. With this local minimum, we prove that the nonlinear SSL model can capture the label-related feature and hidden feature at the same time. In contrast, the nonlinear supervised learning (SL) model can only learn the label-related feature. We also present the learning processes and results of the nonlinear SSL and SL model via simulation experiments.

foundation model

Title: AccidentGPT: Large Multi-Modal Foundation Model for Traffic Accident Analysis. (arXiv:2401.03040v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.03040
Code URL: null
Copy Paste: [[2401.03040]] AccidentGPT: Large Multi-Modal Foundation Model for Traffic Accident Analysis(http://arxiv.org/abs/2401.03040)
Summary:
Traffic accident analysis is pivotal for enhancing public safety and developing road regulations. Traditional approaches, although widely used, are often constrained by manual analysis processes, subjective decisions, uni-modal outputs, as well as privacy issues related to sensitive data. This paper introduces the idea of AccidentGPT, a foundation model of traffic accident analysis, which incorporates multi-modal input data to automatically reconstruct the accident process video with dynamics details, and furthermore provide multi-task analysis with multi-modal outputs. The design of the AccidentGPT is empowered with a multi-modality prompt with feedback for task-oriented adaptability, a hybrid training schema to leverage labelled and unlabelled data, and a edge-cloud split configuration for data privacy. To fully realize the functionalities of this model, we proposes several research opportunities. This paper serves as the stepping stone to fill the gaps in traditional approaches of traffic accident analysis and attract the research community attention for automatic, objective, and privacy-preserving traffic accident analysis.

generative

Title: A Surrogate-Assisted Extended Generative Adversarial Network for Parameter Optimization in Free-Form Metasurface Design. (arXiv:2401.02961v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.02961
Code URL: null
Copy Paste: [[2401.02961]] A Surrogate-Assisted Extended Generative Adversarial Network for Parameter Optimization in Free-Form Metasurface Design(http://arxiv.org/abs/2401.02961)
Summary:
Metasurfaces have widespread applications in fifth-generation (5G) microwave communication. Among the metasurface family, free-form metasurfaces excel in achieving intricate spectral responses compared to regular-shape counterparts. However, conventional numerical methods for free-form metasurfaces are time-consuming and demand specialized expertise. Alternatively, recent studies demonstrate that deep learning has great potential to accelerate and refine metasurface designs. Here, we present XGAN, an extended generative adversarial network (GAN) with a surrogate for high-quality free-form metasurface designs. The proposed surrogate provides a physical constraint to XGAN so that XGAN can accurately generate metasurfaces monolithically from input spectral responses. In comparative experiments involving 20000 free-form metasurface designs, XGAN achieves 0.9734 average accuracy and is 500 times faster than the conventional methodology. This method facilitates the metasurface library building for specific spectral responses and can be extended to various inverse design problems, including optical metamaterials, nanophotonic devices, and drug discovery.

Title: A Physics-guided Generative AI Toolkit for Geophysical Monitoring. (arXiv:2401.03131v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.03131
Code URL: null
Copy Paste: [[2401.03131]] A Physics-guided Generative AI Toolkit for Geophysical Monitoring(http://arxiv.org/abs/2401.03131)
Summary:
Full-waveform inversion (FWI) plays a vital role in geoscience to explore the subsurface. It utilizes the seismic wave to image the subsurface velocity map. As the machine learning (ML) technique evolves, the data-driven approaches using ML for FWI tasks have emerged, offering enhanced accuracy and reduced computational cost compared to traditional physics-based methods. However, a common challenge in geoscience, the unprivileged data, severely limits ML effectiveness. The issue becomes even worse during model pruning, a step essential in geoscience due to environmental complexities. To tackle this, we introduce the EdGeo toolkit, which employs a diffusion-based model guided by physics principles to generate high-fidelity velocity maps. The toolkit uses the acoustic wave equation to generate corresponding seismic waveform data, facilitating the fine-tuning of pruned ML models. Our results demonstrate significant improvements in SSIM scores and reduction in both MAE and MSE across various pruning ratios. Notably, the ML model fine-tuned using data generated by EdGeo yields superior quality of velocity maps, especially in representing unprivileged features, outperforming other existing methods.

Title: Learning from a Generative AI Predecessor -- The Many Motivations for Interacting with Conversational Agents. (arXiv:2401.02978v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02978
Code URL: null
Copy Paste: [[2401.02978]] Learning from a Generative AI Predecessor -- The Many Motivations for Interacting with Conversational Agents(http://arxiv.org/abs/2401.02978)
Summary:
For generative AI to succeed, how engaging a conversationalist must it be? For almost sixty years, some conversational agents have responded to any question or comment to keep a conversation going. In recent years, several utilized machine learning or sophisticated language processing, such as Tay, Xiaoice, Zo, Hugging Face, Kuki, and Replika. Unlike generative AI, they focused on engagement, not expertise. Millions of people were motivated to engage with them. What were the attractions? Will generative AI do better if it is equally engaging, or should it be less engaging? Prior to the emergence of generative AI, we conducted a large-scale quantitative and qualitative analysis to learn what motivated millions of people to engage with one such 'virtual companion,' Microsoft's Zo. We examined the complete chat logs of 2000 anonymized people. We identified over a dozen motivations that people had for interacting with this software. Designers learned different ways to increase engagement. Generative conversational AI does not yet have a clear revenue model to address its high cost. It might benefit from being more engaging, even as it supports productivity and creativity. Our study and analysis point to opportunities and challenges.

Title: Evaluating Large Language Models on the GMAT: Implications for the Future of Business Education. (arXiv:2401.02985v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02985
Code URL: null
Copy Paste: [[2401.02985]] Evaluating Large Language Models on the GMAT: Implications for the Future of Business Education(http://arxiv.org/abs/2401.02985)
Summary:
The rapid evolution of artificial intelligence (AI), especially in the domain of Large Language Models (LLMs) and generative AI, has opened new avenues for application across various fields, yet its role in business education remains underexplored. This study introduces the first benchmark to assess the performance of seven major LLMs, OpenAI's models (GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo), Google's models (PaLM 2, Gemini 1.0 Pro), and Anthropic's models (Claude 2 and Claude 2.1), on the GMAT, which is a key exam in the admission process for graduate business programs. Our analysis shows that most LLMs outperform human candidates, with GPT-4 Turbo not only outperforming the other models but also surpassing the average scores of graduate students at top business schools. Through a case study, this research examines GPT-4 Turbo's ability to explain answers, evaluate responses, identify errors, tailor instructions, and generate alternative scenarios. The latest LLM versions, GPT-4 Turbo, Claude 2.1, and Gemini 1.0 Pro, show marked improvements in reasoning tasks compared to their predecessors, underscoring their potential for complex problem-solving. While AI's promise in education, assessment, and tutoring is clear, challenges remain. Our study not only sheds light on LLMs' academic potential but also emphasizes the need for careful development and application of AI in education. As AI technology advances, it is imperative to establish frameworks and protocols for AI interaction, verify the accuracy of AI-generated content, ensure worldwide access for diverse learners, and create an educational environment where AI supports human expertise. This research sets the stage for further exploration into the responsible use of AI to enrich educational experiences and improve exam preparation and assessment methods.

Title: Identification of Regulatory Requirements Relevant to Business Processes: A Comparative Study on Generative AI, Embedding-based Ranking, Crowd and Expert-driven Methods. (arXiv:2401.02986v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02986
Code URL: null
Copy Paste: [[2401.02986]] Identification of Regulatory Requirements Relevant to Business Processes: A Comparative Study on Generative AI, Embedding-based Ranking, Crowd and Expert-driven Methods(http://arxiv.org/abs/2401.02986)
Summary:
Organizations face the challenge of ensuring compliance with an increasing amount of requirements from various regulatory documents. Which requirements are relevant depends on aspects such as the geographic location of the organization, its domain, size, and business processes. Considering these contextual factors, as a first step, relevant documents (e.g., laws, regulations, directives, policies) are identified, followed by a more detailed analysis of which parts of the identified documents are relevant for which step of a given business process. Nowadays the identification of regulatory requirements relevant to business processes is mostly done manually by domain and legal experts, posing a tremendous effort on them, especially for a large number of regulatory documents which might frequently change. Hence, this work examines how legal and domain experts can be assisted in the assessment of relevant requirements. For this, we compare an embedding-based NLP ranking method, a generative AI method using GPT-4, and a crowdsourced method with the purely manual method of creating relevancy labels by experts. The proposed methods are evaluated based on two case studies: an Australian insurance case created with domain experts and a global banking use case, adapted from SAP Signavio's workflow example of an international guideline. A gold standard is created for both BPMN2.0 processes and matched to real-world textual requirements from multiple regulatory documents. The evaluation and discussion provide insights into strengths and weaknesses of each method regarding applicability, automation, transparency, and reproducibility and provide guidelines on which method combinations will maximize benefits for given characteristics such as process usage, impact, and dynamics of an application scenario.

Title: PIXAR: Auto-Regressive Language Modeling in Pixel Space. (arXiv:2401.03321v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.03321
Code URL: null
Copy Paste: [[2401.03321]] PIXAR: Auto-Regressive Language Modeling in Pixel Space(http://arxiv.org/abs/2401.03321)
Summary:
Recent works showed the possibility of building open-vocabulary large language models (LLMs) that directly operate on pixel representations and are implemented as encoder-decoder models that reconstruct masked image patches of rendered text. However, these pixel-based LLMs are limited to autoencoding tasks and cannot generate new text as images. As such, they cannot be used for open-answer or generative language tasks. In this work, we overcome this limitation and introduce PIXAR, the first pixel-based autoregressive LLM that does not rely on a pre-defined vocabulary for both input and output text. Consisting of only a decoder, PIXAR can answer free-form generative tasks while keeping the text representation learning performance on par with previous encoder-decoder models. Furthermore, we highlight the challenges to autoregressively generate non-blurred text as images and link this to the usual maximum likelihood objective. We propose a simple adversarial pretraining that significantly improves the readability and performance of PIXAR making it comparable to GPT2 on short text generation tasks. This paves the way to building open-vocabulary LLMs that are usable for free-form generative tasks and questions the necessity of the usual symbolic input representation -- text as tokens -- for these challenging tasks.

anomaly

Title: Forensic Video Analytic Software. (arXiv:2401.02960v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2401.02960
Code URL: null
Copy Paste: [[2401.02960]] Forensic Video Analytic Software(http://arxiv.org/abs/2401.02960)
Summary:
Law enforcement officials heavily depend on Forensic Video Analytic (FVA) Software in their evidence extraction process. However present-day FVA software are complex, time consuming, equipment dependent and expensive. Developing countries struggle to gain access to this gateway to a secure haven. The term forensic pertains the application of scientific methods to the investigation of crime through post-processing, whereas surveillance is the close monitoring of real-time feeds.

The principle objective of this Final Year Project was to develop an efficient and effective FVA Software, addressing the shortcomings through a stringent and systematic review of scholarly research papers, online databases and legal documentation. The scope spans multiple object detection, multiple object tracking, anomaly detection, activity recognition, tampering detection, general and specific image enhancement and video synopsis.

Methods employed include many machine learning techniques, GPU acceleration and efficient, integrated architecture development both for real-time and postprocessing. For this CNN, GMM, multithreading and OpenCV C++ coding were used. The implications of the proposed methodology would rapidly speed up the FVA process especially through the novel video synopsis research arena. This project has resulted in three research outcomes Moving Object Based Collision Free Video Synopsis, Forensic and Surveillance Analytic Tool Architecture and Tampering Detection Inter-Frame Forgery.

The results include forensic and surveillance panel outcomes with emphasis on video synopsis and Sri Lankan context. Principal conclusions include the optimization and efficient algorithm integration to overcome limitations in processing power, memory and compromise between real-time performance and accuracy.

Title: Multi-View 3D Instance Segmentation of Structural Anomalies for Enhanced Structural Inspection of Concrete Bridges. (arXiv:2401.03298v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2401.03298
Code URL: null
Copy Paste: [[2401.03298]] Multi-View 3D Instance Segmentation of Structural Anomalies for Enhanced Structural Inspection of Concrete Bridges(http://arxiv.org/abs/2401.03298)
Summary:
For effective structural damage assessment, the instances of damages need to be localized in the world of a 3D model. Due to a lack of data, the detection of structural anomalies can currently not be directly learned and performed in 3D space. In this work, a three-stage approach is presented, which uses the good performance of detection models on image level to segment instances of anomalies in the 3D space. In the detection stage, semantic segmentation predictions are produced on image level. The mapping stage transfers the image-level prediction onto the respective point cloud. In the extraction stage, 3D anomaly instances are extracted from the segmented point cloud. Cloud contraction is used to transform cracks into their medial axis representation. For areal anomalies the bounding polygon is extracted by means of alpha shapes. The approach covers the classes crack, spalling, and corrosion and the three image-level segmentation models TopoCrack, nnU-Net, and DetectionHMA are compared. Granted a localization tolerance of 4cm, IoUs of over 90% can be achieved for crack and corrosion and 41% for spalling, which appears to be a specifically challenging class. Detection on instance-level measured in AP is about 45% for crack and spalling and 73% for corrosion.

Title: Deep Anomaly Detection in Text. (arXiv:2401.02971v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2401.02971
Code URL: null
Copy Paste: [[2401.02971]] Deep Anomaly Detection in Text(http://arxiv.org/abs/2401.02971)
Summary:
Deep anomaly detection methods have become increasingly popular in recent years, with methods like Stacked Autoencoders, Variational Autoencoders, and Generative Adversarial Networks greatly improving the state-of-the-art. Other methods rely on augmenting classical models (such as the One-Class Support Vector Machine), by learning an appropriate kernel function using Neural Networks. Recent developments in representation learning by self-supervision are proving to be very beneficial in the context of anomaly detection. Inspired by the advancements in anomaly detection using self-supervised learning in the field of computer vision, this thesis aims to develop a method for detecting anomalies by exploiting pretext tasks tailored for text corpora. This approach greatly improves the state-of-the-art on two datasets, 20Newsgroups, and AG News, for both semi-supervised and unsupervised anomaly detection, thus proving the potential for self-supervised anomaly detectors in the field of natural language processing.

Title: Attention and Autoencoder Hybrid Model for Unsupervised Online Anomaly Detection. (arXiv:2401.03322v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.03322
Code URL: null
Copy Paste: [[2401.03322]] Attention and Autoencoder Hybrid Model for Unsupervised Online Anomaly Detection(http://arxiv.org/abs/2401.03322)
Summary:
This paper introduces a hybrid attention and autoencoder (AE) model for unsupervised online anomaly detection in time series. The autoencoder captures local structural patterns in short embeddings, while the attention model learns long-term features, facilitating parallel computing with positional encoding. Unique in its approach, our proposed hybrid model combines attention and autoencoder for the first time in time series anomaly detection. It employs an attention-based mechanism, akin to the deep transformer model, with key architectural modifications for predicting the next time step window in the autoencoder's latent space. The model utilizes a threshold from the validation dataset for anomaly detection and introduces an alternative method based on analyzing the first statistical moment of error, improving accuracy without dependence on a validation dataset. Evaluation on diverse real-world benchmark datasets and comparing with other well-established models, confirms the effectiveness of our proposed model in anomaly detection.

Title: Weakly Augmented Variational Autoencoder in Time Series Anomaly Detection. (arXiv:2401.03341v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2401.03341
Code URL: null
Copy Paste: [[2401.03341]] Weakly Augmented Variational Autoencoder in Time Series Anomaly Detection(http://arxiv.org/abs/2401.03341)
Summary:
Due to their unsupervised training and uncertainty estimation, deep Variational Autoencoders (VAEs) have become powerful tools for reconstruction-based Time Series Anomaly Detection (TSAD). Existing VAE-based TSAD methods, either statistical or deep, tune meta-priors to estimate the likelihood probability for effectively capturing spatiotemporal dependencies in the data. However, these methods confront the challenge of inherent data scarcity, which is often the case in anomaly detection tasks. Such scarcity easily leads to latent holes, discontinuous regions in latent space, resulting in non-robust reconstructions on these discontinuous spaces. We propose a novel generative framework that combines VAEs with self-supervised learning (SSL) to address this issue.

2024-01-09

diffusion

Title: Latte: Latent Diffusion Transformer for Video Generation. (arXiv:2401.03048v1 [cs.CV])

Title: SAR Despeckling via Regional Denoising Diffusion Probabilistic Model. (arXiv:2401.03122v1 [cs.CV])

Title: Controllable Image Synthesis of Industrial Data Using Stable Diffusion. (arXiv:2401.03152v1 [cs.CV])

Title: An Event-Oriented Diffusion-Refinement Method for Sparse Events Completion. (arXiv:2401.03153v1 [cs.CV])

Title: PosDiffNet: Positional Neural Diffusion for Point Cloud Registration in a Large Field of View with Perturbations. (arXiv:2401.03167v1 [cs.CV])

Title: MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond. (arXiv:2401.03221v1 [cs.CV])

Title: Image Inpainting via Tractable Steering of Diffusion Models. (arXiv:2401.03349v1 [cs.CV])

Title: The Rise of Diffusion Models in Time-Series Forecasting. (arXiv:2401.03006v1 [cs.LG])

Title: Fair Sampling in Diffusion Models through Switching Mechanism. (arXiv:2401.03140v1 [cs.LG])

self-supervised

Title: Dress-Me-Up: A Dataset & Method for Self-Supervised 3D Garment Retargeting. (arXiv:2401.03108v1 [cs.CV])

Title: Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection. (arXiv:2401.03145v1 [cs.CV])

Title: Preserving Silent Features for Domain Generalization. (arXiv:2401.03170v1 [cs.LG])

Title: Exploiting Data Hierarchy as a New Modality for Contrastive Learning. (arXiv:2401.03312v1 [cs.CV])

Title: TimeGraphs: Graph-based Temporal Reasoning. (arXiv:2401.03134v1 [cs.LG])

Title: Understanding Representation Learnability of Nonlinear Self-Supervised Learning. (arXiv:2401.03214v1 [cs.LG])

foundation model

Title: AccidentGPT: Large Multi-Modal Foundation Model for Traffic Accident Analysis. (arXiv:2401.03040v1 [cs.LG])

generative

Title: A Surrogate-Assisted Extended Generative Adversarial Network for Parameter Optimization in Free-Form Metasurface Design. (arXiv:2401.02961v1 [cs.LG])

Title: A Physics-guided Generative AI Toolkit for Geophysical Monitoring. (arXiv:2401.03131v1 [cs.LG])

Title: Learning from a Generative AI Predecessor -- The Many Motivations for Interacting with Conversational Agents. (arXiv:2401.02978v1 [cs.CL])

Title: Evaluating Large Language Models on the GMAT: Implications for the Future of Business Education. (arXiv:2401.02985v1 [cs.CL])

Title: Identification of Regulatory Requirements Relevant to Business Processes: A Comparative Study on Generative AI, Embedding-based Ranking, Crowd and Expert-driven Methods. (arXiv:2401.02986v1 [cs.CL])

Title: PIXAR: Auto-Regressive Language Modeling in Pixel Space. (arXiv:2401.03321v1 [cs.CL])

anomaly

Title: Forensic Video Analytic Software. (arXiv:2401.02960v1 [cs.CR])

Title: Multi-View 3D Instance Segmentation of Structural Anomalies for Enhanced Structural Inspection of Concrete Bridges. (arXiv:2401.03298v1 [cs.CV])

Title: Deep Anomaly Detection in Text. (arXiv:2401.02971v1 [cs.CL])

Title: Attention and Autoencoder Hybrid Model for Unsupervised Online Anomaly Detection. (arXiv:2401.03322v1 [cs.LG])

Title: Weakly Augmented Variational Autoencoder in Time Series Anomaly Detection. (arXiv:2401.03341v1 [cs.LG])

in-context