2024-08-27

Title: Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing

Authors: Zitao Shuai, Chenwei Wu, Zhengxu Tang, Bowen Song, Liyue Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13335
Pdf URL: https://arxiv.org/pdf/2408.13335
Copy Paste: [[2408.13335]] Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing(https://arxiv.org/abs/2408.13335)
Keywords: diffusion
Abstract: Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation. However, how text and image latents individually and jointly contribute to the semantics of generated images, remain largely unexplored. Through our investigation of DiT's latent space, we have uncovered key findings that unlock the potential for zero-shot fine-grained semantic editing: (1) Both the text and image spaces in DiTs are inherently decomposable. (2) These spaces collectively form a disentangled semantic representation space, enabling precise and fine-grained semantic control. (3) Effective image editing requires the combined use of both text and image latent spaces. Leveraging these insights, we propose a simple and effective Extract-Manipulate-Sample (EMS) framework for zero-shot fine-grained image editing. Our approach first utilizes a multi-modal Large Language Model to convert input images and editing targets into text descriptions. We then linearly manipulate text embeddings based on the desired editing degree and employ constrained score distillation sampling to manipulate image embeddings. We quantify the disentanglement degree of the latent space of diffusion models by proposing a new metric. To evaluate fine-grained editing performance, we introduce a comprehensive benchmark incorporating both human annotations, manual evaluation, and automatic metrics. We have conducted extensive experimental results and in-depth analysis to thoroughly uncover the semantic disentanglement properties of the diffusion transformer, as well as the effectiveness of our proposed method. Our annotated benchmark dataset is publicly available at this https URL, facilitating reproducible research in this domain.

Title: Shape-Preserving Generation of Food Images for Automatic Dietary Assessment

Authors: Guangzong Chen, Zhi-Hong Mao, Mingui Sun, Kangni Liu, Wenyan Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13358
Pdf URL: https://arxiv.org/pdf/2408.13358
Copy Paste: [[2408.13358]] Shape-Preserving Generation of Food Images for Automatic Dietary Assessment(https://arxiv.org/abs/2408.13358)
Keywords: generative
Abstract: Traditional dietary assessment methods heavily rely on self-reporting, which is time-consuming and prone to bias. Recent advancements in Artificial Intelligence (AI) have revealed new possibilities for dietary assessment, particularly through analysis of food images. Recognizing foods and estimating food volumes from images are known as the key procedures for automatic dietary assessment. However, both procedures required large amounts of training images labeled with food names and volumes, which are currently unavailable. Alternatively, recent studies have indicated that training images can be artificially generated using Generative Adversarial Networks (GANs). Nonetheless, convenient generation of large amounts of food images with known volumes remain a challenge with the existing techniques. In this work, we present a simple GAN-based neural network architecture for conditional food image generation. The shapes of the food and container in the generated images closely resemble those in the reference input image. Our experiments demonstrate the realism of the generated images and shape-preserving capabilities of the proposed framework.

Title: Generative Blockchain: Transforming Blockchain from Transaction Recording to Transaction Generation through Proof-of-Merit

Authors: Haozhao Zhang, Zhe Zhang, Zhiqiang Zheng, Varghese Jacob
Subjects: cs.CR, cs.ET
Abstract URL: https://arxiv.org/abs/2408.13367
Pdf URL: https://arxiv.org/pdf/2408.13367
Copy Paste: [[2408.13367]] Generative Blockchain: Transforming Blockchain from Transaction Recording to Transaction Generation through Proof-of-Merit(https://arxiv.org/abs/2408.13367)
Keywords: generative
Abstract: This paper proposes a new paradigm: generative blockchain, which aims to transform conventional blockchain technology by combining transaction generation and recording, rather than focusing solely on transaction recording. Central to our design is a novel consensus mechanism, Proof-of-Merit (PoM), specifically crafted for environments where businesses must solve complex problems before transactions can be recorded. PoM integrates the generation and recording of transactions within a unified blockchain system, fundamentally differing from prevailing consensus mechanisms that primarily record existing transactions. We demonstrate PoM on a ride service on-demand platform, where the task of solving complex transaction-generating problems is delegated to a pool of independent problem solvers. These solvers generate transactions, and their solutions are selected based on merit. The winning solvers then register these transactions onto the blockchain and are rewarded accordingly. We introduce a Decentralized Control Parameter (DCP) to balance two key performance metrics: efficiency and equity. The applicability of our generative blockchain is illustrated through a ridesharing context, where matchers (solvers) are tasked with matching riders to drivers. We demonstrate PoM's performance and nuanced properties using agent-based simulation, exploring how to find the optimal DCP value to achieve a desirable balance of efficiency and equity in a generative blockchain.

Title: Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing

Authors: Yangyang Xu, Wenqi Shao, Yong Du, Haiming Zhu, Yang Zhou, Ping Luo, Shengfeng He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13395
Pdf URL: https://arxiv.org/pdf/2408.13395
Copy Paste: [[2408.13395]] Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing(https://arxiv.org/abs/2408.13395)
Keywords: diffusion
Abstract: Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities, yet balancing reconstruction fidelity and editability for real images remains a significant challenge. In this work, we introduce \textbf{T}ask-\textbf{O}riented \textbf{D}iffusion \textbf{I}nversion (\textbf{TODInv}), a novel framework that inverts and edits real images tailored to specific editing tasks by optimizing prompt embeddings within the extended $\mathcal{P}^*$ space. By leveraging distinct embeddings across different U-Net layers and time steps, TODInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability. This hierarchical editing mechanism categorizes tasks into structure, appearance, and global edits, optimizing only those embeddings unaffected by the current editing task. Extensive experiments on benchmark dataset reveal TODInv's superior performance over existing methods, delivering both quantitative and qualitative enhancements while showcasing its versatility with few-step diffusion model.

Title: TVG: A Training-free Transition Video Generation Method with Diffusion Models

Authors: Rui Zhang, Yaosen Chen, Yuegen Liu, Wei Wang, Xuming Wen, Hongxia Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13413
Pdf URL: https://arxiv.org/pdf/2408.13413
Copy Paste: [[2408.13413]] TVG: A Training-free Transition Video Generation Method with Diffusion Models(https://arxiv.org/abs/2408.13413)
Keywords: diffusion
Abstract: Transition videos play a crucial role in media production, enhancing the flow and coherence of visual narratives. Traditional methods like morphing often lack artistic appeal and require specialized skills, limiting their effectiveness. Recent advances in diffusion model-based video generation offer new possibilities for creating transitions but face challenges such as poor inter-frame relationship modeling and abrupt content changes. We propose a novel training-free Transition Video Generation (TVG) approach using video-level diffusion models that addresses these limitations without additional training. Our method leverages Gaussian Process Regression ($\mathcal{GPR}$) to model latent representations, ensuring smooth and dynamic transitions between frames. Additionally, we introduce interpolation-based conditional controls and a Frequency-aware Bidirectional Fusion (FBiF) architecture to enhance temporal control and transition reliability. Evaluations of benchmark datasets and custom image pairs demonstrate the effectiveness of our approach in generating high-quality smooth transition videos. The code are provided in this https URL.

Title: Training-free Long Video Generation with Chain of Diffusion Model Experts

Authors: Wenhao Li, Yichao Cao, Xie Su, Xi Lin, Shan You, Mingkai Zheng, Yi Chen, Chang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13423
Pdf URL: https://arxiv.org/pdf/2408.13423
Copy Paste: [[2408.13423]] Training-free Long Video Generation with Chain of Diffusion Model Experts(https://arxiv.org/abs/2408.13423)
Keywords: diffusion
Abstract: Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to high complexity of video generation task. In this paper, we propose \textbf{ConFiner}, an efficient high-quality video generation framework that decouples video generation into easier subtasks: structure \textbf{con}trol and spatial-temporal re\textbf{fine}ment. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts' capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10\% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.

Title: Integrating Multi-Head Convolutional Encoders with Cross-Attention for Improved SPARQL Query Translation

Authors: Yi-Hui Chen, Eric Jui-Lin Lu, Kwan-Ho Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.13432
Pdf URL: https://arxiv.org/pdf/2408.13432
Copy Paste: [[2408.13432]] Integrating Multi-Head Convolutional Encoders with Cross-Attention for Improved SPARQL Query Translation(https://arxiv.org/abs/2408.13432)
Keywords: generative
Abstract: The main task of the KGQA system (Knowledge Graph Question Answering) is to convert user input questions into query syntax (such as SPARQL). With the rise of modern popular encoders and decoders like Transformer and ConvS2S, many scholars have shifted the research direction of SPARQL generation to the Neural Machine Translation (NMT) architecture or the generative AI field of Text-to-SPARQL. In NMT-based QA systems, the system treats knowledge base query syntax as a language. It uses NMT-based translation models to translate natural language questions into query syntax. Scholars use popular architectures equipped with cross-attention, such as Transformer, ConvS2S, and BiLSTM, to train translation models for query syntax. To achieve better query results, this paper improved the ConvS2S encoder and added multi-head attention from the Transformer, proposing a Multi-Head Conv encoder (MHC encoder) based on the n-gram language model. The principle is to use convolutional layers to capture local hidden features in the input sequence with different receptive fields, using multi-head attention to calculate dependencies between them. Ultimately, we found that the translation model based on the Multi-Head Conv encoder achieved better performance than other encoders, obtaining 76.52\% and 83.37\% BLEU-1 (BiLingual Evaluation Understudy) on the QALD-9 and LC-QuAD-1.0 datasets, respectively. Additionally, in the end-to-end system experiments on the QALD-9 and LC-QuAD-1.0 datasets, we achieved leading results over other KGQA systems, with Macro F1-measures reaching 52\% and 66\%, respectively. Moreover, the experimental results show that with limited computational resources, if one possesses an excellent encoder-decoder architecture and cross-attention, experts and scholars can achieve outstanding performance equivalent to large pre-trained models using only general embeddings.

Title: Explainable Concept Generation through Vision-Language Preference Learning

Authors: Aditya Taparia, Som Sagar, Ransalu Senanayake
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.13438
Pdf URL: https://arxiv.org/pdf/2408.13438
Copy Paste: [[2408.13438]] Explainable Concept Generation through Vision-Language Preference Learning(https://arxiv.org/abs/2408.13438)
Keywords: generative
Abstract: Concept-based explanations have become a popular choice for explaining deep neural networks post-hoc because, unlike most other explainable AI techniques, they can be used to test high-level visual "concepts" that are not directly related to feature attributes. For instance, the concept of "stripes" is important to classify an image as a zebra. Concept-based explanation methods, however, require practitioners to guess and collect multiple candidate concept image sets, which can often be imprecise and labor-intensive. Addressing this limitation, in this paper, we frame concept image set creation as an image generation problem. However, since naively using a generative model does not result in meaningful concepts, we devise a reinforcement learning-based preference optimization algorithm that fine-tunes the vision-language generative model from approximate textual descriptions of concepts. Through a series of experiments, we demonstrate the capability of our method to articulate complex, abstract concepts that are otherwise challenging to craft manually. In addition to showing the efficacy and reliability of our method, we show how our method can be used as a diagnostic tool for analyzing neural networks.

Title: Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model

Authors: Chen Rao, Guangyuan Li, Zehua Lan, Jiakai Sun, Junsheng Luan, Wei Xing, Lei Zhao, Huaizhong Lin, Jianfeng Dong, Dalong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13459
Pdf URL: https://arxiv.org/pdf/2408.13459
Copy Paste: [[2408.13459]] Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model(https://arxiv.org/abs/2408.13459)
Keywords: diffusion
Abstract: Current video deblurring methods have limitations in recovering high-frequency information since the regression losses are conservative with high-frequency details. Since Diffusion Models (DMs) have strong capabilities in generating high-frequency details, we consider introducing DMs into the video deblurring task. However, we found that directly applying DMs to the video deblurring task has the following problems: (1) DMs require many iteration steps to generate videos from Gaussian noise, which consumes many computational resources. (2) DMs are easily misled by the blurry artifacts in the video, resulting in irrational content and distortion of the deblurred video. To address the above issues, we propose a novel video deblurring framework VD-Diff that integrates the diffusion model into the Wavelet-Aware Dynamic Transformer (WADT). Specifically, we perform the diffusion model in a highly compact latent space to generate prior features containing high-frequency information that conforms to the ground truth distribution. We design the WADT to preserve and recover the low-frequency information in the video while utilizing the high-frequency information generated by the diffusion model. Extensive experiments show that our proposed VD-Diff outperforms SOTA methods on GoPro, DVD, BSD, and Real-World Video datasets.

Title: DOPPLER: Differentially Private Optimizers with Low-pass Filter for Privacy Noise Reduction

Authors: Xinwei Zhang, Zhiqi Bu, Mingyi Hong, Meisam Razaviyayn
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2408.13460
Pdf URL: https://arxiv.org/pdf/2408.13460
Copy Paste: [[2408.13460]] DOPPLER: Differentially Private Optimizers with Low-pass Filter for Privacy Noise Reduction(https://arxiv.org/abs/2408.13460)
Keywords: foundation model
Abstract: Privacy is a growing concern in modern deep-learning systems and applications. Differentially private (DP) training prevents the leakage of sensitive information in the collected training data from the trained machine learning models. DP optimizers, including DP stochastic gradient descent (DPSGD) and its variants, privatize the training procedure by gradient clipping and DP noise injection. However, in practice, DP models trained using DPSGD and its variants often suffer from significant model performance degradation. Such degradation prevents the application of DP optimization in many key tasks, such as foundation model pretraining. In this paper, we provide a novel signal processing perspective to the design and analysis of DP optimizers. We show that a ``frequency domain'' operation called low-pass filtering can be used to effectively reduce the impact of DP noise. More specifically, by defining the ``frequency domain'' for both the gradient and differential privacy (DP) noise, we have developed a new component, called DOPPLER. This component is designed for DP algorithms and works by effectively amplifying the gradient while suppressing DP noise within this frequency domain. As a result, it maintains privacy guarantees and enhances the quality of the DP-protected model. Our experiments show that the proposed DP optimizers with a low-pass filter outperform their counterparts without the filter by 3%-10% in test accuracy on various models and datasets. Both theoretical and practical evidence suggest that the DOPPLER is effective in closing the gap between DP and non-DP training.

Title: Disentangled Generative Graph Representation Learning

Authors: Xinyue Hu, Zhibin Duan, Xinyang Liu, Yuxin Li, Bo Chen, Mingyuan Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.13471
Pdf URL: https://arxiv.org/pdf/2408.13471
Copy Paste: [[2408.13471]] Disentangled Generative Graph Representation Learning(https://arxiv.org/abs/2408.13471)
Keywords: self-supervised, generative
Abstract: Recently, generative graph models have shown promising results in learning graph representations through self-supervised methods. However, most existing generative graph representation learning (GRL) approaches rely on random masking across the entire graph, which overlooks the entanglement of learned representations. This oversight results in non-robustness and a lack of explainability. Furthermore, disentangling the learned representations remains a significant challenge and has not been sufficiently explored in GRL research. Based on these insights, this paper introduces DiGGR (Disentangled Generative Graph Representation Learning), a self-supervised learning framework. DiGGR aims to learn latent disentangled factors and utilizes them to guide graph mask modeling, thereby enhancing the disentanglement of learned representations and enabling end-to-end joint learning. Extensive experiments on 11 public datasets for two different graph learning tasks demonstrate that DiGGR consistently outperforms many previous self-supervised methods, verifying the effectiveness of the proposed approach.

Title: DualAnoDiff: Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation

Authors: Ying Jin, Jinlong Peng, Qingdong He, Teng Hu, Hao Chen, Jiafu Wu, Wenbing Zhu, Mingmin Chi, Jun Liu, Yabiao Wang, Chengjie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13509
Pdf URL: https://arxiv.org/pdf/2408.13509
Copy Paste: [[2408.13509]] DualAnoDiff: Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation(https://arxiv.org/abs/2408.13509)
Keywords: diffusion, anomaly
Abstract: The performance of anomaly inspection in industrial manufacturing is constrained by the scarcity of anomaly data. To overcome this challenge, researchers have started employing anomaly generation approaches to augment the anomaly dataset. However, existing anomaly generation methods suffer from limited diversity in the generated anomalies and struggle to achieve a seamless blending of this anomaly with the original image. In this paper, we overcome these challenges from a new perspective, simultaneously generating a pair of the overall image and the corresponding anomaly part. We propose DualAnoDiff, a novel diffusion-based few-shot anomaly image generation model, which can generate diverse and realistic anomaly images by using a dual-interrelated diffusion model, where one of them is employed to generate the whole image while the other one generates the anomaly part. Moreover, we extract background and shape information to mitigate the distortion and blurriness phenomenon in few-shot image generation. Extensive experiments demonstrate the superiority of our proposed model over state-of-the-art methods in terms of both realism and diversity. Overall, our approach significantly improves the performance of downstream anomaly detection tasks, including anomaly detection, anomaly localization, and anomaly classification tasks.

Title: AnoPLe: Few-Shot Anomaly Detection via Bi-directional Prompt Learning with Only Normal Samples

Authors: Yujin Lee, Seoyoon Jang, Hyunsoo Yoon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.13516
Pdf URL: https://arxiv.org/pdf/2408.13516
Copy Paste: [[2408.13516]] AnoPLe: Few-Shot Anomaly Detection via Bi-directional Prompt Learning with Only Normal Samples(https://arxiv.org/abs/2408.13516)
Keywords: anomaly
Abstract: Few-shot Anomaly Detection (FAD) poses significant challenges due to the limited availability of training samples and the frequent absence of abnormal samples. Previous approaches often rely on annotations or true abnormal samples to improve detection, but such textual or visual cues are not always accessible. To address this, we introduce AnoPLe, a multi-modal prompt learning method designed for anomaly detection without prior knowledge of anomalies. AnoPLe simulates anomalies and employs bidirectional coupling of textual and visual prompts to facilitate deep interaction between the two modalities. Additionally, we integrate a lightweight decoder with a learnable multi-view signal, trained on multi-scale images to enhance local semantic comprehension. To further improve performance, we align global and local semantics, enriching the image-level understanding of anomalies. The experimental results demonstrate that AnoPLe achieves strong FAD performance, recording 94.1% and 86.2% Image AUROC on MVTec-AD and VisA respectively, with only around a 1% gap compared to the SoTA, despite not being exposed to true anomalies. Code is available at this https URL.

Title: Variational Autoencoder for Anomaly Detection: A Comparative Study

Authors: Huy Hoang Nguyen, Cuong Nhat Nguyen, Xuan Tung Dao, Quoc Trung Duong, Dzung Pham Thi Kim, Minh-Tan Pham
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2408.13561
Pdf URL: https://arxiv.org/pdf/2408.13561
Copy Paste: [[2408.13561]] Variational Autoencoder for Anomaly Detection: A Comparative Study(https://arxiv.org/abs/2408.13561)
Keywords: anomaly
Abstract: This paper aims to conduct a comparative analysis of contemporary Variational Autoencoder (VAE) architectures employed in anomaly detection, elucidating their performance and behavioral characteristics within this specific task. The architectural configurations under consideration encompass the original VAE baseline, the VAE with a Gaussian Random Field prior (VAE-GRF), and the VAE incorporating a vision transformer (ViT-VAE). The findings reveal that ViT-VAE exhibits exemplary performance across various scenarios, whereas VAE-GRF may necessitate more intricate hyperparameter tuning to attain its optimal performance state. Additionally, to mitigate the propensity for over-reliance on results derived from the widely used MVTec dataset, this paper leverages the recently-public MiAD dataset for benchmarking. This deliberate inclusion seeks to enhance result competitiveness by alleviating the impact of domain-specific models tailored exclusively for MVTec, thereby contributing to a more robust evaluation framework. Codes is available at this https URL.

Title: Can Visual Foundation Models Achieve Long-term Point Tracking?

Authors: Görkay Aydemir, Weidi Xie, Fatma Güney
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13575
Pdf URL: https://arxiv.org/pdf/2408.13575
Copy Paste: [[2408.13575]] Can Visual Foundation Models Achieve Long-term Point Tracking?(https://arxiv.org/abs/2408.13575)
Keywords: diffusion, foundation model
Abstract: Large-scale vision foundation models have demonstrated remarkable success across various tasks, underscoring their robust generalization capabilities. While their proficiency in two-view correspondence has been explored, their effectiveness in long-term correspondence within complex environments remains unexplored. To address this, we evaluate the geometric awareness of visual foundation models in the context of point tracking: (i) in zero-shot settings, without any training; (ii) by probing with low-capacity layers; (iii) by fine-tuning with Low Rank Adaptation (LoRA). Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings. Furthermore, DINOv2 achieves performance comparable to supervised models in adaptation settings, demonstrating its potential as a strong initialization for correspondence learning.

Title: Preliminary Investigations of a Multi-Faceted Robust and Synergistic Approach in Semiconductor Electron Micrograph Analysis: Integrating Vision Transformers with Large Language and Multimodal Models

Authors: Sakhinana Sagar Srinivas, Geethan Sannidhi, Sreeja Gangasani, Chidaksh Ravuru, Venkataramana Runkana
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.13621
Pdf URL: https://arxiv.org/pdf/2408.13621
Copy Paste: [[2408.13621]] Preliminary Investigations of a Multi-Faceted Robust and Synergistic Approach in Semiconductor Electron Micrograph Analysis: Integrating Vision Transformers with Large Language and Multimodal Models(https://arxiv.org/abs/2408.13621)
Keywords: generative, in-context
Abstract: Characterizing materials using electron micrographs is crucial in areas such as semiconductors and quantum materials. Traditional classification methods falter due to the intricatestructures of these micrographs. This study introduces an innovative architecture that leverages the generative capabilities of zero-shot prompting in Large Language Models (LLMs) such as GPT-4(language only), the predictive ability of few-shot (in-context) learning in Large Multimodal Models (LMMs) such as GPT-4(V)ision, and fuses knowledge across image based and linguistic insights for accurate nanomaterial category prediction. This comprehensive approach aims to provide a robust solution for the automated nanomaterial identification task in semiconductor manufacturing, blending performance, efficiency, and interpretability. Our method surpasses conventional approaches, offering precise nanomaterial identification and facilitating high-throughput screening.

Title: Prompt-Softbox-Prompt: A free-text Embedding Control for Image Editing

Authors: Yitong Yang, Yinglin Wang, Jing Wang, Tian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13623
Pdf URL: https://arxiv.org/pdf/2408.13623
Copy Paste: [[2408.13623]] Prompt-Softbox-Prompt: A free-text Embedding Control for Image Editing(https://arxiv.org/abs/2408.13623)
Keywords: diffusion
Abstract: Text-driven diffusion models have achieved remarkable success in image editing, but a crucial component in these models-text embeddings-has not been fully explored. The entanglement and opacity of text embeddings present significant challenges to achieving precise image editing. In this paper, we provide a comprehensive and in-depth analysis of text embeddings in Stable Diffusion XL, offering three key insights. First, while the 'aug_embedding' captures the full semantic content of the text, its contribution to the final image generation is relatively minor. Second, 'BOS' and 'Padding_embedding' do not contain any semantic information. Lastly, the 'EOS' holds the semantic information of all words and contains the most style features. Each word embedding plays a unique role without interfering with one another. Based on these insights, we propose a novel approach for controllable image editing using a free-text embedding control method called PSP (Prompt-Softbox-Prompt). PSP enables precise image editing by inserting or adding text embeddings within the cross-attention layers and using Softbox to define and control the specific area for semantic injection. This technique allows for obejct additions and replacements while preserving other areas of the image. Additionally, PSP can achieve style transfer by simply replacing text embeddings. Extensive experimental results show that PSP achieves significant results in tasks such as object replacement, object addition, and style transfer.

Title: Towards Case-based Interpretability for Medical Federated Learning

Authors: Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.13626
Pdf URL: https://arxiv.org/pdf/2408.13626
Copy Paste: [[2408.13626]] Towards Case-based Interpretability for Medical Federated Learning(https://arxiv.org/abs/2408.13626)
Keywords: generative
Abstract: We explore deep generative models to generate case-based explanations in a medical federated learning setting. Explaining AI model decisions through case-based interpretability is paramount to increasing trust and allowing widespread adoption of AI in clinical practice. However, medical AI training paradigms are shifting towards federated learning settings in order to comply with data protection regulations. In a federated scenario, past data is inaccessible to the current user. Thus, we use a deep generative model to generate synthetic examples that protect privacy and explain decisions. Our proof-of-concept focuses on pleural effusion diagnosis and uses publicly available Chest X-ray data.

Title: Temporal Divide-and-Conquer Anomaly Actions Localization in Semi-Supervised Videos with Hierarchical Transformer

Authors: Nada Osman, Marwan Torki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13643
Pdf URL: https://arxiv.org/pdf/2408.13643
Copy Paste: [[2408.13643]] Temporal Divide-and-Conquer Anomaly Actions Localization in Semi-Supervised Videos with Hierarchical Transformer(https://arxiv.org/abs/2408.13643)
Keywords: anomaly
Abstract: Anomaly action detection and localization play an essential role in security and advanced surveillance systems. However, due to the tremendous amount of surveillance videos, most of the available data for the task is unlabeled or semi-labeled with the video class known, but the location of the anomaly event is unknown. In this work, we target anomaly localization in semi-supervised videos. While the mainstream direction in addressing this task is focused on segment-level multi-instance learning and the generation of pseudo labels, we aim to explore a promising yet unfulfilled direction to solve the problem by learning the temporal relations within videos in order to locate anomaly events. To this end, we propose a hierarchical transformer model designed to evaluate the significance of observed actions in anomalous videos with a divide-and-conquer strategy along the temporal axis. Our approach segments a parent video hierarchically into multiple temporal children instances and measures the influence of the children nodes in classifying the abnormality of the parent video. Evaluating our model on two well-known anomaly detection datasets, UCF-crime and ShanghaiTech, proves its ability to interpret the observed actions within videos and localize the anomalous ones. Our proposed approach outperforms previous works relying on segment-level multiple-instance learning approaches while reaching a promising performance compared to the more recent pseudo-labeling-based approaches.

Title: GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

Authors: Keqiang Sun, Amin Jourabloo, Riddhish Bhalodia, Moustafa Meshry, Yu Rong, Zhengyu Yang, Thu Nguyen-Phuoc, Christian Haene, Jiu Xu, Sam Johnson, Hongsheng Li, Sofien Bouaziz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13674
Pdf URL: https://arxiv.org/pdf/2408.13674
Copy Paste: [[2408.13674]] GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars(https://arxiv.org/abs/2408.13674)
Keywords: diffusion, generative
Abstract: Photo-realistic and controllable 3D avatars are crucial for various applications such as virtual and mixed reality (VR/MR), telepresence, gaming, and film production. Traditional methods for avatar creation often involve time-consuming scanning and reconstruction processes for each avatar, which limits their scalability. Furthermore, these methods do not offer the flexibility to sample new identities or modify existing ones. On the other hand, by learning a strong prior from data, generative models provide a promising alternative to traditional reconstruction methods, easing the time constraints for both data capture and processing. Additionally, generative methods enable downstream applications beyond reconstruction, such as editing and stylization. Nonetheless, the research on generative 3D avatars is still in its infancy, and therefore current methods still have limitations such as creating static avatars, lacking photo-realism, having incomplete facial details, or having limited drivability. To address this, we propose a text-conditioned generative model that can generate photo-realistic facial avatars of diverse identities, with more complete details like hair, eyes and mouth interior, and which can be driven through a powerful non-parametric latent expression space. Specifically, we integrate the generative and editing capabilities of latent diffusion models with a strong prior model for avatar expression driving. Our model can generate and control high-fidelity avatars, even those out-of-distribution. We also highlight its potential for downstream applications, including avatar editing and single-shot avatar reconstruction.

Title: A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models

Authors: Antón de la Fuente, Dan Jurafsky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.13678
Pdf URL: https://arxiv.org/pdf/2408.13678
Copy Paste: [[2408.13678]] A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models(https://arxiv.org/abs/2408.13678)
Keywords: self-supervised
Abstract: This study asks how self-supervised speech models represent suprasegmental categories like Mandarin lexical tone, English lexical stress, and English phrasal accents. Through a series of probing tasks, we make layer-wise comparisons of English and Mandarin 12 layer monolingual models. Our findings suggest that 1) English and Mandarin wav2vec 2.0 models learn contextual representations of abstract suprasegmental categories which are strongest in the middle third of the network. 2) Models are better at representing features that exist in the language of their training data, and this difference is driven by enriched context in transformer blocks, not local acoustic representation. 3) Fine-tuned wav2vec 2.0 improves performance in later layers compared to pre-trained models mainly for lexically contrastive features like tone and stress, 4) HuBERT and WavLM learn similar representations to wav2vec 2.0, differing mainly in later layer performance. Our results extend previous understanding of how models represent suprasegmentals and offer new insights into the language-specificity and contextual nature of these representations.

Title: Guided and Fused: Efficient Frozen CLIP-ViT with Feature Guidance and Multi-Stage Feature Fusion for Generalizable Deepfake Detection

Authors: Yingjian Chen, Lei Zhang, Yakun Niu, Pei Chen, Lei Tan, Jing Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13697
Pdf URL: https://arxiv.org/pdf/2408.13697
Copy Paste: [[2408.13697]] Guided and Fused: Efficient Frozen CLIP-ViT with Feature Guidance and Multi-Stage Feature Fusion for Generalizable Deepfake Detection(https://arxiv.org/abs/2408.13697)
Keywords: diffusion, generative
Abstract: The rise of generative models has sparked concerns about image authenticity online, highlighting the urgent need for an effective and general detector. Recent methods leveraging the frozen pre-trained CLIP-ViT model have made great progress in deepfake detection. However, these models often rely on visual-general features directly extracted by the frozen network, which contain excessive information irrelevant to the task, resulting in limited detection performance. To address this limitation, in this paper, we propose an efficient Guided and Fused Frozen CLIP-ViT (GFF), which integrates two simple yet effective modules. The Deepfake-Specific Feature Guidance Module (DFGM) guides the frozen pre-trained model in extracting features specifically for deepfake detection, reducing irrelevant information while preserving its generalization capabilities. The Multi-Stage Fusion Module (FuseFormer) captures low-level and high-level information by fusing features extracted from each stage of the ViT. This dual-module approach significantly improves deepfake detection by fully leveraging CLIP-ViT's inherent advantages. Extensive experiments demonstrate the effectiveness and generalization ability of GFF, which achieves state-of-the-art performance with optimal results in only 5 training epochs. Even when trained on only 4 classes of ProGAN, GFF achieves nearly 99% accuracy on unseen GANs and maintains an impressive 97% accuracy on unseen diffusion models.

Title: SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

Authors: Wenrui Li, Yapeng Mi, Fucheng Cai, Zhe Yang, Wangmeng Zuo, Xingtao Wang, Xiaopeng Fan
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2408.13711
Pdf URL: https://arxiv.org/pdf/2408.13711
Copy Paste: [[2408.13711]] SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting(https://arxiv.org/abs/2408.13711)
Keywords: generative
Abstract: Text-driven 3D scene generation has seen significant advancements recently. However, most existing methods generate single-view images using generative models and then stitch them together in 3D space. This independent generation for each view often results in spatial inconsistency and implausibility in the 3D scenes. To address this challenge, we proposed a novel text-driven 3D-consistent scene generation model: SceneDreamer360. Our proposed method leverages a text-driven panoramic image generation model as a prior for 3D scene generation and employs 3D Gaussian Splatting (3DGS) to ensure consistency across multi-view panoramic images. Specifically, SceneDreamer360 enhances the fine-tuned Panfusion generator with a three-stage panoramic enhancement, enabling the generation of high-resolution, detail-rich panoramic images. During the 3D scene construction, a novel point cloud fusion initialization method is used, producing higher quality and spatially consistent point clouds. Our extensive experiments demonstrate that compared to other methods, SceneDreamer360 with its panoramic image generation and 3DGS can produce higher quality, spatially consistent, and visually appealing 3D scenes from any text prompt. Our codes are available at \url{this https URL}.

Title: PhysPart: Physically Plausible Part Completion for Interactable Objects

Authors: Rundong Luo, Haoran Geng, Congyue Deng, Puhao Li, Zan Wang, Baoxiong Jia, Leonidas Guibas, Siyuang Huang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2408.13724
Pdf URL: https://arxiv.org/pdf/2408.13724
Copy Paste: [[2408.13724]] PhysPart: Physically Plausible Part Completion for Interactable Objects(https://arxiv.org/abs/2408.13724)
Keywords: diffusion, generative
Abstract: Interactable objects are ubiquitous in our daily lives. Recent advances in 3D generative models make it possible to automate the modeling of these objects, benefiting a range of applications from 3D printing to the creation of robot simulation environments. However, while significant progress has been made in modeling 3D shapes and appearances, modeling object physics, particularly for interactable objects, remains challenging due to the physical constraints imposed by inter-part motions. In this paper, we tackle the problem of physically plausible part completion for interactable objects, aiming to generate 3D parts that not only fit precisely into the object but also allow smooth part motions. To this end, we propose a diffusion-based part generation model that utilizes geometric conditioning through classifier-free guidance and formulates physical constraints as a set of stability and mobility losses to guide the sampling process. Additionally, we demonstrate the generation of dependent parts, paving the way toward sequential part generation for objects with complex part-whole hierarchies. Experimentally, we introduce a new metric for measuring physical plausibility based on motion success rates. Our model outperforms existing baselines over shape and physical metrics, especially those that do not adequately model physical constraints. We also demonstrate our applications in 3D printing, robot manipulation, and sequential part generation, showing our strength in realistic tasks with the demand for high physical plausibility.

Title: Localization of Synthetic Manipulations in Western Blot Images

Authors: Anmol Manjunath, Viola Negroni, Sara Mandelli, Daniel Moreira, Paolo Bestagini
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2408.13786
Pdf URL: https://arxiv.org/pdf/2408.13786
Copy Paste: [[2408.13786]] Localization of Synthetic Manipulations in Western Blot Images(https://arxiv.org/abs/2408.13786)
Keywords: generative
Abstract: Recent breakthroughs in deep learning and generative systems have significantly fostered the creation of synthetic media, as well as the local alteration of real content via the insertion of highly realistic synthetic manipulations. Local image manipulation, in particular, poses serious challenges to the integrity of digital content and societal trust. This problem is not only confined to multimedia data, but also extends to biological images included in scientific publications, like images depicting Western blots. In this work, we address the task of localizing synthetic manipulations in Western blot images. To discriminate between pristine and synthetic pixels of an analyzed image, we propose a synthetic detector that operates on small patches extracted from the image. We aggregate patch contributions to estimate a tampering heatmap, highlighting synthetic pixels out of pristine ones. Our methodology proves effective when tested over two manipulated Western blot image datasets, one altered automatically and the other manually by exploiting advanced AI-based image manipulation tools that are unknown at our training stage. We also explore the robustness of our method over an external dataset of other scientific images depicting different semantics, manipulated through unseen generation techniques.

Title: 3D-VirtFusion: Synthetic 3D Data Augmentation through Generative Diffusion Models and Controllable Editing

Authors: Shichao Dong, Ze Yang, Guosheng Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13788
Pdf URL: https://arxiv.org/pdf/2408.13788
Copy Paste: [[2408.13788]] 3D-VirtFusion: Synthetic 3D Data Augmentation through Generative Diffusion Models and Controllable Editing(https://arxiv.org/abs/2408.13788)
Keywords: diffusion, foundation model, generative
Abstract: Data augmentation plays a crucial role in deep learning, enhancing the generalization and robustness of learning-based models. Standard approaches involve simple transformations like rotations and flips for generating extra data. However, these augmentations are limited by their initial dataset, lacking high-level diversity. Recently, large models such as language models and diffusion models have shown exceptional capabilities in perception and content generation. In this work, we propose a new paradigm to automatically generate 3D labeled training data by harnessing the power of pretrained large foundation models. For each target semantic class, we first generate 2D images of a single object in various structure and appearance via diffusion models and chatGPT generated text prompts. Beyond texture augmentation, we propose a method to automatically alter the shape of objects within 2D images. Subsequently, we transform these augmented images into 3D objects and construct virtual scenes by random composition. This method can automatically produce a substantial amount of 3D scene data without the need of real data, providing significant benefits in addressing few-shot learning challenges and mitigating long-tailed class imbalances. By providing a flexible augmentation approach, our work contributes to enhancing 3D data diversity and advancing model capabilities in scene understanding tasks.

Title: Bring the Power of Diffusion Model to Defect Detection

Authors: Xuyi Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13845
Pdf URL: https://arxiv.org/pdf/2408.13845
Copy Paste: [[2408.13845]] Bring the Power of Diffusion Model to Defect Detection(https://arxiv.org/abs/2408.13845)
Keywords: diffusion
Abstract: Due to the high complexity and technical requirements of industrial production processes, surface defects will inevitably appear, which seriously affects the quality of products. Although existing lightweight detection networks are highly efficient, they are susceptible to false or missed detection of non-salient defects due to the lack of semantic information. In contrast, the diffusion model can generate higher-order semantic representations in the denoising process. Therefore, the aim of this paper is to incorporate the higher-order modelling capability of the diffusion model into the detection model, so as to better assist in the classification and localization of difficult targets. First, the denoising diffusion probabilistic model (DDPM) is pre-trained to extract the features of denoising process to construct as a feature repository. In particular, to avoid the potential bottleneck of memory caused by the dataloader loading high-dimensional features, a residual convolutional variational auto-encoder (ResVAE) is designed to further compress the feature repository. The image is fed into both image backbone and feature repository for feature extraction and querying respectively. The queried latent features are reconstructed and filtered to obtain high-dimensional DDPM features. A dynamic cross-fusion method is proposed to fully refine the contextual features of DDPM to optimize the detection model. Finally, we employ knowledge distillation to migrate the higher-order modelling capabilities back into the lightweight baseline model without additional efficiency cost. Experiment results demonstrate that our method achieves competitive results on several industrial datasets.

Title: Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Authors: Minghao Liu, Le Zhang, Yingjie Tian, Xiaochao Qu, Luoqi Liu, Ting Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.13858
Pdf URL: https://arxiv.org/pdf/2408.13858
Copy Paste: [[2408.13858]] Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching(https://arxiv.org/abs/2408.13858)
Keywords: diffusion
Abstract: Recent advances in text-to-image diffusion models have demonstrated impressive capabilities in image quality. However, complex scene generation remains relatively unexplored, and even the definition of `complex scene' itself remains unclear. In this paper, we address this gap by providing a precise definition of complex scenes and introducing a set of Complex Decomposition Criteria (CDC) based on this definition. Inspired by the artists painting process, we propose a training-free diffusion framework called Complex Diffusion (CxD), which divides the process into three stages: composition, painting, and retouching. Our method leverages the powerful chain-of-thought capabilities of large language models (LLMs) to decompose complex prompts based on CDC and to manage composition and layout. We then develop an attention modulation method that guides simple prompts to specific regions to complete the complex scene painting. Finally, we inject the detailed output of the LLM into a retouching model to enhance the image details, thus implementing the retouching stage. Extensive experiments demonstrate that our method outperforms previous SOTA approaches, significantly improving the generation of high-quality, semantically consistent, and visually diverse images for complex scenes, even with intricate prompts.

Title: Particle-Filtering-based Latent Diffusion for Inverse Problems

Authors: Amir Nazemi, Mohammad Hadi Sepanj, Nicholas Pellegrino, Chris Czarnecki, Paul Fieguth
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13868
Pdf URL: https://arxiv.org/pdf/2408.13868
Copy Paste: [[2408.13868]] Particle-Filtering-based Latent Diffusion for Inverse Problems(https://arxiv.org/abs/2408.13868)
Keywords: diffusion
Abstract: Current strategies for solving image-based inverse problems apply latent diffusion models to perform posterior sampling.However, almost all approaches make no explicit attempt to explore the solution space, instead drawing only a single sample from a Gaussian distribution from which to generate their solution. In this paper, we introduce a particle-filtering-based framework for a nonlinear exploration of the solution space in the initial stages of reverse SDE methods. Our proposed particle-filtering-based latent diffusion (PFLD) method and proposed problem formulation and framework can be applied to any diffusion-based solution for linear or nonlinear inverse problems. Our experimental results show that PFLD outperforms the SoTA solver PSLD on the FFHQ-1K and ImageNet-1K datasets on inverse problem tasks of super resolution, Gaussian debluring and inpainting.

Title: TraIL-Det: Transformation-Invariant Local Feature Networks for 3D LiDAR Object Detection with Unsupervised Pre-Training

Authors: Li Li, Tanqiu Qiao, Hubert P. H. Shum, Toby P. Breckon
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2408.13902
Pdf URL: https://arxiv.org/pdf/2408.13902
Copy Paste: [[2408.13902]] TraIL-Det: Transformation-Invariant Local Feature Networks for 3D LiDAR Object Detection with Unsupervised Pre-Training(https://arxiv.org/abs/2408.13902)
Keywords: self-supervised
Abstract: 3D point clouds are essential for perceiving outdoor scenes, especially within the realm of autonomous driving. Recent advances in 3D LiDAR Object Detection focus primarily on the spatial positioning and distribution of points to ensure accurate detection. However, despite their robust performance in variable conditions, these methods are hindered by their sole reliance on coordinates and point intensity, resulting in inadequate isometric invariance and suboptimal detection outcomes. To tackle this challenge, our work introduces Transformation-Invariant Local (TraIL) features and the associated TraIL-Det architecture. Our TraIL features exhibit rigid transformation invariance and effectively adapt to variations in point density, with a design focus on capturing the localized geometry of neighboring structures. They utilize the inherent isotropic radiation of LiDAR to enhance local representation, improve computational efficiency, and boost detection performance. To effectively process the geometric relations among points within each proposal, we propose a Multi-head self-Attention Encoder (MAE) with asymmetric geometric features to encode high-dimensional TraIL features into manageable representations. Our method outperforms contemporary self-supervised 3D object detection approaches in terms of mAP on KITTI (67.8, 20% label, moderate) and Waymo (68.9, 20% label, moderate) datasets under various label ratios (20%, 50%, and 100%).

Title: COMPOSE: Comprehensive Portrait Shadow Editing

Authors: Andrew Hou, Zhixin Shu, Xuaner Zhang, He Zhang, Yannick Hold-Geoffroy, Jae Shin Yoon, Xiaoming Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.13922
Pdf URL: https://arxiv.org/pdf/2408.13922
Copy Paste: [[2408.13922]] COMPOSE: Comprehensive Portrait Shadow Editing(https://arxiv.org/abs/2408.13922)
Keywords: diffusion
Abstract: Existing portrait relighting methods struggle with precise control over facial shadows, particularly when faced with challenges such as handling hard shadows from directional light sources or adjusting shadows while remaining in harmony with existing lighting conditions. In many situations, completely altering input lighting is undesirable for portrait retouching applications: one may want to preserve some authenticity in the captured environment. Existing shadow editing methods typically restrict their application to just the facial region and often offer limited lighting control options, such as shadow softening or rotation. In this paper, we introduce COMPOSE: a novel shadow editing pipeline for human portraits, offering precise control over shadow attributes such as shape, intensity, and position, all while preserving the original environmental illumination of the portrait. This level of disentanglement and controllability is obtained thanks to a novel decomposition of the environment map representation into ambient light and an editable gaussian dominant light source. COMPOSE is a four-stage pipeline that consists of light estimation and editing, light diffusion, shadow synthesis, and finally shadow editing. We define facial shadows as the result of a dominant light source, encoded using our novel gaussian environment map representation. Utilizing an OLAT dataset, we have trained models to: (1) predict this light source representation from images, and (2) generate realistic shadows using this representation. We also demonstrate comprehensive and intuitive shadow editing with our pipeline. Through extensive quantitative and qualitative evaluations, we have demonstrated the robust capability of our system in shadow editing.

Title: Time Series Analysis for Education: Methods, Applications, and Future Directions

Authors: Shengzhong Mao, Chaoli Zhang, Yichi Song, Jindong Wang, Xiao-Jun Zeng, Zenglin Xu, Qingsong Wen
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2408.13960
Pdf URL: https://arxiv.org/pdf/2408.13960
Copy Paste: [[2408.13960]] Time Series Analysis for Education: Methods, Applications, and Future Directions(https://arxiv.org/abs/2408.13960)
Keywords: anomaly
Abstract: Recent advancements in the collection and analysis of sequential educational data have brought time series analysis to a pivotal position in educational research, highlighting its essential role in facilitating data-driven decision-making. However, there is a lack of comprehensive summaries that consolidate these advancements. To the best of our knowledge, this paper is the first to provide a comprehensive review of time series analysis techniques specifically within the educational context. We begin by exploring the landscape of educational data analytics, categorizing various data sources and types relevant to education. We then review four prominent time series methods-forecasting, classification, clustering, and anomaly detection-illustrating their specific application points in educational settings. Subsequently, we present a range of educational scenarios and applications, focusing on how these methods are employed to address diverse educational tasks, which highlights the practical integration of multiple time series methods to solve complex educational problems. Finally, we conclude with a discussion on future directions, including personalized learning analytics, multimodal data fusion, and the role of large language models (LLMs) in educational time series. The contributions of this paper include a detailed taxonomy of educational data, a synthesis of time series techniques with specific educational applications, and a forward-looking perspective on emerging trends and future research opportunities in educational analysis. The related papers and resources are available and regularly updated at the project page.

Title: Focused Large Language Models are Stable Many-Shot Learners

Authors: Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Heda Wang, Yao Hu, Kan Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.13987
Pdf URL: https://arxiv.org/pdf/2408.13987
Copy Paste: [[2408.13987]] Focused Large Language Models are Stable Many-Shot Learners(https://arxiv.org/abs/2408.13987)
Keywords: in-context
Abstract: In-Context Learning (ICL) enables large language models (LLMs) to achieve rapid task adaptation by learning from demonstrations. With the increase in available context length of LLMs, recent experiments have shown that the performance of ICL does not necessarily scale well in many-shot (demonstration) settings. We theoretically and experimentally confirm that the reason lies in more demonstrations dispersing the model attention from the query, hindering its understanding of key content. Inspired by how humans learn from examples, we propose a training-free method FocusICL, which conducts triviality filtering to avoid attention being diverted by unimportant contents at token-level and operates hierarchical attention to further ensure sufficient attention towards current query at demonstration-level. We also design an efficient hyperparameter searching strategy for FocusICL based on model perplexity of demonstrations. Comprehensive experiments validate that FocusICL achieves an average performance improvement of 5.2% over vanilla ICL and scales well with many-shot demonstrations.

Title: Pixel-Aligned Multi-View Generation with Depth Guided Decoder

Authors: Zhenggang Tang, Peiye Zhuang, Chaoyang Wang, Aliaksandr Siarohin, Yash Kant, Alexander Schwing, Sergey Tulyakov, Hsin-Ying Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14016
Pdf URL: https://arxiv.org/pdf/2408.14016
Copy Paste: [[2408.14016]] Pixel-Aligned Multi-View Generation with Depth Guided Decoder(https://arxiv.org/abs/2408.14016)
Keywords: diffusion
Abstract: The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.

Title: SurGen: Text-Guided Diffusion Model for Surgical Video Generation

Authors: Joseph Cho, Samuel Schmidgall, Cyril Zakka, Mrudang Mathur, Rohan Shad, William Hiesinger
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.14028
Pdf URL: https://arxiv.org/pdf/2408.14028
Copy Paste: [[2408.14028]] SurGen: Text-Guided Diffusion Model for Surgical Video Generation(https://arxiv.org/abs/2408.14028)
Keywords: diffusion
Abstract: Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis, producing the highest resolution and longest duration videos among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.

Title: PAGE: Parametric Generative Explainer for Graph Neural Network

Authors: Yang Qiu, Wei Liu, Jun Wang, Ruixuan Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14042
Pdf URL: https://arxiv.org/pdf/2408.14042
Copy Paste: [[2408.14042]] PAGE: Parametric Generative Explainer for Graph Neural Network(https://arxiv.org/abs/2408.14042)
Keywords: generative
Abstract: This article introduces PAGE, a parameterized generative interpretive framework. PAGE is capable of providing faithful explanations for any graph neural network without necessitating prior knowledge or internal details. Specifically, we train the auto-encoder to generate explanatory substructures by designing appropriate training strategy. Due to the dimensionality reduction of features in the latent space of the auto-encoder, it becomes easier to extract causal features leading to the model's output, which can be easily employed to generate explanations. To accomplish this, we introduce an additional discriminator to capture the causality between latent causal features and the model's output. By designing appropriate optimization objectives, the well-trained discriminator can be employed to constrain the encoder in generating enhanced causal features. Finally, these features are mapped to substructures of the input graph through the decoder to serve as explanations. Compared to existing methods, PAGE operates at the sample scale rather than nodes or edges, eliminating the need for perturbation or encoding processes as seen in previous methods. Experimental results on both artificially synthesized and real-world datasets demonstrate that our approach not only exhibits the highest faithfulness and accuracy but also significantly outperforms baseline models in terms of efficiency.

Title: Beyond Detection: Leveraging Large Language Models for Cyber Attack Prediction in IoT Networks

Authors: Alaeddine Diaf, Abdelaziz Amara Korba, Nour Elislem Karabadji, Yacine Ghamri-Doudane
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14045
Pdf URL: https://arxiv.org/pdf/2408.14045
Copy Paste: [[2408.14045]] Beyond Detection: Leveraging Large Language Models for Cyber Attack Prediction in IoT Networks(https://arxiv.org/abs/2408.14045)
Keywords: generative
Abstract: In recent years, numerous large-scale cyberattacks have exploited Internet of Things (IoT) devices, a phenomenon that is expected to escalate with the continuing proliferation of IoT technology. Despite considerable efforts in attack detection, intrusion detection systems remain mostly reactive, responding to specific patterns or observed anomalies. This work proposes a proactive approach to anticipate and mitigate malicious activities before they cause damage. This paper proposes a novel network intrusion prediction framework that combines Large Language Models (LLMs) with Long Short Term Memory (LSTM) networks. The framework incorporates two LLMs in a feedback loop: a fine-tuned Generative Pre-trained Transformer (GPT) model for predicting network traffic and a fine-tuned Bidirectional Encoder Representations from Transformers (BERT) for evaluating the predicted traffic. The LSTM classifier model then identifies malicious packets among these predictions. Our framework, evaluated on the CICIoT2023 IoT attack dataset, demonstrates a significant improvement in predictive capabilities, achieving an overall accuracy of 98%, offering a robust solution to IoT cybersecurity challenges.

Title: ShapeMamba-EM: Fine-Tuning Foundation Model with Local Shape Descriptors and Mamba Blocks for 3D EM Image Segmentation

Authors: Ruohua Shi, Qiufan Pang, Lei Ma, Lingyu Duan, Tiejun Huang, Tingting Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14114
Pdf URL: https://arxiv.org/pdf/2408.14114
Copy Paste: [[2408.14114]] ShapeMamba-EM: Fine-Tuning Foundation Model with Local Shape Descriptors and Mamba Blocks for 3D EM Image Segmentation(https://arxiv.org/abs/2408.14114)
Keywords: foundation model
Abstract: Electron microscopy (EM) imaging offers unparalleled resolution for analyzing neural tissues, crucial for uncovering the intricacies of synaptic connections and neural processes fundamental to understanding behavioral mechanisms. Recently, the foundation models have demonstrated impressive performance across numerous natural and medical image segmentation tasks. However, applying these foundation models to EM segmentation faces significant challenges due to domain disparities. This paper presents ShapeMamba-EM, a specialized fine-tuning method for 3D EM segmentation, which employs adapters for long-range dependency modeling and an encoder for local shape description within the original foundation model. This approach effectively addresses the unique volumetric and morphological complexities of EM data. Tested over a wide range of EM images, covering five segmentation tasks and 10 datasets, ShapeMamba-EM outperforms existing methods, establishing a new standard in EM image segmentation and enhancing the understanding of neural tissue architecture.

Title: Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models

Authors: Chaohua Shi, Xuan Wang, Si Shi, Xule Wang, Mingrui Zhu, Nannan Wang, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14135
Pdf URL: https://arxiv.org/pdf/2408.14135
Copy Paste: [[2408.14135]] Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models(https://arxiv.org/abs/2408.14135)
Keywords: diffusion
Abstract: Food image composition requires the use of existing dish images and background images to synthesize a natural new image, while diffusion models have made significant advancements in image generation, enabling the construction of end-to-end architectures that yield promising results. However, existing diffusion models face challenges in processing and fusing information from multiple images and lack access to high-quality publicly available datasets, which prevents the application of diffusion models in food image composition. In this paper, we introduce a large-scale, high-quality food image composite dataset, FC22k, which comprises 22,000 foreground, background, and ground truth ternary image pairs. Additionally, we propose a novel food image composition method, Foodfusion, which leverages the capabilities of the pre-trained diffusion models and incorporates a Fusion Module for processing and integrating foreground and background information. This fused information aligns the foreground features with the background structure by merging the global structural information at the cross-attention layer of the denoising UNet. To further enhance the content and structure of the background, we also integrate a Content-Structure Control Module. Extensive experiments demonstrate the effectiveness and scalability of our proposed method.

Title: SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

Authors: Trung Dao, Thuan Hoang Nguyen, Thanh Le, Duc Vu, Khoi Nguyen, Cuong Pham, Anh Tran
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14176
Pdf URL: https://arxiv.org/pdf/2408.14176
Copy Paste: [[2408.14176]] SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher(https://arxiv.org/abs/2408.14176)
Keywords: diffusion
Abstract: In this paper, we aim to enhance the performance of SwiftBrush, a prominent one-step text-to-image diffusion model, to be competitive with its multi-step Stable Diffusion counterpart. Initially, we explore the quality-diversity trade-off between SwiftBrush and SD Turbo: the former excels in image diversity, while the latter excels in image quality. This observation motivates our proposed modifications in the training methodology, including better weight initialization and efficient LoRA training. Moreover, our introduction of a novel clamped CLIP loss enhances image-text alignment and results in improved image quality. Remarkably, by combining the weights of models trained with efficient LoRA and full training, we achieve a new state-of-the-art one-step diffusion model, achieving an FID of 8.14 and surpassing all GAN-based and multi-step Stable Diffusion models. The evaluation code is available at: this https URL.

Title: NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training

Authors: Albert Luginov, Muhammad Shahzad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14177
Pdf URL: https://arxiv.org/pdf/2408.14177
Copy Paste: [[2408.14177]] NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training(https://arxiv.org/abs/2408.14177)
Keywords: self-supervised
Abstract: We introduce NimbleD, an efficient self-supervised monocular depth estimation learning framework that incorporates supervision from pseudo-labels generated by a large vision model. This framework does not require camera intrinsics, enabling large-scale pre-training on publicly available videos. Our straightforward yet effective learning strategy significantly enhances the performance of fast and lightweight models without introducing any overhead, allowing them to achieve performance comparable to state-of-the-art self-supervised monocular depth estimation models. This advancement is particularly beneficial for virtual and augmented reality applications requiring low latency inference. The source code, model weights, and acknowledgments are available at this https URL .

Title: Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving

Authors: Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, Yong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14197
Pdf URL: https://arxiv.org/pdf/2408.14197
Copy Paste: [[2408.14197]] Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving(https://arxiv.org/abs/2408.14197)
Keywords: generative
Abstract: World models envision potential future states based on various ego actions. They embed extensive knowledge about the driving environment, facilitating safe and scalable autonomous driving. Most existing methods primarily focus on either data generation or the pretraining paradigms of world models. Unlike the aforementioned prior works, we propose Drive-OccWorld, which adapts a vision-centric 4D forecasting world model to end-to-end planning for autonomous driving. Specifically, we first introduce a semantic and motion-conditional normalization in the memory module, which accumulates semantic and dynamic information from historical BEV embeddings. These BEV features are then conveyed to the world decoder for future occupancy and flow forecasting, considering both geometry and spatiotemporal modeling. Additionally, we propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation and facilitate a broader range of downstream applications. Furthermore, we explore integrating the generative capabilities of the 4D world model with end-to-end planning, enabling continuous forecasting of future states and the selection of optimal trajectories using an occupancy-based cost function. Extensive experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy, opening new avenues for driving world generation and end-to-end planning.

Title: MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement

Authors: Xu He, Xiaoyu Li, Di Kang, Jiangnan Ye, Chaopeng Zhang, Liyang Chen, Xiangjun Gao, Han Zhang, Zhiyong Wu, Haolin Zhuang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14211
Pdf URL: https://arxiv.org/pdf/2408.14211
Copy Paste: [[2408.14211]] MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement(https://arxiv.org/abs/2408.14211)
Keywords: diffusion, generative
Abstract: Existing works in single-image human reconstruction suffer from weak generalizability due to insufficient training data or 3D inconsistencies for a lack of comprehensive multi-view knowledge. In this paper, we introduce MagicMan, a human-specific multi-view diffusion model designed to generate high-quality novel view images from a single reference image. As its core, we leverage a pre-trained 2D diffusion model as the generative prior for generalizability, with the parametric SMPL-X model as the 3D body prior to promote 3D awareness. To tackle the critical challenge of maintaining consistency while achieving dense multi-view generation for improved 3D human reconstruction, we first introduce hybrid multi-view attention to facilitate both efficient and thorough information interchange across different views. Additionally, we present a geometry-aware dual branch to perform concurrent generation in both RGB and normal domains, further enhancing consistency via geometry cues. Last but not least, to address ill-shaped issues arising from inaccurate SMPL-X estimation that conflicts with the reference image, we propose a novel iterative refinement strategy, which progressively optimizes SMPL-X accuracy while enhancing the quality and consistency of the generated multi-views. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in both novel view synthesis and subsequent 3D human reconstruction tasks.

Title: TC-PDM: Temporally Consistent Patch Diffusion Models for Infrared-to-Visible Video Translation

Authors: Anh-Dzung Doan, Vu Minh Hieu Phan, Surabhi Gupta, Markus Wagner, Tat-Jun Chin, Ian Reid
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14227
Pdf URL: https://arxiv.org/pdf/2408.14227
Copy Paste: [[2408.14227]] TC-PDM: Temporally Consistent Patch Diffusion Models for Infrared-to-Visible Video Translation(https://arxiv.org/abs/2408.14227)
Keywords: diffusion
Abstract: Infrared imaging offers resilience against changing lighting conditions by capturing object temperatures. Yet, in few scenarios, its lack of visual details compared to daytime visible images, poses a significant challenge for human and machine interpretation. This paper proposes a novel diffusion method, dubbed Temporally Consistent Patch Diffusion Models (TC-DPM), for infrared-to-visible video translation. Our method, extending the Patch Diffusion Model, consists of two key components. Firstly, we propose a semantic-guided denoising, leveraging the strong representations of foundational models. As such, our method faithfully preserves the semantic structure of generated visible images. Secondly, we propose a novel temporal blending module to guide the denoising trajectory, ensuring the temporal consistency between consecutive frames. Experiment shows that TC-PDM outperforms state-of-the-art methods by 35.3% in FVD for infrared-to-visible video translation and by 6.1% in AP50 for day-to-night object detection. Our code is publicly available at this https URL

Title: Text3DAug -- Prompted Instance Augmentation for LiDAR Perception

Authors: Laurenz Reichardt, Luca Uhr, Oliver Wasenmüller
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14253
Pdf URL: https://arxiv.org/pdf/2408.14253
Copy Paste: [[2408.14253]] Text3DAug -- Prompted Instance Augmentation for LiDAR Perception(https://arxiv.org/abs/2408.14253)
Keywords: generative
Abstract: LiDAR data of urban scenarios poses unique challenges, such as heterogeneous characteristics and inherent class imbalance. Therefore, large-scale datasets are necessary to apply deep learning methods. Instance augmentation has emerged as an efficient method to increase dataset diversity. However, current methods require the time-consuming curation of 3D models or costly manual data annotation. To overcome these limitations, we propose Text3DAug, a novel approach leveraging generative models for instance augmentation. Text3DAug does not depend on labeled data and is the first of its kind to generate instances and annotations from text. This allows for a fully automated pipeline, eliminating the need for manual effort in practical applications. Additionally, Text3DAug is sensor agnostic and can be applied regardless of the LiDAR sensor used. Comprehensive experimental analysis on LiDAR segmentation, detection and novel class discovery demonstrates that Text3DAug is effective in supplementing existing methods or as a standalone method, performing on par or better than established methods, however while overcoming their specific drawbacks. The code is publicly available.

Title: Self-supervised Speech Representations Still Struggle with African American Vernacular English

Authors: Kalvin Chang, Yi-Hui Chou, Jiatong Shi, Hsuan-Ming Chen, Nicole Holliday, Odette Scharenborg, David R. Mortensen
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2408.14262
Pdf URL: https://arxiv.org/pdf/2408.14262
Copy Paste: [[2408.14262]] Self-supervised Speech Representations Still Struggle with African American Vernacular English(https://arxiv.org/abs/2408.14262)
Keywords: self-supervised
Abstract: Underperformance of ASR systems for speakers of African American Vernacular English (AAVE) and other marginalized language varieties is a well-documented phenomenon, and one that reinforces the stigmatization of these varieties. We investigate whether or not the recent wave of Self-Supervised Learning (SSL) speech models can close the gap in ASR performance between AAVE and Mainstream American English (MAE). We evaluate four SSL models (wav2vec 2.0, HuBERT, WavLM, and XLS-R) on zero-shot Automatic Speech Recognition (ASR) for these two varieties and find that these models perpetuate the bias in performance against AAVE. Additionally, the models have higher word error rates on utterances with more phonological and morphosyntactic features of AAVE. Despite the success of SSL speech models in improving ASR for low resource varieties, SSL pre-training alone may not bridge the gap between AAVE and MAE. Our code is publicly available at this https URL.

Title: Streamline tractography of the fetal brain in utero with machine learning

Authors: Weide Liu, Camilo Calixto, Simon K. Warfield, Davood Karimi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.14326
Pdf URL: https://arxiv.org/pdf/2408.14326
Copy Paste: [[2408.14326]] Streamline tractography of the fetal brain in utero with machine learning(https://arxiv.org/abs/2408.14326)
Keywords: diffusion
Abstract: Diffusion-weighted magnetic resonance imaging (dMRI) is the only non-invasive tool for studying white matter tracts and structural connectivity of the brain. These assessments rely heavily on tractography techniques, which reconstruct virtual streamlines representing white matter fibers. Much effort has been devoted to improving tractography methodology for adult brains, while tractography of the fetal brain has been largely neglected. Fetal tractography faces unique difficulties due to low dMRI signal quality, immature and rapidly developing brain structures, and paucity of reference data. This work presents the first machine learning model for fetal tractography. The model input consists of five sources of information: (1) Fiber orientation, inferred from a diffusion tensor fit to the dMRI signal; (2) Directions of recent propagation steps; (3) Global spatial information, encoded as distances to keypoints in the brain cortex; (4) Tissue segmentation information; and (5) Prior information about the expected local fiber orientations supplied with an atlas. In order to mitigate the local tensor estimation error, a large spatial context around the current point in the diffusion tensor image is encoded using convolutional and attention neural network modules. Moreover, the diffusion tensor information at a hypothetical next point is included in the model input. Filtering rules based on anatomically constrained tractography are applied to prune implausible streamlines. We trained the model on manually-refined whole-brain fetal tractograms and validated the trained model on an independent set of 11 test scans with gestational ages between 23 and 36 weeks. Results show that our proposed method achieves superior performance across all evaluated tracts. The new method can significantly advance the capabilities of dMRI for studying normal and abnormal brain development in utero.

Title: PHEVA: A Privacy-preserving Human-centric Video Anomaly Detection Dataset

Authors: Ghazal Alinezhad Noghre, Shanle Yao, Armin Danesh Pazho, Babak Rahimi Ardabili, Vinit Katariya, Hamed Tabkhi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14329
Pdf URL: https://arxiv.org/pdf/2408.14329
Copy Paste: [[2408.14329]] PHEVA: A Privacy-preserving Human-centric Video Anomaly Detection Dataset(https://arxiv.org/abs/2408.14329)
Keywords: anomaly
Abstract: PHEVA, a Privacy-preserving Human-centric Ethical Video Anomaly detection dataset. By removing pixel information and providing only de-identified human annotations, PHEVA safeguards personally identifiable information. The dataset includes seven indoor/outdoor scenes, featuring one novel, context-specific camera, and offers over 5x the pose-annotated frames compared to the largest previous dataset. This study benchmarks state-of-the-art methods on PHEVA using a comprehensive set of metrics, including the 10% Error Rate (10ER), a metric used for anomaly detection for the first time providing insights relevant to real-world deployment. As the first of its kind, PHEVA bridges the gap between conventional training and real-world deployment by introducing continual learning benchmarks, with models outperforming traditional methods in 82.14% of cases. The dataset is publicly available at this https URL.

Title: DuDoCROP: Dual-Domain CLIP-Assisted Residual Optimization Perception Model for CT Metal Artifact Reduction

Authors: Xinrui Zhang, Ailong Cai, Lei Li, Bin Yan
Subjects: cs.CV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2408.14342
Pdf URL: https://arxiv.org/pdf/2408.14342
Copy Paste: [[2408.14342]] DuDoCROP: Dual-Domain CLIP-Assisted Residual Optimization Perception Model for CT Metal Artifact Reduction(https://arxiv.org/abs/2408.14342)
Keywords: diffusion, generative
Abstract: Metal artifacts in computed tomography (CT) imaging pose significant challenges to accurate clinical diagnosis. The presence of high-density metallic implants results in artifacts that deteriorate image quality, manifesting in the forms of streaking, blurring, or beam hardening effects, etc. Nowadays, various deep learning-based approaches, particularly generative models, have been proposed for metal artifact reduction (MAR). However, these methods have limited perception ability in the diverse morphologies of different metal implants with artifacts, which may generate spurious anatomical structures and exhibit inferior generalization capability. To address the issues, we leverage visual-language model (VLM) to identify these morphological features and introduce them into a dual-domain CLIP-assisted residual optimization perception model (DuDoCROP) for MAR. Specifically, a dual-domain CLIP (DuDoCLIP) is fine-tuned on the image domain and sinogram domain using contrastive learning to extract semantic descriptions from anatomical structures and metal artifacts. Subsequently, a diffusion model is guided by the embeddings of DuDoCLIP, thereby enabling the dual-domain prior generation. Additionally, we design prompt engineering for more precise image-text descriptions that can enhance the model's perception capability. Then, a downstream task is devised for the one-step residual optimization and integration of dual-domain priors, while incorporating raw data fidelity. Ultimately, a new perceptual indicator is proposed to validate the model's perception and generation performance. With the assistance of DuDoCLIP, our DuDoCROP exhibits at least 63.7% higher generalization capability compared to the baseline model. Numerical experiments demonstrate that the proposed method can generate more realistic image structures and outperform other SOTA approaches both qualitatively and quantitatively.

Title: An Embedding is Worth a Thousand Noisy Labels

Authors: Francesco Di Salvo, Sebastian Doerrich, Ines Rieger, Christian Ledig
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2408.14358
Pdf URL: https://arxiv.org/pdf/2408.14358
Copy Paste: [[2408.14358]] An Embedding is Worth a Thousand Noisy Labels(https://arxiv.org/abs/2408.14358)
Keywords: self-supervised, foundation model
Abstract: The performance of deep neural networks scales with dataset size and label quality, rendering the efficient mitigation of low-quality data annotations crucial for building robust and cost-effective systems. Existing strategies to address label noise exhibit severe limitations due to computational complexity and application dependency. In this work, we propose WANN, a Weighted Adaptive Nearest Neighbor approach that builds on self-supervised feature representations obtained from foundation models. To guide the weighted voting scheme, we introduce a reliability score, which measures the likelihood of a data label being correct. WANN outperforms reference methods, including a linear layer trained with robust loss functions, on diverse datasets of varying size and under various noise types and severities. WANN also exhibits superior generalization on imbalanced data compared to both Adaptive-NNs (ANN) and fixed k-NNs. Furthermore, the proposed weighting scheme enhances supervised dimensionality reduction under noisy labels. This yields a significant boost in classification performance with 10x and 100x smaller image embeddings, minimizing latency and storage requirements. Our approach, emphasizing efficiency and explainability, emerges as a simple, robust solution to overcome the inherent limitations of deep neural network training. The code is available at this https URL .

Title: Probing Causality Manipulation of Large Language Models

Authors: Chenyang Zhang, Haibo Tong, Bin Zhang, Dongyu Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14380
Pdf URL: https://arxiv.org/pdf/2408.14380
Copy Paste: [[2408.14380]] Probing Causality Manipulation of Large Language Models(https://arxiv.org/abs/2408.14380)
Keywords: in-context
Abstract: Large language models (LLMs) have shown various ability on natural language processing, including problems about causality. It is not intuitive for LLMs to command causality, since pretrained models usually work on statistical associations, and do not focus on causes and effects in sentences. So that probing internal manipulation of causality is necessary for LLMs. This paper proposes a novel approach to probe causality manipulation hierarchically, by providing different shortcuts to models and observe behaviors. We exploit retrieval augmented generation (RAG) and in-context learning (ICL) for models on a designed causality classification task. We conduct experiments on mainstream LLMs, including GPT-4 and some smaller and domain-specific models. Our results suggest that LLMs can detect entities related to causality and recognize direct causal relationships. However, LLMs lack specialized cognition for causality, merely treating them as part of the global semantic of the sentence.

Title: Hyperdimensional Computing Empowered Federated Foundation Model over Wireless Networks for Metaverse

Authors: Yahao Ding, Wen Shang, Minrui Xu, Zhaohui Yang, Ye Hu, Dusit Niyato, Mohammad Shikh-Bahaei
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2408.14416
Pdf URL: https://arxiv.org/pdf/2408.14416
Copy Paste: [[2408.14416]] Hyperdimensional Computing Empowered Federated Foundation Model over Wireless Networks for Metaverse(https://arxiv.org/abs/2408.14416)
Keywords: foundation model
Abstract: The Metaverse, a burgeoning collective virtual space merging augmented reality and persistent virtual worlds, necessitates advanced artificial intelligence (AI) and communication technologies to support immersive and interactive experiences. Federated learning (FL) has emerged as a promising technique for collaboratively training AI models while preserving data privacy. However, FL faces challenges such as high communication overhead and substantial computational demands, particularly for neural network (NN) models. To address these issues, we propose an integrated federated split learning and hyperdimensional computing (FSL-HDC) framework for emerging foundation models. This novel approach reduces communication costs, computation load, and privacy risks, making it particularly suitable for resource-constrained edge devices in the Metaverse, ensuring real-time responsive interactions. Additionally, we introduce an optimization algorithm that concurrently optimizes transmission power and bandwidth to minimize the maximum transmission time among all users to the server. The simulation results based on the MNIST dataset indicate that FSL-HDC achieves an accuracy rate of approximately 87.5%, which is slightly lower than that of FL-HDC. However, FSL-HDC exhibits a significantly faster convergence speed, approximately 3.733x that of FSL-NN, and demonstrates robustness to non-IID data distributions. Moreover, our proposed optimization algorithm can reduce the maximum transmission time by up to 64% compared with the baseline.

Title: MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

Authors: Kuluhan Binici, Abhinav Ramesh Kashyap, Viktor Schlegel, Andy T. Liu, Vijay Prakash Dwivedi, Thanh-Tung Nguyen, Xiaoxue Gao, Nancy F. Chen, Stefan Winkler
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14418
Pdf URL: https://arxiv.org/pdf/2408.14418
Copy Paste: [[2408.14418]] MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues(https://arxiv.org/abs/2408.14418)
Keywords: in-context
Abstract: Automatic Speech Recognition (ASR) systems are pivotal in transcribing speech into text, yet the errors they introduce can significantly degrade the performance of downstream tasks like summarization. This issue is particularly pronounced in clinical dialogue summarization, a low-resource domain where supervised data for fine-tuning is scarce, necessitating the use of ASR models as black-box solutions. Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). Specifically, we leverage the in-context learning capabilities of LLMs and instruct them to generate ASR-like errors based on a few available medical dialogue examples with audio recordings. Experimental results show that LLMs can effectively model ASR noise, and incorporating this noisy data into the training process significantly improves the robustness and accuracy of medical dialogue summarization systems. This approach addresses the challenges of noisy ASR outputs in critical applications, offering a robust solution to enhance the reliability of clinical dialogue summarization.

Title: Evaluating saliency scores in point clouds of natural environments by learning surface anomalies

Authors: Reuma Arav, Dennis Wittich, Franz Rottensteiner
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.14421
Pdf URL: https://arxiv.org/pdf/2408.14421
Copy Paste: [[2408.14421]] Evaluating saliency scores in point clouds of natural environments by learning surface anomalies(https://arxiv.org/abs/2408.14421)
Keywords: anomaly
Abstract: In recent years, three-dimensional point clouds are used increasingly to document natural environments. Each dataset contains a diverse set of objects, at varying shapes and sizes, distributed throughout the data and intricately intertwined with the topography. Therefore, regions of interest are difficult to find and consequent analyses become a challenge. Inspired from visual perception principles, we propose to differentiate objects of interest from the cluttered environment by evaluating how much they stand out from their surroundings, i.e., their geometric salience. Previous saliency detection approaches suggested mostly handcrafted attributes for the task. However, such methods fail when the data are too noisy or have high levels of texture. Here we propose a learning-based mechanism that accommodates noise and textured surfaces. We assume that within the natural environment any change from the prevalent surface would suggest a salient object. Thus, we first learn the underlying surface and then search for anomalies within it. Initially, a deep neural network is trained to reconstruct the surface. Regions where the reconstructed part deviates significantly from the original point cloud yield a substantial reconstruction error, signifying an anomaly, i.e., saliency. We demonstrate the effectiveness of the proposed approach by searching for salient features in various natural scenarios, which were acquired by different acquisition platforms. We show the strong correlation between the reconstruction error and salient objects.

Title: Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models

Authors: Aradhye Agarwal, Suhas K Ramesh, Ayan Sengupta, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.14470
Pdf URL: https://arxiv.org/pdf/2408.14470
Copy Paste: [[2408.14470]] Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models(https://arxiv.org/abs/2408.14470)
Keywords: generative
Abstract: Fine-tuning large language models (LLMs) on downstream tasks requires substantial computational resources. A class of parameter-efficient fine-tuning (PEFT) aims to mitigate these computational challenges by selectively fine-tuning only a small fraction of the model parameters. Although computationally efficient, these techniques often fail to match the performance of fully fine-tuned models, primarily due to inherent biases introduced during parameter selection. Traditional selective PEFT techniques use a fixed set of parameters based on a predefined budget (a process also known as unmasking), failing to capture parameter importance dynamically and often ending up exceeding the budget. We introduce $\text{ID}^3$, a novel selective PEFT method that calculates parameter importance continually and dynamically unmasks parameters by balancing exploration and exploitation in parameter selection. Our empirical study on 15 tasks spanning natural language understanding and generative tasks demonstrates the effectiveness of our method compared to fixed-masking-based PEFT techniques. We analytically show that $\text{ID}^3$ reduces the number of gradient updates by a factor of two, enhancing computational efficiency. $\text{ID}^3$ is robust to random initialization of neurons and, therefore, can be seamlessly integrated into existing additive and reparametrization-based PEFT modules such as adapters and LoRA for dynamic sparsification.

Title: A Practitioner's Guide to Continual Multimodal Pretraining

Authors: Karsten Roth, Vishaal Udandarao, Sebastian Dziadzio, Ameya Prabhu, Mehdi Cherti, Oriol Vinyals, Olivier Hénaff, Samuel Albanie, Matthias Bethge, Zeynep Akata
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.14471
Pdf URL: https://arxiv.org/pdf/2408.14471
Copy Paste: [[2408.14471]] A Practitioner's Guide to Continual Multimodal Pretraining(https://arxiv.org/abs/2408.14471)
Keywords: foundation model
Abstract: Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts -- spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner's guide to continual multimodal pretraining for real-world deployment. Our benchmark and code is here: this https URL.