2024-08-28

Title: A New Era in Computational Pathology: A Survey on Foundation and Vision-Language Models

Authors: Dibaloke Chanda, Milan Aryal, Nasim Yahya Soltani, Masoud Ganji
Subjects: cs.LG, cs.AI, cs.CL, eess.IV
Abstract URL: https://arxiv.org/abs/2408.14496
Pdf URL: https://arxiv.org/pdf/2408.14496
Copy Paste: [[2408.14496]] A New Era in Computational Pathology: A Survey on Foundation and Vision-Language Models(https://arxiv.org/abs/2408.14496)
Keywords: foundation model
Abstract: Recent advances in deep learning have completely transformed the domain of computational pathology (CPath), which in turn altered the diagnostic workflow of pathologists by integrating foundation models (FMs) and vision-language models (VLMs) in their assessment and decision-making process. FMs overcome the limitations of existing deep learning approaches in CPath by learning a representation space that can be adapted to a wide variety of downstream tasks without explicit supervision. VLMs allow pathology reports written in natural language to be used as a rich semantic information source to improve existing models as well as generate predictions in natural language form. In this survey, a holistic and systematic overview of recent innovations in FMs and VLMs in CPath is presented. Furthermore, the tools, datasets and training schemes for these models are summarized in addition to categorizing them into distinct groups. This extensive survey highlights the current trends in CPath and the way it is going to be transformed through FMs and VLMs in the future.

Title: SHEDAD: SNN-Enhanced District Heating Anomaly Detection for Urban Substations

Authors: Jonne van Dreven, Abbas Cheddad, Sadi Alawadi, Ahmad Nauman Ghazi, Jad Al Koussa, Dirk Vanhoudt
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14499
Pdf URL: https://arxiv.org/pdf/2408.14499
Copy Paste: [[2408.14499]] SHEDAD: SNN-Enhanced District Heating Anomaly Detection for Urban Substations(https://arxiv.org/abs/2408.14499)
Keywords: anomaly
Abstract: District Heating (DH) systems are essential for energy-efficient urban heating. However, despite the advancements in automated fault detection and diagnosis (FDD), DH still faces challenges in operational faults that impact efficiency. This study introduces the Shared Nearest Neighbor Enhanced District Heating Anomaly Detection (SHEDAD) approach, designed to approximate the DH network topology and allow for local anomaly detection without disclosing sensitive information, such as substation locations. The approach leverages a multi-adaptive k-Nearest Neighbor (k-NN) graph to improve the initial neighborhood creation. Moreover, it introduces a merging technique that reduces noise and eliminates trivial edges. We use the Median Absolute Deviation (MAD) and modified z-scores to flag anomalous substations. The results reveal that SHEDAD outperforms traditional clustering methods, achieving significantly lower intra-cluster variance and distance. Additionally, SHEDAD effectively isolates and identifies two distinct categories of anomalies: supply temperatures and substation performance. We identified 30 anomalous substations and reached a sensitivity of approximately 65\% and specificity of approximately 97\%. By focusing on this subset of poor-performing substations in the network, SHEDAD enables more targeted and effective maintenance interventions, which can reduce energy usage while optimizing network performance.

Title: Empowering Pre-Trained Language Models for Spatio-Temporal Forecasting via Decoupling Enhanced Discrete Reprogramming

Authors: Hao Wang, Jindong Han, Wei Fan, Hao Liu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2408.14505
Pdf URL: https://arxiv.org/pdf/2408.14505
Copy Paste: [[2408.14505]] Empowering Pre-Trained Language Models for Spatio-Temporal Forecasting via Decoupling Enhanced Discrete Reprogramming(https://arxiv.org/abs/2408.14505)
Keywords: diffusion
Abstract: Spatio-temporal time series forecasting plays a critical role in various real-world applications, such as transportation optimization, energy management, and climate analysis. The recent advancements in Pre-trained Language Models (PLMs) have inspired efforts to reprogram these models for time series forecasting tasks, by leveraging their superior reasoning and generalization capabilities. However, existing approaches fall short in handling complex spatial inter-series dependencies and intrinsic intra-series frequency components, limiting their spatio-temporal forecasting performance. Moreover, the linear mapping of continuous time series to a compressed subset vocabulary in reprogramming constrains the spatio-temporal semantic expressivity of PLMs and may lead to potential information bottleneck. To overcome the above limitations, we propose \textsc{RePST}, a tailored PLM reprogramming framework for spatio-temporal forecasting. The key insight of \textsc{RePST} is to decouple the spatio-temporal dynamics in the frequency domain, allowing better alignment with the PLM text space. Specifically, we first decouple spatio-temporal data in Fourier space and devise a structural diffusion operator to obtain temporal intrinsic and spatial diffusion signals, making the dynamics more comprehensible and predictable for PLMs. To avoid information bottleneck from a limited vocabulary, we further propose a discrete reprogramming strategy that selects relevant discrete textual information from an expanded vocabulary space in a differentiable manner. Extensive experiments on four real-world datasets show that our proposed approach significantly outperforms state-of-the-art spatio-temporal forecasting models, particularly in data-scarce scenarios.

Title: LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings

Authors: Duo Wang, Yuan Zuo, Fengzhi Li, Junjie Wu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2408.14512
Pdf URL: https://arxiv.org/pdf/2408.14512
Copy Paste: [[2408.14512]] LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings(https://arxiv.org/abs/2408.14512)
Keywords: self-supervised
Abstract: Zero-shot graph machine learning, especially with graph neural networks (GNNs), has garnered significant interest due to the challenge of scarce labeled data. While methods like self-supervised learning and graph prompt learning have been extensively explored, they often rely on fine-tuning with task-specific labels, limiting their effectiveness in zero-shot scenarios. Inspired by the zero-shot capabilities of instruction-fine-tuned large language models (LLMs), we introduce a novel framework named Token Embedding-Aligned Graph Language Model (TEA-GLM) that leverages LLMs as cross-dataset and cross-task zero-shot learners for graph machine learning. Concretely, we pretrain a GNN, aligning its representations with token embeddings of an LLM. We then train a linear projector that transforms the GNN's representations into a fixed number of graph token embeddings without tuning the LLM. A unified instruction is designed for various graph tasks at different levels, such as node classification (node-level) and link prediction (edge-level). These design choices collectively enhance our method's effectiveness in zero-shot learning, setting it apart from existing methods. Experiments show that our graph token embeddings help the LLM predictor achieve state-of-the-art performance on unseen datasets and tasks compared to other methods using LLMs as predictors.

Title: Variational autoencoder-based neural network model compression

Authors: Liang Cheng, Peiyuan Guan, Amir Taherkordi, Lei Liu, Dapeng Lan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14513
Pdf URL: https://arxiv.org/pdf/2408.14513
Copy Paste: [[2408.14513]] Variational autoencoder-based neural network model compression(https://arxiv.org/abs/2408.14513)
Keywords: generative, anomaly
Abstract: Variational Autoencoders (VAEs), as a form of deep generative model, have been widely used in recent years, and shown great great peformance in a number of different domains, including image generation and anomaly detection, etc.. This paper aims to explore neural network model compression method based on VAE. The experiment uses different neural network models for MNIST recognition as compression targets, including Feedforward Neural Network (FNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM). These models are the most basic models in deep learning, and other more complex and advanced models are based on them or inherit their features and evolve. In the experiment, the first step is to train the models mentioned above, each trained model will have different accuracy and number of total parameters. And then the variants of parameters for each model are processed as training data in VAEs separately, and the trained VAEs are tested by the true model parameters. The experimental results show that using the latent space as a representation of the model compression can improve the compression rate compared to some traditional methods such as pruning and quantization, meanwhile the accuracy is not greatly affected using the model parameters reconstructed based on the latent space. In the future, a variety of different large-scale deep learning models will be used more widely, so exploring different ways to save time and space on saving or transferring models will become necessary, and the use of VAE in this paper can provide a basis for these further explorations.

Title: Retrieval Augmented Generation for Dynamic Graph Modeling

Authors: Yuxia Wu, Yuan Fang, Lizi Liao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14523
Pdf URL: https://arxiv.org/pdf/2408.14523
Copy Paste: [[2408.14523]] Retrieval Augmented Generation for Dynamic Graph Modeling(https://arxiv.org/abs/2408.14523)
Keywords: generative
Abstract: Dynamic graph modeling is crucial for analyzing evolving patterns in various applications. Existing approaches often integrate graph neural networks with temporal modules or redefine dynamic graph modeling as a generative sequence task. However, these methods typically rely on isolated historical contexts of the target nodes from a narrow perspective, neglecting occurrences of similar patterns or relevant cases associated with other nodes. In this work, we introduce the Retrieval-Augmented Generation for Dynamic Graph Modeling (RAG4DyG) framework, which leverages guidance from contextually and temporally analogous examples to broaden the perspective of each node. This approach presents two critical challenges: (1) How to identify and retrieve high-quality demonstrations that are contextually and temporally analogous to dynamic graph samples? (2) How can these demonstrations be effectively integrated to improve dynamic graph modeling? To address these challenges, we propose RAG4DyG, which enriches the understanding of historical contexts by retrieving and learning from contextually and temporally pertinent demonstrations. Specifically, we employ a time- and context-aware contrastive learning module to identify and retrieve relevant cases for each query sequence. Moreover, we design a graph fusion strategy to integrate the retrieved cases, thereby augmenting the inherent historical contexts for improved prediction. Extensive experiments on real-world datasets across different domains demonstrate the effectiveness of RAG4DyG for dynamic graph modeling.

Title: DIAGen: Diverse Image Augmentation with Generative Models

Authors: Tobias Lingenberg, Markus Reuter, Gopika Sudhakaran, Dominik Gojny, Stefan Roth, Simone Schaub-Meyer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14584
Pdf URL: https://arxiv.org/pdf/2408.14584
Copy Paste: [[2408.14584]] DIAGen: Diverse Image Augmentation with Generative Models(https://arxiv.org/abs/2408.14584)
Keywords: diffusion, generative
Abstract: Simple data augmentation techniques, such as rotations and flips, are widely used to enhance the generalization power of computer vision models. However, these techniques often fail to modify high-level semantic attributes of a class. To address this limitation, researchers have explored generative augmentation methods like the recently proposed DA-Fusion. Despite some progress, the variations are still largely limited to textural changes, thus falling short on aspects like varied viewpoints, environment, weather conditions, or even class-level semantic attributes (eg, variations in a dog's breed). To overcome this challenge, we propose DIAGen, building upon DA-Fusion. First, we apply Gaussian noise to the embeddings of an object learned with Textual Inversion to diversify generations using a pre-trained diffusion model's knowledge. Second, we exploit the general knowledge of a text-to-text generative model to guide the image generation of the diffusion model with varied class-specific prompts. Finally, we introduce a weighting mechanism to mitigate the impact of poorly generated samples. Experimental results across various datasets show that DIAGen not only enhances semantic diversity but also improves the performance of subsequent classifiers. The advantages of DIAGen over standard augmentations and the DA-Fusion baseline are particularly pronounced with out-of-distribution samples.

Title: Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities

Authors: Yidi Li, Yihan Li, Yixin Guo, Bin Ren, Zhenhuan Xu, Hao Guo, Hong Liu, Nicu Sebe
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2408.14585
Pdf URL: https://arxiv.org/pdf/2408.14585
Copy Paste: [[2408.14585]] Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities(https://arxiv.org/abs/2408.14585)
Keywords: generative
Abstract: In speaker tracking research, integrating and complementing multi-modal data is a crucial strategy for improving the accuracy and robustness of tracking systems. However, tracking with incomplete modalities remains a challenging issue due to noisy observations caused by occlusion, acoustic noise, and sensor failures. Especially when there is missing data in multiple modalities, the performance of existing multi-modal fusion methods tends to decrease. To this end, we propose a Global-Local Distillation-based Tracker (GLDTracker) for robust audio-visual speaker tracking. GLDTracker is driven by a teacher-student distillation model, enabling the flexible fusion of incomplete information from each modality. The teacher network processes global signals captured by camera and microphone arrays, and the student network handles local information subject to visual occlusion and missing audio channels. By transferring knowledge from teacher to student, the student network can better adapt to complex dynamic scenes with incomplete observations. In the student network, a global feature reconstruction module based on the generative adversarial network is constructed to reconstruct global features from feature embedding with missing local information. Furthermore, a multi-modal multi-level fusion attention is introduced to integrate the incomplete feature and the reconstructed feature, leveraging the complementarity and consistency of audio-visual and global-local features. Experimental results on the AV16.3 dataset demonstrate that the proposed GLDTracker outperforms existing state-of-the-art audio-visual trackers and achieves leading performance on both standard and incomplete modalities datasets, highlighting its superiority and robustness in complex conditions. The code and models will be available.

Title: Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models

Authors: Ian Stewart, Sameera Horawalavithana, Brendan Kennedy, Sai Munikoti, Karl Pazdernik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.14595
Pdf URL: https://arxiv.org/pdf/2408.14595
Copy Paste: [[2408.14595]] Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models(https://arxiv.org/abs/2408.14595)
Keywords: foundation model
Abstract: Multimodal foundation models (MFMs) such as OFASys show the potential to unlock analysis of complex data such as images, videos, and audio data via text prompts alone. However, their performance may suffer in the face of text input that differs even slightly from their training distribution, which is surprising considering the use of modality-specific data to "ground" the text input. This study demonstrates that prompt instability is a major concern for MFMs, leading to a consistent drop in performance across all modalities, but that instability can be mitigated with additional training with augmented data. We evaluate several methods for grounded prompt perturbation, where we generate perturbations and filter based on similarity to text and/or modality data. After re-training the models on the augmented data, we find improved accuracy and more stable performance on the perturbed test data regardless of perturbation condition, suggesting that the data augmentation strategy helps the models handle domain shifts more effectively. In error analysis, we find consistent patterns of performance improvement across domains, suggesting that retraining on prompt perturbations tends to help general reasoning capabilities in MFMs.

Title: What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation

Authors: Dingyi Yang, Qin Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.14622
Pdf URL: https://arxiv.org/pdf/2408.14622
Copy Paste: [[2408.14622]] What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation(https://arxiv.org/abs/2408.14622)
Keywords: generative
Abstract: With the development of artificial intelligence, particularly the success of Large Language Models (LLMs), the quantity and quality of automatically generated stories have significantly increased. This has led to the need for automatic story evaluation to assess the generative capabilities of computing systems and analyze the quality of both automatic-generated and human-written stories. Evaluating a story can be more challenging than other generation evaluation tasks. While tasks like machine translation primarily focus on assessing the aspects of fluency and accuracy, story evaluation demands complex additional measures such as overall coherence, character development, interestingness, etc. This requires a thorough review of relevant research. In this survey, we first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual. We highlight their evaluation challenges, identify various human criteria to measure stories, and present existing benchmark datasets. Then, we propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation. We also provide descriptions of these metrics, along with the discussion of their merits and limitations. Later, we discuss the human-AI collaboration for story evaluation and generation. Finally, we suggest potential future research directions, extending from story evaluation to general evaluations.

Title: OctFusion: Octree-based Diffusion Models for 3D Shape Generation

Authors: Bojun Xiong, Si-Tong Wei, Xin-Yang Zheng, Yan-Pei Cao, Zhouhui Lian, Peng-Shuai Wang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2408.14732
Pdf URL: https://arxiv.org/pdf/2408.14732
Copy Paste: [[2408.14732]] OctFusion: Octree-based Diffusion Models for 3D Shape Generation(https://arxiv.org/abs/2408.14732)
Keywords: diffusion
Abstract: Diffusion models have emerged as a popular method for 3D generation. However, it is still challenging for diffusion models to efficiently generate diverse and high-quality 3D shapes. In this paper, we introduce OctFusion, which can generate 3D shapes with arbitrary resolutions in 2.5 seconds on a single Nvidia 4090 GPU, and the extracted meshes are guaranteed to be continuous and manifold. The key components of OctFusion are the octree-based latent representation and the accompanying diffusion models. The representation combines the benefits of both implicit neural representations and explicit spatial octrees and is learned with an octree-based variational autoencoder. The proposed diffusion model is a unified multi-scale U-Net that enables weights and computation sharing across different octree levels and avoids the complexity of widely used cascaded diffusion schemes. We verify the effectiveness of OctFusion on the ShapeNet and Objaverse datasets and achieve state-of-the-art performances on shape generation tasks. We demonstrate that OctFusion is extendable and flexible by generating high-quality color fields for textured mesh generation and high-quality 3D shapes conditioned on text prompts, sketches, or category labels. Our code and pre-trained models are available at \url{this https URL}.

Title: Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation

Authors: Bochao Liu, Pengju Wang, Shiming Ge
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2408.14738
Pdf URL: https://arxiv.org/pdf/2408.14738
Copy Paste: [[2408.14738]] Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation(https://arxiv.org/abs/2408.14738)
Keywords: diffusion, generative
Abstract: While the success of deep learning relies on large amounts of training datasets, data is often limited in privacy-sensitive domains. To address this challenge, generative model learning with differential privacy has emerged as a solution to train private generative models for desensitized data generation. However, the quality of the images generated by existing methods is limited due to the complexity of modeling data distribution. We build on the success of diffusion models and introduce DP-SAD, which trains a private diffusion model by a stochastic adversarial distillation method. Specifically, we first train a diffusion model as a teacher and then train a student by distillation, in which we achieve differential privacy by adding noise to the gradients from other models to the student. For better generation quality, we introduce a discriminator to distinguish whether an image is from the teacher or the student, which forms the adversarial training. Extensive experiments and analysis clearly demonstrate the effectiveness of our proposed method.

Title: Personalized Video Summarization using Text-Based Queries and Conditional Modeling

Authors: Jia-Hong Huang
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2408.14743
Pdf URL: https://arxiv.org/pdf/2408.14743
Copy Paste: [[2408.14743]] Personalized Video Summarization using Text-Based Queries and Conditional Modeling(https://arxiv.org/abs/2408.14743)
Keywords: self-supervised
Abstract: The proliferation of video content on platforms like YouTube and Vimeo presents significant challenges in efficiently locating relevant information. Automatic video summarization aims to address this by extracting and presenting key content in a condensed form. This thesis explores enhancing video summarization by integrating text-based queries and conditional modeling to tailor summaries to user needs. Traditional methods often produce fixed summaries that may not align with individual requirements. To overcome this, we propose a multi-modal deep learning approach that incorporates both textual queries and visual information, fusing them at different levels of the model architecture. Evaluation metrics such as accuracy and F1-score assess the quality of the generated summaries. The thesis also investigates improving text-based query representations using contextualized word embeddings and specialized attention networks. This enhances the semantic understanding of queries, leading to better video summaries. To emulate human-like summarization, which accounts for both visual coherence and abstract factors like storyline consistency, we introduce a conditional modeling approach. This method uses multiple random variables and joint distributions to capture key summarization components, resulting in more human-like and explainable summaries. Addressing data scarcity in fully supervised learning, the thesis proposes a segment-level pseudo-labeling approach. This self-supervised method generates additional data, improving model performance even with limited human-labeled datasets. In summary, this research aims to enhance automatic video summarization by incorporating text-based queries, improving query representations, introducing conditional modeling, and addressing data scarcity, thereby creating more effective and personalized video summaries.

Title: Training-Free Time-Series Anomaly Detection: Leveraging Image Foundation Models

Authors: Nobuo Namura, Yuma Ichikawa
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.14756
Pdf URL: https://arxiv.org/pdf/2408.14756
Copy Paste: [[2408.14756]] Training-Free Time-Series Anomaly Detection: Leveraging Image Foundation Models(https://arxiv.org/abs/2408.14756)
Keywords: foundation model, anomaly
Abstract: Recent advancements in time-series anomaly detection have relied on deep learning models to handle the diverse behaviors of time-series data. However, these models often suffer from unstable training and require extensive hyperparameter tuning, leading to practical limitations. Although foundation models present a potential solution, their use in time series is limited. To overcome these issues, we propose an innovative image-based, training-free time-series anomaly detection (ITF-TAD) approach. ITF-TAD converts time-series data into images using wavelet transform and compresses them into a single representation, leveraging image foundation models for anomaly detection. This approach achieves high-performance anomaly detection without unstable neural network training or hyperparameter tuning. Furthermore, ITF-TAD identifies anomalies across different frequencies, providing users with a detailed visualization of anomalies and their corresponding frequencies. Comprehensive experiments on five benchmark datasets, including univariate and multivariate time series, demonstrate that ITF-TAD offers a practical and effective solution with performance exceeding or comparable to that of deep models.

Title: Channel-wise Influence: Estimating Data Influence for Multivariate Time Series

Authors: Muyao Wang, Zeke Xie, Bo Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.14763
Pdf URL: https://arxiv.org/pdf/2408.14763
Copy Paste: [[2408.14763]] Channel-wise Influence: Estimating Data Influence for Multivariate Time Series(https://arxiv.org/abs/2408.14763)
Keywords: anomaly
Abstract: The influence function, a technique from robust statistics, measures the impact on model parameters or related functions when training data is removed or modified. This effective and valuable post-hoc method allows for studying the interpretability of machine learning models without requiring costly model retraining. It would provide extensions like increasing model performance, improving model generalization, and offering interpretability. Recently, Multivariate Time Series (MTS) analysis has become an important yet challenging task, attracting significant attention. However, there is no preceding research on the influence functions of MTS to shed light on the effects of modifying the channel of training MTS. Given that each channel in an MTS plays a crucial role in its analysis, it is essential to characterize the influence of different channels. To fill this gap, we propose a channel-wise influence function, which is the first method that can estimate the influence of different channels in MTS, utilizing a first-order gradient approximation that leverages the more informative average gradient of the data set. Additionally, we demonstrate how this influence function can be used to estimate the impact of a channel in MTS. Finally, we validated the accuracy and effectiveness of our influence estimation function in critical MTS analysis tasks, such as MTS anomaly detection and MTS forecasting. According to abundant experiments on real-world dataset, the original influence function performs worse than our method and even fail for the channel pruning problem, which demonstrate the superiority and necessity of channel-wise influence function in MTS analysis tasks.

Title: CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

Authors: Weijia Li, Jun He, Junyan Ye, Huaping Zhong, Zhimeng Zheng, Zilong Huang, Dahua Lin, Conghui He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14765
Pdf URL: https://arxiv.org/pdf/2408.14765
Copy Paste: [[2408.14765]] CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis(https://arxiv.org/abs/2408.14765)
Keywords: diffusion
Abstract: Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis task. In this work, we propose CrossViewDiff, a cross-view diffusion model for satellite-to-street view synthesis. To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules to construct the structural and textural controls for street-view image synthesis. We further design a cross-view control guided denoising process that incorporates the above controls via an enhanced cross-view attention module. To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method as a supplement to standard evaluation metrics. We also explore the effect of different data sources (e.g., text, maps, building heights, and multi-temporal satellite imagery) on this task. Results on three public cross-view datasets show that CrossViewDiff outperforms current state-of-the-art on both standard and GPT-based evaluation metrics, generating high-quality street-view panoramas with more realistic structures and textures across rural, suburban, and urban scenes. The code and models of this work will be released at this https URL.

Title: Text-guided Foundation Model Adaptation for Long-Tailed Medical Image Classification

Authors: Sirui Li, Li Lin, Yijin Huang, Pujin Cheng, Xiaoying Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14770
Pdf URL: https://arxiv.org/pdf/2408.14770
Copy Paste: [[2408.14770]] Text-guided Foundation Model Adaptation for Long-Tailed Medical Image Classification(https://arxiv.org/abs/2408.14770)
Keywords: foundation model
Abstract: In medical contexts, the imbalanced data distribution in long-tailed datasets, due to scarce labels for rare diseases, greatly impairs the diagnostic accuracy of deep learning models. Recent multimodal text-image supervised foundation models offer new solutions to data scarcity through effective representation learning. However, their limited medical-specific pretraining hinders their performance in medical image classification relative to natural images. To address this issue, we propose a novel Text-guided Foundation model Adaptation for Long-Tailed medical image classification (TFA-LT). We adopt a two-stage training strategy, integrating representations from the foundation model using just two linear adapters and a single ensembler for balanced outcomes. Experimental results on two long-tailed medical image datasets validate the simplicity, lightweight and efficiency of our approach: requiring only 6.1% GPU memory usage of the current best-performing algorithm, our method achieves an accuracy improvement of up to 27.1%, highlighting the substantial potential of foundation model adaptation in this area.

Title: Revisiting Surgical Instrument Segmentation Without Human Intervention: A Graph Partitioning View

Authors: Mingyu Sheng, Jianan Fan, Dongnan Liu, Ron Kikinis, Weidong Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14789
Pdf URL: https://arxiv.org/pdf/2408.14789
Copy Paste: [[2408.14789]] Revisiting Surgical Instrument Segmentation Without Human Intervention: A Graph Partitioning View(https://arxiv.org/abs/2408.14789)
Keywords: self-supervised
Abstract: Surgical instrument segmentation (SIS) on endoscopic images stands as a long-standing and essential task in the context of computer-assisted interventions for boosting minimally invasive surgery. Given the recent surge of deep learning methodologies and their data-hungry nature, training a neural predictive model based on massive expert-curated annotations has been dominating and served as an off-the-shelf approach in the field, which could, however, impose prohibitive burden to clinicians for preparing fine-grained pixel-wise labels corresponding to the collected surgical video frames. In this work, we propose an unsupervised method by reframing the video frame segmentation as a graph partitioning problem and regarding image pixels as graph nodes, which is significantly different from the previous efforts. A self-supervised pre-trained model is firstly leveraged as a feature extractor to capture high-level semantic features. Then, Laplacian matrixs are computed from the features and are eigendecomposed for graph partitioning. On the "deep" eigenvectors, a surgical video frame is meaningfully segmented into different modules such as tools and tissues, providing distinguishable semantic information like locations, classes, and relations. The segmentation problem can then be naturally tackled by applying clustering or threshold on the eigenvectors. Extensive experiments are conducted on various datasets (e.g., EndoVis2017, EndoVis2018, UCL, etc.) for different clinical endpoints. Across all the challenging scenarios, our method demonstrates outstanding performance and robustness higher than unsupervised state-of-the-art (SOTA) methods. The code is released at this https URL.

Title: GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer Based Fusion Network for Multimodal Sentiment Analysis

Authors: Yijie Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.14809
Pdf URL: https://arxiv.org/pdf/2408.14809
Copy Paste: [[2408.14809]] GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer Based Fusion Network for Multimodal Sentiment Analysis(https://arxiv.org/abs/2408.14809)
Keywords: self-supervised
Abstract: Multimodal Sentiment Analysis (MSA) leverages multiple modals to analyze sentiments. Typically, advanced fusion methods and representation learning-based methods are designed to tackle it. Our proposed GSIFN solves two key problems to be solved in MSA: (i) In multimodal fusion, the decoupling of modal combinations and tremendous parameter redundancy in existing fusion methods, which lead to poor fusion performance and efficiency. (ii) The trade-off between representation capability and computation overhead of the unimodal feature extractors and enhancers. GSIFN incorporates two main components to solve these problems: (i) Graph-Structured and Interlaced-Masked Multimodal Transformer. It adopts the Interlaced Mask mechanism to construct robust multimodal graph embedding, achieve all-modal-in-one Transformer-based fusion, and greatly reduce the computation overhead. (ii) A self-supervised learning framework with low computation overhead and high performance, which utilizes a parallelized LSTM with matrix memory to enhance non-verbal modal feature for unimodal label generation. Evaluated on the MSA datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS, GSIFN demonstrates superior performance with significantly lower computation overhead compared with state-of-the-art methods.

Title: HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling

Authors: Yubin Wang, Xinyang Jiang, De Cheng, Wenli Sun, Dongsheng Li, Cairong Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14812
Pdf URL: https://arxiv.org/pdf/2408.14812
Copy Paste: [[2408.14812]] HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling(https://arxiv.org/abs/2408.14812)
Keywords: foundation model
Abstract: Prompt learning has become a prevalent strategy for adapting vision-language foundation models (VLMs) such as CLIP to downstream tasks. With the emergence of large language models (LLMs), recent studies have explored the potential of using category-related descriptions to enhance prompt effectiveness. However, conventional descriptions lack explicit structured information necessary to represent the interconnections among key elements like entities or attributes with relation to a particular category. Since existing prompt tuning methods give little consideration to managing structured knowledge, this paper advocates leveraging LLMs to construct a graph for each description to prioritize such structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), enabling simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and global-level prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Finally, by enhancing multi-granularity knowledge generation, redesigning the relationship-driven attention re-weighting module, and incorporating consistent constraints on the hierarchical text encoder, we propose HPT++, which further improves the performance of HPT. Our experiments are conducted across a wide range of evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization. Extensive results and ablation studies demonstrate the effectiveness of our methods, which consistently outperform existing SOTA methods.

Title: Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation

Authors: Abdelrahman Eldesokey, Peter Wonka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14819
Pdf URL: https://arxiv.org/pdf/2408.14819
Copy Paste: [[2408.14819]] Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation(https://arxiv.org/abs/2408.14819)
Keywords: diffusion
Abstract: We propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive 3D layout control. Layout control has been widely studied to alleviate the shortcomings of T2I diffusion models in understanding objects' placement and relationships from text descriptions. Nevertheless, existing approaches for layout control are limited to 2D layouts, require the user to provide a static layout beforehand, and fail to preserve generated images under layout changes. This makes these approaches unsuitable for applications that require 3D object-wise control and iterative refinements, e.g., interior design and complex scene generation. To this end, we leverage the recent advancements in depth-conditioned T2I models and propose a novel approach for interactive 3D layout control. We replace the traditional 2D boxes used in layout control with 3D boxes. Furthermore, we revamp the T2I task as a multi-stage generation process, where at each stage, the user can insert, change, and move an object in 3D while preserving objects from earlier stages. We achieve this through our proposed Dynamic Self-Attention (DSA) module and the consistent 3D object translation strategy. Experiments show that our approach can generate complicated scenes based on 3D layouts, boosting the object generation success rate over the standard depth-conditioned T2I methods by 2x. Moreover, it outperforms other methods in comparison in preserving objects under layout changes. Project Page: \url{this https URL}

Title: Data-driven Effective Modeling of Multiscale Stochastic Dynamical Systems

Authors: Yuan Chen, Dongbin Xiu
Subjects: cs.LG, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2408.14821
Pdf URL: https://arxiv.org/pdf/2408.14821
Copy Paste: [[2408.14821]] Data-driven Effective Modeling of Multiscale Stochastic Dynamical Systems(https://arxiv.org/abs/2408.14821)
Keywords: generative
Abstract: We present a numerical method for learning the dynamics of slow components of unknown multiscale stochastic dynamical systems. While the governing equations of the systems are unknown, bursts of observation data of the slow variables are available. By utilizing the observation data, our proposed method is capable of constructing a generative stochastic model that can accurately capture the effective dynamics of the slow variables in distribution. We present a comprehensive set of numerical examples to demonstrate the performance of the proposed method.

Title: Alfie: Democratising RGBA Image Generation With No $$$

Authors: Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2408.14826
Pdf URL: https://arxiv.org/pdf/2408.14826
Copy Paste: [[2408.14826]] Alfie: Democratising RGBA Image Generation With No $$$(https://arxiv.org/abs/2408.14826)
Keywords: diffusion
Abstract: Designs and artworks are ubiquitous across various creative fields, requiring graphic design skills and dedicated software to create compositions that include many graphical elements, such as logos, icons, symbols, and art scenes, which are integral to visual storytelling. Automating the generation of such visual elements improves graphic designers' productivity, democratizes and innovates the creative industry, and helps generate more realistic synthetic data for related tasks. These illustration elements are mostly RGBA images with irregular shapes and cutouts, facilitating blending and scene composition. However, most image generation models are incapable of generating such images and achieving this capability requires expensive computational resources, specific training recipes, or post-processing solutions. In this work, we propose a fully-automated approach for obtaining RGBA illustrations by modifying the inference-time behavior of a pre-trained Diffusion Transformer model, exploiting the prompt-guided controllability and visual quality offered by such models with no additional computational cost. We force the generation of entire subjects without sharp croppings, whose background is easily removed for seamless integration into design projects or artistic scenes. We show with a user study that, in most cases, users prefer our solution over generating and then matting an image, and we show that our generated illustrations yield good results when used as inputs for composite scene generation pipelines. We release the code at this https URL.

Title: DRL-Based Federated Self-Supervised Learning for Task Offloading and Resource Allocation in ISAC-Enabled Vehicle Edge Computing

Authors: Xueying Gu, Qiong Wu, Pingyi Fan, Nan Cheng, Wen Chen, Khaled B. Letaief
Subjects: cs.LG, cs.DC, cs.NI
Abstract URL: https://arxiv.org/abs/2408.14831
Pdf URL: https://arxiv.org/pdf/2408.14831
Copy Paste: [[2408.14831]] DRL-Based Federated Self-Supervised Learning for Task Offloading and Resource Allocation in ISAC-Enabled Vehicle Edge Computing(https://arxiv.org/abs/2408.14831)
Keywords: self-supervised
Abstract: Intelligent Transportation Systems (ITS) leverage Integrated Sensing and Communications (ISAC) to enhance data exchange between vehicles and infrastructure in the Internet of Vehicles (IoV). This integration inevitably increases computing demands, risking real-time system stability. Vehicle Edge Computing (VEC) addresses this by offloading tasks to Road Side Unit (RSU), ensuring timely services. Our previous work FLSimCo algorithm, which uses local resources for Federated Self-Supervised Learning (SSL), though vehicles often can't complete all iterations task. Our improved algorithm offloads partial task to RSU and optimizes energy consumption by adjusting transmission power, CPU frequency, and task assignment ratios, balancing local and RSU-based training. Meanwhile, setting an offloading threshold further prevents inefficiencies. Simulation results show that the enhanced algorithm reduces energy consumption, improves offloading efficiency and the accuracy of Federated SSL.

Title: Diffusion Models Are Real-Time Game Engines

Authors: Dani Valevski, Yaniv Leviathan, Moab Arar, Shlomi Fruchter
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2408.14837
Pdf URL: https://arxiv.org/pdf/2408.14837
Copy Paste: [[2408.14837]] Diffusion Models Are Real-Time Game Engines(https://arxiv.org/abs/2408.14837)
Keywords: diffusion
Abstract: We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. GameNGen can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations enable stable auto-regressive generation over long trajectories.

Title: Diffusion based Semantic Outlier Generation via Nuisance Awareness for Out-of-Distribution Detection

Authors: Suhee Yoon, Sanghyu Yoon, Hankook Lee, Ye Seul Sim, Sungik Choi, Kyungeun Lee, Hye-Seung Cho, Woohyung Lim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.14841
Pdf URL: https://arxiv.org/pdf/2408.14841
Copy Paste: [[2408.14841]] Diffusion based Semantic Outlier Generation via Nuisance Awareness for Out-of-Distribution Detection(https://arxiv.org/abs/2408.14841)
Keywords: diffusion
Abstract: Out-of-distribution (OOD) detection, which determines whether a given sample is part of the in-distribution (ID), has recently shown promising results through training with synthetic OOD datasets. Nonetheless, existing methods often produce outliers that are considerably distant from the ID, showing limited efficacy for capturing subtle distinctions between ID and OOD. To address these issues, we propose a novel framework, Semantic Outlier generation via Nuisance Awareness (SONA), which notably produces challenging outliers by directly leveraging pixel-space ID samples through diffusion models. Our approach incorporates SONA guidance, providing separate control over semantic and nuisance regions of ID samples. Thereby, the generated outliers achieve two crucial properties: (i) they present explicit semantic-discrepant information, while (ii) maintaining various levels of nuisance resemblance with ID. Furthermore, the improved OOD detector training with SONA outliers facilitates learning with a focus on semantic distinctions. Extensive experiments demonstrate the effectiveness of our framework, achieving an impressive AUROC of 88% on near-OOD datasets, which surpasses the performance of baseline methods by a significant margin of approximately 6%.

Title: From Bias to Balance: Detecting Facial Expression Recognition Biases in Large Multimodal Foundation Models

Authors: Kaylee Chhua, Zhoujinyi Wen, Vedant Hathalia, Kevin Zhu, Sean O'Brien
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.14842
Pdf URL: https://arxiv.org/pdf/2408.14842
Copy Paste: [[2408.14842]] From Bias to Balance: Detecting Facial Expression Recognition Biases in Large Multimodal Foundation Models(https://arxiv.org/abs/2408.14842)
Keywords: foundation model
Abstract: This study addresses the racial biases in facial expression recognition (FER) systems within Large Multimodal Foundation Models (LMFMs). Despite advances in deep learning and the availability of diverse datasets, FER systems often exhibit higher error rates for individuals with darker skin tones. Existing research predominantly focuses on traditional FER models (CNNs, RNNs, ViTs), leaving a gap in understanding racial biases in LMFMs. We benchmark four leading LMFMs: GPT-4o, PaliGemma, Gemini, and CLIP to assess their performance in facial emotion detection across different racial demographics. A linear classifier trained on CLIP embeddings obtains accuracies of 95.9\% for RADIATE, 90.3\% for Tarr, and 99.5\% for Chicago Face. Furthermore, we identify that Anger is misclassified as Disgust 2.1 times more often in Black Females than White Females. This study highlights the need for fairer FER systems and establishes a foundation for developing unbiased, accurate FER technologies. Visit this https URL for further information regarding the biases within facial expression recognition.

Title: Diffusion-Occ: 3D Point Cloud Completion via Occupancy Diffusion

Authors: Guoqing Zhang, Jian Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14846
Pdf URL: https://arxiv.org/pdf/2408.14846
Copy Paste: [[2408.14846]] Diffusion-Occ: 3D Point Cloud Completion via Occupancy Diffusion(https://arxiv.org/abs/2408.14846)
Keywords: diffusion, generative
Abstract: Point clouds are crucial for capturing three-dimensional data but often suffer from incompleteness due to limitations such as resolution and occlusion. Traditional methods typically rely on point-based approaches within discriminative frameworks for point cloud completion. In this paper, we introduce \textbf{Diffusion-Occ}, a novel framework for Diffusion Point Cloud Completion. Diffusion-Occ utilizes a two-stage coarse-to-fine approach. In the first stage, the Coarse Density Voxel Prediction Network (CDNet) processes partial points to predict coarse density voxels, streamlining global feature extraction through voxel classification, as opposed to previous regression-based methods. In the second stage, we introduce the Occupancy Generation Network (OccGen), a conditional occupancy diffusion model based on a transformer architecture and enhanced by our Point-Voxel Fuse (PVF) block. This block integrates coarse density voxels with partial points to leverage both global and local features for comprehensive completion. By thresholding the occupancy field, we convert it into a complete point cloud. Additionally, our method employs diverse training mixtures and efficient diffusion parameterization to enable effective one-step sampling during both training and inference. Experimental results demonstrate that Diffusion-Occ outperforms existing discriminative and generative methods.

Title: DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose

Authors: Yusuke Yoshiyasu, Leyuan Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14860
Pdf URL: https://arxiv.org/pdf/2408.14860
Copy Paste: [[2408.14860]] DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose(https://arxiv.org/abs/2408.14860)
Keywords: diffusion, generative
Abstract: This paper presents DiffSurf, a transformer-based denoising diffusion model for generating and reconstructing 3D surfaces. Specifically, we design a diffusion transformer architecture that predicts noise from noisy 3D surface vertices and normals. With this architecture, DiffSurf is able to generate 3D surfaces in various poses and shapes, such as human bodies, hands, animals and man-made objects. Further, DiffSurf is versatile in that it can address various 3D downstream tasks including morphing, body shape variation and 3D human mesh fitting to 2D keypoints. Experimental results on 3D human model benchmarks demonstrate that DiffSurf can generate shapes with greater diversity and higher quality than previous generative models. Furthermore, when applied to the task of single-image 3D human mesh recovery, DiffSurf achieves accuracy comparable to prior techniques at a near real-time rate.

Title: User-level Social Multimedia Traffic Anomaly Detection with Meta-Learning

Authors: Tongtong Feng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.14884
Pdf URL: https://arxiv.org/pdf/2408.14884
Copy Paste: [[2408.14884]] User-level Social Multimedia Traffic Anomaly Detection with Meta-Learning(https://arxiv.org/abs/2408.14884)
Keywords: generative, anomaly
Abstract: Accuracy anomaly detection in user-level social multimedia traffic is crucial for privacy security. Compared with existing models that passively detect specific anomaly classes with large labeled training samples, user-level social multimedia traffic contains sizeable new anomaly classes with few labeled samples and has an imbalance, self-similar, and data-hungry nature. Recent advances, such as Generative Adversarial Networks (GAN), solve it by learning a sample generator only from seen class samples to synthesize new samples. However, if we detect many new classes, the number of synthesizing samples would be unfeasibly estimated, and this operation will drastically increase computational complexity and energy consumption. Motivation on these limitations, in this paper, we propose \textit{Meta-UAD}, a Meta-learning scheme for User-level social multimedia traffic Anomaly Detection. This scheme relies on the episodic training paradigm and learns from the collection of K-way-M-shot classification tasks, which can use the pre-trained model to adapt any new class with few samples by going through few iteration steps. Since user-level social multimedia traffic emerges from a complex interaction process of users and social applications, we further develop a feature extractor to improve scheme performance. It extracts statistical features using cumulative importance ranking and time-series features using an LSTM-based AutoEncoder. We evaluate our scheme on two public datasets and the results further demonstrate the superiority of Meta-UAD.

Title: MeshUp: Multi-Target Mesh Deformation via Blended Score Distillation

Authors: Hyunwoo Kim, Itai Lang, Noam Aigerman, Thibault Groueix, Vladimir G. Kim, Rana Hanocka
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2408.14899
Pdf URL: https://arxiv.org/pdf/2408.14899
Copy Paste: [[2408.14899]] MeshUp: Multi-Target Mesh Deformation via Blended Score Distillation(https://arxiv.org/abs/2408.14899)
Keywords: diffusion
Abstract: We propose MeshUp, a technique that deforms a 3D mesh towards multiple target concepts, and intuitively controls the region where each concept is expressed. Conveniently, the concepts can be defined as either text queries, e.g., "a dog" and "a turtle," or inspirational images, and the local regions can be selected as any number of vertices on the mesh. We can effectively control the influence of the concepts and mix them together using a novel score distillation approach, referred to as the Blended Score Distillation (BSD). BSD operates on each attention layer of the denoising U-Net of a diffusion model as it extracts and injects the per-objective activations into a unified denoising pipeline from which the deformation gradients are calculated. To localize the expression of these activations, we create a probabilistic Region of Interest (ROI) map on the surface of the mesh, and turn it into 3D-consistent masks that we use to control the expression of these activations. We demonstrate the effectiveness of BSD empirically and show that it can deform various meshes towards multiple objectives.

Title: Cross-Modal Learning for Chemistry Property Prediction: Large Language Models Meet Graph Machine Learning

Authors: Sakhinana Sagar Srinivas, Venkataramana Runkana
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.14964
Pdf URL: https://arxiv.org/pdf/2408.14964
Copy Paste: [[2408.14964]] Cross-Modal Learning for Chemistry Property Prediction: Large Language Models Meet Graph Machine Learning(https://arxiv.org/abs/2408.14964)
Keywords: generative
Abstract: In the field of chemistry, the objective is to create novel molecules with desired properties, facilitating accurate property predictions for applications such as material design and drug screening. However, existing graph deep learning methods face limitations that curb their expressive power. To address this, we explore the integration of vast molecular domain knowledge from Large Language Models (LLMs) with the complementary strengths of Graph Neural Networks (GNNs) to enhance performance in property prediction tasks. We introduce a Multi-Modal Fusion (MMF) framework that synergistically harnesses the analytical prowess of GNNs and the linguistic generative and predictive abilities of LLMs, thereby improving accuracy and robustness in predicting molecular properties. Our framework combines the effectiveness of GNNs in modeling graph-structured data with the zero-shot and few-shot learning capabilities of LLMs, enabling improved predictions while reducing the risk of overfitting. Furthermore, our approach effectively addresses distributional shifts, a common challenge in real-world applications, and showcases the efficacy of learning cross-modal representations, surpassing state-of-the-art baselines on benchmark datasets for property prediction tasks.

Title: MegActor-$\Sigma$: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer

Authors: Shurong Yang, Huadong Li, Juhao Wu, Minhao Jing, Linze Li, Renhe Ji, Jiajun Liang, Haoqiang Fan, Jin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.14975
Pdf URL: https://arxiv.org/pdf/2408.14975
Copy Paste: [[2408.14975]] MegActor-$\Sigma$: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer(https://arxiv.org/abs/2408.14975)
Keywords: diffusion
Abstract: Diffusion models have demonstrated superior performance in the field of portrait animation. However, current approaches relied on either visual or audio modality to control character movements, failing to exploit the potential of mixed-modal control. This challenge arises from the difficulty in balancing the weak control strength of audio modality and the strong control strength of visual modality. To address this issue, we introduce MegActor-$\Sigma$: a mixed-modal conditional diffusion transformer (DiT), which can flexibly inject audio and visual modality control signals into portrait animation. Specifically, we make substantial advancements over its predecessor, MegActor, by leveraging the promising model structure of DiT and integrating audio and visual conditions through advanced modules within the DiT framework. To further achieve flexible combinations of mixed-modal control signals, we propose a ``Modality Decoupling Control" training strategy to balance the control strength between visual and audio modalities, along with the ``Amplitude Adjustment" inference strategy to freely regulate the motion amplitude of each modality. Finally, to facilitate extensive studies in this field, we design several dataset evaluation metrics to filter out public datasets and solely use this filtered dataset to train MegActor-$\Sigma$. Extensive experiments demonstrate the superiority of our approach in generating vivid portrait animations, outperforming previous methods trained on private dataset.

Title: Pre-training Everywhere: Parameter-Efficient Fine-Tuning for Medical Image Analysis via Target Parameter Pre-training

Authors: Xingliang Lei, Yiwen Ye, Ziyang Chen, Minglei Shu, Yong Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.15011
Pdf URL: https://arxiv.org/pdf/2408.15011
Copy Paste: [[2408.15011]] Pre-training Everywhere: Parameter-Efficient Fine-Tuning for Medical Image Analysis via Target Parameter Pre-training(https://arxiv.org/abs/2408.15011)
Keywords: self-supervised
Abstract: Parameter-efficient fine-tuning (PEFT) techniques have emerged to address issues of overfitting and high computational costs associated with fully fine-tuning in the paradigm of self-supervised learning. Mainstream methods based on PEFT involve adding a few trainable parameters while keeping the pre-trained parameters of the backbone fixed. These methods achieve comparative, and often superior, performance to fully fine-tuning, demonstrating the powerful representation ability of the pre-trained backbone. Despite its success, these methods typically ignore the initialization of the new parameters, often relying solely on random initialization. We argue that if pre-training is significantly beneficial, it should be applied to all parameters requiring representational capacity. Motivated by this insight, we propose a simple yet effective fine-tuning framework based on Target Parameter Pre-training (TPP). The target parameters refer to the new parameters introduced during fine-tuning. TPP includes an additional stage before PEFT to pre-train these target parameters. During this stage, the pre-trained backbone parameters are frozen, and only the target parameters are trainable. A defined pre-text task is used to encourage the target parameters to learn specific representations of downstream data. When PEFT is subsequently employed, the pre-trained target parameters are loaded to enhance fine-tuning efficiency. The proposed TPP framework is versatile, allowing for the integration of various pretext tasks for pre-training and supporting different PEFT methods as backbones. We evaluated the fine-tining performance of our method using five public datasets, including three modalities and two task types. The results demonstrate that the proposed TPP can be easily integrated into existing PEFT methods, significantly improving performance.

Title: Sequence-aware Pre-training for Echocardiography Probe Guidance

Authors: Haojun Jiang, Zhenguo Sun, Yu Sun, Ning Jia, Meng Li, Shaqi Luo, Shiji Song, Gao Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.15026
Pdf URL: https://arxiv.org/pdf/2408.15026
Copy Paste: [[2408.15026]] Sequence-aware Pre-training for Echocardiography Probe Guidance(https://arxiv.org/abs/2408.15026)
Keywords: self-supervised
Abstract: Cardiac ultrasound probe guidance aims to help novices adjust the 6-DOF probe pose to obtain high-quality sectional images. Cardiac ultrasound faces two major challenges: (1) the inherently complex structure of the heart, and (2) significant individual variations. Previous works have only learned the population-averaged 2D and 3D structures of the heart rather than personalized cardiac structural features, leading to a performance bottleneck. Clinically, we observed that sonographers adjust their understanding of a patient's cardiac structure based on prior scanning sequences, thereby modifying their scanning strategies. Inspired by this, we propose a sequence-aware self-supervised pre-training method. Specifically, our approach learns personalized 2D and 3D cardiac structural features by predicting the masked-out images and actions in a scanning sequence. We hypothesize that if the model can predict the missing content it has acquired a good understanding of the personalized cardiac structure. In the downstream probe guidance task, we also introduced a sequence modeling approach that models individual cardiac structural information based on the images and actions from historical scan data, enabling more accurate navigation decisions. Experiments on a large-scale dataset with 1.36 million samples demonstrated that our proposed sequence-aware paradigm can significantly reduce navigation errors, with translation errors decreasing by 15.90% to 36.87% and rotation errors decreasing by 11.13% to 20.77%, compared to state-of-the-art methods.

Title: Evidence-Enhanced Triplet Generation Framework for Hallucination Alleviation in Generative Question Answering

Authors: Haowei Du, Huishuai Zhang, Dongyan Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.15037
Pdf URL: https://arxiv.org/pdf/2408.15037
Copy Paste: [[2408.15037]] Evidence-Enhanced Triplet Generation Framework for Hallucination Alleviation in Generative Question Answering(https://arxiv.org/abs/2408.15037)
Keywords: generative
Abstract: To address the hallucination in generative question answering (GQA) where the answer can not be derived from the document, we propose a novel evidence-enhanced triplet generation framework, EATQA, encouraging the model to predict all the combinations of (Question, Evidence, Answer) triplet by flipping the source pair and the target label to understand their logical relationships, i.e., predict Answer(A), Question(Q), and Evidence(E) given a QE, EA, and QA pairs, respectively. Furthermore, we bridge the distribution gap to distill the knowledge from evidence in inference stage. Our framework ensures the model to learn the logical relation between query, evidence and answer, which simultaneously improves the evidence generation and query answering. In this paper, we apply EATQA to LLama and it outperforms other LLMs-based methods and hallucination mitigation approaches on two challenging GQA benchmarks. Further analysis shows that our method not only keeps prior knowledge within LLM, but also mitigates hallucination and generates faithful answers.

Title: Self-supervised Topic Taxonomy Discovery in the Box Embedding Space

Authors: Yuyin Lu, Hegang Chen, Pengbo Mao, Yanghui Rao, Haoran Xie, Fu Lee Wang, Qing Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.15050
Pdf URL: https://arxiv.org/pdf/2408.15050
Copy Paste: [[2408.15050]] Self-supervised Topic Taxonomy Discovery in the Box Embedding Space(https://arxiv.org/abs/2408.15050)
Keywords: self-supervised
Abstract: Topic taxonomy discovery aims at uncovering topics of different abstraction levels and constructing hierarchical relations between them. Unfortunately, most of prior work can hardly model semantic scopes of words and topics by holding the Euclidean embedding space assumption. What's worse, they infer asymmetric hierarchical relations by symmetric distances between topic embeddings. As a result, existing methods suffer from problems of low-quality topics at high abstraction levels and inaccurate hierarchical relations. To alleviate these problems, this paper develops a Box embedding-based Topic Model (BoxTM) that maps words and topics into the box embedding space, where the asymmetric metric is defined to properly infer hierarchical relations among topics. Additionally, our BoxTM explicitly infers upper-level topics based on correlation between specific topics through recursive clustering on topic boxes. Finally, extensive experiments validate high-quality of the topic taxonomy learned by BoxTM.

Title: Constrained Diffusion Models via Dual Training

Authors: Shervin Khalafi, Dongsheng Ding, Alejandro Ribeiro
Subjects: cs.LG, cs.CV, math.OC
Abstract URL: https://arxiv.org/abs/2408.15094
Pdf URL: https://arxiv.org/pdf/2408.15094
Copy Paste: [[2408.15094]] Constrained Diffusion Models via Dual Training(https://arxiv.org/abs/2408.15094)
Keywords: diffusion
Abstract: Diffusion models have attained prominence for their ability to synthesize a probability distribution for a given dataset via a diffusion process, enabling the generation of new data points with high fidelity. However, diffusion processes are prone to generating biased data based on the training dataset. To address this issue, we develop constrained diffusion models by imposing diffusion constraints based on desired distributions that are informed by requirements. Specifically, we cast the training of diffusion models under requirements as a constrained distribution optimization problem that aims to reduce the distribution difference between original and generated data while obeying constraints on the distribution of generated data. We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints. To train constrained diffusion models, we develop a dual training algorithm and characterize the optimality of the trained constrained diffusion model. We empirically demonstrate the effectiveness of our constrained models in two constrained generation tasks: (i) we consider a dataset with one or more underrepresented classes where we train the model with constraints to ensure fairly sampling from all classes during inference; (ii) we fine-tune a pre-trained diffusion model to sample from a new dataset while avoiding overfitting.

Title: CLIP-AGIQA: Boosting the Performance of AI-Generated Image Quality Assessment with CLIP

Authors: Zhenchen Tang, Zichuan Wang, Bo Peng, Jing Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.15098
Pdf URL: https://arxiv.org/pdf/2408.15098
Copy Paste: [[2408.15098]] CLIP-AGIQA: Boosting the Performance of AI-Generated Image Quality Assessment with CLIP(https://arxiv.org/abs/2408.15098)
Keywords: generative
Abstract: With the rapid development of generative technologies, AI-Generated Images (AIGIs) have been widely applied in various aspects of daily life. However, due to the immaturity of the technology, the quality of the generated images varies, so it is important to develop quality assessment techniques for the generated images. Although some models have been proposed to assess the quality of generated images, they are inadequate when faced with the ever-increasing and diverse categories of generated images. Consequently, the development of more advanced and effective models for evaluating the quality of generated images is urgently needed. Recent research has explored the significant potential of the visual language model CLIP in image quality assessment, finding that it performs well in evaluating the quality of natural images. However, its application to generated images has not been thoroughly investigated. In this paper, we build on this idea and further explore the potential of CLIP in evaluating the quality of generated images. We design CLIP-AGIQA, a CLIP-based regression model for quality assessment of generated images, leveraging rich visual and textual knowledge encapsulated in CLIP. Particularly, we implement multi-category learnable prompts to fully utilize the textual knowledge in CLIP for quality assessment. Extensive experiments on several generated image quality assessment benchmarks, including AGIQA-3K and AIGCIQA2023, demonstrate that CLIP-AGIQA outperforms existing IQA models, achieving excellent results in evaluating the quality of generated images.

Title: AnomalousPatchCore: Exploring the Use of Anomalous Samples in Industrial Anomaly Detection

Authors: Mykhailo Koshil, Tilman Wegener, Detlef Mentrup, Simone Frintrop, Christian Wilms
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.15113
Pdf URL: https://arxiv.org/pdf/2408.15113
Copy Paste: [[2408.15113]] AnomalousPatchCore: Exploring the Use of Anomalous Samples in Industrial Anomaly Detection(https://arxiv.org/abs/2408.15113)
Keywords: anomaly
Abstract: Visual inspection, or industrial anomaly detection, is one of the most common quality control types in manufacturing. The task is to identify the presence of an anomaly given an image, e.g., a missing component on an image of a circuit board, for subsequent manual inspection. While industrial anomaly detection has seen a surge in recent years, most anomaly detection methods still utilize knowledge only from normal samples, failing to leverage the information from the frequently available anomalous samples. Additionally, they heavily rely on very general feature extractors pre-trained on common image classification datasets. In this paper, we address these shortcomings and propose the new anomaly detection system AnomalousPatchCore~(APC) based on a feature extractor fine-tuned with normal and anomalous in-domain samples and a subsequent memory bank for identifying unusual features. To fine-tune the feature extractor in APC, we propose three auxiliary tasks that address the different aspects of anomaly detection~(classification vs. localization) and mitigate the effect of the imbalance between normal and anomalous samples. Our extensive evaluation on the MVTec dataset shows that APC outperforms state-of-the-art systems in detecting anomalies, which is especially important in industrial anomaly detection given the subsequent manual inspection. In detailed ablation studies, we further investigate the properties of our APC.

Title: How transformers learn structured data: insights from hierarchical filtering

Authors: Jerome Garnier-Brun, Marc Mézard, Emanuele Moscato, Luca Saglietti
Subjects: cs.LG, cond-mat.dis-nn, cond-mat.stat-mech, cs.CL
Abstract URL: https://arxiv.org/abs/2408.15138
Pdf URL: https://arxiv.org/pdf/2408.15138
Copy Paste: [[2408.15138]] How transformers learn structured data: insights from hierarchical filtering(https://arxiv.org/abs/2408.15138)
Keywords: generative
Abstract: We introduce a hierarchical filtering procedure for generative models of sequences on trees, enabling control over the range of positional correlations in the data. Leveraging this controlled setting, we provide evidence that vanilla encoder-only transformer architectures can implement the optimal Belief Propagation algorithm on both root classification and masked language modeling tasks. Correlations at larger distances corresponding to increasing layers of the hierarchy are sequentially included as the network is trained. We analyze how the transformer layers succeed by focusing on attention maps from models trained with varying degrees of filtering. These attention maps show clear evidence for iterative hierarchical reconstruction of correlations, and we can relate these observations to a plausible implementation of the exact inference algorithm for the network sizes considered.

Title: PoseWatch: A Transformer-based Architecture for Human-centric Video Anomaly Detection Using Spatio-temporal Pose Tokenization

Authors: Ghazal Alinezhad Noghre, Armin Danesh Pazho, Hamed Tabkhi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.15185
Pdf URL: https://arxiv.org/pdf/2408.15185
Copy Paste: [[2408.15185]] PoseWatch: A Transformer-based Architecture for Human-centric Video Anomaly Detection Using Spatio-temporal Pose Tokenization(https://arxiv.org/abs/2408.15185)
Keywords: anomaly
Abstract: Video Anomaly Detection (VAD) presents a significant challenge in computer vision, particularly due to the unpredictable and infrequent nature of anomalous events, coupled with the diverse and dynamic environments in which they occur. Human-centric VAD, a specialized area within this domain, faces additional complexities, including variations in human behavior, potential biases in data, and substantial privacy concerns related to human subjects. These issues complicate the development of models that are both robust and generalizable. To address these challenges, recent advancements have focused on pose-based VAD, which leverages human pose as a high-level feature to mitigate privacy concerns, reduce appearance biases, and minimize background interference. In this paper, we introduce PoseWatch, a novel transformer-based architecture designed specifically for human-centric pose-based VAD. PoseWatch features an innovative Spatio-Temporal Pose and Relative Pose (ST-PRP) tokenization method that enhances the representation of human motion over time, which is also beneficial for broader human behavior analysis tasks. The architecture's core, a Unified Encoder Twin Decoders (UETD) transformer, significantly improves the detection of anomalous behaviors in video data. Extensive evaluations across multiple benchmark datasets demonstrate that PoseWatch consistently outperforms existing methods, establishing a new state-of-the-art in pose-based VAD. This work not only demonstrates the efficacy of PoseWatch but also highlights the potential of integrating Natural Language Processing techniques with computer vision to advance human behavior analysis.

Title: Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations

Authors: Yucheng Jiang, Yijia Shao, Dekun Ma, Sina J. Semnani, Monica S. Lam
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2408.15232
Pdf URL: https://arxiv.org/pdf/2408.15232
Copy Paste: [[2408.15232]] Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations(https://arxiv.org/abs/2408.15232)
Keywords: generative
Abstract: While language model (LM)-powered chatbots and generative search engines excel at answering concrete queries, discovering information in the terrain of unknown unknowns remains challenging for users. To emulate the common educational scenario where children/students learn by listening to and participating in conversations of their parents/teachers, we create Collaborative STORM (Co-STORM). Unlike QA systems that require users to ask all the questions, Co-STORM lets users observe and occasionally steer the discourse among several LM agents. The agents ask questions on the user's behalf, allowing the user to discover unknown unknowns serendipitously. To facilitate user interaction, Co-STORM assists users in tracking the discourse by organizing the uncovered information into a dynamic mind map, ultimately generating a comprehensive report as takeaways. For automatic evaluation, we construct the WildSeek dataset by collecting real information-seeking records with user goals. Co-STORM outperforms baseline methods on both discourse trace and report quality. In a further human evaluation, 70% of participants prefer Co-STORM over a search engine, and 78% favor it over a RAG chatbot.

Title: Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

Authors: Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, Steven M. Seitz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.15239
Pdf URL: https://arxiv.org/pdf/2408.15239
Copy Paste: [[2408.15239]] Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation(https://arxiv.org/abs/2408.15239)
Keywords: diffusion, generative
Abstract: We present a method for generating video sequences with coherent motion between a pair of input key frames. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for key frame interpolation, i.e., to produce a video in between two input frames. We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backwards in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.

Title: Generative Verifiers: Reward Modeling as Next-Token Prediction

Authors: Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.15240
Pdf URL: https://arxiv.org/pdf/2408.15240
Copy Paste: [[2408.15240]] Generative Verifiers: Reward Modeling as Next-Token Prediction(https://arxiv.org/abs/2408.15240)
Keywords: generative
Abstract: Verifiers or reward models are often used to enhance the reasoning performance of large language models (LLMs). A common approach is the Best-of-N method, where N candidate solutions generated by the LLM are ranked by a verifier, and the best one is selected. While LLM-based verifiers are typically trained as discriminative classifiers to score solutions, they do not utilize the text generation capabilities of pretrained LLMs. To overcome this limitation, we instead propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs: they integrate seamlessly with instruction tuning, enable chain-of-thought reasoning, and can utilize additional inference-time compute via majority voting for better verification. We demonstrate that when using Gemma-based verifiers on algorithmic and grade-school math reasoning tasks, GenRM outperforms discriminative verifiers and LLM-as-a-Judge, showing a 16-64% improvement in the percentage of problems solved with Best-of-N. Furthermore, we show that GenRM scales favorably across dataset size, model capacity, and inference-time compute.

Title: GenRec: Unifying Video Generation and Recognition with Diffusion Models

Authors: Zejia Weng, Xitong Yang, Zhen Xing, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.15241
Pdf URL: https://arxiv.org/pdf/2408.15241
Copy Paste: [[2408.15241]] GenRec: Unifying Video Generation and Recognition with Diffusion Models(https://arxiv.org/abs/2408.15241)
Keywords: diffusion, generative
Abstract: Video diffusion models are able to generate high-quality videos by learning strong spatial-temporal priors on large-scale datasets. In this paper, we aim to investigate whether such priors derived from a generative process are suitable for video recognition, and eventually joint optimization of generation and recognition. Building upon Stable Video Diffusion, we introduce GenRec, the first unified framework trained with a random-frame conditioning process so as to learn generalized spatial-temporal representations. The resulting framework can naturally supports generation and recognition, and more importantly is robust even when visual inputs contain limited information. Extensive experiments demonstrate the efficacy of GenRec for both recognition and generation. In particular, GenRec achieves competitive recognition performance, offering 75.8% and 87.2% accuracy on SSV2 and K400, respectively. GenRec also performs the best class-conditioned image-to-video generation results, achieving 46.5 and 49.3 FVD scores on SSV2 and EK-100 datasets. Furthermore, GenRec demonstrates extraordinary robustness in scenarios that only limited frames can be observed.