2024-08-13

Title: Large Model Strategic Thinking, Small Model Efficiency: Transferring Theory of Mind in Large Language Models

Authors: Nunzio Lore, Alireza (Sepehr)Ilami, Babak Heydari
Subjects: cs.CL, cs.AI, cs.CY, cs.ET, cs.GT
Abstract URL: https://arxiv.org/abs/2408.05241
Pdf URL: https://arxiv.org/pdf/2408.05241
Copy Paste: [[2408.05241]] Large Model Strategic Thinking, Small Model Efficiency: Transferring Theory of Mind in Large Language Models(https://arxiv.org/abs/2408.05241)
Keywords: in-context
Abstract: As the performance of larger, newer Large Language Models continues to improve for strategic Theory of Mind (ToM) tasks, the demand for these state of the art models increases commensurately. However, their deployment is costly both in terms of processing power and time. In this paper, we investigate the feasibility of creating smaller, simulation-ready agents by way of fine-tuning. To do this, we present a large pre-trained model with 20 unique scenarios that combine a social context with a social dilemma, recording its answers, and using them for Q\&A fine-tuning on a smaller model of the same family. Our focus is on in-context game-theoretic decision-making, the same domain within which human interaction occurs and that requires both a theory of mind (or a semblance thereof) and an understanding of social dynamics. We find that the fine-tuned smaller language model exhibited significant performance closer to that of its larger relative, and that their improvements extended in areas and contexts beyond the ones provided in the training examples. On average for all games, through fine-tuning, the smaller model showed a \%46 improvement in aligning with the behavior of the larger model, with \%100 representing complete alignment. This suggests that our pipeline represents an efficient method to transmit some form of theory of mind to smaller models, creating improved and cheaply deployable algorithms in the process. Despite their simplicity and their associated shortcomings and limitations, our findings represent a stepping stone in the pursuit and training of specialized models for strategic and social decision making.

Title: The Role and Applications of Airport Digital Twin in Cyberattack Protection during the Generative AI Era

Authors: Abraham Itzhak Weinberg
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2408.05248
Pdf URL: https://arxiv.org/pdf/2408.05248
Copy Paste: [[2408.05248]] The Role and Applications of Airport Digital Twin in Cyberattack Protection during the Generative AI Era(https://arxiv.org/abs/2408.05248)
Keywords: generative, anomaly
Abstract: In recent years, the threat facing airports from growing and increasingly sophisticated cyberattacks has become evident. Airports are considered a strategic national asset, so protecting them from attacks, specifically cyberattacks, is a crucial mission. One way to increase airports' security is by using Digital Twins (DTs). This paper shows and demonstrates how DTs can enhance the security mission. The integration of DTs with Generative AI (GenAI) algorithms can lead to synergy and new frontiers in fighting cyberattacks. The paper exemplifies ways to model cyberattack scenarios using simulations and generate synthetic data for testing defenses. It also discusses how DTs can be used as a crucial tool for vulnerability assessment by identifying weaknesses, prioritizing, and accelerating remediations in case of cyberattacks. Moreover, the paper demonstrates approaches for anomaly detection and threat hunting using Machine Learning (ML) and GenAI algorithms. Additionally, the paper provides impact prediction and recovery coordination methods that can be used by DT operators and stakeholders. It also introduces ways to harness the human factor by integrating training and simulation algorithms with Explainable AI (XAI) into the DT platforms. Lastly, the paper offers future applications and technologies that can be utilized in DT environments.

Title: The impact of internal variability on benchmarking deep learning climate emulators

Authors: Björn Lütjens, Raffaele Ferrari, Duncan Watson-Parris, Noelle Selin
Subjects: cs.LG, cs.AI, cs.CE, cs.CV
Abstract URL: https://arxiv.org/abs/2408.05288
Pdf URL: https://arxiv.org/pdf/2408.05288
Copy Paste: [[2408.05288]] The impact of internal variability on benchmarking deep learning climate emulators(https://arxiv.org/abs/2408.05288)
Keywords: foundation model
Abstract: Full-complexity Earth system models (ESMs) are computationally very expensive, limiting their use in exploring the climate outcomes of multiple emission pathways. More efficient emulators that approximate ESMs can directly map emissions onto climate outcomes, and benchmarks are being used to evaluate their accuracy on standardized tasks and datasets. We investigate a popular benchmark in data-driven climate emulation, ClimateBench, on which deep learning-based emulators are currently achieving the best performance. We implement a linear regression-based emulator, akin to pattern scaling, and find that it outperforms the incumbent 100M-parameter deep learning foundation model, ClimaX, on 3 out of 4 regionally-resolved surface-level climate variables. While emulating surface temperature is expected to be predominantly linear, this result is surprising for emulating precipitation. We identify that this outcome is a result of high levels of internal variability in the benchmark targets. To address internal variability, we update the benchmark targets with ensemble averages from the MPI-ESM1.2-LR model that contain 50 instead of 3 climate simulations per emission pathway. Using the new targets, we show that linear pattern scaling continues to be more accurate on temperature, but can be outperformed by a deep learning-based model for emulating precipitation. We publish our code, data, and an interactive tutorial at this http URL.

Title: Hybrid Efficient Unsupervised Anomaly Detection for Early Pandemic Case Identification

Authors: Ghazal Ghajari, Mithun Kumar PK, Fathi Amsaad
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2408.05347
Pdf URL: https://arxiv.org/pdf/2408.05347
Copy Paste: [[2408.05347]] Hybrid Efficient Unsupervised Anomaly Detection for Early Pandemic Case Identification(https://arxiv.org/abs/2408.05347)
Keywords: anomaly
Abstract: Unsupervised anomaly detection is a promising technique for identifying unusual patterns in data without the need for labeled training examples. This approach is particularly valuable for early case detection in epidemic management, especially when early-stage data are scarce. This research introduces a novel hybrid method for anomaly detection that combines distance and density measures, enhancing its applicability across various infectious diseases. Our method is especially relevant in pandemic situations, as demonstrated during the COVID-19 crisis, where traditional supervised classification methods fall short due to limited data. The efficacy of our method is evaluated using COVID-19 chest X-ray data, where it significantly outperforms established unsupervised techniques. It achieves an average AUC of 77.43%, surpassing the AUC of Isolation Forest at 73.66% and KNN at 52.93%. These results highlight the potential of our hybrid anomaly detection method to improve early detection capabilities in diverse epidemic scenarios, thereby facilitating more effective and timely responses.

Title: PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identificat

Authors: Bin Hu, Xinggang Wang, Wenyu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05398
Pdf URL: https://arxiv.org/pdf/2408.05398
Copy Paste: [[2408.05398]] PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identificat(https://arxiv.org/abs/2408.05398)
Keywords: self-supervised
Abstract: Person Re-Identification (ReID) aims to retrieve relevant individuals in non-overlapping camera images and has a wide range of applications in the field of public safety. In recent years, with the development of Vision Transformer (ViT) and self-supervised learning techniques, the performance of person ReID based on self-supervised pre-training has been greatly improved. Person ReID requires extracting highly discriminative local fine-grained features of the human body, while traditional ViT is good at extracting context-related global features, making it difficult to focus on local human body features. To this end, this article introduces the recently emerged Masked Image Modeling (MIM) self-supervised learning method into person ReID, and effectively extracts high-quality global and local features through large-scale unsupervised pre-training by combining masked image modeling and discriminative contrastive learning, and then conducts supervised fine-tuning training in the person ReID task. This person feature extraction method based on ViT with masked image modeling (PersonViT) has the good characteristics of unsupervised, scalable, and strong generalization capabilities, overcoming the problem of difficult annotation in supervised person ReID, and achieves state-of-the-art results on publicly available benchmark datasets, including MSMT17, Market1501, DukeMTMC-reID, and Occluded-Duke. The code and pre-trained models of the PersonViT method are released at this https URL to promote further research in the person ReID fie

Title: LaiDA: Linguistics-aware In-context Learning with Data Augmentation for Metaphor Components Identification

Authors: Hongde Liu, Chenyuan He, Feiyang Meng, Changyong Niu, Yuxiang Jia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.05404
Pdf URL: https://arxiv.org/pdf/2408.05404
Copy Paste: [[2408.05404]] LaiDA: Linguistics-aware In-context Learning with Data Augmentation for Metaphor Components Identification(https://arxiv.org/abs/2408.05404)
Keywords: in-context
Abstract: Metaphor Components Identification (MCI) contributes to enhancing machine understanding of metaphors, thereby advancing downstream natural language processing tasks. However, the complexity, diversity, and dependency on context and background knowledge pose significant challenges for MCI. Large language models (LLMs) offer new avenues for accurate comprehension of complex natural language texts due to their strong semantic analysis and extensive commonsense knowledge. In this research, a new LLM-based framework is proposed, named Linguistics-aware In-context Learning with Data Augmentation (LaiDA). Specifically, ChatGPT and supervised fine-tuning are utilized to tailor a high-quality dataset. LaiDA incorporates a simile dataset for pre-training. A graph attention network encoder generates linguistically rich feature representations to retrieve similar examples. Subsequently, LLM is fine-tuned with prompts that integrate linguistically similar examples. LaiDA ranked 2nd in Subtask 2 of NLPCC2024 Shared Task 9, demonstrating its effectiveness. Code and data are available at this https URL.

Title: Style-Preserving Lip Sync via Audio-Aware Style Reference

Authors: Weizhi Zhong, Jichang Li, Yinqi Cai, Liang Lin, Guanbin Li
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2408.05412
Pdf URL: https://arxiv.org/pdf/2408.05412
Copy Paste: [[2408.05412]] Style-Preserving Lip Sync via Audio-Aware Style Reference(https://arxiv.org/abs/2408.05412)
Keywords: diffusion
Abstract: Audio-driven lip sync has recently drawn significant attention due to its widespread application in the multimedia domain. Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals, posing a notable challenge for audio-driven lip sync. Earlier methods for such task often bypassed the modeling of personalized speaking styles, resulting in sub-optimal lip sync conforming to the general styles. Recent lip sync techniques attempt to guide the lip sync for arbitrary audio by aggregating information from a style reference video, yet they can not preserve the speaking styles well due to their inaccuracy in style aggregation. This work proposes an innovative audio-aware style reference scheme that effectively leverages the relationships between input audio and reference audio from style reference video to address the style-preserving audio-driven lip sync. Specifically, we first develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video. Afterwards, to better render the lip motion into realistic talking face video, we devise a conditional latent diffusion model, integrating lip motion through modulated convolutional layers and fusing reference facial images via spatial cross-attention layers. Extensive experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.

Title: High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Authors: Weizhi Zhong, Junfan Lin, Peixin Chen, Liang Lin, Guanbin Li
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2408.05416
Pdf URL: https://arxiv.org/pdf/2408.05416
Copy Paste: [[2408.05416]] High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model(https://arxiv.org/abs/2408.05416)
Keywords: diffusion, generative
Abstract: Audio-driven talking face video generation has attracted increasing attention due to its huge industrial potential. Some previous methods focus on learning a direct mapping from audio to visual content. Despite progress, they often struggle with the ambiguity of the mapping process, leading to flawed results. An alternative strategy involves facial structural representations (e.g., facial landmarks) as intermediaries. This multi-stage approach better preserves the appearance details but suffers from error accumulation due to the independent optimization of different stages. Moreover, most previous methods rely on generative adversarial networks, prone to training instability and mode collapse. To address these challenges, our study proposes a novel landmark-based diffusion model for talking face generation, which leverages facial landmarks as intermediate representations while enabling end-to-end optimization. Specifically, we first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks via differentiable cross-attention, which enables end-to-end optimization for improved lip synchronization. Besides, TalkFormer employs implicit feature warping to align the reference image features with the target motion for preserving more appearance details. Extensive experiments demonstrate that our approach can synthesize high-fidelity and lip-synced talking face videos, preserving more subject appearance details from the reference image.

Title: Multimodal generative semantic communication based on latent diffusion model

Authors: Weiqi Fu, Lianming Xu, Xin Wu, Haoyang Wei, Li Wang
Subjects: cs.CV, cs.NI
Abstract URL: https://arxiv.org/abs/2408.05455
Pdf URL: https://arxiv.org/pdf/2408.05455
Copy Paste: [[2408.05455]] Multimodal generative semantic communication based on latent diffusion model(https://arxiv.org/abs/2408.05455)
Keywords: diffusion, generative
Abstract: In emergencies, the ability to quickly and accurately gather environmental data and command information, and to make timely decisions, is particularly critical. Traditional semantic communication frameworks, primarily based on a single modality, are susceptible to complex environments and lighting conditions, thereby limiting decision accuracy. To this end, this paper introduces a multimodal generative semantic communication framework named mm-GESCO. The framework ingests streams of visible and infrared modal image data, generates fused semantic segmentation maps, and transmits them using a combination of one-hot encoding and zlib compression techniques to enhance data transmission efficiency. At the receiving end, the framework can reconstruct the original multimodal images based on the semantic maps. Additionally, a latent diffusion model based on contrastive learning is designed to align different modal data within the latent space, allowing mm-GESCO to reconstruct latent features of any modality presented at the input. Experimental results demonstrate that mm-GESCO achieves a compression ratio of up to 200 times, surpassing the performance of existing semantic communication frameworks and exhibiting excellent performance in downstream tasks such as object classification and detection.

Title: Path-LLM: A Shortest-Path-based LLM Learning for Unified Graph Representation

Authors: Wenbo Shang, Xuliang Zhu, Xin Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.05456
Pdf URL: https://arxiv.org/pdf/2408.05456
Copy Paste: [[2408.05456]] Path-LLM: A Shortest-Path-based LLM Learning for Unified Graph Representation(https://arxiv.org/abs/2408.05456)
Keywords: self-supervised
Abstract: Unified graph representation learning aims to produce node embeddings, which can be applied to multiple downstream applications. However, existing studies based on graph neural networks and language models either suffer from the limitations of numerous training needed toward specific downstream predictions or have shallow semantic features. In this work, we propose a novel Path-LLM model to learn unified graph representation, which leverages a powerful large language model (LLM) to incorporate our proposed path features. Our Path-LLM framework consists of several well-designed techniques. First, we develop a new mechanism of long-to-short shortest path (L2SP) selection, which covers essential connections between different dense groups. An in-depth comparison of different path selection plans is offered to illustrate the strength of our designed L2SP. Then, we design path textualization to obtain L2SP-based training texts. Next, we feed the texts into a self-supervised LLM training process to learn embeddings. Extensive experiments on benchmarks validate the superiority of Path-LLM against the state-of-the-art WalkLM method on two classical graph learning tasks (node classification and link prediction) and one NP-hard graph query processing task (keyword search), meanwhile saving more than 90% of training paths.

Title: ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack

Authors: Ziyi Gao, Kai Chen, Zhipeng Wei, Tingshu Mou, Jingjing Chen, Zhiyu Tan, Hao Li, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05479
Pdf URL: https://arxiv.org/pdf/2408.05479
Copy Paste: [[2408.05479]] ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack(https://arxiv.org/abs/2408.05479)
Keywords: diffusion
Abstract: Recent diffusion-based unrestricted attacks generate imperceptible adversarial examples with high transferability compared to previous unrestricted attacks and restricted attacks. However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos. In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack (ReToMe-VA), which is the first framework to generate imperceptible adversarial video clips with higher transferability. Specifically, to achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adversarial Latent Optimization (TALO) strategy that optimizes perturbations in diffusion models' latent space at each denoising step. TALO offers iterative and accurate updates to generate more powerful adversarial frames. TALO can further reduce memory consumption in gradient computation. Moreover, to achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames in the self-attention module, resulting in temporally consistent adversarial videos. ReToMe concurrently facilitates inter-frame interactions into the attack process, inducing more diverse and robust gradients, thus leading to better adversarial transferability. Extensive experiments demonstrate the efficacy of ReToMe-VA, particularly in surpassing state-of-the-art attacks in adversarial transferability by more than 14.16% on average.

Title: ZePo: Zero-Shot Portrait Stylization with Faster Sampling

Authors: Jin Liu, Huaibo Huang, Jie Cao, Ran He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05492
Pdf URL: https://arxiv.org/pdf/2408.05492
Copy Paste: [[2408.05492]] ZePo: Zero-Shot Portrait Stylization with Faster Sampling(https://arxiv.org/abs/2408.05492)
Keywords: diffusion
Abstract: Diffusion-based text-to-image generation models have significantly advanced the field of art content synthesis. However, current portrait stylization methods generally require either model fine-tuning based on examples or the employment of DDIM Inversion to revert images to noise space, both of which substantially decelerate the image generation process. To overcome these limitations, this paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps. We observed that Latent Consistency Models employing consistency distillation can effectively extract representative Consistency Features from noisy images. To blend the Consistency Features extracted from both content and style images, we introduce a Style Enhancement Attention Control technique that meticulously merges content and style features within the attention space of the target image. Moreover, we propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control. Extensive experiments have validated the effectiveness of our proposed framework in enhancing stylization efficiency and fidelity. The code is available at \url{this https URL}.

Title: SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning

Authors: Yuze Zhao, Jintao Huang, Jinghan Hu, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, Yingda Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.05517
Pdf URL: https://arxiv.org/pdf/2408.05517
Copy Paste: [[2408.05517]] SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning(https://arxiv.org/abs/2408.05517)
Keywords: foundation model
Abstract: Recent development in Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) have leverage Attention-based Transformer architectures and achieved superior performance and generalization capabilities. They have since covered extensive areas of traditional learning tasks. For instance, text-based tasks such as text-classification and sequence-labeling, as well as multi-modal tasks like Visual Question Answering (VQA) and Optical Character Recognition (OCR), which were previously addressed using different models, can now be tackled based on one foundation model. Consequently, the training and lightweight fine-tuning of LLMs and MLLMs, especially those based on Transformer architecture, has become particularly important. In recognition of these overwhelming needs, we develop SWIFT, a customizable one-stop infrastructure for large models. With support of over $300+$ LLMs and $50+$ MLLMs, SWIFT stands as the open-source framework that provide the \textit{most comprehensive support} for fine-tuning large models. In particular, it is the first training framework that provides systematic support for MLLMs. In addition to the core functionalities of fine-tuning, SWIFT also integrates post-training processes such as inference, evaluation, and model quantization, to facilitate fast adoptions of large models in various application scenarios. With a systematic integration of various training techniques, SWIFT offers helpful utilities such as benchmark comparisons among different training techniques for large models. For fine-tuning models specialized in agent framework, we show that notable improvements on the ToolBench leader-board can be achieved by training with customized dataset on SWIFT, with an increase of 5.2\%-21.8\% in the Act.EM metric over various baseline models, a reduction in hallucination by 1.6\%-14.1\%, and an average performance improvement of 8\%-17\%.

Title: Large Language Model-based Role-Playing for Personalized Medical Jargon Extraction

Authors: Jung Hoon Lim, Sunjae Kwon, Zonghai Yao, John P.Lalor, Hong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.05555
Pdf URL: https://arxiv.org/pdf/2408.05555
Copy Paste: [[2408.05555]] Large Language Model-based Role-Playing for Personalized Medical Jargon Extraction(https://arxiv.org/abs/2408.05555)
Keywords: generative, in-context
Abstract: Previous studies reveal that Electronic Health Records (EHR), which have been widely adopted in the U.S. to allow patients to access their personal medical information, do not have high readability to patients due to the prevalence of medical jargon. Tailoring medical notes to individual comprehension by identifying jargon that is difficult for each person will enhance the utility of generative models. We present the first quantitative analysis to measure the impact of role-playing in LLM in medical term extraction. By comparing the results of Mechanical Turk workers over 20 sentences, our study demonstrates that LLM role-playing improves F1 scores in 95% of cases across 14 different socio-demographic backgrounds. Furthermore, applying role-playing with in-context learning outperformed the previous state-of-the-art models. Our research showed that ChatGPT can improve traditional medical term extraction systems by utilizing role-play to deliver personalized patient education, a potential that previous models had not achieved.

Title: What Matters in Autonomous Driving Anomaly Detection: A Weakly Supervised Horizon

Authors: Utkarsh Tiwari, Snehashis Majhi, Michal Balazia, François Brémond
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05562
Pdf URL: https://arxiv.org/pdf/2408.05562
Copy Paste: [[2408.05562]] What Matters in Autonomous Driving Anomaly Detection: A Weakly Supervised Horizon(https://arxiv.org/abs/2408.05562)
Keywords: anomaly
Abstract: Video anomaly detection (VAD) in autonomous driving scenario is an important task, however it involves several challenges due to the ego-centric views and moving camera. Due to this, it remains largely under-explored. While recent developments in weakly-supervised VAD methods have shown remarkable progress in detecting critical real-world anomalies in static camera scenario, the development and validation of such methods are yet to be explored for moving camera VAD. This is mainly due to existing datasets like DoTA not following training pre-conditions of weakly-supervised learning. In this paper, we aim to promote weakly-supervised method development for autonomous driving VAD. We reorganize the DoTA dataset and aim to validate recent powerful weakly-supervised VAD methods on moving camera scenarios. Further, we provide a detailed analysis of what modifications on state-of-the-art methods can significantly improve the detection performance. Towards this, we propose a "feature transformation block" and through experimentation we show that our propositions can empower existing weakly-supervised VAD methods significantly in improving the VAD in autonomous driving. Our codes/dataset/demo will be released at this http URL

Title: Sequential Representation Learning via Static-Dynamic Conditional Disentanglement

Authors: Mathieu Cyrille Simon, Pascal Frossard, Christophe De Vleeschouwer
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2408.05599
Pdf URL: https://arxiv.org/pdf/2408.05599
Copy Paste: [[2408.05599]] Sequential Representation Learning via Static-Dynamic Conditional Disentanglement(https://arxiv.org/abs/2408.05599)
Keywords: self-supervised
Abstract: This paper explores self-supervised disentangled representation learning within sequential data, focusing on separating time-independent and time-varying factors in videos. We propose a new model that breaks the usual independence assumption between those factors by explicitly accounting for the causal relationship between the static/dynamic variables and that improves the model expressivity through additional Normalizing Flows. A formal definition of the factors is proposed. This formalism leads to the derivation of sufficient conditions for the ground truth factors to be identifiable, and to the introduction of a novel theoretically grounded disentanglement constraint that can be directly and efficiently incorporated into our new framework. The experiments show that the proposed approach outperforms previous complex state-of-the-art techniques in scenarios where the dynamics of a scene are influenced by its content.

Title: UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling

Authors: Kai Yu, Yang Zhou, Yang Bai, Zhi Da Soh, Xinxing Xu, Rick Siow Mong Goh, Ching-Yu Cheng, Yong Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.05618
Pdf URL: https://arxiv.org/pdf/2408.05618
Copy Paste: [[2408.05618]] UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling(https://arxiv.org/abs/2408.05618)
Keywords: foundation model
Abstract: Retinal foundation models aim to learn generalizable representations from diverse retinal images, facilitating label-efficient model adaptation across various ophthalmic tasks. Despite their success, current retinal foundation models are generally restricted to a single imaging modality, such as Color Fundus Photography (CFP) or Optical Coherence Tomography (OCT), limiting their versatility. Moreover, these models may struggle to fully leverage expert annotations and overlook the valuable domain knowledge essential for domain-specific representation learning. To overcome these limitations, we introduce UrFound, a retinal foundation model designed to learn universal representations from both multimodal retinal images and domain knowledge. UrFound is equipped with a modality-agnostic image encoder and accepts either CFP or OCT images as inputs. To integrate domain knowledge into representation learning, we encode expert annotation in text supervision and propose a knowledge-guided masked modeling strategy for model pre-training. It involves reconstructing randomly masked patches of retinal images while predicting masked text tokens conditioned on the corresponding retinal image. This approach aligns multimodal images and textual expert annotations within a unified latent space, facilitating generalizable and domain-specific representation learning. Experimental results demonstrate that UrFound exhibits strong generalization ability and data efficiency when adapting to various tasks in retinal image analysis. By training on ~180k retinal images, UrFound significantly outperforms the state-of-the-art retinal foundation model trained on up to 1.6 million unlabelled images across 8 public retinal datasets. Our code and data are available at this https URL.

Title: Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Authors: Jacob K Christopher, Brian R Bartoldson, Bhavya Kailkhura, Ferdinando Fioretto
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.05636
Pdf URL: https://arxiv.org/pdf/2408.05636
Copy Paste: [[2408.05636]] Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion(https://arxiv.org/abs/2408.05636)
Keywords: diffusion
Abstract: Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling parallel sequence verification, its efficiency remains inherently limited by the reliance on incremental token generation in existing draft models. To overcome this limitation, this paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences. This allows parallelization of both the drafting and verification steps, providing significant speed-ups to the inference process. Our proposed approach, \textit{Speculative Diffusion Decoding (SpecDiff)}, is validated on standard language generation benchmarks and empirically demonstrated to provide a \textbf{up to 8.7x speed-up over standard generation processes and up to 2.5x speed-up over existing speculative decoding approaches.}

Title: StealthDiffusion: Towards Evading Diffusion Forensic Detection through Diffusion Model

Authors: Ziyin Zhou, Ke Sun, Zhongxi Chen, Huafeng Kuang, Xiaoshuai Sun, Rongrong Ji
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.05669
Pdf URL: https://arxiv.org/pdf/2408.05669
Copy Paste: [[2408.05669]] StealthDiffusion: Towards Evading Diffusion Forensic Detection through Diffusion Model(https://arxiv.org/abs/2408.05669)
Keywords: diffusion, generative
Abstract: The rapid progress in generative models has given rise to the critical task of AI-Generated Content Stealth (AIGC-S), which aims to create AI-generated images that can evade both forensic detectors and human inspection. This task is crucial for understanding the vulnerabilities of existing detection methods and developing more robust techniques. However, current adversarial attacks often introduce visible noise, have poor transferability, and fail to address spectral differences between AI-generated and genuine images. To address this, we propose StealthDiffusion, a framework based on stable diffusion that modifies AI-generated images into high-quality, imperceptible adversarial examples capable of evading state-of-the-art forensic detectors. StealthDiffusion comprises two main components: Latent Adversarial Optimization, which generates adversarial perturbations in the latent space of stable diffusion, and Control-VAE, a module that reduces spectral differences between the generated adversarial images and genuine images without affecting the original diffusion model's generation process. Extensive experiments show that StealthDiffusion is effective in both white-box and black-box settings, transforming AI-generated images into high-quality adversarial forgeries with frequency spectra similar to genuine images. These forgeries are classified as genuine by advanced forensic classifiers and are difficult for humans to distinguish.

Title: SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction

Authors: Bohao Xu, Yingzhou Lu, Chenhao Li, Ling Yue, Xiao Wang, Nan Hao, Tianfan Fu, Jim Chen
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2408.05696
Pdf URL: https://arxiv.org/pdf/2408.05696
Copy Paste: [[2408.05696]] SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction(https://arxiv.org/abs/2408.05696)
Keywords: self-supervised, foundation model
Abstract: In drug discovery, predicting the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of small-molecule drugs is critical for ensuring safety and efficacy. However, the process of accurately predicting these properties is often resource-intensive and requires extensive experimental data. To address this challenge, we propose SMILES-Mamba, a two-stage model that leverages both unlabeled and labeled data through a combination of self-supervised pretraining and fine-tuning strategies. The model first pre-trains on a large corpus of unlabeled SMILES strings to capture the underlying chemical structure and relationships, before being fine-tuned on smaller, labeled datasets specific to ADMET tasks. Our results demonstrate that SMILES-Mamba exhibits competitive performance across 22 ADMET datasets, achieving the highest score in 14 tasks, highlighting the potential of self-supervised learning in improving molecular property prediction. This approach not only enhances prediction accuracy but also reduces the dependence on large, labeled datasets, offering a promising direction for future research in drug discovery.

Title: Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

Authors: Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05710
Pdf URL: https://arxiv.org/pdf/2408.05710
Copy Paste: [[2408.05710]] Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators(https://arxiv.org/abs/2408.05710)
Keywords: diffusion
Abstract: This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps. In response to this observation, we present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. By modulating the number of mediator tokens during the denoising generation phases, our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Concurrently, integrating mediator tokens simplifies the attention module's complexity to a linear scale, enhancing the efficiency of global attention processes. Additionally, we propose a time-step dynamic mediator token adjustment mechanism that further decreases the required computational FLOPs for generation, simultaneously facilitating the generation of high-quality images within the constraints of varied inference budgets. Extensive experiments demonstrate that the proposed method can improve the generated image quality while also reducing the inference cost of diffusion transformers. When integrated with the recent work SiT, our method achieves a state-of-the-art FID score of 2.01. The source code is available at this https URL.

Title: Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval

Authors: Rukai Wei, Heng Cui, Yu Liu, Yufeng Hou, Yanzhao Xie, Ke Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05711
Pdf URL: https://arxiv.org/pdf/2408.05711
Copy Paste: [[2408.05711]] Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval(https://arxiv.org/abs/2408.05711)
Keywords: self-supervised
Abstract: Implementing cross-modal hashing between 2D images and 3D point-cloud data is a growing concern in real-world retrieval systems. Simply applying existing cross-modal approaches to this new task fails to adequately capture latent multi-modal semantics and effectively bridge the modality gap between 2D and 3D. To address these issues without relying on hand-crafted labels, we propose contrastive masked autoencoders based self-supervised hashing (CMAH) for retrieval between images and point-cloud data. We start by contrasting 2D-3D pairs and explicitly constraining them into a joint Hamming space. This contrastive learning process ensures robust discriminability for the generated hash codes and effectively reduces the modality gap. Moreover, we utilize multi-modal auto-encoders to enhance the model's understanding of multi-modal semantics. By completing the masked image/point-cloud data modeling task, the model is encouraged to capture more localized clues. In addition, the proposed multi-modal fusion block facilitates fine-grained interactions among different modalities. Extensive experiments on three public datasets demonstrate that the proposed CMAH significantly outperforms all baseline methods.

Title: SSL: A Self-similarity Loss for Improving Generative Image Super-resolution

Authors: Du Chen, Zhengqiang Zhang, Jie Liang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05713
Pdf URL: https://arxiv.org/pdf/2408.05713
Copy Paste: [[2408.05713]] SSL: A Self-similarity Loss for Improving Generative Image Super-resolution(https://arxiv.org/abs/2408.05713)
Keywords: diffusion, generative
Abstract: Generative adversarial networks (GAN) and generative diffusion models (DM) have been widely used in real-world image super-resolution (Real-ISR) to enhance the image perceptual quality. However, these generative models are prone to generating visual artifacts and false image structures, resulting in unnatural Real-ISR results. Based on the fact that natural images exhibit high self-similarities, i.e., a local patch can have many similar patches to it in the whole image, in this work we propose a simple yet effective self-similarity loss (SSL) to improve the performance of generative Real-ISR models, enhancing the hallucination of structural and textural details while reducing the unpleasant visual artifacts. Specifically, we compute a self-similarity graph (SSG) of the ground-truth image, and enforce the SSG of Real-ISR output to be close to it. To reduce the training cost and focus on edge areas, we generate an edge mask from the ground-truth image, and compute the SSG only on the masked pixels. The proposed SSL serves as a general plug-and-play penalty, which could be easily applied to the off-the-shelf Real-ISR models. Our experiments demonstrate that, by coupling with SSL, the performance of many state-of-the-art Real-ISR models, including those GAN and DM based ones, can be largely improved, reproducing more perceptually realistic image details and eliminating many false reconstructions and visual artifacts. Codes and supplementary material can be found at this https URL

Title: MTSCI: A Conditional Diffusion Model for Multivariate Time Series Consistent Imputation

Authors: Jianping Zhou, Junhao Li, Guanjie Zheng, Xinbing Wang, Chenghu Zhou
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2408.05740
Pdf URL: https://arxiv.org/pdf/2408.05740
Copy Paste: [[2408.05740]] MTSCI: A Conditional Diffusion Model for Multivariate Time Series Consistent Imputation(https://arxiv.org/abs/2408.05740)
Keywords: diffusion, generative
Abstract: Missing values are prevalent in multivariate time series, compromising the integrity of analyses and degrading the performance of downstream tasks. Consequently, research has focused on multivariate time series imputation, aiming to accurately impute the missing values based on available observations. A key research question is how to ensure imputation consistency, i.e., intra-consistency between observed and imputed values, and inter-consistency between adjacent windows after imputation. However, previous methods rely solely on the inductive bias of the imputation targets to guide the learning process, ignoring imputation consistency and ultimately resulting in poor performance. Diffusion models, known for their powerful generative abilities, prefer to generate consistent results based on available observations. Therefore, we propose a conditional diffusion model for Multivariate Time Series Consistent Imputation (MTSCI). Specifically, MTSCI employs a contrastive complementary mask to generate dual views during the forward noising process. Then, the intra contrastive loss is calculated to ensure intra-consistency between the imputed and observed values. Meanwhile, MTSCI utilizes a mixup mechanism to incorporate conditional information from adjacent windows during the denoising process, facilitating the inter-consistency between imputed samples. Extensive experiments on multiple real-world datasets demonstrate that our method achieves the state-of-the-art performance on multivariate time series imputation task under different missing scenarios. Code is available at this https URL.

Title: An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available, without the training set

Authors: Chaoyi Ai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.05772
Pdf URL: https://arxiv.org/pdf/2408.05772
Copy Paste: [[2408.05772]] An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available, without the training set(https://arxiv.org/abs/2408.05772)
Keywords: foundation model
Abstract: Human-Object Interaction (HOI) aims to identify the pairs of humans and objects in images and to recognize their relationships, ultimately forming $\langle human, object, verb \rangle$ triplets. Under default settings, HOI performance is nearly saturated, with many studies focusing on long-tail distribution and zero-shot/few-shot scenarios. Let us consider an intriguing problem:``What if there is only test dataset without training dataset, using multimodal visual foundation model in a training-free manner? '' This study uses two experimental settings: grounding truth and random arbitrary combinations. We get some interesting conclusion and find that the open vocabulary capabilities of the multimodal visual foundation model are not yet fully realized. Additionally, replacing the feature extraction with grounding DINO further confirms these findings.

Title: Efficient Test-Time Prompt Tuning for Vision-Language Models

Authors: Yuhan Zhu, Guozhen Zhang, Chen Xu, Haocheng Shen, Xiaoxin Chen, Gangshan Wu, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05775
Pdf URL: https://arxiv.org/pdf/2408.05775
Copy Paste: [[2408.05775]] Efficient Test-Time Prompt Tuning for Vision-Language Models(https://arxiv.org/abs/2408.05775)
Keywords: self-supervised
Abstract: Vision-language models have showcased impressive zero-shot classification capabilities when equipped with suitable text prompts. Previous studies have shown the effectiveness of test-time prompt tuning; however, these methods typically require per-image prompt adaptation during inference, which incurs high computational budgets and limits scalability and practical deployment. To overcome this issue, we introduce Self-TPT, a novel framework leveraging Self-supervised learning for efficient Test-time Prompt Tuning. The key aspect of Self-TPT is that it turns to efficient predefined class adaptation via self-supervised learning, thus avoiding computation-heavy per-image adaptation at inference. Self-TPT begins by co-training the self-supervised and the classification task using source data, then applies the self-supervised task exclusively for test-time new class adaptation. Specifically, we propose Contrastive Prompt Learning (CPT) as the key task for self-supervision. CPT is designed to minimize the intra-class distances while enhancing inter-class distinguishability via contrastive learning. Furthermore, empirical evidence suggests that CPT could closely mimic back-propagated gradients of the classification task, offering a plausible explanation for its effectiveness. Motivated by this finding, we further introduce a gradient matching loss to explicitly enhance the gradient similarity. We evaluated Self-TPT across three challenging zero-shot benchmarks. The results consistently demonstrate that Self-TPT not only significantly reduces inference costs but also achieves state-of-the-art performance, effectively balancing the efficiency-efficacy trade-off.

Title: Egocentric Vision Language Planning

Authors: Zhirui Fang, Ming Yang, Weishuai Zeng, Boyu Li, Junpeng Yue, Ziluo Ding, Xiu Li, Zongqing Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05802
Pdf URL: https://arxiv.org/pdf/2408.05802
Copy Paste: [[2408.05802]] Egocentric Vision Language Planning(https://arxiv.org/abs/2408.05802)
Keywords: diffusion
Abstract: We explore leveraging large multi-modal models (LMMs) and text2image models to build a more general embodied agent. LMMs excel in planning long-horizon tasks over symbolic abstractions but struggle with grounding in the physical world, often failing to accurately identify object positions in images. A bridge is needed to connect LMMs to the physical world. The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective in varying household scenarios. This model leverages a diffusion model to simulate the fundamental dynamics between states and actions, integrating techniques like style transfer and optical flow to enhance generalization across different environmental dynamics. The LMM serves as a planner, breaking down instructions into sub-goals and selecting actions based on their alignment with these sub-goals, thus enabling more generalized and effective decision-making. Experiments show that EgoPlan improves long-horizon task success rates from the egocentric view compared to baselines across household scenarios.

Title: HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-Training

Authors: Fenghe Tang, Ronghao Xu, Qingsong Yao, Xueming Fu, Quan Quan, Heqin Zhu, Zaiyi Liu, S. Kevin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05815
Pdf URL: https://arxiv.org/pdf/2408.05815
Copy Paste: [[2408.05815]] HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-Training(https://arxiv.org/abs/2408.05815)
Keywords: self-supervised, generative
Abstract: The generative self-supervised learning strategy exhibits remarkable learning representational capabilities. However, there is limited attention to end-to-end pre-training methods based on a hybrid architecture of CNN and Transformer, which can learn strong local and global representations simultaneously. To address this issue, we propose a generative pre-training strategy called Hybrid Sparse masKing (HySparK) based on masked image modeling and apply it to large-scale pre-training on medical images. First, we perform a bottom-up 3D hybrid masking strategy on the encoder to keep consistency masking. Then we utilize sparse convolution for the top CNNs and encode unmasked patches for the bottom vision Transformers. Second, we employ a simple hierarchical decoder with skip-connections to achieve dense multi-scale feature reconstruction. Third, we implement our pre-training method on a collection of multiple large-scale 3D medical imaging datasets. Extensive experiments indicate that our proposed pre-training strategy demonstrates robust transfer-ability in supervised downstream tasks and sheds light on HySparK's promising prospects. The code is available at this https URL

Title: LaWa: Using Latent Space for In-Generation Image Watermarking

Authors: Ahmad Rezaei, Mohammad Akbari, Saeed Ranjbar Alvar, Arezou Fatemi, Yong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05868
Pdf URL: https://arxiv.org/pdf/2408.05868
Copy Paste: [[2408.05868]] LaWa: Using Latent Space for In-Generation Image Watermarking(https://arxiv.org/abs/2408.05868)
Keywords: diffusion, generative
Abstract: With generative models producing high quality images that are indistinguishable from real ones, there is growing concern regarding the malicious usage of AI-generated images. Imperceptible image watermarking is one viable solution towards such concerns. Prior watermarking methods map the image to a latent space for adding the watermark. Moreover, Latent Diffusion Models (LDM) generate the image in the latent space of a pre-trained autoencoder. We argue that this latent space can be used to integrate watermarking into the generation process. To this end, we present LaWa, an in-generation image watermarking method designed for LDMs. By using coarse-to-fine watermark embedding modules, LaWa modifies the latent space of pre-trained autoencoders and achieves high robustness against a wide range of image transformations while preserving perceptual quality of the image. We show that LaWa can also be used as a general image watermarking method. Through extensive experiments, we demonstrate that LaWa outperforms previous works in perceptual quality, robustness against attacks, and computational complexity, while having very low false positive rate. Code is available here.

Title: LLM-Based Robust Product Classification in Commerce and Compliance

Authors: Sina Gholamian, Gianfranco Romani, Bartosz Rudnikowicz, Laura Skylaki
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.05874
Pdf URL: https://arxiv.org/pdf/2408.05874
Copy Paste: [[2408.05874]] LLM-Based Robust Product Classification in Commerce and Compliance(https://arxiv.org/abs/2408.05874)
Keywords: generative, in-context
Abstract: Product classification is a crucial task in international trade, as compliance regulations are verified and taxes and duties are applied based on product categories. Manual classification of products is time-consuming and error-prone, and the sheer volume of products imported and exported renders the manual process infeasible. Consequently, e-commerce platforms and enterprises involved in international trade have turned to automatic product classification using machine learning. However, current approaches do not consider the real-world challenges associated with product classification, such as very abbreviated and incomplete product descriptions. In addition, recent advancements in generative Large Language Models (LLMs) and their reasoning capabilities are mainly untapped in product classification and e-commerce. In this research, we explore the real-life challenges of industrial classification and we propose data perturbations that allow for realistic data simulation. Furthermore, we employ LLM-based product classification to improve the robustness of the prediction in presence of incomplete data. Our research shows that LLMs with in-context learning outperform the supervised approaches in the clean-data scenario. Additionally, we illustrate that LLMs are significantly more robust than the supervised approaches when data attacks are present.

Title: GFlowNet Training by Policy Gradients

Authors: Puhua Niu, Shili Wu, Mingzhou Fan, Xiaoning Qian
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2408.05885
Pdf URL: https://arxiv.org/pdf/2408.05885
Copy Paste: [[2408.05885]] GFlowNet Training by Policy Gradients(https://arxiv.org/abs/2408.05885)
Keywords: generative
Abstract: Generative Flow Networks (GFlowNets) have been shown effective to generate combinatorial objects with desired properties. We here propose a new GFlowNet training framework, with policy-dependent rewards, that bridges keeping flow balance of GFlowNets to optimizing the expected accumulated reward in traditional Reinforcement-Learning (RL). This enables the derivation of new policy-based GFlowNet training methods, in contrast to existing ones resembling value-based RL. It is known that the design of backward policies in GFlowNet training affects efficiency. We further develop a coupled training strategy that jointly solves GFlowNet forward policy training and backward policy design. Performance analysis is provided with a theoretical guarantee of our policy-based GFlowNet training. Experiments on both simulated and real-world datasets verify that our policy-based strategies provide advanced RL perspectives for robust gradient estimation to improve GFlowNet performance.

Title: Classifier Guidance Enhances Diffusion-based Adversarial Purification by Preserving Predictive Information

Authors: Mingkun Zhang, Jianing Li, Wei Chen, Jiafeng Guo, Xueqi Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05900
Pdf URL: https://arxiv.org/pdf/2408.05900
Copy Paste: [[2408.05900]] Classifier Guidance Enhances Diffusion-based Adversarial Purification by Preserving Predictive Information(https://arxiv.org/abs/2408.05900)
Keywords: diffusion
Abstract: Adversarial purification is one of the promising approaches to defend neural networks against adversarial attacks. Recently, methods utilizing diffusion probabilistic models have achieved great success for adversarial purification in image classification tasks. However, such methods fall into the dilemma of balancing the needs for noise removal and information preservation. This paper points out that existing adversarial purification methods based on diffusion models gradually lose sample information during the core denoising process, causing occasional label shift in subsequent classification tasks. As a remedy, we suggest to suppress such information loss by introducing guidance from the classifier confidence. Specifically, we propose Classifier-cOnfidence gUided Purification (COUP) algorithm, which purifies adversarial examples while keeping away from the classifier decision boundary. Experimental results show that COUP can achieve better adversarial robustness under strong attack methods.

Title: HcNet: Image Modeling with Heat Conduction Equation

Authors: Zhemin Zhang, Xun Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05901
Pdf URL: https://arxiv.org/pdf/2408.05901
Copy Paste: [[2408.05901]] HcNet: Image Modeling with Heat Conduction Equation(https://arxiv.org/abs/2408.05901)
Keywords: diffusion, foundation model
Abstract: Foundation models, such as CNNs and ViTs, have powered the development of image modeling. However, general guidance to model architecture design is still missing. The design of many modern model architectures, such as residual structures, multiplicative gating signal, and feed-forward networks, can be interpreted in terms of the heat conduction equation. This finding inspired us to model images by the heat conduction equation, where the essential idea is to conceptualize image features as temperatures and model their information interaction as the diffusion of thermal energy. We can take advantage of the rich knowledge in the heat conduction equation to guide us in designing new and more interpretable models. As an example, we propose Heat Conduction Layer and Refine Approximation Layer inspired by solving the heat conduction equation using Finite Difference Method and Fourier series, respectively. This paper does not aim to present a state-of-the-art model; instead, it seeks to integrate the overall architectural design of the model into the heat conduction theory framework. Nevertheless, our Heat Conduction Network (HcNet) still shows competitive performance. Code available at \url{this https URL}.

Title: Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts

Authors: Peng Wu, Xuerong Zhou, Guansong Pang, Zhiwei Yang, Qingsen Yan, Peng Wang, Yanning Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.05905
Pdf URL: https://arxiv.org/pdf/2408.05905
Copy Paste: [[2408.05905]] Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts(https://arxiv.org/abs/2408.05905)
Keywords: anomaly
Abstract: Current weakly supervised video anomaly detection (WSVAD) task aims to achieve frame-level anomalous event detection with only coarse video-level annotations available. Existing works typically involve extracting global features from full-resolution video frames and training frame-level classifiers to detect anomalies in the temporal dimension. However, most anomalous events tend to occur in localized spatial regions rather than the entire video frames, which implies existing frame-level feature based works may be misled by the dominant background information and lack the interpretation of the detected anomalies. To address this dilemma, this paper introduces a novel method called STPrompt that learns spatio-temporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs). Our proposed method employs a two-stream network structure, with one stream focusing on the temporal dimension and the other primarily on the spatial dimension. By leveraging the learned knowledge from pre-trained VLMs and incorporating natural motion priors from raw videos, our model learns prompt embeddings that are aligned with spatio-temporal regions of videos (e.g., patches of individual frames) for identify specific local regions of anomalies, enabling accurate video anomaly detection while mitigating the influence of background information. Without relying on detailed spatio-temporal annotations or auxiliary object detection/tracking, our method achieves state-of-the-art performance on three public benchmarks for the WSVADL task.

Title: A Simple Early Exiting Framework for Accelerated Sampling in Diffusion Models

Authors: Taehong Moon, Moonseok Choi, EungGu Yun, Jongmin Yoon, Gayoung Lee, Jaewoong Cho, Juho Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05927
Pdf URL: https://arxiv.org/pdf/2408.05927
Copy Paste: [[2408.05927]] A Simple Early Exiting Framework for Accelerated Sampling in Diffusion Models(https://arxiv.org/abs/2408.05927)
Keywords: diffusion
Abstract: Diffusion models have shown remarkable performance in generation problems over various domains including images, videos, text, and audio. A practical bottleneck of diffusion models is their sampling speed, due to the repeated evaluation of score estimation networks during the inference. In this work, we propose a novel framework capable of adaptively allocating compute required for the score estimation, thereby reducing the overall sampling time of diffusion models. We observe that the amount of computation required for the score estimation may vary along the time step for which the score is estimated. Based on this observation, we propose an early-exiting scheme, where we skip the subset of parameters in the score estimation network during the inference, based on a time-dependent exit schedule. Using the diffusion models for image synthesis, we show that our method could significantly improve the sampling throughput of the diffusion models without compromising image quality. Furthermore, we also demonstrate that our method seamlessly integrates with various types of solvers for faster sampling, capitalizing on their compatibility to enhance overall efficiency. The source code and our experiments are available at \url{this https URL}

Title: Deep Geometric Moments Promote Shape Consistency in Text-to-3D Generation

Authors: Utkarsh Nath, Rajeev Goel, Eun Som Jeon, Changhoon Kim, Kyle Min, Yezhou Yang, Yingzhen Yang, Pavan Turaga
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05938
Pdf URL: https://arxiv.org/pdf/2408.05938
Copy Paste: [[2408.05938]] Deep Geometric Moments Promote Shape Consistency in Text-to-3D Generation(https://arxiv.org/abs/2408.05938)
Keywords: diffusion, generative
Abstract: To address the data scarcity associated with 3D assets, 2D-lifting techniques such as Score Distillation Sampling (SDS) have become a widely adopted practice in text-to-3D generation pipelines. However, the diffusion models used in these techniques are prone to viewpoint bias and thus lead to geometric inconsistencies such as the Janus problem. To counter this, we introduce MT3D, a text-to-3D generative model that leverages a high-fidelity 3D object to overcome viewpoint bias and explicitly infuse geometric understanding into the generation pipeline. Firstly, we employ depth maps derived from a high-quality 3D model as control signals to guarantee that the generated 2D images preserve the fundamental shape and structure, thereby reducing the inherent viewpoint bias. Next, we utilize deep geometric moments to ensure geometric consistency in the 3D representation explicitly. By incorporating geometric details from a 3D asset, MT3D enables the creation of diverse and geometrically consistent objects, thereby improving the quality and usability of our 3D representations.

Title: UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

Authors: Junjie He, Yifeng Geng, Liefeng Bo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05939
Pdf URL: https://arxiv.org/pdf/2408.05939
Copy Paste: [[2408.05939]] UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization(https://arxiv.org/abs/2408.05939)
Keywords: diffusion, generative
Abstract: This paper presents UniPortrait, an innovative human image personalization framework that unifies single- and multi-ID customization with high face fidelity, extensive facial editability, free-form input description, and diverse layout generation. UniPortrait consists of only two plug-and-play modules: an ID embedding module and an ID routing module. The ID embedding module extracts versatile editable facial features with a decoupling strategy for each ID and embeds them into the context space of diffusion models. The ID routing module then combines and distributes these embeddings adaptively to their respective regions within the synthesized image, achieving the customization of single and multiple IDs. With a carefully designed two-stage training scheme, UniPortrait achieves superior performance in both single- and multi-ID customization. Quantitative and qualitative experiments demonstrate the advantages of our method over existing approaches as well as its good scalability, e.g., the universal compatibility with existing generative control tools. The project page is at this https URL .

Title: Freehand Sketch Generation from Mechanical Components

Authors: Zhichao Liao, Di Huang, Heming Fang, Yue Ma, Fengyuan Piao, Xinghui Li, Long Zeng, Pingfa Feng
Subjects: cs.CV, cs.AI, cs.GR, cs.MM
Abstract URL: https://arxiv.org/abs/2408.05966
Pdf URL: https://arxiv.org/pdf/2408.05966
Copy Paste: [[2408.05966]] Freehand Sketch Generation from Mechanical Components(https://arxiv.org/abs/2408.05966)
Keywords: generative
Abstract: Drawing freehand sketches of mechanical components on multimedia devices for AI-based engineering modeling has become a new trend. However, its development is being impeded because existing works cannot produce suitable sketches for data-driven research. These works either generate sketches lacking a freehand style or utilize generative models not originally designed for this task resulting in poor effectiveness. To address this issue, we design a two-stage generative framework mimicking the human sketching behavior pattern, called MSFormer, which is the first time to produce humanoid freehand sketches tailored for mechanical components. The first stage employs Open CASCADE technology to obtain multi-view contour sketches from mechanical components, filtering perturbing signals for the ensuing generation process. Meanwhile, we design a view selector to simulate viewpoint selection tasks during human sketching for picking out information-rich sketches. The second stage translates contour sketches into freehand sketches by a transformer-based generator. To retain essential modeling features as much as possible and rationalize stroke distribution, we introduce a novel edge-constraint stroke initialization. Furthermore, we utilize a CLIP vision encoder and a new loss function incorporating the Hausdorff distance to enhance the generalizability and robustness of the model. Extensive experiments demonstrate that our approach achieves state-of-the-art performance for generating freehand sketches in the mechanical domain. Project page: this https URL .

Title: Unseen No More: Unlocking the Potential of CLIP for Generative Zero-shot HOI Detection

Authors: Yixin Guo, Yu Liu, Jianghao Li, Weimin Wang, Qi Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05974
Pdf URL: https://arxiv.org/pdf/2408.05974
Copy Paste: [[2408.05974]] Unseen No More: Unlocking the Potential of CLIP for Generative Zero-shot HOI Detection(https://arxiv.org/abs/2408.05974)
Keywords: generative
Abstract: Zero-shot human-object interaction (HOI) detector is capable of generalizing to HOI categories even not encountered during training. Inspired by the impressive zero-shot capabilities offered by CLIP, latest methods strive to leverage CLIP embeddings for improving zero-shot HOI detection. However, these embedding-based methods train the classifier on seen classes only, inevitably resulting in seen-unseen confusion for the model during inference. Besides, we find that using prompt-tuning and adapters further increases the gap between seen and unseen accuracy. To tackle this challenge, we present the first generation-based model using CLIP for zero-shot HOI detection, coined HOIGen. It allows to unlock the potential of CLIP for feature generation instead of feature extraction only. To achieve it, we develop a CLIP-injected feature generator in accordance with the generation of human, object and union features. Then, we extract realistic features of seen samples and mix them with synthetic features together, allowing the model to train seen and unseen classes jointly. To enrich the HOI scores, we construct a generative prototype bank in a pairwise HOI recognition branch, and a multi-knowledge prototype bank in an image-wise HOI recognition branch, respectively. Extensive experiments on HICO-DET benchmark demonstrate our HOIGen achieves superior performance for both seen and unseen classes under various zero-shot settings, compared with other top-performing methods. Code is available at: this https URL

Title: Diffuse-UDA: Addressing Unsupervised Domain Adaptation in Medical Image Segmentation with Appearance and Structure Aligned Diffusion Models

Authors: Haifan Gong, Yitao Wang, Yihan Wang, Jiashun Xiao, Xiang Wan, Haofeng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.05985
Pdf URL: https://arxiv.org/pdf/2408.05985
Copy Paste: [[2408.05985]] Diffuse-UDA: Addressing Unsupervised Domain Adaptation in Medical Image Segmentation with Appearance and Structure Aligned Diffusion Models(https://arxiv.org/abs/2408.05985)
Keywords: diffusion
Abstract: The scarcity and complexity of voxel-level annotations in 3D medical imaging present significant challenges, particularly due to the domain gap between labeled datasets from well-resourced centers and unlabeled datasets from less-resourced centers. This disparity affects the fairness of artificial intelligence algorithms in healthcare. We introduce Diffuse-UDA, a novel method leveraging diffusion models to tackle Unsupervised Domain Adaptation (UDA) in medical image segmentation. Diffuse-UDA generates high-quality image-mask pairs with target domain characteristics and various structures, thereby enhancing UDA tasks. Initially, pseudo labels for target domain samples are generated. Subsequently, a specially tailored diffusion model, incorporating deformable augmentations, is trained on image-label or image-pseudo-label pairs from both domains. Finally, source domain labels guide the diffusion model to generate image-label pairs for the target domain. Comprehensive evaluations on several benchmarks demonstrate that Diffuse-UDA outperforms leading UDA and semi-supervised strategies, achieving performance close to or even surpassing the theoretical upper bound of models trained directly on target domain data. Diffuse-UDA offers a pathway to advance the development and deployment of AI systems in medical imaging, addressing disparities between healthcare environments. This approach enables the exploration of innovative AI-driven diagnostic tools, improves outcomes, saves time, and reduces human error.

Title: An Analysis for Image-to-Image Translation and Style Transfer

Authors: Xiaoming Yu, Jie Tian, Zhenhua Hu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2408.06000
Pdf URL: https://arxiv.org/pdf/2408.06000
Copy Paste: [[2408.06000]] An Analysis for Image-to-Image Translation and Style Transfer(https://arxiv.org/abs/2408.06000)
Keywords: generative
Abstract: With the development of generative technologies in deep learning, a large number of image-to-image translation and style transfer models have emerged at an explosive rate in recent years. These two technologies have made significant progress and can generate realistic images. However, many communities tend to confuse the two, because both generate the desired image based on the input image and both cover the two definitions of content and style. In fact, there are indeed significant differences between the two, and there is currently a lack of clear explanations to distinguish the two technologies, which is not conducive to the advancement of technology. We hope to serve the entire community by introducing the differences and connections between image-to-image translation and style transfer. The entire discussion process involves the concepts, forms, training modes, evaluation processes, and visualization results of the two technologies. Finally, we conclude that image-to-image translation divides images by domain, and the types of images in the domain are limited, and the scope involved is small, but the conversion ability is strong and can achieve strong semantic changes. Style transfer divides image types by single image, and the scope involved is large, but the transfer ability is limited, and it transfers more texture and color of the image.

Title: BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training

Authors: Xuanpu Zhang, Dan Song, Pengxin Zhan, Qingguo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Anan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.06047
Pdf URL: https://arxiv.org/pdf/2408.06047
Copy Paste: [[2408.06047]] BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training(https://arxiv.org/abs/2408.06047)
Keywords: diffusion
Abstract: Image-based virtual try-on is an increasingly popular and important task to generate realistic try-on images of specific person. Existing methods always employ an accurate mask to remove the original garment in the source image, thus achieving realistic synthesized images in simple and conventional try-on scenarios based on powerful diffusion model. Therefore, acquiring suitable mask is vital to the try-on performance of these methods. However, obtaining precise inpainting masks, especially for complex wild try-on data containing diverse foreground occlusions and person poses, is not easy as Figure 1-Top shows. This difficulty often results in poor performance in more practical and challenging real-life scenarios, such as the selfie scene shown in Figure 1-Bottom. To this end, we propose a novel training paradigm combined with an efficient data augmentation method to acquire large-scale unpaired training data from wild scenarios, thereby significantly facilitating the try-on performance of our model without the need for additional inpainting masks. Besides, a try-on localization loss is designed to localize a more accurate try-on area to obtain more reasonable try-on results. It is noted that our method only needs the reference cloth image, source pose image and source person image as input, which is more cost-effective and user-friendly compared to existing methods. Extensive qualitative and quantitative experiments have demonstrated superior performance in wild scenarios with such a low-demand input.

Title: What Ails Generative Structure-based Drug Design: Too Little or Too Much Expressivity?

Authors: Rafał Karczewski, Samuel Kaski, Markus Heinonen, Vikas Garg
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.06050
Pdf URL: https://arxiv.org/pdf/2408.06050
Copy Paste: [[2408.06050]] What Ails Generative Structure-based Drug Design: Too Little or Too Much Expressivity?(https://arxiv.org/abs/2408.06050)
Keywords: generative
Abstract: Several generative models with elaborate training and sampling procedures have been proposed recently to accelerate structure-based drug design (SBDD); however, perplexingly, their empirical performance turns out to be suboptimal. We seek to better understand this phenomenon from both theoretical and empirical perspectives. Since most of these models apply graph neural networks (GNNs), one may suspect that they inherit the representational limitations of GNNs. We analyze this aspect, establishing the first such results for protein-ligand complexes. A plausible counterview may attribute the underperformance of these models to their excessive parameterizations, inducing expressivity at the expense of generalization. We also investigate this possibility with a simple metric-aware approach that learns an economical surrogate for affinity to infer an unlabelled molecular graph and optimizes for labels conditioned on this graph and molecular properties. The resulting model achieves state-of-the-art results using 100x fewer trainable parameters and affords up to 1000x speedup. Collectively, our findings underscore the need to reassess and redirect the existing paradigm and efforts for SBDD.

Title: ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Authors: Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming-Chang Yang, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.06070
Pdf URL: https://arxiv.org/pdf/2408.06070
Copy Paste: [[2408.06070]] ControlNeXt: Powerful and Efficient Control for Image and Video Generation(https://arxiv.org/abs/2408.06070)
Keywords: diffusion
Abstract: Diffusion models have demonstrated remarkable and robust abilities in both image and video generation. To achieve greater control over generated results, researchers introduce additional architectures, such as ControlNet, Adapters and ReferenceNet, to integrate conditioning controls. However, current controllable generation methods often require substantial additional computational resources, especially for video generation, and face challenges in training or exhibit weak control. In this paper, we propose ControlNeXt: a powerful and efficient method for controllable image and video generation. We first design a more straightforward and efficient architecture, replacing heavy additional branches with minimal additional cost compared to the base model. Such a concise structure also allows our method to seamlessly integrate with other LoRA weights, enabling style alteration without the need for additional training. As for training, we reduce up to 90% of learnable parameters compared to the alternatives. Furthermore, we propose another method called Cross Normalization (CN) as a replacement for Zero-Convolution' to achieve fast and stable training convergence. We have conducted various experiments with different base models across images and videos, demonstrating the robustness of our method.

Title: CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Authors: Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.06072
Pdf URL: https://arxiv.org/pdf/2408.06072
Copy Paste: [[2408.06072]] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer(https://arxiv.org/abs/2408.06072)
Keywords: diffusion
Abstract: We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weights of both the 3D Causal VAE and CogVideoX are publicly available at this https URL.

Title: Building Decision Making Models Through Language Model Regime

Authors: Yu Zhang, Haoxiang Liu, Feijun Jiang, Weihua Luo, Kaifu Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.06087
Pdf URL: https://arxiv.org/pdf/2408.06087
Copy Paste: [[2408.06087]] Building Decision Making Models Through Language Model Regime(https://arxiv.org/abs/2408.06087)
Keywords: foundation model
Abstract: We propose a novel approach for decision making problems leveraging the generalization capabilities of large language models (LLMs). Traditional methods such as expert systems, planning algorithms, and reinforcement learning often exhibit limited generalization, typically requiring the training of new models for each unique task. In contrast, LLMs demonstrate remarkable success in generalizing across varied language tasks, inspiring a new strategy for training decision making models. Our approach, referred to as "Learning then Using" (LTU), entails a two-stage process. Initially, the \textit{learning} phase develops a robust foundational decision making model by integrating diverse knowledge from various domains and decision making contexts. The subsequent \textit{using} phase refines this foundation model for specific decision making scenarios. Distinct from other studies that employ LLMs for decision making through supervised learning, our LTU method embraces a versatile training methodology that combines broad pre-training with targeted fine-tuning. Experiments in e-commerce domains such as advertising and search optimization have shown that LTU approach outperforms traditional supervised learning regimes in decision making capabilities and generalization. The LTU approach is the first practical training architecture for both single-step and multi-step decision making tasks combined with LLMs, which can be applied beyond game and robot domains. It provides a robust and adaptable framework for decision making, enhances the effectiveness and flexibility of various systems in tackling various challenges.

Title: A Methodological Report on Anomaly Detection on Dynamic Knowledge Graphs

Authors: Xiaohua Lu, Leshanshui Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.06121
Pdf URL: https://arxiv.org/pdf/2408.06121
Copy Paste: [[2408.06121]] A Methodological Report on Anomaly Detection on Dynamic Knowledge Graphs(https://arxiv.org/abs/2408.06121)
Keywords: anomaly
Abstract: In this paper, we explore different approaches to anomaly detection on dynamic knowledge graphs, specifically in a microservices environment for Kubernetes applications. Our approach explores three dynamic knowledge graph representations: sequential data, one-hop graph structure, and two-hop graph structure, with each representation incorporating increasingly complex structural information. Each phase includes different machine learning and deep learning models. We empirically analyse their performance and propose an approach based on ensemble learning of these models. Our approach significantly outperforms the baseline on the ISWC 2024 Dynamic Knowledge Graph Anomaly Detection dataset, providing a robust solution for anomaly detection in dynamic complex data.

Title: Efficient and Scalable Point Cloud Generation with Sparse Point-Voxel Diffusion Models

Authors: Ioannis Romanelis, Vlassios Fotis, Athanasios Kalogeras, Christos Alexakos, Konstantinos Moustakas, Adrian Munteanu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.06145
Pdf URL: https://arxiv.org/pdf/2408.06145
Copy Paste: [[2408.06145]] Efficient and Scalable Point Cloud Generation with Sparse Point-Voxel Diffusion Models(https://arxiv.org/abs/2408.06145)
Keywords: diffusion, generative
Abstract: We propose a novel point cloud U-Net diffusion architecture for 3D generative modeling capable of generating high-quality and diverse 3D shapes while maintaining fast generation times. Our network employs a dual-branch architecture, combining the high-resolution representations of points with the computational efficiency of sparse voxels. Our fastest variant outperforms all non-diffusion generative approaches on unconditional shape generation, the most popular benchmark for evaluating point cloud generative models, while our largest model achieves state-of-the-art results among diffusion methods, with a runtime approximately 70% of the previously state-of-the-art PVD. Beyond unconditional generation, we perform extensive evaluations, including conditional generation on all categories of ShapeNet, demonstrating the scalability of our model to larger datasets, and implicit generation which allows our network to produce high quality point clouds on fewer timesteps, further decreasing the generation time. Finally, we evaluate the architecture's performance in point cloud completion and super-resolution. Our model excels in all tasks, establishing it as a state-of-the-art diffusion U-Net for point cloud generative modeling. The code is publicly available at this https URL.

Title: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

Authors: Taewon Kang, Divya Kothandaraman, Dinesh Manocha, Ming C. Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.06157
Pdf URL: https://arxiv.org/pdf/2408.06157
Copy Paste: [[2408.06157]] Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance(https://arxiv.org/abs/2408.06157)
Keywords: diffusion
Abstract: Recent 3D novel view synthesis (NVS) methods are limited to single-object-centric scenes generated from new viewpoints and struggle with complex environments. They often require extensive 3D data for training, lacking generalization beyond training distribution. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without tedious fine-tuning, but lack camera control. In this paper, we introduce HawkI++, a method capable of generating camera-controlled viewpoints from a single input image. HawkI++ excels in handling complex and diverse scenes without additional 3D data or extensive training. It leverages widely available pretrained NVS models for weak guidance, integrating this knowledge into a 3D-free view synthesis approach to achieve the desired results efficiently. Our experimental results demonstrate that HawkI++ outperforms existing models in both qualitative and quantitative evaluations, providing high-fidelity and consistent novel view synthesis at desired camera angles across a wide variety of scenes.

Title: FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework

Authors: Lukas Meyer, Andreas Gilson, Ute Schmidt, Marc Stamminger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.06190
Pdf URL: https://arxiv.org/pdf/2408.06190
Copy Paste: [[2408.06190]] FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework(https://arxiv.org/abs/2408.06190)
Keywords: foundation model
Abstract: We introduce FruitNeRF, a unified novel fruit counting framework that leverages state-of-the-art view synthesis methods to count any fruit type directly in 3D. Our framework takes an unordered set of posed images captured by a monocular camera and segments fruit in each image. To make our system independent of the fruit type, we employ a foundation model that generates binary segmentation masks for any fruit. Utilizing both modalities, RGB and semantic, we train a semantic neural radiance field. Through uniform volume sampling of the implicit Fruit Field, we obtain fruit-only point clouds. By applying cascaded clustering on the extracted point cloud, our approach achieves precise fruit count.The use of neural radiance fields provides significant advantages over conventional methods such as object tracking or optical flow, as the counting itself is lifted into 3D. Our method prevents double counting fruit and avoids counting irrelevant fruit.We evaluate our methodology using both real-world and synthetic datasets. The real-world dataset consists of three apple trees with manually counted ground truths, a benchmark apple dataset with one row and ground truth fruit location, while the synthetic dataset comprises various fruit types including apple, plum, lemon, pear, peach, and mango.Additionally, we assess the performance of fruit counting using the foundation model compared to a U-Net.

Title: Correlation Weighted Prototype-based Self-Supervised One-Shot Segmentation of Medical Images

Authors: Siladittya Manna, Saumik Bhattacharya, Umapada Pal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.06235
Pdf URL: https://arxiv.org/pdf/2408.06235
Copy Paste: [[2408.06235]] Correlation Weighted Prototype-based Self-Supervised One-Shot Segmentation of Medical Images(https://arxiv.org/abs/2408.06235)
Keywords: self-supervised
Abstract: Medical image segmentation is one of the domains where sufficient annotated data is not available. This necessitates the application of low-data frameworks like few-shot learning. Contemporary prototype-based frameworks often do not account for the variation in features within the support and query images, giving rise to a large variance in prototype alignment. In this work, we adopt a prototype-based self-supervised one-way one-shot learning framework using pseudo-labels generated from superpixels to learn the semantic segmentation task itself. We use a correlation-based probability score to generate a dynamic prototype for each query pixel from the bag of prototypes obtained from the support feature map. This weighting scheme helps to give a higher weightage to contextually related prototypes. We also propose a quadrant masking strategy in the downstream segmentation task by utilizing prior domain information to discard unwanted false positives. We present extensive experimentations and evaluations on abdominal CT and MR datasets to show that the proposed simple but potent framework performs at par with the state-of-the-art methods.

Title: 3D Reconstruction of Protein Structures from Multi-view AFM Images using Neural Radiance Fields (NeRFs)

Authors: Jaydeep Rade, Ethan Herron, Soumik Sarkar, Anwesha Sarkar, Adarsh Krishnamurthy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.06244
Pdf URL: https://arxiv.org/pdf/2408.06244
Copy Paste: [[2408.06244]] 3D Reconstruction of Protein Structures from Multi-view AFM Images using Neural Radiance Fields (NeRFs)(https://arxiv.org/abs/2408.06244)
Keywords: diffusion
Abstract: Recent advancements in deep learning for predicting 3D protein structures have shown promise, particularly when leveraging inputs like protein sequences and Cryo-Electron microscopy (Cryo-EM) images. However, these techniques often fall short when predicting the structures of protein complexes (PCs), which involve multiple proteins. In our study, we investigate using atomic force microscopy (AFM) combined with deep learning to predict the 3D structures of PCs. AFM generates height maps that depict the PCs in various random orientations, providing a rich information for training a neural network to predict the 3D structures. We then employ the pre-trained UpFusion model (which utilizes a conditional diffusion model for synthesizing novel views) to train an instance-specific NeRF model for 3D reconstruction. The performance of UpFusion is evaluated through zero-shot predictions of 3D protein structures using AFM images. The challenge, however, lies in the time-intensive and impractical nature of collecting actual AFM images. To address this, we use a virtual AFM imaging process that transforms a `PDB' protein file into multi-view 2D virtual AFM images via volume rendering techniques. We extensively validate the UpFusion architecture using both virtual and actual multi-view AFM images. Our results include a comparison of structures predicted with varying numbers of views and different sets of views. This novel approach holds significant potential for enhancing the accuracy of protein complex structure predictions with further fine-tuning of the UpFusion network.

Title: Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning

Authors: Yingjin Song, Denis Paperno, Albert Gatt
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2408.06259
Pdf URL: https://arxiv.org/pdf/2408.06259
Copy Paste: [[2408.06259]] Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning(https://arxiv.org/abs/2408.06259)
Keywords: foundation model
Abstract: Visual storytelling systems generate multi-sentence stories from image sequences. In this task, capturing contextual information and bridging visual variation bring additional challenges. We propose a simple yet effective framework that leverages the generalization capabilities of pretrained foundation models, only training a lightweight vision-language mapping network to connect modalities, while incorporating context to enhance coherence. We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness. Extensive experimental results, across both automatic metrics and human evaluations, demonstrate that the stories generated by our framework are diverse, coherent, informative, and interesting.

Title: Open-Source Molecular Processing Pipeline for Generating Molecules

Authors: Shreyas V, Jose Siguenza, Karan Bania, Bharath Ramsundar
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2408.06261
Pdf URL: https://arxiv.org/pdf/2408.06261
Copy Paste: [[2408.06261]] Open-Source Molecular Processing Pipeline for Generating Molecules(https://arxiv.org/abs/2408.06261)
Keywords: generative
Abstract: Generative models for molecules have shown considerable promise for use in computational chemistry, but remain difficult to use for non-experts. For this reason, we introduce open-source infrastructure for easily building generative molecular models into the widely used DeepChem [Ramsundar et al., 2019] library with the aim of creating a robust and reusable molecular generation pipeline. In particular, we add high quality PyTorch [Paszke et al., 2019] implementations of the Molecular Generative Adversarial Networks (MolGAN) [Cao and Kipf, 2022] and Normalizing Flows [Papamakarios et al., 2021]. Our implementations show strong performance comparable with past work [Kuznetsov and Polykovskiy, 2021, Cao and Kipf, 2022].

Title: DUNE: A Machine Learning Deep UNet++ based Ensemble Approach to Monthly, Seasonal and Annual Climate Forecasting

Authors: Pratik Shukla, Milton Halem
Subjects: cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2408.06262
Pdf URL: https://arxiv.org/pdf/2408.06262
Copy Paste: [[2408.06262]] DUNE: A Machine Learning Deep UNet++ based Ensemble Approach to Monthly, Seasonal and Annual Climate Forecasting(https://arxiv.org/abs/2408.06262)
Keywords: anomaly
Abstract: Capitalizing on the recent availability of ERA5 monthly averaged long-term data records of mean atmospheric and climate fields based on high-resolution reanalysis, deep-learning architectures offer an alternative to physics-based daily numerical weather predictions for subseasonal to seasonal (S2S) and annual means. A novel Deep UNet++-based Ensemble (DUNE) neural architecture is introduced, employing multi-encoder-decoder structures with residual blocks. When initialized from a prior month or year, this architecture produced the first AI-based global monthly, seasonal, or annual mean forecast of 2-meter temperatures (T2m) and sea surface temperatures (SST). ERA5 monthly mean data is used as input for T2m over land, SST over oceans, and solar radiation at the top of the atmosphere for each month of 40 years to train the model. Validation forecasts are performed for an additional two years, followed by five years of forecast evaluations to account for natural annual variability. AI-trained inference forecast weights generate forecasts in seconds, enabling ensemble seasonal forecasts. Root Mean Squared Error (RMSE), Anomaly Correlation Coefficient (ACC), and Heidke Skill Score (HSS) statistics are presented globally and over specific regions. These forecasts outperform persistence, climatology, and multiple linear regression for all domains. DUNE forecasts demonstrate comparable statistical accuracy to NOAA's operational monthly and seasonal probabilistic outlook forecasts over the US but at significantly higher resolutions. RMSE and ACC error statistics for other recent AI-based daily forecasts also show superior performance for DUNE-based forecasts. The DUNE model's application to an ensemble data assimilation cycle shows comparable forecast accuracy with a single high-resolution model, potentially eliminating the need for retraining on extrapolated datasets.