2024-08-06

Title: Siamese Transformer Networks for Few-shot Image Classification

Authors: Weihao Jiang, Shuoxi Zhang, Kun He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01427
Pdf URL: https://arxiv.org/pdf/2408.01427
Copy Paste: [[2408.01427]] Siamese Transformer Networks for Few-shot Image Classification(https://arxiv.org/abs/2408.01427)
Keywords: transformer
Abstract: Humans exhibit remarkable proficiency in visual classification tasks, accurately recognizing and classifying new images with minimal examples. This ability is attributed to their capacity to focus on details and identify common features between previously seen and new images. In contrast, existing few-shot image classification methods often emphasize either global features or local features, with few studies considering the integration of both. To address this limitation, we propose a novel approach based on the Siamese Transformer Network (STN). Our method employs two parallel branch networks utilizing the pre-trained Vision Transformer (ViT) architecture to extract global and local features, respectively. Specifically, we implement the ViT-Small network architecture and initialize the branch networks with pre-trained model parameters obtained through self-supervised learning. We apply the Euclidean distance measure to the global features and the Kullback-Leibler (KL) divergence measure to the local features. To integrate the two metrics, we first employ L2 normalization and then weight the normalized results to obtain the final similarity score. This strategy leverages the advantages of both global and local features while ensuring their complementary benefits. During the training phase, we adopt a meta-learning approach to fine-tune the entire network. Our strategy effectively harnesses the potential of global and local features in few-shot image classification, circumventing the need for complex feature adaptation modules and enhancing the model's generalization ability. Extensive experiments demonstrate that our framework is simple yet effective, achieving superior performance compared to state-of-the-art baselines on four popular few-shot classification benchmarks in both 5-shot and 1-shot scenarios.

Title: Transferable Adversarial Facial Images for Privacy Protection

Authors: Minghui Li, Jiangxiong Wang, Hao Zhang, Ziqi Zhou, Shengshan Hu, Xiaobing Pei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01428
Pdf URL: https://arxiv.org/pdf/2408.01428
Copy Paste: [[2408.01428]] Transferable Adversarial Facial Images for Privacy Protection(https://arxiv.org/abs/2408.01428)
Keywords: privacy, protect, attack, generative
Abstract: The success of deep face recognition (FR) systems has raised serious privacy concerns due to their ability to enable unauthorized tracking of users in the digital world. Previous studies proposed introducing imperceptible adversarial noises into face images to deceive those face recognition models, thus achieving the goal of enhancing facial privacy protection. Nevertheless, they heavily rely on user-chosen references to guide the generation of adversarial noises, and cannot simultaneously construct natural and highly transferable adversarial face images in black-box scenarios. In light of this, we present a novel face privacy protection scheme with improved transferability while maintain high visual quality. We propose shaping the entire face space directly instead of exploiting one kind of facial characteristic like makeup information to integrate adversarial noises. To achieve this goal, we first exploit global adversarial latent search to traverse the latent space of the generative model, thereby creating natural adversarial face images with high transferability. We then introduce a key landmark regularization module to preserve the visual identity information. Finally, we investigate the impacts of various kinds of latent spaces and find that $\mathcal{F}$ latent space benefits the trade-off between visual naturalness and adversarial transferability. Extensive experiments over two datasets demonstrate that our approach significantly enhances attack transferability while maintaining high visual quality, outperforming state-of-the-art methods by an average 25% improvement in deep FR models and 10% improvement on commercial FR APIs, including Face++, Aliyun, and Tencent.

Title: SUSTechGAN: Image Generation for Object Recognition in Adverse Conditions of Autonomous Driving

Authors: Gongjin Lan, Yang Peng, Qi Hao, Chengzhong Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01430
Pdf URL: https://arxiv.org/pdf/2408.01430
Copy Paste: [[2408.01430]] SUSTechGAN: Image Generation for Object Recognition in Adverse Conditions of Autonomous Driving(https://arxiv.org/abs/2408.01430)
Keywords: generative
Abstract: Autonomous driving significantly benefits from data-driven deep neural networks. However, the data in autonomous driving typically fits the long-tailed distribution, in which the critical driving data in adverse conditions is hard to collect. Although generative adversarial networks (GANs) have been applied to augment data for autonomous driving, generating driving images in adverse conditions is still challenging. In this work, we propose a novel SUSTechGAN with dual attention modules and multi-scale generators to generate driving images for improving object recognition of autonomous driving in adverse conditions. We test the SUSTechGAN and the existing well-known GANs to generate driving images in adverse conditions of rain and night and apply the generated images to retrain object recognition networks. Specifically, we add generated images into the training datasets to retrain the well-known YOLOv5 and evaluate the improvement of the retrained YOLOv5 for object recognition in adverse conditions. The experimental results show that the generated driving images by our SUSTechGAN significantly improved the performance of retrained YOLOv5 in rain and night conditions, which outperforms the well-known GANs. The open-source code, video description and datasets are available on the page 1 to facilitate image generation development in autonomous driving under adverse conditions.

Title: VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance

Authors: Divyansh Srivastava, Ge Yan, Tsui-Wei Weng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.01432
Pdf URL: https://arxiv.org/pdf/2408.01432
Copy Paste: [[2408.01432]] VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance(https://arxiv.org/abs/2408.01432)
Keywords: interpretability, large language model
Abstract: Concept Bottleneck Models (CBMs) provide interpretable prediction by introducing an intermediate Concept Bottleneck Layer (CBL), which encodes human-understandable concepts to explain models' decision. Recent works proposed to utilize Large Language Models (LLMs) and pre-trained Vision-Language Models (VLMs) to automate the training of CBMs, making it more scalable and automated. However, existing approaches still fall short in two aspects: First, the concepts predicted by CBL often mismatch the input image, raising doubts about the faithfulness of interpretation. Second, it has been shown that concept values encode unintended information: even a set of random concepts could achieve comparable test accuracy to state-of-the-art CBMs. To address these critical limitations, in this work, we propose a novel framework called Vision-Language-Guided Concept Bottleneck Model (VLG-CBM) to enable faithful interpretability with the benefits of boosted performance. Our method leverages off-the-shelf open-domain grounded object detectors to provide visually grounded concept annotation, which largely enhances the faithfulness of concept prediction while further improving the model performance. In addition, we propose a new metric called Number of Effective Concepts (NEC) to control the information leakage and provide better interpretability. Extensive evaluations across five standard benchmarks show that our method, VLG-CBM, outperforms existing methods by at least 4.27% and up to 51.09% on accuracy at NEC=5, and by at least 0.45% and up to 29.78% on average accuracy across different NECs, while preserves both faithfulness and interpretability of the learned concepts as demonstrated in extensive experiments.

Title: Evaluating and Enhancing Trustworthiness of LLMs in Perception Tasks

Authors: Malsha Ashani Mahawatta Dona, Beatriz Cabrero-Daniel, Yinan Yu, Christian Berger
Subjects: cs.CV, cs.ET
Abstract URL: https://arxiv.org/abs/2408.01433
Pdf URL: https://arxiv.org/pdf/2408.01433
Copy Paste: [[2408.01433]] Evaluating and Enhancing Trustworthiness of LLMs in Perception Tasks(https://arxiv.org/abs/2408.01433)
Keywords: large language model
Abstract: Today's advanced driver assistance systems (ADAS), like adaptive cruise control or rear collision warning, are finding broader adoption across vehicle classes. Integrating such advanced, multimodal Large Language Models (LLMs) on board a vehicle, which are capable of processing text, images, audio, and other data types, may have the potential to greatly enhance passenger comfort. Yet, an LLM's hallucinations are still a major challenge to be addressed. In this paper, we systematically assessed potential hallucination detection strategies for such LLMs in the context of object detection in vision-based data on the example of pedestrian detection and localization. We evaluate three hallucination detection strategies applied to two state-of-the-art LLMs, the proprietary GPT-4V and the open LLaVA, on two datasets (Waymo/US and PREPER CITY/Sweden). Our results show that these LLMs can describe a traffic situation to an impressive level of detail but are still challenged for further analysis activities such as object localization. We evaluate and extend hallucination detection approaches when applying these LLMs to video sequences in the example of pedestrian detection. Our experiments show that, at the moment, the state-of-the-art proprietary LLM performs much better than the open LLM. Furthermore, consistency enhancement techniques based on voting, such as the Best-of-Three (BO3) method, do not effectively reduce hallucinations in LLMs that tend to exhibit high false negatives in detecting pedestrians. However, extending the hallucination detection by including information from the past helps to improve results.

Title: MoDE: Effective Multi-task Parameter Efficient Fine-Tuning with a Mixture of Dyadic Experts

Authors: Lin Ning, Harsh Lara, Meiqi Guo, Abhinav Rastogi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01505
Pdf URL: https://arxiv.org/pdf/2408.01505
Copy Paste: [[2408.01505]] MoDE: Effective Multi-task Parameter Efficient Fine-Tuning with a Mixture of Dyadic Experts(https://arxiv.org/abs/2408.01505)
Keywords: large language model
Abstract: Parameter-efficient fine-tuning techniques like Low-Rank Adaptation (LoRA) have revolutionized the adaptation of large language models (LLMs) to diverse tasks. Recent efforts have explored mixtures of LoRA modules for multi-task settings. However, our analysis reveals redundancy in the down-projection matrices of these architectures. This observation motivates our proposed method, Mixture of Dyadic Experts (MoDE), which introduces a novel design for efficient multi-task adaptation. This is done by sharing the down-projection matrix across tasks and employing atomic rank-one adapters, coupled with routers that allow more sophisticated task-level specialization. Our design allows for more fine-grained mixing, thereby increasing the model's ability to jointly handle multiple tasks. We evaluate MoDE on the Supernatural Instructions (SNI) benchmark consisting of a diverse set of 700+ tasks and demonstrate that it outperforms state-of-the-art multi-task parameter-efficient fine-tuning (PEFT) methods, without introducing additional parameters. Our findings contribute to a deeper understanding of parameter efficiency in multi-task LLM adaptation and provide a practical solution for deploying high-performing, lightweight models.

Title: Blockchain Economic Denial of Sustainability Attack: Exploiting Latency Optimization in Ethereum Transaction Forwarding

Authors: Taro Tsuchiya, Liyi Zhou, Kaihua Qin, Arthur Gervais, Nicolas Christin
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.01508
Pdf URL: https://arxiv.org/pdf/2408.01508
Copy Paste: [[2408.01508]] Blockchain Economic Denial of Sustainability Attack: Exploiting Latency Optimization in Ethereum Transaction Forwarding(https://arxiv.org/abs/2408.01508)
Keywords: secure, attack
Abstract: Strategies related to the blockchain concept of Extractable Value (MEV/BEV), such as arbitrage, front- or back-running create an economic incentive for network nodes to reduce latency, including minimizing transaction validation time -- a core feature to secure blockchain networks. A modified node, that neglects to filter invalid transactions in the Ethereum P2P network, introduces novel attack vectors. In this work, we formalize and evaluate a Blockchain Economic Denial of Sustainability (EDoS) attack, which can cause financial losses in traffic costs for operators of modified nodes. We 1) mathematically define the attack model, 2) identify thousands of empirical instances of this similar attack in the wild, 3) empirically measure the model parameters from our two monitoring nodes, and 4) conduct attack simulations on the local network to compare its performance with existing Denial-of-Service attacks. We show that an attacker can amplify network traffic at modified nodes by a factor of 3,600, and cause economic damages 13,800 times greater than the amount needed to carry out the attack. Despite these risks, aggressive latency reduction may still be profitable enough to justify the existence of modified nodes. To assess this trade-off, we 1) simulate the transaction validation process in the local network and 2) empirically measure the latency reduction by deploying our modified node in the Ethereum testnet. We conclude with a cost-benefit analysis of skipping validation and provide mitigation strategies against this attack.

Title: Multi-Unit Floor Plan Recognition and Reconstruction Using Improved Semantic Segmentation of Raster-Wise Floor Plans

Authors: Lukas Kratochvila, Gijs de Jong, Monique Arkesteijn, Simon Bilik, Tomas Zemcik, Karel Horak, Jan S. Rellermeyer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01526
Pdf URL: https://arxiv.org/pdf/2408.01526
Copy Paste: [[2408.01526]] Multi-Unit Floor Plan Recognition and Reconstruction Using Improved Semantic Segmentation of Raster-Wise Floor Plans(https://arxiv.org/abs/2408.01526)
Keywords: segmentation
Abstract: Digital twins have a major potential to form a significant part of urban management in emergency planning, as they allow more efficient designing of the escape routes, better orientation in exceptional situations, and faster rescue intervention. Nevertheless, creating the twins still remains a largely manual effort, due to a lack of 3D-representations, which are available only in limited amounts for some new buildings. Thus, in this paper we aim to synthesize 3D information from commonly available 2D architectural floor plans. We propose two novel pixel-wise segmentation methods based on the MDA-Unet and MACU-Net architectures with improved skip connections, an attention mechanism, and a training objective together with a reconstruction part of the pipeline, which vectorizes the segmented plans to create a 3D model. The proposed methods are compared with two other state-of-the-art techniques and several benchmark datasets. On the commonly used CubiCasa benchmark dataset, our methods have achieved the mean F1 score of 0.86 over five examined classes, outperforming the other pixel-wise approaches tested. We have also made our code publicly available to support research in the field.

Title: Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality Metrics

Authors: Alexander Gushchin, Khaled Abud, Georgii Bychkov, Ekaterina Shumitskaya, Anna Chistyakova, Sergey Lavrushkin, Bader Rasheed, Kirill Malyshev, Dmitriy Vatolin, Anastasia Antsiferova
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2408.01541
Pdf URL: https://arxiv.org/pdf/2408.01541
Copy Paste: [[2408.01541]] Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality Metrics(https://arxiv.org/abs/2408.01541)
Keywords: defense, attack, robust
Abstract: In the field of Image Quality Assessment (IQA), the adversarial robustness of the metrics poses a critical concern. This paper presents a comprehensive benchmarking study of various defense mechanisms in response to the rise in adversarial attacks on IQA. We systematically evaluate 25 defense strategies, including adversarial purification, adversarial training, and certified robustness methods. We applied 14 adversarial attack algorithms of various types in both non-adaptive and adaptive settings and tested these defenses against them. We analyze the differences between defenses and their applicability to IQA tasks, considering that they should preserve IQA scores and image quality. The proposed benchmark aims to guide future developments and accepts submissions of new methods, with the latest results available online: this https URL.

Title: Trainable Pointwise Decoder Module for Point Cloud Segmentation

Authors: Bike Chen, Chen Gong, Antti Tikanmäki, Juha Röning
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01548
Pdf URL: https://arxiv.org/pdf/2408.01548
Copy Paste: [[2408.01548]] Trainable Pointwise Decoder Module for Point Cloud Segmentation(https://arxiv.org/abs/2408.01548)
Keywords: segmentation
Abstract: Point cloud segmentation (PCS) aims to make per-point predictions and enables robots and autonomous driving cars to understand the environment. The range image is a dense representation of a large-scale outdoor point cloud, and segmentation models built upon the image commonly execute efficiently. However, the projection of the point cloud onto the range image inevitably leads to dropping points because, at each image coordinate, only one point is kept despite multiple points being projected onto the same location. More importantly, it is challenging to assign correct predictions to the dropped points that belong to the classes different from the kept point class. Besides, existing post-processing methods, such as K-nearest neighbor (KNN) search and kernel point convolution (KPConv), cannot be trained with the models in an end-to-end manner or cannot process varying-density outdoor point clouds well, thereby enabling the models to achieve sub-optimal performance. To alleviate this problem, we propose a trainable pointwise decoder module (PDM) as the post-processing approach, which gathers weighted features from the neighbors and then makes the final prediction for the query point. In addition, we introduce a virtual range image-guided copy-rotate-paste (VRCrop) strategy in data augmentation. VRCrop constrains the total number of points and eliminates undesirable artifacts in the augmented point cloud. With PDM and VRCrop, existing range image-based segmentation models consistently perform better than their counterparts on the SemanticKITTI, SemanticPOSS, and nuScenes datasets.

Title: Multi-task SAR Image Processing via GAN-based Unsupervised Manipulation

Authors: Xuran Hu, Mingzhe Zhu, Ziqiang Xu, Zhenpeng Feng, Ljubisa Stankovic
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2408.01553
Pdf URL: https://arxiv.org/pdf/2408.01553
Copy Paste: [[2408.01553]] Multi-task SAR Image Processing via GAN-based Unsupervised Manipulation(https://arxiv.org/abs/2408.01553)
Keywords: generative
Abstract: Generative Adversarial Networks (GANs) have shown tremendous potential in synthesizing a large number of realistic SAR images by learning patterns in the data distribution. Some GANs can achieve image editing by introducing latent codes, demonstrating significant promise in SAR image processing. Compared to traditional SAR image processing methods, editing based on GAN latent space control is entirely unsupervised, allowing image processing to be conducted without any labeled data. Additionally, the information extracted from the data is more interpretable. This paper proposes a novel SAR image processing framework called GAN-based Unsupervised Editing (GUE), aiming to address the following two issues: (1) disentangling semantic directions in the GAN latent space and finding meaningful directions; (2) establishing a comprehensive SAR image processing framework while achieving multiple image processing functions. In the implementation of GUE, we decompose the entangled semantic directions in the GAN latent space by training a carefully designed network. Moreover, we can accomplish multiple SAR image processing tasks (including despeckling, localization, auxiliary identification, and rotation editing) in a single training process without any form of supervision. Extensive experiments validate the effectiveness of the proposed method.

Title: Counterfactual Explanations for Medical Image Classification and Regression using Diffusion Autoencoder

Authors: Matan Atad, David Schinz, Hendrik Moeller, Robert Graf, Benedikt Wiestler, Daniel Rueckert, Nassir Navab, Jan S. Kirschke, Matthias Keicher
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.01571
Pdf URL: https://arxiv.org/pdf/2408.01571
Copy Paste: [[2408.01571]] Counterfactual Explanations for Medical Image Classification and Regression using Diffusion Autoencoder(https://arxiv.org/abs/2408.01571)
Keywords: extraction, interpretability, diffusion, generative
Abstract: Counterfactual explanations (CEs) aim to enhance the interpretability of machine learning models by illustrating how alterations in input features would affect the resulting predictions. Common CE approaches require an additional model and are typically constrained to binary counterfactuals. In contrast, we propose a novel method that operates directly on the latent space of a generative model, specifically a Diffusion Autoencoder (DAE). This approach offers inherent interpretability by enabling the generation of CEs and the continuous visualization of the model's internal representation across decision boundaries. Our method leverages the DAE's ability to encode images into a semantically rich latent space in an unsupervised manner, eliminating the need for labeled data or separate feature extraction models. We show that these latent representations are helpful for medical condition classification and the ordinal regression of severity pathologies, such as vertebral compression fractures (VCF) and diabetic retinopathy (DR). Beyond binary CEs, our method supports the visualization of ordinal CEs using a linear model, providing deeper insights into the model's decision-making process and enhancing interpretability. Experiments across various medical imaging datasets demonstrate the method's advantages in interpretability and versatility. The linear manifold of the DAE's latent space allows for meaningful interpolation and manipulation, making it a powerful tool for exploring medical image properties. Our code is available at this https URL.

Title: THOR2: Leveraging Topological Soft Clustering of Color Space for Human-Inspired Object Recognition in Unseen Environments

Authors: Ekta U. Samani, Ashis G. Banerjee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01579
Pdf URL: https://arxiv.org/pdf/2408.01579
Copy Paste: [[2408.01579]] THOR2: Leveraging Topological Soft Clustering of Color Space for Human-Inspired Object Recognition in Unseen Environments(https://arxiv.org/abs/2408.01579)
Keywords: robust
Abstract: Visual object recognition in unseen and cluttered indoor environments is a challenging problem for mobile robots. This study presents a 3D shape and color-based descriptor, TOPS2, for point clouds generated from RGB-D images and an accompanying recognition framework, THOR2. The TOPS2 descriptor embodies object unity, a human cognition mechanism, by retaining the slicing-based topological representation of 3D shape from the TOPS descriptor while capturing object color information through slicing-based color embeddings computed using a network of coarse color regions. These color regions, analogous to the MacAdam ellipses identified in human color perception, are obtained using the Mapper algorithm, a topological soft-clustering technique. THOR2, trained using synthetic data, demonstrates markedly improved recognition accuracy compared to THOR, its 3D shape-based predecessor, on two benchmark real-world datasets: the OCID dataset capturing cluttered scenes from different viewpoints and the UW-IS Occluded dataset reflecting different environmental conditions and degrees of object occlusion recorded using commodity hardware. THOR2 also outperforms baseline deep learning networks, and a widely-used ViT adapted for RGB-D inputs on both the datasets. Therefore, THOR2 is a promising step toward achieving robust recognition in low-cost robots.

Title: Deep Learning Approach for Ear Recognition and Longitudinal Evaluation in Children

Authors: Afzal Hossain, Tipu Sultan, Stephanie Schuckers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01588
Pdf URL: https://arxiv.org/pdf/2408.01588
Copy Paste: [[2408.01588]] Deep Learning Approach for Ear Recognition and Longitudinal Evaluation in Children(https://arxiv.org/abs/2408.01588)
Keywords: biometric
Abstract: Ear recognition as a biometric modality is becoming increasingly popular, with promising broader application areas. While current applications involve adults, one of the challenges in ear recognition for children is the rapid structural changes in the ear as they age. This work introduces a foundational longitudinal dataset collected from children aged 4 to 14 years over a 2.5-year period and evaluates ear recognition performance in this demographic. We present a deep learning based approach for ear recognition, using an ensemble of VGG16 and MobileNet, focusing on both adult and child datasets, with an emphasis on longitudinal evaluation for children.

Title: Trustworthy Machine Learning under Social and Adversarial Data Sources

Authors: Han Shao
Subjects: cs.LG, cs.AI, cs.GT
Abstract URL: https://arxiv.org/abs/2408.01596
Pdf URL: https://arxiv.org/pdf/2408.01596
Copy Paste: [[2408.01596]] Trustworthy Machine Learning under Social and Adversarial Data Sources(https://arxiv.org/abs/2408.01596)
Keywords: attack
Abstract: Machine learning has witnessed remarkable breakthroughs in recent years. As machine learning permeates various aspects of daily life, individuals and organizations increasingly interact with these systems, exhibiting a wide range of social and adversarial behaviors. These behaviors may have a notable impact on the behavior and performance of machine learning systems. Specifically, during these interactions, data may be generated by strategic individuals, collected by self-interested data collectors, possibly poisoned by adversarial attackers, and used to create predictors, models, and policies satisfying multiple objectives. As a result, the machine learning systems' outputs might degrade, such as the susceptibility of deep neural networks to adversarial examples (Shafahi et al., 2018; Szegedy et al., 2013) and the diminished performance of classic algorithms in the presence of strategic individuals (Ahmadi et al., 2021). Addressing these challenges is imperative for the success of machine learning in societal settings.

Title: CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

Authors: Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, Joshua Saxe
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2408.01605
Pdf URL: https://arxiv.org/pdf/2408.01605
Copy Paste: [[2408.01605]] CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models(https://arxiv.org/abs/2408.01605)
Keywords: security, large language model
Abstract: We are releasing a new suite of security benchmarks for LLMs, CYBERSECEVAL 3, to continue the conversation on empirically measuring LLM cybersecurity risks and capabilities. CYBERSECEVAL 3 assesses 8 different risks across two broad categories: risk to third parties, and risk to application developers and end users. Compared to previous work, we add new areas focused on offensive security capabilities: automated social engineering, scaling manual offensive cyber operations, and autonomous offensive cyber operations. In this paper we discuss applying these benchmarks to the Llama 3 models and a suite of contemporaneous state-of-the-art LLMs, enabling us to contextualize risks both with and without mitigations in place.

Title: JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Language Model

Authors: Farzaneh Jafari, Stefano Berretti, Anup Basu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01627
Pdf URL: https://arxiv.org/pdf/2408.01627
Copy Paste: [[2408.01627]] JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Language Model(https://arxiv.org/abs/2408.01627)
Keywords: transformer
Abstract: In recent years, talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high video quality. However, no single model has yet achieved equivalence across all these metrics. This paper aims to animate a 3D face using Jamba, a hybrid Transformers-Mamba model. Mamba, a pioneering Structured State Space Model (SSM) architecture, was designed to address the constraints of the conventional Transformer architecture. Nevertheless, it has several drawbacks. Jamba merges the advantages of both Transformer and Mamba approaches, providing a holistic solution. Based on the foundational Jamba block, we present JambaTalk to enhance motion variety and speed through multimodal integration. Extensive experiments reveal that our method achieves performance comparable or superior to state-of-the-art models.

Title: Fair Risk Minimization under Causal Path-Specific Effect Constraints

Authors: Razieh Nabi, David Benkeser
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2408.01630
Pdf URL: https://arxiv.org/pdf/2408.01630
Copy Paste: [[2408.01630]] Fair Risk Minimization under Causal Path-Specific Effect Constraints(https://arxiv.org/abs/2408.01630)
Keywords: robust, fair
Abstract: This paper introduces a framework for estimating fair optimal predictions using machine learning where the notion of fairness can be quantified using path-specific causal effects. We use a recently developed approach based on Lagrange multipliers for infinite-dimensional functional estimation to derive closed-form solutions for constrained optimization based on mean squared error and cross-entropy risk criteria. The theoretical forms of the solutions are analyzed in detail and described as nuanced adjustments to the unconstrained minimizer. This analysis highlights important trade-offs between risk minimization and achieving fairnes. The theoretical solutions are also used as the basis for construction of flexible semiparametric estimation strategies for these nuisance components. We describe the robustness properties of our estimators in terms of achieving the optimal constrained risk, as well as in terms of controlling the value of the constraint. We study via simulation the impact of using robust estimators of pathway-specific effects to validate our theory. This work advances the discourse on algorithmic fairness by integrating complex causal considerations into model training, thus providing strategies for implementing fair models in real-world applications.

Title: Transforming Slot Schema Induction with Generative Dialogue State Inference

Authors: James D. Finch, Boxin Zhao, Jinho D. Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01638
Pdf URL: https://arxiv.org/pdf/2408.01638
Copy Paste: [[2408.01638]] Transforming Slot Schema Induction with Generative Dialogue State Inference(https://arxiv.org/abs/2408.01638)
Keywords: generative
Abstract: The challenge of defining a slot schema to represent the state of a task-oriented dialogue system is addressed by Slot Schema Induction (SSI), which aims to automatically induce slots from unlabeled dialogue data. Whereas previous approaches induce slots by clustering value spans extracted directly from the dialogue text, we demonstrate the power of discovering slots using a generative approach. By training a model to generate slot names and values that summarize key dialogue information with no prior task knowledge, our SSI method discovers high-quality candidate information for representing dialogue state. These discovered slot-value candidates can be easily clustered into unified slot schemas that align well with human-authored schemas. Experimental comparisons on the MultiWOZ and SGD datasets demonstrate that Generative Dialogue State Inference (GenDSI) outperforms the previous state-of-the-art on multiple aspects of the SSI task.

Title: Leveraging GNSS and Onboard Visual Data from Consumer Vehicles for Robust Road Network Estimation

Authors: Balázs Opra, Betty Le Dem, Jeffrey M. Walls, Dimitar Lukarski, Cyrill Stachniss
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2408.01640
Pdf URL: https://arxiv.org/pdf/2408.01640
Copy Paste: [[2408.01640]] Leveraging GNSS and Onboard Visual Data from Consumer Vehicles for Robust Road Network Estimation(https://arxiv.org/abs/2408.01640)
Keywords: robust, segmentation
Abstract: Maps are essential for diverse applications, such as vehicle navigation and autonomous robotics. Both require spatial models for effective route planning and localization. This paper addresses the challenge of road graph construction for autonomous vehicles. Despite recent advances, creating a road graph remains labor-intensive and has yet to achieve full automation. The goal of this paper is to generate such graphs automatically and accurately. Modern cars are equipped with onboard sensors used for today's advanced driver assistance systems like lane keeping. We propose using global navigation satellite system (GNSS) traces and basic image data acquired from these standard sensors in consumer vehicles to estimate road-level maps with minimal effort. We exploit the spatial information in the data by framing the problem as a road centerline semantic segmentation task using a convolutional neural network. We also utilize the data's time series nature to refine the neural network's output by using map matching. We implemented and evaluated our method using a fleet of real consumer vehicles, only using the deployed onboard sensors. Our evaluation demonstrates that our approach not only matches existing methods on simpler road configurations but also significantly outperforms them on more complex road geometries and topologies. This work received the 2023 Woven by Toyota Invention Award.

Title: SAT3D: Image-driven Semantic Attribute Transfer in 3D

Authors: Zhijun Zhai, Zengmao Wang, Xiaoxiao Long, Kaixuan Zhou, Bo Du
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01664
Pdf URL: https://arxiv.org/pdf/2408.01664
Copy Paste: [[2408.01664]] SAT3D: Image-driven Semantic Attribute Transfer in 3D(https://arxiv.org/abs/2408.01664)
Keywords: generative
Abstract: GAN-based image editing task aims at manipulating image attributes in the latent space of generative models. Most of the previous 2D and 3D-aware approaches mainly focus on editing attributes in images with ambiguous semantics or regions from a reference image, which fail to achieve photographic semantic attribute transfer, such as the beard from a photo of a man. In this paper, we propose an image-driven Semantic Attribute Transfer method in 3D (SAT3D) by editing semantic attributes from a reference image. For the proposed method, the exploration is conducted in the style space of a pre-trained 3D-aware StyleGAN-based generator by learning the correlations between semantic attributes and style code channels. For guidance, we associate each attribute with a set of phrase-based descriptor groups, and develop a Quantitative Measurement Module (QMM) to quantitatively describe the attribute characteristics in images based on descriptor groups, which leverages the image-text comprehension capability of CLIP. During the training process, the QMM is incorporated into attribute losses to calculate attribute similarity between images, guiding target semantic transferring and irrelevant semantics preserving. We present our 3D-aware attribute transfer results across multiple domains and also conduct comparisons with classical 2D image editing methods, demonstrating the effectiveness and customizability of our SAT3D.

Title: Automated Phishing Detection Using URLs and Webpages

Authors: Huilin Wang, Bryan Hooi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.01667
Pdf URL: https://arxiv.org/pdf/2408.01667
Copy Paste: [[2408.01667]] Automated Phishing Detection Using URLs and Webpages(https://arxiv.org/abs/2408.01667)
Keywords: security, large language model
Abstract: Phishing detection is a critical cybersecurity task that involves the identification and neutralization of fraudulent attempts to obtain sensitive information, thereby safeguarding individuals and organizations from data breaches and financial loss. In this project, we address the constraints of traditional reference-based phishing detection by developing an LLM agent framework. This agent harnesses Large Language Models to actively fetch and utilize online information, thus providing a dynamic reference system for more accurate phishing detection. This innovation circumvents the need for a static knowledge base, offering a significant enhancement in adaptability and efficiency for automated security measures. The project report includes an initial study and problem analysis of existing solutions, which motivated us to develop a new framework. We demonstrate the framework with LLMs simulated as agents and detail the techniques required for construction, followed by a complete implementation with a proof-of-concept as well as experiments to evaluate our solution's performance against other similar solutions. The results show that our approach has achieved with accuracy of 0.945, significantly outperforms the existing solution(DynaPhish) by 0.445. Furthermore, we discuss the limitations of our approach and suggest improvements that could make it more effective. Overall, the proposed framework has the potential to enhance the effectiveness of current reference-based phishing detection approaches and could be adapted for real-world applications.

Title: Multiple Contexts and Frequencies Aggregation Network forDeepfake Detection

Authors: Zifeng Li, Wenzhong Tang, Shijun Gao, Shuai Wang, Yanxiang Wang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2408.01668
Pdf URL: https://arxiv.org/pdf/2408.01668
Copy Paste: [[2408.01668]] Multiple Contexts and Frequencies Aggregation Network forDeepfake Detection(https://arxiv.org/abs/2408.01668)
Keywords: robust, generative
Abstract: Deepfake detection faces increasing challenges since the fast growth of generative models in developing massive and diverse Deepfake technologies. Recent advances rely on introducing heuristic features from spatial or frequency domains rather than modeling general forgery features within backbones. To address this issue, we turn to the backbone design with two intuitive priors from spatial and frequency detectors, \textit{i.e.,} learning robust spatial attributes and frequency distributions that are discriminative for real and fake samples. To this end, we propose an efficient network for face forgery detection named MkfaNet, which consists of two core modules. For spatial contexts, we design a Multi-Kernel Aggregator that adaptively selects organ features extracted by multiple convolutions for modeling subtle facial differences between real and fake faces. For the frequency components, we propose a Multi-Frequency Aggregator to process different bands of frequency components by adaptively reweighing high-frequency and low-frequency features. Comprehensive experiments on seven popular deepfake detection benchmarks demonstrate that our proposed MkfaNet variants achieve superior performances in both within-domain and across-domain evaluations with impressive efficiency of parameter usage.

Title: iControl3D: An Interactive System for Controllable 3D Scene Generation

Authors: Xingyi Li, Yizheng Wu, Jun Cen, Juewen Peng, Kewei Wang, Ke Xian, Zhe Wang, Zhiguo Cao, Guosheng Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01678
Pdf URL: https://arxiv.org/pdf/2408.01678
Copy Paste: [[2408.01678]] iControl3D: An Interactive System for Controllable 3D Scene Generation(https://arxiv.org/abs/2408.01678)
Keywords: diffusion
Abstract: 3D content creation has long been a complex and time-consuming process, often requiring specialized skills and resources. While recent advancements have allowed for text-guided 3D object and scene generation, they still fall short of providing sufficient control over the generation process, leading to a gap between the user's creative vision and the generated results. In this paper, we present iControl3D, a novel interactive system that empowers users to generate and render customizable 3D scenes with precise control. To this end, a 3D creator interface has been developed to provide users with fine-grained control over the creation process. Technically, we leverage 3D meshes as an intermediary proxy to iteratively merge individual 2D diffusion-generated images into a cohesive and unified 3D scene representation. To ensure seamless integration of 3D meshes, we propose to perform boundary-aware depth alignment before fusing the newly generated mesh with the existing one in 3D space. Additionally, to effectively manage depth discrepancies between remote content and foreground, we propose to model remote content separately with an environment map instead of 3D meshes. Finally, our neural rendering interface enables users to build a radiance field of their scene online and navigate the entire scene. Extensive experiments have been conducted to demonstrate the effectiveness of our system. The code will be made available at this https URL.

Title: MMPKUBase: A Comprehensive and High-quality Chinese Multi-modal Knowledge Graph

Authors: Xuan Yi, Yanzeng Li, Lei Zou
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2408.01679
Pdf URL: https://arxiv.org/pdf/2408.01679
Copy Paste: [[2408.01679]] MMPKUBase: A Comprehensive and High-quality Chinese Multi-modal Knowledge Graph(https://arxiv.org/abs/2408.01679)
Keywords: robust
Abstract: Multi-modal knowledge graphs have emerged as a powerful approach for information representation, combining data from different modalities such as text, images, and videos. While several such graphs have been constructed and have played important roles in applications like visual question answering and recommendation systems, challenges persist in their development. These include the scarcity of high-quality Chinese knowledge graphs and limited domain coverage in existing multi-modal knowledge graphs. This paper introduces MMPKUBase, a robust and extensive Chinese multi-modal knowledge graph that covers diverse domains, including birds, mammals, ferns, and more, comprising over 50,000 entities and over 1 million filtered images. To ensure data quality, we employ Prototypical Contrastive Learning and the Isolation Forest algorithm to refine the image data. Additionally, we have developed a user-friendly platform to facilitate image attribute exploration.

Title: SiamMo: Siamese Motion-Centric 3D Object Tracking

Authors: Yuxiang Yang, Yingqi Deng, Jing Zhang, Hongjie Gu, Zhekang Don
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01688
Pdf URL: https://arxiv.org/pdf/2408.01688
Copy Paste: [[2408.01688]] SiamMo: Siamese Motion-Centric 3D Object Tracking(https://arxiv.org/abs/2408.01688)
Keywords: robust, extraction, segmentation
Abstract: Current 3D single object tracking methods primarily rely on the Siamese matching-based paradigm, which struggles with textureless and incomplete LiDAR point clouds. Conversely, the motion-centric paradigm avoids appearance matching, thus overcoming these issues. However, its complex multi-stage pipeline and the limited temporal modeling capability of a single-stream architecture constrain its potential. In this paper, we introduce SiamMo, a novel and simple Siamese motion-centric tracking approach. Unlike the traditional single-stream architecture, we employ Siamese feature extraction for motion-centric tracking. This decouples feature extraction from temporal fusion, significantly enhancing tracking performance. Additionally, we design a Spatio-Temporal Feature Aggregation module to integrate Siamese features at multiple scales, capturing motion information effectively. We also introduce a Box-aware Feature Encoding module to encode object size priors into motion estimation. SiamMo is a purely motion-centric tracker that eliminates the need for additional processes like segmentation and box refinement. Without whistles and bells, SiamMo not only surpasses state-of-the-art methods across multiple benchmarks but also demonstrates exceptional robustness in challenging scenarios. SiamMo sets a new record on the KITTI tracking benchmark with 90.1\% precision while maintaining a high inference speed of 108 FPS. The code will be released at this https URL.

Title: Controllable Unlearning for Image-to-Image Generative Models via $\varepsilon$-Constrained Optimization

Authors: Xiaohua Feng, Chaochao Chen, Yuyuan Li, Li Zhang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2408.01689
Pdf URL: https://arxiv.org/pdf/2408.01689
Copy Paste: [[2408.01689]] Controllable Unlearning for Image-to-Image Generative Models via $\varepsilon$-Constrained Optimization(https://arxiv.org/abs/2408.01689)
Keywords: privacy, generative
Abstract: While generative models have made significant advancements in recent years, they also raise concerns such as privacy breaches and biases. Machine unlearning has emerged as a viable solution, aiming to remove specific training data, e.g., containing private information and bias, from models. In this paper, we study the machine unlearning problem in Image-to-Image (I2I) generative models. Previous studies mainly treat it as a single objective optimization problem, offering a solitary solution, thereby neglecting the varied user expectations towards the trade-off between complete unlearning and model utility. To address this issue, we propose a controllable unlearning framework that uses a control coefficient $\varepsilon$ to control the trade-off. We reformulate the I2I generative model unlearning problem into a $\varepsilon$-constrained optimization problem and solve it with a gradient-based method to find optimal solutions for unlearning boundaries. These boundaries define the valid range for the control coefficient. Within this range, every yielded solution is theoretically guaranteed with Pareto optimality. We also analyze the convergence rate of our framework under various control functions. Extensive experiments on two benchmark datasets across three mainstream I2I models demonstrate the effectiveness of our controllable unlearning framework.

Title: IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection

Authors: Hong Guan, Yancheng Wang, Lulu Xie, Soham Nag, Rajeev Goel, Niranjan Erappa Narayana Swamy, Yingzhen Yang, Chaowei Xiao, Jonathan Prisby, Ross Maciejewski, Jia Zou
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2408.01690
Pdf URL: https://arxiv.org/pdf/2408.01690
Copy Paste: [[2408.01690]] IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection(https://arxiv.org/abs/2408.01690)
Keywords: security, privacy
Abstract: Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark datasets for identity document analysis, including MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a limited number of samples, cover insufficient varieties of fraud patterns, and seldom include alterations in critical personal identifying fields like portrait images, limiting their utility in training models capable of detecting realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark dataset, IDNet, designed to advance privacy-preserving fraud detection efforts. The IDNet dataset comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes, categorized into 20 types from $10$ U.S. states and 10 European countries. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods, facilitating the generation of camera and video capturing of identity documents, and testing schema unification and other identity document management functionalities.

Title: TreeCSS: An Efficient Framework for Vertical Federated Learning

Authors: Qinbo Zhang, Xiao Yan, Yukai Ding, Quanqing Xu, Chuang Hu, Xiaokai Zhou, Jiawei Jiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01691
Pdf URL: https://arxiv.org/pdf/2408.01691
Copy Paste: [[2408.01691]] TreeCSS: An Efficient Framework for Vertical Federated Learning(https://arxiv.org/abs/2408.01691)
Keywords: security, federate
Abstract: Vertical federated learning (VFL) considers the case that the features of data samples are partitioned over different participants. VFL consists of two main steps, i.e., identify the common data samples for all participants (alignment) and train model using the aligned data samples (training). However, when there are many participants and data samples, both alignment and training become slow. As such, we propose TreeCSS as an efficient VFL framework that accelerates the two main steps. In particular, for sample alignment, we design an efficient multi-party private set intersection (MPSI) protocol called Tree-MPSI, which adopts a tree-based structure and a data-volume-aware scheduling strategy to parallelize alignment among the participants. As model training time scales with the number of data samples, we conduct coreset selection (CSS) to choose some representative data samples for training. Our CCS method adopts a clustering-based scheme for security and generality, which first clusters the features locally on each participant and then merges the local clustering results to select representative samples. In addition, we weight the samples according to their distances to the centroids to reflect their importance to model training. We evaluate the effectiveness and efficiency of our TreeCSS framework on various datasets and models. The results show that compared with vanilla VFL, TreeCSS accelerates training by up to 2.93x and achieves comparable model accuracy.

Title: A Comparative Analysis of CNN-based Deep Learning Models for Landslide Detection

Authors: Omkar Oak, Rukmini Nazre, Soham Naigaonkar, Suraj Sawant, Himadri Vaidya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01692
Pdf URL: https://arxiv.org/pdf/2408.01692
Copy Paste: [[2408.01692]] A Comparative Analysis of CNN-based Deep Learning Models for Landslide Detection(https://arxiv.org/abs/2408.01692)
Keywords: segmentation
Abstract: Landslides inflict substantial societal and economic damage, underscoring their global significance as recurrent and destructive natural disasters. Recent landslides in northern parts of India and Nepal have caused significant disruption, damaging infrastructure and posing threats to local communities. Convolutional Neural Networks (CNNs), a type of deep learning technique, have shown remarkable success in image processing. Because of their sophisticated architectures, advanced CNN-based models perform better in landslide detection than conventional algorithms. The purpose of this work is to investigate CNNs' potential in more detail, with an emphasis on comparison of CNN based models for better landslide detection. We compared four traditional semantic segmentation models (U-Net, LinkNet, PSPNet, and FPN) and utilized the ResNet50 backbone encoder to implement them. Moreover, we have experimented with the hyperparameters such as learning rates, batch sizes, and regularization techniques to fine-tune the models. We have computed the confusion matrix for each model and used performance metrics including precision, recall and f1-score to evaluate and compare the deep learning models. According to the experimental results, LinkNet gave the best results among the four models having an Accuracy of 97.49% and a F1-score of 85.7% (with 84.49% precision, 87.07% recall). We have also presented a comprehensive comparison of all pixel-wise confusion matrix results and the time taken to train each model.

Title: Bayesian Active Learning for Semantic Segmentation

Authors: Sima Didari, Wenjun Hu, Jae Oh Woo, Heng Hao, Hankyu Moon, Seungjai Min
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01694
Pdf URL: https://arxiv.org/pdf/2408.01694
Copy Paste: [[2408.01694]] Bayesian Active Learning for Semantic Segmentation(https://arxiv.org/abs/2408.01694)
Keywords: segmentation
Abstract: Fully supervised training of semantic segmentation models is costly and challenging because each pixel within an image needs to be labeled. Therefore, the sparse pixel-level annotation methods have been introduced to train models with a subset of pixels within each image. We introduce a Bayesian active learning framework based on sparse pixel-level annotation that utilizes a pixel-level Bayesian uncertainty measure based on Balanced Entropy (BalEnt) [84]. BalEnt captures the information between the models' predicted marginalized probability distribution and the pixel labels. BalEnt has linear scalability with a closed analytical form and can be calculated independently per pixel without relational computations with other pixels. We train our proposed active learning framework for Cityscapes, Camvid, ADE20K and VOC2012 benchmark datasets and show that it reaches supervised levels of mIoU using only a fraction of labeled pixels while outperforming the previous state-of-the-art active learning models with a large margin.

Title: Signal-SGN: A Spiking Graph Convolutional Network for Skeletal Action Recognition via Learning Temporal-Frequency Dynamics

Authors: Naichuan Zheng, Hailun Xia, Dapeng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01701
Pdf URL: https://arxiv.org/pdf/2408.01701
Copy Paste: [[2408.01701]] Signal-SGN: A Spiking Graph Convolutional Network for Skeletal Action Recognition via Learning Temporal-Frequency Dynamics(https://arxiv.org/abs/2408.01701)
Keywords: extraction
Abstract: In skeletal-based action recognition, Graph Convolutional Networks (GCNs) based methods face limitations due to their complexity and high energy consumption. Spiking Neural Networks (SNNs) have gained attention in recent years for their low energy consumption, but existing methods combining GCNs and SNNs fail to fully utilize the temporal characteristics of skeletal sequences, leading to increased storage and computational costs. To address this issue, we propose a Signal-SGN(Spiking Graph Convolutional Network), which leverages the temporal dimension of skeletal sequences as the spiking timestep and treats features as discrete stochastic signals. The core of the network consists of a 1D Spiking Graph Convolutional Network (1D-SGN) and a Frequency Spiking Convolutional Network (FSN). The SGN performs graph convolution on single frames and incorporates spiking network characteristics to capture inter-frame temporal relationships, while the FSN uses Fast Fourier Transform (FFT) and complex convolution to extract temporal-frequency features. We also introduce a multi-scale wavelet transform feature fusion module(MWTF) to capture spectral features of temporal signals, enhancing the model's classification capability. We propose a pluggable temporal-frequency spatial semantic feature extraction module(TFSM) to enhance the model's ability to distinguish features without increasing inference-phase consumption. Our numerous experiments on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets demonstrate that the proposed models not only surpass existing SNN-based methods in accuracy but also reduce computational and storage costs during training. Furthermore, they achieve competitive accuracy compared to corresponding GCN-based methods, which is quite remarkable.

Title: Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers

Authors: Weijie Zheng, Xingjun Ma, Hanxun Huang, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01705
Pdf URL: https://arxiv.org/pdf/2408.01705
Copy Paste: [[2408.01705]] Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers(https://arxiv.org/abs/2408.01705)
Keywords: attack, robust, transformer
Abstract: With the advancement of vision transformers (ViTs) and self-supervised learning (SSL) techniques, pre-trained large ViTs have become the new foundation models for computer vision applications. However, studies have shown that, like convolutional neural networks (CNNs), ViTs are also susceptible to adversarial attacks, where subtle perturbations in the input can fool the model into making false predictions. This paper studies the transferability of such an adversarial vulnerability from a pre-trained ViT model to downstream tasks. We focus on \emph{sample-wise} transfer attacks and propose a novel attack method termed \emph{Downstream Transfer Attack (DTA)}. For a given test image, DTA leverages a pre-trained ViT model to craft the adversarial example and then applies the adversarial example to attack a fine-tuned version of the model on a downstream dataset. During the attack, DTA identifies and exploits the most vulnerable layers of the pre-trained model guided by a cosine similarity loss to craft highly transferable attacks. Through extensive experiments with pre-trained ViTs by 3 distinct pre-training methods, 3 fine-tuning schemes, and across 10 diverse downstream datasets, we show that DTA achieves an average attack success rate (ASR) exceeding 90\%, surpassing existing methods by a huge margin. When used with adversarial training, the adversarial examples generated by our DTA can significantly improve the model's robustness to different downstream transfer attacks.

Title: AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Authors: Zili Wang, Qi Yang, Linsu Shi, Jiazhong Yu, Qinghua Liang, Fei Li, Shiming Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01708
Pdf URL: https://arxiv.org/pdf/2408.01708
Copy Paste: [[2408.01708]] AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation(https://arxiv.org/abs/2408.01708)
Keywords: transformer, segmentation
Abstract: Recently, transformer-based models have demonstrated remarkable performance on audio-visual segmentation (AVS) tasks. However, their expensive computational cost makes real-time inference impractical. By characterizing attention maps of the network, we identify two key obstacles in AVS models: 1) attention dissipation, corresponding to the over-concentrated attention weights by Softmax within restricted frames, and 2) inefficient, burdensome transformer decoder, caused by narrow focus patterns in early stages. In this paper, we introduce AVESFormer, the first real-time Audio-Visual Efficient Segmentation transformer that achieves fast, efficient and light-weight simultaneously. Our model leverages an efficient prompt query generator to correct the behaviour of cross-attention. Additionally, we propose ELF decoder to bring greater efficiency by facilitating convolutions suitable for local features to reduce computational burdens. Extensive experiments demonstrate that our AVESFormer significantly enhances model performance, achieving 79.9% on S4, 57.9% on MS3 and 31.2% on AVSS, outperforming previous state-of-the-art and achieving an excellent trade-off between performance and speed. Code can be found at this https URL.

Title: A General Ambiguity Model for Binary Edge Images with Edge Tracing and its Implementation

Authors: Markus Hennig, Marc Leineke, Bärbel Mertsching
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01712
Pdf URL: https://arxiv.org/pdf/2408.01712
Copy Paste: [[2408.01712]] A General Ambiguity Model for Binary Edge Images with Edge Tracing and its Implementation(https://arxiv.org/abs/2408.01712)
Keywords: segmentation
Abstract: We present a general and intuitive ambiguity model for intersections, junctions and other structures in binary edge images. The model is combined with edge tracing, where edges are ordered sequences of connected pixels. The objective is to provide a versatile preprocessing method for tasks such as figure-ground segmentation, object recognition, topological analysis, etc. By using only a small set of straightforward principles, the results are intuitive to describe. This helps to implement subsequent processing steps, such as resolving ambiguous edge connections at junctions. By using an augmented edge map, neighboring edges can be directly accessed using quick local search operations. The edge tracing uses recursion, which leads to compact programming code. We explain our algorithm using pseudocode, compare it with related methods, and show how simple modular postprocessing steps can be used to optimize the results. The complete algorithm, including all data structures, requires less than 50 lines of pseudocode. We also provide a C++ implementation of our method.

Title: Intuitionistic Fuzzy Generalized Eigenvalue Proximal Support Vector Machine

Authors: A. Quadir, M. A. Ganaie, M. Tanveer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.01713
Pdf URL: https://arxiv.org/pdf/2408.01713
Copy Paste: [[2408.01713]] Intuitionistic Fuzzy Generalized Eigenvalue Proximal Support Vector Machine(https://arxiv.org/abs/2408.01713)
Keywords: robust
Abstract: Generalized eigenvalue proximal support vector machine (GEPSVM) has attracted widespread attention due to its simple architecture, rapid execution, and commendable performance. GEPSVM gives equal significance to all samples, thereby diminishing its robustness and efficacy when confronted with real-world datasets containing noise and outliers. In order to reduce the impact of noises and outliers, we propose a novel intuitionistic fuzzy generalized eigenvalue proximal support vector machine (IF-GEPSVM). The proposed IF-GEPSVM assigns the intuitionistic fuzzy score to each training sample based on its location and surroundings in the high-dimensional feature space by using a kernel function. The solution of the IF-GEPSVM optimization problem is obtained by solving a generalized eigenvalue problem. Further, we propose an intuitionistic fuzzy improved GEPSVM (IF-IGEPSVM) by solving the standard eigenvalue decomposition resulting in simpler optimization problems with less computation cost which leads to an efficient intuitionistic fuzzy-based model. We conduct a comprehensive evaluation of the proposed IF-GEPSVM and IF-IGEPSVM models on UCI and KEEL datasets. Moreover, to evaluate the robustness of the proposed IF-GEPSVM and IF-IGEPSVM models, label noise is introduced into some UCI and KEEL datasets. The experimental findings showcase the superior generalization performance of the proposed models when compared to the existing baseline models, both with and without label noise. Our experimental results, supported by rigorous statistical analyses, confirm the superior generalization abilities of the proposed IF-GEPSVM and IF-IGEPSVM models over the baseline models. Furthermore, we implement the proposed IF-GEPSVM and IF-IGEPSVM models on the USPS recognition dataset, yielding promising results that underscore the models' effectiveness in practical and real-world applications.

Title: Joint Universal Adversarial Perturbations with Interpretations

Authors: Liang-bo Ning, Zeyu Dai, Wenqi Fan, Jingran Su, Chao Pan, Luning Wang, Qing Li
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01715
Pdf URL: https://arxiv.org/pdf/2408.01715
Copy Paste: [[2408.01715]] Joint Universal Adversarial Perturbations with Interpretations(https://arxiv.org/abs/2408.01715)
Keywords: attack
Abstract: Deep neural networks (DNNs) have significantly boosted the performance of many challenging tasks. Despite the great development, DNNs have also exposed their vulnerability. Recent studies have shown that adversaries can manipulate the predictions of DNNs by adding a universal adversarial perturbation (UAP) to benign samples. On the other hand, increasing efforts have been made to help users understand and explain the inner working of DNNs by highlighting the most informative parts (i.e., attribution maps) of samples with respect to their predictions. Moreover, we first empirically find that such attribution maps between benign and adversarial examples have a significant discrepancy, which has the potential to detect universal adversarial perturbations for defending against adversarial attacks. This finding motivates us to further investigate a new research problem: whether there exist universal adversarial perturbations that are able to jointly attack DNNs classifier and its interpretation with malicious desires. It is challenging to give an explicit answer since these two objectives are seemingly conflicting. In this paper, we propose a novel attacking framework to generate joint universal adversarial perturbations (JUAP), which can fool the DNNs model and misguide the inspection from interpreters simultaneously. Comprehensive experiments on various datasets demonstrate the effectiveness of the proposed method JUAP for joint attacks. To the best of our knowledge, this is the first effort to study UAP for jointly attacking both DNNs and interpretations.

Title: A Novel Evaluation Framework for Image2Text Generation

Authors: Jia-Hong Huang, Hongyi Zhu, Yixian Shen, Stevan Rudinac, Alessio M. Pacces, Evangelos Kanoulas
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2408.01723
Pdf URL: https://arxiv.org/pdf/2408.01723
Copy Paste: [[2408.01723]] A Novel Evaluation Framework for Image2Text Generation(https://arxiv.org/abs/2408.01723)
Keywords: large language model
Abstract: Evaluating the quality of automatically generated image descriptions is challenging, requiring metrics that capture various aspects such as grammaticality, coverage, correctness, and truthfulness. While human evaluation offers valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr aim to bridge this gap but often show weak correlations with human judgment. We address this challenge by introducing a novel evaluation framework rooted in a modern large language model (LLM), such as GPT-4 or Gemini, capable of image generation. In our proposed framework, we begin by feeding an input image into a designated image captioning model, chosen for evaluation, to generate a textual description. Using this description, an LLM then creates a new image. By extracting features from both the original and LLM-created images, we measure their similarity using a designated similarity metric. A high similarity score suggests that the image captioning model has accurately generated textual descriptions, while a low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance. Human-annotated reference captions are not required in our proposed evaluation framework, which serves as a valuable tool for evaluating the effectiveness of image captioning models. Its efficacy is confirmed through human evaluation.

Title: Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Authors: Jintao Tan, Xize Cheng, Lingyu Xiong, Lei Zhu, Xiandong Li, Xianjia Wu, Kai Gong, Minglei Li, Yi Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01732
Pdf URL: https://arxiv.org/pdf/2408.01732
Copy Paste: [[2408.01732]] Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation(https://arxiv.org/abs/2408.01732)
Keywords: diffusion
Abstract: Audio-driven talking head generation is a significant and challenging task applicable to various fields such as virtual avatars, film production, and online conferences. However, the existing GAN-based models emphasize generating well-synchronized lip shapes but overlook the visual quality of generated frames, while diffusion-based models prioritize generating high-quality frames but neglect lip shape matching, resulting in jittery mouth movements. To address the aforementioned problems, we introduce a two-stage diffusion-based model. The first stage involves generating synchronized facial landmarks based on the given speech. In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos. Extensive experiments demonstrate that our model yields the best performance.

Title: LAM3D: Leveraging Attention for Monocular 3D Object Detection

Authors: Diana-Alexandra Sas, Leandro Di Bella, Yangxintong Lyu, Florin Oniga, Adrian Munteanu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01739
Pdf URL: https://arxiv.org/pdf/2408.01739
Copy Paste: [[2408.01739]] LAM3D: Leveraging Attention for Monocular 3D Object Detection(https://arxiv.org/abs/2408.01739)
Keywords: extraction, transformer, segmentation
Abstract: Since the introduction of the self-attention mechanism and the adoption of the Transformer architecture for Computer Vision tasks, the Vision Transformer-based architectures gained a lot of popularity in the field, being used for tasks such as image classification, object detection and image segmentation. However, efficiently leveraging the attention mechanism in vision transformers for the Monocular 3D Object Detection task remains an open question. In this paper, we present LAM3D, a framework that Leverages self-Attention mechanism for Monocular 3D object Detection. To do so, the proposed method is built upon a Pyramid Vision Transformer v2 (PVTv2) as feature extraction backbone and 2D/3D detection machinery. We evaluate the proposed method on the KITTI 3D Object Detection Benchmark, proving the applicability of the proposed solution in the autonomous driving domain and outperforming reference methods. Moreover, due to the usage of self-attention, LAM3D is able to systematically outperform the equivalent architecture that does not employ self-attention.

Title: Summarization of Investment Reports Using Pre-trained Model

Authors: Hiroki Sakaji, Ryotaro Kobayashi, Kiyoshi Izumi, Hiroyuki Mitsugi, Wataru Kuramoto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01744
Pdf URL: https://arxiv.org/pdf/2408.01744
Copy Paste: [[2408.01744]] Summarization of Investment Reports Using Pre-trained Model(https://arxiv.org/abs/2408.01744)
Keywords: transformer
Abstract: In this paper, we attempt to summarize monthly reports as investment reports. Fund managers have a wide range of tasks, one of which is the preparation of investment reports. In addition to preparing monthly reports on fund management, fund managers prepare management reports that summarize these monthly reports every six months or once a year. The preparation of fund reports is a labor-intensive and time-consuming task. Therefore, in this paper, we tackle investment summarization from monthly reports using transformer-based models. There are two main types of summarization methods: extractive summarization and abstractive summarization, and this study constructs both methods and examines which is more useful in summarizing investment reports.

Title: Indexing and Visualization of Climate Change Narratives Using BERT and Causal Extraction

Authors: Hiroki Sakaji, Noriyasu Kaneda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01745
Pdf URL: https://arxiv.org/pdf/2408.01745
Copy Paste: [[2408.01745]] Indexing and Visualization of Climate Change Narratives Using BERT and Causal Extraction(https://arxiv.org/abs/2408.01745)
Keywords: extraction, transformer
Abstract: In this study, we propose a methodology to extract, index, and visualize ``climate change narratives'' (stories about the connection between causal and consequential events related to climate change). We use two natural language processing methods, BERT (Bidirectional Encoder Representations from Transformers) and causal extraction, to textually analyze newspaper articles on climate change to extract ``climate change narratives.'' The novelty of the methodology could extract and quantify the causal relationships assumed by the newspaper's writers. Looking at the extracted climate change narratives over time, we find that since 2018, an increasing number of narratives suggest the impact of the development of climate change policy discussion and the implementation of climate change-related policies on corporate behaviors, macroeconomics, and price dynamics. We also observed the recent emergence of narratives focusing on the linkages between climate change-related policies and monetary policy. Furthermore, there is a growing awareness of the negative impacts of natural disasters (e.g., abnormal weather and severe floods) related to climate change on economic activities, and this issue might be perceived as a new challenge for companies and governments. The methodology of this study is expected to be applied to a wide range of fields, as it can analyze causal relationships among various economic topics, including analysis of inflation expectation or monetary policy communication strategy.

Title: Domain penalisation for improved Out-of-Distribution Generalisation

Authors: Shuvam Jena, Sushmetha Sumathi Rajendran, Karthik Seemakurthy, Sasithradevi A, Vijayalakshmi M, Prakash Poornachari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01746
Pdf URL: https://arxiv.org/pdf/2408.01746
Copy Paste: [[2408.01746]] Domain penalisation for improved Out-of-Distribution Generalisation(https://arxiv.org/abs/2408.01746)
Keywords: robust
Abstract: In the field of object detection, domain generalisation (DG) aims to ensure robust performance across diverse and unseen target domains by learning the robust domain-invariant features corresponding to the objects of interest across multiple source domains. While there are many approaches established for performing DG for the task of classification, there has been a very little focus on object detection. In this paper, we propose a domain penalisation (DP) framework for the task of object detection, where the data is assumed to be sampled from multiple source domains and tested on completely unseen test domains. We assign penalisation weights to each domain, with the values updated based on the detection networks performance on the respective source domains. By prioritising the domains that needs more attention, our approach effectively balances the training process. We evaluate our solution on the GWHD 2021 dataset, a component of the WiLDS benchmark and we compare against ERM and GroupDRO as these are primarily loss function based. Our extensive experimental results reveals that the proposed approach improves the accuracy by 0.3 percent and 0.5 percent on validation and test out-of-distribution (OOD) sets, respectively for FasterRCNN. We also compare the performance of our approach on FCOS detector and show that our approach improves the baseline OOD performance over the existing approaches by 1.3 percent and 1.4 percent on validation and test sets, respectively. This study underscores the potential of performance based domain penalisation in enhancing the generalisation ability of object detection models across diverse environments.

Title: Advancing Green AI: Efficient and Accurate Lightweight CNNs for Rice Leaf Disease Identification

Authors: Khairun Saddami, Yudha Nurdin, Mutia Zahramita, Muhammad Shahreeza Safiruz
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01752
Pdf URL: https://arxiv.org/pdf/2408.01752
Copy Paste: [[2408.01752]] Advancing Green AI: Efficient and Accurate Lightweight CNNs for Rice Leaf Disease Identification(https://arxiv.org/abs/2408.01752)
Keywords: security
Abstract: Rice plays a vital role as a primary food source for over half of the world's population, and its production is critical for global food security. Nevertheless, rice cultivation is frequently affected by various diseases that can severely decrease yield and quality. Therefore, early and accurate detection of rice diseases is necessary to prevent their spread and minimize crop losses. In this research, we explore three mobile-compatible CNN architectures, namely ShuffleNet, MobileNetV2, and EfficientNet-B0, for rice leaf disease classification. These models are selected due to their compatibility with mobile devices, as they demand less computational power and memory compared to other CNN models. To enhance the performance of the three models, we added two fully connected layers separated by a dropout layer. We used early stop creation to prevent the model from being overfiting. The results of the study showed that the best performance was achieved by the EfficientNet-B0 model with an accuracy of 99.8%. Meanwhile, MobileNetV2 and ShuffleNet only achieved accuracies of 84.21% and 66.51%, respectively. This study shows that EfficientNet-B0 when combined with the proposed layer and early stop, can produce a high-accuracy model. Keywords: rice leaf detection; green AI; smart agriculture; EfficientNet;

Title: Joint Model Pruning and Resource Allocation for Wireless Time-triggered Federated Learning

Authors: Xinlu Zhang, Yansha Deng, Toktam Mahmoodi
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2408.01765
Pdf URL: https://arxiv.org/pdf/2408.01765
Copy Paste: [[2408.01765]] Joint Model Pruning and Resource Allocation for Wireless Time-triggered Federated Learning(https://arxiv.org/abs/2408.01765)
Keywords: federate
Abstract: Time-triggered federated learning, in contrast to conventional event-based federated learning, organizes users into tiers based on fixed time intervals. However, this network still faces challenges due to a growing number of devices and limited wireless bandwidth, increasing issues like stragglers and communication overhead. In this paper, we apply model pruning to wireless Time-triggered systems and jointly study the problem of optimizing the pruning ratio and bandwidth allocation to minimize training loss under communication latency constraints. To solve this joint optimization problem, we perform a convergence analysis on the gradient $l_2$-norm of the asynchronous multi-tier federated learning (FL) model with adaptive model pruning. The convergence upper bound is derived and a joint optimization problem of pruning ratio and wireless bandwidth is defined to minimize the model training loss under a given communication latency constraint. The closed-form solutions for wireless bandwidth and pruning ratio by using KKT conditions are then formulated. As indicated in the simulation experiments, our proposed TT-Prune demonstrates a 40% reduction in communication cost, compared with the asynchronous multi-tier FL without model pruning, while maintaining the model convergence at the same level.

Title: MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition

Authors: Ruoyu Wang, Wenqian Wang, Jianjun Gao, Dan Lin, Kim-Hui Yap, Bingbing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01766
Pdf URL: https://arxiv.org/pdf/2408.01766
Copy Paste: [[2408.01766]] MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition(https://arxiv.org/abs/2408.01766)
Keywords: transformer
Abstract: Driver action recognition, aiming to accurately identify drivers' behaviours, is crucial for enhancing driver-vehicle interactions and ensuring driving safety. Unlike general action recognition, drivers' environments are often challenging, being gloomy and dark, and with the development of sensors, various cameras such as IR and depth cameras have emerged for analyzing drivers' behaviors. Therefore, in this paper, we propose a novel multimodal fusion transformer, named MultiFuser, which identifies cross-modal interrelations and interactions among multimodal car cabin videos and adaptively integrates different modalities for improved representations. Specifically, MultiFuser comprises layers of Bi-decomposed Modules to model spatiotemporal features, with a modality synthesizer for multimodal features integration. Each Bi-decomposed Module includes a Modal Expertise ViT block for extracting modality-specific features and a Patch-wise Adaptive Fusion block for efficient cross-modal fusion. Extensive experiments are conducted on Drive&Act dataset and the results demonstrate the efficacy of our proposed approach.

Title: Comparison of Embedded Spaces for Deep Learning Classification

Authors: Stefan Scholl
Subjects: cs.LG, cs.CV, eess.IV, eess.SP
Abstract URL: https://arxiv.org/abs/2408.01767
Pdf URL: https://arxiv.org/pdf/2408.01767
Copy Paste: [[2408.01767]] Comparison of Embedded Spaces for Deep Learning Classification(https://arxiv.org/abs/2408.01767)
Keywords: explainability
Abstract: Embedded spaces are a key feature in deep learning. Good embedded spaces represent the data well to support classification and advanced techniques such as open-set recognition, few-short learning and explainability. This paper presents a compact overview of different techniques to design embedded spaces for classification. It compares different loss functions and constraints on the network parameters with respect to the achievable geometric structure of the embedded space. The techniques are demonstrated with two and three-dimensional embeddings for the MNIST, Fashion MNIST and CIFAR-10 datasets, allowing visual inspection of the embedded spaces.

Title: STDA: Spatio-Temporal Dual-Encoder Network Incorporating Driver Attention to Predict Driver Behaviors Under Safety-Critical Scenarios

Authors: Dongyang Xu, Yiran Luo, Tianle Lu, Qingfan Wang, Qing Zhou, Bingbing Nie
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.01774
Pdf URL: https://arxiv.org/pdf/2408.01774
Copy Paste: [[2408.01774]] STDA: Spatio-Temporal Dual-Encoder Network Incorporating Driver Attention to Predict Driver Behaviors Under Safety-Critical Scenarios(https://arxiv.org/abs/2408.01774)
Keywords: robust, interpretability
Abstract: Accurate behavior prediction for vehicles is essential but challenging for autonomous driving. Most existing studies show satisfying performance under regular scenarios, but most neglected safety-critical scenarios. In this study, a spatio-temporal dual-encoder network named STDA for safety-critical scenarios was developed. Considering the exceptional capabilities of human drivers in terms of situational awareness and comprehending risks, driver attention was incorporated into STDA to facilitate swift identification of the critical regions, which is expected to improve both performance and interpretability. STDA contains four parts: the driver attention prediction module, which predicts driver attention; the fusion module designed to fuse the features between driver attention and raw images; the temporary encoder module used to enhance the capability to interpret dynamic scenes; and the behavior prediction module to predict the behavior. The experiment data are used to train and validate the model. The results show that STDA improves the G-mean from 0.659 to 0.719 when incorporating driver attention and adopting a temporal encoder module. In addition, extensive experimentation has been conducted to validate that the proposed module exhibits robust generalization capabilities and can be seamlessly integrated into other mainstream models.

Title: MathLearner: A Large Language Model Agent Framework for Learning to Solve Mathematical Problems

Authors: Wenbei Xie, Donglin Liu, Haoran Yan, Wenjie Wu, Zongyang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01779
Pdf URL: https://arxiv.org/pdf/2408.01779
Copy Paste: [[2408.01779]] MathLearner: A Large Language Model Agent Framework for Learning to Solve Mathematical Problems(https://arxiv.org/abs/2408.01779)
Keywords: large language model
Abstract: With the development of artificial intelligence (AI), large language models (LLM) are widely used in many fields. However, the reasoning ability of LLM is still very limited when it comes to mathematical reasoning. Mathematics plays an important role in all aspects of human society and is a technical guarantee in the fields of healthcare, transport and aerospace, for this reason, the development of AI big language models in the field of mathematics has great potential significance. To improve the mathematical reasoning ability of large language models, we proposed an agent framework for learning to solve mathematical problems based on inductive reasoning. By emulating the human learning process of generalization of learned information and effective application of previous knowledge in new reasoning tasks, this framework has great performance in the mathematical reasoning process. It improves global accuracy over the baseline method (chain-of-thought) by 20.96% and solves 17.54% of the mathematical problems that the baseline cannot solve. Benefiting from the efficient RETRIEVAL method, our model improves the ability of large language models to efficiently use external knowledge, i.e., the mathematical computation of the model can be based on written procedures. In education, our model can be used as a personalised learning aid, thus reducing the inequality of educational resources.

Title: Towards an ontology of state actors in cyberspace

Authors: Giacomo De Colle
Subjects: cs.CR, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2408.01787
Pdf URL: https://arxiv.org/pdf/2408.01787
Copy Paste: [[2408.01787]] Towards an ontology of state actors in cyberspace(https://arxiv.org/abs/2408.01787)
Keywords: security, extraction
Abstract: To improve cyber threat analysis practices in cybersecurity, I present a plan to build a formal ontological representation of state actors in cyberspace and of cyber operations. I argue that modelling these phenomena via ontologies allows for coherent integration of data coming from diverse sources, automated reasoning over such data, as well as intelligence extraction and reuse from and of them. Existing ontological tools in cybersecurity can be ameliorated by connecting them to neighboring domains such as law, regulations, governmental institutions, and documents. In this paper, I propose metrics to evaluate currently existing ontological tools to create formal representations in the cybersecurity domain, and I provide a plan to develop and extend them when they are lacking.

Title: Optimizing Intrusion Detection System Performance Through Synergistic Hyperparameter Tuning and Advanced Data Processing

Authors: Samia Saidane, Francesco Telch, Kussai Shahin, Fabrizio Granelli
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.01792
Pdf URL: https://arxiv.org/pdf/2408.01792
Copy Paste: [[2408.01792]] Optimizing Intrusion Detection System Performance Through Synergistic Hyperparameter Tuning and Advanced Data Processing(https://arxiv.org/abs/2408.01792)
Keywords: robust
Abstract: Intrusion detection is vital for securing computer networks against malicious activities. Traditional methods struggle to detect complex patterns and anomalies in network traffic effectively. To address this issue, we propose a system combining deep learning, data balancing (K-means + SMOTE), high-dimensional reduction (PCA and FCBF), and hyperparameter optimization (Extra Trees and BO-TPE) to enhance intrusion detection performance. By training on extensive datasets like CIC IDS 2018 and CIC IDS 2017, our models demonstrate robust performance and generalization. Notably, the ensemble model "VGG19" consistently achieves remarkable accuracy (99.26% on CIC-IDS2017 and 99.22% on CSE-CIC-IDS2018), outperforming other models.

Title: MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Authors: Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01800
Pdf URL: https://arxiv.org/pdf/2408.01800
Copy Paste: [[2408.01800]] MiniCPM-V: A GPT-4V Level MLLM on Your Phone(https://arxiv.org/abs/2408.01800)
Keywords: privacy, protect, large language model
Abstract: The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

Title: STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs

Authors: Peijie Dong, Lujun Li, Dayou Du, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Wenhan Luo, Qifeng Liu, Yike Guo, Xiaowen Chu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2408.01803
Pdf URL: https://arxiv.org/pdf/2408.01803
Copy Paste: [[2408.01803]] STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs(https://arxiv.org/abs/2408.01803)
Keywords: large language model
Abstract: In this paper, we present STBLLM, the first structural binarization framework for compressing Large Language Models (LLMs) to less than 1-bit precision. LLMs have achieved remarkable performance, but their heavy memory requirements have hindered widespread adoption, particularly on resource-constrained devices. Binarization, which quantifies weights to a mere 1-bit, achieves a milestone in increasing computational efficiency. However, we observe that some weights in binarized LLMs can be randomly flipped without significant performance degradation, indicating the potential for further compression. To exploit this, our STBLLM employs an N:M sparsity to perform structural binarization of the weights. First, we introduce a new Standardized Importance (SI) metric that considers weight magnitude and input feature norm to better evaluate weight significance. Then, we propose a layer-wise approach where different layers of the LLM can be sparsified with varying N:M ratios, balancing compression and accuracy. Finally, we use residual approximation with double binarization to preserve information for salient weights. In addition, we utilize a fine-grained grouping strategy for less important weights that applies different quantization schemes to sparse, intermediate, and dense regions. We conduct extensive experiments on various language models, including the LLaMA-1/2/3, OPT family, and Mistral, to evaluate the effectiveness of STBLLM. The results demonstrate that our approach performs better than other compressed binarization LLM methods while significantly reducing memory requirements.

Title: ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features

Authors: Peng Cheng, Yuwei Wang, Peng Huang, Zhongjie Ba, Xiaodong Lin, Feng Lin, Li Lu, Kui Ren
Subjects: cs.CR, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2408.01808
Pdf URL: https://arxiv.org/pdf/2408.01808
Copy Paste: [[2408.01808]] ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features(https://arxiv.org/abs/2408.01808)
Keywords: attack, robust
Abstract: Extensive research has revealed that adversarial examples (AE) pose a significant threat to voice-controllable smart devices. Recent studies have proposed black-box adversarial attacks that require only the final transcription from an automatic speech recognition (ASR) system. However, these attacks typically involve many queries to the ASR, resulting in substantial costs. Moreover, AE-based adversarial audio samples are susceptible to ASR updates. In this paper, we identify the root cause of these limitations, namely the inability to construct AE attack samples directly around the decision boundary of deep learning (DL) models. Building on this observation, we propose ALIF, the first black-box adversarial linguistic feature-based attack pipeline. We leverage the reciprocal process of text-to-speech (TTS) and ASR models to generate perturbations in the linguistic embedding space where the decision boundary resides. Based on the ALIF pipeline, we present the ALIF-OTL and ALIF-OTA schemes for launching attacks in both the digital domain and the physical playback environment on four commercial ASRs and voice assistants. Extensive evaluations demonstrate that ALIF-OTL and -OTA significantly improve query efficiency by 97.7% and 73.3%, respectively, while achieving competitive performance compared to existing methods. Notably, ALIF-OTL can generate an attack sample with only one query. Furthermore, our test-of-time experiment validates the robustness of our approach against ASR updates.

Title: SkyDiffusion: Street-to-Satellite Image Synthesis with Diffusion Models and BEV Paradigm

Authors: Junyan Ye, Jun He, Weijia Li, Zhutao Lv, Jinhua Yu, Haote Yang, Conghui He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01812
Pdf URL: https://arxiv.org/pdf/2408.01812
Copy Paste: [[2408.01812]] SkyDiffusion: Street-to-Satellite Image Synthesis with Diffusion Models and BEV Paradigm(https://arxiv.org/abs/2408.01812)
Keywords: diffusion
Abstract: Street-to-satellite image synthesis focuses on generating realistic satellite images from corresponding ground street-view images while maintaining a consistent content layout, similar to looking down from the sky. The significant differences in perspectives create a substantial domain gap between the views, making this cross-view generation task particularly challenging. In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing satellite images from street-view images, leveraging diffusion models and Bird's Eye View (BEV) paradigm. First, we design a Curved-BEV method to transform street-view images to the satellite view, reformulating the challenging cross-domain image synthesis task into a conditional generation problem. Curved-BEV also includes a "Multi-to-One" mapping strategy for combining multiple street-view images within the same satellite coverage area, effectively solving the occlusion issues in dense urban scenes. Next, we design a BEV-controlled diffusion model to generate satellite images consistent with the street-view content, which also incorporates a light manipulation module to optimize the lighting condition of the synthesized image using a reference satellite. Experimental results demonstrate that SkyDiffusion outperforms state-of-the-art methods on both suburban (CVUSA & CVACT) and urban (VIGOR-Chicago) cross-view datasets, with an average SSIM increase of 14.5% and a FID reduction of 29.6%, achieving realistic and content-consistent satellite image generation. The code and models of this work will be released at this https URL.

Title: GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Authors: Yihong Lin, Lingyu Xiong, Xiandong Li, Wenxiong Kang, Xianjia Wu, Liang Peng, Songju Lei, Huang Xu, Zhaoxin Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01826
Pdf URL: https://arxiv.org/pdf/2408.01826
Copy Paste: [[2408.01826]] GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer(https://arxiv.org/abs/2408.01826)
Keywords: diffusion, transformer
Abstract: 3D speech-driven facial animation generation has received much attention in both industrial applications and academic research. Since the non-verbal facial cues that exist across the face in reality are non-deterministic, the generated results should be diverse. However, most recent methods are deterministic models that cannot learn a many-to-many mapping between audio and facial motion to generate diverse facial animations. To address this problem, we propose GLDiTalker, which introduces a motion prior along with some stochasticity to reduce the uncertainty of cross-modal mapping while increasing non-determinacy of the non-verbal facial cues that reside throughout the face. Particularly, GLDiTalker uses VQ-VAE to map facial motion mesh sequences into latent space in the first stage, and then iteratively adds and removes noise to the latent facial motion features in the second stage. In order to integrate different levels of spatial information, the Spatial Pyramidal SpiralConv Encoder is also designed to extract multi-scale features. Extensive qualitative and quantitative experiments demonstrate that our method achieves the state-of-the-art performance.

Title: TS-SAM: Fine-Tuning Segment-Anything Model for Downstream Tasks

Authors: Yang Yu, Chen Xu, Kai Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01835
Pdf URL: https://arxiv.org/pdf/2408.01835
Copy Paste: [[2408.01835]] TS-SAM: Fine-Tuning Segment-Anything Model for Downstream Tasks(https://arxiv.org/abs/2408.01835)
Keywords: segmentation
Abstract: Adapter based fine-tuning has been studied for improving the performance of SAM on downstream tasks. However, there is still a significant performance gap between fine-tuned SAMs and domain-specific models. To reduce the gap, we propose Two-Stream SAM (TS-SAM). On the one hand, inspired by the side network in Parameter-Efficient Fine-Tuning (PEFT), we designed a lightweight Convolutional Side Adapter (CSA), which integrates the powerful features from SAM into side network training for comprehensive feature fusion. On the other hand, in line with the characteristics of segmentation tasks, we designed Multi-scale Refinement Module (MRM) and Feature Fusion Decoder (FFD) to keep both the detailed and semantic features. Extensive experiments on ten public datasets from three tasks demonstrate that TS-SAM not only significantly outperforms the recently proposed SAM-Adapter and SSOM, but achieves competitive performance with the SOTA domain-specific models. Our code is available at: this https URL.

Title: Supervised Image Translation from Visible to Infrared Domain for Object Detection

Authors: Prahlad Anand, Qiranul Saadiyean, Aniruddh Sikdar, Nalini N, Suresh Sundaram
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01843
Pdf URL: https://arxiv.org/pdf/2408.01843
Copy Paste: [[2408.01843]] Supervised Image Translation from Visible to Infrared Domain for Object Detection(https://arxiv.org/abs/2408.01843)
Keywords: generative
Abstract: This study aims to learn a translation from visible to infrared imagery, bridging the domain gap between the two modalities so as to improve accuracy on downstream tasks including object detection. Previous approaches attempt to perform bi-domain feature fusion through iterative optimization or end-to-end deep convolutional networks. However, we pose the problem as similar to that of image translation, adopting a two-stage training strategy with a Generative Adversarial Network and an object detection model. The translation model learns a conversion that preserves the structural detail of visible images while preserving the texture and other characteristics of infrared images. Images so generated are used to train standard object detection frameworks including Yolov5, Mask and Faster RCNN. We also investigate the usefulness of integrating a super-resolution step into our pipeline to further improve model accuracy, and achieve an improvement of as high as 5.3% mAP.

Title: Efficient Solutions For An Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly

Authors: Peyman Hosseini, Ignacio Castro, Iacopo Ghinassi, Matthew Purver
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.01866
Pdf URL: https://arxiv.org/pdf/2408.01866
Copy Paste: [[2408.01866]] Efficient Solutions For An Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly(https://arxiv.org/abs/2408.01866)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in comprehending and analyzing lengthy sequential inputs, owing to their extensive context windows that allow processing millions of tokens in a single forward pass. However, this paper uncovers a surprising limitation: LLMs fall short when handling long input sequences. We investigate this issue using three datasets and two tasks (sentiment analysis and news categorization) across various LLMs, including Claude 3, Gemini Pro, GPT 3.5 Turbo, Llama 3 Instruct, and Mistral Instruct models. To address this limitation, we propose and evaluate ad-hoc solutions that substantially enhance LLMs' performance on long input sequences by up to 50%, while reducing API cost and latency by up to 93% and 50%, respectively.

Title: MALADE: Orchestration of LLM-powered Agents with Retrieval Augmented Generation for Pharmacovigilance

Authors: Jihye Choi, Nils Palumbo, Prasad Chalasani, Matthew M. Engelhard, Somesh Jha, Anivarya Kumar, David Page
Subjects: cs.CL, cs.AI, cs.IR, cs.LG, cs.MA, q-bio.QM
Abstract URL: https://arxiv.org/abs/2408.01869
Pdf URL: https://arxiv.org/pdf/2408.01869
Copy Paste: [[2408.01869]] MALADE: Orchestration of LLM-powered Agents with Retrieval Augmented Generation for Pharmacovigilance(https://arxiv.org/abs/2408.01869)
Keywords: extraction, large language model
Abstract: In the era of Large Language Models (LLMs), given their remarkable text understanding and generation abilities, there is an unprecedented opportunity to develop new, LLM-based methods for trustworthy medical knowledge synthesis, extraction and summarization. This paper focuses on the problem of Pharmacovigilance (PhV), where the significance and challenges lie in identifying Adverse Drug Events (ADEs) from diverse text sources, such as medical literature, clinical notes, and drug labels. Unfortunately, this task is hindered by factors including variations in the terminologies of drugs and outcomes, and ADE descriptions often being buried in large amounts of narrative text. We present MALADE, the first effective collaborative multi-agent system powered by LLM with Retrieval Augmented Generation for ADE extraction from drug label data. This technique involves augmenting a query to an LLM with relevant information extracted from text resources, and instructing the LLM to compose a response consistent with the augmented data. MALADE is a general LLM-agnostic architecture, and its unique capabilities are: (1) leveraging a variety of external sources, such as medical literature, drug labels, and FDA tools (e.g., OpenFDA drug information API), (2) extracting drug-outcome association in a structured format along with the strength of the association, and (3) providing explanations for established associations. Instantiated with GPT-4 Turbo or GPT-4o, and FDA drug label data, MALADE demonstrates its efficacy with an Area Under ROC Curve of 0.90 against the OMOP Ground Truth table of ADEs. Our implementation leverages the Langroid multi-agent LLM framework and can be found at this https URL.

Title: Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval

Authors: Yanfei Chen, Jinsung Yoon, Devendra Singh Sachan, Qingze Wang, Vincent Cohen-Addad, Mohammadhossein Bateni, Chen-Yu Lee, Tomas Pfister
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01875
Pdf URL: https://arxiv.org/pdf/2408.01875
Copy Paste: [[2408.01875]] Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval(https://arxiv.org/abs/2408.01875)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) have enabled autonomous agents with complex reasoning and task-fulfillment capabilities using a wide range of tools. However, effectively identifying the most relevant tools for a given task becomes a key bottleneck as the toolset size grows, hindering reliable tool utilization. To address this, we introduce Re-Invoke, an unsupervised tool retrieval method designed to scale effectively to large toolsets without training. Specifically, we first generate a diverse set of synthetic queries that comprehensively cover different aspects of the query space associated with each tool document during the tool indexing phase. Second, we leverage LLM's query understanding capabilities to extract key tool-related context and underlying intents from user queries during the inference phase. Finally, we employ a novel multi-view similarity ranking strategy based on intents to pinpoint the most relevant tools for each query. Our evaluation demonstrates that Re-Invoke significantly outperforms state-of-the-art alternatives in both single-tool and multi-tool scenarios, all within a fully unsupervised setting. Notably, on the ToolE datasets, we achieve a 20% relative improvement in nDCG@5 for single-tool retrieval and a 39% improvement for multi-tool retrieval.

Title: Cross-layer Attention Sharing for Large Language Models

Authors: Yongyu Mu, Yuzhang Wu, Yuchun Fan, Chenglong Wang, Hengyu Li, Qiaozhi He, Murun Yang, Tong Xiao, Jingbo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01890
Pdf URL: https://arxiv.org/pdf/2408.01890
Copy Paste: [[2408.01890]] Cross-layer Attention Sharing for Large Language Models(https://arxiv.org/abs/2408.01890)
Keywords: large language model
Abstract: As large language models (LLMs) evolve, the increase in model depth and parameter number leads to substantial redundancy. To enhance the efficiency of the attention mechanism, previous works primarily compress the KV cache or group attention heads, while largely overlooking redundancy between layers. Our comprehensive analyses across various LLMs show that highly similar attention patterns persist within most layers. It's intuitive to save the computation by sharing attention weights across layers. However, further analysis reveals two challenges: (1) Directly sharing the weight matrix without carefully rearranging the attention heads proves to be ineffective; (2) Shallow layers are vulnerable to small deviations in attention weights. Driven by these insights, we introduce LiSA, a lightweight substitute for self-attention in well-trained LLMs. LiSA employs tiny feed-forward networks to align attention heads between adjacent layers and low-rank matrices to approximate differences in layer-wise attention weights. Evaluations encompassing 13 typical benchmarks demonstrate that LiSA maintains high response quality in terms of accuracy and perplexity while reducing redundant attention calculations within 53-84% of the total layers. Our implementations of LiSA achieve a 6X compression of Q and K, with maximum throughput improvements of 19.5% for LLaMA3-8B and 32.3% for LLaMA2-7B.

Title: Remote Staking with Economic Safety

Authors: Xinshu Dong, Orfeas Stefanos Thyfronitis Litos, Ertem Nusret Tas, David Tse, Robin Linus Woll, Lei Yang, Mingchao Yu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.01896
Pdf URL: https://arxiv.org/pdf/2408.01896
Copy Paste: [[2408.01896]] Remote Staking with Economic Safety(https://arxiv.org/abs/2408.01896)
Keywords: secure, security
Abstract: Proof-of-stake (PoS) blockchains require validators to lock their tokens as collateral, slashing these tokens if they are identified as protocol violators. PoS chains have mostly been secured by their native tokens. However, using only the native token upper-bounds the value eligible for staking by the market capitalization of the native token. In contrast, the remote staking of another crypto asset from a provider chain provides an avenue to improve the consumer chain's economic security. In this paper, we present the first known remote staking protocols with guaranteed optimal economic safety: whenever there is a safety violation on the consumer chain, at least one third of the provider's stake securing the consumer chain is slashed. To achieve this goal for a broad range of provider and consumer chains, two independent contributions are made: 1) a remote unbonding protocol that ensures slashing before the stake is unbonded on the provider chain if there is safety violation on the consumer chain; 2) a protocol to slash stake even without smart contracts on the provider chain. The remote staking protocol is analyzed and implemented in the case where the provider chain is Bitcoin and the consumer chain is a Cosmos SDK chain running the Tendermint consensus protocol.

Title: CAF-YOLO: A Robust Framework for Multi-Scale Lesion Detection in Biomedical Imagery

Authors: Zilin Chen, Shengnan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01897
Pdf URL: https://arxiv.org/pdf/2408.01897
Copy Paste: [[2408.01897]] CAF-YOLO: A Robust Framework for Multi-Scale Lesion Detection in Biomedical Imagery(https://arxiv.org/abs/2408.01897)
Keywords: robust, transformer
Abstract: Object detection is of paramount importance in biomedical image analysis, particularly for lesion identification. While current methodologies are proficient in identifying and pinpointing lesions, they often lack the precision needed to detect minute biomedical entities (e.g., abnormal cells, lung nodules smaller than 3 mm), which are critical in blood and lung pathology. To address this challenge, we propose CAF-YOLO, based on the YOLOv8 architecture, a nimble yet robust method for medical object detection that leverages the strengths of convolutional neural networks (CNNs) and transformers. To overcome the limitation of convolutional kernels, which have a constrained capacity to interact with distant information, we introduce an attention and convolution fusion module (ACFM). This module enhances the modeling of both global and local features, enabling the capture of long-term feature dependencies and spatial autocorrelation. Additionally, to improve the restricted single-scale feature aggregation inherent in feed-forward networks (FFN) within transformer architectures, we design a multi-scale neural network (MSNN). This network improves multi-scale information aggregation by extracting features across diverse scales. Experimental evaluations on widely used datasets, such as BCCD and LUNA16, validate the rationale and efficacy of CAF-YOLO. This methodology excels in detecting and precisely locating diverse and intricate micro-lesions within biomedical imagery. Our codes are available at this https URL.

Title: DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models

Authors: Bowen Wang, Jiuyang Chang, Yiming Qian, Guoxin Chen, Junhao Chen, Zhouqiang Jiang, Jiahao Zhang, Yuta Nakashima, Hajime Nagahara
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01933
Pdf URL: https://arxiv.org/pdf/2408.01933
Copy Paste: [[2408.01933]] DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models(https://arxiv.org/abs/2408.01933)
Keywords: interpretability, large language model
Abstract: Large language models (LLMs) have recently showcased remarkable capabilities, spanning a wide range of tasks and applications, including those in the medical domain. Models like GPT-4 excel in medical question answering but may face challenges in the lack of interpretability when handling complex tasks in real clinical settings. We thus introduce the diagnostic reasoning dataset for clinical notes (DiReCT), aiming at evaluating the reasoning ability and interpretability of LLMs compared to human doctors. It contains 521 clinical notes, each meticulously annotated by physicians, detailing the diagnostic reasoning process from observations in a clinical note to the final diagnosis. Additionally, a diagnostic knowledge graph is provided to offer essential knowledge for reasoning, which may not be covered in the training data of existing LLMs. Evaluations of leading LLMs on DiReCT bring out a significant gap between their reasoning ability and that of human doctors, highlighting the critical need for models that can reason effectively in real-world clinical scenarios.

Title: A Survey and Evaluation of Adversarial Attacks for Object Detection

Authors: Khoi Nguyen Tiet Nguyen, Wenyu Zhang, Kangkang Lu, Yuhuan Wu, Xingjian Zheng, Hui Li Tan, Liangli Zhen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01934
Pdf URL: https://arxiv.org/pdf/2408.01934
Copy Paste: [[2408.01934]] A Survey and Evaluation of Adversarial Attacks for Object Detection(https://arxiv.org/abs/2408.01934)
Keywords: security, attack, robust
Abstract: Deep learning models excel in various computer vision tasks but are susceptible to adversarial examples-subtle perturbations in input data that lead to incorrect predictions. This vulnerability poses significant risks in safety-critical applications such as autonomous vehicles, security surveillance, and aircraft health monitoring. While numerous surveys focus on adversarial attacks in image classification, the literature on such attacks in object detection is limited. This paper offers a comprehensive taxonomy of adversarial attacks specific to object detection, reviews existing adversarial robustness evaluation metrics, and systematically assesses open-source attack methods and model robustness. Key observations are provided to enhance the understanding of attack effectiveness and corresponding countermeasures. Additionally, we identify crucial research challenges to guide future efforts in securing automated object detection systems.

Title: Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference

Authors: Ke Shen, Mayank Kejriwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01935
Pdf URL: https://arxiv.org/pdf/2408.01935
Copy Paste: [[2408.01935]] Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference(https://arxiv.org/abs/2408.01935)
Keywords: generative, large language model
Abstract: Despite their impressive performance, large language models (LLMs) such as ChatGPT are known to pose important risks. One such set of risks arises from misplaced confidence, whether over-confidence or under-confidence, that the models have in their inference. While the former is well studied, the latter is not, leading to an asymmetry in understanding the comprehensive risk of the model based on misplaced confidence. In this paper, we address this asymmetry by defining two types of risk (decision and composite risk), and proposing an experimental framework consisting of a two-level inference architecture and appropriate metrics for measuring such risks in both discriminative and generative LLMs. The first level relies on a decision rule that determines whether the underlying language model should abstain from inference. The second level (which applies if the model does not abstain) is the model's inference. Detailed experiments on four natural language commonsense reasoning datasets using both an open-source ensemble-based RoBERTa model and ChatGPT, demonstrate the practical utility of the evaluation framework. For example, our results show that our framework can get an LLM to confidently respond to an extra 20.1% of low-risk inference tasks that other methods might misclassify as high-risk, and skip 19.8% of high-risk tasks, which would have been answered incorrectly.

Title: RobNODDI: Robust NODDI Parameter Estimation with Adaptive Sampling under Continuous Representation

Authors: Taohui Xiao, Jian Cheng, Wenxin Fan, Jing Yang, Cheng Li, Enqing Dong, Shanshan Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2408.01944
Pdf URL: https://arxiv.org/pdf/2408.01944
Copy Paste: [[2408.01944]] RobNODDI: Robust NODDI Parameter Estimation with Adaptive Sampling under Continuous Representation(https://arxiv.org/abs/2408.01944)
Keywords: robust, diffusion
Abstract: Neurite Orientation Dispersion and Density Imaging (NODDI) is an important imaging technology used to evaluate the microstructure of brain tissue, which is of great significance for the discovery and treatment of various neurological diseases. Current deep learning-based methods perform parameter estimation through diffusion magnetic resonance imaging (dMRI) with a small number of diffusion gradients. These methods speed up parameter estimation and improve accuracy. However, the diffusion directions used by most existing deep learning models during testing needs to be strictly consistent with the diffusion directions during training. This results in poor generalization and robustness of deep learning models in dMRI parameter estimation. In this work, we verify for the first time that the parameter estimation performance of current mainstream methods will significantly decrease when the testing diffusion directions and the training diffusion directions are inconsistent. A robust NODDI parameter estimation method with adaptive sampling under continuous representation (RobNODDI) is proposed. Furthermore, long short-term memory (LSTM) units and fully connected layers are selected to learn continuous representation signals. To this end, we use a total of 100 subjects to conduct experiments based on the Human Connectome Project (HCP) dataset, of which 60 are used for training, 20 are used for validation, and 20 are used for testing. The test results indicate that RobNODDI improves the generalization performance and robustness of the deep learning model, enhancing the stability and flexibility of deep learning NODDI parameter estimatimation applications.

Title: CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization

Authors: Xiang He, Xiangxi Liu, Yang Li, Dongcheng Zhao, Guobin Shen, Qingqun Kong, Xin Yang, Yi Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01952
Pdf URL: https://arxiv.org/pdf/2408.01952
Copy Paste: [[2408.01952]] CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization(https://arxiv.org/abs/2408.01952)
Keywords: extraction
Abstract: The audio-visual event localization task requires identifying concurrent visual and auditory events from unconstrained videos within a network model, locating them, and classifying their category. The efficient extraction and integration of audio and visual modal information have always been challenging in this field. In this paper, we introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information. We propose an audio-visual co-guidance attention mechanism that allows for adaptive bi-directional cross-modal attentional guidance between audio and visual information, thus reducing inconsistencies between modalities. Moreover, we have observed that existing methods have difficulty distinguishing between similar background and event and lack the fine-grained features for event classification. Consequently, we employ background-event contrast enhancement to increase the discrimination of fused feature and fine-tuned pre-trained model to extract more refined and discernible features from complex multimodal inputs. Specifically, we have enhanced the model's ability to discern subtle differences between event and background and improved the accuracy of event classification in our model. Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization task, proving the effectiveness of our proposed methods in handling complex multimodal learning and event localization in unconstrained videos. Code is available at this https URL.

Title: Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

Authors: Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe
Subjects: cs.CV, cs.AI, cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2408.01959
Pdf URL: https://arxiv.org/pdf/2408.01959
Copy Paste: [[2408.01959]] Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI(https://arxiv.org/abs/2408.01959)
Keywords: diffusion
Abstract: Multimodal AI models capable of associating images and text hold promise for numerous domains, ranging from automated image captioning to accessibility applications for blind and low-vision users. However, uncertainty about bias has in some cases limited their adoption and availability. In the present work, we study 43 CLIP vision-language models to determine whether they learn human-like facial impression biases, and we find evidence that such biases are reflected across three distinct CLIP model families. We show for the first time that the the degree to which a bias is shared across a society predicts the degree to which it is reflected in a CLIP model. Human-like impressions of visually unobservable attributes, like trustworthiness and sexuality, emerge only in models trained on the largest dataset, indicating that a better fit to uncurated cultural data results in the reproduction of increasingly subtle social biases. Moreover, we use a hierarchical clustering approach to show that dataset size predicts the extent to which the underlying structure of facial impression bias resembles that of facial impression bias in humans. Finally, we show that Stable Diffusion models employing CLIP as a text encoder learn facial impression biases, and that these biases intersect with racial biases in Stable Diffusion XL-Turbo. While pretrained CLIP models may prove useful for scientific studies of bias, they will also require significant dataset curation when intended for use as general-purpose models in a zero-shot setting.

Title: AnomalySD: Few-Shot Multi-Class Anomaly Detection with Stable Diffusion Model

Authors: Zhenyu Yan, Qingqing Fang, Wenxi Lv, Qinliang Su
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01960
Pdf URL: https://arxiv.org/pdf/2408.01960
Copy Paste: [[2408.01960]] AnomalySD: Few-Shot Multi-Class Anomaly Detection with Stable Diffusion Model(https://arxiv.org/abs/2408.01960)
Keywords: privacy, diffusion, segmentation
Abstract: Anomaly detection is a critical task in industrial manufacturing, aiming to identify defective parts of products. Most industrial anomaly detection methods assume the availability of sufficient normal data for training. This assumption may not hold true due to the cost of labeling or data privacy policies. Additionally, mainstream methods require training bespoke models for different objects, which incurs heavy costs and lacks flexibility in practice. To address these issues, we seek help from Stable Diffusion (SD) model due to its capability of zero/few-shot inpainting, which can be leveraged to inpaint anomalous regions as normal. In this paper, a few-shot multi-class anomaly detection framework that adopts Stable Diffusion model is proposed, named AnomalySD. To adapt SD to anomaly detection task, we design different hierarchical text descriptions and the foreground mask mechanism for fine-tuning SD. In the inference stage, to accurately mask anomalous regions for inpainting, we propose multi-scale mask strategy and prototype-guided mask strategy to handle diverse anomalous regions. Hierarchical text prompts are also utilized to guide the process of inpainting in the inference stage. The anomaly score is estimated based on inpainting result of all masks. Extensive experiments on the MVTec-AD and VisA datasets demonstrate the superiority of our approach. We achieved anomaly classification and segmentation results of 93.6%/94.8% AUROC on the MVTec-AD dataset and 86.1%/96.5% AUROC on the VisA dataset under multi-class and one-shot settings.

Title: A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios

Authors: Samuel Ackerman, Ella Rabinovich, Eitan Farchi, Ateret Anaby-Tavor
Subjects: cs.CL, stat.AP
Abstract URL: https://arxiv.org/abs/2408.01963
Pdf URL: https://arxiv.org/pdf/2408.01963
Copy Paste: [[2408.01963]] A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios(https://arxiv.org/abs/2408.01963)
Keywords: robust, large language model
Abstract: We evaluate the robustness of several large language models on multiple datasets. Robustness here refers to the relative insensitivity of the model's answers to meaning-preserving variants of their input. Benchmark datasets are constructed by introducing naturally-occurring, non-malicious perturbations, or by generating semantically equivalent paraphrases of input questions or statements. We further propose a novel metric for assessing a model robustness, and demonstrate its benefits in the non-adversarial scenario by empirical evaluation of several models on the created datasets.

Title: Top K Enhanced Reinforcement Learning Attacks on Heterogeneous Graph Node Classification

Authors: Honglin Gao, Gaoxi Xiao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01964
Pdf URL: https://arxiv.org/pdf/2408.01964
Copy Paste: [[2408.01964]] Top K Enhanced Reinforcement Learning Attacks on Heterogeneous Graph Node Classification(https://arxiv.org/abs/2408.01964)
Keywords: defense, attack, robust
Abstract: Graph Neural Networks (GNNs) have attracted substantial interest due to their exceptional performance on graph-based data. However, their robustness, especially on heterogeneous graphs, remains underexplored, particularly against adversarial attacks. This paper proposes HeteroKRLAttack, a targeted evasion black-box attack method for heterogeneous graphs. By integrating reinforcement learning with a Top-K algorithm to reduce the action space, our method efficiently identifies effective attack strategies to disrupt node classification tasks. We validate the effectiveness of HeteroKRLAttack through experiments on multiple heterogeneous graph datasets, showing significant reductions in classification accuracy compared to baseline methods. An ablation study underscores the critical role of the Top-K algorithm in enhancing attack performance. Our findings highlight potential vulnerabilities in current models and provide guidance for future defense strategies against adversarial attacks on heterogeneous graphs.

Title: Optimal and efficient text counterfactuals using Graph Neural Networks

Authors: Dimitris Lymperopoulos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.01969
Pdf URL: https://arxiv.org/pdf/2408.01969
Copy Paste: [[2408.01969]] Optimal and efficient text counterfactuals using Graph Neural Networks(https://arxiv.org/abs/2408.01969)
Keywords: interpretability, explainability
Abstract: As NLP models become increasingly integral to decision-making processes, the need for explainability and interpretability has become paramount. In this work, we propose a framework that achieves the aforementioned by generating semantically edited inputs, known as counterfactual interventions, which change the model prediction, thus providing a form of counterfactual explanations for the model. We test our framework on two NLP tasks - binary sentiment classification and topic classification - and show that the generated edits are contrastive, fluent and minimal, while the whole process remains significantly faster that other state-of-the-art counterfactual editors.

Title: Single-Point Supervised High-Resolution Dynamic Network for Infrared Small Target Detection

Authors: Jing Wu, Rixiang Ni, Feng Huang, Zhaobing Qiu, Liqiong Chen, Changhai Luo, Yunxiang Li, Youli Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01976
Pdf URL: https://arxiv.org/pdf/2408.01976
Copy Paste: [[2408.01976]] Single-Point Supervised High-Resolution Dynamic Network for Infrared Small Target Detection(https://arxiv.org/abs/2408.01976)
Keywords: extraction
Abstract: Infrared small target detection (IRSTD) tasks are extremely challenging for two main reasons: 1) it is difficult to obtain accurate labelling information that is critical to existing methods, and 2) infrared (IR) small target information is easily lost in deep networks. To address these issues, we propose a single-point supervised high-resolution dynamic network (SSHD-Net). In contrast to existing methods, we achieve state-of-the-art (SOTA) detection performance using only single-point supervision. Specifically, we first design a high-resolution cross-feature extraction module (HCEM), that achieves bi-directional feature interaction through stepped feature cascade channels (SFCC). It balances network depth and feature resolution to maintain deep IR small-target information. Secondly, the effective integration of global and local features is achieved through the dynamic coordinate fusion module (DCFM), which enhances the anti-interference ability in complex backgrounds. In addition, we introduce the high-resolution multilevel residual module (HMRM) to enhance the semantic information extraction capability. Finally, we design the adaptive target localization detection head (ATLDH) to improve detection accuracy. Experiments on the publicly available datasets NUDT-SIRST and IRSTD-1k demonstrate the effectiveness of our method. Compared to other SOTA methods, our method can achieve better detection performance with only a single point of supervision.

Title: Label Augmentation for Neural Networks Robustness

Authors: Fatemeh Amerehi, Patrick Healy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01977
Pdf URL: https://arxiv.org/pdf/2408.01977
Copy Paste: [[2408.01977]] Label Augmentation for Neural Networks Robustness(https://arxiv.org/abs/2408.01977)
Keywords: attack, robust
Abstract: Out-of-distribution generalization can be categorized into two types: common perturbations arising from natural variations in the real world and adversarial perturbations that are intentionally crafted to deceive neural networks. While deep neural networks excel in accuracy under the assumption of identical distributions between training and test data, they often encounter out-of-distribution scenarios resulting in a significant decline in accuracy. Data augmentation methods can effectively enhance robustness against common corruptions, but they typically fall short in improving robustness against adversarial perturbations. In this study, we develop Label Augmentation (LA), which enhances robustness against both common and intentional perturbations and improves uncertainty estimation. Our findings indicate a Clean error rate improvement of up to 23.29% when employing LA in comparisons to the baseline. Additionally, it enhances robustness under common corruptions benchmark by up to 24.23%. When tested against FGSM and PGD attacks, improvements in adversarial robustness are noticeable, with enhancements of up to 53.18% for FGSM and 24.46% for PGD attacks.

Title: AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning

Authors: Xin Wang, Kai Chen, Xingjun Ma, Zhineng Chen, Jingjing Chen, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.01978
Pdf URL: https://arxiv.org/pdf/2408.01978
Copy Paste: [[2408.01978]] AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning(https://arxiv.org/abs/2408.01978)
Keywords: attack, robust
Abstract: Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks even under a black-box setting where the adversary can only query the model. Particularly, query-based black-box adversarial attacks estimate adversarial gradients based on the returned probability vectors of the target model for a sequence of queries. During this process, the queries made to the target model are intermediate adversarial examples crafted at the previous attack step, which share high similarities in the pixel space. Motivated by this observation, stateful detection methods have been proposed to detect and reject query-based attacks. While demonstrating promising results, these methods either have been evaded by more advanced attacks or suffer from low efficiency in terms of the number of shots (queries) required to detect different attacks. Arguably, the key challenge here is to assign high similarity scores for any two intermediate adversarial examples perturbed from the same clean image. To address this challenge, we propose a novel Adversarial Contrastive Prompt Tuning (ACPT) method to robustly fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries. With ACPT, we further introduce a detection framework AdvQDet that can detect 7 state-of-the-art query-based attacks with $>99\%$ detection rate within 5 shots. We also show that ACPT is robust to 3 types of adaptive attacks. Code is available at this https URL.

Title: DeMansia: Mamba Never Forgets Any Tokens

Authors: Ricky Fang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.01986
Pdf URL: https://arxiv.org/pdf/2408.01986
Copy Paste: [[2408.01986]] DeMansia: Mamba Never Forgets Any Tokens(https://arxiv.org/abs/2408.01986)
Keywords: transformer
Abstract: This paper examines the mathematical foundations of transformer architectures, highlighting their limitations particularly in handling long sequences. We explore prerequisite models such as Mamba, Vision Mamba (ViM), and LV-ViT that pave the way for our proposed architecture, DeMansia. DeMansia integrates state space models with token labeling techniques to enhance performance in image classification tasks, efficiently addressing the computational challenges posed by traditional transformers. The architecture, benchmark, and comparisons with contemporary models demonstrate DeMansia's effectiveness. The implementation of this paper is available on GitHub at this https URL

Title: Towards Automatic Hands-on-Keyboard Attack Detection Using LLMs in EDR Solutions

Authors: Amit Portnoy, Ehud Azikri, Shay Kels
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2408.01993
Pdf URL: https://arxiv.org/pdf/2408.01993
Copy Paste: [[2408.01993]] Towards Automatic Hands-on-Keyboard Attack Detection Using LLMs in EDR Solutions(https://arxiv.org/abs/2408.01993)
Keywords: security, attack, large language model
Abstract: Endpoint Detection and Remediation (EDR) platforms are essential for identifying and responding to cyber threats. This study presents a novel approach using Large Language Models (LLMs) to detect Hands-on-Keyboard (HOK) cyberattacks. Our method involves converting endpoint activity data into narrative forms that LLMs can analyze to distinguish between normal operations and potential HOK attacks. We address the challenges of interpreting endpoint data by segmenting narratives into windows and employing a dual training strategy. The results demonstrate that LLM-based models have the potential to outperform traditional machine learning methods, offering a promising direction for enhancing EDR capabilities and apply LLMs in cybersecurity.

Title: Reinforcement Learning for an Efficient and Effective Malware Investigation during Cyber Incident Response

Authors: Dipo Dunsin, Mohamed Chahine Ghanem, Karim Ouazzane, Vassil Vassilev
Subjects: cs.CR, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2408.01999
Pdf URL: https://arxiv.org/pdf/2408.01999
Copy Paste: [[2408.01999]] Reinforcement Learning for an Efficient and Effective Malware Investigation during Cyber Incident Response(https://arxiv.org/abs/2408.01999)
Keywords: robust
Abstract: This research focused on enhancing post-incident malware forensic investigation using reinforcement learning RL. We proposed an advanced MDP post incident malware forensics investigation model and framework to expedite post incident forensics. We then implement our RL Malware Investigation Model based on structured MDP within the proposed framework. To identify malware artefacts, the RL agent acquires and examines forensics evidence files, iteratively improving its capabilities using Q Table and temporal difference learning. The Q learning algorithm significantly improved the agent ability to identify malware. An epsilon greedy exploration strategy and Q learning updates enabled efficient learning and decision making. Our experimental testing revealed that optimal learning rates depend on the MDP environment complexity, with simpler environments benefiting from higher rates for quicker convergence and complex ones requiring lower rates for stability. Our model performance in identifying and classifying malware reduced malware analysis time compared to human experts, demonstrating robustness and adaptability. The study highlighted the significance of hyper parameter tuning and suggested adaptive strategies for complex environments. Our RL based approach produced promising results and is validated as an alternative to traditional methods notably by offering continuous learning and adaptation to new and evolving malware threats which ultimately enhance the post incident forensics investigations.

Title: AdaCBM: An Adaptive Concept Bottleneck Model for Explainable and Accurate Diagnosis

Authors: Townim F. Chowdhury, Vu Minh Hieu Phan, Kewen Liao, Minh-Son To, Yutong Xie, Anton van den Hengel, Johan W. Verjans, Zhibin Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02001
Pdf URL: https://arxiv.org/pdf/2408.02001
Copy Paste: [[2408.02001]] AdaCBM: An Adaptive Concept Bottleneck Model for Explainable and Accurate Diagnosis(https://arxiv.org/abs/2408.02001)
Keywords: explainability
Abstract: The integration of vision-language models such as CLIP and Concept Bottleneck Models (CBMs) offers a promising approach to explaining deep neural network (DNN) decisions using concepts understandable by humans, addressing the black-box concern of DNNs. While CLIP provides both explainability and zero-shot classification capability, its pre-training on generic image and text data may limit its classification accuracy and applicability to medical image diagnostic tasks, creating a transfer learning problem. To maintain explainability and address transfer learning needs, CBM methods commonly design post-processing modules after the bottleneck module. However, this way has been ineffective. This paper takes an unconventional approach by re-examining the CBM framework through the lens of its geometrical representation as a simple linear classification system. The analysis uncovers that post-CBM fine-tuning modules merely rescale and shift the classification outcome of the system, failing to fully leverage the system's learning potential. We introduce an adaptive module strategically positioned between CLIP and CBM to bridge the gap between source and downstream domains. This simple yet effective approach enhances classification performance while preserving the explainability afforded by the framework. Our work offers a comprehensive solution that encompasses the entire process, from concept discovery to model training, providing a holistic recipe for leveraging the strengths of GPT, CLIP, and CBM.

Title: LLaSA: Large Language and E-Commerce Shopping Assistant

Authors: Shuo Zhang, Boci Peng, Xinping Zhao, Boren Hu, Yun Zhu, Yanjia Zeng, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.02006
Pdf URL: https://arxiv.org/pdf/2408.02006
Copy Paste: [[2408.02006]] LLaSA: Large Language and E-Commerce Shopping Assistant(https://arxiv.org/abs/2408.02006)
Keywords: secure, large language model
Abstract: The e-commerce platform has evolved rapidly due to its widespread popularity and convenience. Developing an e-commerce shopping assistant for customers is crucial to aiding them in quickly finding desired products and recommending precisely what they need. However, most previous shopping assistants face two main problems: (1) task-specificity, which necessitates the development of different models for various tasks, thereby increasing development costs and limiting effectiveness; and (2) poor generalization, where the trained model performs inadequately on up-to-date products. To resolve these issues, we employ Large Language Models (LLMs) to construct an omnipotent assistant, leveraging their adeptness at handling multiple tasks and their superior generalization capability. Nonetheless, LLMs lack inherent knowledge of e-commerce concepts. To address this, we create an instruction dataset comprising 65,000 samples and diverse tasks, termed as EshopInstruct. Through instruction tuning on our dataset, the assistant, named LLaSA, demonstrates the potential to function as an omnipotent assistant. Additionally, we propose various inference optimization strategies to enhance performance with limited inference resources. In the Amazon KDD Cup 2024 Challenge, our proposed method, LLaSA, achieved an overall ranking of 3rd place on ShopBench, including 57 tasks and approximately 20,000 questions, and we secured top-5 rankings in each track, especially in track4, where we achieved the best performance result among all student teams. Our extensive practices fully demonstrate that LLMs possess the great potential to be competent e-commerce shopping assistants.

Title: Personalized Federated Learning on Heterogeneous and Long-Tailed Data via Expert Collaborative Learning

Authors: Fengling Lv, Xinyi Shang, Yang Zhou, Yiqun Zhang, Mengke Li, Yang Lu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.02019
Pdf URL: https://arxiv.org/pdf/2408.02019
Copy Paste: [[2408.02019]] Personalized Federated Learning on Heterogeneous and Long-Tailed Data via Expert Collaborative Learning(https://arxiv.org/abs/2408.02019)
Keywords: federate
Abstract: Personalized Federated Learning (PFL) aims to acquire customized models for each client without disclosing raw data by leveraging the collective knowledge of distributed clients. However, the data collected in real-world scenarios is likely to follow a long-tailed distribution. For example, in the medical domain, it is more common for the number of general health notes to be much larger than those specifically relatedto certain diseases. The presence of long-tailed data can significantly degrade the performance of PFL models. Additionally, due to the diverse environments in which each client operates, data heterogeneity is also a classic challenge in federated learning. In this paper, we explore the joint problem of global long-tailed distribution and data heterogeneity in PFL and propose a method called Expert Collaborative Learning (ECL) to tackle this problem. Specifically, each client has multiple experts, and each expert has a different training subset, which ensures that each class, especially the minority classes, receives sufficient training. Multiple experts collaborate synergistically to produce the final prediction output. Without special bells and whistles, the vanilla ECL outperforms other state-of-the-art PFL methods on several benchmark datasets under different degrees of data heterogeneity and long-tailed distribution.

Title: Scenario-based Thermal Management Parametrization Through Deep Reinforcement Learning

Authors: Thomas Rudolf, Philip Muhl, Sören Hohmann, Lutz Eckstein
Subjects: cs.LG, cs.AI, cs.CE, eess.SY
Abstract URL: https://arxiv.org/abs/2408.02022
Pdf URL: https://arxiv.org/pdf/2408.02022
Copy Paste: [[2408.02022]] Scenario-based Thermal Management Parametrization Through Deep Reinforcement Learning(https://arxiv.org/abs/2408.02022)
Keywords: robust
Abstract: The thermal system of battery electric vehicles demands advanced control. Its thermal management needs to effectively control active components across varying operating conditions. While robust control function parametrization is required, current methodologies show significant drawbacks. They consume considerable time, human effort, and extensive real-world testing. Consequently, there is a need for innovative and intelligent solutions that are capable of autonomously parametrizing embedded controllers. Addressing this issue, our paper introduces a learning-based tuning approach. We propose a methodology that benefits from automated scenario generation for increased robustness across vehicle usage scenarios. Our deep reinforcement learning agent processes the tuning task context and incorporates an image-based interpretation of embedded parameter sets. We demonstrate its applicability to a valve controller parametrization task and verify it in real-world vehicle testing. The results highlight the competitive performance to baseline methods. This novel approach contributes to the shift towards virtual development of thermal management functions, with promising potential of large-scale parameter tuning in the automotive industry.

Title: A Smart City Infrastructure Ontology for Threats, Cybercrime, and Digital Forensic Investigation

Authors: Yee Ching Tok, Davis Zheng Yang, Sudipta Chattopadhyay
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.02023
Pdf URL: https://arxiv.org/pdf/2408.02023
Copy Paste: [[2408.02023]] A Smart City Infrastructure Ontology for Threats, Cybercrime, and Digital Forensic Investigation(https://arxiv.org/abs/2408.02023)
Keywords: attack
Abstract: Cybercrime and the market for cyber-related compromises are becoming attractive revenue sources for state-sponsored actors, cybercriminals and technical individuals affected by financial hardships. Due to burgeoning cybercrime on new technological frontiers, efforts have been made to assist digital forensic investigators (DFI) and law enforcement agencies (LEA) in their investigative efforts. Forensic tool innovations and ontology developments, such as the Unified Cyber Ontology (UCO) and Cyber-investigation Analysis Standard Expression (CASE), have been proposed to assist DFI and LEA. Although these tools and ontologies are useful, they lack extensive information sharing and tool interoperability features, and the ontologies lack the latest Smart City Infrastructure (SCI) context that was proposed. To mitigate the weaknesses in both solutions and to ensure a safer cyber-physical environment for all, we propose the Smart City Ontological Paradigm Expression (SCOPE), an expansion profile of the UCO and CASE ontology that implements SCI threat models, SCI digital forensic evidence, attack techniques, patterns and classifications from MITRE. We showcase how SCOPE could present complex data such as SCI-specific threats, cybercrime, investigation data and incident handling workflows via an incident scenario modelled after publicly reported real-world incidents attributed to Advanced Persistent Threat (APT) groups. We also make SCOPE available to the community so that threats, digital evidence and cybercrime in emerging trends such as SCI can be identified, represented, and shared collaboratively.

Title: Faster Diffusion Action Segmentation

Authors: Shuaibing Wang, Shunli Wang, Mingcheng Li, Dingkang Yang, Haopeng Kuang, Ziyun Qian, Lihua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02024
Pdf URL: https://arxiv.org/pdf/2408.02024
Copy Paste: [[2408.02024]] Faster Diffusion Action Segmentation(https://arxiv.org/abs/2408.02024)
Keywords: diffusion, transformer, segmentation
Abstract: Temporal Action Segmentation (TAS) is an essential task in video analysis, aiming to segment and classify continuous frames into distinct action segments. However, the ambiguous boundaries between actions pose a significant challenge for high-precision segmentation. Recent advances in diffusion models have demonstrated substantial success in TAS tasks due to their stable training process and high-quality generation capabilities. However, the heavy sampling steps required by diffusion models pose a substantial computational burden, limiting their practicality in real-time applications. Additionally, most related works utilize Transformer-based encoder architectures. Although these architectures excel at capturing long-range dependencies, they incur high computational costs and face feature-smoothing issues when processing long video sequences. To address these challenges, we propose EffiDiffAct, an efficient and high-performance TAS algorithm. Specifically, we develop a lightweight temporal feature encoder that reduces computational overhead and mitigates the rank collapse phenomenon associated with traditional self-attention mechanisms. Furthermore, we introduce an adaptive skip strategy that allows for dynamic adjustment of timestep lengths based on computed similarity metrics during inference, thereby further enhancing computational efficiency. Comprehensive experiments on the 50Salads, Breakfast, and GTEA datasets demonstrated the effectiveness of the proposed algorithm.

Title: Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

Authors: Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, Peilin Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02032
Pdf URL: https://arxiv.org/pdf/2408.02032
Copy Paste: [[2408.02032]] Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models(https://arxiv.org/abs/2408.02032)
Keywords: robust
Abstract: While Large Vision-Language Models (LVLMs) have rapidly advanced in recent years, the prevalent issue known as the `hallucination' problem has emerged as a significant bottleneck, hindering their real-world deployments. Existing methods mitigate this issue mainly from two perspectives: One approach leverages extra knowledge like robust instruction tuning LVLMs with curated datasets or employing auxiliary analysis networks, which inevitable incur additional costs. Another approach, known as contrastive decoding, induces hallucinations by manually disturbing the vision or instruction raw inputs and mitigates them by contrasting the outputs of the disturbed and original LVLMs. However, these approaches rely on empirical holistic input disturbances and double the inference cost. To avoid these issues, we propose a simple yet effective method named Self-Introspective Decoding (SID). Our empirical investigation reveals that pretrained LVLMs can introspectively assess the importance of vision tokens based on preceding vision and text (both instruction and generated) tokens. We develop the Context and Text-aware Token Selection (CT2S) strategy, which preserves only unimportant vision tokens after early layers of LVLMs to adaptively amplify text-informed hallucination during the auto-regressive decoding. This approach ensures that multimodal knowledge absorbed in the early layers induces multimodal contextual rather than aimless hallucinations. Subsequently, the original token logits subtract the amplified vision-and-text association hallucinations, guiding LVLMs decoding faithfully. Extensive experiments illustrate SID generates less-hallucination and higher-quality texts across various metrics, without extra knowledge and much additional computation burdens.

Title: Enhancing Human Action Recognition and Violence Detection Through Deep Learning Audiovisual Fusion

Authors: Pooya Janani (1), Amirabolfazl Suratgar (1), Afshin Taghvaeipour (2) ((1) Distributed and Intelligent Optimization Research Laboratory, Dept. of Electrical Engineering, Amirkabir University of Technology, Tehran, Iran, (2) Dept. of Mechanical Engineering, Amirkabir University of Technology, Tehran, Iran)
Subjects: cs.CV, cs.LG, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2408.02033
Pdf URL: https://arxiv.org/pdf/2408.02033
Copy Paste: [[2408.02033]] Enhancing Human Action Recognition and Violence Detection Through Deep Learning Audiovisual Fusion(https://arxiv.org/abs/2408.02033)
Keywords: security
Abstract: This paper proposes a hybrid fusion-based deep learning approach based on two different modalities, audio and video, to improve human activity recognition and violence detection in public places. To take advantage of audiovisual fusion, late fusion, intermediate fusion, and hybrid fusion-based deep learning (HFBDL) are used and compared. Since the objective is to detect and recognize human violence in public places, Real-life violence situation (RLVS) dataset is expanded and used. Simulating results of HFBDL show 96.67\% accuracy on validation data, which is more accurate than the other state-of-the-art methods on this dataset. To showcase our model's ability in real-world scenarios, another dataset of 54 sounded videos of both violent and non-violent situations was recorded. The model could successfully detect 52 out of 54 videos correctly. The proposed method shows a promising performance on real scenarios. Thus, it can be used for human action recognition and violence detection in public places for security purposes.

Title: Mini-Monkey: Alleviate the Sawtooth Effect by Multi-Scale Adaptive Cropping

Authors: Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02034
Pdf URL: https://arxiv.org/pdf/2408.02034
Copy Paste: [[2408.02034]] Mini-Monkey: Alleviate the Sawtooth Effect by Multi-Scale Adaptive Cropping(https://arxiv.org/abs/2408.02034)
Keywords: large language model, segmentation
Abstract: Recently, there has been significant interest in enhancing the capability of multimodal large language models (MLLMs) to process high-resolution images. Most existing methods focus on adopting a cropping strategy to improve the ability of multimodal large language models to understand image details. However, this cropping operation inevitably causes the segmentation of objects and connected areas, which impairs the MLLM's ability to recognize small or irregularly shaped objects or text. This issue is particularly evident in lightweight MLLMs. Addressing this issue, we propose Mini-Monkey, a lightweight MLLM that incorporates a plug-and-play method called multi-scale adaptive crop strategy (MSAC). Mini-Monkey adaptively generates multi-scale representations, allowing it to select non-segmented objects from various scales. To mitigate the computational overhead introduced by MSAC, we propose a Scale Compression Mechanism (SCM), which effectively compresses image tokens. Mini-Monkey achieves state-of-the-art performance among 2B-parameter MLLMs. It not only demonstrates leading performance on a variety of general multimodal understanding tasks but also shows consistent improvements in document understanding capabilities. On the OCRBench, Mini-Monkey achieves a score of 802, outperforming 8B-parameter state-of-the-art model InternVL2-8B. Besides, our model and training strategy are very efficient, which can be trained with only eight RTX 3090. The code is available at this https URL.

Title: Robustness of Watermarking on Text-to-Image Diffusion Models

Authors: Xiaodong Wu, Xiangman Li, Jianbing Ni
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.02035
Pdf URL: https://arxiv.org/pdf/2408.02035
Copy Paste: [[2408.02035]] Robustness of Watermarking on Text-to-Image Diffusion Models(https://arxiv.org/abs/2408.02035)
Keywords: attack, robust, watermark, diffusion, generative
Abstract: Watermarking has become one of promising techniques to not only aid in identifying AI-generated images but also serve as a deterrent against the unethical use of these models. However, the robustness of watermarking techniques has not been extensively studied recently. In this paper, we investigate the robustness of generative watermarking, which is created from the integration of watermarking embedding and text-to-image generation processing in generative models, e.g., latent diffusion models. Specifically, we propose three attacking methods, i.e., discriminator-based attacks, edge prediction-based attacks, and fine-tune-based attacks, under the scenario where the watermark decoder is not accessible. The model is allowed to be fine-tuned to created AI agents with specific generative tasks for personalizing or specializing. We found that generative watermarking methods are robust to direct evasion attacks, like discriminator-based attacks, or manipulation based on the edge information in edge prediction-based attacks but vulnerable to malicious fine-tuning. Experimental results show that our fine-tune-based attacks can decrease the accuracy of the watermark detection to nearly $67.92\%$. In addition, We conduct an ablation study on the length of fine-tuned messages, encoder/decoder's depth and structure to identify key factors that impact the performance of fine-tune-based attacks.

Title: Pixel-Level Domain Adaptation: A New Perspective for Enhancing Weakly Supervised Semantic Segmentation

Authors: Ye Du, Zehua Fu, Qingjie Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02039
Pdf URL: https://arxiv.org/pdf/2408.02039
Copy Paste: [[2408.02039]] Pixel-Level Domain Adaptation: A New Perspective for Enhancing Weakly Supervised Semantic Segmentation(https://arxiv.org/abs/2408.02039)
Keywords: extraction, segmentation
Abstract: Recent attention has been devoted to the pursuit of learning semantic segmentation models exclusively from image tags, a paradigm known as image-level Weakly Supervised Semantic Segmentation (WSSS). Existing attempts adopt the Class Activation Maps (CAMs) as priors to mine object regions yet observe the imbalanced activation issue, where only the most discriminative object parts are located. In this paper, we argue that the distribution discrepancy between the discriminative and the non-discriminative parts of objects prevents the model from producing complete and precise pseudo masks as ground truths. For this purpose, we propose a Pixel-Level Domain Adaptation (PLDA) method to encourage the model in learning pixel-wise domain-invariant features. Specifically, a multi-head domain classifier trained adversarially with the feature extraction is introduced to promote the emergence of pixel features that are invariant with respect to the shift between the source (i.e., the discriminative object parts) and the target (\textit{i.e.}, the non-discriminative object parts) domains. In addition, we come up with a Confident Pseudo-Supervision strategy to guarantee the discriminative ability of each pixel for the segmentation task, which serves as a complement to the intra-image domain adversarial training. Our method is conceptually simple, intuitive and can be easily integrated into existing WSSS methods. Taking several strong baseline models as instances, we experimentally demonstrate the effectiveness of our approach under a wide range of settings.

Title: Deep Spectral Methods for Unsupervised Ultrasound Image Interpretation

Authors: Oleksandra Tmenova, Yordanka Velikova, Mahdi Saleh, Nassir Navab
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02043
Pdf URL: https://arxiv.org/pdf/2408.02043
Copy Paste: [[2408.02043]] Deep Spectral Methods for Unsupervised Ultrasound Image Interpretation(https://arxiv.org/abs/2408.02043)
Keywords: transformer, segmentation
Abstract: Ultrasound imaging is challenging to interpret due to non-uniform intensities, low contrast, and inherent artifacts, necessitating extensive training for non-specialists. Advanced representation with clear tissue structure separation could greatly assist clinicians in mapping underlying anatomy and distinguishing between tissue layers. Decomposing an image into semantically meaningful segments is mainly achieved using supervised segmentation algorithms. Unsupervised methods are beneficial, as acquiring large labeled datasets is difficult and costly, but despite their advantages, they still need to be explored in ultrasound. This paper proposes a novel unsupervised deep learning strategy tailored to ultrasound to obtain easily interpretable tissue separations. We integrate key concepts from unsupervised deep spectral methods, which combine spectral graph theory with deep learning methods. We utilize self-supervised transformer features for spectral clustering to generate meaningful segments based on ultrasound-specific metrics and shape and positional priors, ensuring semantic consistency across the dataset. We evaluate our unsupervised deep learning strategy on three ultrasound datasets, showcasing qualitative results across anatomical contexts without label requirements. We also conduct a comparative analysis against other clustering algorithms to demonstrate superior segmentation performance, boundary preservation, and label consistency.

Title: Fine-tuning multilingual language models in Twitter/X sentiment analysis: a study on Eastern-European V4 languages

Authors: Tomáš Filip, Martin Pavlíček, Petr Sosík
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02044
Pdf URL: https://arxiv.org/pdf/2408.02044
Copy Paste: [[2408.02044]] Fine-tuning multilingual language models in Twitter/X sentiment analysis: a study on Eastern-European V4 languages(https://arxiv.org/abs/2408.02044)
Keywords: large language model
Abstract: The aspect-based sentiment analysis (ABSA) is a standard NLP task with numerous approaches and benchmarks, where large language models (LLM) represent the current state-of-the-art. We focus on ABSA subtasks based on Twitter/X data in underrepresented languages. On such narrow tasks, small tuned language models can often outperform universal large ones, providing available and cheap solutions. We fine-tune several LLMs (BERT, BERTweet, Llama2, Llama3, Mistral) for classification of sentiment towards Russia and Ukraine in the context of the ongoing military conflict. The training/testing dataset was obtained from the academic API from Twitter/X during 2023, narrowed to the languages of the V4 countries (Czech Republic, Slovakia, Poland, Hungary). Then we measure their performance under a variety of settings including translations, sentiment targets, in-context learning and more, using GPT4 as a reference model. We document several interesting phenomena demonstrating, among others, that some models are much better fine-tunable on multilingual Twitter tasks than others, and that they can reach the SOTA level with a very small training set. Finally we identify combinations of settings providing the best results.

Title: PanicleNeRF: low-cost, high-precision in-field phenotypingof rice panicles with smartphone

Authors: Xin Yang (1 and 2), Xuqi Lu (1 and 2), Pengyao Xie (1 and 2), Ziyue Guo (1 and 2), Hui Fang (1), Haowei Fu (3), Xiaochun Hu (4), Zhenbiao Sun (4), Haiyan Cen (1 and 2) ((1) College of Biosystems Engineering and Food Science, Zhejiang University, (2) Key Laboratory of Spectroscopy Sensing, Ministry of Agriculture and Rural Affairs, (3) Jiaxing Academy of Agricultural Science, (4) Yuan Longping High-Tech Agriculture Co., Ltd)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02053
Pdf URL: https://arxiv.org/pdf/2408.02053
Copy Paste: [[2408.02053]] PanicleNeRF: low-cost, high-precision in-field phenotypingof rice panicles with smartphone(https://arxiv.org/abs/2408.02053)
Keywords: segmentation
Abstract: The rice panicle traits significantly influence grain yield, making them a primary target for rice phenotyping studies. However, most existing techniques are limited to controlled indoor environments and difficult to capture the rice panicle traits under natural growth conditions. Here, we developed PanicleNeRF, a novel method that enables high-precision and low-cost reconstruction of rice panicle three-dimensional (3D) models in the field using smartphone. The proposed method combined the large model Segment Anything Model (SAM) and the small model You Only Look Once version 8 (YOLOv8) to achieve high-precision segmentation of rice panicle images. The NeRF technique was then employed for 3D reconstruction using the images with 2D segmentation. Finally, the resulting point clouds are processed to successfully extract panicle traits. The results show that PanicleNeRF effectively addressed the 2D image segmentation task, achieving a mean F1 Score of 86.9% and a mean Intersection over Union (IoU) of 79.8%, with nearly double the boundary overlap (BO) performance compared to YOLOv8. As for point cloud quality, PanicleNeRF significantly outperformed traditional SfM-MVS (structure-from-motion and multi-view stereo) methods, such as COLMAP and Metashape. The panicle length was then accurately extracted with the rRMSE of 2.94% for indica and 1.75% for japonica rice. The panicle volume estimated from 3D point clouds strongly correlated with the grain number (R2 = 0.85 for indica and 0.82 for japonica) and grain mass (0.80 for indica and 0.76 for japonica). This method provides a low-cost solution for high-throughput in-field phenotyping of rice panicles, accelerating the efficiency of rice breeding.

Title: Step Saver: Predicting Minimum Denoising Steps for Diffusion Model Image Generation

Authors: Jean Yu, Haim Barad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02054
Pdf URL: https://arxiv.org/pdf/2408.02054
Copy Paste: [[2408.02054]] Step Saver: Predicting Minimum Denoising Steps for Diffusion Model Image Generation(https://arxiv.org/abs/2408.02054)
Keywords: diffusion
Abstract: In this paper, we introduce an innovative NLP model specifically fine-tuned to determine the minimal number of denoising steps required for any given text prompt. This advanced model serves as a real-time tool that recommends the ideal denoise steps for generating high-quality images efficiently. It is designed to work seamlessly with the Diffusion model, ensuring that images are produced with superior quality in the shortest possible time. Although our explanation focuses on the DDIM scheduler, the methodology is adaptable and can be applied to various other schedulers like Euler, Euler Ancestral, Heun, DPM2 Karras, UniPC, and more. This model allows our customers to conserve costly computing resources by executing the fewest necessary denoising steps to achieve optimal quality in the produced images.

Title: MedSyn: LLM-based Synthetic Medical Text Generation Framework

Authors: Gleb Kumichev, Pavel Blinov, Yulia Kuzkina, Vasily Goncharov, Galina Zubkova, Nikolai Zenovkin, Aleksei Goncharov, Andrey Savchenko
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02056
Pdf URL: https://arxiv.org/pdf/2408.02056
Copy Paste: [[2408.02056]] MedSyn: LLM-based Synthetic Medical Text Generation Framework(https://arxiv.org/abs/2408.02056)
Keywords: privacy, large language model
Abstract: Generating synthetic text addresses the challenge of data availability in privacy-sensitive domains such as healthcare. This study explores the applicability of synthetic data in real-world medical settings. We introduce MedSyn, a novel medical text generation framework that integrates large language models with a Medical Knowledge Graph (MKG). We use MKG to sample prior medical information for the prompt and generate synthetic clinical notes with GPT-4 and fine-tuned LLaMA models. We assess the benefit of synthetic data through application in the ICD code prediction task. Our research indicates that synthetic data can increase the classification accuracy of vital and challenging codes by up to 17.8% compared to settings without synthetic data. Furthermore, to provide new data for further research in the healthcare domain, we present the largest open-source synthetic dataset of clinical notes for the Russian language, comprising over 41k samples covering 219 ICD-10 codes.

Title: ParkingE2E: Camera-based End-to-end Parking Network, from Images to Planning

Authors: Changze Li, Ziheng Ji, Zhe Chen, Tong Qin, Ming Yang
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2408.02061
Pdf URL: https://arxiv.org/pdf/2408.02061
Copy Paste: [[2408.02061]] ParkingE2E: Camera-based End-to-end Parking Network, from Images to Planning(https://arxiv.org/abs/2408.02061)
Keywords: transformer
Abstract: Autonomous parking is a crucial task in the intelligent driving field. Traditional parking algorithms are usually implemented using rule-based schemes. However, these methods are less effective in complex parking scenarios due to the intricate design of the algorithms. In contrast, neural-network-based methods tend to be more intuitive and versatile than the rule-based methods. By collecting a large number of expert parking trajectory data and emulating human strategy via learning-based methods, the parking task can be effectively addressed. In this paper, we employ imitation learning to perform end-to-end planning from RGB images to path planning by imitating human driving trajectories. The proposed end-to-end approach utilizes a target query encoder to fuse images and target features, and a transformer-based decoder to autoregressively predict future waypoints. We conducted extensive experiments in real-world scenarios, and the results demonstrate that the proposed method achieved an average parking success rate of 87.8% across four different real-world garages. Real-vehicle experiments further validate the feasibility and effectiveness of the method proposed in this paper.

Title: PromptSAM+: Malware Detection based on Prompt Segment Anything Model

Authors: Xingyuan Wei, Yichen Liu, Ce Li, Ning Li, Degang Sun, Yan Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.02066
Pdf URL: https://arxiv.org/pdf/2408.02066
Copy Paste: [[2408.02066]] PromptSAM+: Malware Detection based on Prompt Segment Anything Model(https://arxiv.org/abs/2408.02066)
Keywords: robust, segmentation
Abstract: Machine learning and deep learning (ML/DL) have been extensively applied in malware detection, and some existing methods demonstrate robust performance. However, several issues persist in the field of malware detection: (1) Existing work often overemphasizes accuracy at the expense of practicality, rarely considering false positive and false negative rates as important metrics. (2) Considering the evolution of malware, the performance of classifiers significantly declines over time, greatly reducing the practicality of malware detectors. (3) Prior ML/DL-based efforts heavily rely on ample labeled data for model training, largely dependent on feature engineering or domain knowledge to build feature databases, making them vulnerable if correct labels are scarce. With the development of computer vision, vision-based malware detection technology has also rapidly evolved. In this paper, we propose a visual malware general enhancement classification framework, `PromptSAM+', based on a large visual network segmentation model, the Prompt Segment Anything Model(named PromptSAM+). Our experimental results indicate that 'PromptSAM+' is effective and efficient in malware detection and classification, achieving high accuracy and low rates of false positives and negatives. The proposed method outperforms the most advanced image-based malware detection technologies on several datasets. 'PromptSAM+' can mitigate aging in existing image-based malware classifiers, reducing the considerable manpower needed for labeling new malware samples through active learning. We conducted experiments on datasets for both Windows and Android platforms, achieving favorable outcomes. Additionally, our ablation experiments on several datasets demonstrate that our model identifies effective modules within the large visual network.

Title: Case-based reasoning approach for diagnostic screening of children with developmental delays

Authors: Zichen Song, Jiakang Li, Songning Lai, Sitan Huang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2408.02073
Pdf URL: https://arxiv.org/pdf/2408.02073
Copy Paste: [[2408.02073]] Case-based reasoning approach for diagnostic screening of children with developmental delays(https://arxiv.org/abs/2408.02073)
Keywords: extraction, transformer
Abstract: According to the World Health Organization, the population of children with developmental delays constitutes approximately 6% to 9% of the total population. Based on the number of newborns in Huaibei, Anhui Province, China, in 2023 (94,420), it is estimated that there are about 7,500 cases (suspected cases of developmental delays) of suspicious cases annually. Early identification and appropriate early intervention for these children can significantly reduce the wastage of medical resources and societal costs. International research indicates that the optimal period for intervention in children with developmental delays is before the age of six, with the golden treatment period being before three and a half years of age. Studies have shown that children with developmental delays who receive early intervention exhibit significant improvement in symptoms; some may even fully recover. This research adopts a hybrid model combining a CNN-Transformer model with Case-Based Reasoning (CBR) to enhance the screening efficiency for children with developmental delays. The CNN-Transformer model is an excellent model for image feature extraction and recognition, effectively identifying features in bone age images to determine bone age. CBR is a technique for solving problems based on similar cases; it solves current problems based on past experiences, similar to how humans solve problems through learning from experience. Given CBR's memory capability to judge and compare new cases based on previously stored old cases, it is suitable for application in support systems with latent and variable characteristics. Therefore, this study utilizes the CNN-Transformer-CBR to establish a screening system for children with developmental delays, aiming to improve screening efficiency.

Title: FDiff-Fusion:Denoising diffusion fusion network based on fuzzy learning for 3D medical image segmentation

Authors: Weiping Ding, Sheng Geng, Haipeng Wang, Jiashuang Huang, Tianyi Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02075
Pdf URL: https://arxiv.org/pdf/2408.02075
Copy Paste: [[2408.02075]] FDiff-Fusion:Denoising diffusion fusion network based on fuzzy learning for 3D medical image segmentation(https://arxiv.org/abs/2408.02075)
Keywords: diffusion, segmentation
Abstract: In recent years, the denoising diffusion model has achieved remarkable success in image segmentation modeling. With its powerful nonlinear modeling capabilities and superior generalization performance, denoising diffusion models have gradually been applied to medical image segmentation tasks, bringing new perspectives and methods to this field. However, existing methods overlook the uncertainty of segmentation boundaries and the fuzziness of regions, resulting in the instability and inaccuracy of the segmentation results. To solve this problem, a denoising diffusion fusion network based on fuzzy learning for 3D medical image segmentation (FDiff-Fusion) is proposed in this paper. By integrating the denoising diffusion model into the classical U-Net network, this model can effectively extract rich semantic information from input medical images, thus providing excellent pixel-level representation for medical image segmentation. ... Finally, to validate the effectiveness of FDiff-Fusion, we compare it with existing advanced segmentation networks on the BRATS 2020 brain tumor dataset and the BTCV abdominal multi-organ dataset. The results show that FDiff-Fusion significantly improves the Dice scores and HD95 distance on these two datasets, demonstrating its superiority in medical image segmentation tasks.

Title: LDFaceNet: Latent Diffusion-based Network for High-Fidelity Deepfake Generation

Authors: Dwij Mehta, Aditya Mehta, Pratik Narang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02078
Pdf URL: https://arxiv.org/pdf/2408.02078
Copy Paste: [[2408.02078]] LDFaceNet: Latent Diffusion-based Network for High-Fidelity Deepfake Generation(https://arxiv.org/abs/2408.02078)
Keywords: diffusion, generative, segmentation
Abstract: Over the past decade, there has been tremendous progress in the domain of synthetic media generation. This is mainly due to the powerful methods based on generative adversarial networks (GANs). Very recently, diffusion probabilistic models, which are inspired by non-equilibrium thermodynamics, have taken the spotlight. In the realm of image generation, diffusion models (DMs) have exhibited remarkable proficiency in producing both realistic and heterogeneous imagery through their stochastic sampling procedure. This paper proposes a novel facial swapping module, termed as LDFaceNet (Latent Diffusion based Face Swapping Network), which is based on a guided latent diffusion model that utilizes facial segmentation and facial recognition modules for a conditioned denoising process. The model employs a unique loss function to offer directional guidance to the diffusion process. Notably, LDFaceNet can incorporate supplementary facial guidance for desired outcomes without any retraining. To the best of our knowledge, this represents the first application of the latent diffusion model in the face-swapping task without prior training. The results of this study demonstrate that the proposed method can generate extremely realistic and coherent images by leveraging the potential of the diffusion model for facial swapping, thereby yielding superior visual outcomes and greater diversity.

Title: Secure and Transparent Medical Record Management System Using Python and Blockchain

Authors: Atchiyya Naidu Chitikela
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.02081
Pdf URL: https://arxiv.org/pdf/2408.02081
Copy Paste: [[2408.02081]] Secure and Transparent Medical Record Management System Using Python and Blockchain(https://arxiv.org/abs/2408.02081)
Keywords: secure, security, privacy, attack, robust
Abstract: In this paper, we propose a robust health record storage and management system built on blockchain technology to address the challenges faced by traditional healthcare record systems. The primary advantage of employing blockchain in healthcare record management is its ability to provide a secure and decentralized platform. Unlike traditional centralized databases, where a single point of failure can compromise data integrity and security, blockchain distributes data across a network of nodes, ensuring redundancy and resilience against cyber-attacks. This distributed nature of blockchain enhances data security and privacy, crucial considerations when dealing with sensitive health information. Central to our proposed system is the utilization of smart contracts, which are self-executing contracts with predefined rules and conditions. Smart contracts automate processes related to health record management, such as data access, sharing, and updating, based on predefined permissions and protocols. This automation not only streamlines administrative tasks but also reduces the risk of human errors and ensures data accuracy and consistency. Furthermore, our system prioritizes patient empowerment by granting individuals complete control over their health records. Patients can securely access and manage their data using cryptographic keys, granting permission to healthcare providers or other authorized entities as needed. Overall, our proposed health record storage and management system on the blockchain offer significant advantages over traditional systems, including enhanced security, data integrity, transparency, and patient control. By leveraging blockchain technology and smart contracts, healthcare organizations can revolutionize their record management practices, and maintaining secure ecosystems.

Title: Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

Authors: Yulei Qin, Yuncheng Yang, Pengcheng Guo, Gang Li, Hang Shao, Yuchen Shi, Zihan Xu, Yun Gu, Ke Li, Xing Sun
Subjects: cs.CV, cs.AI, cs.CL, eess.SP
Abstract URL: https://arxiv.org/abs/2408.02085
Pdf URL: https://arxiv.org/pdf/2408.02085
Copy Paste: [[2408.02085]] Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models(https://arxiv.org/abs/2408.02085)
Keywords: large language model
Abstract: Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference. Despite the vast amount of open instruction datasets, naively training a LLM on all existing instructions may not be optimal and practical. To pinpoint the most beneficial datapoints, data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning. However, under the context of instruction tuning, there still exists a gap in knowledge on what kind of data evaluation metrics can be employed and how they can be integrated into the selection mechanism. To bridge this gap, we present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs. We systematically categorize all applicable methods into quality-based, diversity-based, and importance-based ones where a unified, fine-grained taxonomy is structured. For each category, representative methods are elaborated to describe the landscape of relevant research. In addition, comparison between latest methods is conducted on their officially reported results to provide in-depth discussions on their limitations. Finally, we summarize the open challenges and propose the promosing avenues for future studies. All related contents are available at this https URL.

Title: KAN-RCBEVDepth: A multi-modal fusion algorithm in object detection for autonomous driving

Authors: Zhihao Lai, Chuanhao Liu, Shihui Sheng, Zhiqiang Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02088
Pdf URL: https://arxiv.org/pdf/2408.02088
Copy Paste: [[2408.02088]] KAN-RCBEVDepth: A multi-modal fusion algorithm in object detection for autonomous driving(https://arxiv.org/abs/2408.02088)
Keywords: transformer
Abstract: Accurate 3D object detection in autonomous driving is critical yet challenging due to occlusions, varying object scales, and complex urban environments. This paper introduces the RCBEV-KAN algorithm, a pioneering method designed to enhance 3D object detection by fusing multimodal sensor data from cameras, LiDAR, and millimeter-wave radar. Our innovative Bird's Eye View (BEV)-based approach, utilizing a Transformer architecture, significantly boosts detection precision and efficiency by seamlessly integrating diverse data sources, improving spatial relationship handling, and optimizing computational processes. Experimental results show that the RCBEV-KAN model demonstrates superior performance across most detection categories, achieving higher Mean Distance AP (0.389 vs. 0.316, a 23% improvement), better ND Score (0.484 vs. 0.415, a 17% improvement), and faster Evaluation Time (71.28s, 8% faster). These results indicate that RCBEV-KAN is more accurate, reliable, and efficient, making it ideal for dynamic and challenging autonomous driving environments.

Title: View-consistent Object Removal in Radiance Fields

Authors: Yiren Lu, Jing Ma, Yu Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02100
Pdf URL: https://arxiv.org/pdf/2408.02100
Copy Paste: [[2408.02100]] View-consistent Object Removal in Radiance Fields(https://arxiv.org/abs/2408.02100)
Keywords: robust, segmentation
Abstract: Radiance Fields (RFs) have emerged as a crucial technology for 3D scene representation, enabling the synthesis of novel views with remarkable realism. However, as RFs become more widely used, the need for effective editing techniques that maintain coherence across different perspectives becomes evident. Current methods primarily depend on per-frame 2D image inpainting, which often fails to maintain consistency across views, thus compromising the realism of edited RF scenes. In this work, we introduce a novel RF editing pipeline that significantly enhances consistency by requiring the inpainting of only a single reference image. This image is then projected across multiple views using a depth-based approach, effectively reducing the inconsistencies observed with per-frame inpainting. However, projections typically assume photometric consistency across views, which is often impractical in real-world settings. To accommodate realistic variations in lighting and viewpoint, our pipeline adjusts the appearance of the projected views by generating multiple directional variants of the inpainted image, thereby adapting to different photometric conditions. Additionally, we present an effective and robust multi-view object segmentation approach as a valuable byproduct of our pipeline. Extensive experiments demonstrate that our method significantly surpasses existing frameworks in maintaining content consistency across views and enhancing visual quality. More results are available at this https URL.

Title: Effective Demonstration Annotation for In-Context Learning via Language Model-Based Determinantal Point Process

Authors: Peng Wang, Xiaobin Wang, Chao Lou, Shengyu Mao, Pengjun Xie, Yong Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.02103
Pdf URL: https://arxiv.org/pdf/2408.02103
Copy Paste: [[2408.02103]] Effective Demonstration Annotation for In-Context Learning via Language Model-Based Determinantal Point Process(https://arxiv.org/abs/2408.02103)
Keywords: large language model
Abstract: In-context learning (ICL) is a few-shot learning paradigm that involves learning mappings through input-output pairs and appropriately applying them to new instances. Despite the remarkable ICL capabilities demonstrated by Large Language Models (LLMs), existing works are highly dependent on large-scale labeled support sets, not always feasible in practical scenarios. To refine this approach, we focus primarily on an innovative selective annotation mechanism, which precedes the standard demonstration retrieval. We introduce the Language Model-based Determinant Point Process (LM-DPP) that simultaneously considers the uncertainty and diversity of unlabeled instances for optimal selection. Consequently, this yields a subset for annotation that strikes a trade-off between the two factors. We apply LM-DPP to various language models, including GPT-J, LlaMA, and GPT-3. Experimental results on 9 NLU and 2 Generation datasets demonstrate that LM-DPP can effectively select canonical examples. Further analysis reveals that LLMs benefit most significantly from subsets that are both low uncertainty and high diversity.

Title: AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos

Authors: Feichi Lu, Zijian Dong, Jie Song, Otmar Hilliges
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02110
Pdf URL: https://arxiv.org/pdf/2408.02110
Copy Paste: [[2408.02110]] AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos(https://arxiv.org/abs/2408.02110)
Keywords: robust
Abstract: Despite progress in human motion capture, existing multi-view methods often face challenges in estimating the 3D pose and shape of multiple closely interacting people. This difficulty arises from reliance on accurate 2D joint estimations, which are hard to obtain due to occlusions and body contact when people are in close interaction. To address this, we propose a novel method leveraging the personalized implicit neural avatar of each individual as a prior, which significantly improves the robustness and precision of this challenging pose estimation task. Concretely, the avatars are efficiently reconstructed via layered volume rendering from sparse multi-view videos. The reconstructed avatar prior allows for the direct optimization of 3D poses based on color and silhouette rendering loss, bypassing the issues associated with noisy 2D detections. To handle interpenetration, we propose a collision loss on the overlapping shape regions of avatars to add penetration constraints. Moreover, both 3D poses and avatars are optimized in an alternating manner. Our experimental results demonstrate state-of-the-art performance on several public datasets.

Title: Assessing the XDC Network: A Comprehensive Evaluation of its qualitative and technical aspects

Authors: Atul Khekade, Omkar Mestry, Van Khanh Nguyen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.02115
Pdf URL: https://arxiv.org/pdf/2408.02115
Copy Paste: [[2408.02115]] Assessing the XDC Network: A Comprehensive Evaluation of its qualitative and technical aspects(https://arxiv.org/abs/2408.02115)
Keywords: security
Abstract: This research provides a thorough assessment of the XDC Network, a delegated proof of stake (XDPoS) consensus-based blockchain technology, across its technical, security, and business dimensions. The study evaluates the network's decentralization, scalability, and security features, including its Nakamoto coefficient, validator participation, and client distribution. Additionally, it examines the developer ecosystem, including GitHub metrics, and business aspects such as transaction costs and predictability. The findings of this research will provide valuable insights into the strengths and weaknesses of the XDC Network, informing stakeholders and decision-makers about its suitability for various use cases, particularly in trade finance, asset tokenization, and enterprise blockchain solutions.

Title: FovEx: Human-inspired Explanations for Vision Transformers and Convolutional Neural Networks

Authors: Mahadev Prasad Panda, Matteo Tiezzi, Martina Vilas, Gemma Roig, Bjoern M. Eskofier, Dario Zanca
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02123
Pdf URL: https://arxiv.org/pdf/2408.02123
Copy Paste: [[2408.02123]] FovEx: Human-inspired Explanations for Vision Transformers and Convolutional Neural Networks(https://arxiv.org/abs/2408.02123)
Keywords: explainability, transformer
Abstract: Explainability in artificial intelligence (XAI) remains a crucial aspect for fostering trust and understanding in machine learning models. Current visual explanation techniques, such as gradient-based or class-activation-based methods, often exhibit a strong dependence on specific model architectures. Conversely, perturbation-based methods, despite being model-agnostic, are computationally expensive as they require evaluating models on a large number of forward passes. In this work, we introduce Foveation-based Explanations (FovEx), a novel XAI method inspired by human vision. FovEx seamlessly integrates biologically inspired perturbations by iteratively creating foveated renderings of the image and combines them with gradient-based visual explorations to determine locations of interest efficiently. These locations are selected to maximize the performance of the model to be explained with respect to the downstream task and then combined to generate an attribution map. We provide a thorough evaluation with qualitative and quantitative assessments on established benchmarks. Our method achieves state-of-the-art performance on both transformers (on 4 out of 5 metrics) and convolutional models (on 3 out of 5 metrics), demonstrating its versatility among various architectures. Furthermore, we show the alignment between the explanation map produced by FovEx and human gaze patterns (+14\% in NSS compared to RISE, +203\% in NSS compared to GradCAM). This comparison enhances our confidence in FovEx's ability to close the interpretation gap between humans and machines.

Title: Table Transformers for Imputing Textual Attributes

Authors: Ting-Ruen Wei, Yuan Wang, Yoshitaka Inoue, Hsin-Tai Wu, Yi Fang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.02128
Pdf URL: https://arxiv.org/pdf/2408.02128
Copy Paste: [[2408.02128]] Table Transformers for Imputing Textual Attributes(https://arxiv.org/abs/2408.02128)
Keywords: transformer
Abstract: Missing data in tabular dataset is a common issue as the performance of downstream tasks usually depends on the completeness of the training dataset. Previous missing data imputation methods focus on numeric and categorical columns, but we propose a novel end-to-end approach called Table Transformers for Imputing Textual Attributes (TTITA) based on the transformer to impute unstructured textual columns using other columns in the table. We conduct extensive experiments on two Amazon Reviews datasets, and our approach shows competitive performance outperforming baseline models such as recurrent neural networks and Llama2. The performance improvement is more significant when the target sequence has a longer length. Additionally, we incorporated multi-task learning to simultaneously impute for heterogeneous columns, boosting the performance for text imputation. We also qualitatively compare with ChatGPT for realistic applications.

Title: Model Hijacking Attack in Federated Learning

Authors: Zheng Li, Siyuan Wu, Ruichuan Chen, Paarijaat Aditya, Istemi Ekin Akkus, Manohar Vanga, Min Zhang, Hao Li, Yang Zhang
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02131
Pdf URL: https://arxiv.org/pdf/2408.02131
Copy Paste: [[2408.02131]] Model Hijacking Attack in Federated Learning(https://arxiv.org/abs/2408.02131)
Keywords: defense, attack, federate
Abstract: Machine learning (ML), driven by prominent paradigms such as centralized and federated learning, has made significant progress in various critical applications ranging from autonomous driving to face recognition. However, its remarkable success has been accompanied by various attacks. Recently, the model hijacking attack has shown that ML models can be hijacked to execute tasks different from their original tasks, which increases both accountability and parasitic computational risks. Nevertheless, thus far, this attack has only focused on centralized learning. In this work, we broaden the scope of this attack to the federated learning domain, where multiple clients collaboratively train a global model without sharing their data. Specifically, we present HijackFL, the first-of-its-kind hijacking attack against the global model in federated learning. The adversary aims to force the global model to perform a different task (called hijacking task) from its original task without the server or benign client noticing. To accomplish this, unlike existing methods that use data poisoning to modify the target model's parameters, HijackFL searches for pixel-level perturbations based on their local model (without modifications) to align hijacking samples with the original ones in the feature space. When performing the hijacking task, the adversary applies these cloaks to the hijacking samples, compelling the global model to identify them as original samples and predict them accordingly. We conduct extensive experiments on four benchmark datasets and three popular models. Empirical results demonstrate that its attack performance outperforms baselines. We further investigate the factors that affect its performance and discuss possible defenses to mitigate its impact.

Title: VidModEx: Interpretable and Efficient Black Box Model Extraction for High-Dimensional Spaces

Authors: Somnath Sendhil Kumar, Yuvaraj Govindarajulu, Pavan Kulkarni, Manojkumar Parmar
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02140
Pdf URL: https://arxiv.org/pdf/2408.02140
Copy Paste: [[2408.02140]] VidModEx: Interpretable and Efficient Black Box Model Extraction for High-Dimensional Spaces(https://arxiv.org/abs/2408.02140)
Keywords: extraction
Abstract: In the domain of black-box model extraction, conventional methods reliant on soft labels or surrogate datasets struggle with scaling to high-dimensional input spaces and managing the complexity of an extensive array of interrelated classes. In this work, we present a novel approach that utilizes SHAP (SHapley Additive exPlanations) to enhance synthetic data generation. SHAP quantifies the individual contributions of each input feature towards the victim model's output, facilitating the optimization of an energy-based GAN towards a desirable output. This method significantly boosts performance, achieving a 16.45% increase in the accuracy of image classification models and extending to video classification models with an average improvement of 26.11% and a maximum of 33.36% on challenging datasets such as UCF11, UCF101, Kinetics 400, Kinetics 600, and Something-Something V2. We further demonstrate the effectiveness and practical utility of our method under various scenarios, including the availability of top-k prediction probabilities, top-k prediction labels, and top-1 labels.

Title: Analyzing Cultural Representations of Emotions in LLMs through Mixed Emotion Survey

Authors: Shiran Dudy, Ibrahim Said Ahmad, Ryoko Kitajima, Agata Lapedriza
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02143
Pdf URL: https://arxiv.org/pdf/2408.02143
Copy Paste: [[2408.02143]] Analyzing Cultural Representations of Emotions in LLMs through Mixed Emotion Survey(https://arxiv.org/abs/2408.02143)
Keywords: large language model
Abstract: Large Language Models (LLMs) have gained widespread global adoption, showcasing advanced linguistic capabilities across multiple of languages. There is a growing interest in academia to use these models to simulate and study human behaviors. However, it is crucial to acknowledge that an LLM's proficiency in a specific language might not fully encapsulate the norms and values associated with its culture. Concerns have emerged regarding potential biases towards Anglo-centric cultures and values due to the predominance of Western and US-based training data. This study focuses on analyzing the cultural representations of emotions in LLMs, in the specific case of mixed-emotion situations. Our methodology is based on the studies of Miyamoto et al. (2010), which identified distinctive emotional indicators in Japanese and American human responses. We first administer their mixed emotion survey to five different LLMs and analyze their outputs. Second, we experiment with contextual variables to explore variations in responses considering both language and speaker origin. Thirdly, we expand our investigation to encompass additional East Asian and Western European origin languages to gauge their alignment with their respective cultures, anticipating a closer fit. We find that (1) models have limited alignment with the evidence in the literature; (2) written language has greater effect on LLMs' response than information on participants origin; and (3) LLMs responses were found more similar for East Asian languages than Western European languages.

Title: ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software

Authors: Xiang Mei, Pulkit Singh Singaria, Jordi Del Castillo, Haoran Xi, Abdelouahab (Habs)Benchikh, Tiffany Bao, Ruoyu Wang, Yan Shoshitaishvili, Adam Doupé, Hammond Pearce, Brendan Dolan-Gavitt
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02153
Pdf URL: https://arxiv.org/pdf/2408.02153
Copy Paste: [[2408.02153]] ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software(https://arxiv.org/abs/2408.02153)
Keywords: security
Abstract: High-quality datasets of real-world vulnerabilities are enormously valuable for downstream research in software security, but existing datasets are typically small, require extensive manual effort to update, and are missing crucial features that such research needs. In this paper, we introduce ARVO: an Atlas of Reproducible Vulnerabilities in Open-source software. By sourcing vulnerabilities from C/C++ projects that Google's OSS-Fuzz discovered and implementing a reliable re-compilation system, we successfully reproduce more than 5,000 memory vulnerabilities across over 250 projects, each with a triggering input, the canonical developer-written patch for fixing the vulnerability, and the ability to automatically rebuild the project from source and run it at its vulnerable and patched revisions. Moreover, our dataset can be automatically updated as OSS-Fuzz finds new vulnerabilities, allowing it to grow over time. We provide a thorough characterization of the ARVO dataset, show that it can locate fixes more accurately than Google's own OSV reproduction effort, and demonstrate its value for future research through two case studies: firstly evaluating real-world LLM-based vulnerability repair, and secondly identifying over 300 falsely patched (still-active) zero-day vulnerabilities from projects improperly labeled by OSS-Fuzz.

Title: Rethinking Affect Analysis: A Protocol for Ensuring Fairness and Consistency

Authors: Guanyu Hu, Dimitrios Kollias, Eleni Papadopoulou, Paraskevi Tzouveli, Jie Wei, Xinyu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02164
Pdf URL: https://arxiv.org/pdf/2408.02164
Copy Paste: [[2408.02164]] Rethinking Affect Analysis: A Protocol for Ensuring Fairness and Consistency(https://arxiv.org/abs/2408.02164)
Keywords: fair
Abstract: Evaluating affect analysis methods presents challenges due to inconsistencies in database partitioning and evaluation protocols, leading to unfair and biased results. Previous studies claim continuous performance improvements, but our findings challenge such assertions. Using these insights, we propose a unified protocol for database partitioning that ensures fairness and comparability. We provide detailed demographic annotations (in terms of race, gender and age), evaluation metrics, and a common framework for expression recognition, action unit detection and valence-arousal estimation. We also rerun the methods with the new protocol and introduce a new leaderboards to encourage future research in affect recognition with a fairer comparison. Our annotations, code, and pre-trained models are available on \hyperlink{this https URL}{Github}.

Title: X.509 Information Security Certification Based on Post-Quantum Cryptography

Authors: Abel C. H. Chen
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2408.02179
Pdf URL: https://arxiv.org/pdf/2408.02179
Copy Paste: [[2408.02179]] X.509 Information Security Certification Based on Post-Quantum Cryptography(https://arxiv.org/abs/2408.02179)
Keywords: security
Abstract: In recent years, with the advancement of quantum computing, mainstream asymmetric cryptographic methods in the current Public Key Infrastructure (PKI) systems are gradually being threatened. Therefore, this study explores X.509 security certificates based on Post-Quantum Cryptography (PQC) and discusses implemented solutions. This study compares mainstream asymmetric cryptographic methods (including RSA and Elliptic Curve Digital Signature Algorithm (ECDSA)) with standard PQC methods (including Falcon, Dilithium, SPHINCS+), comparing the efficiency of certificate generation, signature generation, and signature verification. Finally, recommendations for a solution based on PQC for X.509 security certificates are proposed.

Title: AssemAI: Interpretable Image-Based Anomaly Detection for Manufacturing Pipelines

Authors: Renjith Prasad, Chathurangi Shyalika, Ramtin Zand, Fadi El Kalach, Revathy Venkataramanan, Ramy Harik, Amit Sheth
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02181
Pdf URL: https://arxiv.org/pdf/2408.02181
Copy Paste: [[2408.02181]] AssemAI: Interpretable Image-Based Anomaly Detection for Manufacturing Pipelines(https://arxiv.org/abs/2408.02181)
Keywords: explainability, transformer
Abstract: Anomaly detection in manufacturing pipelines remains a critical challenge, intensified by the complexity and variability of industrial environments. This paper introduces AssemAI, an interpretable image-based anomaly detection system tailored for smart manufacturing pipelines. Our primary contributions include the creation of a tailored image dataset and the development of a custom object detection model, YOLO-FF, designed explicitly for anomaly detection in manufacturing assembly environments. Utilizing the preprocessed image dataset derived from an industry-focused rocket assembly pipeline, we address the challenge of imbalanced image data and demonstrate the importance of image-based methods in anomaly detection. The proposed approach leverages domain knowledge in data preparation, model development and reasoning. We compare our method against several baselines, including simple CNN and custom Visual Transformer (ViT) models, showcasing the effectiveness of our custom data preparation and pretrained CNN integration. Additionally, we incorporate explainability techniques at both user and model levels, utilizing ontology for user-friendly explanations and SCORE-CAM for in-depth feature and model analysis. Finally, the model was also deployed in a real-time setting. Our results include ablation studies on the baselines, providing a comprehensive evaluation of the proposed system. This work highlights the broader impact of advanced image-based anomaly detection in enhancing the reliability and efficiency of smart manufacturing processes.

Title: CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs

Authors: Weijie Lv, Xuan Xia, Sheng-Jun Huang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02193
Pdf URL: https://arxiv.org/pdf/2408.02193
Copy Paste: [[2408.02193]] CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs(https://arxiv.org/abs/2408.02193)
Keywords: large language model
Abstract: Large language models (LLMs) have shown great potential in code-related tasks, yet open-source models lag behind their closed-source counterparts. To bridge this performance gap, existing methods generate vast amounts of synthetic data for fine-tuning, leading to inefficiencies in training. Motivated by the need for more effective and efficient training, we propose the Code Adaptive Compute-efficient Tuning (CodeACT) framework. CodeACT introduces the Complexity and Diversity Aware Sampling (CDAS) method to select high-quality training data based on complexity and diversity, and the Dynamic Pack padding strategy to reduce computational resource usage by minimizing padding tokens during training. Experimental results demonstrate that CodeACT-DeepSeek-Coder-6.7B, fine-tuned on only 40% of the EVOL-Instruct data, achieves an 8.6% performance increase on HumanEval, reduces training time by 78%, and decreases peak GPU memory usage by 27%. These findings underscore CodeACT's ability to enhance the performance and efficiency of open-source models. By optimizing both the data selection and training processes, CodeACT offers a comprehensive approach to improving the capabilities of open-source LLMs while significantly reducing computational requirements, addressing the dual challenges of data quality and training efficiency, and paving the way for more resource-efficient and performant models.

Title: Evaluating the Performance of Large Language Models for SDG Mapping (Technical Report)

Authors: Hui Yin, Amir Aryani, Nakul Nambiar
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2408.02201
Pdf URL: https://arxiv.org/pdf/2408.02201
Copy Paste: [[2408.02201]] Evaluating the Performance of Large Language Models for SDG Mapping (Technical Report)(https://arxiv.org/abs/2408.02201)
Keywords: privacy, protect, large language model
Abstract: The use of large language models (LLMs) is expanding rapidly, and open-source versions are becoming available, offering users safer and more adaptable options. These models enable users to protect data privacy by eliminating the need to provide data to third parties and can be customized for specific tasks. In this study, we compare the performance of various language models on the Sustainable Development Goal (SDG) mapping task, using the output of GPT-4o as the baseline. The selected open-source models for comparison include Mixtral, LLaMA 2, LLaMA 3, Gemma, and Qwen2. Additionally, GPT-4o-mini, a more specialized version of GPT-4o, was included to extend the comparison. Given the multi-label nature of the SDG mapping task, we employed metrics such as F1 score, precision, and recall with micro-averaging to evaluate different aspects of the models' performance. These metrics are derived from the confusion matrix to ensure a comprehensive evaluation. We provide a clear observation and analysis of each model's performance by plotting curves based on F1 score, precision, and recall at different thresholds. According to the results of this experiment, LLaMA 2 and Gemma still have significant room for improvement. The other four models do not exhibit particularly large differences in performance. The outputs from all seven models are available on Zenodo: this https URL.

Title: Source-Free Domain-Invariant Performance Prediction

Authors: Ekaterina Khramtsova, Mahsa Baktashmotlagh, Guido Zuccon, Xi Wang, Mathieu Salzmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02209
Pdf URL: https://arxiv.org/pdf/2408.02209
Copy Paste: [[2408.02209]] Source-Free Domain-Invariant Performance Prediction(https://arxiv.org/abs/2408.02209)
Keywords: generative
Abstract: Accurately estimating model performance poses a significant challenge, particularly in scenarios where the source and target domains follow different data distributions. Most existing performance prediction methods heavily rely on the source data in their estimation process, limiting their applicability in a more realistic setting where only the trained model is accessible. The few methods that do not require source data exhibit considerably inferior performance. In this work, we propose a source-free approach centred on uncertainty-based estimation, using a generative model for calibration in the absence of source data. We establish connections between our approach for unsupervised calibration and temperature scaling. We then employ a gradient-based strategy to evaluate the correctness of the calibrated predictions. Our experiments on benchmark object recognition datasets reveal that existing source-based methods fall short with limited source sample availability. Furthermore, our approach significantly outperforms the current state-of-the-art source-free and source-based methods, affirming its effectiveness in domain-invariant performance estimation.

Title: ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

Authors: Yuxuan Wang, Alan Yuille, Zhuowan Li, Zilong Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02210
Pdf URL: https://arxiv.org/pdf/2408.02210
Copy Paste: [[2408.02210]] ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning(https://arxiv.org/abs/2408.02210)
Keywords: large language model
Abstract: Compositional visual reasoning methods, which translate a complex query into a structured composition of feasible visual tasks, have exhibited a strong potential in complicated multi-modal tasks. Empowered by recent advances in large language models (LLMs), this multi-modal challenge has been brought to a new stage by treating LLMs as few-shot/zero-shot planners, i.e., vision-language (VL) programming. Such methods, despite their numerous merits, suffer from challenges due to LLM planning mistakes or inaccuracy of visual execution modules, lagging behind the non-compositional models. In this work, we devise a "plug-and-play" method, ExoViP, to correct errors in both the planning and execution stages through introspective verification. We employ verification modules as "exoskeletons" to enhance current VL programming schemes. Specifically, our proposed verification module utilizes a mixture of three sub-verifiers to validate predictions after each reasoning step, subsequently calibrating the visual module predictions and refining the reasoning trace planned by LLMs. Experimental results on two representative VL programming methods showcase consistent improvements on five compositional reasoning tasks on standard benchmarks. In light of this, we believe that ExoViP can foster better performance and generalization on open-domain multi-modal challenges.

Title: Climate-Driven Doubling of Maize Loss Probability in U.S. Crop Insurance: Spatiotemporal Prediction and Possible Policy Responses

Authors: A Samuel Pottinger, Lawson Connor, Brookie Guzder-Williams, Maya Weltman-Fahs, Timothy Bowles
Subjects: cs.LG, q-fin.RM
Abstract URL: https://arxiv.org/abs/2408.02217
Pdf URL: https://arxiv.org/pdf/2408.02217
Copy Paste: [[2408.02217]] Climate-Driven Doubling of Maize Loss Probability in U.S. Crop Insurance: Spatiotemporal Prediction and Possible Policy Responses(https://arxiv.org/abs/2408.02217)
Keywords: protect, generative
Abstract: Climate change not only threatens agricultural producers but also strains financial institutions. These important food system actors include government entities tasked with both insuring grower livelihoods and supporting response to continued global warming. We use an artificial neural network to predict future maize yields in the U.S. Corn Belt, finding alarming changes to institutional risk exposure within the Federal Crop Insurance Program. Specifically, our machine learning method anticipates more frequent and more severe yield losses that would result in the annual probability of Yield Protection (YP) claims to more than double at mid-century relative to simulations without continued climate change. Furthermore, our dual finding of relatively unchanged average yields paired with decreasing yield stability reveals targeted opportunities to adjust coverage formulas to include variability. This important structural shift may help regulators support grower adaptation to continued climate change by recognizing the value of risk-reducing strategies such as regenerative agriculture. Altogether, paired with open source interactive tools for deeper investigation, our risk profile simulations fill an actionable gap in current understanding, bridging granular historic yield estimation and climate-informed prediction of future insurer-relevant loss.

Title: SoK: Fighting Counterfeits with Cyber-Physical Synergy Based on Physically-Unclonable Identifiers of Paper Surface

Authors: Anirudh Nakra, Min Wu, Chau-Wai Wong
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.02221
Pdf URL: https://arxiv.org/pdf/2408.02221
Copy Paste: [[2408.02221]] SoK: Fighting Counterfeits with Cyber-Physical Synergy Based on Physically-Unclonable Identifiers of Paper Surface(https://arxiv.org/abs/2408.02221)
Keywords: security, biometric
Abstract: Counterfeit products cause severe harm to public safety and health by penetrating untrusted supply chains. Numerous anti-counterfeiting techniques have been proposed, among which the use of inherent, unclonable irregularities of paper surfaces has shown considerable potential as a high-performance economical solution. Prior works do not consider supply chains cohesively, either focusing on creating or improving unclonable identifiers or on securing digital records of products. This work aims to systematically unify these two separate but connected research areas by comprehensively analyzing the needs of supply chains. We construct a generalized paper-based authentication framework and identify important shortcomings and promising ideas in the existing literature. Next, we do a stage-wise security analysis of our consolidated framework by drawing inspiration from works in signal processing, cryptography, and biometric systems. Finally, we examine key representative scenarios that illustrate the range of practical and technical challenges in real-world supply chains, and we outline the best practices to guide future research.

Title: Cross-modulated Attention Transformer for RGBT Tracking

Authors: Yun Xiao, Jiacong Zhao, Andong Lu, Chenglong Li, Yin Lin, Bing Yin, Cong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02222
Pdf URL: https://arxiv.org/pdf/2408.02222
Copy Paste: [[2408.02222]] Cross-modulated Attention Transformer for RGBT Tracking(https://arxiv.org/abs/2408.02222)
Keywords: robust, transformer
Abstract: Existing Transformer-based RGBT trackers achieve remarkable performance benefits by leveraging self-attention to extract uni-modal features and cross-attention to enhance multi-modal feature interaction and template-search correlation computation. Nevertheless, the independent search-template correlation calculations ignore the consistency between branches, which can result in ambiguous and inappropriate correlation weights. It not only limits the intra-modal feature representation, but also harms the robustness of cross-attention for multi-modal feature interaction and search-template correlation computation. To address these issues, we propose a novel approach called Cross-modulated Attention Transformer (CAFormer), which performs intra-modality self-correlation, inter-modality feature interaction, and search-template correlation computation in a unified attention model, for RGBT tracking. In particular, we first independently generate correlation maps for each modality and feed them into the designed Correlation Modulated Enhancement module, modulating inaccurate correlation weights by seeking the consensus between modalities. Such kind of design unifies self-attention and cross-attention schemes, which not only alleviates inaccurate attention weight computation in self-attention but also eliminates redundant computation introduced by extra cross-attention scheme. In addition, we propose a collaborative token elimination strategy to further improve tracking inference efficiency and accuracy. Extensive experiments on five public RGBT tracking benchmarks show the outstanding performance of the proposed CAFormer against state-of-the-art methods.

Title: Large Language Model Aided QoS Prediction for Service Recommendation

Authors: Huiying Liu, Zekun Zhang, Qilin Wu, Yiwen Zhang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2408.02223
Pdf URL: https://arxiv.org/pdf/2408.02223
Copy Paste: [[2408.02223]] Large Language Model Aided QoS Prediction for Service Recommendation(https://arxiv.org/abs/2408.02223)
Keywords: large language model
Abstract: Large language models (LLMs) have seen rapid improvement in the recent years, and are used in a wider range of applications. After being trained on large text corpus, LLMs obtain the capability of extracting rich features from textual data. Such capability is potentially useful for the web service recommendation task, where the web users and services have intrinsic attributes that can be described using natural language sentences and are useful for recommendation. In this paper, we explore the possibility and practicality of using LLMs for web service recommendation. We propose the large language model aided QoS prediction (llmQoS) model, which use LLMs to extract useful information from attributes of web users and services via descriptive sentences. This information is then used in combination with the QoS values of historical interactions of users and services, to predict QoS values for any given user-service pair. Our proposed model is shown to overcome the data sparsity issue for QoS prediction. We show that on the WSDream dataset, llmQoS outperforms comparable baseline models consistently.

Title: ProCreate, Don\'t Reproduce! Propulsive Energy Diffusion for Creative Generation

Authors: Jack Lu, Ryan Teehan, Mengye Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02226
Pdf URL: https://arxiv.org/pdf/2408.02226
Copy Paste: [[2408.02226]] ProCreate, Don\'t Reproduce! Propulsive Energy Diffusion for Creative Generation(https://arxiv.org/abs/2408.02226)
Keywords: diffusion, generative
Abstract: In this paper, we propose ProCreate, a simple and easy-to-implement method to improve sample diversity and creativity of diffusion-based image generative models and to prevent training data reproduction. ProCreate operates on a set of reference images and actively propels the generated image embedding away from the reference embeddings during the generation process. We propose FSCG-8 (Few-Shot Creative Generation 8), a few-shot creative generation dataset on eight different categories -- encompassing different concepts, styles, and settings -- in which ProCreate achieves the highest sample diversity and fidelity. Furthermore, we show that ProCreate is effective at preventing replicating training data in a large-scale evaluation using training text prompts. Code and FSCG-8 are available at this https URL. The project page is available at this https URL.

Title: REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Authors: Agneet Chatterjee, Yiran Luo, Tejas Gokhale, Yezhou Yang, Chitta Baral
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02231
Pdf URL: https://arxiv.org/pdf/2408.02231
Copy Paste: [[2408.02231]] REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models(https://arxiv.org/abs/2408.02231)
Keywords: robust, generative, large language model
Abstract: Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 100+ 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also design RevQA, a question-answering benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware generative models.

Title: A Multi-Source Heterogeneous Knowledge Injected Prompt Learning Method for Legal Charge Prediction

Authors: Jingyun Sun, Chi Wei, Yang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02233
Pdf URL: https://arxiv.org/pdf/2408.02233
Copy Paste: [[2408.02233]] A Multi-Source Heterogeneous Knowledge Injected Prompt Learning Method for Legal Charge Prediction(https://arxiv.org/abs/2408.02233)
Keywords: interpretability
Abstract: Legal charge prediction, an essential task in legal AI, seeks to assign accurate charge labels to case descriptions, attracting significant recent interest. Existing methods primarily employ diverse neural network structures for modeling case descriptions directly, failing to effectively leverage multi-source external knowledge. We propose a prompt learning framework-based method that simultaneously leverages multi-source heterogeneous external knowledge from a legal knowledge base, a conversational LLM, and related legal articles. Specifically, we match knowledge snippets in case descriptions via the legal knowledge base and encapsulate them into the input through a hard prompt template. Additionally, we retrieve legal articles related to a given case description through contrastive learning, and then obtain factual elements within the case description through a conversational LLM. We fuse the embedding vectors of soft prompt tokens with the encoding vector of factual elements to achieve knowledge-enhanced model forward inference. Experimental results show that our method achieved state-of-the-art results on CAIL-2018, the largest legal charge prediction dataset, and our method has lower data dependency. Case studies also demonstrate our method's strong interpretability.

Title: Do Large Language Models Speak All Languages Equally? A Comparative Study in Low-Resource Settings

Authors: Md. Arid Hasan, Prerona Tarannum, Krishno Dey, Imran Razzak, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.02237
Pdf URL: https://arxiv.org/pdf/2408.02237
Copy Paste: [[2408.02237]] Do Large Language Models Speak All Languages Equally? A Comparative Study in Low-Resource Settings(https://arxiv.org/abs/2408.02237)
Keywords: large language model
Abstract: Large language models (LLMs) have garnered significant interest in natural language processing (NLP), particularly their remarkable performance in various downstream tasks in resource-rich languages. Recent studies have highlighted the limitations of LLMs in low-resource languages, primarily focusing on binary classification tasks and giving minimal attention to South Asian languages. These limitations are primarily attributed to constraints such as dataset scarcity, computational costs, and research gaps specific to low-resource languages. To address this gap, we present datasets for sentiment and hate speech tasks by translating from English to Bangla, Hindi, and Urdu, facilitating research in low-resource language processing. Further, we comprehensively examine zero-shot learning using multiple LLMs in English and widely spoken South Asian languages. Our findings indicate that GPT-4 consistently outperforms Llama 2 and Gemini, with English consistently demonstrating superior performance across diverse tasks compared to low-resource languages. Furthermore, our analysis reveals that natural language inference (NLI) exhibits the highest performance among the evaluated tasks, with GPT-4 demonstrating superior capabilities.

Title: BOTS-LM: Training Large Language Models for Setswana

Authors: Nathan Brown, Vukosi Marivate
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.02239
Pdf URL: https://arxiv.org/pdf/2408.02239
Copy Paste: [[2408.02239]] BOTS-LM: Training Large Language Models for Setswana(https://arxiv.org/abs/2408.02239)
Keywords: generative, large language model
Abstract: In this work we present BOTS-LM, a series of bilingual language models proficient in both Setswana and English. Leveraging recent advancements in data availability and efficient fine-tuning, BOTS-LM achieves performance similar to models significantly larger than itself while maintaining computational efficiency. Our initial release features an 8 billion parameter generative large language model, with upcoming 0.5 billion and 1 billion parameter large language models and a 278 million parameter encoder-only model soon to be released. We find the 8 billion parameter model significantly outperforms Llama-3-70B and Aya 23 on English-Setswana translation tasks, approaching the performance of dedicated machine translation models, while approaching 70B parameter performance on Setswana reasoning as measured by a machine translated subset of the MMLU benchmark. To accompany the BOTS-LM series of language models, we release the largest Setswana web dataset, SetsText, totalling over 267 million tokens. In addition, we release the largest machine translated Setswana dataset, the first and largest synthetic Setswana dataset, training and evaluation code, training logs, and MMLU-tsn, a machine translated subset of MMLU.

Title: Curriculum learning based pre-training using Multi-Modal Contrastive Masked Autoencoders

Authors: Muhammad Abdullah Jamal, Omid Mohareri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02245
Pdf URL: https://arxiv.org/pdf/2408.02245
Copy Paste: [[2408.02245]] Curriculum learning based pre-training using Multi-Modal Contrastive Masked Autoencoders(https://arxiv.org/abs/2408.02245)
Keywords: robust, diffusion, segmentation
Abstract: In this paper, we propose a new pre-training method for image understanding tasks under Curriculum Learning (CL) paradigm which leverages RGB-D. The method utilizes Multi-Modal Contrastive Masked Autoencoder and Denoising techniques. Recent approaches either use masked autoencoding (e.g., MultiMAE) or contrastive learning(e.g., Pri3D, or combine them in a single contrastive masked autoencoder architecture such as CMAE and CAV-MAE. However, none of the single contrastive masked autoencoder is applicable to RGB-D datasets. To improve the performance and efficacy of such methods, we propose a new pre-training strategy based on CL. Specifically, in the first stage, we pre-train the model using contrastive learning to learn cross-modal representations. In the second stage, we initialize the modality-specific encoders using the weights from the first stage and then pre-train the model using masked autoencoding and denoising/noise prediction used in diffusion models. Masked autoencoding focuses on reconstructing the missing patches in the input modality using local spatial correlations, while denoising learns high frequency components of the input data. Our approach is scalable, robust and suitable for pre-training with limited RGB-D datasets. Extensive experiments on multiple datasets such as ScanNet, NYUv2 and SUN RGB-D show the efficacy and superior performance of our approach. Specifically, we show an improvement of +1.0% mIoU against Mask3D on ScanNet semantic segmentation. We further demonstrate the effectiveness of our approach in low-data regime by evaluating it for semantic segmentation task against the state-of-the-art methods.

Title: Contrastive Learning and Abstract Concepts: The Case of Natural Numbers

Authors: Daniel N. Nissani (Nissensohn)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02247
Pdf URL: https://arxiv.org/pdf/2408.02247
Copy Paste: [[2408.02247]] Contrastive Learning and Abstract Concepts: The Case of Natural Numbers(https://arxiv.org/abs/2408.02247)
Keywords: robust
Abstract: Contrastive Learning (CL) has been successfully applied to classification and other downstream tasks related to concrete concepts, such as objects contained in the ImageNet dataset. No attempts seem to have been made so far in applying this promising scheme to more abstract entities. A prominent example of these could be the concept of (discrete) Quantity. CL can be frequently interpreted as a self-supervised scheme guided by some profound and ubiquitous conservation principle (e.g. conservation of identity in object classification tasks). In this introductory work we apply a suitable conservation principle to the semi-abstract concept of natural numbers by which discrete quantities can be estimated or predicted. We experimentally show, by means of a toy problem, that contrastive learning can be trained to count at a glance with high accuracy both at human as well as at super-human ranges.. We compare this with the results of a trained-to-count at a glance supervised learning (SL) neural network scheme of similar architecture. We show that both schemes exhibit similar good performance on baseline experiments, where the distributions of the training and testing stages are equal. Importantly, we demonstrate that in some generalization scenarios, where training and testing distributions differ, CL boasts more robust and much better error performance.

Title: ReDel: A Toolkit for LLM-Powered Recursive Multi-Agent Systems

Authors: Andrew Zhu, Liam Dugan, Chris Callison-Burch
Subjects: cs.CL, cs.MA, cs.SE
Abstract URL: https://arxiv.org/abs/2408.02248
Pdf URL: https://arxiv.org/pdf/2408.02248
Copy Paste: [[2408.02248]] ReDel: A Toolkit for LLM-Powered Recursive Multi-Agent Systems(https://arxiv.org/abs/2408.02248)
Keywords: large language model
Abstract: Recently, there has been increasing interest in using Large Language Models (LLMs) to construct complex multi-agent systems to perform tasks such as compiling literature reviews, drafting consumer reports, and planning vacations. Many tools and libraries exist for helping create such systems, however none support recursive multi-agent systems -- where the models themselves flexibly decide when to delegate tasks and how to organize their delegation structure. In this work, we introduce ReDel: a toolkit for recursive multi-agent systems that supports custom tool-use, delegation schemes, event-based logging, and interactive replay in an easy-to-use web interface. We show that, using ReDel, we are able to achieve significant performance gains on agentic benchmarks and easily identify potential areas of improvements through the visualization and debugging tools. Our code, documentation, and PyPI package are open-source and free to use under the MIT license.

Title: Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using VLMs

Authors: Jeongkee Lim, Yusung Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02261
Pdf URL: https://arxiv.org/pdf/2408.02261
Copy Paste: [[2408.02261]] Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using VLMs(https://arxiv.org/abs/2408.02261)
Keywords: segmentation
Abstract: The challenge of semantic segmentation in Unsupervised Domain Adaptation (UDA) emerges not only from domain shifts between source and target images but also from discrepancies in class taxonomies across domains. Traditional UDA research assumes consistent taxonomy between the source and target domains, thereby limiting their ability to recognize and adapt to the taxonomy of the target domain. This paper introduces a novel approach, Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using Vision Language Models (CSI), which effectively performs domain-adaptive semantic segmentation even in situations of source-target class mismatches. CSI leverages the semantic generalization potential of Visual Language Models (VLMs) to create synergy with previous UDA methods. It leverages segment reasoning obtained through traditional UDA methods, combined with the rich semantic knowledge embedded in VLMs, to relabel new classes in the target domain. This approach allows for effective adaptation to extended taxonomies without requiring any ground truth label for the target domain. Our method has shown to be effective across various benchmarks in situations of inconsistent taxonomy settings (coarse-to-fine taxonomy and open taxonomy) and demonstrates consistent synergy effects when integrated with previous state-of-the-art UDA methods. The implementation is available at this http URL.

Title: VoxelTrack: Exploring Voxel Representation for 3D Point Cloud Object Tracking

Authors: Yuxuan Lu, Jiahao Nie, Zhiwei He, Hongjie Gu, Xudong Lv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02263
Pdf URL: https://arxiv.org/pdf/2408.02263
Copy Paste: [[2408.02263]] VoxelTrack: Exploring Voxel Representation for 3D Point Cloud Object Tracking(https://arxiv.org/abs/2408.02263)
Keywords: robust
Abstract: Current LiDAR point cloud-based 3D single object tracking (SOT) methods typically rely on point-based representation network. Despite demonstrated success, such networks suffer from some fundamental problems: 1) It contains pooling operation to cope with inherently disordered point clouds, hindering the capture of 3D spatial information that is useful for tracking, a regression task. 2) The adopted set abstraction operation hardly handles density-inconsistent point clouds, also preventing 3D spatial information from being modeled. To solve these problems, we introduce a novel tracking framework, termed VoxelTrack. By voxelizing inherently disordered point clouds into 3D voxels and extracting their features via sparse convolution blocks, VoxelTrack effectively models precise and robust 3D spatial information, thereby guiding accurate position prediction for tracked objects. Moreover, VoxelTrack incorporates a dual-stream encoder with cross-iterative feature fusion module to further explore fine-grained 3D spatial information for tracking. Benefiting from accurate 3D spatial information being modeled, our VoxelTrack simplifies tracking pipeline with a single regression loss. Extensive experiments are conducted on three widely-adopted datasets including KITTI, NuScenes and Waymo Open Dataset. The experimental results confirm that VoxelTrack achieves state-of-the-art performance (88.3%, 71.4% and 63.6% mean precision on the three datasets, respectively), and outperforms the existing trackers with a real-time speed of 36 Fps on a single TITAN RTX GPU. The source code and model will be released.

Title: One-Shot Collaborative Data Distillation

Authors: Rayne Holland, Chandra Thapa, Sarah Ali Siddiqui, Wei Shao, Seyit Camtepe
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.02266
Pdf URL: https://arxiv.org/pdf/2408.02266
Copy Paste: [[2408.02266]] One-Shot Collaborative Data Distillation(https://arxiv.org/abs/2408.02266)
Keywords: attack
Abstract: Large machine-learning training datasets can be distilled into small collections of informative synthetic data samples. These synthetic sets support efficient model learning and reduce the communication cost of data sharing. Thus, high-fidelity distilled data can support the efficient deployment of machine learning applications in distributed network environments. A naive way to construct a synthetic set in a distributed environment is to allow each client to perform local data distillation and to merge local distillations at a central server. However, the quality of the resulting set is impaired by heterogeneity in the distributions of the local data held by clients. To overcome this challenge, we introduce the first collaborative data distillation technique, called CollabDM, which captures the global distribution of the data and requires only a single round of communication between client and server. Our method outperforms the state-of-the-art one-shot learning method on skewed data in distributed learning environments. We also show the promising practical benefits of our method when applied to attack detection in 5G networks.

Title: Geometric Algebra Meets Large Language Models: Instruction-Based Transformations of Separate Meshes in 3D, Interactive and Controllable Scenes

Authors: Dimitris Angelis, Prodromos Kolyvakis, Manos Kamarianakis, George Papagiannakis
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2408.02275
Pdf URL: https://arxiv.org/pdf/2408.02275
Copy Paste: [[2408.02275]] Geometric Algebra Meets Large Language Models: Instruction-Based Transformations of Separate Meshes in 3D, Interactive and Controllable Scenes(https://arxiv.org/abs/2408.02275)
Keywords: robust, large language model
Abstract: This paper introduces a novel integration of Large Language Models (LLMs) with Conformal Geometric Algebra (CGA) to revolutionize controllable 3D scene editing, particularly for object repositioning tasks, which traditionally requires intricate manual processes and specialized expertise. These conventional methods typically suffer from reliance on large training datasets or lack a formalized language for precise edits. Utilizing CGA as a robust formal language, our system, shenlong, precisely models spatial transformations necessary for accurate object repositioning. Leveraging the zero-shot learning capabilities of pre-trained LLMs, shenlong translates natural language instructions into CGA operations which are then applied to the scene, facilitating exact spatial transformations within 3D scenes without the need for specialized pre-training. Implemented in a realistic simulation environment, shenlong ensures compatibility with existing graphics pipelines. To accurately assess the impact of CGA, we benchmark against robust Euclidean Space baselines, evaluating both latency and accuracy. Comparative performance evaluations indicate that shenlong significantly reduces LLM response times by 16% and boosts success rates by 9.6% on average compared to the traditional methods. Notably, shenlong achieves a 100% perfect success rate in common practical queries, a benchmark where other systems fall short. These advancements underscore shenlong's potential to democratize 3D scene editing, enhancing accessibility and fostering innovation across sectors such as education, digital entertainment, and virtual reality.

Title: DRFormer: Multi-Scale Transformer Utilizing Diverse Receptive Fields for Long Time-Series Forecasting

Authors: Ruixin Ding, Yuqi Chen, Yu-Ting Lan, Wei Zhang
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2408.02279
Pdf URL: https://arxiv.org/pdf/2408.02279
Copy Paste: [[2408.02279]] DRFormer: Multi-Scale Transformer Utilizing Diverse Receptive Fields for Long Time-Series Forecasting(https://arxiv.org/abs/2408.02279)
Keywords: extraction, transformer
Abstract: Long-term time series forecasting (LTSF) has been widely applied in finance, traffic prediction, and other domains. Recently, patch-based transformers have emerged as a promising approach, segmenting data into sub-level patches that serve as input tokens. However, existing methods mostly rely on predetermined patch lengths, necessitating expert knowledge and posing challenges in capturing diverse characteristics across various scales. Moreover, time series data exhibit diverse variations and fluctuations across different temporal scales, which traditional approaches struggle to model effectively. In this paper, we propose a dynamic tokenizer with a dynamic sparse learning algorithm to capture diverse receptive fields and sparse patterns of time series data. In order to build hierarchical receptive fields, we develop a multi-scale Transformer model, coupled with multi-scale sequence extraction, capable of capturing multi-resolution features. Additionally, we introduce a group-aware rotary position encoding technique to enhance intra- and inter-group position awareness among representations across different temporal scales. Our proposed model, named DRFormer, is evaluated on various real-world datasets, and experimental results demonstrate its superiority compared to existing methods. Our code is available at: this https URL.

Title: Joint-Motion Mutual Learning for Pose Estimation in Videos

Authors: Sifan Wu, Haipeng Chen, Yifang Yin, Sihao Hu, Runyang Feng, Yingying Jiao, Ziqi Yang, Zhenguang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02285
Pdf URL: https://arxiv.org/pdf/2408.02285
Copy Paste: [[2408.02285]] Joint-Motion Mutual Learning for Pose Estimation in Videos(https://arxiv.org/abs/2408.02285)
Keywords: robust
Abstract: Human pose estimation in videos has long been a compelling yet challenging task within the realm of computer vision. Nevertheless, this task remains difficult because of the complex video scenes, such as video defocus and self-occlusion. Recent methods strive to integrate multi-frame visual features generated by a backbone network for pose estimation. However, they often ignore the useful joint information encoded in the initial heatmap, which is a by-product of the backbone generation. Comparatively, methods that attempt to refine the initial heatmap fail to consider any spatio-temporal motion features. As a result, the performance of existing methods for pose estimation falls short due to the lack of ability to leverage both local joint (heatmap) information and global motion (feature) dynamics. To address this problem, we propose a novel joint-motion mutual learning framework for pose estimation, which effectively concentrates on both local joint dependency and global pixel-level motion dynamics. Specifically, we introduce a context-aware joint learner that adaptively leverages initial heatmaps and motion flow to retrieve robust local joint feature. Given that local joint feature and global motion flow are complementary, we further propose a progressive joint-motion mutual learning that synergistically exchanges information and interactively learns between joint feature and motion flow to improve the capability of the model. More importantly, to capture more diverse joint and motion cues, we theoretically analyze and propose an information orthogonality objective to avoid learning redundant information from multi-cues. Empirical experiments show our method outperforms prior arts on three challenging benchmarks.

Title: Generalized Gaussian Temporal Difference Error For Uncertainty-aware Reinforcement Learning

Authors: Seyeon Kim, Joonhun Lee, Namhoon Cho, Sungjun Han, Seungeon Baek
Subjects: cs.LG, cs.AI, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2408.02295
Pdf URL: https://arxiv.org/pdf/2408.02295
Copy Paste: [[2408.02295]] Generalized Gaussian Temporal Difference Error For Uncertainty-aware Reinforcement Learning(https://arxiv.org/abs/2408.02295)
Keywords: robust
Abstract: Conventional uncertainty-aware temporal difference (TD) learning methods often rely on simplistic assumptions, typically including a zero-mean Gaussian distribution for TD errors. Such oversimplification can lead to inaccurate error representations and compromised uncertainty estimation. In this paper, we introduce a novel framework for generalized Gaussian error modeling in deep reinforcement learning, applicable to both discrete and continuous control settings. Our framework enhances the flexibility of error distribution modeling by incorporating higher-order moments, particularly kurtosis, thereby improving the estimation and mitigation of data-dependent noise, i.e., aleatoric uncertainty. We examine the influence of the shape parameter of the generalized Gaussian distribution (GGD) on aleatoric uncertainty and provide a closed-form expression that demonstrates an inverse relationship between uncertainty and the shape parameter. Additionally, we propose a theoretically grounded weighting scheme to fully leverage the GGD. To address epistemic uncertainty, we enhance the batch inverse variance weighting by incorporating bias reduction and kurtosis considerations, resulting in improved robustness. Extensive experimental evaluations using policy gradient algorithms demonstrate the consistent efficacy of our method, showcasing significant performance improvements.

Title: SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models

Authors: Shujuan Zhao, Lingfeng Qiao, Kangyang Luo, Qian-Wen Zhang, Junru Lu, Di Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.02302
Pdf URL: https://arxiv.org/pdf/2408.02302
Copy Paste: [[2408.02302]] SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models(https://arxiv.org/abs/2408.02302)
Keywords: large language model
Abstract: Large language models (LLMs) have become powerful tools for advancing natural language processing applications in the financial industry. However, existing financial LLMs often face challenges such as hallucinations or superficial parameter training, resulting in suboptimal performance, particularly in financial computing and machine reading comprehension (MRC). To address these issues, we propose a novel large language model specifically designed for the Chinese financial domain, named SNFinLLM. SNFinLLM excels in domain-specific tasks such as answering questions, summarizing financial research reports, analyzing sentiment, and executing financial calculations. We then perform the supervised fine-tuning (SFT) to enhance the model's proficiency across various financial domains. Specifically, we gather extensive financial data and create a high-quality instruction dataset composed of news articles, professional papers, and research reports of finance domain. Utilizing both domain-specific and general datasets, we proceed with continuous pre-training on an established open-source base model, resulting in SNFinLLM-base. Following this, we engage in supervised fine-tuning (SFT) to bolster the model's capability across multiple financial tasks. Crucially, we employ a straightforward Direct Preference Optimization (DPO) method to better align the model with human preferences. Extensive experiments conducted on finance benchmarks and our evaluation dataset demonstrate that SNFinLLM markedly outperforms other state-of-the-art financial language models. For more details, check out our demo video here: this https URL.

Title: PROF: Protected Order Flow in a Profit-Seeking World

Authors: Kushal Babel, Nerla Jean-Louis, Yan Ji, Ujval Misra, Mahimna Kelkar, Kosala Yapa Mudiyanselage, Andrew Miller, Ari Juels
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.02303
Pdf URL: https://arxiv.org/pdf/2408.02303
Copy Paste: [[2408.02303]] PROF: Protected Order Flow in a Profit-Seeking World(https://arxiv.org/abs/2408.02303)
Keywords: protect
Abstract: Users of decentralized finance (DeFi) applications face significant risks from adversarial actions that manipulate the order of transactions to extract value from users. Such actions -- an adversarial form of what is called maximal-extractable value (MEV) -- impact both individual outcomes and the stability of the DeFi ecosystem. MEV exploitation, moreover, is being institutionalized through an architectural paradigm known Proposer-Builder Separation (PBS). This work introduces a system called PROF (PRotected Order Flow) that is designed to limit harmful forms of MEV in existing PBS systems. PROF aims at this goal using two ideas. First, PROF imposes an ordering on a set ("bundle") of privately input transactions and enforces that ordering all the way through to block production -- preventing transaction-order manipulation. Second, PROF creates bundles whose inclusion is profitable to block producers, thereby ensuring that bundles see timely inclusion in blocks. PROF is backward-compatible, meaning that it works with existing and future PBS designs. PROF is also compatible with any desired algorithm for ordering transactions within a PROF bundle (e.g., first-come, first-serve, fee-based, etc.). It executes efficiently, i.e., with low latency, and requires no additional trust assumptions among PBS entities. We quantitatively and qualitatively analyze incentive structure of PROF, and its utility to users compared with existing solutions. We also report on inclusion likelihood of PROF transactions, and concrete latency numbers through our end-to-end implementation.

Title: Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization

Authors: Changtao Miao, Qi Chu, Tao Gong, Zhentao Tan, Zhenchao Jin, Wanyi Zhuang, Man Luo, Honggang Hu, Nenghai Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02306
Pdf URL: https://arxiv.org/pdf/2408.02306
Copy Paste: [[2408.02306]] Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization(https://arxiv.org/abs/2408.02306)
Keywords: transformer
Abstract: With the advancement of face manipulation technology, forgery images in multi-face scenarios are gradually becoming a more complex and realistic challenge. Despite this, detection and localization methods for such multi-face manipulations remain underdeveloped. Traditional manipulation localization methods either indirectly derive detection results from localization masks, resulting in limited detection performance, or employ a naive two-branch structure to simultaneously obtain detection and localization results, which cannot effectively benefit the localization capability due to limited interaction between two tasks. This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization. The MoNFAP primarily introduces two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM). The FUP integrates detection and localization tasks using a token learning strategy and multiple forgery-aware transformers, which facilitates the use of classification information to enhance localization capability. Besides, motivated by the crucial role of noise information in forgery detection, the MNM leverages multiple noise extractors based on the concept of the mixture of experts to enhance the general RGB features, further boosting the performance of our framework. Finally, we establish a comprehensive benchmark for multi-face detection and localization and the proposed \textit{MoNFAP} achieves significant performance. The codes will be made available.

Title: On the Robustness of Malware Detectors to Adversarial Samples

Authors: Muhammad Salman, Benjamin Zi Hao Zhao, Hassan Jameel Asghar, Muhammad Ikram, Sidharth Kaushik, Mohamed Ali Kaafar
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02310
Pdf URL: https://arxiv.org/pdf/2408.02310
Copy Paste: [[2408.02310]] On the Robustness of Malware Detectors to Adversarial Samples(https://arxiv.org/abs/2408.02310)
Keywords: attack, robust
Abstract: Adversarial examples add imperceptible alterations to inputs with the objective to induce misclassification in machine learning models. They have been demonstrated to pose significant challenges in domains like image classification, with results showing that an adversarially perturbed image to evade detection against one classifier is most likely transferable to other classifiers. Adversarial examples have also been studied in malware analysis. Unlike images, program binaries cannot be arbitrarily perturbed without rendering them non-functional. Due to the difficulty of crafting adversarial program binaries, there is no consensus on the transferability of adversarially perturbed programs to different detectors. In this work, we explore the robustness of malware detectors against adversarially perturbed malware. We investigate the transferability of adversarial attacks developed against one detector, against other machine learning-based malware detectors, and code similarity techniques, specifically, locality sensitive hashing-based detectors. Our analysis reveals that adversarial program binaries crafted for one detector are generally less effective against others. We also evaluate an ensemble of detectors and show that they can potentially mitigate the impact of adversarial program binaries. Finally, we demonstrate that substantial program changes made to evade detection may result in the transformation technique being identified, implying that the adversary must make minimal changes to the program binary.

Title: A Lean Transformer Model for Dynamic Malware Analysis and Detection

Authors: Tony Quertier, Benjamin Marais, Grégoire Barrué, Stéphane Morucci, Sévan Azé, Sébastien Salladin
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02313
Pdf URL: https://arxiv.org/pdf/2408.02313
Copy Paste: [[2408.02313]] A Lean Transformer Model for Dynamic Malware Analysis and Detection(https://arxiv.org/abs/2408.02313)
Keywords: secure, security, defense, attack, transformer, generative, large language model
Abstract: Malware is a fast-growing threat to the modern computing world and existing lines of defense are not efficient enough to address this issue. This is mainly due to the fact that many prevention solutions rely on signature-based detection methods that can easily be circumvented by hackers. Therefore, there is a recurrent need for behavior-based analysis where a suspicious file is ran in a secured environment and its traces are collected to reports for analysis. Previous works have shown some success leveraging Neural Networks and API calls sequences extracted from these execution reports. Recently, Large Language Models and Generative AI have demonstrated impressive capabilities mainly in Natural Language Processing tasks and promising applications in the cybersecurity field for both attackers and defenders. In this paper, we design an Encoder-Only model, based on the Transformers architecture, to detect malicious files, digesting their API call sequences collected by an execution emulation solution. We are also limiting the size of the model architecture and the number of its parameters since it is often considered that Large Language Models may be overkill for specific tasks such as the one we are dealing with hereafter. In addition to achieving decent detection results, this approach has the advantage of reducing our carbon footprint by limiting training and inference times and facilitating technical operations with less hardware requirements. We also carry out some analysis of our results and highlight the limits and possible improvements when using Transformers to analyze malicious files.

Title: XDC Network Assessment: Decentralization, Scalability and Security

Authors: Mohuya Chakraborty, Atul Khekade
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.02318
Pdf URL: https://arxiv.org/pdf/2408.02318
Copy Paste: [[2408.02318]] XDC Network Assessment: Decentralization, Scalability and Security(https://arxiv.org/abs/2408.02318)
Keywords: security
Abstract: XinFin, in 2019, unveiled the XDC network, an enterprise-ready hybrid blockchain platform that is open-source and specializes in tokenization for real-world decentralized finance. Overseeing the XDC network is currently the XDC Foundation, a non-profit organization established to encourage the growth, enhancement, and adoption of the XDC Network through community-driven projects such as GitHub. This whitepaper discusses the real-time assessment of the XDC network's decentralization, scalability, and security aspects as well as the Nakamoto coefficient estimation that follows, which is a measure of a decentralized system's decentralization nature that quantifies the minimal number of nodes or entities needed to compromise the system. A high coefficient denotes greater decentralization, while a low number denotes increased disruption risk. The XDC network's real-time computation of the high Nakamoto coefficient demonstrates its highly decentralized character. The article also addresses the diversity of consensus and execution clients, the host distribution, the geo-distribution, and some of the outstanding issues and business considerations.

Title: A Sharp Convergence Theory for The Probability Flow ODEs of Diffusion Models

Authors: Gen Li, Yuting Wei, Yuejie Chi, Yuxin Chen
Subjects: cs.LG, eess.SP, math.NA, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2408.02320
Pdf URL: https://arxiv.org/pdf/2408.02320
Copy Paste: [[2408.02320]] A Sharp Convergence Theory for The Probability Flow ODEs of Diffusion Models(https://arxiv.org/abs/2408.02320)
Keywords: diffusion, generative
Abstract: Diffusion models, which convert noise into new data instances by learning to reverse a diffusion process, have become a cornerstone in contemporary generative modeling. In this work, we develop non-asymptotic convergence theory for a popular diffusion-based sampler (i.e., the probability flow ODE sampler) in discrete time, assuming access to $\ell_2$-accurate estimates of the (Stein) score functions. For distributions in $\mathbb{R}^d$, we prove that $d/\varepsilon$ iterations -- modulo some logarithmic and lower-order terms -- are sufficient to approximate the target distribution to within $\varepsilon$ total-variation distance. This is the first result establishing nearly linear dimension-dependency (in $d$) for the probability flow ODE sampler. Imposing only minimal assumptions on the target data distribution (e.g., no smoothness assumption is imposed), our results also characterize how $\ell_2$ score estimation errors affect the quality of the data generation processes. In contrast to prior works, our theory is developed based on an elementary yet versatile non-asymptotic approach without the need of resorting to SDE and ODE toolboxes.

Title: From Generalist to Specialist: Exploring CWE-Specific Vulnerability Detection

Authors: Syafiq Al Atiiq, Christian Gehrmann, Kevin Dahlén, Karim Khalil
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2408.02329
Pdf URL: https://arxiv.org/pdf/2408.02329
Copy Paste: [[2408.02329]] From Generalist to Specialist: Exploring CWE-Specific Vulnerability Detection(https://arxiv.org/abs/2408.02329)
Keywords: large language model
Abstract: Vulnerability Detection (VD) using machine learning faces a significant challenge: the vast diversity of vulnerability types. Each Common Weakness Enumeration (CWE) represents a unique category of vulnerabilities with distinct characteristics, code semantics, and patterns. Treating all vulnerabilities as a single label with a binary classification approach may oversimplify the problem, as it fails to capture the nuances and context-specific to each CWE. As a result, a single binary classifier might merely rely on superficial text patterns rather than understanding the intricacies of each vulnerability type. Recent reports showed that even the state-of-the-art Large Language Model (LLM) with hundreds of billions of parameters struggles to generalize well to detect vulnerabilities. Our work investigates a different approach that leverages CWE-specific classifiers to address the heterogeneity of vulnerability types. We hypothesize that training separate classifiers for each CWE will enable the models to capture the unique characteristics and code semantics associated with each vulnerability category. To confirm this, we conduct an ablation study by training individual classifiers for each CWE and evaluating their performance independently. Our results demonstrate that CWE-specific classifiers outperform a single binary classifier trained on all vulnerabilities. Building upon this, we explore strategies to combine them into a unified vulnerability detection system using a multiclass approach. Even if the lack of large and high-quality datasets for vulnerability detection is still a major obstacle, our results show that multiclass detection can be a better path toward practical vulnerability detection in the future. All our models and code to produce our results are open-sourced.

Title: Infusing Environmental Captions for Long-Form Video Language Grounding

Authors: Hyogun Lee, Soyeon Hong, Mujeen Sung, Jinwoo Choi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02336
Pdf URL: https://arxiv.org/pdf/2408.02336
Copy Paste: [[2408.02336]] Infusing Environmental Captions for Long-Form Video Language Grounding(https://arxiv.org/abs/2408.02336)
Keywords: robust, large language model
Abstract: In this work, we tackle the problem of long-form video-language grounding (VLG). Given a long-form video and a natural language query, a model should temporally localize the precise moment that answers the query. Humans can easily solve VLG tasks, even with arbitrarily long videos, by discarding irrelevant moments using extensive and robust knowledge gained from experience. Unlike humans, existing VLG methods are prone to fall into superficial cues learned from small-scale datasets, even when they are within irrelevant frames. To overcome this challenge, we propose EI-VLG, a VLG method that leverages richer textual information provided by a Multi-modal Large Language Model (MLLM) as a proxy for human experiences, helping to effectively exclude irrelevant frames. We validate the effectiveness of the proposed method via extensive experiments on a challenging EgoNLQ benchmark.

Title: Machine Learning Applications in Medical Prognostics: A Comprehensive Review

Authors: Michael Fascia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.02344
Pdf URL: https://arxiv.org/pdf/2408.02344
Copy Paste: [[2408.02344]] Machine Learning Applications in Medical Prognostics: A Comprehensive Review(https://arxiv.org/abs/2408.02344)
Keywords: privacy, robust, interpretability
Abstract: Machine learning (ML) has revolutionized medical prognostics by integrating advanced algorithms with clinical data to enhance disease prediction, risk assessment, and patient outcome forecasting. This comprehensive review critically examines the application of various ML techniques in medical prognostics, focusing on their efficacy, challenges, and future directions. The methodologies discussed include Random Forest (RF) for sepsis prediction, logistic regression for cardiovascular risk assessment, Convolutional Neural Networks (CNNs) for cancer detection, and Long Short-Term Memory (LSTM) networks for predicting clinical deterioration. RF models demonstrate robust performance in handling high-dimensional data and capturing non-linear relationships, making them particularly effective for sepsis prediction. Logistic regression remains valuable for its interpretability and ease of use in cardiovascular risk assessment. CNNs have shown exceptional accuracy in cancer detection, leveraging their ability to learn complex visual patterns from medical imaging. LSTM networks excel in analyzing temporal data, providing accurate predictions of clinical deterioration. The review highlights the strengths and limitations of each technique, the importance of model interpretability, and the challenges of data quality and privacy. Future research directions include the integration of multi-modal data sources, the application of transfer learning, and the development of continuous learning systems. These advancements aim to enhance the predictive power and clinical applicability of ML models, ultimately improving patient outcomes in healthcare settings.

Title: Earth System Data Cubes: Avenues for advancing Earth system research

Authors: David Montero, Guido Kraemer, Anca Anghelea, César Aybar, Gunnar Brandt, Gustau Camps-Valls, Felix Cremer, Ida Flik, Fabian Gans, Sarah Habershon, Chaonan Ji, Teja Kattenborn, Laura Martínez-Ferrer, Francesco Martinuzzi, Martin Reinhardt, Maximilian Söchting, Khalil Teber, Miguel D. Mahecha
Subjects: cs.CV, cs.DB
Abstract URL: https://arxiv.org/abs/2408.02348
Pdf URL: https://arxiv.org/pdf/2408.02348
Copy Paste: [[2408.02348]] Earth System Data Cubes: Avenues for advancing Earth system research(https://arxiv.org/abs/2408.02348)
Keywords: robust
Abstract: Recent advancements in Earth system science have been marked by the exponential increase in the availability of diverse, multivariate datasets characterised by moderate to high spatio-temporal resolutions. Earth System Data Cubes (ESDCs) have emerged as one suitable solution for transforming this flood of data into a simple yet robust data structure. ESDCs achieve this by organising data into an analysis-ready format aligned with a spatio-temporal grid, facilitating user-friendly analysis and diminishing the need for extensive technical data processing knowledge. Despite these significant benefits, the completion of the entire ESDC life cycle remains a challenging task. Obstacles are not only of a technical nature but also relate to domain-specific problems in Earth system research. There exist barriers to realising the full potential of data collections in light of novel cloud-based technologies, particularly in curating data tailored for specific application domains. These include transforming data to conform to a spatio-temporal grid with minimum distortions and managing complexities such as spatio-temporal autocorrelation issues. Addressing these challenges is pivotal for the effective application of Artificial Intelligence (AI) approaches. Furthermore, adhering to open science principles for data dissemination, reproducibility, visualisation, and reuse is crucial for fostering sustainable research. Overcoming these challenges offers a substantial opportunity to advance data-driven Earth system research, unlocking the full potential of an integrated, multidimensional view of Earth system processes. This is particularly true when such research is coupled with innovative research paradigms and technological progress.

Title: Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding

Authors: Renato Vukovic, David Arps, Carel van Niekerk, Benjamin Matthias Ruppik, Hsien-Chin Lin, Michael Heck, Milica Gašić
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02361
Pdf URL: https://arxiv.org/pdf/2408.02361
Copy Paste: [[2408.02361]] Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding(https://arxiv.org/abs/2408.02361)
Keywords: extraction, generative, large language model
Abstract: State-of-the-art task-oriented dialogue systems typically rely on task-specific ontologies for fulfilling user queries. The majority of task-oriented dialogue data, such as customer service recordings, comes without ontology and annotation. Such ontologies are normally built manually, limiting the application of specialised systems. Dialogue ontology construction is an approach for automating that process and typically consists of two steps: term extraction and relation extraction. In this work, we focus on relation extraction in a transfer learning set-up. To improve the generalisation, we propose an extension to the decoding mechanism of large language models. We adapt Chain-of-Thought (CoT) decoding, recently developed for reasoning problems, to generative relation extraction. Here, we generate multiple branches in the decoding space and select the relations based on a confidence threshold. By constraining the decoding to ontology terms and relations, we aim to decrease the risk of hallucination. We conduct extensive experimentation on two widely used datasets and find improvements in performance on target ontology for source fine-tuned and one-shot prompted large language models.

Title: The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024

Authors: He Wang, Lei Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02369
Pdf URL: https://arxiv.org/pdf/2408.02369
Copy Paste: [[2408.02369]] The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024(https://arxiv.org/abs/2408.02369)
Keywords: transformer
Abstract: This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP (Team 237) in the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), engaging in all four tracks, including the fixed and open tracks of Single-Speaker VSR Task and Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multiscale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, introducing Enhanced ResNet3D visual frontend, E-Branchformer encoder, and Bi-directional Transformer decoder. Our approach yields a 30.47% CER for the Single-Speaker Task and 34.30% CER for the Multi-Speaker Task, securing second place in the open track of the Single-Speaker Task and first place in the other three tracks.

Title: A Few-Shot Approach for Relation Extraction Domain Adaptation using Large Language Models

Authors: Vanni Zavarella, Juan Carlos Gamero-Salinas, Sergio Consoli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02377
Pdf URL: https://arxiv.org/pdf/2408.02377
Copy Paste: [[2408.02377]] A Few-Shot Approach for Relation Extraction Domain Adaptation using Large Language Models(https://arxiv.org/abs/2408.02377)
Keywords: extraction, transformer, large language model
Abstract: Knowledge graphs (KGs) have been successfully applied to the analysis of complex scientific and technological domains, with automatic KG generation methods typically building upon relation extraction models capturing fine-grained relations between domain entities in text. While these relations are fully applicable across scientific areas, existing models are trained on few domain-specific datasets such as SciERC and do not perform well on new target domains. In this paper, we experiment with leveraging in-context learning capabilities of Large Language Models to perform schema-constrained data annotation, collecting in-domain training instances for a Transformer-based relation extraction model deployed on titles and abstracts of research papers in the Architecture, Construction, Engineering and Operations (AECO) domain. By assessing the performance gain with respect to a baseline Deep Learning architecture trained on off-domain data, we show that by using a few-shot learning strategy with structured prompts and only minimal expert annotation the presented approach can potentially support domain adaptation of a science KG generation model.

Title: Cross Psuedo Supervision Framework for Sparsely Labelled Geo-spatial Images

Authors: Yash Dixit, Naman Srivastava, Joel D Joy, Rohan Olikara, Swarup E, Rakshit Ramesh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02382
Pdf URL: https://arxiv.org/pdf/2408.02382
Copy Paste: [[2408.02382]] Cross Psuedo Supervision Framework for Sparsely Labelled Geo-spatial Images(https://arxiv.org/abs/2408.02382)
Keywords: robust, segmentation
Abstract: Land Use Land Cover (LULC) mapping is essential for urban and resource planning and is one of the key elements in developing smart and sustainable cities. This study introduces a semi-supervised segmentation model for LULC prediction using high-resolution satellite images with a huge diversity in data distributions in different areas from the country of India. Our approach ensures a robust generalization across different types of buildings, roads, trees, and water bodies within these distinct areas. We propose a modified Cross Pseudo Supervision framework to train image segmentation models on sparsely labelled data. The proposed framework addresses the limitations of the popular "Cross Pseudo Supervision" technique for semi-supervised learning. Specifically, it tackles the challenges of training segmentation models on noisy satellite image data with sparse and inaccurate labels. This comprehensive approach enhances the accuracy and utility of LULC mapping for various urban planning applications.

Title: Strategic Federated Learning: Application to Smart Meter Data Clustering

Authors: Hassan Mohamad, Chao Zhang, Samson Lasaulce, Vineeth S Varma, Mérouane Debbah, Mounir Ghogho
Subjects: cs.LG, cs.GT
Abstract URL: https://arxiv.org/abs/2408.02384
Pdf URL: https://arxiv.org/pdf/2408.02384
Copy Paste: [[2408.02384]] Strategic Federated Learning: Application to Smart Meter Data Clustering(https://arxiv.org/abs/2408.02384)
Keywords: federate
Abstract: Federated learning (FL) involves several clients that share with a fusion center (FC), the model each client has trained with its own data. Conventional FL, which can be interpreted as an estimation or distortion-based approach, ignores the final use of model information (MI) by the FC and the other clients. In this paper, we introduce a novel FL framework in which the FC uses an aggregate version of the MI to make decisions that affect the client's utility functions. Clients cannot choose the decisions and can only use the MI reported to the FC to maximize their utility. Depending on the alignment between the client and FC utilities, the client may have an individual interest in adding strategic noise to the model. This general framework is stated and specialized to the case of clustering, in which noisy cluster representative information is reported. This is applied to the problem of power consumption scheduling. In this context, utility non-alignment occurs, for instance, when the client wants to consume when the price of electricity is low, whereas the FC wants the consumption to occur when the total power is the lowest. This is illustrated with aggregated real data from Ausgrid \cite{ausgrid}. Our numerical analysis clearly shows that the client can increase his utility by adding noise to the model reported to the FC. Corresponding results and source codes can be downloaded from \cite{source-code}.

Title: CMR-Agent: Learning a Cross-Modal Agent for Iterative Image-to-Point Cloud Registration

Authors: Gongxin Yao, Yixin Xuan, Xinyang Li, Yu Pan
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2408.02394
Pdf URL: https://arxiv.org/pdf/2408.02394
Copy Paste: [[2408.02394]] CMR-Agent: Learning a Cross-Modal Agent for Iterative Image-to-Point Cloud Registration(https://arxiv.org/abs/2408.02394)
Keywords: extraction, interpretability
Abstract: Image-to-point cloud registration aims to determine the relative camera pose of an RGB image with respect to a point cloud. It plays an important role in camera localization within pre-built LiDAR maps. Despite the modality gaps, most learning-based methods establish 2D-3D point correspondences in feature space without any feedback mechanism for iterative optimization, resulting in poor accuracy and interpretability. In this paper, we propose to reformulate the registration procedure as an iterative Markov decision process, allowing for incremental adjustments to the camera pose based on each intermediate state. To achieve this, we employ reinforcement learning to develop a cross-modal registration agent (CMR-Agent), and use imitation learning to initialize its registration policy for stability and quick-start of the training. According to the cross-modal observations, we propose a 2D-3D hybrid state representation that fully exploits the fine-grained features of RGB images while reducing the useless neutral states caused by the spatial truncation of camera frustum. Additionally, the overall framework is well-designed to efficiently reuse one-shot cross-modal embeddings, avoiding repetitive and time-consuming feature extraction. Extensive experiments on the KITTI-Odometry and NuScenes datasets demonstrate that CMR-Agent achieves competitive accuracy and efficiency in registration. Once the one-shot embeddings are completed, each iteration only takes a few milliseconds.

Title: Terracorder: Sense Long and Prosper

Authors: Josh Millar, Sarab Sethi, Hamed Haddadi, Anil Madhavapeddy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.02407
Pdf URL: https://arxiv.org/pdf/2408.02407
Copy Paste: [[2408.02407]] Terracorder: Sense Long and Prosper(https://arxiv.org/abs/2408.02407)
Keywords: robust
Abstract: In-situ sensing devices need to be deployed in remote environments for long periods of time; minimizing their power consumption is vital for maximising both their operational lifetime and coverage. We introduce Terracorder -- a versatile multi-sensor device -- and showcase its exceptionally low power consumption using an on-device reinforcement learning scheduler. We prototype a unique device setup for biodiversity monitoring and compare its battery life using our scheduler against a number of fixed schedules; the scheduler captures more than 80% of events at less than 50% of the number of activations of the best-performing fixed schedule. We then explore how a collaborative scheduler can maximise the useful operation of a network of devices, improving overall network power consumption and robustness.

Title: Multi-weather Cross-view Geo-localization Using Denoising Diffusion Models

Authors: Tongtong Feng, Qing Li, Xin Wang, Mingzi Wang, Guangyao Li, Wenwu Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02408
Pdf URL: https://arxiv.org/pdf/2408.02408
Copy Paste: [[2408.02408]] Multi-weather Cross-view Geo-localization Using Denoising Diffusion Models(https://arxiv.org/abs/2408.02408)
Keywords: extraction, diffusion
Abstract: Cross-view geo-localization in GNSS-denied environments aims to determine an unknown location by matching drone-view images with the correct geo-tagged satellite-view images from a large gallery. Recent research shows that learning discriminative image representations under specific weather conditions can significantly enhance performance. However, the frequent occurrence of unseen extreme weather conditions hinders progress. This paper introduces MCGF, a Multi-weather Cross-view Geo-localization Framework designed to dynamically adapt to unseen weather conditions. MCGF establishes a joint optimization between image restoration and geo-localization using denoising diffusion models. For image restoration, MCGF incorporates a shared encoder and a lightweight restoration module to help the backbone eliminate weather-specific information. For geo-localization, MCGF uses EVA-02 as a backbone for feature extraction, with cross-entropy loss for training and cosine distance for testing. Extensive experiments on University160k-WX demonstrate that MCGF achieves competitive results for geo-localization in varying weather conditions.

Title: Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models

Authors: Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, Haoyang Li
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2408.02416
Pdf URL: https://arxiv.org/pdf/2408.02416
Copy Paste: [[2408.02416]] Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models(https://arxiv.org/abs/2408.02416)
Keywords: defense, attack, extraction, large language model
Abstract: The drastic increase of large language models' (LLMs) parameters has led to a new research direction of fine-tuning-free downstream customization by prompts, i.e., task descriptions. While these prompt-based services (e.g. OpenAI's GPTs) play an important role in many businesses, there has emerged growing concerns about the prompt leakage, which undermines the intellectual properties of these services and causes downstream attacks. In this paper, we analyze the underlying mechanism of prompt leakage, which we refer to as prompt memorization, and develop corresponding defending strategies. By exploring the scaling laws in prompt extraction, we analyze key attributes that influence prompt extraction, including model sizes, prompt lengths, as well as the types of prompts. Then we propose two hypotheses that explain how LLMs expose their prompts. The first is attributed to the perplexity, i.e. the familiarity of LLMs to texts, whereas the second is based on the straightforward token translation path in attention matrices. To defend against such threats, we investigate whether alignments can undermine the extraction of prompts. We find that current LLMs, even those with safety alignments like GPT-4, are highly vulnerable to prompt extraction attacks, even under the most straightforward user attacks. Therefore, we put forward several defense strategies with the inspiration of our findings, which achieve 83.8\% and 71.0\% drop in the prompt extraction rate for Llama2-7B and GPT-3.5, respectively. Source code is avaliable at \url{this https URL}.

Title: Attenuation-adjusted deep learning of pore defects in 2D radiographs of additive manufacturing powders

Authors: Andreas Bjerregaard, David Schumacher, Jon Sporring
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2408.02427
Pdf URL: https://arxiv.org/pdf/2408.02427
Copy Paste: [[2408.02427]] Attenuation-adjusted deep learning of pore defects in 2D radiographs of additive manufacturing powders(https://arxiv.org/abs/2408.02427)
Keywords: segmentation
Abstract: The presence of gas pores in metal feedstock powder for additive manufacturing greatly affects the final AM product. Since current porosity analysis often involves lengthy X-ray computed tomography (XCT) scans with a full rotation around the sample, motivation exists to explore methods that allow for high throughput -- possibly enabling in-line porosity analysis during manufacturing. Through labelling pore pixels on single 2D radiographs of powders, this work seeks to simulate such future efficient setups. High segmentation accuracy is achieved by combining a model of X-ray attenuation through particles with a variant of the widely applied UNet architecture; notably, F1-score increases by $11.4\%$ compared to the baseline UNet. The proposed pore segmentation is enabled by: 1) pretraining on synthetic data, 2) making tight particle cutouts, and 3) subtracting an ideal particle without pores generated from a distance map inspired by Lambert-Beers law. This paper explores four image processing methods, where the fastest (yet still unoptimized) segments a particle in mean $0.014s$ time with F1-score $0.78$, and the most accurate in $0.291s$ with F1-score $0.87$. Due to their scalable nature, these strategies can be involved in making high throughput porosity analysis of metal feedstock powder for additive manufacturing.

Title: Long Input Benchmark for Russian Analysis

Authors: Igor Churin, Murat Apishev, Maria Tikhonova, Denis Shevelev, Aydar Bulatov, Yuri Kuratov, Sergej Averkiev, Alena Fenogenova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02439
Pdf URL: https://arxiv.org/pdf/2408.02439
Copy Paste: [[2408.02439]] Long Input Benchmark for Russian Analysis(https://arxiv.org/abs/2408.02439)
Keywords: large language model
Abstract: Recent advancements in Natural Language Processing (NLP) have fostered the development of Large Language Models (LLMs) that can solve an immense variety of tasks. One of the key aspects of their application is their ability to work with long text documents and to process long sequences of tokens. This has created a demand for proper evaluation of long-context understanding. To address this need for the Russian language, we propose LIBRA (Long Input Benchmark for Russian Analysis), which comprises 21 adapted datasets to study the LLM's abilities to understand long texts thoroughly. The tests are divided into four complexity groups and allow the evaluation of models across various context lengths ranging from 4k up to 128k tokens. We provide the open-source datasets, codebase, and public leaderboard for LIBRA to guide forthcoming research.

Title: Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

Authors: Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.02442
Pdf URL: https://arxiv.org/pdf/2408.02442
Copy Paste: [[2408.02442]] Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models(https://arxiv.org/abs/2408.02442)
Keywords: large language model
Abstract: Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language models (LLMs). This study investigates whether such constraints on generation space impact LLMs' abilities, including reasoning and domain knowledge comprehension. Specifically, we evaluate LLMs' performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. Surprisingly, we observe a significant decline in LLMs' reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.

Title: Enhancing Heterogeneous Knowledge Graph Completion with a Novel GAT-based Approach

Authors: Wanxu Wei, Yitong Song, Bin Yao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02456
Pdf URL: https://arxiv.org/pdf/2408.02456
Copy Paste: [[2408.02456]] Enhancing Heterogeneous Knowledge Graph Completion with a Novel GAT-based Approach(https://arxiv.org/abs/2408.02456)
Keywords: robust
Abstract: Knowledge graphs (KGs) play a vital role in enhancing search results and recommendation systems. With the rapid increase in the size of the KGs, they are becoming inaccuracy and incomplete. This problem can be solved by the knowledge graph completion methods, of which graph attention network (GAT)-based methods stand out since their superior performance. However, existing GAT-based knowledge graph completion methods often suffer from overfitting issues when dealing with heterogeneous knowledge graphs, primarily due to the unbalanced number of samples. Additionally, these methods demonstrate poor performance in predicting the tail (head) entity that shares the same relation and head (tail) entity with others. To solve these problems, we propose GATH, a novel GAT-based method designed for Heterogeneous KGs. GATH incorporates two separate attention network modules that work synergistically to predict the missing entities. We also introduce novel encoding and feature transformation approaches, enabling the robust performance of GATH in scenarios with imbalanced samples. Comprehensive experiments are conducted to evaluate the GATH's performance. Compared with the existing SOTA GAT-based model on Hits@10 and MRR metrics, our model improves performance by 5.2% and 5.2% on the FB15K-237 dataset, and by 4.5% and 14.6% on the WN18RR dataset, respectively.

Title: Fairness and Bias Mitigation in Computer Vision: A Survey

Authors: Sepehr Dehdashtian, Ruozhen He, Yi Li, Guha Balakrishnan, Nuno Vasconcelos, Vicente Ordonez, Vishnu Naresh Boddeti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02464
Pdf URL: https://arxiv.org/pdf/2408.02464
Copy Paste: [[2408.02464]] Fairness and Bias Mitigation in Computer Vision: A Survey(https://arxiv.org/abs/2408.02464)
Keywords: fair, generative
Abstract: Computer vision systems have witnessed rapid progress over the past two decades due to multiple advances in the field. As these systems are increasingly being deployed in high-stakes real-world applications, there is a dire need to ensure that they do not propagate or amplify any discriminatory tendencies in historical or human-curated data or inadvertently learn biases from spurious correlations. This paper presents a comprehensive survey on fairness that summarizes and sheds light on ongoing trends and successes in the context of computer vision. The topics we discuss include 1) The origin and technical definitions of fairness drawn from the wider fair machine learning literature and adjacent disciplines. 2) Work that sought to discover and analyze biases in computer vision systems. 3) A summary of methods proposed to mitigate bias in computer vision systems in recent years. 4) A comprehensive summary of resources and datasets produced by researchers to measure, analyze, and mitigate bias and enhance fairness. 5) Discussion of the field's success, continuing trends in the context of multimodal foundation and generative models, and gaps that still need to be addressed. The presented characterization should help researchers understand the importance of identifying and mitigating bias in computer vision and the state of the field and identify potential directions for future research.

Title: Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection

Authors: Ting Lei, Shaofeng Yin, Yuxin Peng, Yang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02484
Pdf URL: https://arxiv.org/pdf/2408.02484
Copy Paste: [[2408.02484]] Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection(https://arxiv.org/abs/2408.02484)
Keywords: extraction
Abstract: Zero-shot Human-Object Interaction (HOI) detection has emerged as a frontier topic due to its capability to detect HOIs beyond a predefined set of categories. This task entails not only identifying the interactiveness of human-object pairs and localizing them but also recognizing both seen and unseen interaction categories. In this paper, we introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP. This approach enhances the generalization of large foundation models, such as CLIP, when fine-tuned for HOI detection. Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction and generalizable interaction classification, respectively. Specifically, we integrate prior knowledge of different granularity into conditional vision prompts, including an input-conditioned instance prior and a global spatial pattern prior. The former encourages the image encoder to treat instances belonging to seen or potentially unseen HOI concepts equally while the latter provides representative plausible spatial configuration of the human and object under interaction. Besides, we employ language-aware prompt learning with a consistency constraint to preserve the knowledge of the large foundation model to enable better generalization in the text branch. Extensive experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings. The code and models are available at \url{this https URL}.

Title: UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

Authors: Zhaowei Li, Wei Wang, YiQing Cai, Xu Qi, Pengyu Wang, Dong Zhang, Hang Song, Botian Jiang, Zhida Huang, Tao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.02503
Pdf URL: https://arxiv.org/pdf/2408.02503
Copy Paste: [[2408.02503]] UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model(https://arxiv.org/abs/2408.02503)
Keywords: robust, large language model
Abstract: Significant advancements has recently been achieved in the field of multi-modal large language models (MLLMs), demonstrating their remarkable capabilities in understanding and reasoning across diverse tasks. However, these models are often trained for specific tasks and rely on task-specific input-output formats, limiting their applicability to a broader range of tasks. This raises a fundamental question: Can we develop a unified approach to represent and handle different multi-modal tasks to maximize the generalizability of MLLMs? In this paper, we propose UnifiedMLLM, a comprehensive model designed to represent various tasks using a unified representation. Our model exhibits strong capabilities in comprehending the implicit intent of user instructions and preforming reasoning. In addition to generating textual responses, our model also outputs task tokens and grounding tokens, serving as indicators of task types and task granularity. These outputs are subsequently routed through the task router and directed to specific expert models for task completion. To train our model, we construct a task-specific dataset and an 100k multi-task dataset encompassing complex scenarios. Employing a three-stage training strategy, we equip our model with robust reasoning and task processing capabilities while preserving its generalization capacity and knowledge reservoir. Extensive experiments showcase the impressive performance of our unified representation approach across various tasks, surpassing existing methodologies. Furthermore, our approach exhibits exceptional scalability and generality. Our code, model, and dataset will be available at \url{this https URL}.

Title: Estimating Pore Location of PBF-LB/M Processes with Segmentation Models

Authors: Hans Aoyang Zhou, Jan Theunissen, Marco Kemmerling, Anas Abdelrazeq, Johannes Henrich Schleifenbaum, Robert H. Schmitt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02507
Pdf URL: https://arxiv.org/pdf/2408.02507
Copy Paste: [[2408.02507]] Estimating Pore Location of PBF-LB/M Processes with Segmentation Models(https://arxiv.org/abs/2408.02507)
Keywords: segmentation
Abstract: Reliably manufacturing defect free products is still an open challenge for Laser Powder Bed Fusion processes. Particularly, pores that occur frequently have a negative impact on mechanical properties like fatigue performance. Therefore, an accurate localisation of pores is mandatory for quality assurance, but requires time-consuming post-processing steps like computer tomography scans. Although existing solutions using in-situ monitoring data can detect pore occurrence within a layer, they are limited in their localisation precision. Therefore, we propose a pore localisation approach that estimates their position within a single layer using a Gaussian kernel density estimation. This allows segmentation models to learn the correlation between in-situ monitoring data and the derived probability distribution of pore occurrence. Within our experiments, we compare the prediction performance of different segmentation models depending on machine parameter configuration and geometry features. From our results, we conclude that our approach allows a precise localisation of pores that requires minimal data preprocessing. Our research extends the literature by providing a foundation for more precise pore detection systems.

Title: Practical Attacks against Black-box Code Completion Engines

Authors: Slobodan Jenko, Jingxuan He, Niels Mündler, Mark Vero, Martin Vechev
Subjects: cs.CR, cs.LG, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2408.02509
Pdf URL: https://arxiv.org/pdf/2408.02509
Copy Paste: [[2408.02509]] Practical Attacks against Black-box Code Completion Engines(https://arxiv.org/abs/2408.02509)
Keywords: security, attack, large language model
Abstract: Modern code completion engines, powered by large language models, have demonstrated impressive capabilities to generate functionally correct code based on surrounding context. As these tools are extensively used by millions of developers, it is crucial to investigate their security implications. In this work, we present INSEC, a novel attack that directs code completion engines towards generating vulnerable code. In line with most commercial completion engines, such as GitHub Copilot, INSEC assumes only black-box query access to the targeted engine, without requiring any knowledge of the engine's internals. Our attack works by inserting a malicious attack string as a short comment in the completion input. To derive the attack string, we design a series of specialized initialization schemes and an optimization procedure for further refinement. We demonstrate the strength of INSEC not only on state-of-the-art open-source models but also on black-box commercial services such as the OpenAI API and GitHub Copilot. On a comprehensive set of security-critical test cases covering 16 CWEs across 5 programming languages, INSEC significantly increases the likelihood of the considered completion engines in generating unsafe code by >50% in absolute, while maintaining the ability in producing functionally correct code. At the same time, our attack has low resource requirements, and can be developed for a cost of well under ten USD on commodity hardware.

Title: Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

Authors: Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.02544
Pdf URL: https://arxiv.org/pdf/2408.02544
Copy Paste: [[2408.02544]] Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions(https://arxiv.org/abs/2408.02544)
Keywords: large language model
Abstract: This paper investigates the faithfulness of multimodal large language model (MLLM) agents in the graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context. A general setting is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content. A wide range of MLLMs are evaluated as GUI agents using our simulated dataset, following three working patterns with different levels of perception. Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions. While recent studies predominantly focus on the helpfulness (i.e., action accuracy) of multimodal agents, our findings indicate that these agents are prone to environmental distractions, resulting in unfaithful behaviors. Furthermore, we switch to the adversarial perspective and implement environment injection, demonstrating that such unfaithfulness can be exploited, leading to unexpected risks.

Title: RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

Authors: Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat, Peter Izsak
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02545
Pdf URL: https://arxiv.org/pdf/2408.02545
Copy Paste: [[2408.02545]] RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation(https://arxiv.org/abs/2408.02545)
Keywords: generative, large language model
Abstract: Implementing Retrieval-Augmented Generation (RAG) systems is inherently complex, requiring deep understanding of data, use cases, and intricate design decisions. Additionally, evaluating these systems presents significant challenges, necessitating assessment of both retrieval accuracy and generative quality through a multi-faceted approach. We introduce RAG Foundry, an open-source framework for augmenting large language models for RAG use cases. RAG Foundry integrates data creation, training, inference and evaluation into a single workflow, facilitating the creation of data-augmented datasets for training and evaluating large language models in RAG settings. This integration enables rapid prototyping and experimentation with various RAG techniques, allowing users to easily generate datasets and train RAG models using internal or specialized knowledge sources. We demonstrate the framework effectiveness by augmenting and fine-tuning Llama-3 and Phi-3 models with diverse RAG configurations, showcasing consistent improvements across three knowledge-intensive datasets. Code is released as open-source in this https URL.

Title: MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization

Authors: Yiwen Chen, Yikai Wang, Yihao Luo, Zhengyi Wang, Zilong Chen, Jun Zhu, Chi Zhang, Guosheng Lin
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2408.02555
Pdf URL: https://arxiv.org/pdf/2408.02555
Copy Paste: [[2408.02555]] MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization(https://arxiv.org/abs/2408.02555)
Keywords: transformer
Abstract: We introduce MeshAnything V2, an autoregressive transformer that generates Artist-Created Meshes (AM) aligned to given shapes. It can be integrated with various 3D asset production pipelines to achieve high-quality, highly controllable AM generation. MeshAnything V2 surpasses previous methods in both efficiency and performance using models of the same size. These improvements are due to our newly proposed mesh tokenization method: Adjacent Mesh Tokenization (AMT). Different from previous methods that represent each face with three vertices, AMT uses a single vertex whenever possible. Compared to previous methods, AMT requires about half the token sequence length to represent the same mesh in average. Furthermore, the token sequences from AMT are more compact and well-structured, fundamentally benefiting AM generation. Our extensive experiments show that AMT significantly improves the efficiency and performance of AM generation. Project Page: this https URL

Title: Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information

Authors: Yauwai Yim, Chunkit Chan, Tianyu Shi, Zheye Deng, Wei Fan, Tianshi Zheng, Yangqiu Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02559
Pdf URL: https://arxiv.org/pdf/2408.02559
Copy Paste: [[2408.02559]] Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information(https://arxiv.org/abs/2408.02559)
Keywords: large language model
Abstract: Large language models (LLMs) have shown success in handling simple games with imperfect information and enabling multi-agent coordination, but their ability to facilitate practical collaboration against other agents in complex, imperfect information environments, especially in a non-English environment, still needs to be explored. This study investigates the applicability of knowledge acquired by open-source and API-based LLMs to sophisticated text-based games requiring agent collaboration under imperfect information, comparing their performance to established baselines using other types of agents. We propose a Theory of Mind (ToM) planning technique that allows LLM agents to adapt their strategy against various adversaries using only game rules, current state, and historical context as input. An external tool was incorporated to mitigate the challenge of dynamic and extensive action spaces in this card game. Our results show that although a performance gap exists between current LLMs and state-of-the-art reinforcement learning (RL) models, LLMs demonstrate ToM capabilities in this game setting. It consistently improves their performance against opposing agents, suggesting their ability to understand the actions of allies and adversaries and establish collaboration with allies. To encourage further research and understanding, we have made our codebase openly accessible.

Title: Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs

Authors: Ananya Pandey, Dinesh Kumar Vishwakarma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02571
Pdf URL: https://arxiv.org/pdf/2408.02571
Copy Paste: [[2408.02571]] Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs(https://arxiv.org/abs/2408.02571)
Keywords: robust
Abstract: The emoticons are symbolic representations that generally accompany the textual content to visually enhance or summarize the true intention of a written message. Although widely utilized in the realm of social media, the core semantics of these emoticons have not been extensively explored based on multiple modalities. Incorporating textual and visual information within a single message develops an advanced way of conveying information. Hence, this research aims to analyze the relationship among sentences, visuals, and emoticons. For an orderly exposition, this paper initially provides a detailed examination of the various techniques for extracting multimodal features, emphasizing the pros and cons of each method. Through conducting a comprehensive examination of several multimodal algorithms, with specific emphasis on the fusion approaches, we have proposed a novel contrastive learning based multimodal architecture. The proposed model employs the joint training of dual-branch encoder along with the contrastive learning to accurately map text and images into a common latent space. Our key finding is that by integrating the principle of contrastive learning with that of the other two branches yields superior results. The experimental results demonstrate that our suggested methodology surpasses existing multimodal approaches in terms of accuracy and robustness. The proposed model attained an accuracy of 91% and an MCC-score of 90% while assessing emoticons using the Multimodal-Twitter Emoticon dataset acquired from Twitter. We provide evidence that deep features acquired by contrastive learning are more efficient, suggesting that the proposed fusion technique also possesses strong generalisation capabilities for recognising emoticons across several modes.

Title: Operational range bounding of spectroscopy models with anomaly detection

Authors: Luís F. Simões, Pierluigi Casale, Marília Felismino, Kai Hou Yip, Ingo P. Waldmann, Giovanna Tinetti, Theresa Lueftinger
Subjects: cs.LG, astro-ph.IM
Abstract URL: https://arxiv.org/abs/2408.02581
Pdf URL: https://arxiv.org/pdf/2408.02581
Copy Paste: [[2408.02581]] Operational range bounding of spectroscopy models with anomaly detection(https://arxiv.org/abs/2408.02581)
Keywords: extraction, explainability
Abstract: Safe operation of machine learning models requires architectures that explicitly delimit their operational ranges. We evaluate the ability of anomaly detection algorithms to provide indicators correlated with degraded model performance. By placing acceptance thresholds over such indicators, hard boundaries are formed that define the model's coverage. As a use case, we consider the extraction of exoplanetary spectra from transit light curves, specifically within the context of ESA's upcoming Ariel mission. Isolation Forests are shown to effectively identify contexts where prediction models are likely to fail. Coverage/error trade-offs are evaluated under conditions of data and concept drift. The best performance is seen when Isolation Forests model projections of the prediction model's explainability SHAP values.

Title: Leveraging the Power of LLMs: A Fine-Tuning Approach for High-Quality Aspect-Based Summarization

Authors: Ankan Mullick, Sombit Bose, Rounak Saha, Ayan Kumar Bhowmick, Aditya Vempaty, Pawan Goyal, Niloy Ganguly, Prasenjit Dey, Ravi Kokku
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2408.02584
Pdf URL: https://arxiv.org/pdf/2408.02584
Copy Paste: [[2408.02584]] Leveraging the Power of LLMs: A Fine-Tuning Approach for High-Quality Aspect-Based Summarization(https://arxiv.org/abs/2408.02584)
Keywords: extraction, large language model
Abstract: The ever-increasing volume of digital information necessitates efficient methods for users to extract key insights from lengthy documents. Aspect-based summarization offers a targeted approach, generating summaries focused on specific aspects within a document. Despite advancements in aspect-based summarization research, there is a continuous quest for improved model performance. Given that large language models (LLMs) have demonstrated the potential to revolutionize diverse tasks within natural language processing, particularly in the problem of summarization, this paper explores the potential of fine-tuning LLMs for the aspect-based summarization task. We evaluate the impact of fine-tuning open-source foundation LLMs, including Llama2, Mistral, Gemma and Aya, on a publicly available domain-specific aspect based summary dataset. We hypothesize that this approach will enable these models to effectively identify and extract aspect-related information, leading to superior quality aspect-based summaries compared to the state-of-the-art. We establish a comprehensive evaluation framework to compare the performance of fine-tuned LLMs against competing aspect-based summarization methods and vanilla counterparts of the fine-tuned LLMs. Our work contributes to the field of aspect-based summarization by demonstrating the efficacy of fine-tuning LLMs for generating high-quality aspect-based summaries. Furthermore, it opens doors for further exploration of using LLMs for targeted information extraction tasks across various NLP domains.

Title: Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection

Authors: Sajal Aggarwal, Ananya Pandey, Dinesh Kumar Vishwakarma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02595
Pdf URL: https://arxiv.org/pdf/2408.02595
Copy Paste: [[2408.02595]] Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection(https://arxiv.org/abs/2408.02595)
Keywords: robust, extraction
Abstract: Sarcasm is a type of irony, characterized by an inherent mismatch between the literal interpretation and the intended connotation. Though sarcasm detection in text has been extensively studied, there are situations in which textual input alone might be insufficient to perceive sarcasm. The inclusion of additional contextual cues, such as images, is essential to recognize sarcasm in social media data effectively. This study presents a novel framework for multimodal sarcasm detection that can process input triplets. Two components of these triplets comprise the input text and its associated image, as provided in the datasets. Additionally, a supplementary modality is introduced in the form of descriptive image captions. The motivation behind incorporating this visual semantic representation is to more accurately capture the discrepancies between the textual and visual content, which are fundamental to the sarcasm detection task. The primary contributions of this study are: (1) a robust textual feature extraction branch that utilizes a cross-lingual language model; (2) a visual feature extraction branch that incorporates a self-regulated residual ConvNet integrated with a lightweight spatially aware attention module; (3) an additional modality in the form of image captions generated using an encoder-decoder architecture capable of reading text embedded in images; (4) distinct attention modules to effectively identify the incongruities between the text and two levels of image representations; (5) multi-level cross-domain semantic incongruity representation achieved through feature fusion. Compared with cutting-edge baselines, the proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets.

Title: Progressively Selective Label Enhancement for Language Model Alignment

Authors: Biao Liu, Ning Xu, Xin Geng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02599
Pdf URL: https://arxiv.org/pdf/2408.02599
Copy Paste: [[2408.02599]] Progressively Selective Label Enhancement for Language Model Alignment(https://arxiv.org/abs/2408.02599)
Keywords: large language model
Abstract: Large Language Models have demonstrated impressive capabilities in various language tasks but may produce content that misaligns with human expectations, raising ethical and legal concerns. Therefore, it is important to explore the limitations and implement restrictions on the models to ensure safety and compliance, with Reinforcement Learning from Human Feedback (RLHF) being the primary method. Due to challenges in stability and scalability with the RLHF stages, researchers are exploring alternative methods to achieve effects comparable to those of RLHF. However, these methods often depend on large high-quality datasets and inefficiently utilize generated data. To deal with this problem, we propose PSLE, i.e., Progressively Selective Label Enhancement for Language Model Alignment, a framework that fully utilizes all generated data by guiding the model with principles to align outputs with human expectations. Using a dynamically updated threshold, our approach ensures efficient data utilization by incorporating all generated responses and weighting them based on their corresponding reward scores. Experimental results on multiple datasets demonstrate the effectiveness of PSLE compared to existing language model alignment methods.

Title: LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba

Authors: Yunxiang Fu, Chaoqi Chen, Yizhou Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02615
Pdf URL: https://arxiv.org/pdf/2408.02615
Copy Paste: [[2408.02615]] LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba(https://arxiv.org/abs/2408.02615)
Keywords: diffusion, transformer, generative
Abstract: Recent Transformer-based diffusion models have shown remarkable performance, largely attributed to the ability of the self-attention mechanism to accurately capture both global and local contexts by computing all-pair interactions among input tokens. However, their quadratic complexity poses significant computational challenges for long-sequence inputs. Conversely, a recent state space model called Mamba offers linear complexity by compressing a filtered global context into a hidden state. Despite its efficiency, compression inevitably leads to information loss of fine-grained local dependencies among tokens, which are crucial for effective visual generative modeling. Motivated by these observations, we introduce Local Attentional Mamba (LaMamba) blocks that combine the strengths of self-attention and Mamba, capturing both global contexts and local details with linear complexity. Leveraging the efficient U-Net architecture, our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution, all while utilizing substantially fewer GFLOPs and a comparable number of parameters. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62\% GFLOPs compared to DiT-XL/2, while achieving superior performance with comparable or fewer parameters.

Title: Language Model Can Listen While Speaking

Authors: Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
Subjects: cs.CL, cs.AI, cs.HC, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2408.02622
Pdf URL: https://arxiv.org/pdf/2408.02622
Copy Paste: [[2408.02622]] Language Model Can Listen While Speaking(https://arxiv.org/abs/2408.02622)
Keywords: robust
Abstract: Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.

Title: SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models

Authors: Muxi Diao, Rumei Li, Shiyang Liu, Guogang Liao, Jingang Wang, Xunliang Cai, Weiran Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.02632
Pdf URL: https://arxiv.org/pdf/2408.02632
Copy Paste: [[2408.02632]] SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models(https://arxiv.org/abs/2408.02632)
Keywords: security, attack, robust, large language model
Abstract: As large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to specifically target and explore the weaknesses of these models. To tackle these challenges, we introduce the $\mathbf{S}\text{elf-}\mathbf{E}\text{volving }\mathbf{A}\text{dversarial }\mathbf{S}\text{afety }\mathbf{(SEAS)}$ optimization framework, which enhances security by leveraging data generated by the model itself. SEAS operates through three iterative stages: Initialization, Attack, and Adversarial Optimization, refining both the Red Team and Target models to improve robustness and safety. This framework reduces reliance on manual testing and significantly enhances the security capabilities of LLMs. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and after three iterations, the Target model achieves a security level comparable to GPT-4, while the Red Team model shows a marked increase in attack success rate (ASR) against advanced models.

Title: Interactive 3D Medical Image Segmentation with SAM 2

Authors: Chuyun Shen, Wenhao Li, Yuhang Shi, Xiangfeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02635
Pdf URL: https://arxiv.org/pdf/2408.02635
Copy Paste: [[2408.02635]] Interactive 3D Medical Image Segmentation with SAM 2(https://arxiv.org/abs/2408.02635)
Keywords: robust, segmentation
Abstract: Interactive medical image segmentation (IMIS) has shown significant potential in enhancing segmentation accuracy by integrating iterative feedback from medical professionals. However, the limited availability of enough 3D medical data restricts the generalization and robustness of most IMIS methods. The Segment Anything Model (SAM), though effective for 2D images, requires expensive semi-auto slice-by-slice annotations for 3D medical images. In this paper, we explore the zero-shot capabilities of SAM 2, the next-generation Meta SAM model trained on videos, for 3D medical image segmentation. By treating sequential 2D slices of 3D images as video frames, SAM 2 can fully automatically propagate annotations from a single frame to the entire 3D volume. We propose a practical pipeline for using SAM 2 in 3D medical image segmentation and present key findings highlighting its efficiency and potential for further optimization. Concretely, numerical experiments on the BraTS2020 and the medical segmentation decathlon datasets demonstrate that SAM 2 still has a gap with supervised methods but can narrow the gap in specific settings and organ types, significantly reducing the annotation burden on medical professionals. Our code will be open-sourced and available at this https URL.

Title: Command-line Obfuscation Detection using Small Language Models

Authors: Vojtech Outrata, Michael Adam Polak, Martin Kopp
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02637
Pdf URL: https://arxiv.org/pdf/2408.02637
Copy Paste: [[2408.02637]] Command-line Obfuscation Detection using Small Language Models(https://arxiv.org/abs/2408.02637)
Keywords: security, transformer
Abstract: To avoid detection, adversaries often use command-line obfuscation. There are numerous techniques of the command-line obfuscation, all designed to alter the command-line syntax without affecting its original functionality. This variability forces most security solutions to create an exhaustive enumeration of signatures for even a single pattern. In contrast to using signatures, we have implemented a scalable NLP-based detection method that leverages a custom-trained, small transformer language model that can be applied to any source of execution logs. The evaluation on top of real-world telemetry demonstrates that our approach yields high-precision detections even on high-volume telemetry from a diverse set of environments spanning from universities and businesses to healthcare or finance. The practical value is demonstrated in a case study of real-world samples detected by our model. We show the model's superiority to signatures on established malware known to employ obfuscation and showcase previously unseen obfuscated samples detected by our model.

Title: Detection of Compromised Functions in a Serverless Cloud Environment

Authors: Danielle Lavi, Oleg Brodt, Dudu Mimran, Yuval Elovici, Asaf Shabtai
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2408.02641
Pdf URL: https://arxiv.org/pdf/2408.02641
Copy Paste: [[2408.02641]] Detection of Compromised Functions in a Serverless Cloud Environment(https://arxiv.org/abs/2408.02641)
Keywords: security, defense, attack
Abstract: Serverless computing is an emerging cloud paradigm with serverless functions at its core. While serverless environments enable software developers to focus on developing applications without the need to actively manage the underlying runtime infrastructure, they open the door to a wide variety of security threats that can be challenging to mitigate with existing methods. Existing security solutions do not apply to all serverless architectures, since they require significant modifications to the serverless infrastructure or rely on third-party services for the collection of more detailed data. In this paper, we present an extendable serverless security threat detection model that leverages cloud providers' native monitoring tools to detect anomalous behavior in serverless applications. Our model aims to detect compromised serverless functions by identifying post-exploitation abnormal behavior related to different types of attacks on serverless functions, and therefore, it is a last line of defense. Our approach is not tied to any specific serverless application, is agnostic to the type of threats, and is adaptable through model adjustments. To evaluate our model's performance, we developed a serverless cybersecurity testbed in an AWS cloud environment, which includes two different serverless applications and simulates a variety of attack scenarios that cover the main security threats faced by serverless functions. Our evaluation demonstrates our model's ability to detect all implemented attacks while maintaining a negligible false alarm rate.

Title: Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?

Authors: Mohammad Bahrami Karkevandi, Nishant Vishwamitra, Peyman Najafirad
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2408.02651
Pdf URL: https://arxiv.org/pdf/2408.02651
Copy Paste: [[2408.02651]] Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?(https://arxiv.org/abs/2408.02651)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in natural language tasks, but their safety and morality remain contentious due to their training on internet text corpora. To address these concerns, alignment techniques have been developed to improve the public usability and safety of LLMs. Yet, the potential for generating harmful content through these models seems to persist. This paper explores the concept of jailbreaking LLMs-reversing their alignment through adversarial triggers. Previous methods, such as soft embedding prompts, manually crafted prompts, and gradient-based automatic prompts, have had limited success on black-box models due to their requirements for model access and for producing a low variety of manually crafted prompts, making them susceptible to being blocked. This paper introduces a novel approach using reinforcement learning to optimize adversarial triggers, requiring only inference API access to the target model and a small surrogate model. Our method, which leverages a BERTScore-based reward function, enhances the transferability and effectiveness of adversarial triggers on new black-box models. We demonstrate that this approach improves the performance of adversarial triggers on a previously untested language model.

Title: On Using Quasirandom Sequences in Machine Learning for Model Weight Initialization

Authors: Andriy Miranskyy, Adam Sorrenti, Viral Thakar
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2408.02654
Pdf URL: https://arxiv.org/pdf/2408.02654
Copy Paste: [[2408.02654]] On Using Quasirandom Sequences in Machine Learning for Model Weight Initialization(https://arxiv.org/abs/2408.02654)
Keywords: transformer
Abstract: The effectiveness of training neural networks directly impacts computational costs, resource allocation, and model development timelines in machine learning applications. An optimizer's ability to train the model adequately (in terms of trained model performance) depends on the model's initial weights. Model weight initialization schemes use pseudorandom number generators (PRNGs) as a source of randomness. We investigate whether substituting PRNGs for low-discrepancy quasirandom number generators (QRNGs) -- namely Sobol' sequences -- as a source of randomness for initializers can improve model performance. We examine Multi-Layer Perceptrons (MLP), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and Transformer architectures trained on MNIST, CIFAR-10, and IMDB datasets using SGD and Adam optimizers. Our analysis uses ten initialization schemes: Glorot, He, Lecun (both Uniform and Normal); Orthogonal, Random Normal, Truncated Normal, and Random Uniform. Models with weights set using PRNG- and QRNG-based initializers are compared pairwise for each combination of dataset, architecture, optimizer, and initialization scheme. Our findings indicate that QRNG-based neural network initializers either reach a higher accuracy or achieve the same accuracy more quickly than PRNG-based initializers in 60% of the 120 experiments conducted. Thus, using QRNG-based initializers instead of PRNG-based initializers can speed up and improve model training.

Title: Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

Authors: Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, Peng Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.02657
Pdf URL: https://arxiv.org/pdf/2408.02657
Copy Paste: [[2408.02657]] Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining(https://arxiv.org/abs/2408.02657)
Keywords: diffusion, transformer, generative, segmentation
Abstract: We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. Unlike existing autoregressive image generation approaches, Lumina-mGPT employs a pretrained decoder-only transformer as a unified framework for modeling multimodal token sequences. Our key insight is that a simple decoder-only transformer with multimodal Generative PreTraining (mGPT), utilizing the next-token prediction objective on massive interleaved text-image sequences, can learn broad and general multimodal capabilities, thereby illuminating photorealistic text-to-image generation. Building on these pretrained models, we propose Flexible Progressive Supervised Finetuning (FP-SFT) on high-quality image-text pairs to fully unlock their potential for high-aesthetic image synthesis at any resolution while maintaining their general multimodal capabilities. Furthermore, we introduce Ominiponent Supervised Finetuning (Omni-SFT), transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like flexible text-to-image generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multiturn visual question answering. Additionally, we analyze the differences and similarities between diffusion-based and autoregressive methods in a direct comparison.