2024-08-21

Title: A Comprehensive Survey on Diffusion Models and Their Applications

Authors: Md Manjurul Ahsan, Shivakumar Raman, Yingtao Liu, Zahed Siddique
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10207
Pdf URL: https://arxiv.org/pdf/2408.10207
Copy Paste: [[2408.10207]] A Comprehensive Survey on Diffusion Models and Their Applications(https://arxiv.org/abs/2408.10207)
Keywords: diffusion
Abstract: Diffusion Models are probabilistic models that create realistic samples by simulating the diffusion process, gradually adding and removing noise from data. These models have gained popularity in domains such as image processing, speech synthesis, and natural language processing due to their ability to produce high-quality samples. As Diffusion Models are being adopted in various domains, existing literature reviews that often focus on specific areas like computer vision or medical imaging may not serve a broader audience across multiple fields. Therefore, this review presents a comprehensive overview of Diffusion Models, covering their theoretical foundations and algorithmic innovations. We highlight their applications in diverse areas such as media quality, authenticity, synthesis, image transformation, healthcare, and more. By consolidating current knowledge and identifying emerging trends, this review aims to facilitate a deeper understanding and broader adoption of Diffusion Models and provide guidelines for future researchers and practitioners across diverse disciplines.

Title: A Survey on Symbolic Knowledge Distillation of Large Language Models

Authors: Kamal Acharya, Alvaro Velasquez, Houbing Herbert Song
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.10210
Pdf URL: https://arxiv.org/pdf/2408.10210
Copy Paste: [[2408.10210]] A Survey on Symbolic Knowledge Distillation of Large Language Models(https://arxiv.org/abs/2408.10210)
Keywords: interpretability, transformer, generative, large language model
Abstract: This survey paper delves into the emerging and critical area of symbolic knowledge distillation in Large Language Models (LLMs). As LLMs like Generative Pre-trained Transformer-3 (GPT-3) and Bidirectional Encoder Representations from Transformers (BERT) continue to expand in scale and complexity, the challenge of effectively harnessing their extensive knowledge becomes paramount. This survey concentrates on the process of distilling the intricate, often implicit knowledge contained within these models into a more symbolic, explicit form. This transformation is crucial for enhancing the interpretability, efficiency, and applicability of LLMs. We categorize the existing research based on methodologies and applications, focusing on how symbolic knowledge distillation can be used to improve the transparency and functionality of smaller, more efficient Artificial Intelligence (AI) models. The survey discusses the core challenges, including maintaining the depth of knowledge in a comprehensible format, and explores the various approaches and techniques that have been developed in this field. We identify gaps in current research and potential opportunities for future advancements. This survey aims to provide a comprehensive overview of symbolic knowledge distillation in LLMs, spotlighting its significance in the progression towards more accessible and efficient AI systems.

Title: VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features

Authors: Ananya Pandey, Dinesh Kumar Vishwakarma
Subjects: cs.CV, cs.AI, cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2408.10246
Pdf URL: https://arxiv.org/pdf/2408.10246
Copy Paste: [[2408.10246]] VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features(https://arxiv.org/abs/2408.10246)
Keywords: extraction
Abstract: Various linguistic and non-linguistic clues, such as excessive emphasis on a word, a shift in the tone of voice, or an awkward expression, frequently convey sarcasm. The computer vision problem of sarcasm recognition in conversation aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue. Prior, sarcasm recognition has focused mainly on text. Still, it is critical to consider all textual information, audio stream, facial expression, and body position for reliable sarcasm identification. Hence, we propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data and an attentional tokenizer based strategy to extract the most critical context-specific information from the textual data. The following is a list of the key contributions that our experimentation has made in response to performing the task of Multi-modal Sarcasm Recognition: an attentional tokenizer branch to get beneficial features from the glossary content provided by the subtitles; a visual branch for acquiring the most prominent features from the video frames; an utterance-level feature extraction from acoustic content and a multi-headed attention based feature fusion branch to blend features obtained from multiple modalities. Extensive testing on one of the benchmark video datasets, MUSTaRD, yielded an accuracy of 79.86% for speaker dependent and 76.94% for speaker independent configuration demonstrating that our approach is superior to the existing methods. We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++.

Title: NeRF-US: Removing Ultrasound Imaging Artifacts from Neural Radiance Fields in the Wild

Authors: Rishit Dagli, Atsuhiro Hibi, Rahul G. Krishnan, Pascal N. Tyrrell
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.10258
Pdf URL: https://arxiv.org/pdf/2408.10258
Copy Paste: [[2408.10258]] NeRF-US: Removing Ultrasound Imaging Artifacts from Neural Radiance Fields in the Wild(https://arxiv.org/abs/2408.10258)
Keywords: diffusion
Abstract: Current methods for performing 3D reconstruction and novel view synthesis (NVS) in ultrasound imaging data often face severe artifacts when training NeRF-based approaches. The artifacts produced by current approaches differ from NeRF floaters in general scenes because of the unique nature of ultrasound capture. Furthermore, existing models fail to produce reasonable 3D reconstructions when ultrasound data is captured or obtained casually in uncontrolled environments, which is common in clinical settings. Consequently, existing reconstruction and NVS methods struggle to handle ultrasound motion, fail to capture intricate details, and cannot model transparent and reflective surfaces. In this work, we introduced NeRF-US, which incorporates 3D-geometry guidance for border probability and scattering density into NeRF training, while also utilizing ultrasound-specific rendering over traditional volume rendering. These 3D priors are learned through a diffusion model. Through experiments conducted on our new "Ultrasound in the Wild" dataset, we observed accurate, clinically plausible, artifact-free reconstructions.

Title: Contrastive Learning on Medical Intents for Sequential Prescription Recommendation

Authors: Arya Hadizadeh Moghaddam, Mohsen Nayebi Kerdabadi, Mei Liu, Zijun Yao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10259
Pdf URL: https://arxiv.org/pdf/2408.10259
Copy Paste: [[2408.10259]] Contrastive Learning on Medical Intents for Sequential Prescription Recommendation(https://arxiv.org/abs/2408.10259)
Keywords: transformer
Abstract: Recent advancements in sequential modeling applied to Electronic Health Records (EHR) have greatly influenced prescription recommender systems. While the recent literature on drug recommendation has shown promising performance, the study of discovering a diversity of coexisting temporal relationships at the level of medical codes over consecutive visits remains less explored. The goal of this study can be motivated from two perspectives. First, there is a need to develop a sophisticated sequential model capable of disentangling the complex relationships across sequential visits. Second, it is crucial to establish multiple and diverse health profiles for the same patient to ensure a comprehensive consideration of different medical intents in drug recommendation. To achieve this goal, we introduce Attentive Recommendation with Contrasted Intents (ARCI), a multi-level transformer-based method designed to capture the different but coexisting temporal paths across a shared sequence of visits. Specifically, we propose a novel intent-aware method with contrastive learning, that links specialized medical intents of the patients to the transformer heads for extracting distinct temporal paths associated with different health profiles. We conducted experiments on two real-world datasets for the prescription recommendation task using both ranking and classification metrics. Our results demonstrate that ARCI has outperformed the state-of-the-art prescription recommendation methods and is capable of providing interpretable insights for healthcare practitioners.

Title: Relational Graph Convolutional Networks Do Not Learn Sound Rules

Authors: Matthew Morris, David J. Tena Cucala, Bernardo Cuenca Grau, Ian Horrocks
Subjects: cs.LG, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2408.10261
Pdf URL: https://arxiv.org/pdf/2408.10261
Copy Paste: [[2408.10261]] Relational Graph Convolutional Networks Do Not Learn Sound Rules(https://arxiv.org/abs/2408.10261)
Keywords: explainability
Abstract: Graph neural networks (GNNs) are frequently used to predict missing facts in knowledge graphs (KGs). Motivated by the lack of explainability for the outputs of these models, recent work has aimed to explain their predictions using Datalog, a widely used logic-based formalism. However, such work has been restricted to certain subclasses of GNNs. In this paper, we consider one of the most popular GNN architectures for KGs, R-GCN, and we provide two methods to extract rules that explain its predictions and are sound, in the sense that each fact derived by the rules is also predicted by the GNN, for any input dataset. Furthermore, we provide a method that can verify that certain classes of Datalog rules are not sound for the R-GCN. In our experiments, we train R-GCNs on KG completion benchmarks, and we are able to verify that no Datalog rule is sound for these models, even though the models often obtain high to near-perfect accuracy. This raises some concerns about the ability of R-GCN models to generalise and about the explainability of their predictions. We further provide two variations to the training paradigm of R-GCN that encourage it to learn sound rules and find a trade-off between model accuracy and the number of learned sound rules.

Title: Kolmogorov Arnold Networks in Fraud Detection: Bridging the Gap Between Theory and Practice

Authors: Yang Lu, Felix Zhan
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2408.10263
Pdf URL: https://arxiv.org/pdf/2408.10263
Copy Paste: [[2408.10263]] Kolmogorov Arnold Networks in Fraud Detection: Bridging the Gap Between Theory and Practice(https://arxiv.org/abs/2408.10263)
Keywords: robust
Abstract: Kolmogorov Arnold Networks (KAN) are highly efficient in inference and can handle complex patterns once trained, making them desirable for production environments and ensuring a fast service experience in the finance and electronic shopping industries. However, we found that KAN, in general, is not suitable for fraud detection problems. We also discovered a quick method to determine whether a problem is solvable by KAN: if the data can be effectively separated using spline interpolation with varying intervals after applying Principal Component Analysis (PCA) to reduce the data dimensions to two, KAN can outperform most machine learning algorithms. Otherwise, it indicates KAN may not solve the problem effectively compared to other machine learning algorithms. We also propose a heuristic approach for selecting the appropriate hyperparameters for KAN to significantly accelerate training time compared to grid search hyperparameter tuning, which usually takes a month for a comprehensive grid search. Specifically, the width parameter should generally follow a pyramid structure, allowing efficient spline mixing, and k should be fixed at 15, with the grid number fixed at 5. This streamlined approach minimizes the number of evaluations required, significantly speeding up the hyperparameter tuning process while still achieving robust performance metrics.

Title: Diffusion Model for Planning: A Systematic Literature Review

Authors: Toshihide Ubukata, Jialong Li, Kenji Tei
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2408.10266
Pdf URL: https://arxiv.org/pdf/2408.10266
Copy Paste: [[2408.10266]] Diffusion Model for Planning: A Systematic Literature Review(https://arxiv.org/abs/2408.10266)
Keywords: robust, diffusion, generative
Abstract: Diffusion models, which leverage stochastic processes to capture complex data distributions effectively, have shown their performance as generative models, achieving notable success in image-related tasks through iterative denoising processes. Recently, diffusion models have been further applied and show their strong abilities in planning tasks, leading to a significant growth in related publications since 2023. To help researchers better understand the field and promote the development of the field, we conduct a systematic literature review of recent advancements in the application of diffusion models for planning. Specifically, this paper categorizes and discusses the current literature from the following perspectives: (i) relevant datasets and benchmarks used for evaluating diffusion modelbased planning; (ii) fundamental studies that address aspects such as sampling efficiency; (iii) skill-centric and condition-guided planning for enhancing adaptability; (iv) safety and uncertainty managing mechanism for enhancing safety and robustness; and (v) domain-specific application such as autonomous driving. Finally, given the above literature review, we further discuss the challenges and future directions in this field.

Title: Towards Efficient Machine Learning Method for IoT DDoS Attack Detection

Authors: P Modi
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2408.10267
Pdf URL: https://arxiv.org/pdf/2408.10267
Copy Paste: [[2408.10267]] Towards Efficient Machine Learning Method for IoT DDoS Attack Detection(https://arxiv.org/abs/2408.10267)
Keywords: security, protect, attack
Abstract: With the rise in the number of IoT devices and its users, security in IoT has become a big concern to ensure the protection from harmful security attacks. In the recent years, different variants of DDoS attacks have been on the rise in IoT devices. Failure to detect DDoS attacks at the right time can result in financial and reputational loss for victim organizations. These attacks conducted with IoT devices can cause a significant downtime of applications running on the Internet. Although researchers have developed and utilized specialized models using artificial intelligence techniques, these models do not provide the best accuracy as there is always a scope of improvement until 100% accuracy is attained. We propose a hybrid feature selection algorithm that selects only the most useful features and passes those features into an XGBoost model, the results of which are explained using feature importances. Our model attains an accuracy of 99.993% on the CIC IDS 2017 dataset and a recall of 97.64 % on the CIC IoT 2023 dataset. Overall, this research would help researchers and implementers in the field of detecting IoT DDoS attacks by providing a more accurate and comparable model.

Title: OpenCity: Open Spatio-Temporal Foundation Models for Traffic Prediction

Authors: Zhonghang Li, Long Xia, Lei Shi, Yong Xu, Dawei Yin, Chao Huang
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2408.10269
Pdf URL: https://arxiv.org/pdf/2408.10269
Copy Paste: [[2408.10269]] OpenCity: Open Spatio-Temporal Foundation Models for Traffic Prediction(https://arxiv.org/abs/2408.10269)
Keywords: transformer
Abstract: Accurate traffic forecasting is crucial for effective urban planning and transportation management, enabling efficient resource allocation and enhanced travel experiences. However, existing models often face limitations in generalization, struggling with zero-shot prediction on unseen regions and cities, as well as diminished long-term accuracy. This is primarily due to the inherent challenges in handling the spatial and temporal heterogeneity of traffic data, coupled with the significant distribution shift across time and space. In this work, we aim to unlock new possibilities for building versatile, resilient and adaptive spatio-temporal foundation models for traffic prediction. To achieve this goal, we introduce a novel foundation model, named OpenCity, that can effectively capture and normalize the underlying spatio-temporal patterns from diverse data characteristics, facilitating zero-shot generalization across diverse urban environments. OpenCity integrates the Transformer architecture with graph neural networks to model the complex spatio-temporal dependencies in traffic data. By pre-training OpenCity on large-scale, heterogeneous traffic datasets, we enable the model to learn rich, generalizable representations that can be seamlessly applied to a wide range of traffic forecasting scenarios. Experimental results demonstrate that OpenCity exhibits exceptional zero-shot predictive performance. Moreover, OpenCity showcases promising scaling laws, suggesting the potential for developing a truly one-for-all traffic prediction solution that can adapt to new urban contexts with minimal overhead. We made our proposed OpenCity model open-source and it is available at the following link: this https URL.

Title: SEAL: Systematic Error Analysis for Value ALignment

Authors: Manon Revel, Matteo Cargnelutti, Tyna Eloundou, Greg Leppert
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2408.10270
Pdf URL: https://arxiv.org/pdf/2408.10270
Copy Paste: [[2408.10270]] SEAL: Systematic Error Analysis for Value ALignment(https://arxiv.org/abs/2408.10270)
Keywords: robust
Abstract: Reinforcement Learning from Human Feedback (RLHF) aims to align language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base LMs. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate the effectiveness of modeling and aligning human values, namely feature imprint, alignment resistance and alignment robustness. We categorize alignment datasets into target features (desired values) and spoiler features (undesired concepts). By regressing RM scores against these features, we quantify the extent to which RMs reward them - a metric we term feature imprint. We define alignment resistance as the proportion of the preference dataset where RMs fail to match human preferences, and we assess alignment robustness by analyzing RM responses to perturbed inputs. Our experiments, utilizing open-source components like the Anthropic/hh-rlhf preference dataset and OpenAssistant RMs, reveal significant imprints of target features and a notable sensitivity to spoiler features. We observed a 26% incidence of alignment resistance in portions of the dataset where LM-labelers disagreed with human preferences. Furthermore, we find that misalignment often arises from ambiguous entries within the alignment dataset. These findings underscore the importance of scrutinizing both RMs and alignment datasets for a deeper understanding of value alignment.

Title: FedKBP: Federated dose prediction framework for knowledge-based planning in radiation therapy

Authors: Jingyun Chen, Martin King, Yading Yuan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10275
Pdf URL: https://arxiv.org/pdf/2408.10275
Copy Paste: [[2408.10275]] FedKBP: Federated dose prediction framework for knowledge-based planning in radiation therapy(https://arxiv.org/abs/2408.10275)
Keywords: privacy, federate
Abstract: Dose prediction plays a key role in knowledge-based planning (KBP) by automatically generating patient-specific dose distribution. Recent advances in deep learning-based dose prediction methods necessitates collaboration among data contributors for improved performance. Federated learning (FL) has emerged as a solution, enabling medical centers to jointly train deep-learning models without compromising patient data privacy. We developed the FedKBP framework to evaluate the performances of centralized, federated, and individual (i.e. separated) training of dose prediction model on the 340 plans from OpenKBP dataset. To simulate FL and individual training, we divided the data into 8 training sites. To evaluate the effect of inter-site data variation on model training, we implemented two types of case distributions: 1) Independent and identically distributed (IID), where the training and validating cases were evenly divided among the 8 sites, and 2) non-IID, where some sites have more cases than others. The results show FL consistently outperforms individual training on both model optimization speed and out-of-sample testing scores, highlighting the advantage of FL over individual training. Under IID data division, FL shows comparable performance to centralized training, underscoring FL as a promising alternative to traditional pooled-data training. Under non-IID division, larger sites outperformed smaller sites by up to 19% on testing scores, confirming the need of collaboration among data owners to achieve better prediction accuracy. Meanwhile, non-IID FL showed reduced performance as compared to IID FL, posing the need for more sophisticated FL method beyond mere model averaging to handle data variation among participating sites.

Title: FEDKIM: Adaptive Federated Knowledge Injection into Medical Foundation Models

Authors: Xiaochen Wang, Jiaqi Wang, Houping Xiao, Jinghui Chen, Fenglong Ma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10276
Pdf URL: https://arxiv.org/pdf/2408.10276
Copy Paste: [[2408.10276]] FEDKIM: Adaptive Federated Knowledge Injection into Medical Foundation Models(https://arxiv.org/abs/2408.10276)
Keywords: privacy, federate
Abstract: Foundation models have demonstrated remarkable capabilities in handling diverse modalities and tasks, outperforming conventional artificial intelligence (AI) approaches that are highly task-specific and modality-reliant. In the medical domain, however, the development of comprehensive foundation models is constrained by limited access to diverse modalities and stringent privacy regulations. To address these constraints, this study introduces a novel knowledge injection approach, FedKIM, designed to scale the medical foundation model within a federated learning framework. FedKIM leverages lightweight local models to extract healthcare knowledge from private data and integrates this knowledge into a centralized foundation model using a designed adaptive Multitask Multimodal Mixture Of Experts (M3OE) module. This method not only preserves privacy but also enhances the model's ability to handle complex medical tasks involving multiple modalities. Our extensive experiments across twelve tasks in seven modalities demonstrate the effectiveness of FedKIM in various settings, highlighting its potential to scale medical foundation models without direct access to sensitive data.

Title: Increasing transformer token length with a Maximum Entropy Principle Method

Authors: R. I. Cukier
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.10277
Pdf URL: https://arxiv.org/pdf/2408.10277
Copy Paste: [[2408.10277]] Increasing transformer token length with a Maximum Entropy Principle Method(https://arxiv.org/abs/2408.10277)
Keywords: transformer
Abstract: Transformers suffer from the computational overhead of their quadratic dependence on the length of sequences processed. We present three methods, all adding an intermediate step between training and inference/generation, which extend the autoregressive length of transformers. All rely on a Maximum Entropy Principle (MEP) whereby entropy is maximized in the presence of suitable constraints, accounted for by use of Lagrange Multipliers. These constraint methods extend the autoregressive character from T to 2T tokens in a linear-with-T fashion. There is overhead associated with this added step, but they should still be faster than the standard methods.

Title: NoRA: Nested Low-Rank Adaptation for Efficient Fine-Tuning Large Models

Authors: Cheng Lin, Lujun Li, Dezhi Li, Jie Zou, Wenhan Luo, Wei Xue, Yike Guo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.10280
Pdf URL: https://arxiv.org/pdf/2408.10280
Copy Paste: [[2408.10280]] NoRA: Nested Low-Rank Adaptation for Efficient Fine-Tuning Large Models(https://arxiv.org/abs/2408.10280)
Keywords: large language model
Abstract: In this paper, we introduce Nested Low-Rank Adaptation (NoRA), a novel approach to parameter-efficient fine-tuning that extends the capabilities of Low-Rank Adaptation (LoRA) techniques. Vanilla LoRA overlooks pre-trained weight inheritance and still requires fine-tuning numerous parameters. To addresses these issues, our NoRA adopts a dual-layer nested structure with Singular Value Decomposition (SVD), effectively leveraging original matrix knowledge while reducing tunable parameters. Specifically, NoRA freezes the outer LoRA weights and utilizes an inner LoRA design, providing enhanced control over model optimization. This approach allows the model to more precisely adapt to specific tasks while maintaining a compact parameter space. By freezing outer LoRA weights and using an inner LoRA design, NoRA enables precise task adaptation with a compact parameter space. Evaluations on tasks including commonsense reasoning with large language models, fine-tuning vision-language models, and subject-driven generation demonstrate NoRA's superiority over LoRA and its variants. Notably, NoRA reduces fine-tuning parameters|training-time|memory-usage by 4\%|22.5\%|20.7\% compared to LoRA on LLaMA-3 8B, while achieving 2.2\% higher performance. Code will be released upon acceptance.

Title: AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

Authors: Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, Meng Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.10284
Pdf URL: https://arxiv.org/pdf/2408.10284
Copy Paste: [[2408.10284]] AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference(https://arxiv.org/abs/2408.10284)
Keywords: large language model
Abstract: Mixture-of-Experts (MoE) models are designed to enhance the efficiency of large language models (LLMs) without proportionally increasing the computational demands. However, their deployment on edge devices still faces significant challenges due to high on-demand loading overheads from managing sparsely activated experts. This paper introduces AdapMoE, an algorithm-system co-design framework for efficient MoE inference. AdapMoE features adaptive expert gating and management to reduce the on-demand loading overheads. We observe the heterogeneity of experts loading across layers and tokens, based on which we propose a sensitivity-based strategy to adjust the number of activated experts dynamically. Meanwhile, we also integrate advanced prefetching and cache management techniques to further reduce the loading latency. Through comprehensive evaluations on various platforms, we demonstrate AdapMoE consistently outperforms existing techniques, reducing the average number of activated experts by 25% and achieving a 1.35x speedup without accuracy degradation. Code is available at: this https URL.

Title: BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction

Authors: Yifei Yang, Runhan Shi, Zuchao Li, Shu Jiang, Bao-Liang Lu, Yang Yang, Hai Zhao
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2408.10285
Pdf URL: https://arxiv.org/pdf/2408.10285
Copy Paste: [[2408.10285]] BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction(https://arxiv.org/abs/2408.10285)
Keywords: large language model
Abstract: Retrosynthesis analysis is pivotal yet challenging in drug discovery and organic chemistry. Despite the proliferation of computational tools over the past decade, AI-based systems often fall short in generalizing across diverse reaction types and exploring alternative synthetic pathways. This paper presents BatGPT-Chem, a large language model with 15 billion parameters, tailored for enhanced retrosynthesis prediction. Integrating chemical tasks via a unified framework of natural language and SMILES notation, this approach synthesizes extensive instructional data from an expansive chemical database. Employing both autoregressive and bidirectional training techniques across over one hundred million instances, BatGPT-Chem captures a broad spectrum of chemical knowledge, enabling precise prediction of reaction conditions and exhibiting strong zero-shot capabilities. Superior to existing AI methods, our model demonstrates significant advancements in generating effective strategies for complex molecules, as validated by stringent benchmark tests. BatGPT-Chem not only boosts the efficiency and creativity of retrosynthetic analysis but also establishes a new standard for computational tools in synthetic design. This development empowers chemists to adeptly address the synthesis of novel compounds, potentially expediting the innovation cycle in drug manufacturing and materials science. We release our trial platform at \url{this https URL}.

Title: GPT-Augmented Reinforcement Learning with Intelligent Control for Vehicle Dispatching

Authors: Xiao Han, Zijian Zhang, Xiangyu Zhao, Guojiang Shen, Xiangjie Kong, Xuetao Wei, Liqiang Nie, Jieping Ye
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10286
Pdf URL: https://arxiv.org/pdf/2408.10286
Copy Paste: [[2408.10286]] GPT-Augmented Reinforcement Learning with Intelligent Control for Vehicle Dispatching(https://arxiv.org/abs/2408.10286)
Keywords: secure
Abstract: As urban residents demand higher travel quality, vehicle dispatch has become a critical component of online ride-hailing services. However, current vehicle dispatch systems struggle to navigate the complexities of urban traffic dynamics, including unpredictable traffic conditions, diverse driver behaviors, and fluctuating supply and demand patterns. These challenges have resulted in travel difficulties for passengers in certain areas, while many drivers in other areas are unable to secure orders, leading to a decline in the overall quality of urban transportation services. To address these issues, this paper introduces GARLIC: a framework of GPT-Augmented Reinforcement Learning with Intelligent Control for vehicle dispatching. GARLIC utilizes multiview graphs to capture hierarchical traffic states, and learns a dynamic reward function that accounts for individual driving behaviors. The framework further integrates a GPT model trained with a custom loss function to enable high-precision predictions and optimize dispatching policies in real-world scenarios. Experiments conducted on two real-world datasets demonstrate that GARLIC effectively aligns with driver behaviors while reducing the empty load rate of vehicles.

Title: Leveraging Superfluous Information in Contrastive Representation Learning

Authors: Xuechu Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10292
Pdf URL: https://arxiv.org/pdf/2408.10292
Copy Paste: [[2408.10292]] Leveraging Superfluous Information in Contrastive Representation Learning(https://arxiv.org/abs/2408.10292)
Keywords: robust, segmentation
Abstract: Contrastive representation learning, which aims to learnthe shared information between different views of unlabeled data by maximizing the mutual information between them, has shown its powerful competence in self-supervised learning for downstream tasks. However, recent works have demonstrated that more estimated mutual information does not guarantee better performance in different downstream tasks. Such works inspire us to conjecture that the learned representations not only maintain task-relevant information from unlabeled data but also carry task-irrelevant information which is superfluous for downstream tasks, thus leading to performance degeneration. In this paper we show that superfluous information does exist during the conventional contrastive learning framework, and further design a new objective, namely SuperInfo, to learn robust representations by a linear combination of both predictive and superfluous information. Besides, we notice that it is feasible to tune the coefficients of introduced losses to discard task-irrelevant information, while keeping partial non-shared task-relevant information according to our SuperInfo loss.We demonstrate that learning with our loss can often outperform the traditional contrastive learning approaches on image classification, object detection and instance segmentation tasks with significant improvements.

Title: On the Identifiability of Sparse ICA without Assuming Non-Gaussianity

Authors: Ignavier Ng, Yujia Zheng, Xinshuai Dong, Kun Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2408.10353
Pdf URL: https://arxiv.org/pdf/2408.10353
Copy Paste: [[2408.10353]] On the Identifiability of Sparse ICA without Assuming Non-Gaussianity(https://arxiv.org/abs/2408.10353)
Keywords: generative
Abstract: Independent component analysis (ICA) is a fundamental statistical tool used to reveal hidden generative processes from observed data. However, traditional ICA approaches struggle with the rotational invariance inherent in Gaussian distributions, often necessitating the assumption of non-Gaussianity in the underlying sources. This may limit their applicability in broader contexts. To accommodate Gaussian sources, we develop an identifiability theory that relies on second-order statistics without imposing further preconditions on the distribution of sources, by introducing novel assumptions on the connective structure from sources to observed variables. Different from recent work that focuses on potentially restrictive connective structures, our proposed assumption of structural variability is both considerably less restrictive and provably necessary. Furthermore, we propose two estimation methods based on second-order statistics and sparsity constraint. Experimental results are provided to validate our identifiability theory and estimation methods.

Title: Diversity and stylization of the contemporary user-generated visual arts in the complexity-entropy plane

Authors: Seunghwan Kim, Byunghwee Lee, Wonjae Lee
Subjects: cs.CV, physics.data-an, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2408.10356
Pdf URL: https://arxiv.org/pdf/2408.10356
Copy Paste: [[2408.10356]] Diversity and stylization of the contemporary user-generated visual arts in the complexity-entropy plane(https://arxiv.org/abs/2408.10356)
Keywords: extraction
Abstract: The advent of computational and numerical methods in recent times has provided new avenues for analyzing art historiographical narratives and tracing the evolution of art styles therein. Here, we investigate an evolutionary process underpinning the emergence and stylization of contemporary user-generated visual art styles using the complexity-entropy (C-H) plane, which quantifies local structures in paintings. Informatizing 149,780 images curated in DeviantArt and Behance platforms from 2010 to 2020, we analyze the relationship between local information of the C-H space and multi-level image features generated by a deep neural network and a feature extraction algorithm. The results reveal significant statistical relationships between the C-H information of visual artistic styles and the dissimilarities of the multi-level image features over time within groups of artworks. By disclosing a particular C-H region where the diversity of image representations is noticeably manifested, our analyses reveal an empirical condition of emerging styles that are both novel in the C-H plane and characterized by greater stylistic diversity. Our research shows that visual art analyses combined with physics-inspired methodologies and machine learning, can provide macroscopic insights into quantitatively mapping relevant characteristics of an evolutionary process underpinning the creative stylization of uncharted visual arts of given groups and time.

Title: Beyond Relevant Documents: A Knowledge-Intensive Approach for Query-Focused Summarization using Large Language Models

Authors: Weijia Zhang, Jia-Hong Huang, Svitlana Vakulenko, Yumo Xu, Thilina Rajapakse, Evangelos Kanoulas
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2408.10357
Pdf URL: https://arxiv.org/pdf/2408.10357
Copy Paste: [[2408.10357]] Beyond Relevant Documents: A Knowledge-Intensive Approach for Query-Focused Summarization using Large Language Models(https://arxiv.org/abs/2408.10357)
Keywords: large language model
Abstract: Query-focused summarization (QFS) is a fundamental task in natural language processing with broad applications, including search engines and report generation. However, traditional approaches assume the availability of relevant documents, which may not always hold in practical scenarios, especially in highly specialized topics. To address this limitation, we propose a novel knowledge-intensive approach that reframes QFS as a knowledge-intensive task setup. This approach comprises two main components: a retrieval module and a summarization controller. The retrieval module efficiently retrieves potentially relevant documents from a large-scale knowledge corpus based on the given textual query, eliminating the dependence on pre-existing document sets. The summarization controller seamlessly integrates a powerful large language model (LLM)-based summarizer with a carefully tailored prompt, ensuring the generated summary is comprehensive and relevant to the query. To assess the effectiveness of our approach, we create a new dataset, along with human-annotated relevance labels, to facilitate comprehensive evaluation covering both retrieval and summarization performance. Extensive experiments demonstrate the superior performance of our approach, particularly its ability to generate accurate summaries without relying on the availability of relevant documents initially. This underscores our method's versatility and practical applicability across diverse query scenarios.

Title: HaSPeR: An Image Repository for Hand Shadow Puppet Recognition

Authors: Syed Rifat Raiyan, Zibran Zarif Amio, Sabbir Ahmed
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10360
Pdf URL: https://arxiv.org/pdf/2408.10360
Copy Paste: [[2408.10360]] HaSPeR: An Image Repository for Hand Shadow Puppet Recognition(https://arxiv.org/abs/2408.10360)
Keywords: explainability, transformer
Abstract: Hand shadow puppetry, also known as shadowgraphy or ombromanie, is a form of theatrical art and storytelling where hand shadows are projected onto flat surfaces to create illusions of living creatures. The skilled performers create these silhouettes by hand positioning, finger movements, and dexterous gestures to resemble shadows of animals and objects. Due to the lack of practitioners and a seismic shift in people's entertainment standards, this art form is on the verge of extinction. To facilitate its preservation and proliferate it to a wider audience, we introduce ${\rm H{\small A}SP{\small E}R}$, a novel dataset consisting of 8,340 images of hand shadow puppets across 11 classes extracted from both professional and amateur hand shadow puppeteer clips. We provide a detailed statistical analysis of the dataset and employ a range of pretrained image classification models to establish baselines. Our findings show a substantial performance superiority of traditional convolutional models over attention-based transformer architectures. We also find that lightweight models, such as MobileNetV2, suited for mobile applications and embedded devices, perform comparatively well. We surmise that such low-latency architectures can be useful in developing ombromanie teaching tools, and we create a prototype application to explore this surmission. Keeping the best-performing model InceptionV3 under the limelight, we conduct comprehensive feature-spatial, explainability, and error analyses to gain insights into its decision-making process. To the best of our knowledge, this is the first documented dataset and research endeavor to preserve this dying art for future generations, with computer vision approaches. Our code and data are publicly available.

Title: Security Risks Due to Data Persistence in Cloud FPGA Platforms

Authors: Zhehang Zhang, Bharadwaj Madabhushi, Sandip Kundu, Russell Tessier
Subjects: cs.CR, cs.AR, cs.DC
Abstract URL: https://arxiv.org/abs/2408.10374
Pdf URL: https://arxiv.org/pdf/2408.10374
Copy Paste: [[2408.10374]] Security Risks Due to Data Persistence in Cloud FPGA Platforms(https://arxiv.org/abs/2408.10374)
Keywords: security
Abstract: The integration of Field Programmable Gate Arrays (FPGAs) into cloud computing systems has become commonplace. As the operating systems used to manage these systems evolve, special consideration must be given to DRAM devices accessible by FPGAs. These devices may hold sensitive data that can become inadvertently exposed to adversaries following user logout. Although addressed in some cloud FPGA environments, automatic DRAM clearing after process termination is not automatically included in popular FPGA runtime environments nor in most proposed cloud FPGA hypervisors. In this paper, we examine DRAM data persistence in AMD/Xilinx Alveo U280 nodes that are part of the Open Cloud Testbed (OCT). Our results indicate that DDR4 DRAM is not automatically cleared following user logout from an allocated node and subsequent node users can easily obtain recognizable data from the DRAM following node reallocation over 17 minutes later. This issue is particularly relevant for systems which support FPGA multi-tenancy.

Title: Value Alignment from Unstructured Text

Authors: Inkit Padhi, Karthikeyan Natesan Ramamurthy, Prasanna Sattigeri, Manish Nagireddy, Pierre Dognin, Kush R. Varshney
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.10392
Pdf URL: https://arxiv.org/pdf/2408.10392
Copy Paste: [[2408.10392]] Value Alignment from Unstructured Text(https://arxiv.org/abs/2408.10392)
Keywords: large language model
Abstract: Aligning large language models (LLMs) to value systems has emerged as a significant area of research within the fields of AI and NLP. Currently, this alignment process relies on the availability of high-quality supervised and preference data, which can be both time-consuming and expensive to curate or annotate. In this paper, we introduce a systematic end-to-end methodology for aligning LLMs to the implicit and explicit values represented in unstructured text data. Our proposed approach leverages the use of scalable synthetic data generation techniques to effectively align the model to the values present in the unstructured data. Through two distinct use-cases, we demonstrate the efficiency of our methodology on the Mistral-7B-Instruct model. Our approach credibly aligns LLMs to the values embedded within documents, and shows improved performance against other approaches, as quantified through the use of automatic metrics and win rates.

Title: Evaluating Image-Based Face and Eye Tracking with Event Cameras

Authors: Khadija Iddrisu, Waseem Shariff, Noel E.OConnor, Joseph Lemley, Suzanne Little
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10395
Pdf URL: https://arxiv.org/pdf/2408.10395
Copy Paste: [[2408.10395]] Evaluating Image-Based Face and Eye Tracking with Event Cameras(https://arxiv.org/abs/2408.10395)
Keywords: robust
Abstract: Event Cameras, also known as Neuromorphic sensors, capture changes in local light intensity at the pixel level, producing asynchronously generated data termed ``events''. This distinct data format mitigates common issues observed in conventional cameras, like under-sampling when capturing fast-moving objects, thereby preserving critical information that might otherwise be lost. However, leveraging this data often necessitates the development of specialized, handcrafted event representations that can integrate seamlessly with conventional Convolutional Neural Networks (CNNs), considering the unique attributes of event data. In this study, We evaluate event-based Face and Eye tracking. The core objective of our study is to showcase the viability of integrating conventional algorithms with event-based data, transformed into a frame format while preserving the unique benefits of event cameras. To validate our approach, we constructed a frame-based event dataset by simulating events between RGB frames derived from the publicly accessible Helen Dataset. We assess its utility for face and eye detection tasks through the application of GR-YOLO -- a pioneering technique derived from YOLOv3. This evaluation includes a comparative analysis with results derived from training the dataset with YOLOv8. Subsequently, the trained models were tested on real event streams from various iterations of Prophesee's event cameras and further evaluated on the Faces in Event Stream (FES) benchmark dataset. The models trained on our dataset shows a good prediction performance across all the datasets obtained for validation with the best results of a mean Average precision score of 0.91. Additionally, The models trained demonstrated robust performance on real event camera data under varying light conditions.

Title: Parallel Processing of Point Cloud Ground Segmentation for Mechanical and Solid-State LiDARs

Authors: Xiao Zhang, Zhanhong Huang, Garcia Gonzalez Antony, Witek Jachimczyk, Xinming Huang
Subjects: cs.CV, eess.IV, eess.SP
Abstract URL: https://arxiv.org/abs/2408.10404
Pdf URL: https://arxiv.org/pdf/2408.10404
Copy Paste: [[2408.10404]] Parallel Processing of Point Cloud Ground Segmentation for Mechanical and Solid-State LiDARs(https://arxiv.org/abs/2408.10404)
Keywords: robust, segmentation
Abstract: In this study, we introduce a novel parallel processing framework for real-time point cloud ground segmentation on FPGA platforms, aimed at adapting LiDAR algorithms to the evolving landscape from mechanical to solid-state LiDAR (SSL) technologies. Focusing on the ground segmentation task, we explore parallel processing techniques on existing approaches and adapt them to real-world SSL data handling. We validated frame-segmentation based parallel processing methods using point-based, voxel-based, and range-image-based ground segmentation approaches on the SemanticKITTI dataset based on mechanical LiDAR. The results revealed the superior performance and robustness of the range-image method, especially in its resilience to slicing. Further, utilizing a custom dataset from our self-built Camera-SSLSS equipment, we examined regular SSL data frames and validated the effectiveness of our parallel approach for SSL sensor. Additionally, our pioneering implementation of range-image ground segmentation on FPGA for SSL sensors demonstrated significant processing speed improvements and resource efficiency, achieving processing rates up to 50.3 times faster than conventional CPU setups. These findings underscore the potential of parallel processing strategies to significantly enhance LiDAR technologies for advanced perception tasks in autonomous systems. Post-publication, both the data and the code will be made available on GitHub.

Title: CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Authors: Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10433
Pdf URL: https://arxiv.org/pdf/2408.10433
Copy Paste: [[2408.10433]] CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs(https://arxiv.org/abs/2408.10433)
Keywords: robust
Abstract: Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We applied CLIP-DPO fine-tuning to the MobileVLM-v2 family of models and to LlaVA-1.5, in all cases observing significant improvements in terms of hallucination reduction over baseline models. We also observe better performance for zero-shot classification, suggesting improved grounding capabilities, and verify that the original performance on standard LVLM benchmarks is overall preserved.

Title: Understanding Generative AI Content with Embedding Models

Authors: Max Vargas, Reilly Cannon, Andrew Engel, Anand D. Sarwate, Tony Chiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10437
Pdf URL: https://arxiv.org/pdf/2408.10437
Copy Paste: [[2408.10437]] Understanding Generative AI Content with Embedding Models(https://arxiv.org/abs/2408.10437)
Keywords: generative
Abstract: The construction of high-quality numerical features is critical to any quantitative data analysis. Feature engineering has been historically addressed by carefully hand-crafting data representations based on domain expertise. This work views the internal representations of modern deep neural networks (DNNs), called embeddings, as an automated form of traditional feature engineering. For trained DNNs, we show that these embeddings can reveal interpretable, high-level concepts in unstructured sample data. We use these embeddings in natural language and computer vision tasks to uncover both inherent heterogeneity in the underlying data and human-understandable explanations for it. In particular, we find empirical evidence that there is inherent separability between real data and that generated from AI models.

Title: Private Means and the Curious Incident of the Free Lunch

Authors: Jack Fitzsimons, James Honaker, Michael Shoemate, Vikrant Singhal
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.10438
Pdf URL: https://arxiv.org/pdf/2408.10438
Copy Paste: [[2408.10438]] Private Means and the Curious Incident of the Free Lunch(https://arxiv.org/abs/2408.10438)
Keywords: privacy
Abstract: We show that the most well-known and fundamental building blocks of DP implementations -- sum, mean, count (and many other linear queries) -- can be released with substantially reduced noise for the same privacy guarantee. We achieve this by projecting individual data with worst-case sensitivity $R$ onto a simplex where all data now has a constant norm $R$. In this simplex, additional ``free'' queries can be run that are already covered by the privacy-loss of the original budgeted query, and which algebraically give additional estimates of counts or sums, and can be combined for lower final noise.

Title: Goldfish: Monolingual Language Models for 350 Languages

Authors: Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10441
Pdf URL: https://arxiv.org/pdf/2408.10441
Copy Paste: [[2408.10441]] Goldfish: Monolingual Language Models for 350 Languages(https://arxiv.org/abs/2408.10441)
Keywords: transformer
Abstract: For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. However, using FLORES perplexity as a metric, we find that these models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B). To facilitate research that focuses on low-resource languages, we pre-train and release Goldfish, a suite of monolingual autoregressive Transformer language models up to 125M parameters for 350 languages. The Goldfish reach lower FLORES perplexities than BLOOM, XGLM, and MaLA-500 on 98 of 204 FLORES languages, despite each Goldfish model being over 10x smaller. However, the Goldfish significantly underperform larger multilingual models on reasoning benchmarks, suggesting that for low-resource languages, multilinguality primarily improves general reasoning abilities rather than basic text generation. We release models trained on 5MB (350 languages), 10MB (288 languages), 100MB (166 languages), and 1GB (83 languages) of text data where available. The Goldfish models are available as baselines, fine-tuning sources, or augmentations to existing models in low-resource NLP research, and they are further useful for crosslinguistic studies requiring maximally comparable models across languages.

Title: Federated Learning of Large ASR Models in the Real World

Authors: Yonghui Xiao, Yuxin Ding, Changwan Ryu, Petr Zadrazil, Francoise Beaufays
Subjects: cs.LG, cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2408.10443
Pdf URL: https://arxiv.org/pdf/2408.10443
Copy Paste: [[2408.10443]] Federated Learning of Large ASR Models in the Real World(https://arxiv.org/abs/2408.10443)
Keywords: privacy, federate
Abstract: Federated learning (FL) has shown promising results on training machine learning models with privacy preservation. However, for large models with over 100 million parameters, the training resource requirement becomes an obstacle for FL because common devices do not have enough memory and computation power to finish the FL tasks. Although efficient training methods have been proposed, it is still a challenge to train the large models like Conformer based ASR. This paper presents a systematic solution to train the full-size ASR models of 130M parameters with FL. To our knowledge, this is the first real-world FL application of the Conformer model, which is also the largest model ever trained with FL so far. And this is the first paper showing FL can improve the ASR model quality with a set of proposed methods to refine the quality of data and labels of clients. We demonstrate both the training efficiency and the model quality improvement in real-world experiments.

Title: The Brittleness of AI-Generated Image Watermarking Techniques: Examining Their Robustness Against Visual Paraphrasing Attacks

Authors: Niyar R Barman, Krish Sharma, Ashhar Aziz, Shashwat Bajpai, Shwetangshu Biswas, Vasu Sharma, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10446
Pdf URL: https://arxiv.org/pdf/2408.10446
Copy Paste: [[2408.10446]] The Brittleness of AI-Generated Image Watermarking Techniques: Examining Their Robustness Against Visual Paraphrasing Attacks(https://arxiv.org/abs/2408.10446)
Keywords: attack, robust, watermark, diffusion
Abstract: The rapid advancement of text-to-image generation systems, exemplified by models like Stable Diffusion, Midjourney, Imagen, and DALL-E, has heightened concerns about their potential misuse. In response, companies like Meta and Google have intensified their efforts to implement watermarking techniques on AI-generated images to curb the circulation of potentially misleading visuals. However, in this paper, we argue that current image watermarking methods are fragile and susceptible to being circumvented through visual paraphrase attacks. The proposed visual paraphraser operates in two steps. First, it generates a caption for the given image using KOSMOS-2, one of the latest state-of-the-art image captioning systems. Second, it passes both the original image and the generated caption to an image-to-image diffusion system. During the denoising step of the diffusion pipeline, the system generates a visually similar image that is guided by the text caption. The resulting image is a visual paraphrase and is free of any watermarks. Our empirical findings demonstrate that visual paraphrase attacks can effectively remove watermarks from images. This paper provides a critical assessment, empirically revealing the vulnerability of existing watermarking techniques to visual paraphrase attacks. While we do not propose solutions to this issue, this paper serves as a call to action for the scientific community to prioritize the development of more robust watermarking techniques. Our first-of-its-kind visual paraphrase dataset and accompanying code are publicly available.

Title: Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

Authors: Liu He, Yizhi Song, Hejun Huang, Daniel Aliaga, Xin Zhou
Subjects: cs.CV, cs.GR, cs.MM
Abstract URL: https://arxiv.org/abs/2408.10453
Pdf URL: https://arxiv.org/pdf/2408.10453
Copy Paste: [[2408.10453]] Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation(https://arxiv.org/abs/2408.10453)
Keywords: diffusion, large language model
Abstract: Text-to-video generation has been dominated by end-to-end diffusion-based or autoregressive models. On one hand, those novel models provide plausible versatility, but they are criticized for physical correctness, shading and illumination, camera motion, and temporal consistency. On the other hand, film industry relies on manually-edited Computer-Generated Imagery (CGI) using 3D modeling software. Human-directed 3D synthetic videos and animations address the aforementioned shortcomings, but it is extremely tedious and requires tight collaboration between movie makers and 3D rendering experts. In this paper, we introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a natural language description of a video, multiple VLM agents auto-direct various processes of the generation pipeline. They cooperate to create Blender scripts which render a video that best aligns with the given description. Based on film making inspiration and augmented with Blender-based movie making knowledge, the Director agent decomposes the input text-based video description into sub-processes. For each sub-process, the Programmer agent produces Python-based Blender scripts based on customized function composing and API calling. Then, the Reviewer agent, augmented with knowledge of video reviewing, character motion coordinates, and intermediate screenshots uses its compositional reasoning ability to provide feedback to the Programmer agent. The Programmer agent iteratively improves the scripts to yield the best overall video outcome. Our generated videos show better quality than commercial video generation models in 5 metrics on video quality and instruction-following performance. Moreover, our framework outperforms other approaches in a comprehensive user study on quality, consistency, and rationality.

Title: Differentially Private Stochastic Gradient Descent with Fixed-Size Minibatches: Tighter RDP Guarantees with or without Replacement

Authors: Jeremiah Birrell, Reza Ebrahimi, Rouzbeh Behnia, Jason Pacheco
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2408.10456
Pdf URL: https://arxiv.org/pdf/2408.10456
Copy Paste: [[2408.10456]] Differentially Private Stochastic Gradient Descent with Fixed-Size Minibatches: Tighter RDP Guarantees with or without Replacement(https://arxiv.org/abs/2408.10456)
Keywords: privacy, federate
Abstract: Differentially private stochastic gradient descent (DP-SGD) has been instrumental in privately training deep learning models by providing a framework to control and track the privacy loss incurred during training. At the core of this computation lies a subsampling method that uses a privacy amplification lemma to enhance the privacy guarantees provided by the additive noise. Fixed size subsampling is appealing for its constant memory usage, unlike the variable sized minibatches in Poisson subsampling. It is also of interest in addressing class imbalance and federated learning. However, the current computable guarantees for fixed-size subsampling are not tight and do not consider both add/remove and replace-one adjacency relationships. We present a new and holistic R{é}nyi differential privacy (RDP) accountant for DP-SGD with fixed-size subsampling without replacement (FSwoR) and with replacement (FSwR). For FSwoR we consider both add/remove and replace-one adjacency. Our FSwoR results improves on the best current computable bound by a factor of $4$. We also show for the first time that the widely-used Poisson subsampling and FSwoR with replace-one adjacency have the same privacy to leading order in the sampling probability. Accordingly, our work suggests that FSwoR is often preferable to Poisson subsampling due to constant memory usage. Our FSwR accountant includes explicit non-asymptotic upper and lower bounds and, to the authors' knowledge, is the first such analysis of fixed-size RDP with replacement for DP-SGD. We analytically and empirically compare fixed size and Poisson subsampling, and show that DP-SGD gradients in a fixed-size subsampling regime exhibit lower variance in practice in addition to memory usage benefits.

Title: Parkinson's Disease Classification via EEG: All You Need is a Single Convolutional Layer

Authors: Md Fahim Anjum
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2408.10457
Pdf URL: https://arxiv.org/pdf/2408.10457
Copy Paste: [[2408.10457]] Parkinson's Disease Classification via EEG: All You Need is a Single Convolutional Layer(https://arxiv.org/abs/2408.10457)
Keywords: interpretability
Abstract: In this work, we introduce LightCNN, a minimalist Convolutional Neural Network (CNN) architecture designed for Parkinson's disease (PD) classification using EEG data. LightCNN's strength lies in its simplicity, utilizing just a single convolutional layer. Embracing Leonardo da Vinci's principle that "simplicity is the ultimate sophistication," LightCNN demonstrates that complexity is not required to achieve outstanding results. We benchmarked LightCNN against several state-of-the-art deep learning models known for their effectiveness in EEG-based PD classification. Remarkably, LightCNN outperformed all these complex architectures, with a 2.3% improvement in recall, a 4.6% increase in precision, a 0.1% edge in AUC, a 4% boost in F1-score, and a 3.3% higher accuracy compared to the closest competitor. Furthermore, LightCNN identifies known pathological brain rhythms associated with PD and effectively captures clinically relevant neurophysiological changes in EEG. Its simplicity and interpretability make it ideal for deployment in resource-constrained environments, such as mobile or embedded systems for EEG analysis. In conclusion, LightCNN represents a significant step forward in efficient EEG-based PD classification, demonstrating that a well-designed, lightweight model can achieve superior performance over more complex architectures. This work underscores the potential for minimalist models to meet the needs of modern healthcare applications, particularly where resources are limited.

Title: Learning Multimodal Latent Space with EBM Prior and MCMC Inference

Authors: Shiyu Yuan, Carlo Lipizzi, Tian Han
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2408.10467
Pdf URL: https://arxiv.org/pdf/2408.10467
Copy Paste: [[2408.10467]] Learning Multimodal Latent Space with EBM Prior and MCMC Inference(https://arxiv.org/abs/2408.10467)
Keywords: generative
Abstract: Multimodal generative models are crucial for various applications. We propose an approach that combines an expressive energy-based model (EBM) prior with Markov Chain Monte Carlo (MCMC) inference in the latent space for multimodal generation. The EBM prior acts as an informative guide, while MCMC inference, specifically through short-run Langevin dynamics, brings the posterior distribution closer to its true form. This method not only provides an expressive prior to better capture the complexity of multimodality but also improves the learning of shared latent variables for more coherent generation across modalities. Our proposed method is supported by empirical experiments, underscoring the effectiveness of our EBM prior with MCMC inference in enhancing cross-modal and joint generative tasks in multimodal contexts.

Title: Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions

Authors: Jinxin Liu, Zao Yang
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2408.10468
Pdf URL: https://arxiv.org/pdf/2408.10468
Copy Paste: [[2408.10468]] Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions(https://arxiv.org/abs/2408.10468)
Keywords: privacy, robust, large language model
Abstract: The responses generated by Large Language Models (LLMs) can include sensitive information from individuals and organizations, leading to potential privacy leakage. This work implements Influence Functions (IFs) to trace privacy leakage back to the training data, thereby mitigating privacy concerns of Language Models (LMs). However, we notice that current IFs struggle to accurately estimate the influence of tokens with large gradient norms, potentially overestimating their influence. When tracing the most influential samples, this leads to frequently tracing back to samples with large gradient norm tokens, overshadowing the actual most influential samples even if their influences are well estimated. To address this issue, we propose Heuristically Adjusted IF (HAIF), which reduces the weight of tokens with large gradient norms, thereby significantly improving the accuracy of tracing the most influential samples. To establish easily obtained groundtruth for tracing privacy leakage, we construct two datasets, PII-E and PII-CR, representing two distinct scenarios: one with identical text in the model outputs and pre-training data, and the other where models leverage their reasoning abilities to generate text divergent from pre-training data. HAIF significantly improves tracing accuracy, enhancing it by 20.96\% to 73.71\% on the PII-E dataset and 3.21\% to 45.93\% on the PII-CR dataset, compared to the best SOTA IFs against various GPT-2 and QWen-1.5 models. HAIF also outperforms SOTA IFs on real-world pretraining data CLUECorpus2020, demonstrating strong robustness regardless prompt and response lengths.

Title: LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

Authors: Xinyu Liu, Jing Zhang, Kexin Zhang, Xu Liu, Lingling Li
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2408.10469
Pdf URL: https://arxiv.org/pdf/2408.10469
Copy Paste: [[2408.10469]] LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS(https://arxiv.org/abs/2408.10469)
Keywords: segmentation
Abstract: Video Object Segmentation (VOS) presents several challenges, including object occlusion and fragmentation, the dis-appearance and re-appearance of objects, and tracking specific objects within crowded scenes. In this work, we combine the strengths of the state-of-the-art (SOTA) models SAM2 and Cutie to address these challenges. Additionally, we explore the impact of various hyperparameters on video instance segmentation performance. Our approach achieves a J\&F score of 0.7952 in the testing phase of LSVOS challenge VOS track, ranking third overa1l.

Title: Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism

Authors: Guanchen Li, Xiandong Zhao, Lian Liu, Zeping Li, Dong Li, Lu Tian, Jie He, Ashish Sirasao, Emad Barsoum
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.10473
Pdf URL: https://arxiv.org/pdf/2408.10473
Copy Paste: [[2408.10473]] Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism(https://arxiv.org/abs/2408.10473)
Keywords: robust
Abstract: Pre-trained language models (PLMs) are engineered to be robust in contextual understanding and exhibit outstanding performance in various natural language processing tasks. However, their considerable size incurs significant computational and storage costs. Modern pruning strategies employ one-shot techniques to compress PLMs without the need for retraining on task-specific or otherwise general data; however, these approaches often lead to an indispensable reduction in performance. In this paper, we propose SDS, a Sparse-Dense-Sparse pruning framework to enhance the performance of the pruned PLMs from a weight distribution optimization perspective. We outline the pruning process in three steps. Initially, we prune less critical connections in the model using conventional one-shot pruning methods. Next, we reconstruct a dense model featuring a pruning-friendly weight distribution by reactivating pruned connections with sparse regularization. Finally, we perform a second pruning round, yielding a superior pruned model compared to the initial pruning. Experimental results demonstrate that SDS outperforms the state-of-the-art pruning techniques SparseGPT and Wanda under an identical sparsity configuration. For instance, SDS reduces perplexity by 9.13 on Raw-Wikitext2 and improves accuracy by an average of 2.05% across multiple zero-shot benchmarks for OPT-125M with 2:4 sparsity.

Title: PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting

Authors: Yongbo Yu, Weizhong Yu, Feiping Nie, Xuelong Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.10483
Pdf URL: https://arxiv.org/pdf/2408.10483
Copy Paste: [[2408.10483]] PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting(https://arxiv.org/abs/2408.10483)
Keywords: robust, transformer
Abstract: The self-attention mechanism in Transformer architecture, invariant to sequence order, necessitates positional embeddings to encode temporal order in time series prediction. We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences, particularly when employing longer lookback windows. To address this, we introduce an innovative approach that combines Pyramid RNN embeddings(PRE) for univariate time series with the Transformer's capability to model multivariate dependencies. PRE, utilizing pyramidal one-dimensional convolutional layers, constructs multiscale convolutional features that preserve temporal order. Additionally, RNNs, layered atop these features, learn multiscale time series representations sensitive to sequence order. This integration into Transformer models with attention mechanisms results in significant performance enhancements. We present the PRformer, a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets. This performance highlights the effectiveness of our approach in leveraging longer lookback windows and underscores the critical role of robust temporal representations in maximizing Transformer's potential for prediction tasks. Code is available at this repository: \url{this https URL}.

Title: MambaEVT: Event Stream based Visual Object Tracking using State Space Model

Authors: Xiao Wang, Chao wang, Shiao Wang, Xixi Wang, Zhicheng Zhao, Lin Zhu, Bo Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10487
Pdf URL: https://arxiv.org/pdf/2408.10487
Copy Paste: [[2408.10487]] MambaEVT: Event Stream based Visual Object Tracking using State Space Model(https://arxiv.org/abs/2408.10487)
Keywords: extraction, transformer
Abstract: Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object localization. In this paper, we propose a novel Mamba-based visual tracking framework that adopts the state space model with linear complexity as a backbone network. The search regions and target template are fed into the vision Mamba network for simultaneous feature extraction and interaction. The output tokens of search regions will be fed into the tracking head for target localization. More importantly, we consider introducing a dynamic template update strategy into the tracking framework using the Memory Mamba network. By considering the diversity of samples in the target template library and making appropriate adjustments to the template memory module, a more effective dynamic template can be integrated. The effective combination of dynamic and static templates allows our Mamba-based tracking algorithm to achieve a good balance between accuracy and computational cost on multiple large-scale datasets, including EventVOT, VisEvent, and FE240hz. The source code will be released on this https URL

Title: Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm

Authors: Xiao Wang, Yao Rong, Fuling Wang, Jianing Li, Lin Zhu, Bo Jiang, Yaowei Wang
Subjects: cs.CV, cs.AI, cs.CL, cs.NE
Abstract URL: https://arxiv.org/abs/2408.10488
Pdf URL: https://arxiv.org/pdf/2408.10488
Copy Paste: [[2408.10488]] Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm(https://arxiv.org/abs/2408.10488)
Keywords: privacy, protect, fair
Abstract: Sign Language Translation (SLT) is a core task in the field of AI-assisted disability. Unlike traditional SLT based on visible light videos, which is easily affected by factors such as lighting, rapid hand movements, and privacy breaches, this paper proposes the use of high-definition Event streams for SLT, effectively mitigating the aforementioned issues. This is primarily because Event streams have a high dynamic range and dense temporal signals, which can withstand low illumination and motion blur well. Additionally, due to their sparsity in space, they effectively protect the privacy of the target person. More specifically, we propose a new high-resolution Event stream sign language dataset, termed Event-CSL, which effectively fills the data gap in this area of research. It contains 14,827 videos, 14,821 glosses, and 2,544 Chinese words in the text vocabulary. These samples are collected in a variety of indoor and outdoor scenes, encompassing multiple angles, light intensities, and camera movements. We have benchmarked existing mainstream SLT works to enable fair comparison for future efforts. Based on this dataset and several other large-scale datasets, we propose a novel baseline method that fully leverages the Mamba model's ability to integrate temporal information of CNN features, resulting in improved sign language translation outcomes. Both the benchmark dataset and source code will be released on this https URL

Title: QUITO-X: An Information Bottleneck-based Compression Algorithm with Cross-Attention

Authors: Yihang Wang, Xu Huang, Bowen Tian, Yixing Fan, Jiafeng Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10497
Pdf URL: https://arxiv.org/pdf/2408.10497
Copy Paste: [[2408.10497]] QUITO-X: An Information Bottleneck-based Compression Algorithm with Cross-Attention(https://arxiv.org/abs/2408.10497)
Keywords: generative
Abstract: Generative LLM have achieved significant success in various industrial tasks and can effectively adapt to vertical domains and downstream tasks through ICL. However, with tasks becoming increasingly complex, the context length required by ICL is also getting longer, and two significant issues arise: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the "lost in the middle" problem. Recently, compressing prompts by removing tokens according to some metric obtained from some causal language models, such as llama-7b, has emerged as an effective approach to mitigate these issues. However, the metric used by prior method such as self-information or PPL do not fully align with the objective of distinuishing the most important tokens when conditioning on query. In this work, we introduce information bottleneck theory to carefully examine the properties required by the metric. Inspired by this, we use cross-attention in encoder-decoder architecture as a new metric. Our simple method leads to significantly better performance in smaller models with lower latency. We evaluate our method on four datasets: DROP, CoQA, SQuAD, and Quoref. The experimental results show that, while maintaining the same performance, our compression rate can improve by nearly 25% over previous SOTA. Remarkably, in experiments where 25% of the tokens are removed, our model's EM score for answers sometimes even exceeds that of the control group using uncompressed text as context.

Title: Adaptive Knowledge Distillation for Classification of Hand Images using Explainable Vision Transformers

Authors: Thanh Thi Nguyen, Campbell Wilson, Janis Dalins
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.10503
Pdf URL: https://arxiv.org/pdf/2408.10503
Copy Paste: [[2408.10503]] Adaptive Knowledge Distillation for Classification of Hand Images using Explainable Vision Transformers(https://arxiv.org/abs/2408.10503)
Keywords: explainability, transformer
Abstract: Assessing the forensic value of hand images involves the use of unique features and patterns present in an individual's hand. The human hand has distinct characteristics, such as the pattern of veins, fingerprints, and the geometry of the hand itself. This paper investigates the use of vision transformers (ViTs) for classification of hand images. We use explainability tools to explore the internal representations of ViTs and assess their impact on the model outputs. Utilizing the internal understanding of ViTs, we introduce distillation methods that allow a student model to adaptively extract knowledge from a teacher model while learning on data of a different domain to prevent catastrophic forgetting. Two publicly available hand image datasets are used to conduct a series of experiments to evaluate performance of the ViTs and our proposed adaptive distillation methods. The experimental results demonstrate that ViT models significantly outperform traditional machine learning methods and the internal states of ViTs are useful for explaining the model outputs in the classification task. By averting catastrophic forgetting, our distillation methods achieve excellent performance on data from both source and target domains, particularly when these two domains exhibit significant dissimilarity. The proposed approaches therefore can be developed and implemented effectively for real-world applications such as access control, identity verification, and authentication systems.

Title: Data Augmentation Integrating Dialogue Flow and Style to Adapt Spoken Dialogue Systems to Low-Resource User Groups

Authors: Zhiyang Qi, Michimasa Inaba
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10516
Pdf URL: https://arxiv.org/pdf/2408.10516
Copy Paste: [[2408.10516]] Data Augmentation Integrating Dialogue Flow and Style to Adapt Spoken Dialogue Systems to Low-Resource User Groups(https://arxiv.org/abs/2408.10516)
Keywords: large language model
Abstract: This study addresses the interaction challenges encountered by spoken dialogue systems (SDSs) when engaging with users who exhibit distinct conversational behaviors, particularly minors, in scenarios where data are scarce. We propose a novel data augmentation framework to enhance SDS performance for user groups with limited resources. Our approach leverages a large language model (LLM) to extract speaker styles and a pre-trained language model (PLM) to simulate dialogue act history. This method generates enriched and personalized dialogue data, facilitating improved interactions with unique user demographics. Extensive experiments validate the efficacy of our methodology, highlighting its potential to foster the development of more adaptive and inclusive dialogue systems.

Title: Integrating Multi-Modal Input Token Mixer Into Mamba-Based Decision Models: Decision MetaMamba

Authors: Wall Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10517
Pdf URL: https://arxiv.org/pdf/2408.10517
Copy Paste: [[2408.10517]] Integrating Multi-Modal Input Token Mixer Into Mamba-Based Decision Models: Decision MetaMamba(https://arxiv.org/abs/2408.10517)
Keywords: transformer
Abstract: Return-Conditioned Transformer Decision Models (RCTDM) have demonstrated the potential to enhance transformer performance in offline reinforcement learning by replacing rewards in the input sequence with returns-to-go. However, to achieve the goal of learning an optimal policy from offline datasets composed of limited suboptimal trajectories, RCTDM required alternative methods. One prominent approach, trajectory stitching, was designed to enable the network to combine multiple trajectories to find the optimal path. To implement this using only transformers without auxiliary networks, it was necessary to shorten the input sequence length to better capture the Markov property in reinforcement learnings. This, however, introduced a trade-off, as it reduced the accuracy of action inference. Our study introduces a model named Decision MetaMamba to resolve these challenges. DMM employs an input token mixer to extract patterns from short sequences and uses a State Space Model (SSM) to selectively combine information from relatively distant sequences. Inspired by Metaformer, this structure was developed by transforming Mamba's input layer into various multi-modal layers. Fortunately, with the advent of Mamba, implemented using parallel selective scanning, we achieved a high-performance sequence model capable of replacing transformers. Based on these innovations, DMM demonstrated excellent performance across various datasets in offline RL, confirming that models using SSM can improve performance by domain-specific alterations of the input layer. Additionally, it maintained its performance even in lightweight models with fewer parameters. These results suggest that decision models based on SSM can pave the way for improved outcomes in future developments.

Title: EdgeNAT: Transformer for Efficient Edge Detection

Authors: Jinghuai Jie, Yan Guo, Guixing Wu, Junmin Wu, Baojian Hua
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10527
Pdf URL: https://arxiv.org/pdf/2408.10527
Copy Paste: [[2408.10527]] EdgeNAT: Transformer for Efficient Edge Detection(https://arxiv.org/abs/2408.10527)
Keywords: extraction, transformer
Abstract: Transformers, renowned for their powerful feature extraction capabilities, have played an increasingly prominent role in various vision tasks. Especially, recent advancements present transformer with hierarchical structures such as Dilated Neighborhood Attention Transformer (DiNAT), demonstrating outstanding ability to efficiently capture both global and local features. However, transformers' application in edge detection has not been fully exploited. In this paper, we propose EdgeNAT, a one-stage transformer-based edge detector with DiNAT as the encoder, capable of extracting object boundaries and meaningful edges both accurately and efficiently. On the one hand, EdgeNAT captures global contextual information and detailed local cues with DiNAT, on the other hand, it enhances feature representation with a novel SCAF-MLA decoder by utilizing both inter-spatial and inter-channel relationships of feature maps. Extensive experiments on multiple datasets show that our method achieves state-of-the-art performance on both RGB and depth images. Notably, on the widely used BSDS500 dataset, our L model achieves impressive performances, with ODS F-measure and OIS F-measure of 86.0%, 87.6% for multi-scale input,and 84.9%, and 86.3% for single-scale input, surpassing the current state-of-the-art EDTER by 1.2%, 1.1%, 1.7%, and 1.6%, respectively. Moreover, as for throughput, our approach runs at 20.87 FPS on RTX 4090 GPU with single-scale input. The code for our method will be released soon.

Title: FAGStyle: Feature Augmentation on Geodesic Surface for Zero-shot Text-guided Diffusion Image Style Transfer

Authors: Yuexing Han, Liheng Ruan, Bing Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10533
Pdf URL: https://arxiv.org/pdf/2408.10533
Copy Paste: [[2408.10533]] FAGStyle: Feature Augmentation on Geodesic Surface for Zero-shot Text-guided Diffusion Image Style Transfer(https://arxiv.org/abs/2408.10533)
Keywords: diffusion
Abstract: The goal of image style transfer is to render an image guided by a style reference while maintaining the original content. Existing image-guided methods rely on specific style reference images, restricting their wider application and potentially compromising result quality. As a flexible alternative, text-guided methods allow users to describe the desired style using text prompts. Despite their versatility, these methods often struggle with maintaining style consistency, reflecting the described style accurately, and preserving the content of the target image. To address these challenges, we introduce FAGStyle, a zero-shot text-guided diffusion image style transfer method. Our approach enhances inter-patch information interaction by incorporating the Sliding Window Crop technique and Feature Augmentation on Geodesic Surface into our style control loss. Furthermore, we integrate a Pre-Shape self-correlation consistency loss to ensure content consistency. FAGStyle demonstrates superior performance over existing methods, consistently achieving stylization that retains the semantic content of the source image. Experimental results confirms the efficacy of FAGStyle across a diverse range of source contents and styles, both imagined and common.

Title: Subspace Prototype Guidance for Mitigating Class Imbalance in Point Cloud Semantic Segmentation

Authors: Jiawei Han, Kaiqi Liu, Wei Li, Guangzhi Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10537
Pdf URL: https://arxiv.org/pdf/2408.10537
Copy Paste: [[2408.10537]] Subspace Prototype Guidance for Mitigating Class Imbalance in Point Cloud Semantic Segmentation(https://arxiv.org/abs/2408.10537)
Keywords: segmentation
Abstract: Point cloud semantic segmentation can significantly enhance the perception of an intelligent agent. Nevertheless, the discriminative capability of the segmentation network is influenced by the quantity of samples available for different categories. To mitigate the cognitive bias induced by class imbalance, this paper introduces a novel method, namely subspace prototype guidance (\textbf{SPG}), to guide the training of segmentation network. Specifically, the point cloud is initially separated into independent point sets by category to provide initial conditions for the generation of feature subspaces. The auxiliary branch which consists of an encoder and a projection head maps these point sets into separate feature subspaces. Subsequently, the feature prototypes which are extracted from the current separate subspaces and then combined with prototypes of historical subspaces guide the feature space of main branch to enhance the discriminability of features of minority categories. The prototypes derived from the feature space of main branch are also employed to guide the training of the auxiliary branch, forming a supervisory loop to maintain consistent convergence of the entire network. The experiments conducted on the large public benchmarks (i.e. S3DIS, ScanNet v2, ScanNet200, Toronto-3D) and collected real-world data illustrate that the proposed method significantly improves the segmentation performance and surpasses the state-of-the-art method. The code is available at \url{this https URL}.

Title: The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution

Authors: Bin Cao, Yisi Zhang, Hanyi Wang, Xingjian He, Jing Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10541
Pdf URL: https://arxiv.org/pdf/2408.10541
Copy Paste: [[2408.10541]] The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution(https://arxiv.org/abs/2408.10541)
Keywords: transformer, segmentation
Abstract: Referring Video Object Segmentation is an emerging multi-modal task that aims to segment objects in the video given a natural language expression. In this work, we build two instance-centric models and fuse predicted results from frame-level and instance-level. First, we introduce instance mask into the DETR-based model for query initialization to achieve temporal enhancement and employ SAM for spatial refinement. Secondly, we build an instance retrieval model conducting binary instance mask classification whether the instance is referred. Finally, we fuse predicted results and our method achieved a score of 52.67 J&F in the validation phase and 60.36 J&F in the test phase, securing the final ranking of 3rd place in the 6-th LSVOS Challenge RVOS Track.

Title: Diff-PCC: Diffusion-based Neural Compression for 3D Point Clouds

Authors: Kai Liu, Kang You, Pan Gao
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2408.10543
Pdf URL: https://arxiv.org/pdf/2408.10543
Copy Paste: [[2408.10543]] Diff-PCC: Diffusion-based Neural Compression for 3D Point Clouds(https://arxiv.org/abs/2408.10543)
Keywords: diffusion, generative
Abstract: Stable diffusion networks have emerged as a groundbreaking development for their ability to produce realistic and detailed visual content. This characteristic renders them ideal decoders, capable of producing high-quality and aesthetically pleasing reconstructions. In this paper, we introduce the first diffusion-based point cloud compression method, dubbed Diff-PCC, to leverage the expressive power of the diffusion model for generative and aesthetically superior decoding. Different from the conventional autoencoder fashion, a dual-space latent representation is devised in this paper, in which a compressor composed of two independent encoding backbones is considered to extract expressive shape latents from distinct latent spaces. At the decoding side, a diffusion-based generator is devised to produce high-quality reconstructions by considering the shape latents as guidance to stochastically denoise the noisy point clouds. Experiments demonstrate that the proposed Diff-PCC achieves state-of-the-art compression performance (e.g., 7.711 dB BD-PSNR gains against the latest G-PCC standard at ultra-low bitrate) while attaining superior subjective quality. Source code will be made publicly available.

Title: Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution

Authors: Yucheng Ruan, Xiang Lan, Jingying Ma, Yizhi Dong, Kai He, Mengling Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10548
Pdf URL: https://arxiv.org/pdf/2408.10548
Copy Paste: [[2408.10548]] Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution(https://arxiv.org/abs/2408.10548)
Keywords: robust, transformer, large language model
Abstract: Tabular data, a prevalent data type across various domains, presents unique challenges due to its heterogeneous nature and complex structural relationships. Achieving high predictive performance and robustness in tabular data analysis holds significant promise for numerous applications. Influenced by recent advancements in natural language processing, particularly transformer architectures, new methods for tabular data modeling have emerged. Early techniques concentrated on pre-training transformers from scratch, often encountering scalability issues. Subsequently, methods leveraging pre-trained language models like BERT have been developed, which require less data and yield enhanced performance. The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning. Despite the growing interest, a comprehensive survey of language modeling techniques for tabular data remains absent. This paper fills this gap by providing a systematic review of the development of language modeling for tabular data, encompassing: (1) a categorization of different tabular data structures and data types; (2) a review of key datasets used in model training and tasks used for evaluation; (3) a summary of modeling techniques including widely-adopted data processing methods, popular architectures, and training objectives; (4) the evolution from adapting traditional Pre-training/Pre-trained language models to the utilization of large language models; (5) an identification of persistent challenges and potential future research directions in language modeling for tabular data analysis. GitHub page associated with this survey is available at: this https URL.

Title: Target-Prompt Online Graph Collaborative Learning for Temporal QoS Prediction

Authors: Shengxiang Hu, Guobing Zou, Song Yang, Shiyi Lin, Bofeng Zhang, Yixin Chen
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2408.10555
Pdf URL: https://arxiv.org/pdf/2408.10555
Copy Paste: [[2408.10555]] Target-Prompt Online Graph Collaborative Learning for Temporal QoS Prediction(https://arxiv.org/abs/2408.10555)
Keywords: extraction, transformer
Abstract: In service-oriented architecture, accurately predicting the Quality of Service (QoS) is vital for maintaining reliability and enhancing user satisfaction. However, current methods often neglect high-order latent collaborative relationships and fail to dynamically adjust feature learning for specific user-service invocations, which are critical for precise feature extraction. Moreover, relying on RNNs to capture QoS evolution limits the ability to detect long-term trends due to challenges in managing long-range dependencies. To address these issues, we propose the Target-Prompt Online Graph Collaborative Learning (TOGCL) framework for temporal QoS prediction. It leverages a dynamic user-service invocation graph to comprehensively model historical interactions. Building on this graph, it develops a target-prompt graph attention network to extract online deep latent features of users and services at each time slice, considering implicit target-neighboring collaborative relationships and historical QoS values. Additionally, a multi-layer Transformer encoder is employed to uncover temporal feature evolution patterns, enhancing temporal QoS prediction. Extensive experiments on the WS-DREAM dataset demonstrate that TOGCL significantly outperforms state-of-the-art methods across multiple metrics, achieving improvements of up to 38.80\%. These results underscore the effectiveness of TOGCL for temporal QoS prediction.

Title: Speech Representation Learning Revisited: The Necessity of Separate Learnable Parameters and Robust Data Augmentation

Authors: Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10557
Pdf URL: https://arxiv.org/pdf/2408.10557
Copy Paste: [[2408.10557]] Speech Representation Learning Revisited: The Necessity of Separate Learnable Parameters and Robust Data Augmentation(https://arxiv.org/abs/2408.10557)
Keywords: robust
Abstract: Speech modeling methods learn one embedding for a fixed segment of speech, typically in between 10-25 ms. The information present in speech can be divided into two categories: "what is being said" (content) and "how it is expressed" (other) and these two are orthogonal in nature causing the optimization algorithm to find a sub-optimal solution if forced to optimize together. This leads to sub-optimal performance in one or all downstream tasks as shown by previous studies. Current self-supervised learning (SSL) methods such as HuBERT are very good at modeling the content information present in speech. Data augmentation improves the performance on tasks which require effective modeling of other information but this leads to a divided capacity of the model. In this work, we conduct a preliminary study to understand the importance of modeling other information using separate learnable parameters. We propose a modified version of HuBERT, termed Other HuBERT (O-HuBERT), to test our hypothesis. Our findings are twofold: first, the O-HuBERT method is able to utilize all layers to build complex features to encode other information; second, a robust data augmentation strategy is essential for learning the information required by tasks that depend on other information and to achieve state-of-the-art (SOTA) performance on the SUPERB benchmark with a similarly sized model (100 million parameters) and pre-training data (960 hours).

Title: Prompt-Agnostic Adversarial Perturbation for Customized Diffusion Models

Authors: Cong Wan, Yuhang He, Xiang Song, Yihong Gong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10571
Pdf URL: https://arxiv.org/pdf/2408.10571
Copy Paste: [[2408.10571]] Prompt-Agnostic Adversarial Perturbation for Customized Diffusion Models(https://arxiv.org/abs/2408.10571)
Keywords: privacy, protect, defense, attack, diffusion
Abstract: Diffusion models have revolutionized customized text-to-image generation, allowing for efficient synthesis of photos from personal data with textual descriptions. However, these advancements bring forth risks including privacy breaches and unauthorized replication of artworks. Previous researches primarily center around using prompt-specific methods to generate adversarial examples to protect personal images, yet the effectiveness of existing methods is hindered by constrained adaptability to different prompts. In this paper, we introduce a Prompt-Agnostic Adversarial Perturbation (PAP) method for customized diffusion models. PAP first models the prompt distribution using a Laplace Approximation, and then produces prompt-agnostic perturbations by maximizing a disturbance expectation based on the modeled distribution. This approach effectively tackles the prompt-agnostic attacks, leading to improved defense stability. Extensive experiments in face privacy and artistic style protection, demonstrate the superior generalization of our method in comparison to existing techniques.

Title: Putting People in LLMs' Shoes: Generating Better Answers via Question Rewriter

Authors: Junhao Chen, Bowen Wang, Zhouqiang jiang, Yuta Nakashima
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10573
Pdf URL: https://arxiv.org/pdf/2408.10573
Copy Paste: [[2408.10573]] Putting People in LLMs' Shoes: Generating Better Answers via Question Rewriter(https://arxiv.org/abs/2408.10573)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated significant capabilities, particularly in the domain of question answering (QA). However, their effectiveness in QA is often undermined by the vagueness of user questions. To address this issue, we introduce single-round instance-level prompt optimization, referred to as question rewriter. By enhancing the intelligibility of human questions for black-box LLMs, our question rewriter improves the quality of generated answers. The rewriter is optimized using direct preference optimization based on feedback collected from automatic criteria for evaluating generated answers; therefore, its training does not require costly human annotations. The experiments across multiple black-box LLMs and long-form question answering (LFQA) datasets demonstrate the efficacy of our method. This paper provides a practical framework for training question rewriters and sets a precedent for future explorations in prompt optimization within LFQA tasks. Code is available at \url{this https URL}.

Title: Multi-view Hand Reconstruction with a Point-Embedded Transformer

Authors: Lixin Yang, Licheng Zhong, Pengxiang Zhu, Xinyu Zhan, Junxiao Kong, Jian Xu, Cewu Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10581
Pdf URL: https://arxiv.org/pdf/2408.10581
Copy Paste: [[2408.10581]] Multi-view Hand Reconstruction with a Point-Embedded Transformer(https://arxiv.org/abs/2408.10581)
Keywords: transformer
Abstract: This work introduces a novel and generalizable multi-view Hand Mesh Reconstruction (HMR) model, named POEM, designed for practical use in real-world hand motion capture scenarios. The advances of the POEM model consist of two main aspects. First, concerning the modeling of the problem, we propose embedding a static basis point within the multi-view stereo space. A point represents a natural form of 3D information and serves as an ideal medium for fusing features across different views, given its varied projections across these views. Consequently, our method harnesses a simple yet effective idea: a complex 3D hand mesh can be represented by a set of 3D basis points that 1) are embedded in the multi-view stereo, 2) carry features from the multi-view images, and 3) encompass the hand in it. The second advance lies in the training strategy. We utilize a combination of five large-scale multi-view datasets and employ randomization in the number, order, and poses of the cameras. By processing such a vast amount of data and a diverse array of camera configurations, our model demonstrates notable generalizability in the real-world applications. As a result, POEM presents a highly practical, plug-and-play solution that enables user-friendly, cost-effective multi-view motion capture for both left and right hands. The model and source codes are available at this https URL.

Title: An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs

Authors: Eui Jun Hwang, Sukmin Cho, Junmyeong Lee, Jong C. Park
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2408.10593
Pdf URL: https://arxiv.org/pdf/2408.10593
Copy Paste: [[2408.10593]] An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs(https://arxiv.org/abs/2408.10593)
Keywords: large language model
Abstract: Gloss-free Sign Language Translation (SLT) converts sign videos directly into spoken language sentences without relying on glosses. Recently, Large Language Models (LLMs) have shown remarkable translation performance in gloss-free methods by harnessing their powerful natural language generation capabilities. However, these methods often rely on domain-specific fine-tuning of visual encoders to achieve optimal results. By contrast, this paper emphasizes the importance of capturing the spatial configurations and motion dynamics inherent in sign language. With this in mind, we introduce Spatial and Motion-based Sign Language Translation (SpaMo), a novel LLM-based SLT framework. The core idea of SpaMo is simple yet effective. We first extract spatial and motion features using off-the-shelf visual encoders and then input these features into an LLM with a language prompt. Additionally, we employ a visual-text alignment process as a warm-up before the SLT supervision. Our experiments demonstrate that SpaMo achieves state-of-the-art performance on two popular datasets, PHOENIX14T and How2Sign.

Title: MV-MOS: Multi-View Feature Fusion for 3D Moving Object Segmentation

Authors: Jintao Cheng, Xingming Chen, Jinxin Liang, Xiaoyu Tang, Xieyuanli Chen, Dachuan Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10602
Pdf URL: https://arxiv.org/pdf/2408.10602
Copy Paste: [[2408.10602]] MV-MOS: Multi-View Feature Fusion for 3D Moving Object Segmentation(https://arxiv.org/abs/2408.10602)
Keywords: segmentation
Abstract: Effectively summarizing dense 3D point cloud data and extracting motion information of moving objects (moving object segmentation, MOS) is crucial to autonomous driving and robotics applications. How to effectively utilize motion and semantic features and avoid information loss during 3D-to-2D projection is still a key challenge. In this paper, we propose a novel multi-view MOS model (MV-MOS) by fusing motion-semantic features from different 2D representations of point clouds. To effectively exploit complementary information, the motion branches of the proposed model combines motion features from both bird's eye view (BEV) and range view (RV) representations. In addition, a semantic branch is introduced to provide supplementary semantic features of moving objects. Finally, a Mamba module is utilized to fuse the semantic features with motion features and provide effective guidance for the motion branches. We validated the effectiveness of the proposed multi-branch fusion MOS framework via comprehensive experiments, and our proposed model outperforms existing state-of-the-art models on the SemanticKITTI benchmark.

Title: MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Authors: Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10605
Pdf URL: https://arxiv.org/pdf/2408.10605
Copy Paste: [[2408.10605]] MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration(https://arxiv.org/abs/2408.10605)
Keywords: diffusion
Abstract: Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, we find that existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step of MUSES forward in bridging natural language, 2D image generation, and 3D world.

Title: Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory

Authors: Yongxin Deng (1), Xihe Qiu (1), Xiaoyu Tan (2), Jing Pan (3), Chen Jue (1), Zhijun Fang (4), Yinghui Xu (5), Wei Chu (2), Yuan Qi (5) ((1) Shanghai University of Engineering Science, (2) INF Technology (Shanghai) Co., Ltd., (3) Monash University, (4) Donghua University, (5) Fudan University)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10608
Pdf URL: https://arxiv.org/pdf/2408.10608
Copy Paste: [[2408.10608]] Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory(https://arxiv.org/abs/2408.10608)
Keywords: attack, large language model
Abstract: Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information. Although techniques such as Affective Alignment can mitigate some negative impacts of these biases, existing prompt-based attack methods can still extract these biases from the model's weights. Moreover, these biases frequently appear subtly when LLMs are prompted to perform identical tasks across different demographic groups, thereby camouflaging their presence. To address this issue, we have formally defined the implicit bias problem and developed an innovative framework for bias removal based on Bayesian theory, Bayesian-Theory based Bias Removal (BTBR). BTBR employs likelihood ratio screening to pinpoint data entries within publicly accessible biased datasets that represent biases inadvertently incorporated during the LLM training phase. It then automatically constructs relevant knowledge triples and expunges bias information from LLMs using model editing techniques. Through extensive experimentation, we have confirmed the presence of the implicit bias problem in LLMs and demonstrated the effectiveness of our BTBR approach.

Title: PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis

Authors: Yan Wu, Esther Wershof, Sebastian M Schmon, Marcel Nassar, Błażej Osiński, Ridvan Eksi, Kun Zhang, Thore Graepel
Subjects: cs.LG, q-bio.GN, stat.ML
Abstract URL: https://arxiv.org/abs/2408.10609
Pdf URL: https://arxiv.org/pdf/2408.10609
Copy Paste: [[2408.10609]] PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis(https://arxiv.org/abs/2408.10609)
Keywords: robust, fair
Abstract: We present a comprehensive framework for predicting the effects of perturbations in single cells, designed to standardize benchmarking in this rapidly evolving field. Our framework, PerturBench, includes a user-friendly platform, diverse datasets, metrics for fair model comparison, and detailed performance analysis. Extensive evaluations of published and baseline models reveal limitations like mode or posterior collapse, and underscore the importance of rank metrics that assess the ordering of perturbations alongside traditional measures like RMSE. Our findings show that simple models can outperform more complex approaches. This benchmarking exercise sets new standards for model evaluation, supports robust model development, and advances the potential of these models to use high-throughput and high-content genetic and chemical screens for disease target discovery.

Title: Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information

Authors: Ming Jiang, Tingting Huang, Biao Guo, Yao Lu, Feng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10615
Pdf URL: https://arxiv.org/pdf/2408.10615
Copy Paste: [[2408.10615]] Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information(https://arxiv.org/abs/2408.10615)
Keywords: robust, large language model
Abstract: In recent years, Large language models (LLMs) have garnered significant attention due to their superior performance in complex reasoning tasks. However, recent studies may diminish their reasoning capabilities markedly when problem descriptions contain irrelevant information, even with the use of advanced prompting techniques. To further investigate this issue, a dataset of primary school mathematics problems containing irrelevant information, named GSMIR, was constructed. Testing prominent LLMs and prompting techniques on this dataset revealed that while LLMs can identify irrelevant information, they do not effectively mitigate the interference it causes once identified. A novel automatic construction method, ATF, which enhances the ability of LLMs to identify and self-mitigate the influence of irrelevant information, is proposed to address this shortcoming. This method operates in two steps: first, analysis of irrelevant information, followed by its filtering. The ATF method, as demonstrated by experimental results, significantly improves the reasoning performance of LLMs and prompting techniques, even in the presence of irrelevant information on the GSMIR dataset.

Title: Novel Change Detection Framework in Remote Sensing Imagery Using Diffusion Models and Structural Similarity Index (SSIM)

Authors: Andrew Kiruluta, Eric Lundy, Andreas Lemos
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2408.10619
Pdf URL: https://arxiv.org/pdf/2408.10619
Copy Paste: [[2408.10619]] Novel Change Detection Framework in Remote Sensing Imagery Using Diffusion Models and Structural Similarity Index (SSIM)(https://arxiv.org/abs/2408.10619)
Keywords: robust, diffusion, generative
Abstract: Change detection is a crucial task in remote sensing, enabling the monitoring of environmental changes, urban growth, and disaster impact. Conventional change detection techniques, such as image differencing and ratioing, often struggle with noise and fail to capture complex variations in imagery. Recent advancements in machine learning, particularly generative models like diffusion models, offer new opportunities for enhancing change detection accuracy. In this paper, we propose a novel change detection framework that combines the strengths of Stable Diffusion models with the Structural Similarity Index (SSIM) to create robust and interpretable change maps. Our approach, named Diffusion Based Change Detector, is evaluated on both synthetic and real-world remote sensing datasets and compared with state-of-the-art methods. The results demonstrate that our method significantly outperforms traditional differencing techniques and recent deep learning-based methods, particularly in scenarios with complex changes and noise.

Title: TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles

Authors: Tong Wang, Xiaochao Qu, Ting Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10623
Pdf URL: https://arxiv.org/pdf/2408.10623
Copy Paste: [[2408.10623]] TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles(https://arxiv.org/abs/2408.10623)
Keywords: diffusion, generative
Abstract: Scene text editing aims to modify texts on images while maintaining the style of newly generated text similar to the original. Given an image, a target area, and target text, the task produces an output image with the target text in the selected area, replacing the original. This task has been studied extensively, with initial success using Generative Adversarial Networks (GANs) to balance text fidelity and style similarity. However, GAN-based methods struggled with complex backgrounds or text styles. Recent works leverage diffusion models, showing improved results, yet still face challenges, especially with non-Latin languages like CJK characters (Chinese, Japanese, Korean) that have complex glyphs, often producing inaccurate or unrecognizable characters. To address these issues, we present \emph{TextMastero} - a carefully designed multilingual scene text editing architecture based on latent diffusion models (LDMs). TextMastero introduces two key modules: a glyph conditioning module for fine-grained content control in generating accurate texts, and a latent guidance module for providing comprehensive style information to ensure similarity before and after editing. Both qualitative and quantitative experiments demonstrate that our method surpasses all known existing works in text fidelity and style similarity.

Title: WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification

Authors: Yonggan Wu, Ling-Chao Meng, Yuan Zichao, Sixian Chan, Hong-Qiang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10624
Pdf URL: https://arxiv.org/pdf/2408.10624
Copy Paste: [[2408.10624]] WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification(https://arxiv.org/abs/2408.10624)
Keywords: extraction
Abstract: For the visible-infrared person re-identification (VI-ReID) task, one of the primary challenges lies in significant cross-modality discrepancy. Existing methods struggle to conduct modality-invariant information mining. They often focus solely on mining singular dimensions like spatial or channel, and overlook the extraction of specific-modality multi-dimension information. To fully mine modality-invariant information across a wide range, we introduce the Wide-Ranging Information Mining Network (WRIM-Net), which mainly comprises a Multi-dimension Interactive Information Mining (MIIM) module and an Auxiliary-Information-based Contrastive Learning (AICL) approach. Empowered by the proposed Global Region Interaction (GRI), MIIM comprehensively mines non-local spatial and channel information through intra-dimension interaction. Moreover, Thanks to the low computational complexity design, separate MIIM can be positioned in shallow layers, enabling the network to better mine specific-modality multi-dimension information. AICL, by introducing the novel Cross-Modality Key-Instance Contrastive (CMKIC) loss, effectively guides the network in extracting modality-invariant information. We conduct extensive experiments not only on the well-known SYSU-MM01 and RegDB datasets but also on the latest large-scale cross-modality LLCM dataset. The results demonstrate WRIM-Net's superiority over state-of-the-art methods.

Title: Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended?

Authors: Chen Liang, Qiang Guo, Xiaochao Qu, Luoqi Liu, Ting Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10627
Pdf URL: https://arxiv.org/pdf/2408.10627
Copy Paste: [[2408.10627]] Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended?(https://arxiv.org/abs/2408.10627)
Keywords: segmentation
Abstract: Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames. Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets. This leads to inconsistent segmentation results across frames. To address these issues, we propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation. MVC introduces a training strategy that randomly masks image patches, compelling the network to predict the entire semantic segmentation, thus improving contextual information integration. Additionally, we introduce Object Masked Attention (OMA) to optimize the cross-attention mechanism by reducing the impact of irrelevant queries, thereby enhancing temporal modeling capabilities. Our approach, integrated into the latest decoupled universal video segmentation framework, achieves state-of-the-art performance across five datasets for three video segmentation tasks, demonstrating significant improvements over previous methods without increasing model parameters.

Title: Finding the DeepDream for Time Series: Activation Maximization for Univariate Time Series

Authors: Udo Schlegel, Daniel A. Keim, Tobias Sutter
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10628
Pdf URL: https://arxiv.org/pdf/2408.10628
Copy Paste: [[2408.10628]] Finding the DeepDream for Time Series: Activation Maximization for Univariate Time Series(https://arxiv.org/abs/2408.10628)
Keywords: interpretability
Abstract: Understanding how models process and interpret time series data remains a significant challenge in deep learning to enable applicability in safety-critical areas such as healthcare. In this paper, we introduce Sequence Dreaming, a technique that adapts Activation Maximization to analyze sequential information, aiming to enhance the interpretability of neural networks operating on univariate time series. By leveraging this method, we visualize the temporal dynamics and patterns most influential in model decision-making processes. To counteract the generation of unrealistic or excessively noisy sequences, we enhance Sequence Dreaming with a range of regularization techniques, including exponential smoothing. This approach ensures the production of sequences that more accurately reflect the critical features identified by the neural network. Our approach is tested on a time series classification dataset encompassing applications in predictive maintenance. The results show that our proposed Sequence Dreaming approach demonstrates targeted activation maximization for different use cases so that either centered class or border activation maximization can be generated. The results underscore the versatility of Sequence Dreaming in uncovering salient temporal features learned by neural networks, thereby advancing model transparency and trustworthiness in decision-critical domains.

Title: LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

Authors: Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2408.10631
Pdf URL: https://arxiv.org/pdf/2408.10631
Copy Paste: [[2408.10631]] LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models(https://arxiv.org/abs/2408.10631)
Keywords: large language model
Abstract: Large language models (LLMs) have grown significantly in scale, leading to a critical need for efficient model pruning techniques. Existing post-training pruning techniques primarily focus on measuring weight importance on converged dense models to determine salient weights to retain. However, they often overlook the changes in weight importance during the pruning process, which can lead to performance degradation in the pruned models. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, ensuring global performance optimization. Inspired by the recent discovery of prominent outliers in LLMs, LLM-Barber introduces an innovative pruning metric that identifies weight importance using weights multiplied by gradients. Our experiments show that LLM-Barber can efficiently prune models like LLaMA and OPT families with 7B to 13B parameters on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at this https URL.

Title: Interactive Counterfactual Generation for Univariate Time Series

Authors: Udo Schlegel, Julius Rauscher, Daniel A. Keim
Subjects: cs.LG, cs.HC
Abstract URL: https://arxiv.org/abs/2408.10633
Pdf URL: https://arxiv.org/pdf/2408.10633
Copy Paste: [[2408.10633]] Interactive Counterfactual Generation for Univariate Time Series(https://arxiv.org/abs/2408.10633)
Keywords: interpretability
Abstract: We propose an interactive methodology for generating counterfactual explanations for univariate time series data in classification tasks by leveraging 2D projections and decision boundary maps to tackle interpretability challenges. Our approach aims to enhance the transparency and understanding of deep learning models' decision processes. The application simplifies the time series data analysis by enabling users to interactively manipulate projected data points, providing intuitive insights through inverse projection techniques. By abstracting user interactions with the projected data points rather than the raw time series data, our method facilitates an intuitive generation of counterfactual explanations. This approach allows for a more straightforward exploration of univariate time series data, enabling users to manipulate data points to comprehend potential outcomes of hypothetical scenarios. We validate this method using the ECG5000 benchmark dataset, demonstrating significant improvements in interpretability and user understanding of time series classification. The results indicate a promising direction for enhancing explainable AI, with potential applications in various domains requiring transparent and interpretable deep learning models. Future work will explore the scalability of this method to multivariate time series data and its integration with other interpretability techniques.

Title: Industry Perception of Security Challenges with Identity Access Management Solutions

Authors: Abhishek Pratap Singh, Ievgeniia Kuzminykh, Bogdan Ghita
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.10634
Pdf URL: https://arxiv.org/pdf/2408.10634
Copy Paste: [[2408.10634]] Industry Perception of Security Challenges with Identity Access Management Solutions(https://arxiv.org/abs/2408.10634)
Keywords: secure, security
Abstract: Identity Access Management (IAM) is an area posing significant challenges, particularly in the context of remote connectivity and distributed or cloud-based systems. A wide range of technical solutions have been proposed by prior research, but the integration of these solutions in the commercial sector represent steps that significantly hamper their acceptance. The study aims to outline the current perception and security issues associated with IAMs solutions from the perspective of the beneficiaries. The analysis relies on a series of interviews with 45 cyber security professionals from different organisations all over the world. As results showed, cloud IAM solutions and on premises IAM solutions are affected by different issues. The main challenges for cloud based IAM solutions were Default configurations, Poor management of Non-Human Identities such as Service accounts, Poor certificate management, Poor API configuration and limited Log analysis. In contrast, the challenges for on premise solutions were Multi Factor Authentication, insecure Default configurations, Lack of skillsets required to manage IAM solution securely, Poor password policies, Unpatched vulnerabilities, and compromise of Single-Sign on leading to compromise of multiple entities. The study also determined that, regardless the evolving functionality of cloud based IAM solutions, 41% of respondents believe that the on premise solutions more secure than the cloud-based ones. As pointed out by the respondents, cloud IAM may potentially expose organisations to a wider range of vulnerabilities due to the complexity of the underlying solutions, challenges with managing permissions, and compliance to dynamic IAM policies.

Title: Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs

Authors: Maxim Ifergan, Leshem Choshen, Roee Aharoni, Idan Szpektor, Omri Abend
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10646
Pdf URL: https://arxiv.org/pdf/2408.10646
Copy Paste: [[2408.10646]] Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs(https://arxiv.org/abs/2408.10646)
Keywords: robust
Abstract: The veracity of a factoid is largely independent of the language it is written in. However, language models are inconsistent in their ability to answer the same factual question across languages. This raises questions about how LLMs represent a given fact across languages. We explore multilingual factual knowledge through two aspects: the model's ability to answer a query consistently across languages, and the ability to ''store'' answers in a shared representation for several languages. We propose a methodology to measure the extent of representation sharing across languages by repurposing knowledge editing methods. We examine LLMs with various multilingual configurations using a new multilingual dataset. We reveal that high consistency does not necessarily imply shared representation, particularly for languages with different scripts. Moreover, we find that script similarity is a dominant factor in representation sharing. Finally, we observe that if LLMs could fully share knowledge across languages, their accuracy in their best-performing language could benefit an increase of up to 150\% on average. These findings highlight the need for improved multilingual knowledge representation in LLMs and suggest a path for the development of more robust and consistent multilingual LLMs.

Title: Privacy-preserving Universal Adversarial Defense for Black-box Models

Authors: Qiao Li, Cong Wu, Jing Chen, Zijun Zhang, Kun He, Ruiying Du, Xinxin Wang, Qingchuang Zhao, Yang Liu
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2408.10647
Pdf URL: https://arxiv.org/pdf/2408.10647
Copy Paste: [[2408.10647]] Privacy-preserving Universal Adversarial Defense for Black-box Models(https://arxiv.org/abs/2408.10647)
Keywords: privacy, defense, attack, robust, membership infer
Abstract: Deep neural networks (DNNs) are increasingly used in critical applications such as identity authentication and autonomous driving, where robustness against adversarial attacks is crucial. These attacks can exploit minor perturbations to cause significant prediction errors, making it essential to enhance the resilience of DNNs. Traditional defense methods often rely on access to detailed model information, which raises privacy concerns, as model owners may be reluctant to share such data. In contrast, existing black-box defense methods fail to offer a universal defense against various types of adversarial attacks. To address these challenges, we introduce DUCD, a universal black-box defense method that does not require access to the target model's parameters or architecture. Our approach involves distilling the target model by querying it with data, creating a white-box surrogate while preserving data privacy. We further enhance this surrogate model using a certified defense based on randomized smoothing and optimized noise selection, enabling robust defense against a broad range of adversarial attacks. Comparative evaluations between the certified defenses of the surrogate and target models demonstrate the effectiveness of our approach. Experiments on multiple image classification datasets show that DUCD not only outperforms existing black-box defenses but also matches the accuracy of white-box defenses, all while enhancing data privacy and reducing the success rate of membership inference attacks.

Title: Smart Contract Coordinated Privacy Preserving Crowd-Sensing Campaigns

Authors: Luca Bedogni, Stefano Ferretti
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2408.10648
Pdf URL: https://arxiv.org/pdf/2408.10648
Copy Paste: [[2408.10648]] Smart Contract Coordinated Privacy Preserving Crowd-Sensing Campaigns(https://arxiv.org/abs/2408.10648)
Keywords: secure, security, privacy
Abstract: Crowd-sensing has emerged as a powerful data retrieval model, enabling diverse applications by leveraging active user participation. However, data availability and privacy concerns pose significant challenges. Traditional methods like data encryption and anonymization, while essential, may not fully address these issues. For instance, in sparsely populated areas, anonymized data can still be traced back to individual users. Additionally, the volume of data generated by users can reveal their identities. To develop credible crowd-sensing systems, data must be anonymized, aggregated and separated into uniformly sized chunks. Furthermore, decentralizing the data management process, rather than relying on a single server, can enhance security and trust. This paper proposes a system utilizing smart contracts and blockchain technologies to manage crowd-sensing campaigns. The smart contract handles user subscriptions, data encryption, and decentralized storage, creating a secure data marketplace. Incentive policies within the smart contract encourage user participation and data diversity. Simulation results confirm the system's viability, highlighting the importance of user participation for data credibility and the impact of geographical data scarcity on rewards. This approach aims to balance data origin and reduce cheating risks.

Title: Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

Authors: Guofeng Mei, Luigi Riz, Yiming Wang, Fabio Poiesi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10652
Pdf URL: https://arxiv.org/pdf/2408.10652
Copy Paste: [[2408.10652]] Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant(https://arxiv.org/abs/2408.10652)
Keywords: segmentation
Abstract: Most recent 3D instance segmentation methods are open vocabulary, offering a greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, \ie the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering ``List the objects in the scene.''. We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images. To form 3D instance mask, we first partition the input point cloud into dense superpoints, which are then merged into 3D instance masks. We propose a novel superpoint merging strategy via spectral clustering, accounting for both mask coherence and semantic coherence that are estimated from the 2D object instance masks. We evaluate our method using ScanNet200 and Replica, outperforming existing methods in both vocabulary-free and open-vocabulary settings. Code will be made available.

Title: UIE-UnFold: Deep Unfolding Network with Color Priors and Vision Transformer for Underwater Image Enhancement

Authors: Yingtie Lei, Jia Yu, Yihang Dong, Changwei Gong, Ziyang Zhou, Chi-Man Pun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10653
Pdf URL: https://arxiv.org/pdf/2408.10653
Copy Paste: [[2408.10653]] UIE-UnFold: Deep Unfolding Network with Color Priors and Vision Transformer for Underwater Image Enhancement(https://arxiv.org/abs/2408.10653)
Keywords: transformer
Abstract: Underwater image enhancement (UIE) plays a crucial role in various marine applications, but it remains challenging due to the complex underwater environment. Current learning-based approaches frequently lack explicit incorporation of prior knowledge about the physical processes involved in underwater image formation, resulting in limited optimization despite their impressive enhancement results. This paper proposes a novel deep unfolding network (DUN) for UIE that integrates color priors and inter-stage feature transformation to improve enhancement performance. The proposed DUN model combines the iterative optimization and reliability of model-based methods with the flexibility and representational power of deep learning, offering a more explainable and stable solution compared to existing learning-based UIE approaches. The proposed model consists of three key components: a Color Prior Guidance Block (CPGB) that establishes a mapping between color channels of degraded and original images, a Nonlinear Activation Gradient Descent Module (NAGDM) that simulates the underwater image degradation process, and an Inter Stage Feature Transformer (ISF-Former) that facilitates feature exchange between different network stages. By explicitly incorporating color priors and modeling the physical characteristics of underwater image formation, the proposed DUN model achieves more accurate and reliable enhancement results. Extensive experiments on multiple underwater image datasets demonstrate the superiority of the proposed model over state-of-the-art methods in both quantitative and qualitative evaluations. The proposed DUN-based approach offers a promising solution for UIE, enabling more accurate and reliable scientific analysis in marine research. The code is available at this https URL.

Title: ETGuard: Malicious Encrypted Traffic Detection in Blockchain-based Power Grid Systems

Authors: Peng Zhou, Yongdong Liu, Lixun Ma, Weiye Zhang, Haohan Tan, Zhenguang Liu, Butian Huang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10657
Pdf URL: https://arxiv.org/pdf/2408.10657
Copy Paste: [[2408.10657]] ETGuard: Malicious Encrypted Traffic Detection in Blockchain-based Power Grid Systems(https://arxiv.org/abs/2408.10657)
Keywords: attack
Abstract: The escalating prevalence of encryption protocols has led to a concomitant surge in the number of malicious attacks that hide in encrypted traffic. Power grid systems, as fundamental infrastructure, are becoming prime targets for such attacks. Conventional methods for detecting malicious encrypted packets typically use a static pre-trained model. We observe that these methods are not well-suited for blockchain-based power grid systems. More critically, they fall short in dynamic environments where new types of encrypted attacks continuously emerge. Motivated by this, in this paper we try to tackle these challenges from two aspects: (1) We present a novel framework that is able to automatically detect malicious encrypted traffic in blockchain-based power grid systems and incrementally learn from new malicious traffic. (2) We mathematically derive incremental learning losses to resist the forgetting of old attack patterns while ensuring the model is capable of handling new encrypted attack patterns. Empirically, our method achieves state-of-the-art performance on three different benchmark datasets. We also constructed the first malicious encrypted traffic dataset for blockchain-based power grid scenario. Our code and dataset are available at this https URL, hoping to inspire future research.

Title: REInstruct: Building Instruction Data from Unlabeled Corpus

Authors: Shu Chen, Xinyan Guan, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10663
Pdf URL: https://arxiv.org/pdf/2408.10663
Copy Paste: [[2408.10663]] REInstruct: Building Instruction Data from Unlabeled Corpus(https://arxiv.org/abs/2408.10663)
Keywords: robust, large language model
Abstract: Manually annotating instruction data for large language models is difficult, costly, and hard to scale. Meanwhile, current automatic annotation methods typically rely on distilling synthetic data from proprietary LLMs, which not only limits the upper bound of the quality of the instruction data but also raises potential copyright issues. In this paper, we propose REInstruct, a simple and scalable method to automatically build instruction data from an unlabeled corpus without heavy reliance on proprietary LLMs and human annotation. Specifically, REInstruct first selects a subset of unlabeled texts that potentially contain well-structured helpful and insightful content and then generates instructions for these texts. To generate accurate and relevant responses for effective and robust training, REInstruct further proposes a rewriting-based approach to improve the quality of the generated instruction data. By training Llama-7b on a combination of 3k seed data and 32k synthetic data from REInstruct, fine-tuned model achieves a 65.41\% win rate on AlpacaEval leaderboard against text-davinci-003, outperforming other open-source, non-distilled instruction data construction methods. The code is publicly available at \url{this https URL}.

Title: Federated Clustering: An Unsupervised Cluster-Wise Training for Decentralized Data Distributions

Authors: Mirko Nardi, Lorenzo Valerio, Andrea Passarella
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.10664
Pdf URL: https://arxiv.org/pdf/2408.10664
Copy Paste: [[2408.10664]] Federated Clustering: An Unsupervised Cluster-Wise Training for Decentralized Data Distributions(https://arxiv.org/abs/2408.10664)
Keywords: privacy, robust, federate
Abstract: Federated Learning (FL) is a pivotal approach in decentralized machine learning, especially when data privacy is crucial and direct data sharing is impractical. While FL is typically associated with supervised learning, its potential in unsupervised scenarios is underexplored. This paper introduces a novel unsupervised federated learning methodology designed to identify the complete set of categories (global K) across multiple clients within label-free, non-uniform data distributions, a process known as Federated Clustering. Our approach, Federated Cluster-Wise Refinement (FedCRef), involves clients that collaboratively train models on clusters with similar data distributions. Initially, clients with diverse local data distributions (local K) train models on their clusters to generate compressed data representations. These local models are then shared across the network, enabling clients to compare them through reconstruction error analysis, leading to the formation of federated this http URL these groups, clients collaboratively train a shared model representing each data distribution, while continuously refining their local clusters to enhance data association accuracy. This iterative process allows our system to identify all potential data distributions across the network and develop robust representation models for each. To validate our approach, we compare it with traditional centralized methods, establishing a performance baseline and showcasing the advantages of our distributed solution. We also conduct experiments on the EMNIST and KMNIST datasets, demonstrating FedCRef's ability to refine and align cluster models with actual data distributions, significantly improving data representation precision in unsupervised federated settings.

Title: Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation

Authors: Haoyu Wang, Bingzhe Wu, Yatao Bian, Yongzhe Chang, Xueqian Wang, Peilin Zhao
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10668
Pdf URL: https://arxiv.org/pdf/2408.10668
Copy Paste: [[2408.10668]] Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation(https://arxiv.org/abs/2408.10668)
Keywords: secure, attack, large language model
Abstract: Large Language Models (LLMs) are implicit troublemakers. While they provide valuable insights and assist in problem-solving, they can also potentially serve as a resource for malicious activities. Implementing safety alignment could mitigate the risk of LLMs generating harmful responses. We argue that: even when an LLM appears to successfully block harmful queries, there may still be hidden vulnerabilities that could act as ticking time bombs. To identify these underlying weaknesses, we propose to use a cost value model as both a detector and an attacker. Trained on external or self-generated harmful datasets, the cost value model could successfully influence the original safe LLM to output toxic content in decoding process. For instance, LLaMA-2-chat 7B outputs 39.18% concrete toxic content, along with only 22.16% refusals without any harmful suffixes. These potential weaknesses can then be exploited via prompt optimization such as soft prompts on images. We name this decoding strategy: Jailbreak Value Decoding (JVD), emphasizing that seemingly secure LLMs may not be as safe as we initially believe. They could be used to gather harmful data or launch covert attacks.

Title: Tensor tree learns hidden relational structures in data to construct generative models

Authors: Kenji Harada, Tsuyoshi Okubo, Naoki Kawashima
Subjects: cs.LG, cond-mat.stat-mech, cs.AI, quant-ph
Abstract URL: https://arxiv.org/abs/2408.10669
Pdf URL: https://arxiv.org/pdf/2408.10669
Copy Paste: [[2408.10669]] Tensor tree learns hidden relational structures in data to construct generative models(https://arxiv.org/abs/2408.10669)
Keywords: generative
Abstract: Based on the tensor tree network with the Born machine framework, we propose a general method for constructing a generative model by expressing the target distribution function as the quantum wave function amplitude represented by a tensor tree. The key idea is dynamically optimizing the tree structure that minimizes the bond mutual information. The proposed method offers enhanced performance and uncovers hidden relational structures in the target data. We illustrate potential practical applications with four examples: (i) random patterns, (ii) QMNIST hand-written digits, (iii) Bayesian networks, and (iv) the stock price fluctuation pattern in S&P500. In (i) and (ii), strongly correlated variables were concentrated near the center of the network; in (iii), the causality pattern was identified; and, in (iv), a structure corresponding to the eleven sectors emerged.

Title: A Noncontact Technique for Wave Measurement Based on Thermal Stereography and Deep Learning

Authors: Deyu Li, Longfei Xiao, Handi Wei, Yan Li, Binghua Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2408.10670
Pdf URL: https://arxiv.org/pdf/2408.10670
Copy Paste: [[2408.10670]] A Noncontact Technique for Wave Measurement Based on Thermal Stereography and Deep Learning(https://arxiv.org/abs/2408.10670)
Keywords: generative
Abstract: The accurate measurement of the wave field and its spatiotemporal evolution is essential in many hydrodynamic experiments and engineering applications. The binocular stereo imaging technique has been widely used to measure waves. However, the optical properties of indoor water surfaces, including transparency, specular reflection, and texture absence, pose challenges for image processing and stereo reconstruction. This study proposed a novel technique that combined thermal stereography and deep learning to achieve fully noncontact wave measurements. The optical imaging properties of water in the long-wave infrared spectrum were found to be suitable for stereo matching, effectively avoiding the issues in the visible-light spectrum. After capturing wave images using thermal stereo cameras, a reconstruction strategy involving deep learning techniques was proposed to improve stereo matching performance. A generative approach was employed to synthesize a dataset with ground-truth disparity from unannotated infrared images. This dataset was then fed to a pretrained stereo neural network for fine-tuning to achieve domain adaptation. Wave flume experiments were conducted to validate the feasibility and accuracy of the proposed technique. The final reconstruction results indicated great agreement and high accuracy with a mean bias of less than 2.1% compared with the measurements obtained using wave probes, suggesting that the novel technique effectively measures the spatiotemporal distribution of wave surface in hydrodynamic experiments.

Title: Neural Exploratory Landscape Analysis

Authors: Zeyuan Ma, Jiacheng Chen, Hongshu Guo, Yue-Jiao Gong
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2408.10672
Pdf URL: https://arxiv.org/pdf/2408.10672
Copy Paste: [[2408.10672]] Neural Exploratory Landscape Analysis(https://arxiv.org/abs/2408.10672)
Keywords: robust
Abstract: Recent research in Meta-Black-Box Optimization (MetaBBO) have shown that meta-trained neural networks can effectively guide the design of black-box optimizers, significantly reducing the need for expert tuning and delivering robust performance across complex problem distributions. Despite their success, a paradox remains: MetaBBO still rely on human-crafted Exploratory Landscape Analysis features to inform the meta-level agent about the low-level optimization progress. To address the gap, this paper proposes Neural Exploratory Landscape Analysis (NeurELA), a novel framework that dynamically profiles landscape features through a two-stage, attention-based neural network, executed in an entirely end-to-end fashion. NeurELA is pre-trained over a variety of MetaBBO algorithms using a multi-task neuroevolution strategy. Extensive experiments show that NeurELA achieves consistently superior performance when integrated into different and even unseen MetaBBO tasks and can be efficiently fine-tuned for further performance boost. This advancement marks a pivotal step in making MetaBBO algorithms more autonomous and broadly applicable.

Title: Iterative Window Mean Filter: Thwarting Diffusion-based Adversarial Purification

Authors: Hanrui Wang, Ruoxi Sun, Cunjian Chen, Minhui Xue, Lay-Ki Soon, Shuo Wang, Zhe Jin
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.10673
Pdf URL: https://arxiv.org/pdf/2408.10673
Copy Paste: [[2408.10673]] Iterative Window Mean Filter: Thwarting Diffusion-based Adversarial Purification(https://arxiv.org/abs/2408.10673)
Keywords: security, defense, attack, diffusion
Abstract: Face authentication systems have brought significant convenience and advanced developments, yet they have become unreliable due to their sensitivity to inconspicuous perturbations, such as adversarial attacks. Existing defenses often exhibit weaknesses when facing various attack algorithms and adaptive attacks or compromise accuracy for enhanced security. To address these challenges, we have developed a novel and highly efficient non-deep-learning-based image filter called the Iterative Window Mean Filter (IWMF) and proposed a new framework for adversarial purification, named IWMF-Diff, which integrates IWMF and denoising diffusion models. These methods can function as pre-processing modules to eliminate adversarial perturbations without necessitating further modifications or retraining of the target system. We demonstrate that our proposed methodologies fulfill four critical requirements: preserved accuracy, improved security, generalizability to various threats in different settings, and better resistance to adaptive attacks. This performance surpasses that of the state-of-the-art adversarial purification method, DiffPure.

Title: Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models

Authors: Hongbang Yuan, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2408.10682
Pdf URL: https://arxiv.org/pdf/2408.10682
Copy Paste: [[2408.10682]] Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models(https://arxiv.org/abs/2408.10682)
Keywords: defense, attack, robust, large language model
Abstract: LLM have achieved success in many fields but still troubled by problematic content in the training corpora. LLM unlearning aims at reducing their influence and avoid undesirable behaviours. However, existing unlearning methods remain vulnerable to adversarial queries and the unlearned knowledge resurfaces after the manually designed attack queries. As part of a red-team effort to proactively assess the vulnerabilities of unlearned models, we design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack these models and evaluate their robustness. It optimizes adversarial suffixes to reintroduce the unlearned knowledge in various scenarios. We find that unlearned knowledge can be recovered in $55.2\%$ of the questions, even without revealing the unlearned model's parameters. In response to this vulnerability, we propose Latent Adversarial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process. It formulates the unlearning process as a min-max optimization problem and resolves it through two stages: an attack stage, where perturbation vectors are trained and added to the latent space of LLMs to recover the unlearned knowledge, and a defense stage, where previously trained perturbation vectors are used to enhance unlearned model's robustness. With our LAU framework, we obtain two robust unlearning methods, AdvGA and AdvNPO. We conduct extensive experiments across multiple unlearning benchmarks and various models, and demonstrate that they improve the unlearning effectiveness by over $53.5\%$, cause only less than a $11.6\%$ reduction in neighboring knowledge, and have almost no impact on the model's general capabilities.

Title: Unconditional Truthfulness: Learning Conditional Dependency for Uncertainty Quantification of Large Language Models

Authors: Artem Vazhentsev, Ekaterina Fadeeva, Rui Xing, Alexander Panchenko, Preslav Nakov, Timothy Baldwin, Maxim Panov, Artem Shelmanov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10692
Pdf URL: https://arxiv.org/pdf/2408.10692
Copy Paste: [[2408.10692]] Unconditional Truthfulness: Learning Conditional Dependency for Uncertainty Quantification of Large Language Models(https://arxiv.org/abs/2408.10692)
Keywords: large language model
Abstract: Uncertainty quantification (UQ) is a perspective approach to detecting Large Language Model (LLM) hallucinations and low quality output. In this work, we address one of the challenges of UQ in generation tasks that arises from the conditional dependency between the generation steps of an LLM. We propose to learn this dependency from data. We train a regression model, which target variable is the gap between the conditional and the unconditional generation confidence. During LLM inference, we use this learned conditional dependency model to modulate the uncertainty of the current generation step based on the uncertainty of the previous step. Our experimental evaluation on nine datasets and three LLMs shows that the proposed method is highly effective for uncertainty quantification, achieving substantial improvements over rivaling approaches.

Title: MsMemoryGAN: A Multi-scale Memory GAN for Palm-vein Adversarial Purification

Authors: Huafeng Qin, Yuming Fu, Huiyan Zhang, Mounim A. El-Yacoubi, Xinbo Gao, Qun Song, Jun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10694
Pdf URL: https://arxiv.org/pdf/2408.10694
Copy Paste: [[2408.10694]] MsMemoryGAN: A Multi-scale Memory GAN for Palm-vein Adversarial Purification(https://arxiv.org/abs/2408.10694)
Keywords: defense, attack
Abstract: Deep neural networks have recently achieved promising performance in the vein recognition task and have shown an increasing application trend, however, they are prone to adversarial perturbation attacks by adding imperceptible perturbations to the input, resulting in making incorrect recognition. To address this issue, we propose a novel defense model named MsMemoryGAN, which aims to filter the perturbations from adversarial samples before recognition. First, we design a multi-scale autoencoder to achieve high-quality reconstruction and two memory modules to learn the detailed patterns of normal samples at different scales. Second, we investigate a learnable metric in the memory module to retrieve the most relevant memory items to reconstruct the input image. Finally, the perceptional loss is combined with the pixel loss to further enhance the quality of the reconstructed image. During the training phase, the MsMemoryGAN learns to reconstruct the input by merely using fewer prototypical elements of the normal patterns recorded in the memory. At the testing stage, given an adversarial sample, the MsMemoryGAN retrieves its most relevant normal patterns in memory for the reconstruction. Perturbations in the adversarial sample are usually not reconstructed well, resulting in purifying the input from adversarial perturbations. We have conducted extensive experiments on two public vein datasets under different adversarial attack methods to evaluate the performance of the proposed approach. The experimental results show that our approach removes a wide variety of adversarial perturbations, allowing vein classifiers to achieve the highest recognition accuracy.

Title: AnyGraph: Graph Foundation Model in the Wild

Authors: Lianghao Xia, Chao Huang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10700
Pdf URL: https://arxiv.org/pdf/2408.10700
Copy Paste: [[2408.10700]] AnyGraph: Graph Foundation Model in the Wild(https://arxiv.org/abs/2408.10700)
Keywords: robust
Abstract: The growing ubiquity of relational data structured as graphs has underscored the need for graph learning models with exceptional generalization capabilities. However, current approaches often struggle to effectively extract generalizable insights, frequently requiring extensive fine-tuning and limiting their versatility. Graph foundation models offer a transformative solution, with the potential to learn robust, generalizable representations from graph data. This enables more effective and adaptable applications across a wide spectrum of tasks and domains. In this work, we investigate a unified graph model, AnyGraph, designed to handle key challenges: i) Structure Heterogenity. Addressing distribution shift in graph structural information; ii) Feature Heterogenity. Handling diverse feature representation spaces across graph datasets; iii) Fast Adaptation. Efficiently adapting the model to new graph domains; iv) Scaling Law Emergence. Enabling the model to exhibit scaling law behavior, where its performance scales favorably with the amount of data and parameter sizes. To tackle these critical challenges, we build the AnyGraph upon a Graph Mixture-of-Experts (MoE) architecture. This approach empowers the model to effectively manage both the in-domain and cross-domain distribution shift concerning structure-level and feature-level heterogeneity. Furthermore, a lightweight graph expert routing mechanism is proposed to facilitate AnyGraph's fast adaptability to new data and domains. Our extensive experiments on diverse 38 graph datasets have demonstrated the strong zero-shot learning performance of AnyGraph across diverse graph domains with significant distribution shift. Furthermore, we have validated the model's fast adaptation ability and scaling law emergence, showcasing its versatility.

Title: Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique

Authors: Tej Deep Pala, Vernon Y.H. Toh, Rishabh Bhardwaj, Soujanya Poria
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10701
Pdf URL: https://arxiv.org/pdf/2408.10701
Copy Paste: [[2408.10701]] Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique(https://arxiv.org/abs/2408.10701)
Keywords: attack, robust, large language model
Abstract: In today's era, where large language models (LLMs) are integrated into numerous real-world applications, ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role in this process by generating adversarial attacks to identify and mitigate potential vulnerabilities in these models. However, existing methods often struggle with slow performance, limited categorical diversity, and high resource demands. While Rainbow Teaming, a recent approach, addresses the diversity challenge by framing adversarial prompt generation as a quality-diversity search, it remains slow and requires a large fine-tuned mutator for optimal performance. To overcome these limitations, we propose Ferret, a novel approach that builds upon Rainbow Teaming by generating multiple adversarial prompt mutations per iteration and using a scoring function to rank and select the most effective adversarial prompt. We explore various scoring functions, including reward models, Llama Guard, and LLM-as-a-judge, to rank adversarial mutations based on their potential harm to improve the efficiency of the search for harmful mutations. Our results demonstrate that Ferret, utilizing a reward model as a scoring function, improves the overall attack success rate (ASR) to 95%, which is 46% higher than Rainbow Teaming. Additionally, Ferret reduces the time needed to achieve a 90% ASR by 15.2% compared to the baseline and generates adversarial prompts that are transferable i.e. effective on other LLMs of larger size. Our codes are available at this https URL.

Title: Large Language Models for Multimodal Deformable Image Registration

Authors: Mingrui Ma, Weijie Wang, Jie Ning, Jianfeng He, Nicu Sebe, Bruno Lepri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10703
Pdf URL: https://arxiv.org/pdf/2408.10703
Copy Paste: [[2408.10703]] Large Language Models for Multimodal Deformable Image Registration(https://arxiv.org/abs/2408.10703)
Keywords: generative, large language model
Abstract: The challenge of Multimodal Deformable Image Registration (MDIR) lies in the conversion and alignment of features between images of different modalities. Generative models (GMs) cannot retain the necessary information enough from the source modality to the target one, while non-GMs struggle to align features across these two modalities. In this paper, we propose a novel coarse-to-fine MDIR framework,LLM-Morph, which is applicable to various pre-trained Large Language Models (LLMs) to solve these concerns by aligning the deep features from different modal medical images. Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights, both aimed at eliminating the domain gap between the pre-trained LLMs and the MDIR task. Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task. Extensive experiments in MR-CT Abdomen and SR-Reg Brain datasets demonstrate the effectiveness of our framework and the potential of pre-trained LLMs for MDIR task. Our code is availabel at: this https URL.

Title: Variable Assignment Invariant Neural Networks for Learning Logic Programs

Authors: Yin Jun Phua, Katsumi Inoue
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10709
Pdf URL: https://arxiv.org/pdf/2408.10709
Copy Paste: [[2408.10709]] Variable Assignment Invariant Neural Networks for Learning Logic Programs(https://arxiv.org/abs/2408.10709)
Keywords: extraction
Abstract: Learning from interpretation transition (LFIT) is a framework for learning rules from observed state transitions. LFIT has been implemented in purely symbolic algorithms, but they are unable to deal with noise or generalize to unobserved transitions. Rule extraction based neural network methods suffer from overfitting, while more general implementation that categorize rules suffer from combinatorial explosion. In this paper, we introduce a technique to leverage variable permutation invariance inherent in symbolic domains. Our technique ensures that the permutation and the naming of the variables would not affect the results. We demonstrate the effectiveness and the scalability of this method with various experiments. Our code is publicly available at this https URL

Title: Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

Authors: Pengkun Wei, Shuo Cheng, Dayou Li, Ran Song, Yipeng Zhang, Wei Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10710
Pdf URL: https://arxiv.org/pdf/2408.10710
Copy Paste: [[2408.10710]] Coarse-to-Fine Detection of Multiple Seams for Robotic Welding(https://arxiv.org/abs/2408.10710)
Keywords: extraction
Abstract: Efficiently detecting target weld seams while ensuring sub-millimeter accuracy has always been an important challenge in autonomous welding, which has significant application in industrial practice. Previous works mostly focused on recognizing and localizing welding seams one by one, leading to inferior efficiency in modeling the workpiece. This paper proposes a novel framework capable of multiple weld seams extraction using both RGB images and 3D point clouds. The RGB image is used to obtain the region of interest by approximately localizing the weld seams, and the point cloud is used to achieve the fine-edge extraction of the weld seams within the region of interest using region growth. Our method is further accelerated by using a pre-trained deep learning model to ensure both efficiency and generalization ability. The performance of the proposed method has been comprehensively tested on various workpieces featuring both linear and curved weld seams and in physical experiment systems. The results showcase considerable potential for real-world industrial applications, emphasizing the method's efficiency and effectiveness. Videos of the real-world experiments can be found at this https URL.

Title: MEGen: Generative Backdoor in Large Language Models via Model Editing

Authors: Jiyang Qiu, Xinbei Ma, Zhuosheng Zhang, Hai Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10722
Pdf URL: https://arxiv.org/pdf/2408.10722
Copy Paste: [[2408.10722]] MEGen: Generative Backdoor in Large Language Models via Model Editing(https://arxiv.org/abs/2408.10722)
Keywords: attack, robust, generative, large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. Emerging as widely adopted generalists for diverse tasks, LLMs are still vulnerable to backdoors. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects. In our approach, we first leverage a language model to insert a trigger selected on fixed metrics into the input, then design a pipeline of model editing to directly embed a backdoor into an LLM. By adjusting a small set of local parameters with a mini-batch of samples, MEGen significantly enhances time efficiency and achieves high robustness. Experimental results indicate that our backdoor attack strategy achieves a high attack success rate on poison data while maintaining the model's performance on clean data. Notably, the backdoored model, when triggered, can freely output pre-set dangerous information while successfully completing downstream tasks. This suggests that future LLM applications could be guided to deliver certain dangerous information, thus altering the LLM's generative style. We believe this approach provides insights for future LLM applications and the execution of backdoor attacks on conversational AI systems.

Title: Crafting Tomorrow's Headlines: Neural News Generation and Detection in English, Turkish, Hungarian, and Persian

Authors: Cem Üyük, Danica Rovó, Shaghayegh Kolli, Rabia Varol, Georg Groh, Daryna Dementieva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10724
Pdf URL: https://arxiv.org/pdf/2408.10724
Copy Paste: [[2408.10724]] Crafting Tomorrow's Headlines: Neural News Generation and Detection in English, Turkish, Hungarian, and Persian(https://arxiv.org/abs/2408.10724)
Keywords: robust, transformer, large language model
Abstract: In the era dominated by information overload and its facilitation with Large Language Models (LLMs), the prevalence of misinformation poses a significant threat to public discourse and societal well-being. A critical concern at present involves the identification of machine-generated news. In this work, we take a significant step by introducing a benchmark dataset designed for neural news detection in four languages: English, Turkish, Hungarian, and Persian. The dataset incorporates outputs from multiple multilingual generators (in both, zero-shot and fine-tuned setups) such as BloomZ, LLaMa-2, Mistral, Mixtral, and GPT-4. Next, we experiment with a variety of classifiers, ranging from those based on linguistic features to advanced Transformer-based models and LLMs prompting. We present the detection results aiming to delve into the interpretablity and robustness of machine-generated texts detectors across all target languages.

Title: Towards Efficient Large Language Models for Scientific Text: A Review

Authors: Huy Quoc To, Ming Liu, Guangyan Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10729
Pdf URL: https://arxiv.org/pdf/2408.10729
Copy Paste: [[2408.10729]] Towards Efficient Large Language Models for Scientific Text: A Review(https://arxiv.org/abs/2408.10729)
Keywords: large language model
Abstract: Large language models (LLMs) have ushered in a new era for processing complex information in various fields, including science. The increasing amount of scientific literature allows these models to acquire and understand scientific knowledge effectively, thus improving their performance in a wide range of tasks. Due to the power of LLMs, they require extremely expensive computational resources, intense amounts of data, and training time. Therefore, in recent years, researchers have proposed various methodologies to make scientific LLMs more affordable. The most well-known approaches align in two directions. It can be either focusing on the size of the models or enhancing the quality of data. To date, a comprehensive review of these two families of methods has not yet been undertaken. In this paper, we (I) summarize the current advances in the emerging abilities of LLMs into more accessible AI solutions for science, and (II) investigate the challenges and opportunities of developing affordable solutions for scientific domains using LLMs.

Title: PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection

Authors: Tri Cao, Chengyu Huang, Yuexin Li, Huilin Wang, Amy He, Nay Oo, Bryan Hooi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2408.10738
Pdf URL: https://arxiv.org/pdf/2408.10738
Copy Paste: [[2408.10738]] PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection(https://arxiv.org/abs/2408.10738)
Keywords: security, attack, robust, steal, large language model
Abstract: Phishing attacks are a major threat to online security, exploiting user vulnerabilities to steal sensitive information. Various methods have been developed to counteract phishing, each with varying levels of accuracy, but they also encounter notable limitations. In this study, we introduce PhishAgent, a multimodal agent that combines a wide range of tools, integrating both online and offline knowledge bases with Multimodal Large Language Models (MLLMs). This combination leads to broader brand coverage, which enhances brand recognition and recall. Furthermore, we propose a multimodal information retrieval framework designed to extract the top k relevant items from offline knowledge bases, utilizing all available information from a webpage, including logos, HTML, and URLs. Our empirical results, based on three real-world datasets, demonstrate that the proposed framework significantly enhances detection accuracy and reduces both false positives and false negatives, while maintaining model efficiency. Additionally, PhishAgent shows strong resilience against various types of adversarial attacks.

Title: Security Assessment of Hierarchical Federated Deep Learning

Authors: D Alqattan, R Sun, H Liang, G Nicosia, V Snasel, R Ranjan, V Ojha
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2408.10752
Pdf URL: https://arxiv.org/pdf/2408.10752
Copy Paste: [[2408.10752]] Security Assessment of Hierarchical Federated Deep Learning(https://arxiv.org/abs/2408.10752)
Keywords: security, attack, robust, federate
Abstract: Hierarchical federated learning (HFL) is a promising distributed deep learning model training paradigm, but it has crucial security concerns arising from adversarial attacks. This research investigates and assesses the security of HFL using a novel methodology by focusing on its resilience against adversarial attacks inference-time and training-time. Through a series of extensive experiments across diverse datasets and attack scenarios, we uncover that HFL demonstrates robustness against untargeted training-time attacks due to its hierarchical structure. However, targeted attacks, particularly backdoor attacks, exploit this architecture, especially when malicious clients are positioned in the overlapping coverage areas of edge servers. Consequently, HFL shows a dual nature in its resilience, showcasing its capability to recover from attacks thanks to its hierarchical aggregation that strengthens its suitability for adversarial training, thereby reinforcing its resistance against inference-time attacks. These insights underscore the necessity for balanced security strategies in HFL systems, leveraging their inherent strengths while effectively mitigating vulnerabilities.

Title: Generating Synthetic Fair Syntax-agnostic Data by Learning and Distilling Fair Representation

Authors: Md Fahim Sikder, Resmi Ramachandranpillai, Daniel de Leng, Fredrik Heintz
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10755
Pdf URL: https://arxiv.org/pdf/2408.10755
Copy Paste: [[2408.10755]] Generating Synthetic Fair Syntax-agnostic Data by Learning and Distilling Fair Representation(https://arxiv.org/abs/2408.10755)
Keywords: fair, diffusion, generative
Abstract: Data Fairness is a crucial topic due to the recent wide usage of AI powered applications. Most of the real-world data is filled with human or machine biases and when those data are being used to train AI models, there is a chance that the model will reflect the bias in the training data. Existing bias-mitigating generative methods based on GANs, Diffusion models need in-processing fairness objectives and fail to consider computational overhead while choosing computationally-heavy architectures, which may lead to high computational demands, instability and poor optimization performance. To mitigate this issue, in this work, we present a fair data generation technique based on knowledge distillation, where we use a small architecture to distill the fair representation in the latent space. The idea of fair latent space distillation enables more flexible and stable training of Fair Generative Models (FGMs). We first learn a syntax-agnostic (for any data type) fair representation of the data, followed by distillation in the latent space into a smaller model. After distillation, we use the distilled fair latent space to generate high-fidelity fair synthetic data. While distilling, we employ quality loss (for fair distillation) and utility loss (for data utility) to ensure that the fairness and data utility characteristics remain in the distilled latent space. Our approaches show a 5%, 5% and 10% rise in performance in fairness, synthetic sample quality and data utility, respectively, than the state-of-the-art fair generative model.

Title: Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model

Authors: Chenhan Yuan, Fei Huang, Ru Peng, Keming Lu, Bowen Yu, Chang Zhou, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10764
Pdf URL: https://arxiv.org/pdf/2408.10764
Copy Paste: [[2408.10764]] Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model(https://arxiv.org/abs/2408.10764)
Keywords: transformer, large language model
Abstract: Transformer-based large language models (LLMs) exhibit limitations such as generating unsafe responses, unreliable reasoning, etc. Existing inference intervention approaches attempt to mitigate these issues by finetuning additional models to produce calibration signals (such as rewards) that guide the LLM's decoding process. However, this solution introduces substantial time and space overhead due to the separate models required. This work proposes Non-disruptive parameters insertion (Otter), inserting extra parameters into the transformer architecture to predict calibration signals along with the original LLM output. Otter offers state-of-the-art performance on multiple demanding tasks while saving up to 86.5\% extra space and 98.5\% extra time. Furthermore, Otter seamlessly integrates with existing inference engines, requiring only a one-line code change, and the original model response remains accessible after the parameter insertion. Our code is publicly available at \url{this https URL}

Title: An Open Source Python Library for Anonymizing Sensitive Data

Authors: Judith Sáinz-Pardo Díaz, Álvaro López García
Subjects: cs.CR, cs.DB, cs.SE
Abstract URL: https://arxiv.org/abs/2408.10766
Pdf URL: https://arxiv.org/pdf/2408.10766
Copy Paste: [[2408.10766]] An Open Source Python Library for Anonymizing Sensitive Data(https://arxiv.org/abs/2408.10766)
Keywords: protect
Abstract: Open science is a fundamental pillar to promote scientific progress and collaboration, based on the principles of open data, open source and open access. However, the requirements for publishing and sharing open data are in many cases difficult to meet in compliance with strict data protection regulations. Consequently, researchers need to rely on proven methods that allow them to anonymize their data without sharing it with third parties. To this end, this paper presents the implementation of a Python library for the anonymization of sensitive tabular data. This framework provides users with a wide range of anonymization methods that can be applied on the given dataset, including the set of identifiers, quasi-identifiers, generalization hierarchies and allowed level of suppression, along with the sensitive attribute and the level of anonymity required. The library has been implemented following best practices for integration and continuous development, as well as the use of workflows to test code coverage based on unit and functional tests.

Title: Detection of Intracranial Hemorrhage for Trauma Patients

Authors: Antoine P. Sanner, Nils F. Grauhan, Marc A. Brockmann, Ahmed E. Othman, Anirban Mukhopadhyay
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10768
Pdf URL: https://arxiv.org/pdf/2408.10768
Copy Paste: [[2408.10768]] Detection of Intracranial Hemorrhage for Trauma Patients(https://arxiv.org/abs/2408.10768)
Keywords: segmentation
Abstract: Whole-body CT is used for multi-trauma patients in the search of any and all injuries. Since an initial assessment needs to be rapid and the search for lesions is done for the whole body, very little time can be allocated for the inspection of a specific anatomy. In particular, intracranial hemorrhages are still missed, especially by clinical students. In this work, we present a Deep Learning approach for highlighting such lesions to improve the diagnostic accuracy. While most works on intracranial hemorrhages perform segmentation, detection only requires bounding boxes for the localization of the bleeding. In this paper, we propose a novel Voxel-Complete IoU (VC-IoU) loss that encourages the network to learn the 3D aspect ratios of bounding boxes and leads to more precise detections. We extensively experiment on brain bleeding detection using a publicly available dataset, and validate it on a private cohort, where we achieve 0.877 AR30, 0.728 AP30, and 0.653 AR30, 0.514 AP30 respectively. These results constitute a relative +5% improvement in Average Recall for both datasets compared to other loss functions. Finally, as there is little data currently publicly available for 3D object detection and as annotation resources are limited in the clinical setting, we evaluate the cost of different annotation methods, as well as the impact of imprecise bounding boxes in the training data on the detection performance.

Title: Generative AI in Industrial Machine Vision -- A Review

Authors: Hans Aoyang Zhou, Dominik Wolfschläger, Constantinos Florides, Jonas Werheid, Hannes Behnen, Jan-Henrick Woltersmann, Tiago C. Pinto, Marco Kemmerling, Anas Abdelrazeq, Robert H. Schmitt
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2408.10775
Pdf URL: https://arxiv.org/pdf/2408.10775
Copy Paste: [[2408.10775]] Generative AI in Industrial Machine Vision -- A Review(https://arxiv.org/abs/2408.10775)
Keywords: robust, generative
Abstract: Machine vision enhances automation, quality control, and operational efficiency in industrial applications by enabling machines to interpret and act on visual data. While traditional computer vision algorithms and approaches remain widely utilized, machine learning has become pivotal in current research activities. In particular, generative \gls*{AI} demonstrates promising potential by improving pattern recognition capabilities, through data augmentation, increasing image resolution, and identifying anomalies for quality control. However, the application of generative \gls*{AI} in machine vision is still in its early stages due to challenges in data diversity, computational requirements, and the necessity for robust validation methods. A comprehensive literature review is essential to understand the current state of generative \gls*{AI} in industrial machine vision, focusing on recent advancements, applications, and research trends. Thus, a literature review based on the PRISMA guidelines was conducted, analyzing over 1,200 papers on generative \gls*{AI} in industrial machine vision. Our findings reveal various patterns in current research, with the primary use of generative \gls*{AI} being data augmentation, for machine vision tasks such as classification and object detection. Furthermore, we gather a collection of application challenges together with data requirements to enable a successful application of generative \gls*{AI} in industrial machine vision. This overview aims to provide researchers with insights into the different areas and applications within current research, highlighting significant advancements and identifying opportunities for future work.

Title: LightMDETR: A Lightweight Approach for Low-Cost Open-Vocabulary Object Detection Training

Authors: Binta Sow, Bilal Faye, Hanane Azzag, Mustapha Lebbah
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2408.10787
Pdf URL: https://arxiv.org/pdf/2408.10787
Copy Paste: [[2408.10787]] LightMDETR: A Lightweight Approach for Low-Cost Open-Vocabulary Object Detection Training(https://arxiv.org/abs/2408.10787)
Keywords: robust
Abstract: Object detection in computer vision traditionally involves identifying objects in images. By integrating textual descriptions, we enhance this process, providing better context and accuracy. The MDETR model significantly advances this by combining image and text data for more versatile object detection and classification. However, MDETR's complexity and high computational demands hinder its practical use. In this paper, we introduce Lightweight MDETR (LightMDETR), an optimized MDETR variant designed for improved computational efficiency while maintaining robust multimodal capabilities. Our approach involves freezing the MDETR backbone and training a sole component, the Deep Fusion Encoder (DFE), to represent image and text modalities. A learnable context vector enables the DFE to switch between these modalities. Evaluation on datasets like RefCOCO, RefCOCO+, and RefCOCOg demonstrates that LightMDETR achieves superior precision and accuracy.

Title: Tapping in a Remote Vehicle's onboard LLM to Complement the Ego Vehicle's Field-of-View

Authors: Malsha Ashani Mahawatta Dona, Beatriz Cabrero-Daniel, Yinan Yu, Christian Berger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10794
Pdf URL: https://arxiv.org/pdf/2408.10794
Copy Paste: [[2408.10794]] Tapping in a Remote Vehicle's onboard LLM to Complement the Ego Vehicle's Field-of-View(https://arxiv.org/abs/2408.10794)
Keywords: large language model
Abstract: Today's advanced automotive systems are turning into intelligent Cyber-Physical Systems (CPS), bringing computational intelligence to their cyber-physical context. Such systems power advanced driver assistance systems (ADAS) that observe a vehicle's surroundings for their functionality. However, such ADAS have clear limitations in scenarios when the direct line-of-sight to surrounding objects is occluded, like in urban areas. Imagine now automated driving (AD) systems that ideally could benefit from other vehicles' field-of-view in such occluded situations to increase traffic safety if, for example, locations about pedestrians can be shared across vehicles. Current literature suggests vehicle-to-infrastructure (V2I) via roadside units (RSUs) or vehicle-to-vehicle (V2V) communication to address such issues that stream sensor or object data between vehicles. When considering the ongoing revolution in vehicle system architectures towards powerful, centralized processing units with hardware accelerators, foreseeing the onboard presence of large language models (LLMs) to improve the passengers' comfort when using voice assistants becomes a reality. We are suggesting and evaluating a concept to complement the ego vehicle's field-of-view (FOV) with another vehicle's FOV by tapping into their onboard LLM to let the machines have a dialogue about what the other vehicle ``sees''. Our results show that very recent versions of LLMs, such as GPT-4V and GPT-4o, understand a traffic situation to an impressive level of detail, and hence, they can be used even to spot traffic participants. However, better prompts are needed to improve the detection quality and future work is needed towards a standardised message interchange format between vehicles.

Title: Adversarial Attack for Explanation Robustness of Rationalization Models

Authors: Yuankai Zhang, Lingxiao Kong, Haozhao Wang, Ruixuan Li, Jun Wang, Yuhua Li, Wei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10795
Pdf URL: https://arxiv.org/pdf/2408.10795
Copy Paste: [[2408.10795]] Adversarial Attack for Explanation Robustness of Rationalization Models(https://arxiv.org/abs/2408.10795)
Keywords: attack, robust, explainability
Abstract: Rationalization models, which select a subset of input text as rationale-crucial for humans to understand and trust predictions-have recently emerged as a prominent research area in eXplainable Artificial Intelligence. However, most of previous studies mainly focus on improving the quality of the rationale, ignoring its robustness to malicious attack. Specifically, whether the rationalization models can still generate high-quality rationale under the adversarial attack remains unknown. To explore this, this paper proposes UAT2E, which aims to undermine the explainability of rationalization models without altering their predictions, thereby eliciting distrust in these models from human users. UAT2E employs the gradient-based search on triggers and then inserts them into the original input to conduct both the non-target and target attack. Experimental results on five datasets reveal the vulnerability of rationalization models in terms of explanation, where they tend to select more meaningless tokens under attacks. Based on this, we make a series of recommendations for improving rationalization models in terms of explanation.

Title: Honeyquest: Rapidly Measuring the Enticingness of Cyber Deception Techniques with Code-based Questionnaires

Authors: Mario Kahlhofer, Stefan Achleitner, Stefan Rass, René Mayrhofer
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2408.10796
Pdf URL: https://arxiv.org/pdf/2408.10796
Copy Paste: [[2408.10796]] Honeyquest: Rapidly Measuring the Enticingness of Cyber Deception Techniques with Code-based Questionnaires(https://arxiv.org/abs/2408.10796)
Keywords: security, attack
Abstract: Fooling adversaries with traps such as honeytokens can slow down cyber attacks and create strong indicators of compromise. Unfortunately, cyber deception techniques are often poorly specified. Also, realistically measuring their effectiveness requires a well-exposed software system together with a production-ready implementation of these techniques. This makes rapid prototyping challenging. Our work translates 13 previously researched and 12 self-defined techniques into a high-level, machine-readable specification. Our open-source tool, Honeyquest, allows researchers to quickly evaluate the enticingness of deception techniques without implementing them. We test the enticingness of 25 cyber deception techniques and 19 true security risks in an experiment with 47 humans. We successfully replicate the goals of previous work with many consistent findings, but without a time-consuming implementation of these techniques on real computer systems. We provide valuable insights for the design of enticing deception and also show that the presence of cyber deception can significantly reduce the risk that adversaries will find a true security risk by about 22% on average.

Title: MPL: Lifting 3D Human Pose from Multi-view 2D Poses

Authors: Seyed Abolfazl Ghasemzadeh, Alexandre Alahi, Christophe De Vleeschouwer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10805
Pdf URL: https://arxiv.org/pdf/2408.10805
Copy Paste: [[2408.10805]] MPL: Lifting 3D Human Pose from Multi-view 2D Poses(https://arxiv.org/abs/2408.10805)
Keywords: transformer
Abstract: Estimating 3D human poses from 2D images is challenging due to occlusions and projective acquisition. Learning-based approaches have been largely studied to address this challenge, both in single and multi-view setups. These solutions however fail to generalize to real-world cases due to the lack of (multi-view) 'in-the-wild' images paired with 3D poses for training. For this reason, we propose combining 2D pose estimation, for which large and rich training datasets exist, and 2D-to-3D pose lifting, using a transformer-based network that can be trained from synthetic 2D-3D pose pairs. Our experiments demonstrate decreases up to 45% in MPJPE errors compared to the 3D pose obtained by triangulating the 2D poses. The framework's source code is available at this https URL .

Title: ColBERT Retrieval and Ensemble Response Scoring for Language Model Question Answering

Authors: Alex Gichamba, Tewodros Kederalah Idris, Brian Ebiyau, Eric Nyberg, Teruko Mitamura
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2408.10808
Pdf URL: https://arxiv.org/pdf/2408.10808
Copy Paste: [[2408.10808]] ColBERT Retrieval and Ensemble Response Scoring for Language Model Question Answering(https://arxiv.org/abs/2408.10808)
Keywords: large language model
Abstract: Domain-specific question answering remains challenging for language models, given the deep technical knowledge required to answer questions correctly. This difficulty is amplified for smaller language models that cannot encode as much information in their parameters as larger models. The "Specializing Large Language Models for Telecom Networks" challenge aimed to enhance the performance of two small language models, Phi-2 and Falcon-7B in telecommunication question answering. In this paper, we present our question answering systems for this challenge. Our solutions achieved leading marks of 81.9% accuracy for Phi-2 and 57.3% for Falcon-7B. We have publicly released our code and fine-tuned models.

Title: Beyond English-Centric LLMs: What Language Do Multilingual Language Models Think in?

Authors: Chengzhi Zhong, Fei Cheng, Qianying Liu, Junfeng Jiang, Zhen Wan, Chenhui Chu, Yugo Murawaki, Sadao Kurohashi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10811
Pdf URL: https://arxiv.org/pdf/2408.10811
Copy Paste: [[2408.10811]] Beyond English-Centric LLMs: What Language Do Multilingual Language Models Think in?(https://arxiv.org/abs/2408.10811)
Keywords: large language model
Abstract: In this study, we investigate whether non-English-centric LLMs, despite their strong performance, `think' in their respective dominant language: more precisely, `think' refers to how the representations of intermediate layers, when un-embedded into the vocabulary space, exhibit higher probabilities for certain dominant languages during generation. We term such languages as internal $\textbf{latent languages}$. We examine the latent language of three typical categories of models for Japanese processing: Llama2, an English-centric model; Swallow, an English-centric model with continued pre-training in Japanese; and LLM-jp, a model pre-trained on balanced English and Japanese corpora. Our empirical findings reveal that, unlike Llama2 which relies exclusively on English as the internal latent language, Japanese-specific Swallow and LLM-jp employ both Japanese and English, exhibiting dual internal latent languages. For any given target language, the model preferentially activates the latent language most closely related to it. In addition, we explore how intermediate layers respond to questions involving cultural conflicts between latent internal and target output languages. We further explore how the language identity shifts across layers while keeping consistent semantic meaning reflected in the intermediate layer representations. This study deepens the understanding of non-English-centric large language models, highlighting the intricate dynamics of language representation within their intermediate layers.

Title: Learning Randomized Algorithms with Transformers

Authors: Johannes von Oswald, Seijin Kobayashi, Yassir Akram, Angelika Steger
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.10818
Pdf URL: https://arxiv.org/pdf/2408.10818
Copy Paste: [[2408.10818]] Learning Randomized Algorithms with Transformers(https://arxiv.org/abs/2408.10818)
Keywords: robust, transformer
Abstract: Randomization is a powerful tool that endows algorithms with remarkable properties. For instance, randomized algorithms excel in adversarial settings, often surpassing the worst-case performance of deterministic algorithms with large margins. Furthermore, their success probability can be amplified by simple strategies such as repetition and majority voting. In this paper, we enhance deep neural networks, in particular transformer models, with randomization. We demonstrate for the first time that randomized algorithms can be instilled in transformers through learning, in a purely data- and objective-driven manner. First, we analyze known adversarial objectives for which randomized algorithms offer a distinct advantage over deterministic ones. We then show that common optimization techniques, such as gradient descent or evolutionary strategies, can effectively learn transformer parameters that make use of the randomness provided to the model. To illustrate the broad applicability of randomization in empowering neural networks, we study three conceptual tasks: associative recall, graph coloring, and agents that explore grid worlds. In addition to demonstrating increased robustness against oblivious adversaries through learned randomization, our experiments reveal remarkable performance improvements due to the inherently random nature of the neural networks' computation and predictions.

Title: Exploiting Large Language Models Capabilities for Question Answer-Driven Knowledge Graph Completion Across Static and Temporal Domains

Authors: Rui Yang, Jiahao Zhu, Jianping Man, Li Fang, Yi Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10819
Pdf URL: https://arxiv.org/pdf/2408.10819
Copy Paste: [[2408.10819]] Exploiting Large Language Models Capabilities for Question Answer-Driven Knowledge Graph Completion Across Static and Temporal Domains(https://arxiv.org/abs/2408.10819)
Keywords: generative, large language model
Abstract: Knowledge graph completion (KGC) aims to identify missing triples in a knowledge graph (KG). This is typically achieved through tasks such as link prediction and instance completion. However, these methods often focus on either static knowledge graphs (SKGs) or temporal knowledge graphs (TKGs), addressing only within-scope triples. This paper introduces a new generative completion framework called Generative Subgraph-based KGC (GS-KGC). GS-KGC employs a question-answering format to directly generate target entities, addressing the challenge of questions having multiple possible answers. We propose a strategy that extracts subgraphs centered on entities and relationships within the KG, from which negative samples and neighborhood information are separately obtained to address the one-to-many problem. Our method generates negative samples using known facts to facilitate the discovery of new information. Furthermore, we collect and refine neighborhood path data of known entities, providing contextual information to enhance reasoning in large language models (LLMs). Our experiments evaluated the proposed method on four SKGs and two TKGs, achieving state-of-the-art Hits@1 metrics on five datasets. Analysis of the results shows that GS-KGC can discover new triples within existing KGs and generate new facts beyond the closed KG, effectively bridging the gap between closed-world and open-world KGC.

Title: Navigating Spatio-Temporal Heterogeneity: A Graph Transformer Approach for Traffic Forecasting

Authors: Jianxiang Zhou, Erdong Liu, Wei Chen, Siru Zhong, Yuxuan Liang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2408.10822
Pdf URL: https://arxiv.org/pdf/2408.10822
Copy Paste: [[2408.10822]] Navigating Spatio-Temporal Heterogeneity: A Graph Transformer Approach for Traffic Forecasting(https://arxiv.org/abs/2408.10822)
Keywords: transformer
Abstract: Traffic forecasting has emerged as a crucial research area in the development of smart cities. Although various neural networks with intricate architectures have been developed to address this problem, they still face two key challenges: i) Recent advancements in network designs for modeling spatio-temporal correlations are starting to see diminishing returns in performance enhancements. ii) Additionally, most models do not account for the spatio-temporal heterogeneity inherent in traffic data, i.e., traffic distribution varies significantly across different regions and traffic flow patterns fluctuate across various time slots. To tackle these challenges, we introduce the Spatio-Temporal Graph Transformer (STGormer), which effectively integrates attribute and structure information inherent in traffic data for learning spatio-temporal correlations, and a mixture-of-experts module for capturing heterogeneity along spaital and temporal axes. Specifically, we design two straightforward yet effective spatial encoding methods based on the graph structure and integrate time position encoding into the vanilla transformer to capture spatio-temporal traffic patterns. Additionally, a mixture-of-experts enhanced feedforward neural network (FNN) module adaptively assigns suitable expert layers to distinct patterns via a spatio-temporal gating network, further improving overall prediction accuracy. Experiments on five real-world datasets demonstrate that STGormer achieves state-of-the-art performance.

Title: Trustworthy Compression? Impact of AI-based Codecs on Biometrics for Law Enforcement

Authors: Sandra Bergmann, Denise Moussa, Christian Riess
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2408.10823
Pdf URL: https://arxiv.org/pdf/2408.10823
Copy Paste: [[2408.10823]] Trustworthy Compression? Impact of AI-based Codecs on Biometrics for Law Enforcement(https://arxiv.org/abs/2408.10823)
Keywords: robust, biometric
Abstract: Image-based biometrics can aid law enforcement in various aspects, for example in iris, fingerprint and soft-biometric recognition. A critical precondition for recognition is the availability of sufficient biometric information in images. It is visually apparent that strong JPEG compression removes such details. However, latest AI-based image compression seemingly preserves many image details even for very strong compression factors. Yet, these perceived details are not necessarily grounded in measurements, which raises the question whether these images can still be used for biometric recognition. In this work, we investigate how AI compression impacts iris, fingerprint and soft-biometric (fabrics and tattoo) images. We also investigate the recognition performance for iris and fingerprint images after AI compression. It turns out that iris recognition can be strongly affected, while fingerprint recognition is quite robust. The loss of detail is qualitatively best seen in fabrics and tattoos images. Overall, our results show that AI-compression still permits many biometric tasks, but attention to strong compression factors in sensitive tasks is advisable.

Title: Benchmarking Large Language Models for Math Reasoning Tasks

Authors: Kathrin Seßler, Yao Rong, Emek Gözlüklü, Enkelejda Kasneci
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2408.10839
Pdf URL: https://arxiv.org/pdf/2408.10839
Copy Paste: [[2408.10839]] Benchmarking Large Language Models for Math Reasoning Tasks(https://arxiv.org/abs/2408.10839)
Keywords: fair, large language model
Abstract: The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance, such as in educational settings. Despite the variety of datasets and in-context learning algorithms designed to improve the ability of LLMs to automate mathematical problem solving, the lack of comprehensive benchmarking across different datasets makes it complicated to select an appropriate model for specific tasks. In this project, we present a benchmark that fairly compares seven state-of-the-art in-context learning algorithms for mathematical problem solving across five widely used mathematical datasets on four powerful foundation models. Furthermore, we explore the trade-off between efficiency and performance, highlighting the practical applications of LLMs for mathematical reasoning. Our results indicate that larger foundation models like GPT-4o and LLaMA 3-70B can solve mathematical reasoning independently from the concrete prompting strategy, while for smaller models the in-context learning approach significantly influences the performance. Moreover, the optimal prompt depends on the chosen foundation model. We open-source our benchmark code to support the integration of additional models in future research.

Title: Detecting Wildfires on UAVs with Real-time Segmentation Trained by Larger Teacher Models

Authors: Julius Pesonen, Teemu Hakala, Väinö Karjalainen, Niko Koivumäki, Lauri Markelin, Anna-Maria Raita-Hakola, Juha Suomalainen, Ilkka Pölönen, Eija Honkavaara
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10843
Pdf URL: https://arxiv.org/pdf/2408.10843
Copy Paste: [[2408.10843]] Detecting Wildfires on UAVs with Real-time Segmentation Trained by Larger Teacher Models(https://arxiv.org/abs/2408.10843)
Keywords: fair, segmentation
Abstract: Early detection of wildfires is essential to prevent large-scale fires resulting in extensive environmental, structural, and societal damage. Uncrewed aerial vehicles (UAVs) can cover large remote areas effectively with quick deployment requiring minimal infrastructure and equipping them with small cameras and computers enables autonomous real-time detection. In remote areas, however, the UAVs are limited to on-board computing for detection due to the lack of high-bandwidth mobile networks. This limits the detection to methods which are light enough for the on-board computer alone. For accurate camera-based localisation, segmentation of the detected smoke is essential but training data for deep learning-based wildfire smoke segmentation is limited. This study shows how small specialised segmentation models can be trained using only bounding box labels, leveraging zero-shot foundation model supervision. The method offers the advantages of needing only fairly easily obtainable bounding box labels and requiring training solely for the smaller student network. The proposed method achieved 63.3% mIoU on a manually annotated and diverse wildfire dataset. The used model can perform in real-time at ~11 fps with a UAV-carried NVIDIA Jetson Orin NX computer while reliably recognising smoke, demonstrated at real-world forest burning events. Code is available at this https URL

Title: CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

Authors: Hidehisa Arai, Keita Miwa, Kento Sasaki, Yu Yamaguchi, Kohei Watanabe, Shunsuke Aoki, Issei Yamamoto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10845
Pdf URL: https://arxiv.org/pdf/2408.10845
Copy Paste: [[2408.10845]] CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving(https://arxiv.org/abs/2408.10845)
Keywords: robust, large language model
Abstract: Autonomous driving, particularly navigating complex and unanticipated scenarios, demands sophisticated reasoning and planning capabilities. While Multi-modal Large Language Models (MLLMs) offer a promising avenue for this, their use has been largely confined to understanding complex environmental contexts or generating high-level driving commands, with few studies extending their application to end-to-end path planning. A major research bottleneck is the lack of large-scale annotated datasets encompassing vision, language, and action. To address this issue, we propose CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. This approach utilizes raw in-vehicle sensor data, allowing it to surpass existing datasets in scale and annotation richness. Using CoVLA, we investigate the driving capabilities of MLLMs that can handle vision, language, and action in a variety of driving scenarios. Our results illustrate the strong proficiency of our model in generating coherent language and action outputs, emphasizing the potential of Vision-Language-Action (VLA) models in the field of autonomous driving. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems by providing a comprehensive platform for training and evaluating VLA models, contributing to safer and more reliable self-driving vehicles. The dataset is released for academic purpose.

Title: Harmonizing Attention: Training-free Texture-aware Geometry Transfer

Authors: Eito Ikuta, Yohan Lee, Akihiro Iohara, Yu Saito, Toshiyuki Tanaka
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2408.10846
Pdf URL: https://arxiv.org/pdf/2408.10846
Copy Paste: [[2408.10846]] Harmonizing Attention: Training-free Texture-aware Geometry Transfer(https://arxiv.org/abs/2408.10846)
Keywords: diffusion
Abstract: Extracting geometry features from photographic images independently of surface texture and transferring them onto different materials remains a complex challenge. In this study, we introduce Harmonizing Attention, a novel training-free approach that leverages diffusion models for texture-aware geometry transfer. Our method employs a simple yet effective modification of self-attention layers, allowing the model to query information from multiple reference images within these layers. This mechanism is seamlessly integrated into the inversion process as Texture-aligning Attention and into the generation process as Geometry-aligning Attention. This dual-attention approach ensures the effective capture and transfer of material-independent geometry features while maintaining material-specific textural continuity, all without the need for model fine-tuning.

Title: Perception-guided Jailbreak against Text-to-Image Models

Authors: Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, Yang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10848
Pdf URL: https://arxiv.org/pdf/2408.10848
Copy Paste: [[2408.10848]] Perception-guided Jailbreak against Text-to-Image Models(https://arxiv.org/abs/2408.10848)
Keywords: security, attack
Abstract: In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.

Title: Knowledge Sharing and Transfer via Centralized Reward Agent for Multi-Task Reinforcement Learning

Authors: Haozhe Ma, Zhengding Luo, Thanh Vinh Vo, Kuankuan Sima, Tze-Yun Leong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10858
Pdf URL: https://arxiv.org/pdf/2408.10858
Copy Paste: [[2408.10858]] Knowledge Sharing and Transfer via Centralized Reward Agent for Multi-Task Reinforcement Learning(https://arxiv.org/abs/2408.10858)
Keywords: robust
Abstract: Reward shaping is effective in addressing the sparse-reward challenge in reinforcement learning by providing immediate feedback through auxiliary informative rewards. Based on the reward shaping strategy, we propose a novel multi-task reinforcement learning framework, that integrates a centralized reward agent (CRA) and multiple distributed policy agents. The CRA functions as a knowledge pool, which aims to distill knowledge from various tasks and distribute it to individual policy agents to improve learning efficiency. Specifically, the shaped rewards serve as a straightforward metric to encode knowledge. This framework not only enhances knowledge sharing across established tasks but also adapts to new tasks by transferring valuable reward signals. We validate the proposed method on both discrete and continuous domains, demonstrating its robustness in multi-task sparse-reward settings and its effective transferability to unseen tasks.

Title: Feature Selection from Differentially Private Correlations

Authors: Ryan Swope, Amol Khanna, Philip Doldo, Saptarshi Roy, Edward Raff
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2408.10862
Pdf URL: https://arxiv.org/pdf/2408.10862
Copy Paste: [[2408.10862]] Feature Selection from Differentially Private Correlations(https://arxiv.org/abs/2408.10862)
Keywords: privacy
Abstract: Data scientists often seek to identify the most important features in high-dimensional datasets. This can be done through $L_1$-regularized regression, but this can become inefficient for very high-dimensional datasets. Additionally, high-dimensional regression can leak information about individual datapoints in a dataset. In this paper, we empirically evaluate the established baseline method for feature selection with differential privacy, the two-stage selection technique, and show that it is not stable under sparsity. This makes it perform poorly on real-world datasets, so we consider a different approach to private feature selection. We employ a correlations-based order statistic to choose important features from a dataset and privatize them to ensure that the results do not leak information about individual datapoints. We find that our method significantly outperforms the established baseline for private feature selection on many datasets.

Title: Open 3D World in Autonomous Driving

Authors: Xinlong Cheng, Lei Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10880
Pdf URL: https://arxiv.org/pdf/2408.10880
Copy Paste: [[2408.10880]] Open 3D World in Autonomous Driving(https://arxiv.org/abs/2408.10880)
Keywords: robust
Abstract: The capability for open vocabulary perception represents a significant advancement in autonomous driving systems, facilitating the comprehension and interpretation of a wide array of textual inputs in real-time. Despite extensive research in open vocabulary tasks within 2D computer vision, the application of such methodologies to 3D environments, particularly within large-scale outdoor contexts, remains relatively underdeveloped. This paper presents a novel approach that integrates 3D point cloud data, acquired from LIDAR sensors, with textual information. The primary focus is on the utilization of textual data to directly localize and identify objects within the autonomous driving context. We introduce an efficient framework for the fusion of bird's-eye view (BEV) region features with textual features, thereby enabling the system to seamlessly adapt to novel textual inputs and enhancing the robustness of open vocabulary detection tasks. The effectiveness of the proposed methodology is rigorously evaluated through extensive experimentation on the newly introduced NuScenes-T dataset, with additional validation of its zero-shot performance on the Lyft Level 5 dataset. This research makes a substantive contribution to the advancement of autonomous driving technologies by leveraging multimodal data to enhance open vocabulary perception in 3D environments, thereby pushing the boundaries of what is achievable in autonomous navigation and perception.

Title: Low-Quality Image Detection by Hierarchical VAE

Authors: Tomoyasu Nanaumi, Kazuhiko Kawamoto, Hiroshi Kera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10885
Pdf URL: https://arxiv.org/pdf/2408.10885
Copy Paste: [[2408.10885]] Low-Quality Image Detection by Hierarchical VAE(https://arxiv.org/abs/2408.10885)
Keywords: generative
Abstract: To make an employee roster, photo album, or training dataset of generative models, one needs to collect high-quality images while dismissing low-quality ones. This study addresses a new task of unsupervised detection of low-quality images. We propose a method that not only detects low-quality images with various types of degradation but also provides visual clues of them based on an observation that partial reconstruction by hierarchical variational autoencoders fails for low-quality images. The experiments show that our method outperforms several unsupervised out-of-distribution detection methods and also gives visual clues for low-quality images that help humans recognize them even in thumbnail view.

Title: ViLReF: A Chinese Vision-Language Retinal Foundation Model

Authors: Shengzhu Yang, Jiawei Du, Jia Guo, Weihang Zhang, Hanruo Liu, Huiqi Li, Ningli Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10894
Pdf URL: https://arxiv.org/pdf/2408.10894
Copy Paste: [[2408.10894]] ViLReF: A Chinese Vision-Language Retinal Foundation Model(https://arxiv.org/abs/2408.10894)
Keywords: extraction, segmentation
Abstract: Subtle semantic differences in retinal image and text data present great challenges for pre-training visual-language models. Moreover, false negative samples, i.e., image-text pairs having the same semantics but incorrectly regarded as negatives, disrupt the visual-language pre-training process and affect the model's learning ability. This work aims to develop a retinal foundation model, called ViLReF, by pre-training on a paired dataset comprising 451,956 retinal images and corresponding diagnostic text reports. In our vision-language pre-training strategy, we leverage expert knowledge to facilitate the extraction of labels and propose a novel constraint, the Weighted Similarity Coupling Loss, to adjust the speed of pushing sample pairs further apart dynamically within the feature space. Furthermore, we employ a batch expansion module with dynamic memory queues, maintained by momentum encoders, to supply extra samples and compensate for the vacancies caused by eliminating false negatives. Extensive experiments are conducted on multiple datasets for downstream classification and segmentation tasks. The experimental results demonstrate the powerful zero-shot and transfer learning capabilities of ViLReF, verifying the effectiveness of our pre-training strategy. Our ViLReF model is available at: this https URL.

Title: A Grey-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse

Authors: Zhongliang Guo, Lei Fang, Jingyu Lin, Yifei Qian, Shuai Zhao, Zeyu Wang, Junhao Dong, Cunjian Chen, Ognjen Arandjelović, Chun Pong Lau
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2408.10901
Pdf URL: https://arxiv.org/pdf/2408.10901
Copy Paste: [[2408.10901]] A Grey-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse(https://arxiv.org/abs/2408.10901)
Keywords: attack, robust, diffusion, generative
Abstract: Recent advancements in generative AI, particularly Latent Diffusion Models (LDMs), have revolutionized image synthesis and manipulation. However, these generative techniques raises concerns about data misappropriation and intellectual property infringement. Adversarial attacks on machine learning models have been extensively studied, and a well-established body of research has extended these techniques as a benign metric to prevent the underlying misuse of generative AI. Current approaches to safeguarding images from manipulation by LDMs are limited by their reliance on model-specific knowledge and their inability to significantly degrade semantic quality of generated images. In response to these shortcomings, we propose the Posterior Collapse Attack (PCA) based on the observation that VAEs suffer from posterior collapse during training. Our method minimizes dependence on the white-box information of target models to get rid of the implicit reliance on model-specific knowledge. By accessing merely a small amount of LDM parameters, in specific merely the VAE encoder of LDMs, our method causes a substantial semantic collapse in generation quality, particularly in perceptual consistency, and demonstrates strong transferability across various model architectures. Experimental results show that PCA achieves superior perturbation effects on image generation of LDMs with lower runtime and VRAM. Our method outperforms existing techniques, offering a more robust and generalizable solution that is helpful in alleviating the socio-technical challenges posed by the rapidly evolving landscape of generative AI.

Title: Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs

Authors: John Mendonça, Isabel Trancoso, Alon Lavie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10902
Pdf URL: https://arxiv.org/pdf/2408.10902
Copy Paste: [[2408.10902]] Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs(https://arxiv.org/abs/2408.10902)
Keywords: large language model
Abstract: Although human evaluation remains the gold standard for open-domain dialogue evaluation, the growing popularity of automated evaluation using Large Language Models (LLMs) has also extended to dialogue. However, most frameworks leverage benchmarks that assess older chatbots on aspects such as fluency and relevance, which are not reflective of the challenges associated with contemporary models. In fact, a qualitative analysis on Soda, a GPT-3.5 generated dialogue dataset, suggests that current chatbots may exhibit several recurring issues related to coherence and commonsense knowledge, but generally produce highly fluent and relevant responses. Noting the aforementioned limitations, this paper introduces Soda-Eval, an annotated dataset based on Soda that covers over 120K turn-level assessments across 10K dialogues, where the annotations were generated by GPT-4. Using Soda-Eval as a benchmark, we then study the performance of several open-access instruction-tuned LLMs, finding that dialogue evaluation remains challenging. Fine-tuning these models improves performance over few-shot inferences, both in terms of correlation and explanation.

Title: BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model

Authors: Yeyong Yu, Rusheng Yu, Haojie Wei, Zhanqiu Zhang, Quan Qian
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2408.10903
Pdf URL: https://arxiv.org/pdf/2408.10903
Copy Paste: [[2408.10903]] BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model(https://arxiv.org/abs/2408.10903)
Keywords: large language model
Abstract: The rapid advancement of large language models (LLMs) has revolutionized role-playing, enabling the development of general role-playing models. However, current role-playing training has two significant issues: (I) Using a predefined role profile to prompt dialogue training for specific scenarios usually leads to inconsistencies and even conflicts between the dialogue and the profile, resulting in training biases. (II) The model learns to imitate the role based solely on the profile, neglecting profile-dialogue alignment at the sentence level. In this work, we propose a simple yet effective framework called BEYOND DIALOGUE, designed to overcome these hurdles. This framework innovatively introduces "beyond dialogue" tasks to align dialogue with profile traits based on each specific scenario, thereby eliminating biases during training. Furthermore, by adopting an innovative prompting mechanism that generates reasoning outcomes for training, the framework allows the model to achieve fine-grained alignment between profile and dialogue at the sentence level. The aforementioned methods are fully automated and low-cost. Additionally, the integration of automated dialogue and objective evaluation methods forms a comprehensive framework, paving the way for general role-playing. Experimental results demonstrate that our model excels in adhering to and reflecting various dimensions of role profiles, outperforming most proprietary general and specialized role-playing baselines. All code and datasets are available at this https URL.

Title: ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining

Authors: Qi Ma, Yue Li, Bin Ren, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, Danda Pani Paudel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10906
Pdf URL: https://arxiv.org/pdf/2408.10906
Copy Paste: [[2408.10906]] ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining(https://arxiv.org/abs/2408.10906)
Keywords: segmentation
Abstract: 3D Gaussian Splatting (3DGS) has become the de facto method of 3D representation in many vision tasks. This calls for the 3D understanding directly in this representation space. To facilitate the research in this direction, we first build a large-scale dataset of 3DGS using the commonly used ShapeNet and ModelNet datasets. Our dataset ShapeSplat consists of 65K objects from 87 unique categories, whose labels are in accordance with the respective datasets. The creation of this dataset utilized the compute equivalent of 2 GPU years on a TITAN XP GPU. We utilize our dataset for unsupervised pretraining and supervised finetuning for classification and segmentation tasks. To this end, we introduce \textbf{\textit{Gaussian-MAE}}, which highlights the unique benefits of representation learning from Gaussian parameters. Through exhaustive experiments, we provide several valuable insights. In particular, we show that (1) the distribution of the optimized GS centroids significantly differs from the uniformly sampled point cloud (used for initialization) counterpart; (2) this change in distribution results in degradation in classification but improvement in segmentation tasks when using only the centroids; (3) to leverage additional Gaussian parameters, we propose Gaussian feature grouping in a normalized feature space, along with splats pooling layer, offering a tailored solution to effectively group and embed similar Gaussians, which leads to notable improvement in finetuning tasks.

Title: To Code, or Not To Code? Exploring Impact of Code in Pre-training

Authors: Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10914
Pdf URL: https://arxiv.org/pdf/2408.10914
Copy Paste: [[2408.10914]] To Code, or Not To Code? Exploring Impact of Code in Pre-training(https://arxiv.org/abs/2408.10914)
Keywords: generative
Abstract: Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts.

Title: CrossFi: A Cross Domain Wi-Fi Sensing Framework Based on Siamese Network

Authors: Zijian Zhao, Tingwei Chen, Zhijie Cai, Hang Li, Xiaoyang Li, Qimei Chen, Guangxu Zhu
Subjects: cs.CV, cs.AI, cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2408.10919
Pdf URL: https://arxiv.org/pdf/2408.10919
Copy Paste: [[2408.10919]] CrossFi: A Cross Domain Wi-Fi Sensing Framework Based on Siamese Network(https://arxiv.org/abs/2408.10919)
Keywords: privacy, protect
Abstract: In recent years, Wi-Fi sensing has garnered significant attention due to its numerous benefits, such as privacy protection, low cost, and penetration ability. Extensive research has been conducted in this field, focusing on areas such as gesture recognition, people identification, and fall detection. However, many data-driven methods encounter challenges related to domain shift, where the model fails to perform well in environments different from the training data. One major factor contributing to this issue is the limited availability of Wi-Fi sensing datasets, which makes models learn excessive irrelevant information and over-fit to the training set. Unfortunately, collecting large-scale Wi-Fi sensing datasets across diverse scenarios is a challenging task. To address this problem, we propose CrossFi, a siamese network-based approach that excels in both in-domain scenario and cross-domain scenario, including few-shot, zero-shot scenarios, and even works in few-shot new-class scenario where testing set contains new categories. The core component of CrossFi is a sample-similarity calculation network called CSi-Net, which improves the structure of the siamese network by using an attention mechanism to capture similarity information, instead of simply calculating the distance or cosine similarity. Based on it, we develop an extra Weight-Net that can generate a template for each class, so that our CrossFi can work in different scenarios. Experimental results demonstrate that our CrossFi achieves state-of-the-art performance across various scenarios. In gesture recognition task, our CrossFi achieves an accuracy of 98.17% in in-domain scenario, 91.72% in one-shot cross-domain scenario, 64.81% in zero-shot cross-domain scenario, and 84.75% in one-shot new-class scenario. To facilitate future research, we will release the code for our model upon publication.

Title: Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations

Authors: Róbert Csordás, Christopher Potts, Christopher D. Manning, Atticus Geiger
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2408.10920
Pdf URL: https://arxiv.org/pdf/2408.10920
Copy Paste: [[2408.10920]] Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations(https://arxiv.org/abs/2408.10920)
Keywords: interpretability
Abstract: The Linear Representation Hypothesis (LRH) states that neural networks learn to encode concepts as directions in activation space, and a strong version of the LRH states that models learn only such encodings. In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction. These representations have layered features that are impossible to locate in distinct linear subspaces. To show this, we train interventions to predict and manipulate tokens by learning the scaling factor corresponding to each sequence position. These interventions indicate that the smallest RNNs find only this magnitude-based solution, while larger RNNs have linear representations. These findings strongly indicate that interpretability research should not be confined by the LRH.

Title: LBC: Language-Based-Classifier for Out-Of-Variable Generalization

Authors: Kangjun Noh, Baekryun Seong, Hoyoon Byun, Sungjin Song, Kyungwoo Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10923
Pdf URL: https://arxiv.org/pdf/2408.10923
Copy Paste: [[2408.10923]] LBC: Language-Based-Classifier for Out-Of-Variable Generalization(https://arxiv.org/abs/2408.10923)
Keywords: large language model
Abstract: Large Language Models (LLMs) have great success in natural language processing tasks such as response generation. However, their use in tabular data has been limited due to their inferior performance compared to traditional machine learning models (TMLs) such as XGBoost. We find that the pre-trained knowledge of LLMs enables them to interpret new variables that appear in a test without additional training, a capability central to the concept of Out-of-Variable (OOV). From the findings, we propose a Language-Based-Classifier (LBC), a classifier that maximizes the benefits of LLMs to outperform TMLs on OOV tasks. LBC employs three key methodological strategies: 1) Categorical changes to adjust data to better fit the model's understanding, 2) Advanced order and indicator to enhance data representation to the model, and 3) Using verbalizer to map logit scores to classes during inference to generate model predictions. These strategies, combined with the pre-trained knowledge of LBC, emphasize the model's ability to effectively handle OOV tasks. We empirically and theoretically validate the superiority of LBC. LBC is the first study to apply an LLM-based model to OOV tasks. The source code is at this https URL.

Title: Large Point-to-Gaussian Model for Image-to-3D Generation

Authors: Longfei Lu, Huachen Gao, Tao Dai, Yaohua Zha, Zhi Hou, Junta Wu, Shu-Tao Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10935
Pdf URL: https://arxiv.org/pdf/2408.10935
Copy Paste: [[2408.10935]] Large Point-to-Gaussian Model for Image-to-3D Generation(https://arxiv.org/abs/2408.10935)
Keywords: diffusion
Abstract: Recently, image-to-3D approaches have significantly advanced the generation quality and speed of 3D assets based on large reconstruction models, particularly 3D Gaussian reconstruction models. Existing large 3D Gaussian models directly map 2D image to 3D Gaussian parameters, while regressing 2D image to 3D Gaussian representations is challenging without 3D priors. In this paper, we propose a large Point-to-Gaussian model, that inputs the initial point cloud produced from large 3D diffusion model conditional on 2D image to generate the Gaussian parameters, for image-to-3D generation. The point cloud provides initial 3D geometry prior for Gaussian generation, thus significantly facilitating image-to-3D Generation. Moreover, we present the \textbf{A}ttention mechanism, \textbf{P}rojection mechanism, and \textbf{P}oint feature extractor, dubbed as \textbf{APP} block, for fusing the image features with point cloud features. The qualitative and quantitative experiments extensively demonstrate the effectiveness of the proposed approach on GSO and Objaverse datasets, and show the proposed method achieves state-of-the-art performance.

Title: Robust Regression with Ensembles Communicating over Noisy Channels

Authors: Yuval Ben-Hur, Yuval Cassuto
Subjects: cs.LG, cs.DC, cs.IT
Abstract URL: https://arxiv.org/abs/2408.10942
Pdf URL: https://arxiv.org/pdf/2408.10942
Copy Paste: [[2408.10942]] Robust Regression with Ensembles Communicating over Noisy Channels(https://arxiv.org/abs/2408.10942)
Keywords: robust
Abstract: As machine-learning models grow in size, their implementation requirements cannot be met by a single computer system. This observation motivates distributed settings, in which intermediate computations are performed across a network of processing units, while the central node only aggregates their outputs. However, distributing inference tasks across low-precision or faulty edge devices, operating over a network of noisy communication channels, gives rise to serious reliability challenges. We study the problem of an ensemble of devices, implementing regression algorithms, that communicate through additive noisy channels in order to collaboratively perform a joint regression task. We define the problem formally, and develop methods for optimizing the aggregation coefficients for the parameters of the noise in the channels, which can potentially be correlated. Our results apply to the leading state-of-the-art ensemble regression methods: bagging and gradient boosting. We demonstrate the effectiveness of our algorithms on both synthetic and real-world datasets.

Title: SysBench: Can Large Language Models Follow System Messages?

Authors: Yanzhao Qin, Tao Zhang, Tao Zhang, Yanjun Shen, Wenjing Luo, Haoze Sun, Yan Zhang, Yujing Qiao, Weipeng Chen, Zenan Zhou, Wentao Zhang, Bin Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10943
Pdf URL: https://arxiv.org/pdf/2408.10943
Copy Paste: [[2408.10943]] SysBench: Can Large Language Models Follow System Messages?(https://arxiv.org/abs/2408.10943)
Keywords: large language model
Abstract: Large Language Models (LLMs) have become instrumental across various applications, with the customization of these models to specific scenarios becoming increasingly critical. System message, a fundamental component of LLMs, is consist of carefully crafted instructions that guide the behavior of model to meet intended goals. Despite the recognized potential of system messages to optimize AI-driven solutions, there is a notable absence of a comprehensive benchmark for evaluating how well different LLMs follow these system messages. To fill this gap, we introduce SysBench, a benchmark that systematically analyzes system message following ability in terms of three challenging aspects: constraint complexity, instruction misalignment and multi-turn stability. In order to enable effective evaluation, SysBench constructs multi-turn user conversations covering various interaction relationships, based on six common types of constraints from system messages in real-world scenarios. Our dataset contains 500 system messages from various domains, each paired with 5 turns of user conversations, which have been manually formulated and checked to guarantee high quality. SysBench provides extensive evaluation across various LLMs, measuring their ability to follow specified constraints given in system messages. The results highlight both the strengths and weaknesses of existing models, offering key insights and directions for future research. The open source library SysBench is available at this https URL.

Title: HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments

Authors: Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10945
Pdf URL: https://arxiv.org/pdf/2408.10945
Copy Paste: [[2408.10945]] HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments(https://arxiv.org/abs/2408.10945)
Keywords: large language model
Abstract: High-resolution Vision-Language Models (VLMs) have been widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate excessive visual tokens due to encoding multiple partitions of the input image. Processing these excessive visual tokens is computationally challenging, especially in resource-constrained environments with commodity GPUs. To support high-resolution images while meeting resource constraints, we propose High-Resolution Early Dropping (HiRED), a token-dropping scheme that operates within a fixed token budget before the Large Language Model (LLM) stage. HiRED can be integrated with existing high-resolution VLMs in a plug-and-play manner, as it requires no additional training while still maintaining superior accuracy. We strategically use the vision encoder's attention in the initial layers to assess the visual content of each image partition and allocate the token budget accordingly. Then, using the attention in the final layer, we select the most important visual tokens from each partition within the allocated budget, dropping the rest. Empirically, when applied to LLaVA-Next-7B on NVIDIA TESLA P40 GPU, HiRED with a 20% token budget increases token generation throughput by 4.7, reduces first-token generation latency by 15 seconds, and saves 2.3 GB of GPU memory for a single inference.

Title: GAIM: Attacking Graph Neural Networks via Adversarial Influence Maximization

Authors: Xiaodong Yang, Xiaoting Li, Huiyuan Chen, Yiwei Cai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2408.10948
Pdf URL: https://arxiv.org/pdf/2408.10948
Copy Paste: [[2408.10948]] GAIM: Attacking Graph Neural Networks via Adversarial Influence Maximization(https://arxiv.org/abs/2408.10948)
Keywords: attack
Abstract: Recent studies show that well-devised perturbations on graph structures or node features can mislead trained Graph Neural Network (GNN) models. However, these methods often overlook practical assumptions, over-rely on heuristics, or separate vital attack components. In response, we present GAIM, an integrated adversarial attack method conducted on a node feature basis while considering the strict black-box setting. Specifically, we define an adversarial influence function to theoretically assess the adversarial impact of node perturbations, thereby reframing the GNN attack problem into the adversarial influence maximization problem. In our approach, we unify the selection of the target node and the construction of feature perturbations into a single optimization problem, ensuring a unique and consistent feature perturbation for each target node. We leverage a surrogate model to transform this problem into a solvable linear programming task, streamlining the optimization process. Moreover, we extend our method to accommodate label-oriented attacks, broadening its applicability. Thorough evaluations on five benchmark datasets across three popular models underscore the effectiveness of our method in both untargeted and label-oriented targeted attacks. Through comprehensive analysis and ablation studies, we demonstrate the practical value and efficacy inherent to our design choices.

Title: KeySpace: Public Key Infrastructure Considerations in Interplanetary Networks

Authors: Joshua Smailes, Sebastian Köhler, Simon Birnbach, Martin Strohmeier, Ivan Martinovic
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2408.10963
Pdf URL: https://arxiv.org/pdf/2408.10963
Copy Paste: [[2408.10963]] KeySpace: Public Key Infrastructure Considerations in Interplanetary Networks(https://arxiv.org/abs/2408.10963)
Keywords: attack
Abstract: As satellite networks grow larger and begin to incorporate interplanetary communication, there is an increasing interest in the unsolved problem of how to approach PKI in these conditions. In this paper we explore the goals and requirements for implementing key management systems in satellite networks, focusing on megaconstellations and interplanetary networks. We design a set of standardized experiments which can be used to compare systems against one another for particular network topologies. Using these, we demonstrate that terrestrial PKI techniques are feasible in highly distributed interplanetary networks, showing that it is possible to configure PKI systems to achieve efficient low-latency connection establishment, and minimize the impact of attacks through effective revocations. We evaluate this by building the Deep Space Network Simulator (DSNS), a novel network simulator aimed at efficient simulation of large space networks. We run simulations evaluating connection establishment and key revocation under a wide range of PKI configurations. Finally, we propose and evaluate two additional configuration options: OCSP Hybrid, and the use of relay nodes as a firewall. Together these minimize the extent of the network an attacker can reach with a compromised key, and reduce the attacker's load on interplanetary relay links.

Title: Facial Demorphing via Identity Preserving Image Decomposition

Authors: Nitish Shukla, Arun Ross
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.10993
Pdf URL: https://arxiv.org/pdf/2408.10993
Copy Paste: [[2408.10993]] Facial Demorphing via Identity Preserving Image Decomposition(https://arxiv.org/abs/2408.10993)
Keywords: security, attack
Abstract: A face morph is created by combining the face images usually pertaining to two distinct identities. The goal is to generate an image that can be matched with two identities thereby undermining the security of a face recognition system. To deal with this problem, several morph attack detection techniques have been developed. But these methods do not extract any information about the underlying bonafides used to create them. Demorphing addresses this limitation. However, current demorphing techniques are mostly reference-based, i.e, they need an image of one of the identities to recover the other. In this work, we treat demorphing as an ill-posed decomposition problem. We propose a novel method that is reference-free and recovers the bonafides with high accuracy. Our method decomposes the morph into several identity-preserving feature components. A merger network then weighs and combines these components to recover the bonafides. Our method is observed to reconstruct high-quality bonafides in terms of definition and fidelity. Experiments on the CASIA-WebFace, SMDD and AMSL datasets demonstrate the effectiveness of our method.

Title: CTP-LLM: Clinical Trial Phase Transition Prediction Using Large Language Models

Authors: Michael Reinisch, Jianfeng He, Chenxi Liao, Sauleh Ahmad Siddiqui, Bei Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.10995
Pdf URL: https://arxiv.org/pdf/2408.10995
Copy Paste: [[2408.10995]] CTP-LLM: Clinical Trial Phase Transition Prediction Using Large Language Models(https://arxiv.org/abs/2408.10995)
Keywords: large language model
Abstract: New medical treatment development requires multiple phases of clinical trials. Despite the significant human and financial costs of bringing a drug to market, less than 20% of drugs in testing will make it from the first phase to final approval. Recent literature indicates that the design of the trial protocols significantly contributes to trial performance. We investigated Clinical Trial Outcome Prediction (CTOP) using trial design documents to predict phase transitions automatically. We propose CTP-LLM, the first Large Language Model (LLM) based model for CTOP. We also introduce the PhaseTransition (PT) Dataset; which labels trials based on their progression through the regulatory process and serves as a benchmark for CTOP evaluation. Our fine-tuned GPT-3.5-based model (CTP-LLM) predicts clinical trial phase transition by analyzing the trial's original protocol texts without requiring human-selected features. CTP-LLM achieves a 67% accuracy rate in predicting trial phase transitions across all phases and a 75% accuracy rate specifically in predicting the transition from Phase~III to final approval. Our experimental performance highlights the potential of LLM-powered applications in forecasting clinical trial outcomes and assessing trial design.

Title: SenPa-MAE: Sensor Parameter Aware Masked Autoencoder for Multi-Satellite Self-Supervised Pretraining

Authors: Jonathan Prexl, Michael Schmitt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11000
Pdf URL: https://arxiv.org/pdf/2408.11000
Copy Paste: [[2408.11000]] SenPa-MAE: Sensor Parameter Aware Masked Autoencoder for Multi-Satellite Self-Supervised Pretraining(https://arxiv.org/abs/2408.11000)
Keywords: transformer
Abstract: This paper introduces SenPa-MAE, a transformer architecture that encodes the sensor parameters of an observed multispectral signal into the image embeddings. SenPa-MAE can be pre-trained on imagery of different satellites with non-matching spectral or geometrical sensor characteristics. To incorporate sensor parameters, we propose a versatile sensor parameter encoding module as well as a data augmentation strategy for the diversification of the pre-training dataset. This enables the model to effectively differentiate between various sensors and gain an understanding of sensor parameters and the correlation to the observed signal. Given the rising number of Earth observation satellite missions and the diversity in their sensor specifications, our approach paves the way towards a sensor-independent Earth observation foundation model. This opens up possibilities such as cross-sensor training and sensor-independent inference.

Title: MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Authors: Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2408.11001
Pdf URL: https://arxiv.org/pdf/2408.11001
Copy Paste: [[2408.11001]] MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning(https://arxiv.org/abs/2408.11001)
Keywords: diffusion
Abstract: Diffusion models have emerged as frontrunners in text-to-image generation for their impressive capabilities. Nonetheless, their fixed image resolution during training often leads to challenges in high-resolution image generation, such as semantic inaccuracies and object replication. This paper introduces MegaFusion, a novel approach that extends existing diffusion-based text-to-image generation models towards efficient higher-resolution generation without additional fine-tuning or extra adaptation. Specifically, we employ an innovative truncate and relay strategy to bridge the denoising processes across different resolutions, allowing for high-resolution image generation in a coarse-to-fine manner. Moreover, by integrating dilated convolutions and noise re-scheduling, we further adapt the model's priors for higher resolution. The versatility and efficacy of MegaFusion make it universally applicable to both latent-space and pixel-space diffusion models, along with other derivative models. Extensive experiments confirm that MegaFusion significantly boosts the capability of existing models to produce images of megapixels and various aspect ratios, while only requiring about 40% of the original computational cost.

Title: While GitHub Copilot Excels at Coding, Does It Ensure Responsible Output?

Authors: Wen Cheng, Ke Sun, Xinyu Zhang, Wei Wang
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2408.11006
Pdf URL: https://arxiv.org/pdf/2408.11006
Copy Paste: [[2408.11006]] While GitHub Copilot Excels at Coding, Does It Ensure Responsible Output?(https://arxiv.org/abs/2408.11006)
Keywords: security, attack, extraction, large language model
Abstract: The rapid development of large language models (LLMs) has significantly advanced code completion capabilities, giving rise to a new generation of LLM-based Code Completion Tools (LCCTs). Unlike general-purpose LLMs, these tools possess unique workflows, integrating multiple information sources as input and prioritizing code suggestions over natural language interaction, which introduces distinct security challenges. Additionally, LCCTs often rely on proprietary code datasets for training, raising concerns about the potential exposure of sensitive data. This paper exploits these distinct characteristics of LCCTs to develop targeted attack methodologies on two critical security risks: jailbreaking and training data extraction attacks. Our experimental results expose significant vulnerabilities within LCCTs, including a 99.4% success rate in jailbreaking attacks on GitHub Copilot and a 46.3% success rate on Amazon Q. Furthermore, We successfully extracted sensitive user data from GitHub Copilot, including 54 real email addresses and 314 physical addresses associated with GitHub usernames. Our study also demonstrates that these code-based attack methods are effective against general-purpose LLMs, such as the GPT series, highlighting a broader security misalignment in the handling of code by modern LLMs. These findings underscore critical security challenges associated with LCCTs and suggest essential directions for strengthening their security frameworks. The example code and attack samples from our research are provided at this https URL.

Title: Athena: Safe Autonomous Agents with Verbal Contrastive Learning

Authors: Tanmana Sadhu, Ali Pesaranghader, Yanan Chen, Dong Hoon Yi
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2408.11021
Pdf URL: https://arxiv.org/pdf/2408.11021
Copy Paste: [[2408.11021]] Athena: Safe Autonomous Agents with Verbal Contrastive Learning(https://arxiv.org/abs/2408.11021)
Keywords: large language model
Abstract: Due to emergent capabilities, large language models (LLMs) have been utilized as language-based agents to perform a variety of tasks and make decisions with an increasing degree of autonomy. These autonomous agents can understand high-level instructions, interact with their environments, and execute complex tasks using a selection of tools available to them. As the capabilities of the agents expand, ensuring their safety and trustworthiness becomes more imperative. In this study, we introduce the Athena framework which leverages the concept of verbal contrastive learning where past safe and unsafe trajectories are used as in-context (contrastive) examples to guide the agent towards safety while fulfilling a given task. The framework also incorporates a critiquing mechanism to guide the agent to prevent risky actions at every step. Furthermore, due to the lack of existing benchmarks on the safety reasoning ability of LLM-based agents, we curate a set of 80 toolkits across 8 categories with 180 scenarios to provide a safety evaluation benchmark. Our experimental evaluation, with both closed- and open-source LLMs, indicates verbal contrastive learning and interaction-level critiquing improve the safety rate significantly.

Title: Scaling Law with Learning Rate Annealing

Authors: Howe Tissue, Venus Wang, Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11029
Pdf URL: https://arxiv.org/pdf/2408.11029
Copy Paste: [[2408.11029]] Scaling Law with Learning Rate Annealing(https://arxiv.org/abs/2408.11029)
Keywords: large language model
Abstract: We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps ($s$): $$L(s) = L_0 + A\cdot S_1^{-\alpha} - C\cdot S_2$$ Where $S_1$ is forward area and $S_2$ is learning rate annealing area. This formulation takes into account two factors: (1) The forward scaling defined as typical scaling law, and (2) the additional loss drop brought by LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss of language model training at any given step and across any learning rate scheduler (LRS). Furthermore, this equation accurately describes the dynamics during training process, and provides a theoretical verification and explanation for numerous experimental findings of previous studies, particularly those focusing on LR schedule and LR annealing. The resulting insights, also serve as a guide for researchers to select critical LRS in advance by prediction using our equation. Most significantly, since all the points in a full training curve follow the equation, we can achieve accurate loss prediction at any given step across any learning rate scheduler, while expending less than 1\% of the computational cost required by the chinchilla scaling law to fit language modeling loss. This approach extremely democratizes scaling law fitting and predicting in developing large language models.

Title: Atmospheric Transport Modeling of CO$_2$ with Neural Networks

Authors: Vitus Benson, Ana Bastos, Christian Reimers, Alexander J. Winkler, Fanny Yang, Markus Reichstein
Subjects: cs.LG, cs.CV, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2408.11032
Pdf URL: https://arxiv.org/pdf/2408.11032
Copy Paste: [[2408.11032]] Atmospheric Transport Modeling of CO$_2$ with Neural Networks(https://arxiv.org/abs/2408.11032)
Keywords: transformer
Abstract: Accurately describing the distribution of CO$_2$ in the atmosphere with atmospheric tracer transport models is essential for greenhouse gas monitoring and verification support systems to aid implementation of international climate agreements. Large deep neural networks are poised to revolutionize weather prediction, which requires 3D modeling of the atmosphere. While similar in this regard, atmospheric transport modeling is subject to new challenges. Both, stable predictions for longer time horizons and mass conservation throughout need to be achieved, while IO plays a larger role compared to computational costs. In this study we explore four different deep neural networks (UNet, GraphCast, Spherical Fourier Neural Operator and SwinTransformer) which have proven as state-of-the-art in weather prediction to assess their usefulness for atmospheric tracer transport modeling. For this, we assemble the CarbonBench dataset, a systematic benchmark tailored for machine learning emulators of Eulerian atmospheric transport. Through architectural adjustments, we decouple the performance of our emulators from the distribution shift caused by a steady rise in atmospheric CO$_2$. More specifically, we center CO$_2$ input fields to zero mean and then use an explicit flux scheme and a mass fixer to assure mass balance. This design enables stable and mass conserving transport for over 6 months with all four neural network architectures. In our study, the SwinTransformer displays particularly strong emulation skill (90-day $R^2 > 0.99$), with physically plausible emulation even for forward runs of multiple years. This work paves the way forward towards high resolution forward and inverse modeling of inert trace gases with neural networks.

Title: Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

Authors: Yuan Xin, Zheng Li, Ning Yu, Dingfan Chen, Mario Fritz, Michael Backes, Yang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11046
Pdf URL: https://arxiv.org/pdf/2408.11046
Copy Paste: [[2408.11046]] Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders(https://arxiv.org/abs/2408.11046)
Keywords: privacy
Abstract: Despite being prevalent in the general field of Natural Language Processing (NLP), pre-trained language models inherently carry privacy and copyright concerns due to their nature of training on large-scale web-scraped data. In this paper, we pioneer a systematic exploration of such risks associated with pre-trained language encoders, specifically focusing on the membership leakage of pre-training data exposed through downstream models adapted from pre-trained language encoders-an aspect largely overlooked in existing literature. Our study encompasses comprehensive experiments across four types of pre-trained encoder architectures, three representative downstream tasks, and five benchmark datasets. Intriguingly, our evaluations reveal, for the first time, the existence of membership leakage even when only the black-box output of the downstream model is exposed, highlighting a privacy risk far greater than previously assumed. Alongside, we present in-depth analysis and insights toward guiding future researchers and practitioners in addressing the privacy considerations in developing pre-trained language models.

Title: MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Authors: Jian Chen, Vashisth Tiwari, Ranajoy Sadhukhan, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, Beidi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2408.11049
Pdf URL: https://arxiv.org/pdf/2408.11049
Copy Paste: [[2408.11049]] MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding(https://arxiv.org/abs/2408.11049)
Keywords: large language model
Abstract: Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput. Speculative decoding (SD) is a widely used technique to reduce latency without sacrificing performance but the conventional wisdom suggests that its efficacy is limited to small batch sizes. In MagicDec, we show that surprisingly SD can achieve speedup even for a high throughput inference regime for moderate to long sequences. More interestingly, an intelligent drafting strategy can achieve better speedup with increasing batch size based on our rigorous analysis. MagicDec first identifies the bottleneck shifts with increasing batch size and sequence length, and uses these insights to deploy speculative decoding more effectively for high throughput inference. Then, it leverages draft models with sparse KV cache to address the KV bottleneck that scales with both sequence length and batch size.

Title: FLAME: Learning to Navigate with Multimodal LLM in Urban Environments

Authors: Yunzhe Xu, Yiyuan Pan, Zhe Liu, Hesheng Wang
Subjects: cs.CV, cs.AI, cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2408.11051
Pdf URL: https://arxiv.org/pdf/2408.11051
Copy Paste: [[2408.11051]] FLAME: Learning to Navigate with Multimodal LLM in Urban Environments(https://arxiv.org/abs/2408.11051)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks, yet current applications face challenges. While LLMs excel in general conversation scenarios, they struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied Agent), a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks that efficiently handles multiple observations. Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for trajectory summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically. Experimental results demonstrate FLAME's superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion rate on Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs) in complex navigation tasks, representing an advancement towards practical applications of MLLMs in embodied AI. Project page: this https URL

Title: NeCo: Improving DINOv2's spatial representations in 19 GPU hours with Patch Neighbor Consistency

Authors: Valentinos Pariza, Mohammadreza Salehi, Gertjan Burghouts, Francesco Locatello, Yuki M. Asano
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2408.11054
Pdf URL: https://arxiv.org/pdf/2408.11054
Copy Paste: [[2408.11054]] NeCo: Improving DINOv2's spatial representations in 19 GPU hours with Patch Neighbor Consistency(https://arxiv.org/abs/2408.11054)
Keywords: segmentation
Abstract: We propose sorting patch representations across views as a novel self-supervised learning signal to improve pretrained representations. To this end, we introduce NeCo: Patch Neighbor Consistency, a novel training loss that enforces patch-level nearest neighbor consistency across a student and teacher model, relative to reference batches. Our method leverages a differentiable sorting method applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. We demonstrate that this method generates high-quality dense feature encoders and establish several new state-of-the-art results: +5.5% and + 6% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, and +7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff.