Copy Paste: [[2408.16779]] Inductive Learning of Logical Theories with LLMs: A Complexity-graded Analysis(https://arxiv.org/abs/2408.16779)
Keywords: explainability, large language model
Abstract: This work presents a novel systematic methodology to analyse the capabilities and limitations of Large Language Models (LLMs) with feedback from a formal inference engine, on logic theory induction. The analysis is complexity-graded w.r.t. rule dependency structure, allowing quantification of specific inference challenges on LLM performance. Integrating LLMs with formal methods is a promising frontier in the Natural Language Processing field, as an important avenue for improving model inference control and explainability. In particular, inductive learning over complex sets of facts and rules, poses unique challenges for current autoregressive models, as they lack explicit symbolic grounding. While they can be complemented by formal systems, the properties delivered by LLMs regarding inductive learning, are not well understood and quantified. Empirical results indicate that the largest LLMs can achieve competitive results against a SOTA Inductive Logic Programming (ILP) system baseline, but also that tracking long predicate relationship chains is a more difficult obstacle than theory complexity for the LLMs.
Title: Generative AI in Ship Design
Authors: Sahil Thakur, Navneet V Saxena, Prof Sitikantha Roy
Copy Paste: [[2408.16798]] Generative AI in Ship Design(https://arxiv.org/abs/2408.16798)
Keywords: generative
Abstract: The process of ship design is intricate, heavily influenced by the hull form which accounts for approximately 70% of the total cost. Traditional methods rely on human-driven iterative processes based on naval architecture principles and engineering analysis. In contrast, generative AI presents a novel approach, utilizing computational algorithms rooted in machine learning and artificial intelligence to optimize ship hull design. This report outlines the systematic creation of a generative AI for this purpose, involving steps such as dataset collection, model architecture selection, training, and validation. Utilizing the "SHIP-D" dataset, consisting of 30,000 hull forms, the report adopts the Gaussian Mixture Model (GMM) as the generative model architecture. GMMs offer a statistical framework to analyze data distribution, crucial for generating innovative ship designs efficiently. Overall, this approach holds promise in revolutionizing ship design by exploring a broader design space and integrating multidisciplinary optimization objectives effectively.
Title: HLogformer: A Hierarchical Transformer for Representing Log Data
Authors: Zhichao Hou, Mina Ghashami, Mikhail Kuznetsov, MohamadAli Torkamani
Copy Paste: [[2408.16803]] HLogformer: A Hierarchical Transformer for Representing Log Data(https://arxiv.org/abs/2408.16803)
Keywords: transformer
Abstract: Transformers have gained widespread acclaim for their versatility in handling diverse data structures, yet their application to log data remains underexplored. Log data, characterized by its hierarchical, dictionary-like structure, poses unique challenges when processed using conventional transformer models. Traditional methods often rely on manually crafted templates for parsing logs, a process that is labor-intensive and lacks generalizability. Additionally, the linear treatment of log sequences by standard transformers neglects the rich, nested relationships within log entries, leading to suboptimal representations and excessive memory usage. To address these issues, we introduce HLogformer, a novel hierarchical transformer framework specifically designed for log data. HLogformer leverages the hierarchical structure of log entries to significantly reduce memory costs and enhance representation learning. Unlike traditional models that treat log data as flat sequences, our framework processes log entries in a manner that respects their inherent hierarchical organization. This approach ensures comprehensive encoding of both fine-grained details and broader contextual relationships. Our contributions are threefold: First, HLogformer is the first framework to design a dynamic hierarchical transformer tailored for dictionary-like log data. Second, it dramatically reduces memory costs associated with processing extensive log sequences. Third, comprehensive experiments demonstrate that HLogformer more effectively encodes hierarchical contextual information, proving to be highly effective for downstream tasks such as synthetic anomaly detection and product recommendation.
Title: STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models
Copy Paste: [[2408.16807]] STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models(https://arxiv.org/abs/2408.16807)
Keywords: security, attack, robust
Abstract: The rapid proliferation of large-scale text-to-image generation (T2IG) models has led to concerns about their potential misuse in generating harmful content. Though many methods have been proposed for erasing undesired concepts from T2IG models, they only provide a false sense of security, as recent works demonstrate that concept-erased models (CEMs) can be easily deceived to generate the erased concept through adversarial attacks. The problem of adversarially robust concept erasing without significant degradation to model utility (ability to generate benign concepts) remains an unresolved challenge, especially in the white-box setting where the adversary has access to the CEM. To address this gap, we propose an approach called STEREO that involves two distinct stages. The first stage searches thoroughly enough for strong and diverse adversarial prompts that can regenerate an erased concept from a CEM, by leveraging robust optimization principles from adversarial training. In the second robustly erase once stage, we introduce an anchor-concept-based compositional objective to robustly erase the target concept at one go, while attempting to minimize the degradation on model utility. By benchmarking the proposed STEREO approach against four state-of-the-art concept erasure methods under three adversarial attacks, we demonstrate its ability to achieve a better robustness vs. utility trade-off. Our code and models are available at this https URL.
Title: See or Guess: Counterfactually Regularized Image Captioning
Copy Paste: [[2408.16809]] See or Guess: Counterfactually Regularized Image Captioning(https://arxiv.org/abs/2408.16809)
Keywords: interpretability, generative
Abstract: Image captioning, which generates natural language descriptions of the visual information in an image, is a crucial task in vision-language research. Previous models have typically addressed this task by aligning the generative capabilities of machines with human intelligence through statistical fitting of existing datasets. While effective for normal images, they may struggle to accurately describe those where certain parts of the image are obscured or edited, unlike humans who excel in such cases. These weaknesses they exhibit, including hallucinations and limited interpretability, often hinder performance in scenarios with shifted association patterns. In this paper, we present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Our approach includes two variants leveraging either total effect or natural direct effect. Integrating them into the training process enables models to handle counterfactual scenarios, increasing their generalizability. Extensive experiments on various datasets show that our method effectively reduces hallucinations and improves the model's faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models. The code is available at this https URL.
Title: Secure Integration of 5G in Industrial Networks: State of the Art, Challenges and Opportunities
Authors: Sotiris Michaelides, Thomas Vogt, Martin Henze
Copy Paste: [[2408.16833]] Secure Integration of 5G in Industrial Networks: State of the Art, Challenges and Opportunities(https://arxiv.org/abs/2408.16833)
Keywords: secure, security
Abstract: The industrial landscape is undergoing a significant transformation, moving away from traditional wired fieldbus networks to cutting-edge 5G mobile networks. This transition, extending from local applications to company-wide use and spanning multiple factories, is driven by the promise of low-latency communication and seamless connectivity for various devices in industrial settings. However, besides these tremendous benefits, the integration of 5G as the communication infrastructure in industrial networks introduces a new set of risks and threats to the security of industrial systems. The inherent complexity of 5G systems poses unique challenges for ensuring a secure integration, surpassing those encountered with any technology previously utilized in industrial networks. Most importantly, the distinct characteristics of industrial networks, such as real-time operation, required safety guarantees, and high availability requirements, further complicate this task. As the industrial transition from wired to wireless networks is a relatively new concept, a lack of guidance and recommendations on securely integrating 5G renders many industrial systems vulnerable and exposed to threats associated with 5G. To address this situation, in this paper, we summarize the state-of-the-art and derive a set of recommendations for the secure integration of 5G into industrial networks based on a thorough analysis of the research landscape. Furthermore, we identify opportunities to utilize 5G to further enhance security and indicate remaining challenges, potentially identifying future academic potential
Title: Cyber Risk Assessment for Cyber-Physical Systems: A Review of Methodologies and Recommendations for Improved Assessment Effectiveness
Copy Paste: [[2408.16841]] Cyber Risk Assessment for Cyber-Physical Systems: A Review of Methodologies and Recommendations for Improved Assessment Effectiveness(https://arxiv.org/abs/2408.16841)
Keywords: security
Abstract: Cyber-Physical Systems (CPS) integrate physical and embedded systems with information and communication technology systems, monitoring and controlling physical processes with minimal human intervention. The connection to information and communication technology exposes CPS to cyber risks. It is crucial to assess these risks to manage them effectively. This paper reviews scholarly contributions to cyber risk assessment for CPS, analyzing how the assessment approaches were evaluated and investigating to what extent they meet the requirements of effective risk assessment. We identify gaps limiting the effectiveness of the assessment and recommend real-time learning from cybersecurity incidents. Our review covers twenty-eight papers published between 2014 and 2023, selected based on a three-step search. Our findings show that the reviewed cyber risk assessment methodologies revealed limited effectiveness due to multiple factors. These findings provide a foundation for further research to explore and address other factors impacting the quality of cyber risk assessment in CPS.
Title: Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis
Authors: Theodoros Kouzelis, Manos Plitsis, Mihalis A. Nikolaou, Yannis Panagakis
Copy Paste: [[2408.16845]] Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis(https://arxiv.org/abs/2408.16845)
Keywords: diffusion, generative
Abstract: Recent advances in Diffusion Models (DMs) have led to significant progress in visual synthesis and editing tasks, establishing them as a strong competitor to Generative Adversarial Networks (GANs). However, the latent space of DMs is not as well understood as that of GANs. Recent research has focused on unsupervised semantic discovery in the latent space of DMs by leveraging the bottleneck layer of the denoising network, which has been shown to exhibit properties of a semantic latent space. However, these approaches are limited to discovering global attributes. In this paper we address, the challenge of local image manipulation in DMs and introduce an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs. Given an arbitrary image and defined regions of interest, we utilize the Jacobian of the denoising network to establish a relation between the regions of interest and their corresponding subspaces in the latent space. Furthermore, we disentangle the joint and individual components of these subspaces to identify latent directions that enable local image manipulation. Once discovered, these directions can be applied to different images to produce semantically consistent edits, making our method suitable for practical applications. Experimental results on various datasets demonstrate that our method can produce semantic edits that are more localized and have better fidelity compared to the state-of-the-art.
Title: FineFACE: Fair Facial Attribute Classification Leveraging Fine-grained Features
Abstract: Published research highlights the presence of demographic bias in automated facial attribute classification algorithms, particularly impacting women and individuals with darker skin tones. Existing bias mitigation techniques typically require demographic annotations and often obtain a trade-off between fairness and accuracy, i.e., Pareto inefficiency. Facial attributes, whether common ones like gender or others such as "chubby" or "high cheekbones", exhibit high interclass similarity and intraclass variation across demographics leading to unequal accuracy. This requires the use of local and subtle cues using fine-grained analysis for differentiation. This paper proposes a novel approach to fair facial attribute classification by framing it as a fine-grained classification problem. Our approach effectively integrates both low-level local features (like edges and color) and high-level semantic features (like shapes and structures) through cross-layer mutual attention learning. Here, shallow to deep CNN layers function as experts, offering category predictions and attention regions. An exhaustive evaluation on facial attribute annotated datasets demonstrates that our FineFACE model improves accuracy by 1.32% to 1.74% and fairness by 67% to 83.6%, over the SOTA bias mitigation techniques. Importantly, our approach obtains a Pareto-efficient balance between accuracy and fairness between demographic groups. In addition, our approach does not require demographic annotations and is applicable to diverse downstream classification tasks. To facilitate reproducibility, the code and dataset information is available at this https URL.
Title: Revising Multimodal VAEs with Diffusion Decoders
Copy Paste: [[2408.16883]] Revising Multimodal VAEs with Diffusion Decoders(https://arxiv.org/abs/2408.16883)
Keywords: diffusion
Abstract: Multimodal VAEs often struggle with generating high-quality outputs, a challenge that extends beyond the inherent limitations of the VAE framework. The core issue lies in the restricted joint representation of the latent space, particularly when complex modalities like images are involved. Feedforward decoders, commonly used for these intricate modalities, inadvertently constrain the joint latent space, leading to a degradation in the quality of the other modalities as well. Although recent studies have shown improvement by introducing modality-specific representations, the issue remains significant. In this work, we demonstrate that incorporating a flexible diffusion decoder specifically for the image modality not only enhances the generation quality of the images but also positively impacts the performance of the other modalities that rely on feedforward decoders. This approach addresses the limitations imposed by conventional joint representations and opens up new possibilities for improving multimodal generation tasks using the multimodal VAE framework. Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities
Title: A Prototype Model of Zero-Trust Architecture Blockchain with EigenTrust-Based Practical Byzantine Fault Tolerance Protocol to Manage Decentralized Clinical Trials
Authors: Ashok Kumar Peepliwall, Hari Mohan Pandey, Surya Prakash, Anand A Mahajan, Sudhinder Singh Chowhan, Vinesh Kumar, Rahul Sharma
Copy Paste: [[2408.16885]] A Prototype Model of Zero-Trust Architecture Blockchain with EigenTrust-Based Practical Byzantine Fault Tolerance Protocol to Manage Decentralized Clinical Trials(https://arxiv.org/abs/2408.16885)
Keywords: secure, security
Abstract: The COVID-19 pandemic necessitated the emergence of decentralized Clinical Trials (DCTs) due to patient retention, accelerate trials, improve data accessibility, enable virtual care, and facilitate seamless communication through integrated systems. However, integrating systems in DCTs exposes clinical data to potential security threats, making them susceptible to theft at any stage, a high risk of protocol deviations, and monitoring issues. To mitigate these challenges, blockchain technology serves as a secure framework, acting as a decentralized ledger, creating an immutable environment by establishing a zero-trust architecture, where data are deemed untrusted until verified. In combination with Internet of Things (IoT)-enabled wearable devices, blockchain secures the transfer of clinical trial data on private blockchains during DCT automation and operations. This paper proposes a prototype model of the Zero-Trust Architecture Blockchain (z-TAB) to integrate patient-generated clinical trial data during DCT operation management. The EigenTrust-based Practical Byzantine Fault Tolerance (T-PBFT) algorithm has been incorporated as a consensus protocol, leveraging Hyperledger Fabric. Furthermore, the Internet of Things (IoT) has been integrated to streamline data processing among stakeholders within the blockchain platforms. Rigorous evaluation has been done to evaluate the quality of the system.
Title: LLaVA-Chef: A Multi-modal Generative Model for Food Recipes
Copy Paste: [[2408.16889]] LLaVA-Chef: A Multi-modal Generative Model for Food Recipes(https://arxiv.org/abs/2408.16889)
Keywords: generative, large language model
Abstract: In the rapidly evolving landscape of online recipe sharing within a globalized context, there has been a notable surge in research towards comprehending and generating food recipes. Recent advancements in large language models (LLMs) like GPT-2 and LLaVA have paved the way for Natural Language Processing (NLP) approaches to delve deeper into various facets of food-related tasks, encompassing ingredient recognition and comprehensive recipe generation. Despite impressive performance and multi-modal adaptability of LLMs, domain-specific training remains paramount for their effective application. This work evaluates existing LLMs for recipe generation and proposes LLaVA-Chef, a novel model trained on a curated dataset of diverse recipe prompts in a multi-stage approach. First, we refine the mapping of visual food image embeddings to the language space. Second, we adapt LLaVA to the food domain by fine-tuning it on relevant recipe data. Third, we utilize diverse prompts to enhance the model's recipe comprehension. Finally, we improve the linguistic quality of generated recipes by penalizing the model with a custom loss function. LLaVA-Chef demonstrates impressive improvements over pretrained LLMs and prior works. A detailed qualitative analysis reveals that LLaVA-Chef generates more detailed recipes with precise ingredient mentions, compared to existing approaches.
Title: Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector
Abstract: Deepfakes, which employ GAN to produce highly realistic facial modification, are widely regarded as the prevailing method. Traditional CNN have been able to identify bogus media, but they struggle to perform well on different datasets and are vulnerable to adversarial attacks due to their lack of robustness. Vision transformers have demonstrated potential in the realm of image classification problems, but they require enough training data. Motivated by these limitations, this publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet with a vision transformer. The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation. The texture module then serves as an input to the dual branch of the cross-attention vision transformer. It specifically focuses on improving the global texture module, which extracts feature map correlation. Empirical analysis reveals that fake images exhibit smooth textures that do not remain consistent over long distances in manipulations. Experiments were performed on different categories of FF++, such as DF, f2f, FS, and NT, together with other types of GAN datasets in cross-domain scenarios. Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF dataset underwent several post-processing situations, such as blurring, compression, and noise. The model surpassed the most advanced models in terms of generalization, achieving a 98% accuracy in cross-domain scenarios. This demonstrates its ability to learn the shared distinguishing textural characteristics in the manipulated samples. These experiments provide evidence that the proposed model is capable of being applied to various situations and is resistant to many post-processing procedures.
Title: Exploring Multiple Strategies to Improve Multilingual Coreference Resolution in CorefUD
Copy Paste: [[2408.16893]] Exploring Multiple Strategies to Improve Multilingual Coreference Resolution in CorefUD(https://arxiv.org/abs/2408.16893)
Keywords: robust
Abstract: Coreference resolution, the task of identifying expressions in text that refer to the same entity, is a critical component in various natural language processing (NLP) applications. This paper presents our end-to-end neural coreference resolution system, utilizing the CorefUD 1.1 dataset, which spans 17 datasets across 12 languages. We first establish strong baseline models, including monolingual and cross-lingual variations, and then propose several extensions to enhance performance across diverse linguistic contexts. These extensions include cross-lingual training, incorporation of syntactic information, a Span2Head model for optimized headword prediction, and advanced singleton modeling. We also experiment with headword span representation and long-documents modeling through overlapping segments. The proposed extensions, particularly the heads-only approach, singleton modeling, and long document prediction significantly improve performance across most datasets. We also perform zero-shot cross-lingual experiments, highlighting the potential and limitations of cross-lingual transfer in coreference resolution. Our findings contribute to the development of robust and scalable coreference systems for multilingual coreference resolution. Finally, we evaluate our model on CorefUD 1.1 test set and surpass the best model from CRAC 2023 shared task of a comparable size by a large margin. Our nodel is available on GitHub: \url{this https URL}
Title: DLFormer: Enhancing Explainability in Multivariate Time Series Forecasting using Distributed Lag Embedding
Copy Paste: [[2408.16896]] DLFormer: Enhancing Explainability in Multivariate Time Series Forecasting using Distributed Lag Embedding(https://arxiv.org/abs/2408.16896)
Keywords: explainability
Abstract: . Most real-world variables are multivariate time series influenced by past values and explanatory factors. Consequently, predicting these time series data using artificial intelligence is ongoing. In particular, in fields such as healthcare and finance, where reliability is crucial, having understandable explanations for predictions is essential. However, achieving a balance between high prediction accuracy and intuitive explainability has proven challenging. Although attention-based models have limitations in representing the individual influences of each variable, these models can influence the temporal dependencies in time series prediction and the magnitude of the influence of individual variables. To address this issue, this study introduced DLFormer, an attention-based architecture integrated with distributed lag embedding, to temporally embed individual variables and capture their temporal influence. Through validation against various real-world datasets, DLFormer showcased superior performance improvements compared to existing attention-based high-performance models. Furthermore, comparing the relationships between variables enhanced the reliability of explainability.
Title: Ig3D: Integrating 3D Face Representations in Facial Expression Inference
Copy Paste: [[2408.16907]] Ig3D: Integrating 3D Face Representations in Facial Expression Inference(https://arxiv.org/abs/2408.16907)
Keywords: generative
Abstract: Reconstructing 3D faces with facial geometry from single images has allowed for major advances in animation, generative models, and virtual reality. However, this ability to represent faces with their 3D features is not as fully explored by the facial expression inference (FEI) community. This study therefore aims to investigate the impacts of integrating such 3D representations into the FEI task, specifically for facial expression classification and face-based valence-arousal (VA) estimation. To accomplish this, we first assess the performance of two 3D face representations (both based on the 3D morphable model, FLAME) for the FEI tasks. We further explore two fusion architectures, intermediate fusion and late fusion, for integrating the 3D face representations with existing 2D inference frameworks. To evaluate our proposed architecture, we extract the corresponding 3D representations and perform extensive tests on the AffectNet and RAF-DB datasets. Our experimental results demonstrate that our proposed method outperforms the state-of-the-art AffectNet VA estimation and RAF-DB classification tasks. Moreover, our method can act as a complement to other existing methods to boost performance in many emotion inference tasks.
Title: Analyzing Inference Privacy Risks Through Gradients in Machine Learning
Authors: Zhuohang Li, Andrew Lowy, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Bradley Malin, Ye Wang
Copy Paste: [[2408.16913]] Analyzing Inference Privacy Risks Through Gradients in Machine Learning(https://arxiv.org/abs/2408.16913)
Keywords: privacy, defense, attack
Abstract: In distributed learning settings, models are iteratively updated with shared gradients computed from potentially sensitive user data. While previous work has studied various privacy risks of sharing gradients, our paper aims to provide a systematic approach to analyze private information leakage from gradients. We present a unified game-based framework that encompasses a broad range of attacks including attribute, property, distributional, and user disclosures. We investigate how different uncertainties of the adversary affect their inferential power via extensive experiments on five datasets across various data modalities. Our results demonstrate the inefficacy of solely relying on data aggregation to achieve privacy against inference attacks in distributed learning. We further evaluate five types of defenses, namely, gradient pruning, signed gradient descent, adversarial perturbations, variational information bottleneck, and differential privacy, under both static and adaptive adversary settings. We provide an information-theoretic view for analyzing the effectiveness of these defenses against inference from gradients. Finally, we introduce a method for auditing attribute inference privacy, improving the empirical estimation of worst-case privacy through crafting adversarial canary records.
Title: ACE-2005-PT: Corpus for Event Extraction in Portuguese
Authors: Luís Filipe Cunha, Purificação Silvano, Ricardo Campos, Alípio Jorge
Copy Paste: [[2408.16928]] ACE-2005-PT: Corpus for Event Extraction in Portuguese(https://arxiv.org/abs/2408.16928)
Keywords: extraction
Abstract: Event extraction is an NLP task that commonly involves identifying the central word (trigger) for an event and its associated arguments in text. ACE-2005 is widely recognised as the standard corpus in this field. While other corpora, like PropBank, primarily focus on annotating predicate-argument structure, ACE-2005 provides comprehensive information about the overall event structure and semantics. However, its limited language coverage restricts its usability. This paper introduces ACE-2005-PT, a corpus created by translating ACE-2005 into Portuguese, with European and Brazilian variants. To speed up the process of obtaining ACE-2005-PT, we rely on automatic translators. This, however, poses some challenges related to automatically identifying the correct alignments between multi-word annotations in the original text and in the corresponding translated sentence. To achieve this, we developed an alignment pipeline that incorporates several alignment techniques: lemmatization, fuzzy matching, synonym matching, multiple translations and a BERT-based word aligner. To measure the alignment effectiveness, a subset of annotations from the ACE-2005-PT corpus was manually aligned by a linguist expert. This subset was then compared against our pipeline results which achieved exact and relaxed match scores of 70.55\% and 87.55\% respectively. As a result, we successfully generated a Portuguese version of the ACE-2005 corpus, which has been accepted for publication by LDC.
Title: Event Extraction for Portuguese: A QA-driven Approach using ACE-2005
Authors: Luís Filipe Cunha, Ricardo Campos, Alípio Jorge
Copy Paste: [[2408.16932]] Event Extraction for Portuguese: A QA-driven Approach using ACE-2005(https://arxiv.org/abs/2408.16932)
Keywords: extraction
Abstract: Event extraction is an Information Retrieval task that commonly consists of identifying the central word for the event (trigger) and the event's arguments. This task has been extensively studied for English but lags behind for Portuguese, partly due to the lack of task-specific annotated corpora. This paper proposes a framework in which two separated BERT-based models were fine-tuned to identify and classify events in Portuguese documents. We decompose this task into two sub-tasks. Firstly, we use a token classification model to detect event triggers. To extract event arguments, we train a Question Answering model that queries the triggers about their corresponding event argument roles. Given the lack of event annotated corpora in Portuguese, we translated the original version of the ACE-2005 dataset (a reference in the field) into Portuguese, producing a new corpus for Portuguese event extraction. To accomplish this, we developed an automatic translation pipeline. Our framework obtains F1 marks of 64.4 for trigger classification and 46.7 for argument classification setting, thus a new state-of-the-art reference for these tasks in Portuguese.
Title: Plausible-Parrots @ MSP2023: Enhancing Semantic Plausibility Modeling using Entity and Event Knowledge
Copy Paste: [[2408.16937]] Plausible-Parrots @ MSP2023: Enhancing Semantic Plausibility Modeling using Entity and Event Knowledge(https://arxiv.org/abs/2408.16937)
Keywords: large language model
Abstract: In this work, we investigate the effectiveness of injecting external knowledge to a large language model (LLM) to identify semantic plausibility of simple events. Specifically, we enhance the LLM with fine-grained entity types, event types and their definitions extracted from an external knowledge base. These knowledge are injected into our system via designed templates. We also augment the data to balance the label distribution and adapt the task setting to real world scenarios in which event mentions are expressed as natural language sentences. The experimental results show the effectiveness of the injected knowledge on modeling semantic plausibility of events. An error analysis further emphasizes the importance of identifying non-trivial entity and event types.
Title: Manipulating OpenFlow Link Discovery Packet Forwarding for Topology Poisoning
Authors: Mingming Chen, Thomas La Porta, Teryl Taylor, Frederico Araujo, Trent Jaeger
Copy Paste: [[2408.16940]] Manipulating OpenFlow Link Discovery Packet Forwarding for Topology Poisoning(https://arxiv.org/abs/2408.16940)
Keywords: security, defense, attack, steal
Abstract: Software-defined networking (SDN) is a centralized, dynamic, and programmable network management technology that enables flexible traffic control and scalability. SDN facilitates network administration through a centralized view of the underlying physical topology; tampering with this topology view can result in catastrophic damage to network management and security. To underscore this issue, we introduce Marionette, a new topology poisoning technique that manipulates OpenFlow link discovery packet forwarding to alter topology information. Our approach exposes an overlooked yet widespread attack vector, distinguishing itself from traditional link fabrication attacks that tamper, spoof, or relay discovery packets at the data plane. Unlike localized attacks observed in existing methods, our technique introduces a globalized topology poisoning attack that leverages control privileges. Marionette implements a reinforcement learning algorithm to compute a poisoned topology target, and injects flow entries to achieve a long-lived stealthy attack. Our evaluation shows that Marionette successfully attacks five open-source controllers and nine OpenFlow-based discovery protocols. Marionette overcomes the state-of-the-art topology poisoning defenses, showcasing a new class of topology poisoning that initiates on the control plane. This security vulnerability was ethically disclosed to OpenDaylight, and CVE-2024-37018 has been assigned.
Title: A longitudinal sentiment analysis of Sinophobia during COVID-19 using large language models
Copy Paste: [[2408.16942]] A longitudinal sentiment analysis of Sinophobia during COVID-19 using large language models(https://arxiv.org/abs/2408.16942)
Keywords: large language model
Abstract: The COVID-19 pandemic has exacerbated xenophobia, particularly Sinophobia, leading to widespread discrimination against individuals of Chinese descent. Large language models (LLMs) are pre-trained deep learning models used for natural language processing (NLP) tasks. The ability of LLMs to understand and generate human-like text makes them particularly useful for analysing social media data to detect and evaluate sentiments. We present a sentiment analysis framework utilising LLMs for longitudinal sentiment analysis of the Sinophobic sentiments expressed in X (Twitter) during the COVID-19 pandemic. The results show a significant correlation between the spikes in Sinophobic tweets, Sinophobic sentiments and surges in COVID-19 cases, revealing that the evolution of the pandemic influenced public sentiment and the prevalence of Sinophobic discourse. Furthermore, the sentiment analysis revealed a predominant presence of negative sentiments, such as annoyance and denial, which underscores the impact of political narratives and misinformation shaping public opinion. The lack of empathetic sentiment which was present in previous studies related to COVID-19 highlights the way the political narratives in media viewed the pandemic and how it blamed the Chinese community. Our study highlights the importance of transparent communication in mitigating xenophobic sentiments during global crises.
Title: Different Victims, Same Layout: Email Visual Similarity Detection for Enhanced Email Protection
Copy Paste: [[2408.16945]] Different Victims, Same Layout: Email Visual Similarity Detection for Enhanced Email Protection(https://arxiv.org/abs/2408.16945)
Keywords: protect, defense, attack
Abstract: In the pursuit of an effective spam detection system, the focus has often been on identifying known spam patterns either through rule-based detection systems or machine learning (ML) solutions. However, both systems are susceptible to evasion techniques and zero-day attacks that can be achieved at low cost. Therefore, an email that bypassed the defense system once can do it again in the following days, even though rules are updated or the ML models are retrained. The recurrence of failures to detect emails that exhibit layout similarities to previously undetected spam is concerning for customers and can erode their trust in a company. Our observations show that threat actors reuse email kits extensively and can bypass detection with little effort, for example, by making changes to the content of emails. In this work, we propose an email visual similarity detection approach, named Pisco, to improve the detection capabilities of an email threat defense system. We apply our proof of concept to some real-world samples received from different sources. Our results show that email kits are being reused extensively and visually similar emails are sent to our customers at various time intervals. Therefore, this method could be very helpful in situations where detection features that rely on contextual information and keywords are bypassed, an occurrence our observations show happens frequently.
Title: An Empirical Study of Scaling Laws for Transfer
Copy Paste: [[2408.16947]] An Empirical Study of Scaling Laws for Transfer(https://arxiv.org/abs/2408.16947)
Keywords: transformer
Abstract: We present a limited empirical study of scaling laws for transfer learning in transformer models. More specifically, we examine a scaling law that incorporates a "transfer gap" term, indicating the effectiveness of pre-training on one distribution when optimizing for downstream performance on another distribution. When the transfer gap is low, pre-training is a cost-effective strategy for improving downstream performance. Conversely, when the gap is high, collecting high-quality fine-tuning data becomes relatively more cost effective. Fitting the scaling law to experiments from diverse datasets reveals significant variations in the transfer gap across distributions. In theory, the scaling law can inform optimal data allocation strategies and highlights how the scarcity of downstream data can bottleneck performance. Our findings contribute to a principled way to measure transfer learning efficiency and understand how data availability affects capabilities.
Title: A Persistent Hierarchical Bloom Filter-based Framework for Authentication and Tracking of ICs
Copy Paste: [[2408.16950]] A Persistent Hierarchical Bloom Filter-based Framework for Authentication and Tracking of ICs(https://arxiv.org/abs/2408.16950)
Keywords: robust
Abstract: Detecting counterfeit integrated circuits (ICs) in unreliable supply chains demands robust tracking and authentication. Physical Unclonable Functions (PUFs) offer unique IC identifiers, but noise undermines their utility. This study introduces the Persistent Hierarchical Bloom Filter (PHBF) framework, ensuring swift and accurate IC authentication with an accuracy rate of 100% across the supply chain even with noisy PUF-generated signatures.
Title: Transient Fault Tolerant Semantic Segmentation for Autonomous Driving
Authors: Leonardo Iurada, Niccolò Cavagnero, Fernando Fernandes Dos Santos, Giuseppe Averta, Paolo Rech, Tatiana Tommasi
Abstract: Deep learning models are crucial for autonomous vehicle perception, but their reliability is challenged by algorithmic limitations and hardware faults. We address the latter by examining fault-tolerance in semantic segmentation models. Using established hardware fault models, we evaluate existing hardening techniques both in terms of accuracy and uncertainty and introduce ReLUMax, a novel simple activation function designed to enhance resilience against transient faults. ReLUMax integrates seamlessly into existing architectures without time overhead. Our experiments demonstrate that ReLUMax effectively improves robustness, preserving performance and boosting prediction confidence, thus contributing to the development of reliable autonomous driving systems.
Title: Discovery of False Data Injection Schemes on Frequency Controllers with Reinforcement Learning
Authors: Romesh Prasad, Malik Hassanaly, Xiangyu Zhang, Abhijeet Sahu
Copy Paste: [[2408.16958]] Discovery of False Data Injection Schemes on Frequency Controllers with Reinforcement Learning(https://arxiv.org/abs/2408.16958)
Keywords: attack
Abstract: While inverter-based distributed energy resources (DERs) play a crucial role in integrating renewable energy into the power system, they concurrently diminish the grid's system inertia, elevating the risk of frequency instabilities. Furthermore, smart inverters, interfaced via communication networks, pose a potential vulnerability to cyber threats if not diligently managed. To proactively fortify the power grid against sophisticated cyber attacks, we propose to employ reinforcement learning (RL) to identify potential threats and system vulnerabilities. This study concentrates on analyzing adversarial strategies for false data injection, specifically targeting smart inverters involved in primary frequency control. Our findings demonstrate that an RL agent can adeptly discern optimal false data injection methods to manipulate inverter settings, potentially causing catastrophic consequences.
Title: HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution
Authors: Masoomeh Aslahishahri, Jordan Ubbens, Ian Stavness
Copy Paste: [[2408.16959]] HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution(https://arxiv.org/abs/2408.16959)
Keywords: transformer
Abstract: In this paper, we propose HiTSR, a hierarchical transformer model for reference-based image super-resolution, which enhances low-resolution input images by learning matching correspondences from high-resolution reference images. Diverging from existing multi-network, multi-stage approaches, we streamline the architecture and training pipeline by incorporating the double attention block from GAN literature. Processing two visual streams independently, we fuse self-attention and cross-attention blocks through a gating attention strategy. The model integrates a squeeze-and-excitation module to capture global context from the input images, facilitating long-range spatial interactions within window-based attention blocks. Long skip connections between shallow and deep layers further enhance information flow. Our model demonstrates superior performance across three datasets including SUN80, Urban100, and Manga109. Specifically, on the SUN80 dataset, our model achieves PSNR/SSIM values of 30.24/0.821. These results underscore the effectiveness of attention mechanisms in reference-based image super-resolution. The transformer-based model attains state-of-the-art results without the need for purpose-built subnetworks, knowledge distillation, or multi-stage training, emphasizing the potency of attention in meeting reference-based image super-resolution requirements.
Title: Contrastive Learning with Synthetic Positives
Copy Paste: [[2408.16965]] Contrastive Learning with Synthetic Positives(https://arxiv.org/abs/2408.16965)
Keywords: diffusion
Abstract: Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-supervised learning (SSL) techniques by utilizing the similarity of multiple instances within the same class. However, its efficacy is constrained as the nearest neighbor algorithm primarily identifies ``easy'' positive pairs, where the representations are already closely located in the embedding space. In this paper, we introduce a novel approach called Contrastive Learning with Synthetic Positives (CLSP) that utilizes synthetic images, generated by an unconditional diffusion model, as the additional positives to help the model learn from diverse positives. Through feature interpolation in the diffusion model sampling process, we generate images with distinct backgrounds yet similar semantic content to the anchor image. These images are considered ``hard'' positives for the anchor image, and when included as supplementary positives in the contrastive loss, they contribute to a performance improvement of over 2\% and 1\% in linear evaluation compared to the previous NNCLR and All4One methods across multiple benchmark datasets such as CIFAR10, achieving state-of-the-art methods. On transfer learning benchmarks, CLSP outperforms existing SSL frameworks on 6 out of 8 downstream datasets. We believe CLSP establishes a valuable baseline for future SSL studies incorporating synthetic data in the training process.
Title: UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches
Authors: Chao Wang, Neo Wu, Lin Ning, Luyang Liu, Jun Xie, Shawn O'Banion, Bradley Green
Copy Paste: [[2408.16966]] UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches(https://arxiv.org/abs/2408.16966)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have shown remarkable capabilities in generating user summaries from a long list of raw user activity data. These summaries capture essential user information such as preferences and interests, and therefore are invaluable for LLM-based personalization applications, such as explainable recommender systems. However, the development of new summarization techniques is hindered by the lack of ground-truth labels, the inherent subjectivity of user summaries, and human evaluation which is often costly and time-consuming. To address these challenges, we introduce \UserSumBench, a benchmark framework designed to facilitate iterative development of LLM-based summarization approaches. This framework offers two key components: (1) A reference-free summary quality metric. We show that this metric is effective and aligned with human preferences across three diverse datasets (MovieLens, Yelp and Amazon Review). (2) A novel robust summarization method that leverages time-hierarchical summarizer and self-critique verifier to produce high-quality summaries while eliminating hallucination. This method serves as a strong baseline for further innovation in summarization techniques.
Title: MemLong: Memory-Augmented Retrieval for Long Text Modeling
Copy Paste: [[2408.16967]] MemLong: Memory-Augmented Retrieval for Long Text Modeling(https://arxiv.org/abs/2408.16967)
Keywords: large language model
Abstract: Recent advancements in Large Language Models (LLMs) have yielded remarkable success across diverse fields. However, handling long contexts remains a significant challenge for LLMs due to the quadratic time and space complexity of attention mechanisms and the growing memory consumption of the key-value cache during generation. This work introduces MemLong: Memory-Augmented Retrieval for Long Text Generation, a method designed to enhance the capabilities of long-context language modeling by utilizing an external retriever for historical information retrieval. MemLong combines a non-differentiable ``ret-mem'' module with a partially trainable decoder-only language model and introduces a fine-grained, controllable retrieval attention mechanism that leverages semantic-level relevant chunks. Comprehensive evaluations on multiple long-context language modeling benchmarks demonstrate that MemLong consistently outperforms other state-of-the-art LLMs. More importantly, MemLong can extend the context length on a single 3090 GPU from 4k up to 80k. Our code is available at this https URL
Title: Point Neuron Learning: A New Physics-Informed Neural Network Architecture
Copy Paste: [[2408.16969]] Point Neuron Learning: A New Physics-Informed Neural Network Architecture(https://arxiv.org/abs/2408.16969)
Keywords: interpretability
Abstract: Machine learning and neural networks have advanced numerous research domains, but challenges such as large training data requirements and inconsistent model performance hinder their application in certain scientific problems. To overcome these challenges, researchers have investigated integrating physics principles into machine learning models, mainly through: (i) physics-guided loss functions, generally termed as physics-informed neural networks, and (ii) physics-guided architectural design. While both approaches have demonstrated success across multiple scientific disciplines, they have limitations including being trapped to a local minimum, poor interpretability, and restricted generalizability. This paper proposes a new physics-informed neural network (PINN) architecture that combines the strengths of both approaches by embedding the fundamental solution of the wave equation into the network architecture, enabling the learned model to strictly satisfy the wave equation. The proposed point neuron learning method can model an arbitrary sound field based on microphone observations without any dataset. Compared to other PINN methods, our approach directly processes complex numbers and offers better interpretability and generalizability. We evaluate the versatility of the proposed architecture by a sound field reconstruction problem in a reverberant environment. Results indicate that the point neuron method outperforms two competing methods and can efficiently handle noisy environments with sparse microphone observations.
Title: Cross Fusion RGB-T Tracking with Bi-directional Adapter
Copy Paste: [[2408.16979]] Cross Fusion RGB-T Tracking with Bi-directional Adapter(https://arxiv.org/abs/2408.16979)
Keywords: transformer
Abstract: Many state-of-the-art RGB-T trackers have achieved remarkable results through modality fusion. However, these trackers often either overlook temporal information or fail to fully utilize it, resulting in an ineffective balance between multi-modal and temporal information. To address this issue, we propose a novel Cross Fusion RGB-T Tracking architecture (CFBT) that ensures the full participation of multiple modalities in tracking while dynamically fusing temporal information. The effectiveness of CFBT relies on three newly designed cross spatio-temporal information fusion modules: Cross Spatio-Temporal Augmentation Fusion (CSTAF), Cross Spatio-Temporal Complementarity Fusion (CSTCF), and Dual-Stream Spatio-Temporal Adapter (DSTA). CSTAF employs a cross-attention mechanism to enhance the feature representation of the template comprehensively. CSTCF utilizes complementary information between different branches to enhance target features and suppress background features. DSTA adopts the adapter concept to adaptively fuse complementary information from multiple branches within the transformer layer, using the RGB modality as a medium. These ingenious fusions of multiple perspectives introduce only less than 0.3\% of the total modal parameters, but they indeed enable an efficient balance between multi-modal and temporal information. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance.
Title: The Sample-Communication Complexity Trade-off in Federated Q-Learning
Copy Paste: [[2408.16981]] The Sample-Communication Complexity Trade-off in Federated Q-Learning(https://arxiv.org/abs/2408.16981)
Keywords: federate
Abstract: We consider the problem of federated Q-learning, where $M$ agents aim to collaboratively learn the optimal Q-function of an unknown infinite-horizon Markov decision process with finite state and action spaces. We investigate the trade-off between sample and communication complexities for the widely used class of intermittent communication algorithms. We first establish the converse result, where it is shown that a federated Q-learning algorithm that offers any speedup with respect to the number of agents in the per-agent sample complexity needs to incur a communication cost of at least an order of $\frac{1}{1-\gamma}$ up to logarithmic factors, where $\gamma$ is the discount factor. We also propose a new algorithm, called Fed-DVR-Q, which is the first federated Q-learning algorithm to simultaneously achieve order-optimal sample and communication complexities. Thus, together these results provide a complete characterization of the sample-communication complexity trade-off in federated Q-learning.
Title: AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
Authors: Yonghui Wang, Wengang Zhou, Hao Feng, Houqiang Li
Copy Paste: [[2408.16986]] AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding(https://arxiv.org/abs/2408.16986)
Keywords: large language model
Abstract: Over the past few years, the advancement of Multimodal Large Language Models (MLLMs) has captured the wide interest of researchers, leading to numerous innovations to enhance MLLMs' comprehension. In this paper, we present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We hypothesize that the requisite number of visual tokens for the model is contingent upon both the resolution and content of the input image. Generally, natural images with a lower information density can be effectively interpreted by the model using fewer visual tokens at reduced resolutions. In contrast, images containing textual content, such as documents with rich text, necessitate a higher number of visual tokens for accurate text interpretation due to their higher information density. Building on this insight, we devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. This method mitigates distortion effects that arise from resizing images to a uniform resolution and dynamically optimizing the visual tokens input to the LLMs. Our model is capable of processing images with resolutions up to $1008\times 1008$. Extensive experiments across various datasets demonstrate that our method achieves impressive performance in handling vision-language tasks in both natural and text-related scenes. The source code and dataset are now publicly available at \url{this https URL}.
Title: Tool-Assisted Agent on SQL Inspection and Refinement in Real-World Scenarios
Authors: Zhongyuan Wang, Richong Zhang, Zhijie Nie, Jaein Kim
Copy Paste: [[2408.16991]] Tool-Assisted Agent on SQL Inspection and Refinement in Real-World Scenarios(https://arxiv.org/abs/2408.16991)
Keywords: large language model
Abstract: Recent Text-to-SQL methods leverage large language models (LLMs) by incorporating feedback from the database management system. While these methods effectively address execution errors in SQL queries, they struggle with database mismatches -- errors that do not trigger execution exceptions. Database mismatches include issues such as condition mismatches and stricter constraint mismatches, both of which are more prevalent in real-world scenarios. To address these challenges, we propose a tool-assisted agent framework for SQL inspection and refinement, equipping the LLM-based agent with two specialized tools: a retriever and a detector, designed to diagnose and correct SQL queries with database mismatches. These tools enhance the capability of LLMs to handle real-world queries more effectively. We also introduce Spider-Mismatch, a new dataset specifically constructed to reflect the condition mismatch problems encountered in real-world scenarios. Experimental results demonstrate that our method achieves the highest performance on the averaged results of the Spider and Spider-Realistic datasets in few-shot settings, and it significantly outperforms baseline methods on the more realistic dataset, Spider-Mismatch.
Title: Safety Layers of Aligned Large Language Models: The Key to LLM Security
Authors: Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li
Copy Paste: [[2408.17003]] Safety Layers of Aligned Large Language Models: The Key to LLM Security(https://arxiv.org/abs/2408.17003)
Keywords: secure, security, large language model
Abstract: Aligned LLMs are highly secure, capable of recognizing and refusing to answer malicious questions. However, the role of internal parameters in maintaining this security is not well understood, further these models are vulnerable to security degradation when fine-tuned with non-malicious backdoor data or normal data. To address these challenges, our work uncovers the mechanism behind security in aligned LLMs at the parameter level, identifying a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones, referred to as "safety layers." We first confirm the existence of these safety layers by analyzing variations in input vectors within the model's internal layers. Additionally, we leverage the over-rejection phenomenon and parameters scaling analysis to precisely locate the safety layers. Building on this understanding, we propose a novel fine-tuning approach, Safely Partial-Parameter Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation. Our experiments demonstrate that this approach significantly preserves model security while maintaining performance and reducing computational resources compared to full fine-tuning.
Title: Error-controlled non-additive interaction discovery in machine learning models
Authors: Winston Chen, Yifan Jiang, William Stafford Noble, Yang Young Lu
Abstract: Machine learning (ML) models are powerful tools for detecting complex patterns within data, yet their "black box" nature limits their interpretability, hindering their use in critical domains like healthcare and finance. To address this challenge, interpretable ML methods have been developed to explain how features influence model predictions. However, these methods often focus on univariate feature importance, overlooking the complex interactions between features that ML models are capable of capturing. Recognizing this limitation, recent efforts have aimed to extend these methods to discover feature interactions, but existing approaches struggle with robustness and error control, especially under data perturbations. In this study, we introduce Diamond, a novel method for trustworthy feature interaction discovery. Diamond uniquely integrates the model-X knockoffs framework to control the false discovery rate (FDR), ensuring that the proportion of falsely discovered interactions remains low. We further address the challenges of using off-the-shelf interaction importance measures by proposing a calibration procedure that refines these measures to maintain the desired FDR. Diamond's applicability spans a wide range of ML models, including deep neural networks, tree-based models, and factorization-based models. Our empirical evaluations on both simulated and real datasets across various biomedical studies demonstrate Diamond's utility in enabling more reliable data-driven scientific discoveries. This method represents a significant step forward in the deployment of ML models for scientific innovation and hypothesis generation.
Title: Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling
Abstract: Self-Consistency (SC) is a widely used method to mitigate hallucinations in Large Language Models (LLMs) by sampling the LLM multiple times and outputting the most frequent solution. Despite its benefits, SC results in significant computational costs proportional to the number of samples generated. Previous early-stopping approaches, such as Early Stopping Self Consistency and Adaptive Consistency, have aimed to reduce these costs by considering output consistency, but they do not analyze the quality of the reasoning paths (RPs) themselves. To address this issue, we propose Reasoning-Aware Self-Consistency (RASC), an innovative early-stopping framework that dynamically adjusts the number of sample generations by considering both the output answer and the RPs from Chain of Thought (CoT) prompting. RASC assigns confidence scores sequentially to the generated samples, stops when certain criteria are met, and then employs weighted majority voting to optimize sample usage and enhance answer reliability. We comprehensively test RASC with multiple LLMs across varied QA datasets. RASC outperformed existing methods and significantly reduces sample usage by an average of 80% while maintaining or improving accuracy up to 5% compared to the original SC
Title: From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs
Authors: Minxue Niu (1), Mimansa Jaiswal (2), Emily Mower Provost (1) ((1) University of Michigan, (2) Independent Researcher)
Copy Paste: [[2408.17026]] From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs(https://arxiv.org/abs/2408.17026)
Keywords: large language model
Abstract: Training emotion recognition models has relied heavily on human annotated data, which present diversity, quality, and cost challenges. In this paper, we explore the potential of Large Language Models (LLMs), specifically GPT4, in automating or assisting emotion annotation. We compare GPT4 with supervised models and or humans in three aspects: agreement with human annotations, alignment with human perception, and impact on model training. We find that common metrics that use aggregated human annotations as ground truth can underestimate the performance, of GPT-4 and our human evaluation experiment reveals a consistent preference for GPT-4 annotations over humans across multiple datasets and evaluators. Further, we investigate the impact of using GPT-4 as an annotation filtering process to improve model training. Together, our findings highlight the great potential of LLMs in emotion annotation tasks and underscore the need for refined evaluation methodologies.
Title: ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images
Authors: Xiaoshuai Zhang, Zhicheng Wang, Howard Zhou, Soham Ghosh, Danushen Gnanapragasam, Varun Jampani, Hao Su, Leonidas Guibas
Copy Paste: [[2408.17027]] ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images(https://arxiv.org/abs/2408.17027)
Keywords: segmentation
Abstract: To advance the state of the art in the creation of 3D foundation models, this paper introduces the ConDense framework for 3D pre-training utilizing existing pre-trained 2D networks and large-scale multi-view datasets. We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline, where 2D-3D feature consistency is enforced through a volume rendering NeRF-like ray marching process. Using dense per pixel features we are able to 1) directly distill the learned priors from 2D models to 3D models and create useful 3D backbones, 2) extract more consistent and less noisy 2D features, 3) formulate a consistent embedding space where 2D, 3D, and other modalities of data (e.g., natural language prompts) can be jointly queried. Furthermore, besides dense features, ConDense can be trained to extract sparse features (e.g., key points), also with 2D-3D consistency -- condensing 3D NeRF representations into compact sets of decorated key points. We demonstrate that our pre-trained model provides good initialization for various 3D tasks including 3D classification and segmentation, outperforming other 3D pre-training methods by a significant margin. It also enables, by exploiting our sparse features, additional useful downstream tasks, such as matching 2D images to 3D scenes, detecting duplicate 3D scenes, and querying a repository of 3D scenes through natural language -- all quite efficiently and without any per-scene fine-tuning.
Title: Meta-UAD: A Meta-Learning Scheme for User-level Network Traffic Anomaly Detection
Authors: Tongtong Feng, Qi Qi, Lingqi Guo, Jingyu Wang
Copy Paste: [[2408.17031]] Meta-UAD: A Meta-Learning Scheme for User-level Network Traffic Anomaly Detection(https://arxiv.org/abs/2408.17031)
Keywords: security
Abstract: Accuracy anomaly detection in user-level network traffic is crucial for network security. Compared with existing models that passively detect specific anomaly classes with large labeled training samples, user-level network traffic contains sizeable new anomaly classes with few labeled samples and has an imbalance, self-similar, and data-hungry nature. Motivation on those limitations, in this paper, we propose \textit{Meta-UAD}, a Meta-learning scheme for User-level network traffic Anomaly Detection. Meta-UAD uses the CICFlowMeter to extract 81 flow-level statistical features and remove some invalid ones using cumulative importance ranking. Meta-UAD adopts a meta-learning training structure and learns from the collection of K-way-M-shot classification tasks, which can use a pre-trained model to adapt any new class with few samples by few iteration steps. We evaluate our scheme on two public datasets. Compared with existing models, the results further demonstrate the superiority of Meta-UAD with 15{\%} - 43{\%} gains in F1-score.
Title: Colaboot: A Cloud-based Diskless PC Booting Mechanism
Authors: Aditya Mitra, Anisha Ghosh, Sibi Chakkaravarthy Sethuraman, Devi Priya V S
Copy Paste: [[2408.17045]] Colaboot: A Cloud-based Diskless PC Booting Mechanism(https://arxiv.org/abs/2408.17045)
Keywords: security, attack
Abstract: Recent increases in endpoint-based security events and threats compelled enterprise operations to switch to virtual desktop infrastructure and web-based applications. In addition to reducing potential hazards, this has guaranteed a consistent desktop environment for every user. On the other hand, the attack surface is greatly increased because all endpoints are connected to the company network, which could harbor malware and other advanced persistent threats. This results in a considerable loss of system resources on each individual endpoint. Hence our work proposes a standard called Colaboot that enables machines throughout a company to boot from a single operating system in order to address these problems and guarantee a consistent operating system environment that could be easily updated to the most recent security patches across all work stations.
Title: Text-to-Image Generation Via Energy-Based CLIP
Copy Paste: [[2408.17046]] Text-to-Image Generation Via Energy-Based CLIP(https://arxiv.org/abs/2408.17046)
Keywords: robust, diffusion, generative
Abstract: Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present EB-CLIP, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative objective, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative objective, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. EB-CLIP not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance capability of EB-CLIP by enhancing CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Lastly, we show that EB-CLIP can serve as a more robust evaluation metric for text-to-image generative tasks than CLIP.
Title: SPOQchain: Platform for Secure, Scalable, and Privacy-Preserving Supply Chain Tracing and Counterfeit Protection
Authors: Moritz Finke, Alexandra Dmitrienko, Jasper Stang
Copy Paste: [[2408.17049]] SPOQchain: Platform for Secure, Scalable, and Privacy-Preserving Supply Chain Tracing and Counterfeit Protection(https://arxiv.org/abs/2408.17049)
Keywords: secure, security, privacy, protect
Abstract: Product lifecycle tracing is increasingly in the focus of regulators and producers, as shown with the initiative of the Digital Product Pass. Likewise, new methods of counterfeit detection are developed that are, e.g., based on Physical Unclonable Functions (PUFs). In order to ensure trust and integrity of product lifecycle data, multiple existing supply chain tracing systems are built on blockchain technology. However, only few solutions employ secure identifiers such as PUFs. Furthermore, existing systems that publish the data of individual products, in part fully transparently, have a detrimental impact on scalability and the privacy of users. This work proposes SPOQchain, a novel blockchain-based platform that provides comprehensive lifecycle traceability and originality verification while ensuring high efficiency and user privacy. The improved efficiency is achieved by a sophisticated batching mechanism that removes lifecycle redundancies. In addition to the successful evaluation of SPOQchain's scalability, this work provides a comprehensive analysis of privacy and security aspects, demonstrating the need and qualification of SPOQchain for the future of supply chain tracing.
Title: Can We Leave Deepfake Data Behind in Training Deepfake Detector?
Copy Paste: [[2408.17052]] Can We Leave Deepfake Data Behind in Training Deepfake Detector?(https://arxiv.org/abs/2408.17052)
Keywords: generative
Abstract: The generalization ability of deepfake detectors is vital for their applications in real-world scenarios. One effective solution to enhance this ability is to train the models with manually-blended data, which we termed "blendfake", encouraging models to learn generic forgery artifacts like blending boundary. Interestingly, current SoTA methods utilize blendfake without incorporating any deepfake data in their training process. This is likely because previous empirical observations suggest that vanilla hybrid training (VHT), which combines deepfake and blendfake data, results in inferior performance to methods using only blendfake data (so-called "1+1<2"). Therefore, a critical question arises: Can we leave deepfake behind and rely solely on blendfake data to train an effective deepfake detector? Intuitively, as deepfakes also contain additional informative forgery clues (e.g., deep generative artifacts), excluding all deepfake data in training deepfake detectors seems counter-intuitive. In this paper, we rethink the role of blendfake in detecting deepfakes and formulate the process from "real to blendfake to deepfake" to be a progressive transition. Specifically, blendfake and deepfake can be explicitly delineated as the oriented pivot anchors between "real-to-fake" transitions. The accumulation of forgery information should be oriented and progressively increasing during this transition process. To this end, we propose an Oriented Progressive Regularizor (OPR) to establish the constraints that compel the distribution of anchors to be discretely arranged. Furthermore, we introduce feature bridging to facilitate the smooth transition between adjacent anchors. Extensive experiments confirm that our design allows leveraging forgery information from both blendfake and deepfake effectively and comprehensively.
Title: BTMuda: A Bi-level Multi-source unsupervised domain adaptation framework for breast cancer diagnosis
Authors: Yuxiang Yang, Xinyi Zeng, Pinxian Zeng, Binyu Yan, Xi Wu, Jiliu Zhou, Yan Wang
Copy Paste: [[2408.17054]] BTMuda: A Bi-level Multi-source unsupervised domain adaptation framework for breast cancer diagnosis(https://arxiv.org/abs/2408.17054)
Keywords: robust, transformer
Abstract: Deep learning has revolutionized the early detection of breast cancer, resulting in a significant decrease in mortality rates. However, difficulties in obtaining annotations and huge variations in distribution between training sets and real scenes have limited their clinical applications. To address these limitations, unsupervised domain adaptation (UDA) methods have been used to transfer knowledge from one labeled source domain to the unlabeled target domain, yet these approaches suffer from severe domain shift issues and often ignore the potential benefits of leveraging multiple relevant sources in practical applications. To address these limitations, in this work, we construct a Three-Branch Mixed extractor and propose a Bi-level Multi-source unsupervised domain adaptation method called BTMuda for breast cancer diagnosis. Our method addresses the problems of domain shift by dividing domain shift issues into two levels: intra-domain and inter-domain. To reduce the intra-domain shift, we jointly train a CNN and a Transformer as two paths of a domain mixed feature extractor to obtain robust representations rich in both low-level local and high-level global information. As for the inter-domain shift, we redesign the Transformer delicately to a three-branch architecture with cross-attention and distillation, which learns domain-invariant representations from multiple domains. Besides, we introduce two alignment modules - one for feature alignment and one for classifier alignment - to improve the alignment process. Extensive experiments conducted on three public mammographic datasets demonstrate that our BTMuda outperforms state-of-the-art methods.
Title: LAR-IQA: A Lightweight, Accurate, and Robust No-Reference Image Quality Assessment Model
Authors: Nasim Jamshidi Avanaki, Abhijay Ghildiyal, Nabajeet Barman, Saman Zadtootaghaj
Copy Paste: [[2408.17057]] LAR-IQA: A Lightweight, Accurate, and Robust No-Reference Image Quality Assessment Model(https://arxiv.org/abs/2408.17057)
Keywords: robust
Abstract: Recent advancements in the field of No-Reference Image Quality Assessment (NR-IQA) using deep learning techniques demonstrate high performance across multiple open-source datasets. However, such models are typically very large and complex making them not so suitable for real-world deployment, especially on resource- and battery-constrained mobile devices. To address this limitation, we propose a compact, lightweight NR-IQA model that achieves state-of-the-art (SOTA) performance on ECCV AIM UHD-IQA challenge validation and test datasets while being also nearly 5.7 times faster than the fastest SOTA model. Our model features a dual-branch architecture, with each branch separately trained on synthetically and authentically distorted images which enhances the model's generalizability across different distortion types. To improve robustness under diverse real-world visual conditions, we additionally incorporate multiple color spaces during the training process. We also demonstrate the higher accuracy of recently proposed Kolmogorov-Arnold Networks (KANs) for final quality regression as compared to the conventional Multi-Layer Perceptrons (MLPs). Our evaluation considering various open-source datasets highlights the practical, high-accuracy, and robust performance of our proposed lightweight model. Code: this https URL.
Title: A Survey of the Self Supervised Learning Mechanisms for Vision Transformers
Copy Paste: [[2408.17059]] A Survey of the Self Supervised Learning Mechanisms for Vision Transformers(https://arxiv.org/abs/2408.17059)
Keywords: transformer
Abstract: Deep supervised learning models require high volume of labeled data to attain sufficiently good results. Although, the practice of gathering and annotating such big data is costly and laborious. Recently, the application of self supervised learning (SSL) in vision tasks has gained significant attention. The intuition behind SSL is to exploit the synchronous relationships within the data as a form of self-supervision, which can be versatile. In the current big data era, most of the data is unlabeled, and the success of SSL thus relies in finding ways to improve this vast amount of unlabeled data available. Thus its better for deep learning algorithms to reduce reliance on human supervision and instead focus on self-supervision based on the inherent relationships within the data. With the advent of ViTs, which have achieved remarkable results in computer vision, it is crucial to explore and understand the various SSL mechanisms employed for training these models specifically in scenarios where there is less label data available. In this survey we thus develop a comprehensive taxonomy of systematically classifying the SSL techniques based upon their representations and pre-training tasks being applied. Additionally, we discuss the motivations behind SSL, review popular pre-training tasks, and highlight the challenges and advancements in this field. Furthermore, we present a comparative analysis of different SSL methods, evaluate their strengths and limitations, and identify potential avenues for future research.
Title: Efficient Image Restoration through Low-Rank Adaptation and Stable Diffusion XL
Copy Paste: [[2408.17060]] Efficient Image Restoration through Low-Rank Adaptation and Stable Diffusion XL(https://arxiv.org/abs/2408.17060)
Keywords: diffusion
Abstract: In this study, we propose an enhanced image restoration model, SUPIR, based on the integration of two low-rank adaptive (LoRA) modules with the Stable Diffusion XL (SDXL) framework. Our method leverages the advantages of LoRA to fine-tune SDXL models, thereby significantly improving image restoration quality and efficiency. We collect 2600 high-quality real-world images, each with detailed descriptive text, for training the model. The proposed method is evaluated on standard benchmarks and achieves excellent performance, demonstrated by higher peak signal-to-noise ratio (PSNR), lower learned perceptual image patch similarity (LPIPS), and higher structural similarity index measurement (SSIM) scores. These results underscore the effectiveness of combining LoRA with SDXL for advanced image restoration tasks, highlighting the potential of our approach in generating high-fidelity restored images.
Title: Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer
Abstract: Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote\&Mix (\textbf{VoMix}), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models \textit{without any training}. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2$\times$ increase in throughput of existing ViT-H on ImageNet-1K and a 2.4$\times$ increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3\% drop in top-1 accuracy.
Title: Instant Adversarial Purification with Adversarial Consistency Distillation
Authors: Chun Tong Lei, Hon Ming Yam, Zhongliang Guo, Chun Pong Lau
Copy Paste: [[2408.17064]] Instant Adversarial Purification with Adversarial Consistency Distillation(https://arxiv.org/abs/2408.17064)
Keywords: defense, diffusion
Abstract: Neural networks, despite their remarkable performance in widespread applications, including image classification, are also known to be vulnerable to subtle adversarial noise. Although some diffusion-based purification methods have been proposed, for example, DiffPure, those methods are time-consuming. In this paper, we propose One Step Control Purification (OSCP), a diffusion-based purification model that can purify the adversarial image in one Neural Function Evaluation (NFE) in diffusion models. We use Latent Consistency Model (LCM) and ControlNet for our one-step purification. OSCP is computationally friendly and time efficient compared to other diffusion-based purification methods; we achieve defense success rate of 74.19\% on ImageNet, only requiring 0.1s for each purification. Moreover, there is a fundamental incongruence between consistency distillation and adversarial perturbation. To address this ontological dissonance, we propose Gaussian Adversarial Noise Distillation (GAND), a novel consistency distillation framework that facilitates a more nuanced reconciliation of the latent space dynamics, effectively bridging the natural and adversarial manifolds. Our experiments show that the GAND does not need a Full Fine Tune (FFT); PEFT, e.g., LoRA is sufficient.
Title: Novel-WD: Exploring acquisition of Novel World Knowledge in LLMs Using Prefix-Tuning
Copy Paste: [[2408.17070]] Novel-WD: Exploring acquisition of Novel World Knowledge in LLMs Using Prefix-Tuning(https://arxiv.org/abs/2408.17070)
Keywords: large language model
Abstract: Teaching new information to pre-trained large language models (PLM) is a crucial but challenging task. Model adaptation techniques, such as fine-tuning and parameter-efficient training have been shown to store new facts at a slow rate; continual learning is an option but is costly and prone to catastrophic forgetting. This work studies and quantifies how PLM may learn and remember new world knowledge facts that do not occur in their pre-training corpus, which only contains world knowledge up to a certain date. To that purpose, we first propose Novel-WD, a new dataset consisting of sentences containing novel facts extracted from recent Wikidata updates, along with two evaluation tasks in the form of causal language modeling and multiple choice questions (MCQ). We make this dataset freely available to the community, and release a procedure to later build new versions of similar datasets with up-to-date information. We also explore the use of prefix-tuning for novel information learning, and analyze how much information can be stored within a given prefix. We show that a single fact can reliably be encoded within a single prefix, and that the prefix capacity increases with its length and with the base model size.
Title: MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models
Copy Paste: [[2408.17072]] MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2408.17072)
Keywords: large language model
Abstract: In a real-world RAG system, the current query often involves spoken ellipses and ambiguous references from dialogue contexts, necessitating query rewriting to better describe user's information needs. However, traditional context-based rewriting has minimal enhancement on downstream generation tasks due to the lengthy process from query rewriting to response generation. Some researchers try to utilize reinforcement learning with generation feedback to assist the rewriter, but these sparse rewards provide little guidance in most cases, leading to unstable training and generation results. We find that user's needs are also reflected in the gold document, retrieved documents and ground truth. Therefore, by feeding back these multi-aspect dense rewards to query rewriting, more stable and satisfactory responses can be achieved. In this paper, we propose a novel query rewriting method MaFeRw, which improves RAG performance by integrating multi-aspect feedback from both the retrieval process and generated results. Specifically, we first use manual data to train a T5 model for the rewriter initialization. Next, we design three metrics as reinforcement learning feedback: the similarity between the rewritten query and the gold document, the ranking metrics, and ROUGE between the generation and the ground truth. Inspired by RLAIF, we train three kinds of reward models for the above metrics to achieve more efficient training. Finally, we combine the scores of these reward models as feedback, and use PPO algorithm to explore the optimal query rewriting strategy. Experimental results on two conversational RAG datasets demonstrate that MaFeRw achieves superior generation metrics and more stable training compared to baselines.
Title: Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training
Authors: Zizheng Huang, Haoxing Chen, Jiaqi Li, Jun Lan, Huijia Zhu, Weiqiang Wang, Limin Wang
Copy Paste: [[2408.17081]] Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training(https://arxiv.org/abs/2408.17081)
Keywords: transformer, segmentation
Abstract: Recent Vision Mamba models not only have much lower complexity for processing higher resolution images and longer videos but also the competitive performance with Vision Transformers (ViTs). However, they are stuck into overfitting and thus only present up to base size (about 80M). It is still unclear how vanilla Vision Mamba (Vim) can be efficiently scaled up to larger sizes, which is essentially for further exploitation. In this paper, we propose a stochastic layer-wise shuffle regularization, which empowers successfully scaling non-hierarchical Vision Mamba to a large size (about 300M) in a supervised setting. Specifically, our base and large-scale ShuffleMamba models can outperform the supervised ViTs of similar size by 0.8\% and 1.0\% classification accuracy on ImageNet1k, respectively, without auxiliary data. When evaluated on the ADE20K semantic segmentation and COCO detection tasks, our ShuffleMamba models also show significant improvements. Without bells and whistles, the stochastic layer-wise shuffle has the following highlights: (1) \textit{Plug and play:} it does not change model architectures and will be omitted in inference. (2) \textit{Simple but effective:} it can improve the overfitting in Vim training and only introduce random token permutation operations. (3) \textit{Intuitive:} the token sequences in deeper layers are more likely to be shuffled as they are expected to be more semantic and less sensitive to patch positions. Code and models will be available at this https URL.
Title: FissionVAE: Federated Non-IID Image Generation with Latent Space and Decoder Decomposition
Authors: Chen Hu, Jingjing Deng, Xianghua Xie, Xiaoke Ma
Copy Paste: [[2408.17090]] FissionVAE: Federated Non-IID Image Generation with Latent Space and Decoder Decomposition(https://arxiv.org/abs/2408.17090)
Keywords: federate, generative
Abstract: Federated learning is a machine learning paradigm that enables decentralized clients to collaboratively learn a shared model while keeping all the training data local. While considerable research has focused on federated image generation, particularly Generative Adversarial Networks, Variational Autoencoders have received less attention. In this paper, we address the challenges of non-IID (independently and identically distributed) data environments featuring multiple groups of images of different types. Specifically, heterogeneous data distributions can lead to difficulties in maintaining a consistent latent space and can also result in local generators with disparate texture features being blended during aggregation. We introduce a novel approach, FissionVAE, which decomposes the latent space and constructs decoder branches tailored to individual client groups. This method allows for customized learning that aligns with the unique data distributions of each group. Additionally, we investigate the incorporation of hierarchical VAE architectures and demonstrate the use of heterogeneous decoder architectures within our model. We also explore strategies for setting the latent prior distributions to enhance the decomposition process. To evaluate our approach, we assemble two composite datasets: the first combines MNIST and FashionMNIST; the second comprises RGB datasets of cartoon and human faces, wild animals, marine vessels, and remote sensing images of Earth. Our experiments demonstrate that FissionVAE greatly improves generation quality on these datasets compared to baseline federated VAE models.
Title: RISSOLE: Parameter-efficient Diffusion Models via Block-wise Generation and Retrieval-Guidance
Authors: Avideep Mukherjee, Soumya Banerjee, Vinay P. Namboodiri, Piyush Rai
Copy Paste: [[2408.17095]] RISSOLE: Parameter-efficient Diffusion Models via Block-wise Generation and Retrieval-Guidance(https://arxiv.org/abs/2408.17095)
Keywords: diffusion, generative
Abstract: Diffusion-based models demonstrate impressive generation capabilities. However, they also have a massive number of parameters, resulting in enormous model sizes, thus making them unsuitable for deployment on resource-constraint devices. Block-wise generation can be a promising alternative for designing compact-sized (parameter-efficient) deep generative models since the model can generate one block at a time instead of generating the whole image at once. However, block-wise generation is also considerably challenging because ensuring coherence across generated blocks can be non-trivial. To this end, we design a retrieval-augmented generation (RAG) approach and leverage the corresponding blocks of the images retrieved by the RAG module to condition the training and generation stages of a block-wise denoising diffusion model. Our conditioning schemes ensure coherence across the different blocks during training and, consequently, during generation. While we showcase our approach using the latent diffusion model (LDM) as the base model, it can be used with other variants of denoising diffusion models. We validate the solution of the coherence problem through the proposed approach by reporting substantive experiments to demonstrate our approach's effectiveness in compact model size and excellent generation quality.
Title: UTrack: Multi-Object Tracking with Uncertain Detections
Authors: Edgardo Solano-Carrillo, Felix Sattler, Antje Alex, Alexander Klein, Bruno Pereira Costa, Angel Bueno Rodriguez, Jannis Stoppe
Copy Paste: [[2408.17098]] UTrack: Multi-Object Tracking with Uncertain Detections(https://arxiv.org/abs/2408.17098)
Keywords: security
Abstract: The tracking-by-detection paradigm is the mainstream in multi-object tracking, associating tracks to the predictions of an object detector. Although exhibiting uncertainty through a confidence score, these predictions do not capture the entire variability of the inference process. For safety and security critical applications like autonomous driving, surveillance, etc., knowing this predictive uncertainty is essential though. Therefore, we introduce, for the first time, a fast way to obtain the empirical predictive distribution during object detection and incorporate that knowledge in multi-object tracking. Our mechanism can easily be integrated into state-of-the-art trackers, enabling them to fully exploit the uncertainty in the detections. Additionally, novel association methods are introduced that leverage the proposed mechanism. We demonstrate the effectiveness of our contribution on a variety of benchmarks, such as MOT17, MOT20, DanceTrack, and KITTI.
Title: Sparse Uncertainty-Informed Sampling from Federated Streaming Data
Copy Paste: [[2408.17108]] Sparse Uncertainty-Informed Sampling from Federated Streaming Data(https://arxiv.org/abs/2408.17108)
Keywords: robust, federate
Abstract: We present a numerically robust, computationally efficient approach for non-I.I.D. data stream sampling in federated client systems, where resources are limited and labeled data for local model adaptation is sparse and expensive. The proposed method identifies relevant stream observations to optimize the underlying client model, given a local labeling budget, and performs instantaneous labeling decisions without relying on any memory buffering strategies. Our experiments show enhanced training batch diversity and an improved numerical robustness of the proposal compared to existing strategies over large-scale data streams, making our approach an effective and convenient solution in FL environments.
Title: Wireless Integrated Authenticated Communication System (WIA-Comm)
Authors: Amith N Bharadwaj, G Adarsh, Gurusatwik Bhatta N, Karan K, Vijay B T
Copy Paste: [[2408.17112]] Wireless Integrated Authenticated Communication System (WIA-Comm)(https://arxiv.org/abs/2408.17112)
Keywords: security
Abstract: The exponential increase in the number of devices connected to the internet globally has led to the requirement for the introduction of better and improved security measures for maintaining data integrity. The development of a wireless and authenticated communication system is required to overcome the safety threats and illegal access to the application system/data. The WIA-Comm System is the one that provides a bridge to control the devices at the application side. It has been designed to provide security by giving control rights only to the device whose MAC (physical) address has already been registered, so only authorized users can control the system. LoRa WAN technology has been used for wireless communication and Arduino IDE to develop the code for the required functionality.
Title: Multi-centric AI Model for Unruptured Intracranial Aneurysm Detection and Volumetric Segmentation in 3D TOF-MRI
Authors: Ashraya K. Indrakanti, Jakob Wasserthal, Martin Segeroth, Shan Yang, Victor Schulze-Zachau, Joshy Cyriac, Michael Bach, Marios Psychogios, Matthias A. Mutke
Copy Paste: [[2408.17115]] Multi-centric AI Model for Unruptured Intracranial Aneurysm Detection and Volumetric Segmentation in 3D TOF-MRI(https://arxiv.org/abs/2408.17115)
Keywords: segmentation
Abstract: Purpose: To develop an open-source nnU-Net-based AI model for combined detection and segmentation of unruptured intracranial aneurysms (UICA) in 3D TOF-MRI, and compare models trained on datasets with aneurysm-like differential diagnoses. Methods: This retrospective study (2020-2023) included 385 anonymized 3D TOF-MRI images from 364 patients (mean age 59 years, 60% female) at multiple centers plus 113 subjects from the ADAM challenge. Images featured untreated or possible UICAs and differential diagnoses. Four distinct training datasets were created, and the nnU-Net framework was used for model development. Performance was assessed on a separate test set using sensitivity and False Positive (FP)/case rate for detection, and DICE score and NSD (Normalized Surface Distance) with a 0.5mm threshold for segmentation. Statistical analysis included chi-square, Mann-Whitney-U, and Kruskal-Wallis tests, with significance set at p < 0.05. Results: Models achieved overall sensitivity between 82% and 85% and a FP/case rate of 0.20 to 0.31, with no significant differences (p = 0.90 and p = 0.16). The primary model showed 85% sensitivity and 0.23 FP/case rate, outperforming the ADAM-challenge winner (61%) and a nnU-Net trained on ADAM data (51%) in sensitivity (p < 0.05). It achieved a mean DICE score of 0.73 and an NSD of 0.84 for correctly detected UICA. Conclusions: Our open-source, nnU-Net-based AI model (available at https://doi.org/10.5281/zenodo.13386859) demonstrates high sensitivity, low false positive rates, and consistent segmentation accuracy for UICA detection and segmentation in 3D TOF-MRI, suggesting its potential to improve clinical diagnosis and for monitoring of UICA.
Title: Efficient Estimation of Unique Components in Independent Component Analysis by Matrix Representation
Copy Paste: [[2408.17118]] Efficient Estimation of Unique Components in Independent Component Analysis by Matrix Representation(https://arxiv.org/abs/2408.17118)
Keywords: extraction
Abstract: Independent component analysis (ICA) is a widely used method in various applications of signal processing and feature extraction. It extends principal component analysis (PCA) and can extract important and complicated components with small variances. One of the major problems of ICA is that the uniqueness of the solution is not guaranteed, unlike PCA. That is because there are many local optima in optimizing the objective function of ICA. It has been shown previously that the unique global optimum of ICA can be estimated from many random initializations by handcrafted thread computation. In this paper, the unique estimation of ICA is highly accelerated by reformulating the algorithm in matrix representation and reducing redundant calculations. Experimental results on artificial datasets and EEG data verified the efficiency of the proposed method.
Title: Traceable AI-driven Avatars Using Multi-factors of Physical World and Metaverse
Copy Paste: [[2408.17121]] Traceable AI-driven Avatars Using Multi-factors of Physical World and Metaverse(https://arxiv.org/abs/2408.17121)
Keywords: security, attack
Abstract: Metaverse allows users to delegate their AI models to an AI engine, which builds corresponding AI-driven avatars to provide immersive experience for other users. Since current authentication methods mainly focus on human-driven avatars and ignore the traceability of AI-driven avatars, attackers may delegate the AI models of a target user to an AI proxy program to perform impersonation attacks without worrying about being detected. In this paper, we propose an authentication method using multi-factors to guarantee the traceability of AI-driven avatars. Firstly, we construct a user's identity model combining the manipulator's iris feature and the AI proxy's public key to ensure that an AI-driven avatar is associated with its original manipulator. Secondly, we propose a chameleon proxy signature scheme that supports the original manipulator to delegate his/her signing ability to an AI proxy. Finally, we design three authentication protocols for avatars based on the identity model and the chameleon proxy signature to guarantee the virtual-to-physical traceability including both the human-driven and AI-driven avatars. Security analysis shows that the proposed signature scheme is unforgeability and the authentication method is able to defend against false accusation. Extensive evaluations show that the designed authentication protocols complete user login, avatar delegation, mutual authentication, and avatar tracing in about 1s, meeting the actual application needs and helping to mitigate impersonation attacks by AI-driven avatars.
Title: Controllable Edge-Type-Specific Interpretation in Multi-Relational Graph Neural Networks for Drug Response Prediction
Copy Paste: [[2408.17129]] Controllable Edge-Type-Specific Interpretation in Multi-Relational Graph Neural Networks for Drug Response Prediction(https://arxiv.org/abs/2408.17129)
Keywords: robust, interpretability
Abstract: Graph Neural Networks have been widely applied in critical decision-making areas that demand interpretable predictions, leading to the flourishing development of interpretability algorithms. However, current graph interpretability algorithms tend to emphasize generality and often overlook biological significance, thereby limiting their applicability in predicting cancer drug responses. In this paper, we propose a novel post-hoc interpretability algorithm for cancer drug response prediction, CETExplainer, which incorporates a controllable edge-type-specific weighting mechanism. It considers the mutual information between subgraphs and predictions, proposing a structural scoring approach to provide fine-grained, biologically meaningful explanations for predictive models. We also introduce a method for constructing ground truth based on real-world datasets to quantitatively evaluate the proposed interpretability algorithm. Empirical analysis on the real-world dataset demonstrates that CETExplainer achieves superior stability and improves explanation quality compared to leading algorithms, thereby offering a robust and insightful tool for cancer drug prediction.
Title: VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers
Abstract: The Diffusion Transformers Models (DiTs) have transitioned the network architecture from traditional UNets to transformers, demonstrating exceptional capabilities in image generation. Although DiTs have been widely applied to high-definition video generation tasks, their large parameter size hinders inference on edge devices. Vector quantization (VQ) can decompose model weight into a codebook and assignments, allowing extreme weight quantization and significantly reducing memory usage. In this paper, we propose VQ4DiT, a fast post-training vector quantization method for DiTs. We found that traditional VQ methods calibrate only the codebook without calibrating the assignments. This leads to weight sub-vectors being incorrectly assigned to the same assignment, providing inconsistent gradients to the codebook and resulting in a suboptimal result. To address this challenge, VQ4DiT calculates the candidate assignment set for each weight sub-vector based on Euclidean distance and reconstructs the sub-vector based on the weighted average. Then, using the zero-data and block-wise calibration method, the optimal assignment from the set is efficiently selected while calibrating the codebook. VQ4DiT quantizes a DiT XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. Experiments show that VQ4DiT establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality.
Title: Temporal and Interactive Modeling for Efficient Human-Human Motion Generation
Authors: Yabiao Wang, Shuo Wang, Jiangning Zhang, Ke Fan, Jiafu Wu, Zhengkai Jiang, Yong Liu
Copy Paste: [[2408.17135]] Temporal and Interactive Modeling for Efficient Human-Human Motion Generation(https://arxiv.org/abs/2408.17135)
Keywords: transformer
Abstract: Human-human motion generation is essential for understanding humans as social beings. Although several transformer-based methods have been proposed, they typically model each individual separately and overlook the causal relationships in temporal motion sequences. Furthermore, the attention mechanism in transformers exhibits quadratic computational complexity, significantly reducing their efficiency when processing long sequences. In this paper, we introduce TIM (Temporal and Interactive Modeling), an efficient and effective approach that presents the pioneering human-human motion generation model utilizing RWKV. Specifically, we first propose Causal Interactive Injection to leverage the temporal properties of motion sequences and avoid non-causal and cumbersome modeling. Then we present Role-Evolving Mixing to adjust to the ever-evolving roles throughout the interaction. Finally, to generate smoother and more rational motion, we design Localized Pattern Amplification to capture short-term motion patterns. Extensive experiments on InterHuman demonstrate that our method achieves superior performance. Notably, TIM has achieved state-of-the-art results using only 32% of InterGen's trainable parameters. Code will be available soon. Homepage: this https URL
Title: Leveraging Digital Twin Technologies for Public Space Protection and Vulnerability Assessment
Copy Paste: [[2408.17136]] Leveraging Digital Twin Technologies for Public Space Protection and Vulnerability Assessment(https://arxiv.org/abs/2408.17136)
Keywords: security, protect, attack, robust
Abstract: Over the recent years, the protection of the so-called `soft-targets', i.e. locations easily accessible by the general public with relatively low, though, security measures, has emerged as a rather challenging and increasingly important issue. The complexity and seriousness of this security threat growths nowadays exponentially, due to the emergence of new advanced technologies (e.g. Artificial Intelligence (AI), Autonomous Vehicles (AVs), 3D printing, etc.); especially when it comes to large-scale, popular and diverse public spaces. In this paper, a novel Digital Twin-as-a-Security-Service (DTaaSS) architecture is introduced for holistically and significantly enhancing the protection of public spaces (e.g. metro stations, leisure sites, urban squares, etc.). The proposed framework combines a Digital Twin (DT) conceptualization with additional cutting-edge technologies, including Internet of Things (IoT), cloud computing, Big Data analytics and AI. In particular, DTaaSS comprises a holistic, real-time, large-scale, comprehensive and data-driven security solution for the efficient/robust protection of public spaces, supporting: a) data collection and analytics, b) area monitoring/control and proactive threat detection, c) incident/attack prediction, and d) quantitative and data-driven vulnerability assessment. Overall, the designed architecture exhibits increased potential in handling complex, hybrid and combined threats over large, critical and popular soft-targets. The applicability and robustness of DTaaSS is discussed in detail against representative and diverse real-world application scenarios, including complex attacks to: a) a metro station, b) a leisure site, and c) a cathedral square.
Title: Flow Matching for Optimal Reaction Coordinates of Biomolecular System
Copy Paste: [[2408.17139]] Flow Matching for Optimal Reaction Coordinates of Biomolecular System(https://arxiv.org/abs/2408.17139)
Keywords: generative
Abstract: We present Flow Matching for Reaction Coordinates (FMRC), a novel deep learning algorithm designed to identify optimal reaction coordinates (RC) in biomolecular reversible dynamics. FMRC is based on the mathematical principles of lumpability and decomposability, which we reformulate into a conditional probability framework for efficient data-driven optimization using deep generative models. While FMRC does not explicitly learn the well-established transfer operator or its eigenfunctions, it can effectively encode the dynamics of leading eigenfunctions of the system transfer operator into its low-dimensional RC space. We further quantitatively compare its performance with several state-of-the-art algorithms by evaluating the quality of Markov State Models (MSM) constructed in their respective RC spaces, demonstrating the superiority of FMRC in three increasingly complex biomolecular systems. Finally, we discuss its potential applications in downstream applications such as enhanced sampling methods and MSM construction.
Title: Towards Hyper-parameter-free Federated Learning
Copy Paste: [[2408.17145]] Towards Hyper-parameter-free Federated Learning(https://arxiv.org/abs/2408.17145)
Keywords: federate
Abstract: The adaptive synchronization techniques in federated learning (FL) for scaled global model updates show superior performance over the vanilla federated averaging (FedAvg) scheme. However, existing methods employ additional tunable hyperparameters on the server to determine the scaling factor. A contrasting approach is automated scaling analogous to tuning-free step-size schemes in stochastic gradient descent (SGD) methods, which offer competitive convergence rates and exhibit good empirical performance. In this work, we introduce two algorithms for automated scaling of global model updates. In our first algorithm, we establish that a descent-ensuring step-size regime at the clients ensures descent for the server objective. We show that such a scheme enables linear convergence for strongly convex federated objectives. Our second algorithm shows that the average of objective values of sampled clients is a practical and effective substitute for the objective function value at the server required for computing the scaling factor, whose computation is otherwise not permitted. Our extensive empirical results show that the proposed methods perform at par or better than the popular federated learning algorithms for both convex and non-convex problems. Our work takes a step towards designing hyper-parameter-free federated learning.
Title: GMM-IKRS: Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring
Authors: Emanuele Santellani, Martin Zach, Christian Sormann, Mattia Rossi, Andreas Kuhn, Friedrich Fraundorfer
Copy Paste: [[2408.17149]] GMM-IKRS: Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring(https://arxiv.org/abs/2408.17149)
Keywords: robust, extraction
Abstract: The extraction of keypoints in images is at the basis of many computer vision applications, from localization to 3D reconstruction. Keypoints come with a score permitting to rank them according to their quality. While learned keypoints often exhibit better properties than handcrafted ones, their scores are not easily interpretable, making it virtually impossible to compare the quality of individual keypoints across methods. We propose a framework that can refine, and at the same time characterize with an interpretable score, the keypoints extracted by any method. Our approach leverages a modified robust Gaussian Mixture Model fit designed to both reject non-robust keypoints and refine the remaining ones. Our score comprises two components: one relates to the probability of extracting the same keypoint in an image captured from another viewpoint, the other relates to the localization accuracy of the keypoint. These two interpretable components permit a comparison of individual keypoints extracted across different methods. Through extensive experiments we demonstrate that, when applied to popular keypoint detectors, our framework consistently improves the repeatability of keypoints as well as their performance in homography and two/multiple-view pose recovery tasks.
Title: Investigating Privacy Leakage in Dimensionality Reduction Methods via Reconstruction Attack
Copy Paste: [[2408.17151]] Investigating Privacy Leakage in Dimensionality Reduction Methods via Reconstruction Attack(https://arxiv.org/abs/2408.17151)
Keywords: privacy, attack
Abstract: This study investigates privacy leakage in dimensionality reduction methods through a novel machine learning-based reconstruction attack. Employing an \emph{informed adversary} threat model, we develop a neural network capable of reconstructing high-dimensional data from low-dimensional embeddings. We evaluate six popular dimensionality reduction techniques: PCA, sparse random projection (SRP), multidimensional scaling (MDS), Isomap, $t$-SNE, and UMAP. Using both MNIST and NIH Chest X-ray datasets, we perform a qualitative analysis to identify key factors affecting reconstruction quality. Furthermore, we assess the effectiveness of an additive noise mechanism in mitigating these reconstruction attacks.
Title: The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information
Copy Paste: [[2408.17163]] The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information(https://arxiv.org/abs/2408.17163)
Keywords: transformer
Abstract: The rising footprint of machine learning has led to a focus on imposing \emph{model sparsity} as a means of reducing computational and memory costs. For deep neural networks (DNNs), the state-of-the-art accuracy-vs-sparsity is achieved by heuristics inspired by the classical Optimal Brain Surgeon (OBS) framework~\citep{lecun90brain, hassibi1992second, hassibi1993optimal}, which leverages loss curvature information to make better pruning decisions. Yet, these results still lack a solid theoretical understanding, and it is unclear whether they can be improved by leveraging connections to the wealth of work on sparse recovery algorithms. In this paper, we draw new connections between these two areas and present new sparse recovery algorithms inspired by the OBS framework that comes with theoretical guarantees under reasonable assumptions and have strong practical performance. Specifically, our work starts from the observation that we can leverage curvature information in OBS-like fashion upon the projection step of classic iterative sparse recovery algorithms such as IHT. We show for the first time that this leads both to improved convergence bounds under standard assumptions. Furthermore, we present extensions of this approach to the practical task of obtaining accurate sparse DNNs, and validate it experimentally at scale for Transformer-based models on vision and language tasks.
Title: Efficient Testable Learning of General Halfspaces with Adversarial Label Noise
Authors: Ilias Diakonikolas, Daniel M. Kane, Sihan Liu, Nikos Zarifis
Copy Paste: [[2408.17165]] Efficient Testable Learning of General Halfspaces with Adversarial Label Noise(https://arxiv.org/abs/2408.17165)
Keywords: robust
Abstract: We study the task of testable learning of general -- not necessarily homogeneous -- halfspaces with adversarial label noise with respect to the Gaussian distribution. In the testable learning framework, the goal is to develop a tester-learner such that if the data passes the tester, then one can trust the output of the robust learner on the data.Our main result is the first polynomial time tester-learner for general halfspaces that achieves dimension-independent misclassification error. At the heart of our approach is a new methodology to reduce testable learning of general halfspaces to testable learning of nearly homogeneous halfspaces that may be of broader interest.
Title: Improving Extraction of Clinical Event Contextual Properties from Electronic Health Records: A Comparative Study
Authors: Shubham Agarwal, Thomas Searle, Mart Ratas, Anthony Shek, James Teo, Richard Dobson
Copy Paste: [[2408.17181]] Improving Extraction of Clinical Event Contextual Properties from Electronic Health Records: A Comparative Study(https://arxiv.org/abs/2408.17181)
Keywords: extraction, transformer
Abstract: Electronic Health Records are large repositories of valuable clinical data, with a significant portion stored in unstructured text format. This textual data includes clinical events (e.g., disorders, symptoms, findings, medications and procedures) in context that if extracted accurately at scale can unlock valuable downstream applications such as disease prediction. Using an existing Named Entity Recognition and Linking methodology, MedCAT, these identified concepts need to be further classified (contextualised) for their relevance to the patient, and their temporal and negated status for example, to be useful downstream. This study performs a comparative analysis of various natural language models for medical text classification. Extensive experimentation reveals the effectiveness of transformer-based language models, particularly BERT. When combined with class imbalance mitigation techniques, BERT outperforms Bi-LSTM models by up to 28% and the baseline BERT model by up to 16% for recall of the minority classes. The method has been implemented as part of CogStack/MedCAT framework and made available to the community for further research.
Title: Secure Ownership Management and Transfer of Consumer Internet of Things Devices with Self-sovereign Identity
Copy Paste: [[2408.17184]] Secure Ownership Management and Transfer of Consumer Internet of Things Devices with Self-sovereign Identity(https://arxiv.org/abs/2408.17184)
Keywords: secure, security
Abstract: The popularity of the Internet of Things (IoT) has driven its usage in our homes and industries over the past 10-12 years. However, there have been some major issues related to identity management and ownership transfer involving IoT devices, particularly for consumer IoT devices, e. g. smart appliances such as smart TVs, smart refrigerators, and so on. There have been a few attempts to address this issue; however, user-centric and effective ownership and identity management of IoT devices have not been very successful so far. Recently, blockchain technology has been used to address these issues with limited success. This article presents a Self-sovereign Identity (SSI) based system that facilitates a secure and user-centric ownership management and transfer of consumer IoT devices. The system leverages a number of emerging technologies, such as blockchain and decentralized identifiers (DID), verifiable credentials (VC), under the umbrella of SSI. We present the architecture of the system based on a threat model and requirement analysis, discuss the implementation of a Proof-of-Concept based on the proposed system and illustrate a number of use-cases with their detailed protocol flows. Furthermore, we analyse its security using ProVerif, a state-of-the art protocol verification tool and examine its performance.
Title: Democratizing AI in Africa: FL for Low-Resource Edge Devices
Authors: Jorge Fabila, Víctor M. Campello, Carlos Martín-Isla, Johnes Obungoloch, Kinyera Leo, Amodoi Ronald, Karim Lekadir
Copy Paste: [[2408.17216]] Democratizing AI in Africa: FL for Low-Resource Edge Devices(https://arxiv.org/abs/2408.17216)
Keywords: federate
Abstract: Africa faces significant challenges in healthcare delivery due to limited infrastructure and access to advanced medical technologies. This study explores the use of federated learning to overcome these barriers, focusing on perinatal health. We trained a fetal plane classifier using perinatal data from five African countries: Algeria, Ghana, Egypt, Malawi, and Uganda, along with data from Spanish hospitals. To incorporate the lack of computational resources in the analysis, we considered a heterogeneous set of devices, including a Raspberry Pi and several laptops, for model training. We demonstrate comparative performance between a centralized and a federated model, despite the compute limitations, and a significant improvement in model generalizability when compared to models trained only locally. These results show the potential for a future implementation at a large scale of a federated learning platform to bridge the accessibility gap and improve model generalizability with very little requirements.
Title: How Could Generative AI Support Compliance with the EU AI Act? A Review for Safe Automated Driving Perception
Copy Paste: [[2408.17222]] How Could Generative AI Support Compliance with the EU AI Act? A Review for Safe Automated Driving Perception(https://arxiv.org/abs/2408.17222)
Keywords: robust, generative
Abstract: Deep Neural Networks (DNNs) have become central for the perception functions of autonomous vehicles, substantially enhancing their ability to understand and interpret the environment. However, these systems exhibit inherent limitations such as brittleness, opacity, and unpredictable behavior in out-of-distribution scenarios. The European Union (EU) Artificial Intelligence (AI) Act, as a pioneering legislative framework, aims to address these challenges by establishing stringent norms and standards for AI systems, including those used in autonomous driving (AD), which are categorized as high-risk AI. In this work, we explore how the newly available generative AI models can potentially support addressing upcoming regulatory requirements in AD perception, particularly with respect to safety. This short review paper summarizes the requirements arising from the EU AI Act regarding DNN-based perception systems and systematically categorizes existing generative AI applications in AD. While generative AI models show promise in addressing some of the EU AI Acts requirements, such as transparency and robustness, this review examines their potential benefits and discusses how developers could leverage these methods to enhance compliance with the Act. The paper also highlights areas where further research is needed to ensure reliable and safe integration of these technologies.
Title: OG-Mapping: Octree-based Structured 3D Gaussians for Online Dense Mapping
Copy Paste: [[2408.17223]] OG-Mapping: Octree-based Structured 3D Gaussians for Online Dense Mapping(https://arxiv.org/abs/2408.17223)
Keywords: robust
Abstract: 3D Gaussian splatting (3DGS) has recently demonstrated promising advancements in RGB-D online dense mapping. Nevertheless, existing methods excessively rely on per-pixel depth cues to perform map densification, which leads to significant redundancy and increased sensitivity to depth noise. Additionally, explicitly storing 3D Gaussian parameters of room-scale scene poses a significant storage challenge. In this paper, we introduce OG-Mapping, which leverages the robust scene structural representation capability of sparse octrees, combined with structured 3D Gaussian representations, to achieve efficient and robust online dense mapping. Moreover, OG-Mapping employs an anchor-based progressive map refinement strategy to recover the scene structures at multiple levels of detail. Instead of maintaining a small number of active keyframes with a fixed keyframe window as previous approaches do, a dynamic keyframe window is employed to allow OG-Mapping to better tackle false local minima and forgetting issues. Experimental results demonstrate that OG-Mapping delivers more robust and superior realism mapping results than existing Gaussian-based RGB-D online mapping methods with a compact model, and no additional post-processing is required.
Title: CondSeg: Ellipse Estimation of Pupil and Iris via Conditioned Segmentation
Authors: Zhuang Jia, Jiangfan Deng, Liying Chi, Xiang Long, Daniel K. Du
Copy Paste: [[2408.17231]] CondSeg: Ellipse Estimation of Pupil and Iris via Conditioned Segmentation(https://arxiv.org/abs/2408.17231)
Keywords: segmentation
Abstract: Parsing of eye components (i.e. pupil, iris and sclera) is fundamental for eye tracking and gaze estimation for AR/VR products. Mainstream approaches tackle this problem as a multi-class segmentation task, providing only visible part of pupil/iris, other methods regress elliptical parameters using human-annotated full pupil/iris parameters. In this paper, we consider two priors: projected full pupil/iris circle can be modelled with ellipses (ellipse prior), and the visibility of pupil/iris is controlled by openness of eye-region (condition prior), and design a novel method CondSeg to estimate elliptical parameters of pupil/iris directly from segmentation labels, without explicitly annotating full ellipses, and use eye-region mask to control the visibility of estimated pupil/iris ellipses. Conditioned segmentation loss is used to optimize the parameters by transforming parameterized ellipses into pixel-wise soft masks in a differentiable way. Our method is tested on public datasets (OpenEDS-2019/-2020) and shows competitive results on segmentation metrics, and provides accurate elliptical parameters for further applications of eye tracking simultaneously.
Title: AI-Driven Intrusion Detection Systems (IDS) on the ROAD dataset: A Comparative Analysis for automotive Controller Area Network (CAN)
Authors: Lorenzo Guerra, Linhan Xu, Pavlo Mozharovskyi, Paolo Bellavista, Thomas Chapuis, Guillaume Duc, Van-Tam Nguyen
Copy Paste: [[2408.17235]] AI-Driven Intrusion Detection Systems (IDS) on the ROAD dataset: A Comparative Analysis for automotive Controller Area Network (CAN)(https://arxiv.org/abs/2408.17235)
Keywords: security, attack, robust, steal
Abstract: The integration of digital devices in modern vehicles has revolutionized automotive technology, enhancing safety and the overall driving experience. The Controller Area Network (CAN) bus is a central system for managing in-vehicle communication between the electronic control units (ECUs). However, the CAN protocol poses security challenges due to inherent vulnerabilities, lacking encryption and authentication, which, combined with an expanding attack surface, necessitates robust security measures. In response to this challenge, numerous Intrusion Detection Systems (IDS) have been developed and deployed. Nonetheless, an open, comprehensive, and realistic dataset to test the effectiveness of such IDSs remains absent in the existing literature. This paper addresses this gap by considering the latest ROAD dataset, containing stealthy and sophisticated injections. The methodology involves dataset labelling and the implementation of both state-of-the-art deep learning models and traditional machine learning models to show the discrepancy in performance between the datasets most commonly used in the literature and the ROAD dataset, a more realistic alternative.
Title: DeTRAP: RISC-V Return Address Protection With Debug Triggers
Authors: Isaac Richter (University of Rochester), Jie Zhou (George Washington University), John Criswell (University of Rochester)
Abstract: Modern microcontroller software is often written in C/C++ and suffers from control-flow hijacking vulnerabilities. Previous mitigations suffer from high performance and memory overheads and require either the presence of memory protection hardware or sophisticated program analysis in the compiler. This paper presents DeTRAP (Debug Trigger Return Address Protection). DeTRAP utilizes a full implementation of the RISC-V debug hardware specification to provide a write-protected shadow stack for return addresses. Unlike previous work, DeTRAP requires no memory protection hardware and only minor changes to the compiler toolchain. We tested DeTRAP on an FPGA running a 32-bit RISC-V microcontroller core and found average execution time overheads to be between 0.5% and 1.9% on evaluated benchmark suites with code size overheads averaging 7.9% or less.
Title: Abstracted Gaussian Prototypes for One-Shot Concept Learning
Copy Paste: [[2408.17251]] Abstracted Gaussian Prototypes for One-Shot Concept Learning(https://arxiv.org/abs/2408.17251)
Keywords: robust, generative, segmentation
Abstract: We introduce a cluster-based generative image segmentation framework to encode higher-level representations of visual concepts based on one-shot learning inspired by the Omniglot Challenge. The inferred parameters of each component of a Gaussian Mixture Model (GMM) represent a distinct topological subpart of a visual concept. Sampling new data from these parameters generates augmented subparts to build a more robust prototype for each concept, i.e., the Abstracted Gaussian Prototype (AGP). This framework addresses one-shot classification tasks using a cognitively-inspired similarity metric and addresses one-shot generative tasks through a novel AGP-VAE pipeline employing variational autoencoders (VAEs) to generate new class variants. Results from human judges reveal that the generative pipeline produces novel examples and classes of visual concepts that are broadly indistinguishable from those made by humans. The proposed framework leads to impressive but not state-of-the-art classification accuracy; thus, the contribution is two-fold: 1) the system is uniquely low in theoretical and computational complexity and operates in a completely standalone manner compared while existing approaches draw heavily on pre-training or knowledge engineering; and 2) in contrast with competing neural network models, the AGP approach addresses the importance of breadth of task capability emphasized in the Omniglot challenge (i.e., successful performance on generative tasks). These two points are critical as we advance toward an understanding of how learning/reasoning systems can produce viable, robust, and flexible concepts based on literally nothing more than a single example.
Title: VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters
Authors: Mouxiang Chen, Lefei Shen, Zhuo Li, Xiaoyun Joy Wang, Jianling Sun, Chenghao Liu
Copy Paste: [[2408.17253]] VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters(https://arxiv.org/abs/2408.17253)
Keywords: large language model
Abstract: Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either fine-tune large language models (LLMs) or build large-scale time-series datasets to develop TSF foundation models. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. In this paper, we explore a new road to building a TSF foundation model from rich and high-quality natural images, based on the intrinsic similarities between images and time series. To bridge the gap between the two domains, we reformulate the TSF task as an image reconstruction task, which is further processed by a visual masked autoencoder (MAE) self-supervised pre-trained on the ImageNet dataset. Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models. With minimal fine-tuning, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. These findings suggest that visual models could be a free lunch for TSF and highlight the potential for future cross-domain research between computer vision and TSF. Our code is publicly available at this https URL.
Title: Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning Approach
Authors: Tong Nie, Junlin He, Yuewen Mei, Guoyang Qin, Guilong Li, Jian Sun, Wei Ma
Copy Paste: [[2408.17258]] Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning Approach(https://arxiv.org/abs/2408.17258)
Keywords: large language model
Abstract: The proliferation of e-commerce and urbanization has significantly intensified delivery operations in urban areas, boosting the volume and complexity of delivery demand. Data-driven predictive methods, especially those utilizing machine learning techniques, have emerged to handle these complexities in urban delivery demand management problems. One particularly pressing problem that has not yet been sufficiently studied is the joint estimation and prediction of city-wide delivery demand. To this end, we formulate this problem as a graph-based spatiotemporal learning task. First, a message-passing neural network model is formalized to capture the interaction between demand patterns of associated regions. Second, by exploiting recent advances in large language models, we extract general geospatial knowledge encodings from the unstructured locational data and integrate them into the demand predictor. Last, to encourage the cross-city transferability of the model, an inductive training scheme is developed in an end-to-end routine. Extensive empirical results on two real-world delivery datasets, including eight cities in China and the US, demonstrate that our model significantly outperforms state-of-the-art baselines in these challenging tasks.
Title: Privacy-Preserving Set-Based Estimation Using Differential Privacy and Zonotopes
Authors: Mohammed M. Dawoud, Changxin Liu, Karl H. Johansson, Amr Alanwar
Copy Paste: [[2408.17263]] Privacy-Preserving Set-Based Estimation Using Differential Privacy and Zonotopes(https://arxiv.org/abs/2408.17263)
Keywords: privacy
Abstract: For large-scale cyber-physical systems, the collaboration of spatially distributed sensors is often needed to perform the state estimation process. Privacy concerns arise from disclosing sensitive measurements to a cloud estimator. To solve this issue, we propose a differentially private set-based estimation protocol that guarantees true state containment in the estimated set and differential privacy for the sensitive measurements throughout the set-based state estimation process within the central and local differential privacy models. Zonotopes are employed in the proposed differentially private set-based estimator, offering computational advantages in set operations. We consider a plant of a non-linear discrete-time dynamical system with bounded modeling uncertainties, sensors that provide sensitive measurements with bounded measurement uncertainties, and a cloud estimator that predicts the system's state. The privacy-preserving noise perturbs the centers of measurement zonotopes, thereby concealing the precise position of these zonotopes, i.e., ensuring privacy preservation for the sets containing sensitive measurements. Compared to existing research, our approach achieves less privacy loss and utility loss through the central and local differential privacy models by leveraging a numerically optimized truncated noise distribution. The proposed estimator is perturbed by weaker noise than the analytical approaches in the literature to guarantee the same level of privacy, therefore improving the estimation utility. Numerical and comparison experiments with truncated Laplace noise are presented to support our approach.
Title: DCUDF2: Improving Efficiency and Accuracy in Extracting Zero Level Sets from Unsigned Distance Fields
Copy Paste: [[2408.17284]] DCUDF2: Improving Efficiency and Accuracy in Extracting Zero Level Sets from Unsigned Distance Fields(https://arxiv.org/abs/2408.17284)
Keywords: robust, extraction
Abstract: Unsigned distance fields (UDFs) allow for the representation of models with complex topologies, but extracting accurate zero level sets from these fields poses significant challenges, particularly in preserving topological accuracy and capturing fine geometric details. To overcome these issues, we introduce DCUDF2, an enhancement over DCUDF--the current state-of-the-art method--for extracting zero level sets from UDFs. Our approach utilizes an accuracy-aware loss function, enhanced with self-adaptive weights, to improve geometric quality significantly. We also propose a topology correction strategy that reduces the dependence on hyper-parameter, increasing the robustness of our method. Furthermore, we develop new operations leveraging self-adaptive weights to boost runtime efficiency. Extensive experiments on surface extraction across diverse datasets demonstrate that DCUDF2 outperforms DCUDF and existing methods in both geometric fidelity and topological accuracy. We will make the source code publicly available.
Title: Image-Perfect Imperfections: Safety, Bias, and Authenticity in the Shadow of Text-To-Image Model Evolution
Authors: Yixin Wu, Yun Shen, Michael Backes, Yang Zhang
Copy Paste: [[2408.17285]] Image-Perfect Imperfections: Safety, Bias, and Authenticity in the Shadow of Text-To-Image Model Evolution(https://arxiv.org/abs/2408.17285)
Keywords: diffusion
Abstract: Text-to-image models, such as Stable Diffusion (SD), undergo iterative updates to improve image quality and address concerns such as safety. Improvements in image quality are straightforward to assess. However, how model updates resolve existing concerns and whether they raise new questions remain unexplored. This study takes an initial step in investigating the evolution of text-to-image models from the perspectives of safety, bias, and authenticity. Our findings, centered on Stable Diffusion, indicate that model updates paint a mixed picture. While updates progressively reduce the generation of unsafe images, the bias issue, particularly in gender, intensifies. We also find that negative stereotypes either persist within the same Non-White race group or shift towards other Non-White race groups through SD updates, yet with minimal association of these traits with the White race group. Additionally, our evaluation reveals a new concern stemming from SD updates: State-of-the-art fake image detectors, initially trained for earlier SD versions, struggle to identify fake images generated by updated versions. We show that fine-tuning these detectors on fake images generated by updated versions achieves at least 96.6\% accuracy across various SD versions, addressing this issue. Our insights highlight the importance of continued efforts to mitigate biases and vulnerabilities in evolving text-to-image models.
Title: Hybridizing Base-Line 2D-CNN Model with Cat Swarm Optimization for Enhanced Advanced Persistent Threat Detection
Copy Paste: [[2408.17307]] Hybridizing Base-Line 2D-CNN Model with Cat Swarm Optimization for Enhanced Advanced Persistent Threat Detection(https://arxiv.org/abs/2408.17307)
Keywords: security, attack, steal
Abstract: In the realm of cyber-security, detecting Advanced Persistent Threats (APTs) remains a formidable challenge due to their stealthy and sophisticated nature. This research paper presents an innovative approach that leverages Convolutional Neural Networks (CNNs) with a 2D baseline model, enhanced by the cutting-edge Cat Swarm Optimization (CSO) algorithm, to significantly improve APT detection accuracy. By seamlessly integrating the 2D-CNN baseline model with CSO, we unlock the potential for unprecedented accuracy and efficiency in APT detection. The results unveil an impressive accuracy score of $98.4\%$, marking a significant enhancement in APT detection across various attack stages, illuminating a path forward in combating these relentless and sophisticated threats.
Title: Structuring a Training Strategy to Robustify Perception Models with Realistic Image Augmentations
Authors: Ahmed Hammam, Bharathwaj Krishnaswami Sreedhar, Nura Kawa, Tim Patzelt, Oliver De Candido
Copy Paste: [[2408.17311]] Structuring a Training Strategy to Robustify Perception Models with Realistic Image Augmentations(https://arxiv.org/abs/2408.17311)
Keywords: robust, segmentation
Abstract: Advancing Machine Learning (ML)-based perception models for autonomous systems necessitates addressing weak spots within the models, particularly in challenging Operational Design Domains (ODDs). These are environmental operating conditions of an autonomous vehicle which can contain difficult conditions, e.g., lens flare at night or objects reflected in a wet street. This report introduces a novel methodology for training with augmentations to enhance model robustness and performance in such conditions. The proposed approach leverages customized physics-based augmentation functions, to generate realistic training data that simulates diverse ODD scenarios. We present a comprehensive framework that includes identifying weak spots in ML models, selecting suitable augmentations, and devising effective training strategies. The methodology integrates hyperparameter optimization and latent space optimization to fine-tune augmentation parameters, ensuring they maximally improve the ML models' performance. Experimental results demonstrate improvements in model performance, as measured by commonly used metrics such as mean Average Precision (mAP) and mean Intersection over Union (mIoU) on open-source object detection and semantic segmentation models and datasets. Our findings emphasize that optimal training strategies are model- and data-specific and highlight the benefits of integrating augmentations into the training pipeline. By incorporating augmentations, we observe enhanced robustness of ML-based perception models, making them more resilient to edge cases encountered in real-world ODDs. This work underlines the importance of customized augmentations and offers an effective solution for improving the safety and reliability of autonomous driving functions.
Title: Fair Best Arm Identification with Fixed Confidence
Copy Paste: [[2408.17313]] Fair Best Arm Identification with Fixed Confidence(https://arxiv.org/abs/2408.17313)
Keywords: fair
Abstract: In this work, we present a novel framework for Best Arm Identification (BAI) under fairness constraints, a setting that we refer to as \textit{F-BAI} (fair BAI). Unlike traditional BAI, which solely focuses on identifying the optimal arm with minimal sample complexity, F-BAI also includes a set of fairness constraints. These constraints impose a lower limit on the selection rate of each arm and can be either model-agnostic or model-dependent. For this setting, we establish an instance-specific sample complexity lower bound and analyze the \textit{price of fairness}, quantifying how fairness impacts sample complexity. Based on the sample complexity lower bound, we propose F-TaS, an algorithm provably matching the sample complexity lower bound, while ensuring that the fairness constraints are satisfied. Numerical results, conducted using both a synthetic model and a practical wireless scheduling application, show the efficiency of F-TaS in minimizing the sample complexity while achieving low fairness violations.
Title: Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering
Authors: Nicholas Pochinkov, Ben Pasero, Skylar Shibayama
Copy Paste: [[2408.17322]] Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering(https://arxiv.org/abs/2408.17322)
Keywords: interpretability, transformer
Abstract: The use of transformer-based models is growing rapidly throughout society. With this growth, it is important to understand how they work, and in particular, how the attention mechanisms represent concepts. Though there are many interpretability methods, many look at models through their neuronal activations, which are poorly understood. We describe different lenses through which to view neuron activations, and investigate the effectiveness in language models and vision transformers through various methods of neural ablation: zero ablation, mean ablation, activation resampling, and a novel approach we term 'peak ablation'. Through experimental analysis, we find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods, with resampling usually causing the most significant performance deterioration. We make our code available at this https URL.
Title: Modularity in Transformers: Investigating Neuron Separability & Specialization
Authors: Nicholas Pochinkov, Thomas Jones, Mohammed Rashidur Rahman
Abstract: Transformer models are increasingly prevalent in various applications, yet our understanding of their internal workings remains limited. This paper investigates the modularity and task specialization of neurons within transformer architectures, focusing on both vision (ViT) and language (Mistral 7B) models. Using a combination of selective pruning and MoEfication clustering techniques, we analyze the overlap and specialization of neurons across different tasks and data subsets. Our findings reveal evidence of task-specific neuron clusters, with varying degrees of overlap between related tasks. We observe that neuron importance patterns persist to some extent even in randomly initialized models, suggesting an inherent structure that training refines. Additionally, we find that neuron clusters identified through MoEfication correspond more strongly to task-specific neurons in earlier and later layers of the models. This work contributes to a more nuanced understanding of transformer internals and offers insights into potential avenues for improving model interpretability and efficiency.
Title: Impact of ChatGPT on the writing style of condensed matter physicists
Authors: Shaojun Xu, Xiaohui Ye, Mengqi Zhang, Pei Wang
Copy Paste: [[2408.17325]] Impact of ChatGPT on the writing style of condensed matter physicists(https://arxiv.org/abs/2408.17325)
Keywords: robust
Abstract: We apply a state-of-the-art difference-in-differences approach to estimate the impact of ChatGPT's release on the writing style of condensed matter papers on arXiv. Our analysis reveals a statistically significant improvement in the English quality of abstracts written by non-native English speakers. Importantly, this improvement remains robust even after accounting for other potential factors, confirming that it can be attributed to the release of ChatGPT. This indicates widespread adoption of the tool. Following the release of ChatGPT, there is a significant increase in the use of unique words, while the frequency of rare words decreases. Across language families, the changes in writing style are significant for authors from the Latin and Ural-Altaic groups, but not for those from the Germanic or other Indo-European groups.
Title: LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation
Copy Paste: [[2408.17347]] LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation(https://arxiv.org/abs/2408.17347)
Keywords: robust, segmentation
Abstract: Conventional medical image segmentation methods have been found inadequate in facilitating physicians with the identification of specific lesions for diagnosis and treatment. Given the utility of text as an instructional format, we introduce a novel task termed Medical Image Referring Segmentation (MIRS), which requires segmenting specified lesions in images based on the given language expressions. Due to the varying object scales in medical images, MIRS demands robust vision-language modeling and comprehensive multi-scale interaction for precise localization and segmentation under linguistic guidance. However, existing medical image segmentation methods fall short in meeting these demands, resulting in insufficient segmentation accuracy. In response, we propose an approach named Language-guided Scale-aware MedSegmentor (LSMS), incorporating two appealing designs: (1)~a Scale-aware Vision-Language Attention module that leverages diverse convolutional kernels to acquire rich visual knowledge and interact closely with linguistic features, thereby enhancing lesion localization capability; (2)~a Full-Scale Decoder that globally models multi-modal features across various scales, capturing complementary information between scales to accurately outline lesion boundaries. Addressing the lack of suitable datasets for MIRS, we constructed a vision-language medical dataset called Reference Hepatic Lesion Segmentation (RefHL-Seg). This dataset comprises 2,283 abdominal CT slices from 231 cases, with corresponding textual annotations and segmentation masks for various liver lesions in images. We validated the performance of LSMS for MIRS and conventional medical image segmentation tasks across various datasets. Our LSMS consistently outperforms on all datasets with lower computational costs. The code and datasets will be released.
Title: Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage
Authors: Md Rafi Ur Rashid, Jing Liu, Toshiaki Koike-Akino, Shagufta Mehnaz, Ye Wang
Copy Paste: [[2408.17354]] Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage(https://arxiv.org/abs/2408.17354)
Keywords: privacy, attack, extraction, membership infer, large language model
Abstract: Fine-tuning large language models on private data for downstream applications poses significant privacy risks in potentially exposing sensitive information. Several popular community platforms now offer convenient distribution of a large variety of pre-trained models, allowing anyone to publish without rigorous verification. This scenario creates a privacy threat, as pre-trained models can be intentionally crafted to compromise the privacy of fine-tuning datasets. In this study, we introduce a novel poisoning technique that uses model-unlearning as an attack tool. This approach manipulates a pre-trained language model to increase the leakage of private data during the fine-tuning process. Our method enhances both membership inference and data extraction attacks while preserving model utility. Experimental results across different models, datasets, and fine-tuning setups demonstrate that our attacks significantly surpass baseline performance. This work serves as a cautionary note for users who download pre-trained models from unverified sources, highlighting the potential risks involved.
Title: C-RADAR: A Centralized Deep Learning System for Intrusion Detection in Software Defined Networks
Copy Paste: [[2408.17356]] C-RADAR: A Centralized Deep Learning System for Intrusion Detection in Software Defined Networks(https://arxiv.org/abs/2408.17356)
Keywords: security, attack
Abstract: The popularity of Software Defined Networks (SDNs) has grown in recent years, mainly because of their ability to simplify network management and improve network flexibility. However, this also makes them vulnerable to various types of cyber attacks. SDNs work on a centralized control plane which makes them more prone to network attacks. Research has demonstrated that deep learning (DL) methods can be successful in identifying intrusions in conventional networks, but their application in SDNs is still an open research area. In this research, we propose the use of DL techniques for intrusion detection in SDNs. We measure the effectiveness of our method by experimentation on a dataset of network traffic and comparing it to existing techniques. Our results show that the DL-based approach outperforms traditional methods in terms of detection accuracy and computational efficiency. The deep learning architecture that has been used in this research is a Long Short Term Memory Network and Self-Attention based architecture i.e. LSTM-Attn which achieves an Fl-score of 0.9721. Furthermore, this technique can be trained to detect new attack patterns and improve the overall security of SDNs.
Title: Assessing Generative Language Models in Classification Tasks: Performance and Self-Evaluation Capabilities in the Environmental and Climate Change Domain
Copy Paste: [[2408.17362]] Assessing Generative Language Models in Classification Tasks: Performance and Self-Evaluation Capabilities in the Environmental and Climate Change Domain(https://arxiv.org/abs/2408.17362)
Keywords: transformer, generative, large language model
Abstract: This paper examines the performance of two Large Language Models (LLMs), GPT3.5 and Llama2 and one Small Language Model (SLM) Gemma, across three different classification tasks within the climate change (CC) and environmental domain. Employing BERT-based models as a baseline, we compare their efficacy against these transformer-based models. Additionally, we assess the models' self-evaluation capabilities by analyzing the calibration of verbalized confidence scores in these text classification tasks. Our findings reveal that while BERT-based models generally outperform both the LLMs and SLM, the performance of the large generative models is still noteworthy. Furthermore, our calibration analysis reveals that although Gemma is well-calibrated in initial tasks, it thereafter produces inconsistent results; Llama is reasonably calibrated, and GPT consistently exhibits strong calibration. Through this research, we aim to contribute to the ongoing discussion on the utility and effectiveness of generative LMs in addressing some of the planet's most urgent issues, highlighting their strengths and limitations in the context of ecology and CC.
Title: Look, Learn and Leverage (L$^3$): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment
Copy Paste: [[2408.17363]] Look, Learn and Leverage (L$^3$): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment(https://arxiv.org/abs/2408.17363)
Keywords: segmentation
Abstract: Modern deep learning models have demonstrated outstanding performance on discovering the underlying mechanisms when both visual appearance and intrinsic relations (e.g., causal structure) data are sufficient, such as Disentangled Representation Learning (DRL), Causal Representation Learning (CRL) and Visual Question Answering (VQA) methods. However, generalization ability of these models is challenged when the visual domain shifts and the relations data is absent during finetuning. To address this challenge, we propose a novel learning framework, Look, Learn and Leverage (L$^3$), which decomposes the learning process into three distinct phases and systematically utilize the class-agnostic segmentation masks as the common symbolic space to align visual domains. Thus, a relations discovery model can be trained on the source domain, and when the visual domain shifts and the intrinsic relations are absent, the pretrained relations discovery model can be directly reused and maintain a satisfactory performance. Extensive performance evaluations are conducted on three different tasks: DRL, CRL and VQA, and show outstanding results on all three tasks, which reveals the advantages of L$^3$.
Title: Leveraging Graph Neural Networks to Forecast Electricity Consumption
Abstract: Accurate electricity demand forecasting is essential for several reasons, especially as the integration of renewable energy sources and the transition to a decentralized network paradigm introduce greater complexity and uncertainty. The proposed methodology leverages graph-based representations to effectively capture the spatial distribution and relational intricacies inherent in this decentralized network structure. This research work offers a novel approach that extends beyond the conventional Generalized Additive Model framework by considering models like Graph Convolutional Networks or Graph SAGE. These graph-based models enable the incorporation of various levels of interconnectedness and information sharing among nodes, where each node corresponds to the combined load (i.e. consumption) of a subset of consumers (e.g. the regions of a country). More specifically, we introduce a range of methods for inferring graphs tailored to consumption forecasting, along with a framework for evaluating the developed models in terms of both performance and explainability. We conduct experiments on electricity forecasting, in both a synthetic and a real framework considering the French mainland regions, and the performance and merits of our approach are discussed.
Title: NDP: Next Distribution Prediction as a More Broad Target
Copy Paste: [[2408.17377]] NDP: Next Distribution Prediction as a More Broad Target(https://arxiv.org/abs/2408.17377)
Keywords: large language model
Abstract: Large language models (LLMs) trained on next-token prediction (NTP) paradigm have demonstrated powerful capabilities. However, the existing NTP paradigm contains several limitations, particularly related to planned task complications and error propagation during inference. In our work, we extend the critique of NTP, highlighting its limitation also due to training with a narrow objective: the prediction of a sub-optimal one-hot distribution. To support this critique, we conducted a pre-experiment treating the output distribution from powerful LLMs as efficient world data compression. By evaluating the similarity between the $n$-gram distribution and the one-hot distribution with LLMs, we observed that the $n$-gram distributions align more closely with the output distribution of LLMs. Based on this insight, we introduce Next Distribution Prediction (NDP), which uses $n$-gram distributions to replace the one-hot targets, enhancing learning without extra online training time. We conducted experiments across translation, general task, language transfer, and medical domain adaptation. Compared to NTP, NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain. This demonstrates the concrete benefits of addressing the target narrowing problem, pointing to a new direction for future work on improving NTP.
Title: Fairness-Aware Estimation of Graphical Models
Copy Paste: [[2408.17396]] Fairness-Aware Estimation of Graphical Models(https://arxiv.org/abs/2408.17396)
Keywords: protect, fair
Abstract: This paper examines the issue of fairness in the estimation of graphical models (GMs), particularly Gaussian, Covariance, and Ising models. These models play a vital role in understanding complex relationships in high-dimensional data. However, standard GMs can result in biased outcomes, especially when the underlying data involves sensitive characteristics or protected groups. To address this, we introduce a comprehensive framework designed to reduce bias in the estimation of GMs related to protected attributes. Our approach involves the integration of the pairwise graph disparity error and a tailored loss function into a nonsmooth multi-objective optimization problem, striving to achieve fairness across different sensitive groups while maintaining the effectiveness of the GMs. Experimental evaluations on synthetic and real-world datasets demonstrate that our framework effectively mitigates bias without undermining GMs' performance.
Title: How Knowledge Distillation Mitigates the Synthetic Gap in Fair Face Recognition
Authors: Pedro C. Neto, Ivona Colakovic, Sašo Karakatič, Ana F. Sequeira
Copy Paste: [[2408.17399]] How Knowledge Distillation Mitigates the Synthetic Gap in Fair Face Recognition(https://arxiv.org/abs/2408.17399)
Keywords: fair
Abstract: Leveraging the capabilities of Knowledge Distillation (KD) strategies, we devise a strategy to fight the recent retraction of face recognition datasets. Given a pretrained Teacher model trained on a real dataset, we show that carefully utilising synthetic datasets, or a mix between real and synthetic datasets to distil knowledge from this teacher to smaller students can yield surprising results. In this sense, we trained 33 different models with and without KD, on different datasets, with different architectures and losses. And our findings are consistent, using KD leads to performance gains across all ethnicities and decreased bias. In addition, it helps to mitigate the performance gap between real and synthetic datasets. This approach addresses the limitations of synthetic data training, improving both the accuracy and fairness of face recognition models.
Title: CinePreGen: Camera Controllable Video Previsualization via Engine-powered Diffusion
Authors: Yiran Chen, Anyi Rao, Xuekun Jiang, Shishi Xiao, Ruiqing Ma, Zeyu Wang, Hui Xiong, Bo Dai
Copy Paste: [[2408.17424]] CinePreGen: Camera Controllable Video Previsualization via Engine-powered Diffusion(https://arxiv.org/abs/2408.17424)
Keywords: diffusion, generative
Abstract: With advancements in video generative AI models (e.g., SORA), creators are increasingly using these techniques to enhance video previsualization. However, they face challenges with incomplete and mismatched AI workflows. Existing methods mainly rely on text descriptions and struggle with camera placement, a key component of previsualization. To address these issues, we introduce CinePreGen, a visual previsualization system enhanced with engine-powered diffusion. It features a novel camera and storyboard interface that offers dynamic control, from global to local camera adjustments. This is combined with a user-friendly AI rendering workflow, which aims to achieve consistent results through multi-masked IP-Adapter and engine simulation guidelines. In our comprehensive evaluation study, we demonstrate that our system reduces development viscosity (i.e., the complexity and challenges in the development process), meets users' needs for extensive control and iteration in the design process, and outperforms other AI video production workflows in cinematic camera movement, as shown by our experiments and a within-subjects user study. With its intuitive camera controls and realistic rendering of camera motion, CinePreGen shows great potential for improving video production for both individual creators and industry professionals.
Title: CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models
Copy Paste: [[2408.17428]] CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models(https://arxiv.org/abs/2408.17428)
Keywords: transformer
Abstract: The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.
Title: DARES: Depth Anything in Robotic Endoscopic Surgery with Self-supervised Vector-LoRA of the Foundation Model
Authors: Mona Sheikh Zeinoddin, Chiara Lena, Jiongqi Qu, Luca Carlini, Mattia Magro, Seunghoi Kim, Elena De Momi, Sophia Bano, Matthew Grech-Sollars, Evangelos Mazomenos, Daniel C. Alexander, Danail Stoyanov, Matthew J. Clarkson, Mobarakol Islam
Copy Paste: [[2408.17433]] DARES: Depth Anything in Robotic Endoscopic Surgery with Self-supervised Vector-LoRA of the Foundation Model(https://arxiv.org/abs/2408.17433)
Keywords: robust
Abstract: Robotic-assisted surgery (RAS) relies on accurate depth estimation for 3D reconstruction and visualization. While foundation models like Depth Anything Models (DAM) show promise, directly applying them to surgery often yields suboptimal results. Fully fine-tuning on limited surgical data can cause overfitting and catastrophic forgetting, compromising model robustness and generalization. Although Low-Rank Adaptation (LoRA) addresses some adaptation issues, its uniform parameter distribution neglects the inherent feature hierarchy, where earlier layers, learning more general features, require more parameters than later ones. To tackle this issue, we introduce Depth Anything in Robotic Endoscopic Surgery (DARES), a novel approach that employs a new adaptation technique, Vector Low-Rank Adaptation (Vector-LoRA) on the DAM V2 to perform self-supervised monocular depth estimation in RAS scenes. To enhance learning efficiency, we introduce Vector-LoRA by integrating more parameters in earlier layers and gradually decreasing parameters in later layers. We also design a reprojection loss based on the multi-scale SSIM error to enhance depth perception by better tailoring the foundation model to the specific requirements of the surgical environment. The proposed method is validated on the SCARED dataset and demonstrates superior performance over recent state-of-the-art self-supervised monocular depth estimation techniques, achieving an improvement of 13.3% in the absolute relative error metric. The code and pre-trained weights are available at this https URL.
Title: SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
Copy Paste: [[2408.17437]] SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists(https://arxiv.org/abs/2408.17437)
Keywords: large language model
Abstract: Traditional benchmarking in NLP typically involves using static held-out test sets. However, this approach often results in an overestimation of performance and lacks the ability to offer comprehensive, interpretable, and dynamic assessments of NLP models. Recently, works like DynaBench (Kiela et al., 2021) and CheckList (Ribeiro et al., 2020) have addressed these limitations through behavioral testing of NLP models with test types generated by a multistep human-annotated pipeline. Unfortunately, manually creating a variety of test types requires much human labor, often at prohibitive cost. In this work, we propose SYNTHEVAL, a hybrid behavioral testing framework that leverages large language models (LLMs) to generate a wide range of test types for a comprehensive evaluation of NLP models. SYNTHEVAL first generates sentences via LLMs using controlled generation, and then identifies challenging examples by comparing the predictions made by LLMs with task-specific NLP models. In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit. We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks. We share our code in this https URL.