secure

Title: A Comprehensive Analysis of Blockchain Applications for Securing Computer Vision Systems. (arXiv:2307.06659v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.06659
Code URL: null
Copy Paste: [[2307.06659] A Comprehensive Analysis of Blockchain Applications for Securing Computer Vision Systems](http://arxiv.org/abs/2307.06659) #secure
Summary:
Blockchain (BC) and Computer Vision (CV) are the two emerging fields with the potential to transform various sectors.The ability of BC can help in offering decentralized and secure data storage, while CV allows machines to learn and understand visual data. This integration of the two technologies holds massive promise for developing innovative applications that can provide solutions to the challenges in various sectors such as supply chain management, healthcare, smart cities, and defense. This review explores a comprehensive analysis of the integration of BC and CV by examining their combination and potential applications. It also provides a detailed analysis of the fundamental concepts of both technologies, highlighting their strengths and limitations. This paper also explores current research efforts that make use of the benefits offered by this combination. The effort includes how BC can be used as an added layer of security in CV systems and also ensure data integrity, enabling decentralized image and video analytics using BC. The challenges and open issues associated with this integration are also identified, and appropriate potential future directions are also proposed.

Title: SecureFalcon: The Next Cyber Reasoning System for Cyber Security. (arXiv:2307.06616v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.06616
Code URL: null
Copy Paste: [[2307.06616] SecureFalcon: The Next Cyber Reasoning System for Cyber Security](http://arxiv.org/abs/2307.06616) #secure
Summary:
Software vulnerabilities leading to various detriments such as crashes, data loss, and security breaches, significantly hinder the quality, affecting the market adoption of software applications and systems. Although traditional methods such as automated software testing, fault localization, and repair have been intensively studied, static analysis tools are most commonly used and have an inherent false positives rate, posing a solid challenge to developer productivity. Large Language Models (LLMs) offer a promising solution to these persistent issues. Among these, FalconLLM has shown substantial potential in identifying intricate patterns and complex vulnerabilities, hence crucial in software vulnerability detection. In this paper, for the first time, FalconLLM is being fine-tuned for cybersecurity applications, thus introducing SecureFalcon, an innovative model architecture built upon FalconLLM. SecureFalcon is trained to differentiate between vulnerable and non-vulnerable C code samples. We build a new training dataset, FormAI, constructed thanks to Generative Artificial Intelligence (AI) and formal verification to evaluate its performance. SecureFalcon achieved an impressive 94% accuracy rate in detecting software vulnerabilities, emphasizing its significant potential to redefine software vulnerability detection methods in cybersecurity.

security

Title: Maximizing Penetration Testing Success with Effective Reconnaissance Techniques using ChatGPT. (arXiv:2307.06391v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.06391
Code URL: null
Copy Paste: [[2307.06391] Maximizing Penetration Testing Success with Effective Reconnaissance Techniques using ChatGPT](http://arxiv.org/abs/2307.06391) #security
Summary:
ChatGPT is a generative pretrained transformer language model created using artificial intelligence implemented as chatbot which can provide very detailed responses to a wide variety of questions. As a very contemporary phenomenon, this tool has a wide variety of potential use cases that have yet to be explored. With the significant extent of information on a broad assortment of potential topics, ChatGPT could add value to many information security uses cases both from an efficiency perspective as well as to offer another source of security information that could be used to assist with securing Internet accessible assets of organizations. One information security practice that could benefit from ChatGPT is the reconnaissance phase of penetration testing. This research uses a case study methodology to explore and investigate the uses of ChatGPT in obtaining valuable reconnaissance data. ChatGPT is able to provide many types of intel regarding targeted properties which includes Internet Protocol (IP) address ranges, domain names, network topology, vendor technologies, SSL/TLS ciphers, ports & services, and operating systems used by the target. The reconnaissance information can then be used during the planning phase of a penetration test to determine the tactics, tools, and techniques to guide the later phases of the penetration test in order to discover potential risks such as unpatched software components and security misconfiguration related issues. The study provides insights into how artificial intelligence language models can be used in cybersecurity and contributes to the advancement of penetration testing techniques.

Keywords: ChatGPT, Penetration Testing, Reconnaissance

Title: Benchmarking the Security Protocol and Data Model (SPDM) for component authentication. (arXiv:2307.06456v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.06456
Code URL: null
Copy Paste: [[2307.06456] Benchmarking the Security Protocol and Data Model (SPDM) for component authentication](http://arxiv.org/abs/2307.06456) #security
Summary:
Efforts to secure computing systems via software traditionally focus on the operating system and application levels. In contrast, the Security Protocol and Data Model (SPDM) tackles firmware level security challenges, which are much harder (if at all possible) to detect with regular protection software. SPDM includes key features like enabling peripheral authentication, authenticated hardware measurements retrieval, and secure session establishment. Since SPDM is a relatively recent proposal, there is a lack of studies evaluating its performance impact on real-world applications. In this article, we address this gap by: (1) implementing the protocol on a simple virtual device, and then investigating the overhead introduced by each SDPM message; and (2) creating an SPDM-capable virtual hard drive based on VirtIO, and comparing the resulting read/write performance with a regular, unsecured implementation. Our results suggest that SPDM bootstrap time takes the order of tens of milliseconds, while the toll of introducing SPDM on hard drive communication highly depends on specific workload patterns. For example, for mixed random read/write operations, the slowdown is negligible in comparison to the baseline unsecured setup. Conversely, for sequential read or write operations, the data encryption process becomes the bottleneck, reducing the performance indicators by several orders of magnitude.

Title: Migrating to Post-Quantum Cryptography: a Framework Using Security Dependency Analysis. (arXiv:2307.06520v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.06520
Code URL: null
Copy Paste: [[2307.06520] Migrating to Post-Quantum Cryptography: a Framework Using Security Dependency Analysis](http://arxiv.org/abs/2307.06520) #security
Summary:
Quantum computing is emerging as an unprecedented threat to the current state of widely used cryptographic systems. Cryptographic methods that have been considered secure for decades will likely be broken, with enormous impact on the security of sensitive data and communications in enterprises worldwide. A plan to migrate to quantum-resistant cryptographic systems is required. However, migrating an enterprise system to ensure a quantum-safe state is a complex process. Enterprises will require systematic guidance to perform this migration to remain resilient in a post-quantum era, as many organisations do not have staff with the expertise to manage this process unaided. This paper presents a comprehensive framework designed to aid enterprises in their migration. The framework articulates key steps and technical considerations in the cryptographic migration process. It makes use of existing organisational inventories and provides a roadmap for prioritising the replacement of cryptosystems in a post-quantum context. The framework enables the efficient identification of cryptographic objects, and can be integrated with other frameworks in enterprise settings to minimise operational disruption during migration. Practical case studies are included to demonstrate the utility and efficacy of the proposed framework using graph theoretic techniques to determine and evaluate cryptographic dependencies.

Title: DAXiot: A Decentralized Authentication and Authorization Scheme for Dynamic IoT Networks. (arXiv:2307.06919v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.06919
Code URL: null
Copy Paste: [[2307.06919] DAXiot: A Decentralized Authentication and Authorization Scheme for Dynamic IoT Networks](http://arxiv.org/abs/2307.06919) #security
Summary:
Federated and decentralized networks supporting frequently changing system participants are a requirement for future Internet of Things (IoT) use cases. IoT devices and networks often lack adequate authentication and authorization mechanisms, resulting in insufficient privacy for entities in such systems. In this work we address both issues by designing a privacy preserving challenge-response style authentication and authorization scheme based on Decentralized Identifiers and Verifiable Credentials. Our solution allows a decentralized permission management of frequently changing network participants and supports authenticated encryption for data confidentiality. We demonstrate our solution in an MQTT 5.0 scenario and evaluate its security, privacy guarantees, and performance.

Title: PHOENI2X -- A European Cyber Resilience Framework With Artificial-Intelligence-Assisted Orchestration, Automation and Response Capabilities for Business Continuity and Recovery, Incident Response, and Information Exchange. (arXiv:2307.06932v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.06932
Code URL: null
Copy Paste: [[2307.06932] PHOENI2X -- A European Cyber Resilience Framework With Artificial-Intelligence-Assisted Orchestration, Automation and Response Capabilities for Business Continuity and Recovery, Incident Response, and Information Exchange](http://arxiv.org/abs/2307.06932) #security
Summary:
As digital technologies become more pervasive in society and the economy, cybersecurity incidents become more frequent and impactful. According to the NIS and NIS2 Directives, EU Member States and their Operators of Essential Services must establish a minimum baseline set of cybersecurity capabilities and engage in cross-border coordination and cooperation. However, this is only a small step towards European cyber resilience. In this landscape, preparedness, shared situational awareness, and coordinated incident response are essential for effective cyber crisis management and resilience. Motivated by the above, this paper presents PHOENI2X, an EU-funded project aiming to design, develop, and deliver a Cyber Resilience Framework providing Artificial-Intelligence-assisted orchestration, automation and response capabilities for business continuity and recovery, incident response, and information exchange, tailored to the needs of Operators of Essential Services and the EU Member State authorities entrusted with cybersecurity.

Title: Online Distributed Learning with Quantized Finite-Time Coordination. (arXiv:2307.06620v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06620
Code URL: null
Copy Paste: [[2307.06620] Online Distributed Learning with Quantized Finite-Time Coordination](http://arxiv.org/abs/2307.06620) #security
Summary:
In this paper we consider online distributed learning problems. Online distributed learning refers to the process of training learning models on distributed data sources. In our setting a set of agents need to cooperatively train a learning model from streaming data. Differently from federated learning, the proposed approach does not rely on a central server but only on peer-to-peer communications among the agents. This approach is often used in scenarios where data cannot be moved to a centralized location due to privacy, security, or cost reasons. In order to overcome the absence of a central server, we propose a distributed algorithm that relies on a quantized, finite-time coordination protocol to aggregate the locally trained models. Furthermore, our algorithm allows for the use of stochastic gradients during local training. Stochastic gradients are computed using a randomly sampled subset of the local training data, which makes the proposed algorithm more efficient and scalable than traditional gradient descent. In our paper, we analyze the performance of the proposed algorithm in terms of the mean distance from the online solution. Finally, we present numerical results for a logistic regression task.

privacy

Title: To share or not to share: What risks would laypeople accept to give sensitive data to differentially-private NLP systems?. (arXiv:2307.06708v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.06708
Code URL: null
Copy Paste: [[2307.06708] To share or not to share: What risks would laypeople accept to give sensitive data to differentially-private NLP systems?](http://arxiv.org/abs/2307.06708) #privacy
Summary:
Although the NLP community has adopted central differential privacy as a go-to framework for privacy-preserving model training or data sharing, the choice and interpretation of the key parameter, privacy budget $\varepsilon$ that governs the strength of privacy protection, remains largely arbitrary. We argue that determining the $\varepsilon$ value should not be solely in the hands of researchers or system developers, but must also take into account the actual people who share their potentially sensitive data. In other words: Would you share your instant messages for $\varepsilon$ of 10? We address this research gap by designing, implementing, and conducting a behavioral experiment (311 lay participants) to study the behavior of people in uncertain decision-making situations with respect to privacy-threatening situations. Framing the risk perception in terms of two realistic NLP scenarios and using a vignette behavioral study help us determine what $\varepsilon$ thresholds would lead lay people to be willing to share sensitive textual data - to our knowledge, the first study of its kind.

Title: Deploying ZKP Frameworks with Real-World Data: Challenges and Proposed Solutions. (arXiv:2307.06408v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.06408
Code URL: null
Copy Paste: [[2307.06408] Deploying ZKP Frameworks with Real-World Data: Challenges and Proposed Solutions](http://arxiv.org/abs/2307.06408) #privacy
Summary:
Zero-knowledge proof (ZKP) frameworks have the potential to revolutionize the handling of sensitive data in various domains. However, deploying ZKP frameworks with real-world data presents several challenges, including scalability, usability, and interoperability. In this project, we present Fact Fortress, an end-to-end framework for designing and deploying zero-knowledge proofs of general statements. Our solution leverages proofs of data provenance and auditable data access policies to ensure the trustworthiness of how sensitive data is handled and provide assurance of the computations that have been performed on it. ZKP is mostly associated with blockchain technology, where it enhances transaction privacy and scalability through rollups, addressing the data inherent to the blockchain. Our approach focuses on safeguarding the privacy of data external to the blockchain, with the blockchain serving as publicly auditable infrastructure to verify the validity of ZK proofs and track how data access has been granted without revealing the data itself. Additionally, our framework provides high-level abstractions that enable developers to express complex computations without worrying about the underlying arithmetic circuits and facilitates the deployment of on-chain verifiers. Although our approach demonstrated fair scalability for large datasets, there is still room for improvement, and further work is needed to enhance its scalability. By enabling on-chain verification of computation and data provenance without revealing any information about the data itself, our solution ensures the integrity of the computations on the data while preserving its privacy.

Title: Privacy-Utility Trade-offs in Neural Networks for Medical Population Graphs: Insights from Differential Privacy and Graph Structure. (arXiv:2307.06760v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06760
Code URL: null
Copy Paste: [[2307.06760] Privacy-Utility Trade-offs in Neural Networks for Medical Population Graphs: Insights from Differential Privacy and Graph Structure](http://arxiv.org/abs/2307.06760) #privacy
Summary:
We initiate an empirical investigation into differentially private graph neural networks on population graphs from the medical domain by examining privacy-utility trade-offs at different privacy levels on both real-world and synthetic datasets and performing auditing through membership inference attacks. Our findings highlight the potential and the challenges of this specific DP application area. Moreover, we find evidence that the underlying graph structure constitutes a potential factor for larger performance gaps by showing a correlation between the degree of graph homophily and the accuracy of the trained model.

Title: Data Behind the Walls An Advanced Architecture for Data Privacy Management. (arXiv:2307.06779v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.06779
Code URL: null
Copy Paste: [[2307.06779] Data Behind the Walls An Advanced Architecture for Data Privacy Management](http://arxiv.org/abs/2307.06779) #privacy
Summary:
In today's highly connected society, we are constantly asked to provide personal information to retailers, voter surveys, medical professionals, and other data collection efforts. The collected data is stored in large data warehouses. Organisations and statistical agencies share and use this data to facilitate research in public health, economics, sociology, etc. However, this data contains sensitive information about individuals, which can result in identity theft, financial loss, stress and depression, embarrassment, abuse, etc. Therefore, one must ensure rigorous management of individuals' privacy. We propose, an advanced data privacy management architecture composed of three layers. The data management layer consists of de-identification and anonymisation, the access management layer for re-enforcing data access based on the concepts of Role-Based Access Control and the Chinese Wall Security Policy, and the roles layer for regulating different users. The proposed system architecture is validated on healthcare datasets.

protect

Title: Differentially Private Decoupled Graph Convolutions for Multigranular Topology Protection. (arXiv:2307.06422v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06422
Code URL: null
Copy Paste: [[2307.06422] Differentially Private Decoupled Graph Convolutions for Multigranular Topology Protection](http://arxiv.org/abs/2307.06422) #protect
Summary:
Graph learning methods, such as Graph Neural Networks (GNNs) based on graph convolutions, are highly successful in solving real-world learning problems involving graph-structured data. However, graph learning methods expose sensitive user information and interactions not only through their model parameters but also through their model predictions. Consequently, standard Differential Privacy (DP) techniques that merely offer model weight privacy are inadequate. This is especially the case for node predictions that leverage neighboring node attributes directly via graph convolutions that create additional risks of privacy leakage. To address this problem, we introduce Graph Differential Privacy (GDP), a new formal DP framework tailored to graph learning settings that ensures both provably private model parameters and predictions. Furthermore, since there may be different privacy requirements for the node attributes and graph structure, we introduce a novel notion of relaxed node-level data adjacency. This relaxation can be used for establishing guarantees for different degrees of graph topology privacy while maintaining node attribute privacy. Importantly, this relaxation reveals a useful trade-off between utility and topology privacy for graph learning methods. In addition, our analysis of GDP reveals that existing DP-GNNs fail to exploit this trade-off due to the complex interplay between graph topology and attribute data in standard graph convolution designs. To mitigate this problem, we introduce the Differentially Private Decoupled Graph Convolution (DPDGC) model, which benefits from decoupled graph convolution while providing GDP guarantees. Extensive experiments on seven node classification benchmarking datasets demonstrate the superior privacy-utility trade-off of DPDGC over existing DP-GNNs based on standard graph convolution design.

defense

attack

Title: Single-Class Target-Specific Attack against Interpretable Deep Learning Systems. (arXiv:2307.06484v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06484
Code URL: null
Copy Paste: [[2307.06484] Single-Class Target-Specific Attack against Interpretable Deep Learning Systems](http://arxiv.org/abs/2307.06484) #attack
Summary:
In this paper, we present a novel Single-class target-specific Adversarial attack called SingleADV. The goal of SingleADV is to generate a universal perturbation that deceives the target model into confusing a specific category of objects with a target category while ensuring highly relevant and accurate interpretations. The universal perturbation is stochastically and iteratively optimized by minimizing the adversarial loss that is designed to consider both the classifier and interpreter costs in targeted and non-targeted categories. In this optimization framework, ruled by the first- and second-moment estimations, the desired loss surface promotes high confidence and interpretation score of adversarial samples. By avoiding unintended misclassification of samples from other categories, SingleADV enables more effective targeted attacks on interpretable deep learning systems in both white-box and black-box scenarios. To evaluate the effectiveness of SingleADV, we conduct experiments using four different model architectures (ResNet-50, VGG-16, DenseNet-169, and Inception-V3) coupled with three interpretation models (CAM, Grad, and MASK). Through extensive empirical evaluation, we demonstrate that SingleADV effectively deceives the target deep learning models and their associated interpreters under various conditions and settings. Our experimental results show that the performance of SingleADV is effective, with an average fooling ratio of 0.74 and an adversarial confidence level of 0.78 in generating deceptive adversarial samples. Furthermore, we discuss several countermeasures against SingleADV, including a transfer-based learning approach and existing preprocessing defenses.

Title: Microbial Genetic Algorithm-based Black-box Attack against Interpretable Deep Learning Systems. (arXiv:2307.06496v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06496
Code URL: null
Copy Paste: [[2307.06496] Microbial Genetic Algorithm-based Black-box Attack against Interpretable Deep Learning Systems](http://arxiv.org/abs/2307.06496) #attack
Summary:
Deep learning models are susceptible to adversarial samples in white and black-box environments. Although previous studies have shown high attack success rates, coupling DNN models with interpretation models could offer a sense of security when a human expert is involved, who can identify whether a given sample is benign or malicious. However, in white-box environments, interpretable deep learning systems (IDLSes) have been shown to be vulnerable to malicious manipulations. In black-box settings, as access to the components of IDLSes is limited, it becomes more challenging for the adversary to fool the system. In this work, we propose a Query-efficient Score-based black-box attack against IDLSes, QuScore, which requires no knowledge of the target model and its coupled interpretation model. QuScore is based on transfer-based and score-based methods by employing an effective microbial genetic algorithm. Our method is designed to reduce the number of queries necessary to carry out successful attacks, resulting in a more efficient process. By continuously refining the adversarial samples created based on feedback scores from the IDLS, our approach effectively navigates the search space to identify perturbations that can fool the system. We evaluate the attack's effectiveness on four CNN models (Inception, ResNet, VGG, DenseNet) and two interpretation models (CAM, Grad), using both ImageNet and CIFAR datasets. Our results show that the proposed approach is query-efficient with a high attack success rate that can reach between 95% and 100% and transferability with an average success rate of 69% in the ImageNet and CIFAR datasets. Our attack method generates adversarial examples with attribution maps that resemble benign samples. We have also demonstrated that our attack is resilient against various preprocessing defense techniques and can easily be transferred to different DNN models.

Title: Multi-objective Evolutionary Search of Variable-length Composite Semantic Perturbations. (arXiv:2307.06548v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06548
Code URL: null
Copy Paste: [[2307.06548] Multi-objective Evolutionary Search of Variable-length Composite Semantic Perturbations](http://arxiv.org/abs/2307.06548) #attack
Summary:
Deep neural networks have proven to be vulnerable to adversarial attacks in the form of adding specific perturbations on images to make wrong outputs. Designing stronger adversarial attack methods can help more reliably evaluate the robustness of DNN models. To release the harbor burden and improve the attack performance, auto machine learning (AutoML) has recently emerged as one successful technique to help automatically find the near-optimal adversarial attack strategy. However, existing works about AutoML for adversarial attacks only focus on $L_{\infty}$-norm-based perturbations. In fact, semantic perturbations attract increasing attention due to their naturalnesses and physical realizability. To bridge the gap between AutoML and semantic adversarial attacks, we propose a novel method called multi-objective evolutionary search of variable-length composite semantic perturbations (MES-VCSP). Specifically, we construct the mathematical model of variable-length composite semantic perturbations, which provides five gradient-based semantic attack methods. The same type of perturbation in an attack sequence is allowed to be performed multiple times. Besides, we introduce the multi-objective evolutionary search consisting of NSGA-II and neighborhood search to find near-optimal variable-length attack sequences. Experimental results on CIFAR10 and ImageNet datasets show that compared with existing methods, MES-VCSP can obtain adversarial examples with a higher attack success rate, more naturalness, and less time cost.

Title: Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success. (arXiv:2307.06865v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.06865
Code URL: null
Copy Paste: [[2307.06865] Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success](http://arxiv.org/abs/2307.06865) #attack
Summary:
The generations of large language models are commonly controlled through prompting techniques, where a user's query to the model is prefixed with a prompt that aims to guide the model's behaviour on the query. The prompts used by companies to guide their models are often treated as secrets, to be hidden from the user making the query. They have even been treated as commodities to be bought and sold. However, there has been anecdotal evidence showing that the prompts can be extracted by a user even when they are kept secret. In this paper, we present a framework for systematically measuring the success of prompt extraction attacks. In experiments with multiple sources of prompts and multiple underlying language models, we find that simple text-based attacks can in fact reveal prompts with high probability.

Title: Introducing Foundation Models as Surrogate Models: Advancing Towards More Practical Adversarial Attacks. (arXiv:2307.06608v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06608
Code URL: null
Copy Paste: [[2307.06608] Introducing Foundation Models as Surrogate Models: Advancing Towards More Practical Adversarial Attacks](http://arxiv.org/abs/2307.06608) #attack
Summary:
Recently, the no-box adversarial attack, in which the attacker lacks access to the model's architecture, weights, and training data, become the most practical and challenging attack setup. However, there is an unawareness of the potential and flexibility inherent in the surrogate model selection process on no-box setting. Inspired by the burgeoning interest in utilizing foundational models to address downstream tasks, this paper adopts an innovative idea that 1) recasting adversarial attack as a downstream task. Specifically, image noise generation to meet the emerging trend and 2) introducing foundational models as surrogate models. Harnessing the concept of non-robust features, we elaborate on two guiding principles for surrogate model selection to explain why the foundational model is an optimal choice for this role. However, paradoxically, we observe that these foundational models underperform. Analyzing this unexpected behavior within the feature space, we attribute the lackluster performance of foundational models (e.g., CLIP) to their significant representational capacity and, conversely, their lack of discriminative prowess. To mitigate this issue, we propose the use of a margin-based loss strategy for the fine-tuning of foundational models on target images. The experimental results verify that our approach, which employs the basic Fast Gradient Sign Method (FGSM) attack algorithm, outstrips the performance of other, more convoluted algorithms. We conclude by advocating for the research community to consider surrogate models as crucial determinants in the effectiveness of adversarial attacks in no-box settings. The implications of our work bear relevance for improving the efficacy of such adversarial attacks and the overall robustness of AI systems.

robust

Title: WaterScenes: A Multi-Task 4D Radar-Camera Fusion Dataset and Benchmark for Autonomous Driving on Water Surfaces. (arXiv:2307.06505v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06505
Code URL: null
Copy Paste: [[2307.06505] WaterScenes: A Multi-Task 4D Radar-Camera Fusion Dataset and Benchmark for Autonomous Driving on Water Surfaces](http://arxiv.org/abs/2307.06505) #robust
Summary:
Autonomous driving on water surfaces plays an essential role in executing hazardous and time-consuming missions, such as maritime surveillance, survivors rescue, environmental monitoring, hydrography mapping and waste cleaning. This work presents WaterScenes, the first multi-task 4D radar-camera fusion dataset for autonomous driving on water surfaces. Equipped with a 4D radar and a monocular camera, our Unmanned Surface Vehicle (USV) proffers all-weather solutions for discerning object-related information, including color, shape, texture, range, velocity, azimuth, and elevation. Focusing on typical static and dynamic objects on water surfaces, we label the camera images and radar point clouds at pixel-level and point-level, respectively. In addition to basic perception tasks, such as object detection, instance segmentation and semantic segmentation, we also provide annotations for free-space segmentation and waterline segmentation. Leveraging the multi-task and multi-modal data, we conduct numerous experiments on the single modality of radar and camera, as well as the fused modalities. Results demonstrate that 4D radar-camera fusion can considerably enhance the robustness of perception on water surfaces, especially in adverse lighting and weather conditions. WaterScenes dataset is public on https://waterscenes.github.io.

Title: MPR-Net:Multi-Scale Pattern Reproduction Guided Universality Time Series Interpretable Forecasting. (arXiv:2307.06736v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06736
Code URL: null
Copy Paste: [[2307.06736] MPR-Net:Multi-Scale Pattern Reproduction Guided Universality Time Series Interpretable Forecasting](http://arxiv.org/abs/2307.06736) #robust
Summary:
Time series forecasting has received wide interest from existing research due to its broad applications and inherent challenging. The research challenge lies in identifying effective patterns in historical series and applying them to future forecasting. Advanced models based on point-wise connected MLP and Transformer architectures have strong fitting power, but their secondary computational complexity limits practicality. Additionally, those structures inherently disrupt the temporal order, reducing the information utilization and making the forecasting process uninterpretable. To solve these problems, this paper proposes a forecasting model, MPR-Net. It first adaptively decomposes multi-scale historical series patterns using convolution operation, then constructs a pattern extension forecasting method based on the prior knowledge of pattern reproduction, and finally reconstructs future patterns into future series using deconvolution operation. By leveraging the temporal dependencies present in the time series, MPR-Net not only achieves linear time complexity, but also makes the forecasting process interpretable. By carrying out sufficient experiments on more than ten real data sets of both short and long term forecasting tasks, MPR-Net achieves the state of the art forecasting performance, as well as good generalization and robustness performance.

Title: Min-Max Optimization under Delays. (arXiv:2307.06886v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06886
Code URL: null
Copy Paste: [[2307.06886] Min-Max Optimization under Delays](http://arxiv.org/abs/2307.06886) #robust
Summary:
Delays and asynchrony are inevitable in large-scale machine-learning problems where communication plays a key role. As such, several works have extensively analyzed stochastic optimization with delayed gradients. However, as far as we are aware, no analogous theory is available for min-max optimization, a topic that has gained recent popularity due to applications in adversarial robustness, game theory, and reinforcement learning. Motivated by this gap, we examine the performance of standard min-max optimization algorithms with delayed gradient updates. First, we show (empirically) that even small delays can cause prominent algorithms like Extra-gradient (\texttt{EG}) to diverge on simple instances for which \texttt{EG} guarantees convergence in the absence of delays. Our empirical study thus suggests the need for a careful analysis of delayed versions of min-max optimization algorithms. Accordingly, under suitable technical assumptions, we prove that Gradient Descent-Ascent (\texttt{GDA}) and \texttt{EG} with delayed updates continue to guarantee convergence to saddle points for convex-concave and strongly convex-strongly concave settings. Our complexity bounds reveal, in a transparent manner, the slow-down in convergence caused by delays.

biometric

Title: Personalized Anomaly Detection in PPG Data using Representation Learning and Biometric Identification. (arXiv:2307.06380v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06380
Code URL: null
Copy Paste: [[2307.06380] Personalized Anomaly Detection in PPG Data using Representation Learning and Biometric Identification](http://arxiv.org/abs/2307.06380) #biometric
Summary:
Photoplethysmography (PPG) signals, typically acquired from wearable devices, hold significant potential for continuous fitness-health monitoring. In particular, heart conditions that manifest in rare and subtle deviating heart patterns may be interesting. However, robust and reliable anomaly detection within these data remains a challenge due to the scarcity of labeled data and high inter-subject variability. This paper introduces a two-stage framework leveraging representation learning and personalization to improve anomaly detection performance in PPG data. The proposed framework first employs representation learning to transform the original PPG signals into a more discriminative and compact representation. We then apply three different unsupervised anomaly detection methods for movement detection and biometric identification. We validate our approach using two different datasets in both generalized and personalized scenarios. The results show that representation learning significantly improves anomaly detection performance while reducing the high inter-subject variability. Personalized models further enhance anomaly detection performance, underscoring the role of personalization in PPG-based fitness-health monitoring systems. The results from biometric identification show that it's easier to distinguish a new user from one intended authorized user than from a group of users. Overall, this study provides evidence of the effectiveness of representation learning and personalization for anomaly detection in PPG data.

steal

extraction

Title: Introduction to Facial Micro Expressions Analysis Using Color and Depth Images: A Matlab Coding Approach (Second Edition, 2023). (arXiv:2307.06396v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06396
Code URL: null
Copy Paste: [[2307.06396] Introduction to Facial Micro Expressions Analysis Using Color and Depth Images: A Matlab Coding Approach (Second Edition, 2023)](http://arxiv.org/abs/2307.06396) #extraction
Summary:
The book attempts to introduce a gentle introduction to the field of Facial Micro Expressions Recognition (FMER) using Color and Depth images, with the aid of MATLAB programming environment. FMER is a subset of image processing and it is a multidisciplinary topic to analysis. So, it requires familiarity with other topics of Artifactual Intelligence (AI) such as machine learning, digital image processing, psychology and more. So, it is a great opportunity to write a book which covers all of these topics for beginner to professional readers in the field of AI and even without having background of AI. Our goal is to provide a standalone introduction in the field of MFER analysis in the form of theorical descriptions for readers with no background in image processing with reproducible Matlab practical examples. Also, we describe any basic definitions for FMER analysis and MATLAB library which is used in the text, that helps final reader to apply the experiments in the real-world applications. We believe that this book is suitable for students, researchers, and professionals alike, who need to develop practical skills, along with a basic understanding of the field. We expect that, after reading this book, the reader feels comfortable with different key stages such as color and depth image processing, color and depth image representation, classification, machine learning, facial micro-expressions recognition, feature extraction and dimensionality reduction. The book attempts to introduce a gentle introduction to the field of Facial Micro Expressions Recognition (FMER) using Color and Depth images, with the aid of MATLAB programming environment.

Title: A Study on Differentiable Logic and LLMs for EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2023. (arXiv:2307.06569v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06569
Code URL: null
Copy Paste: [[2307.06569] A Study on Differentiable Logic and LLMs for EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2023](http://arxiv.org/abs/2307.06569) #extraction
Summary:
In this technical report, we present our findings from a study conducted on the EPIC-KITCHENS-100 Unsupervised Domain Adaptation task for Action Recognition. Our research focuses on the innovative application of a differentiable logic loss in the training to leverage the co-occurrence relations between verb and noun, as well as the pre-trained Large Language Models (LLMs) to generate the logic rules for the adaptation to unseen action labels. Specifically, the model's predictions are treated as the truth assignment of a co-occurrence logic formula to compute the logic loss, which measures the consistency between the predictions and the logic constraints. By using the verb-noun co-occurrence matrix generated from the dataset, we observe a moderate improvement in model performance compared to our baseline framework. To further enhance the model's adaptability to novel action labels, we experiment with rules generated using GPT-3.5, which leads to a slight decrease in performance. These findings shed light on the potential and challenges of incorporating differentiable logic and LLMs for knowledge extraction in unsupervised domain adaptation for action recognition. Our final submission (entitled `NS-LLM') achieved the first place in terms of top-1 action recognition accuracy.

Title: DGCNet: An Efficient 3D-Densenet based on Dynamic Group Convolution for Hyperspectral Remote Sensing Image Classification. (arXiv:2307.06667v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06667
Code URL: null
Copy Paste: [[2307.06667] DGCNet: An Efficient 3D-Densenet based on Dynamic Group Convolution for Hyperspectral Remote Sensing Image Classification](http://arxiv.org/abs/2307.06667) #extraction
Summary:
Deep neural networks face many problems in the field of hyperspectral image classification, lack of effective utilization of spatial spectral information, gradient disappearance and overfitting as the model depth increases. In order to accelerate the deployment of the model on edge devices with strict latency requirements and limited computing power, we introduce a lightweight model based on the improved 3D-Densenet model and designs DGCNet. It improves the disadvantage of group convolution. Referring to the idea of dynamic network, dynamic group convolution(DGC) is designed on 3d convolution kernel. DGC introduces small feature selectors for each grouping to dynamically decide which part of the input channel to connect based on the activations of all input channels. Multiple groups can capture different and complementary visual and semantic features of input images, allowing convolution neural network(CNN) to learn rich features. 3D convolution extracts high-dimensional and redundant hyperspectral data, and there is also a lot of redundant information between convolution kernels. DGC module allows 3D-Densenet to select channel information with richer semantic features and discard inactive regions. The 3D-CNN passing through the DGC module can be regarded as a pruned network. DGC not only allows 3D-CNN to complete sufficient feature extraction, but also takes into account the requirements of speed and calculation amount. The inference speed and accuracy have been improved, with outstanding performance on the IN, Pavia and KSC datasets, ahead of the mainstream hyperspectral image classification methods.

Title: Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events. (arXiv:2307.06439v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.06439
Code URL: null
Copy Paste: [[2307.06439] Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events](http://arxiv.org/abs/2307.06439) #extraction
Summary:
Large language models (LLMs), such as GPT-4, have demonstrated remarkable capabilities across a wide range of tasks, including health applications. In this paper, we study how LLMs can be used to scale biomedical knowledge curation. We find that while LLMs already possess decent competency in structuring biomedical text, by distillation into a task-specific student model through self-supervised learning, substantial gains can be attained over out-of-box LLMs, with additional advantages such as cost, efficiency, and white-box model access.

We conduct a case study on adverse drug event (ADE) extraction, which is an important area for improving care. On standard ADE extraction evaluation, a GPT-3.5 distilled PubMedBERT model attained comparable accuracy as supervised state-of-the-art models without using any labeled data. Despite being over 1,000 times smaller, the distilled model outperformed its teacher GPT-3.5 by over 6 absolute points in F1 and GPT-4 by over 5 absolute points.

Ablation studies on distillation model choice (e.g., PubMedBERT vs BioGPT) and ADE extraction architecture shed light on best practice for biomedical knowledge extraction. Similar gains were attained by distillation for other standard biomedical knowledge extraction tasks such as gene-disease associations and protected health information, further illustrating the promise of this approach.

Title: Convolutional Neural Networks for Sentiment Analysis on Weibo Data: A Natural Language Processing Approach. (arXiv:2307.06540v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.06540
Code URL: null
Copy Paste: [[2307.06540] Convolutional Neural Networks for Sentiment Analysis on Weibo Data: A Natural Language Processing Approach](http://arxiv.org/abs/2307.06540) #extraction
Summary:
This study addressed the complex task of sentiment analysis on a dataset of 119,988 original tweets from Weibo using a Convolutional Neural Network (CNN), offering a new approach to Natural Language Processing (NLP). The data, sourced from Baidu's PaddlePaddle AI platform, were meticulously preprocessed, tokenized, and categorized based on sentiment labels. A CNN-based model was utilized, leveraging word embeddings for feature extraction, and trained to perform sentiment classification. The model achieved a macro-average F1-score of approximately 0.73 on the test set, showing balanced performance across positive, neutral, and negative sentiments. The findings underscore the effectiveness of CNNs for sentiment analysis tasks, with implications for practical applications in social media analysis, market research, and policy studies. The complete experimental content and code have been made publicly available on the Kaggle data platform for further research and development. Future work may involve exploring different architectures, such as Recurrent Neural Networks (RNN) or transformers, or using more complex pre-trained models like BERT, to further improve the model's ability to understand linguistic nuances and context.

Title: Parmesan: mathematical concept extraction for education. (arXiv:2307.06699v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.06699
Code URL: null
Copy Paste: [[2307.06699] Parmesan: mathematical concept extraction for education](http://arxiv.org/abs/2307.06699) #extraction
Summary:
Mathematics is a highly specialized domain with its own unique set of challenges that has seen limited study in natural language processing. However, mathematics is used in a wide variety of fields and multidisciplinary research in many different domains often relies on an understanding of mathematical concepts. To aid researchers coming from other fields, we develop a prototype system for searching for and defining mathematical concepts in context, focusing on the field of category theory. This system, Parmesan, depends on natural language processing components including concept extraction, relation extraction, definition extraction, and entity linking. In developing this system, we show that existing techniques cannot be applied directly to the category theory domain, and suggest hybrid techniques that do perform well, though we expect the system to evolve over time. We also provide two cleaned mathematical corpora that power the prototype system, which are based on journal articles and wiki pages, respectively. The corpora have been annotated with dependency trees, lemmas, and part-of-speech tags.

membership infer

federate

Title: TinyMetaFed: Efficient Federated Meta-Learning for TinyML. (arXiv:2307.06822v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06822
Code URL: null
Copy Paste: [[2307.06822] TinyMetaFed: Efficient Federated Meta-Learning for TinyML](http://arxiv.org/abs/2307.06822) #federate
Summary:
The field of Tiny Machine Learning (TinyML) has made substantial advancements in democratizing machine learning on low-footprint devices, such as microcontrollers. The prevalence of these miniature devices raises the question of whether aggregating their knowledge can benefit TinyML applications. Federated meta-learning is a promising answer to this question, as it addresses the scarcity of labeled data and heterogeneous data distribution across devices in the real world. However, deploying TinyML hardware faces unique resource constraints, making existing methods impractical due to energy, privacy, and communication limitations. We introduce TinyMetaFed, a model-agnostic meta-learning framework suitable for TinyML. TinyMetaFed facilitates collaborative training of a neural network initialization that can be quickly fine-tuned on new devices. It offers communication savings and privacy protection through partial local reconstruction and Top-P% selective communication, computational efficiency via online learning, and robustness to client heterogeneity through few-shot learning. The evaluations on three TinyML use cases demonstrate that TinyMetaFed can significantly reduce energy consumption and communication overhead, accelerate convergence, and stabilize the training process.

Title: FDAPT: Federated Domain-adaptive Pre-training for Language Models. (arXiv:2307.06933v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06933
Code URL: null
Copy Paste: [[2307.06933] FDAPT: Federated Domain-adaptive Pre-training for Language Models](http://arxiv.org/abs/2307.06933) #federate
Summary:
Combining Domain-adaptive Pre-training (DAPT) with Federated Learning (FL) can enhance model adaptation by leveraging more sensitive and distributed data while preserving data privacy. However, few studies have focused on this method. Therefore, we conduct the first comprehensive empirical study to evaluate the performance of Federated Domain-adaptive Pre-training (FDAPT). We demonstrate that FDAPT can maintain competitive downstream task performance to the centralized baseline in both IID and non-IID situations. Furthermore, we propose a novel algorithm, Frozen Federated Domain-adaptive Pre-training (FFDAPT). FFDAPT improves the computational efficiency by 12.1% on average and exhibits similar downstream task performance to standard FDAPT, with general performance fluctuations remaining less than 1%. Finally, through a critical evaluation of our work, we identify promising future research directions for this new research area.

fair

Title: Identifying Early Help Referrals For Local Authorities With Machine Learning And Bias Analysis. (arXiv:2307.06871v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06871
Code URL: null
Copy Paste: [[2307.06871] Identifying Early Help Referrals For Local Authorities With Machine Learning And Bias Analysis](http://arxiv.org/abs/2307.06871) #fair
Summary:
Local authorities in England, such as Leicestershire County Council (LCC), provide Early Help services that can be offered at any point in a young person's life when they experience difficulties that cannot be supported by universal services alone, such as schools. This paper investigates the utilisation of machine learning (ML) to assist experts in identifying families that may need to be referred for Early Help assessment and support. LCC provided an anonymised dataset comprising 14360 records of young people under the age of 18. The dataset was pre-processed, machine learning models were build, and experiments were conducted to validate and test the performance of the models. Bias mitigation techniques were applied to improve the fairness of these models. During testing, while the models demonstrated the capability to identify young people requiring intervention or early help, they also produced a significant number of false positives, especially when constructed with imbalanced data, incorrectly identifying individuals who most likely did not need an Early Help referral. This paper empirically explores the suitability of data-driven ML models for identifying young people who may require Early Help services and discusses their appropriateness and limitations for this task.

interpretability

Title: Weakly supervised marine animal detection from remote sensing images using vector-quantized variational autoencoder. (arXiv:2307.06720v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06720
Code URL: null
Copy Paste: [[2307.06720] Weakly supervised marine animal detection from remote sensing images using vector-quantized variational autoencoder](http://arxiv.org/abs/2307.06720) #interpretability
Summary:
This paper studies a reconstruction-based approach for weakly-supervised animal detection from aerial images in marine environments. Such an approach leverages an anomaly detection framework that computes metrics directly on the input space, enhancing interpretability and anomaly localization compared to feature embedding methods. Building upon the success of Vector-Quantized Variational Autoencoders in anomaly detection on computer vision datasets, we adapt them to the marine animal detection domain and address the challenge of handling noisy data. To evaluate our approach, we compare it with existing methods in the context of marine animal detection from aerial image data. Experiments conducted on two dedicated datasets demonstrate the superior performance of the proposed method over recent studies in the literature. Our framework offers improved interpretability and localization of anomalies, providing valuable insights for monitoring marine ecosystems and mitigating the impact of human activities on marine animals.

Title: Uncovering Unique Concept Vectors through Latent Space Decomposition. (arXiv:2307.06913v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06913
Code URL: null
Copy Paste: [[2307.06913] Uncovering Unique Concept Vectors through Latent Space Decomposition](http://arxiv.org/abs/2307.06913) #interpretability
Summary:
Interpreting the inner workings of deep learning models is crucial for establishing trust and ensuring model safety. Concept-based explanations have emerged as a superior approach that is more interpretable than feature attribution estimates such as pixel saliency. However, defining the concepts for the interpretability analysis biases the explanations by the user's expectations on the concepts. To address this, we propose a novel post-hoc unsupervised method that automatically uncovers the concepts learned by deep models during training. By decomposing the latent space of a layer in singular vectors and refining them by unsupervised clustering, we uncover concept vectors aligned with directions of high variance that are relevant to the model prediction, and that point to semantically distinct concepts. Our extensive experiments reveal that the majority of our concepts are readily understandable to humans, exhibit coherency, and bear relevance to the task at hand. Moreover, we showcase the practical utility of our method in dataset exploration, where our concept vectors successfully identify outlier training samples affected by various confounding factors. This novel exploration technique has remarkable versatility to data types and model architectures and it will facilitate the identification of biases and the discovery of sources of error within training data.

Title: DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering. (arXiv:2307.06869v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.06869
Code URL: null
Copy Paste: [[2307.06869] DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering](http://arxiv.org/abs/2307.06869) #interpretability
Summary:
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability. Specifically, most of the well-performed metrics are required to train on evaluation datasets of specific NLG tasks and evaluation dimensions, which may cause over-fitting to task-specific datasets. Furthermore, existing metrics only provide an evaluation score for each dimension without revealing the evidence to interpret how this score is obtained. To deal with these challenges, we propose a simple yet effective metric called DecompEval. This metric formulates NLG evaluation as an instruction-style question answering task and utilizes instruction-tuned pre-trained language models (PLMs) without training on evaluation datasets, aiming to enhance the generalization ability. To make the evaluation process more interpretable, we decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence. The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result. Experimental results show that DecompEval achieves state-of-the-art performance in untrained metrics for evaluating text summarization and dialogue generation, which also exhibits strong dimension-level / task-level generalization ability and interpretability.

Title: Trainability, Expressivity and Interpretability in Gated Neural ODEs. (arXiv:2307.06398v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06398
Code URL: null
Copy Paste: [[2307.06398] Trainability, Expressivity and Interpretability in Gated Neural ODEs](http://arxiv.org/abs/2307.06398) #interpretability
Summary:
Understanding how the dynamics in biological and artificial neural networks implement the computations required for a task is a salient open question in machine learning and neuroscience. In particular, computations requiring complex memory storage and retrieval pose a significant challenge for these networks to implement or learn. Recently, a family of models described by neural ordinary differential equations (nODEs) has emerged as powerful dynamical neural network models capable of capturing complex dynamics. Here, we extend nODEs by endowing them with adaptive timescales using gating interactions. We refer to these as gated neural ODEs (gnODEs). Using a task that requires memory of continuous quantities, we demonstrate the inductive bias of the gnODEs to learn (approximate) continuous attractors. We further show how reduced-dimensional gnODEs retain their modeling power while greatly improving interpretability, even allowing explicit visualization of the structure of learned attractors. We introduce a novel measure of expressivity which probes the capacity of a neural network to generate complex trajectories. Using this measure, we explore how the phase-space dimension of the nODEs and the complexity of the function modeling the flow field contribute to expressivity. We see that a more complex function for modeling the flow field allows a lower-dimensional nODE to capture a given target dynamics. Finally, we demonstrate the benefit of gating in nODEs on several real-world tasks.

Title: Cramer Type Distances for Learning Gaussian Mixture Models by Gradient Descent. (arXiv:2307.06753v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06753
Code URL: null
Copy Paste: [[2307.06753] Cramer Type Distances for Learning Gaussian Mixture Models by Gradient Descent](http://arxiv.org/abs/2307.06753) #interpretability
Summary:
The learning of Gaussian Mixture Models (also referred to simply as GMMs) plays an important role in machine learning. Known for their expressiveness and interpretability, Gaussian mixture models have a wide range of applications, from statistics, computer vision to distributional reinforcement learning. However, as of today, few known algorithms can fit or learn these models, some of which include Expectation-Maximization algorithms and Sliced Wasserstein Distance. Even fewer algorithms are compatible with gradient descent, the common learning process for neural networks.

In this paper, we derive a closed formula of two GMMs in the univariate, one-dimensional case, then propose a distance function called Sliced Cram\'er 2-distance for learning general multivariate GMMs. Our approach has several advantages over many previous methods. First, it has a closed-form expression for the univariate case and is easy to compute and implement using common machine learning libraries (e.g., PyTorch and TensorFlow). Second, it is compatible with gradient descent, which enables us to integrate GMMs with neural networks seamlessly. Third, it can fit a GMM not only to a set of data points, but also to another GMM directly, without sampling from the target model. And fourth, it has some theoretical guarantees like global gradient boundedness and unbiased sampling gradient. These features are especially useful for distributional reinforcement learning and Deep Q Networks, where the goal is to learn a distribution over future rewards. We will also construct a Gaussian Mixture Distributional Deep Q Network as a toy example to demonstrate its effectiveness. Compared with previous models, this model is parameter efficient in terms of representing a distribution and possesses better interpretability.

explainability

Title: Assessment of the suitability of degradation models for the planning of CCTV inspections of sewer pipes. (arXiv:2307.06341v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06341
Code URL: null
Copy Paste: [[2307.06341] Assessment of the suitability of degradation models for the planning of CCTV inspections of sewer pipes](http://arxiv.org/abs/2307.06341) #explainability
Summary:
The degradation of sewer pipes poses significant economical, environmental and health concerns. The maintenance of such assets requires structured plans to perform inspections, which are more efficient when structural and environmental features are considered along with the results of previous inspection reports. The development of such plans requires degradation models that can be based on statistical and machine learning methods. This work proposes a methodology to assess their suitability to plan inspections considering three dimensions: accuracy metrics, ability to produce long-term degradation curves and explainability. Results suggest that although ensemble models yield the highest accuracy, they are unable to infer the long-term degradation of the pipes, whereas the Logistic Regression offers a slightly less accurate model that is able to produce consistent degradation curves with a high explainability. A use case is presented to demonstrate this methodology and the efficiency of model-based planning compared to the current inspection plan.

watermark

Title: Towards Traitor Tracing in Black-and-White-Box DNN Watermarking with Tardos-based Codes. (arXiv:2307.06695v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.06695
Code URL: null
Copy Paste: [[2307.06695] Towards Traitor Tracing in Black-and-White-Box DNN Watermarking with Tardos-based Codes](http://arxiv.org/abs/2307.06695) #watermark
Summary:
The growing popularity of Deep Neural Networks, which often require computationally expensive training and access to a vast amount of data, calls for accurate authorship verification methods to deter unlawful dissemination of the models and identify the source of the leak. In DNN watermarking the owner may have access to the full network (white-box) or only be able to extract information from its output to queries (black-box), but a watermarked model may include both approaches in order to gather sufficient evidence to then gain access to the network. Although there has been limited research in white-box watermarking that considers traitor tracing, this problem is yet to be explored in the black-box scenario. In this paper, we propose a black-and-white-box watermarking method that opens the door to collusion-resistant traitor tracing in black-box, exploiting the properties of Tardos codes, and making it possible to identify the source of the leak before access to the model is granted. While experimental results show that the method can successfully identify traitors, even when further attacks have been performed, we also discuss its limitations and open problems for traitor tracing in black-box.

diffusion

Title: Improving Nonalcoholic Fatty Liver Disease Classification Performance With Latent Diffusion Models. (arXiv:2307.06507v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06507
Code URL: null
Copy Paste: [[2307.06507] Improving Nonalcoholic Fatty Liver Disease Classification Performance With Latent Diffusion Models](http://arxiv.org/abs/2307.06507) #diffusion
Summary:
Integrating deep learning with clinical expertise holds great potential for addressing healthcare challenges and empowering medical professionals with improved diagnostic tools. However, the need for annotated medical images is often an obstacle to leveraging the full power of machine learning models. Our research demonstrates that by combining synthetic images, generated using diffusion models, with real images, we can enhance nonalcoholic fatty liver disease (NAFLD) classification performance. We evaluate the quality of the synthetic images by comparing two metrics: Inception Score (IS) and Fr\'{e}chet Inception Distance (FID), computed on diffusion-generated images and generative adversarial networks (GANs)-generated images. Our results show superior performance for the diffusion-generated images, with a maximum IS score of $1.90$ compared to $1.67$ for GANs, and a minimum FID score of $69.45$ compared to $99.53$ for GANs. Utilizing a partially frozen CNN backbone (EfficientNet v1), our synthetic augmentation method achieves a maximum image-level ROC AUC of $0.904$ on a NAFLD prediction task.

Title: AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion. (arXiv:2307.06526v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06526
Code URL: null
Copy Paste: [[2307.06526] AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion](http://arxiv.org/abs/2307.06526) #diffusion
Summary:
Large-scale pre-trained vision-language models allow for the zero-shot text-based generation of 3D avatars. The previous state-of-the-art method utilized CLIP to supervise neural implicit models that reconstructed a human body mesh. However, this approach has two limitations. Firstly, the lack of avatar-specific models can cause facial distortion and unrealistic clothing in the generated avatars. Secondly, CLIP only provides optimization direction for the overall appearance, resulting in less impressive results. To address these limitations, we propose AvatarFusion, the first framework to use a latent diffusion model to provide pixel-level guidance for generating human-realistic avatars while simultaneously segmenting clothing from the avatar's body. AvatarFusion includes the first clothing-decoupled neural implicit avatar model that employs a novel Dual Volume Rendering strategy to render the decoupled skin and clothing sub-models in one space. We also introduce a novel optimization method, called Pixel-Semantics Difference-Sampling (PS-DS), which semantically separates the generation of body and clothes, and generates a variety of clothing styles. Moreover, we establish the first benchmark for zero-shot text-to-avatar generation. Our experimental results demonstrate that our framework outperforms previous approaches, with significant improvements observed in all metrics. Additionally, since our model is clothing-decoupled, we can exchange the clothes of avatars. Code will be available on Github.

Title: HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models. (arXiv:2307.06949v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06949
Code URL: null
Copy Paste: [[2307.06949] HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models](http://arxiv.org/abs/2307.06949) #diffusion
Summary:
Personalization has emerged as a prominent aspect within the field of generative AI, enabling the synthesis of individuals in diverse contexts and styles, while retaining high-fidelity to their identities. However, the process of personalization presents inherent challenges in terms of time and memory requirements. Fine-tuning each personalized model needs considerable GPU time investment, and storing a personalized model per subject can be demanding in terms of storage capacity. To overcome these challenges, we propose HyperDreamBooth-a hypernetwork capable of efficiently generating a small set of personalized weights from a single image of a person. By composing these weights into the diffusion model, coupled with fast finetuning, HyperDreamBooth can generate a person's face in various contexts and styles, with high subject details while also preserving the model's crucial knowledge of diverse styles and semantic modifications. Our method achieves personalization on faces in roughly 20 seconds, 25x faster than DreamBooth and 125x faster than Textual Inversion, using as few as one reference image, with the same quality and style diversity as DreamBooth. Also our method yields a model that is 10000x smaller than a normal DreamBooth model. Project page: https://hyperdreambooth.github.io

noise learning

data-free

transformer

Title: ConvNeXt-ChARM: ConvNeXt-based Transform for Efficient Neural Image Compression. (arXiv:2307.06342v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06342
Code URL: null
Copy Paste: [[2307.06342] ConvNeXt-ChARM: ConvNeXt-based Transform for Efficient Neural Image Compression](http://arxiv.org/abs/2307.06342) #transformer
Summary:
Over the last few years, neural image compression has gained wide attention from research and industry, yielding promising end-to-end deep neural codecs outperforming their conventional counterparts in rate-distortion performance. Despite significant advancement, current methods, including attention-based transform coding, still need to be improved in reducing the coding rate while preserving the reconstruction fidelity, especially in non-homogeneous textured image areas. Those models also require more parameters and a higher decoding time. To tackle the above challenges, we propose ConvNeXt-ChARM, an efficient ConvNeXt-based transform coding framework, paired with a compute-efficient channel-wise auto-regressive prior to capturing both global and local contexts from the hyper and quantized latent representations. The proposed architecture can be optimized end-to-end to fully exploit the context information and extract compact latent representation while reconstructing higher-quality images. Experimental results on four widely-used datasets showed that ConvNeXt-ChARM brings consistent and significant BD-rate (PSNR) reductions estimated on average to 5.24% and 1.22% over the versatile video coding (VVC) reference encoder (VTM-18.0) and the state-of-the-art learned image compression method SwinT-ChARM, respectively. Moreover, we provide model scaling studies to verify the computational efficiency of our approach and conduct several objective and subjective analyses to bring to the fore the performance gap between the next generation ConvNet, namely ConvNeXt, and Swin Transformer.

Title: RaBiT: An Efficient Transformer using Bidirectional Feature Pyramid Network with Reverse Attention for Colon Polyp Segmentation. (arXiv:2307.06420v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06420
Code URL: null
Copy Paste: [[2307.06420] RaBiT: An Efficient Transformer using Bidirectional Feature Pyramid Network with Reverse Attention for Colon Polyp Segmentation](http://arxiv.org/abs/2307.06420) #transformer
Summary:
Automatic and accurate segmentation of colon polyps is essential for early diagnosis of colorectal cancer. Advanced deep learning models have shown promising results in polyp segmentation. However, they still have limitations in representing multi-scale features and generalization capability. To address these issues, this paper introduces RaBiT, an encoder-decoder model that incorporates a lightweight Transformer-based architecture in the encoder to model multiple-level global semantic relationships. The decoder consists of several bidirectional feature pyramid layers with reverse attention modules to better fuse feature maps at various levels and incrementally refine polyp boundaries. We also propose ideas to lighten the reverse attention module and make it more suitable for multi-class segmentation. Extensive experiments on several benchmark datasets show that our method outperforms existing methods across all datasets while maintaining low computational complexity. Moreover, our method demonstrates high generalization capability in cross-dataset experiments, even when the training and test sets have different characteristics.

Title: Efficient Convolution and Transformer-Based Network for Video Frame Interpolation. (arXiv:2307.06443v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06443
Code URL: null
Copy Paste: [[2307.06443] Efficient Convolution and Transformer-Based Network for Video Frame Interpolation](http://arxiv.org/abs/2307.06443) #transformer
Summary:
Video frame interpolation is an increasingly important research task with several key industrial applications in the video coding, broadcast and production sectors. Recently, transformers have been introduced to the field resulting in substantial performance gains. However, this comes at a cost of greatly increased memory usage, training and inference time. In this paper, a novel method integrating a transformer encoder and convolutional features is proposed. This network reduces the memory burden by close to 50% and runs up to four times faster during inference time compared to existing transformer-based interpolation methods. A dual-encoder architecture is introduced which combines the strength of convolutions in modelling local correlations with those of the transformer for long-range dependencies. Quantitative evaluations are conducted on various benchmarks with complex motion to showcase the robustness of the proposed method, achieving competitive performance compared to state-of-the-art interpolation networks.

Title: Transformer-based end-to-end classification of variable-length volumetric data. (arXiv:2307.06666v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06666
Code URL: null
Copy Paste: [[2307.06666] Transformer-based end-to-end classification of variable-length volumetric data](http://arxiv.org/abs/2307.06666) #transformer
Summary:
The automatic classification of 3D medical data is memory-intensive. Also, variations in the number of slices between samples is common. Naive solutions such as subsampling can solve these problems, but at the cost of potentially eliminating relevant diagnosis information. Transformers have shown promising performance for sequential data analysis. However, their application for long-sequences is data, computationally, and memory demanding. In this paper, we propose an end-to-end Transformer-based framework that allows to classify volumetric data of variable length in an efficient fashion. Particularly, by randomizing the input slice-wise resolution during training, we enhance the capacity of the learnable positional embedding assigned to each volume slice. Consequently, the accumulated positional information in each positional embedding can be generalized to the neighbouring slices, even for high resolution volumes at the test time. By doing so, the model will be more robust to variable volume length and amenable to different computational budgets. We evaluated the proposed approach in retinal OCT volume classification and achieved 21.96% average improvement in balanced accuracy on a 9-class diagnostic task, compared to state-of-the-art video transformers. Our findings show that varying the slice-wise resolution of the input during training results in more informative volume representation as compared to training with fixed number of slices per volume. Our code is available at: https://github.com/marziehoghbaie/VLFAT.

Title: Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition. (arXiv:2307.06947v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06947
Code URL: null
Copy Paste: [[2307.06947] Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition](http://arxiv.org/abs/2307.06947) #transformer
Summary:
Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their self-attention counterparts on video representations. We extensively explore the design space of focal modulation-based spatio-temporal context modeling and demonstrate our parallel spatial and temporal encoding design to be the optimal choice. Video-FocalNets perform favorably well against the state-of-the-art transformer-based models for video recognition on three large-scale datasets (Kinetics-400, Kinetics-600, and SS-v2) at a lower computational cost. Our code/models are released at https://github.com/TalalWasim/Video-FocalNets.

Title: No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models. (arXiv:2307.06440v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06440
Code URL: null
Copy Paste: [[2307.06440] No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models](http://arxiv.org/abs/2307.06440) #transformer
Summary:
The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream performance faster than standard training. In this work, we revisit three categories of such algorithms: dynamic architectures (layer stacking, layer dropping), batch selection (selective backprop, RHO loss), and efficient optimizers (Lion, Sophia). When pre-training BERT and T5 with a fixed computation budget using such methods, we find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate. We define an evaluation protocol that enables computation to be done on arbitrary machines by mapping all computation time to a reference machine which we call reference system time. We discuss the limitations of our proposed protocol and release our code to encourage rigorous research in efficient training procedures: https://github.com/JeanKaddour/NoTrainNoGain.

generative

Title: T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation. (arXiv:2307.06350v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06350
Code URL: null
Copy Paste: [[2307.06350] T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation](http://arxiv.org/abs/2307.06350) #generative
Summary:
Despite the stunning ability to generate high-quality images by recent text-to-image models, current approaches often struggle to effectively compose objects with different attributes and relationships into a complex and coherent scene. We propose T2I-CompBench, a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional text prompts from 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions). We further propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation. We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS), to boost the compositional text-to-image generation abilities of pretrained text-to-image models. Extensive experiments and evaluations are conducted to benchmark previous methods on T2I-CompBench, and to validate the effectiveness of our proposed evaluation metrics and GORS approach. Project page is available at https://karine-h.github.io/T2I-CompBench/.

Title: Tensor Decompositions Meet Control Theory: Learning General Mixtures of Linear Dynamical Systems. (arXiv:2307.06538v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06538
Code URL: null
Copy Paste: [[2307.06538] Tensor Decompositions Meet Control Theory: Learning General Mixtures of Linear Dynamical Systems](http://arxiv.org/abs/2307.06538) #generative
Summary:
Recently Chen and Poor initiated the study of learning mixtures of linear dynamical systems. While linear dynamical systems already have wide-ranging applications in modeling time-series data, using mixture models can lead to a better fit or even a richer understanding of underlying subpopulations represented in the data. In this work we give a new approach to learning mixtures of linear dynamical systems that is based on tensor decompositions. As a result, our algorithm succeeds without strong separation conditions on the components, and can be used to compete with the Bayes optimal clustering of the trajectories. Moreover our algorithm works in the challenging partially-observed setting. Our starting point is the simple but powerful observation that the classic Ho-Kalman algorithm is a close relative of modern tensor decomposition methods for learning latent variable models. This gives us a playbook for how to extend it to work with more complicated generative models.

Title: GRAN is superior to GraphRNN: node orderings, kernel- and graph embeddings-based metrics for graph generators. (arXiv:2307.06709v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.06709
Code URL: null
Copy Paste: [[2307.06709] GRAN is superior to GraphRNN: node orderings, kernel- and graph embeddings-based metrics for graph generators](http://arxiv.org/abs/2307.06709) #generative
Summary:
A wide variety of generative models for graphs have been proposed. They are used in drug discovery, road networks, neural architecture search, and program synthesis. Generating graphs has theoretical challenges, such as isomorphic representations -- evaluating how well a generative model performs is difficult. Which model to choose depending on the application domain?

We extensively study kernel-based metrics on distributions of graph invariants and manifold-based and kernel-based metrics in graph embedding space. Manifold-based metrics outperform kernel-based metrics in embedding space. We use these metrics to compare GraphRNN and GRAN, two well-known generative models for graphs, and unveil the influence of node orderings. It shows the superiority of GRAN over GraphRNN - further, our proposed adaptation of GraphRNN with a depth-first search ordering is effective for small-sized graphs.

A guideline on good practices regarding dataset selection and node feature initialization is provided. Our work is accompanied by open-source code and reproducible experiments.

large language model

Title: Garbage in, garbage out: Zero-shot detection of crime using Large Language Models. (arXiv:2307.06844v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.06844
Code URL: null
Copy Paste: [[2307.06844] Garbage in, garbage out: Zero-shot detection of crime using Large Language Models](http://arxiv.org/abs/2307.06844) #large language model
Summary:
This paper proposes exploiting the common sense knowledge learned by large language models to perform zero-shot reasoning about crimes given textual descriptions of surveillance videos. We show that when video is (manually) converted to high quality textual descriptions, large language models are capable of detecting and classifying crimes with state-of-the-art performance using only zero-shot reasoning. However, existing automated video-to-text approaches are unable to generate video descriptions of sufficient quality to support reasoning (garbage video descriptions into the large language model, garbage out).

Title: mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs. (arXiv:2307.06930v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06930
Code URL: null
Copy Paste: [[2307.06930] mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs](http://arxiv.org/abs/2307.06930) #large language model
Summary:
Modular vision-language models (Vision-LLMs) align pretrained image encoders with (pretrained) large language models (LLMs), representing a computationally much more efficient alternative to end-to-end training of large vision-language models from scratch, which is prohibitively expensive for most. Vision-LLMs instead post-hoc condition LLMs to `understand' the output of an image encoder. With the abundance of readily available high-quality English image-text data as well as monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. In this work, we present mBLIP, the first multilingual Vision-LLM, which we obtain in a computationally efficient manner -- on consumer hardware using only a few million training examples -- by leveraging a pretrained multilingual LLM. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM -- for this, we leverage multilingual data from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark, mBLIP yields results competitive with state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP (zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to these very large multilingual vision-language models trained from scratch, we obtain mBLIP by training orders of magnitude fewer parameters on magnitudes less data. We release our model and code at \url{https://github.com/gregor-ge/mBLIP}.

Title: InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. (arXiv:2307.06942v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06942
Code URL: https://github.com/opengvlab/internvideo
Copy Paste: [[2307.06942] InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation](http://arxiv.org/abs/2307.06942) #large language model
Summary:
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

Title: A Comprehensive Overview of Large Language Models. (arXiv:2307.06435v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.06435
Code URL: null
Copy Paste: [[2307.06435] A Comprehensive Overview of Large Language Models](http://arxiv.org/abs/2307.06435) #large language model
Summary:
Large Language Models (LLMs) have shown excellent generalization capabilities that have led to the development of numerous models. These models propose various new architectures, tweaking existing architectures with refined training strategies, increasing context length, using high-quality training data, and increasing training time to outperform baselines. Analyzing new developments is crucial for identifying changes that enhance training stability and improve generalization in LLMs. This survey paper comprehensively analyses the LLMs architectures and their categorization, training strategies, training datasets, and performance evaluations and discusses future research directions. Moreover, the paper also discusses the basic building blocks and concepts behind LLMs, followed by a complete overview of LLMs, including their important features and functions. Finally, the paper summarizes significant findings from LLM research and consolidates essential architectural and training strategies for developing advanced LLMs. Given the continuous advancements in LLMs, we intend to regularly update this paper by incorporating new sections and featuring the latest LLM models.

Title: Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study. (arXiv:2307.06530v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.06530
Code URL: null
Copy Paste: [[2307.06530] Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study](http://arxiv.org/abs/2307.06530) #large language model
Summary:
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems to improve transcription accuracy. The increasing sophistication of LLMs, with their in-context learning capabilities and instruction-following behavior, has drawn significant attention in the field of Natural Language Processing (NLP). Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems, which currently face challenges such as ambient noise, speaker accents, and complex linguistic contexts. We designed a study using the Aishell-1 and LibriSpeech datasets, with ChatGPT and GPT-4 serving as benchmarks for LLM capabilities. Unfortunately, our initial experiments did not yield promising results, indicating the complexity of leveraging LLM's in-context learning for ASR applications. Despite further exploration with varied settings and models, the corrected sentences from the LLMs frequently resulted in higher Word Error Rates (WER), demonstrating the limitations of LLMs in speech applications. This paper provides a detailed overview of these experiments, their results, and implications, establishing that using LLMs' in-context learning capabilities to correct potential errors in speech recognition transcriptions is still a challenging task at the current stage.

Title: Unsupervised Calibration through Prior Adaptation for Text Classification using Large Language Models. (arXiv:2307.06713v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.06713
Code URL: null
Copy Paste: [[2307.06713] Unsupervised Calibration through Prior Adaptation for Text Classification using Large Language Models](http://arxiv.org/abs/2307.06713) #large language model
Summary:
A wide variety of natural language tasks are currently being addressed with large-scale language models (LLMs). These models are usually trained with a very large amount of unsupervised text data and adapted to perform a downstream natural language task using methods like fine-tuning, calibration or in-context learning. In this work, we propose an approach to adapt the prior class distribution to perform text classification tasks without the need for labelled samples and only few in-domain sample queries. The proposed approach treats the LLM as a black box, adding a stage where the model posteriors are calibrated to the task. Results show that these methods outperform the un-adapted model for different number of training shots in the prompt and a previous approach were calibration is performed without using any adaptation data.

Title: Negated Complementary Commonsense using Large Language Models. (arXiv:2307.06794v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.06794
Code URL: null
Copy Paste: [[2307.06794] Negated Complementary Commonsense using Large Language Models](http://arxiv.org/abs/2307.06794) #large language model
Summary:
Larger language models, such as GPT-3, have shown to be excellent in many tasks. However, we demonstrate that out-of-ordinary questions can throw the model off guard. This work focuses on finding answers to negated complementary questions in commonsense scenarios. We illustrate how such questions adversely affect the model responses. We propose a model-agnostic methodology to improve the performance in negated complementary scenarios. Our method outperforms few-shot generation from GPT-3 (by more than 11 points) and, more importantly, highlights the significance of studying the response of large language models in negated complementary questions. The code, data, and experiments are available under: https://github.com/navidre/negated_complementary_commonsense.

Title: In-context Autoencoder for Context Compression in a Large Language Model. (arXiv:2307.06945v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.06945
Code URL: null
Copy Paste: [[2307.06945] In-context Autoencoder for Context Compression in a Large Language Model](http://arxiv.org/abs/2307.06945) #large language model
Summary:
We propose the In-context Autoencoder (ICAE) for context compression in a large language model (LLM). The ICAE has two modules: a learnable encoder adapted with LoRA from an LLM for compressing a long context into a limited number of memory slots, and a fixed decoder which is the target LLM that can condition on the memory slots for various purposes. We first pretrain the ICAE using both autoencoding and language modeling objectives on massive text data, enabling it to generate memory slots that accurately and comprehensively represent the original context. Then, we fine-tune the pretrained ICAE on a small amount of instruct data to enhance its interaction with various prompts for producing desirable responses. Our experimental results demonstrate that the ICAE learned with our proposed pretraining and fine-tuning paradigm can effectively produce memory slots with $4\times$ context compression, which can be well conditioned on by the target LLM to respond to various prompts. The promising results demonstrate significant implications of the ICAE for its novel approach to the long context problem and its potential to reduce computation and memory overheads for LLM inference in practice, suggesting further research effort in context management for an LLM. Our code and data will be released shortly.

segmentation

Title: RVD: A Handheld Device-Based Fundus Video Dataset for Retinal Vessel Segmentation. (arXiv:2307.06577v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06577
Code URL: null
Copy Paste: [[2307.06577] RVD: A Handheld Device-Based Fundus Video Dataset for Retinal Vessel Segmentation](http://arxiv.org/abs/2307.06577) #segmentation
Summary:
Retinal vessel segmentation is generally grounded in image-based datasets collected with bench-top devices. The static images naturally lose the dynamic characteristics of retina fluctuation, resulting in diminished dataset richness, and the usage of bench-top devices further restricts dataset scalability due to its limited accessibility. Considering these limitations, we introduce the first video-based retinal dataset by employing handheld devices for data acquisition. The dataset comprises 635 smartphone-based fundus videos collected from four different clinics, involving 415 patients from 50 to 75 years old. It delivers comprehensive and precise annotations of retinal structures in both spatial and temporal dimensions, aiming to advance the landscape of vasculature segmentation. Specifically, the dataset provides three levels of spatial annotations: binary vessel masks for overall retinal structure delineation, general vein-artery masks for distinguishing the vein and artery, and fine-grained vein-artery masks for further characterizing the granularities of each artery and vein. In addition, the dataset offers temporal annotations that capture the vessel pulsation characteristics, assisting in detecting ocular diseases that require fine-grained recognition of hemodynamic fluctuation. In application, our dataset exhibits a significant domain shift with respect to data captured by bench-top devices, thus posing great challenges to existing methods. In the experiments, we provide evaluation metrics and benchmark results on our dataset, reflecting both the potential and challenges it offers for vessel segmentation tasks. We hope this challenging dataset would significantly contribute to the development of eye disease diagnosis and early prevention.

Title: YOLIC: An Efficient Method for Object Localization and Classification on Edge Devices. (arXiv:2307.06689v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.06689
Code URL: null
Copy Paste: [[2307.06689] YOLIC: An Efficient Method for Object Localization and Classification on Edge Devices](http://arxiv.org/abs/2307.06689) #segmentation
Summary:
In the realm of Tiny AI, we introduce "You Only Look at Interested Cells" (YOLIC), an efficient method for object localization and classification on edge devices. Seamlessly blending the strengths of semantic segmentation and object detection, YOLIC offers superior computational efficiency and precision. By adopting Cells of Interest for classification instead of individual pixels, YOLIC encapsulates relevant information, reduces computational load, and enables rough object shape inference. Importantly, the need for bounding box regression is obviated, as YOLIC capitalizes on the predetermined cell configuration that provides information about potential object location, size, and shape. To tackle the issue of single-label classification limitations, a multi-label classification approach is applied to each cell, effectively recognizing overlapping or closely situated objects. This paper presents extensive experiments on multiple datasets, demonstrating that YOLIC achieves detection performance comparable to the state-of-the-art YOLO algorithms while surpassing in speed, exceeding 30fps on a Raspberry Pi 4B CPU. All resources related to this study, including datasets, cell designer, image annotation tool, and source code, have been made publicly available on our project website at https://kai3316.github.io/yolic.github.io