secure

Title: PRO-Face S: Privacy-preserving Reversible Obfuscation of Face Images via Secure Flow. (arXiv:2307.09146v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09146
Code URL: null
Copy Paste: [[2307.09146] PRO-Face S: Privacy-preserving Reversible Obfuscation of Face Images via Secure Flow](http://arxiv.org/abs/2307.09146) #secure
Summary:
This paper proposes a novel paradigm for facial privacy protection that unifies multiple characteristics including anonymity, diversity, reversibility and security within a single lightweight framework. We name it PRO-Face S, short for Privacy-preserving Reversible Obfuscation of Face images via Secure flow-based model. In the framework, an Invertible Neural Network (INN) is utilized to process the input image along with its pre-obfuscated form, and generate the privacy protected image that visually approximates to the pre-obfuscated one, thus ensuring privacy. The pre-obfuscation applied can be in diversified form with different strengths and styles specified by users. Along protection, a secret key is injected into the network such that the original image can only be recovered from the protection image via the same model given the correct key provided. Two modes of image recovery are devised to deal with malicious recovery attempts in different scenarios. Finally, extensive experiments conducted on three public image datasets demonstrate the superiority of the proposed framework over multiple state-of-the-art approaches.

Title: Measuring the Leakage and Exploitability of Authentication Secrets in Super-apps: The WeChat Case. (arXiv:2307.09317v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.09317
Code URL: null
Copy Paste: [[2307.09317] Measuring the Leakage and Exploitability of Authentication Secrets in Super-apps: The WeChat Case](http://arxiv.org/abs/2307.09317) #secure
Summary:
We conduct a large-scale measurement of developers' insecure practices leading to mini-app to super-app authentication bypass, among which hard-coding developer secrets for such authentication is a major contributor. We also analyze the exploitability and security consequences of developer secret leakage in mini-apps by examining individual super-app server-side APIs. We develop an analysis framework for measuring such secret leakage, and primarily analyze 110,993 WeChat mini-apps, and 10,000 Baidu mini-apps (two of the most prominent super-app platforms), along with a few more datasets to test the evolution of developer practices and platform security enforcement over time. We found a large number of WeChat mini-apps (36,425, 32.8%) and a few Baidu mini-apps (112) leak their developer secrets, which can cause severe security and privacy problems for the users and developers of mini-apps. A network attacker who does not even have an account on the super-app platform, can effectively take down a mini-app, send malicious and phishing links to users, and access sensitive information of the mini-app developer and its users. We responsibly disclosed our findings and also put forward potential directions that could be considered to alleviate/eliminate the root causes of developers hard-coding the app secrets in the mini-app's front-end code.

Title: A New Hybrid Cryptosystem Involving DNA,Rabin, One Time Pad and Fiestel. (arXiv:2307.09322v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.09322
Code URL: null
Copy Paste: [[2307.09322] A New Hybrid Cryptosystem Involving DNA,Rabin, One Time Pad and Fiestel](http://arxiv.org/abs/2307.09322) #secure
Summary:
Information security is a crucial need in the modern world. Data security is a real concern, and many customers and organizations need to protect their sensitive information from unauthorized parties and attackers. In previous years, numerous cryptographic schemes have been proposed. DNA cryptography is a new and developing field that combines the computational and biological worlds. DNA cryptography is intriguing due to its high storage capacity, secure data transport, and massive parallel computing. In this paper, a new combination is proposed that offers good security by combining DNA, the Rabin algorithm, one time pad, and a structure inspired by Fiestel. This algorithm employs two keys. The first key is a DNA OTP key which is used for only one secure communication session. The second key, which combines the public and private keys, is a Rabin key. Additionally, by using a Feistel inspired scheme and randomness provided by DNA, the ciphertext is made harder to obtain without the private key.

security

Title: Experimental Security Analysis of DNN-based Adaptive Cruise Control under Context-Aware Perception Attacks. (arXiv:2307.08939v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.08939
Code URL: null
Copy Paste: [[2307.08939] Experimental Security Analysis of DNN-based Adaptive Cruise Control under Context-Aware Perception Attacks](http://arxiv.org/abs/2307.08939) #security
Summary:
Adaptive Cruise Control (ACC) is a widely used driver assistance feature for maintaining desired speed and safe distance to the leading vehicles. This paper evaluates the security of the deep neural network (DNN) based ACC systems under stealthy perception attacks that strategically inject perturbations into camera data to cause forward collisions. We present a combined knowledge-and-data-driven approach to design a context-aware strategy for the selection of the most critical times for triggering the attacks and a novel optimization-based method for the adaptive generation of image perturbations at run-time. We evaluate the effectiveness of the proposed attack using an actual driving dataset and a realistic simulation platform with the control software from a production ACC system and a physical-world driving simulator while considering interventions by the driver and safety features such as Automatic Emergency Braking (AEB) and Forward Collision Warning (FCW). Experimental results show that the proposed attack achieves 142.9x higher success rate in causing accidents than random attacks and is mitigated 89.6% less by the safety features while being stealthy and robust to real-world factors and dynamic changes in the environment. This study provides insights into the role of human operators and basic safety interventions in preventing attacks.

Title: A Note on the Security of ITS: Car Crash Analysis in Cruise Control Scenarios. (arXiv:2307.08899v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.08899
Code URL: null
Copy Paste: [[2307.08899] A Note on the Security of ITS: Car Crash Analysis in Cruise Control Scenarios](http://arxiv.org/abs/2307.08899) #security
Summary:
Security of Intelligent Transportation Systems (ITS) heavily depends on the security of the underlying components that create such a smart ecosystem. Adaptive Cruise Control (ACC) is embedded into most modern vehicles. In this report, we study the situations that the two vehicles involved in a cruise control scenario create. More precisely, after breaking down the phases the two vehicle go through (especially the ego one), we show how a simple formula can be used to predict collisions in hard brake cruise control scenarios.

Title: EsaNet: Environment Semantics Enabled Physical Layer Authentication. (arXiv:2307.08946v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.08946
Code URL: null
Copy Paste: [[2307.08946] EsaNet: Environment Semantics Enabled Physical Layer Authentication](http://arxiv.org/abs/2307.08946) #security
Summary:
Wireless networks are vulnerable to physical layer spoofing attacks due to the wireless broadcast nature, thus, integrating communications and security (ICAS) is urgently needed for 6G endogenous security. In this letter, we propose an environment semantics enabled physical layer authentication network based on deep learning, namely EsaNet, to authenticate the spoofing from the underlying wireless protocol. Specifically, the frequency independent wireless channel fingerprint (FiFP) is extracted from the channel state information (CSI) of a massive multi-input multi-output (MIMO) system based on environment semantics knowledge. Then, we transform the received signal into a two-dimensional red green blue (RGB) image and apply the you only look once (YOLO), a single-stage object detection network, to quickly capture the FiFP. Next, a lightweight classification network is designed to distinguish the legitimate from the illegitimate users. Finally, the experimental results show that the proposed EsaNet can effectively detect physical layer spoofing attacks and is robust in time-varying wireless environments.

Title: Unsupervised Learning of Distributional Properties can Supplement Human Labeling and Increase Active Learning Efficiency in Anomaly Detection. (arXiv:2307.08782v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08782
Code URL: null
Copy Paste: [[2307.08782] Unsupervised Learning of Distributional Properties can Supplement Human Labeling and Increase Active Learning Efficiency in Anomaly Detection](http://arxiv.org/abs/2307.08782) #security
Summary:
Exfiltration of data via email is a serious cybersecurity threat for many organizations. Detecting data exfiltration (anomaly) patterns typically requires labeling, most often done by a human annotator, to reduce the high number of false alarms. Active Learning (AL) is a promising approach for labeling data efficiently, but it needs to choose an efficient order in which cases are to be labeled, and there are uncertainties as to what scoring procedure should be used to prioritize cases for labeling, especially when detecting rare cases of interest is crucial. We propose an adaptive AL sampling strategy that leverages the underlying prior data distribution, as well as model uncertainty, to produce batches of cases to be labeled that contain instances of rare anomalies. We show that (1) the classifier benefits from a batch of representative and informative instances of both normal and anomalous examples, (2) unsupervised anomaly detection plays a useful role in building the classifier in the early stages of training when relatively little labeling has been done thus far. Our approach to AL for anomaly detection outperformed existing AL approaches on three highly unbalanced UCI benchmarks and on one real-world redacted email data set.

Title: Bayesian Safe Policy Learning with Chance Constrained Optimization: Application to Military Security Assessment during the Vietnam War. (arXiv:2307.08840v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08840
Code URL: null
Copy Paste: [[2307.08840] Bayesian Safe Policy Learning with Chance Constrained Optimization: Application to Military Security Assessment during the Vietnam War](http://arxiv.org/abs/2307.08840) #security
Summary:
Algorithmic and data-driven decisions and recommendations are commonly used in high-stakes decision-making settings such as criminal justice, medicine, and public policy. We investigate whether it would have been possible to improve a security assessment algorithm employed during the Vietnam War, using outcomes measured immediately after its introduction in late 1969. This empirical application raises several methodological challenges that frequently arise in high-stakes algorithmic decision-making. First, before implementing a new algorithm, it is essential to characterize and control the risk of yielding worse outcomes than the existing algorithm. Second, the existing algorithm is deterministic, and learning a new algorithm requires transparent extrapolation. Third, the existing algorithm involves discrete decision tables that are common but difficult to optimize over.

To address these challenges, we introduce the Average Conditional Risk (ACRisk), which first quantifies the risk that a new algorithmic policy leads to worse outcomes for subgroups of individual units and then averages this over the distribution of subgroups. We also propose a Bayesian policy learning framework that maximizes the posterior expected value while controlling the posterior expected ACRisk. This framework separates the estimation of heterogeneous treatment effects from policy optimization, enabling flexible estimation of effects and optimization over complex policy classes. We characterize the resulting chance-constrained optimization problem as a constrained linear programming problem. Our analysis shows that compared to the actual algorithm used during the Vietnam War, the learned algorithm assesses most regions as more secure and emphasizes economic and political factors over military factors.

Title: Application of BERT in Wind Power Forecasting-Teletraan's Solution in Baidu KDD Cup 2022. (arXiv:2307.09248v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.09248
Code URL: https://github.com/longxingtan/kdd2022-baidu
Copy Paste: [[2307.09248] Application of BERT in Wind Power Forecasting-Teletraan's Solution in Baidu KDD Cup 2022](http://arxiv.org/abs/2307.09248) #security
Summary:
Nowadays, wind energy has drawn increasing attention as its important role in carbon neutrality and sustainable development. When wind power is integrated into the power grid, precise forecasting is necessary for the sustainability and security of the system. However, the unpredictable nature and long sequence prediction make it especially challenging. In this technical report, we introduce the BERT model applied for Baidu KDD Cup 2022, and the daily fluctuation is added by post-processing to make the predicted results in line with daily periodicity. Our solution achieves 3rd place of 2490 teams. The code is released athttps://github.com/LongxingTan/KDD2022-Baidu

privacy

Title: A Comprehensive Survey of Forgetting in Deep Learning Beyond Continual Learning. (arXiv:2307.09218v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.09218
Code URL: null
Copy Paste: [[2307.09218] A Comprehensive Survey of Forgetting in Deep Learning Beyond Continual Learning](http://arxiv.org/abs/2307.09218) #privacy
Summary:
Forgetting refers to the loss or deterioration of previously acquired information or knowledge. While the existing surveys on forgetting have primarily focused on continual learning, forgetting is a prevalent phenomenon observed in various other research domains within deep learning. Forgetting manifests in research fields such as generative models due to generator shifts, and federated learning due to heterogeneous data distributions across clients. Addressing forgetting encompasses several challenges, including balancing the retention of old task knowledge with fast learning of new tasks, managing task interference with conflicting goals, and preventing privacy leakage, etc. Moreover, most existing surveys on continual learning implicitly assume that forgetting is always harmful. In contrast, our survey argues that forgetting is a double-edged sword and can be beneficial and desirable in certain cases, such as privacy-preserving scenarios. By exploring forgetting in a broader context, we aim to present a more nuanced understanding of this phenomenon and highlight its potential advantages. Through this comprehensive survey, we aspire to uncover potential solutions by drawing upon ideas and approaches from various fields that have dealt with forgetting. By examining forgetting beyond its conventional boundaries, in future work, we hope to encourage the development of novel strategies for mitigating, harnessing, or even embracing forgetting in real applications. A comprehensive list of papers about forgetting in various research fields is available at \url{https://github.com/EnnengYang/Awesome-Forgetting-in-Deep-Learning}.

Title: Zone-Based Privacy-Preserving Billing for Local Energy Market Based on Multiparty Computation. (arXiv:2307.08778v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.08778
Code URL: null
Copy Paste: [[2307.08778] Zone-Based Privacy-Preserving Billing for Local Energy Market Based on Multiparty Computation](http://arxiv.org/abs/2307.08778) #privacy
Summary:
This paper proposes a zone-based privacy-preserving billing protocol for local energy markets that takes into account energy volume deviations of market participants from their bids. Our protocol incorporates participants' locations on the grid for splitting the deviations cost. The proposed billing model employs multiparty computation so that the accurate calculation of individual bills is performed in a decentralised and privacy-preserving manner. We also present a security analysis as well as performance evaluations for different security settings. The results show superiority of the honest-majority model to the dishonest majority in terms of computational efficiency. They also show that the billing can be executed for 5000 users in less than nine seconds in the online phase for all security settings, demonstrating its feasibility to be deployed in real local energy markets.

Title: Privacy-preserving patient clustering for personalized federated learning. (arXiv:2307.08847v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08847
Code URL: https://github.com/g2lab/pcfbl
Copy Paste: [[2307.08847] Privacy-preserving patient clustering for personalized federated learning](http://arxiv.org/abs/2307.08847) #privacy
Summary:
Federated Learning (FL) is a machine learning framework that enables multiple organizations to train a model without sharing their data with a central server. However, it experiences significant performance degradation if the data is non-identically independently distributed (non-IID). This is a problem in medical settings, where variations in the patient population contribute significantly to distribution differences across hospitals. Personalized FL addresses this issue by accounting for site-specific distribution differences. Clustered FL, a Personalized FL variant, was used to address this problem by clustering patients into groups across hospitals and training separate models on each group. However, privacy concerns remained as a challenge as the clustering process requires exchange of patient-level information. This was previously solved by forming clusters using aggregated data, which led to inaccurate groups and performance degradation. In this study, we propose Privacy-preserving Community-Based Federated machine Learning (PCBFL), a novel Clustered FL framework that can cluster patients using patient-level data while protecting privacy. PCBFL uses Secure Multiparty Computation, a cryptographic technique, to securely calculate patient-level similarity scores across hospitals. We then evaluate PCBFL by training a federated mortality prediction model using 20 sites from the eICU dataset. We compare the performance gain from PCBFL against traditional and existing Clustered FL frameworks. Our results show that PCBFL successfully forms clinically meaningful cohorts of low, medium, and high-risk patients. PCBFL outperforms traditional and existing Clustered FL frameworks with an average AUC improvement of 4.3% and AUPRC improvement of 7.8%.

protect

Title: On Borrowed Time -- Preventing Static Power Side-Channel Analysis. (arXiv:2307.09001v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.09001
Code URL: null
Copy Paste: [[2307.09001] On Borrowed Time -- Preventing Static Power Side-Channel Analysis](http://arxiv.org/abs/2307.09001) #protect
Summary:
In recent years, static power side-channel analysis attacks have emerged as a serious threat to cryptographic implementations, overcoming state-of-the-art countermeasures against side-channel attacks. The continued down-scaling of semiconductor process technology, which results in an increase of the relative weight of static power in the total power budget of circuits, will only improve the viability of static power side-channel analysis attacks. Yet, despite the threat posed, limited work has been invested into mitigating this class of attack. In this work we address this gap. We observe that static power side-channel analysis relies on stopping the target circuit's clock over a prolonged period, during which the circuit holds secret information in its registers. We propose Borrowed Time, a countermeasure that hinders an attacker's ability to leverage such clock control. Borrowed Time detects a stopped clock and triggers a reset that wipes any registers containing sensitive intermediates, whose leakages would otherwise be exploitable. We demonstrate the effectiveness of our countermeasure by performing practical Correlation Power Analysis attacks under optimal conditions against an AES implementation on an FPGA target with and without our countermeasure in place. In the unprotected case, we can recover the entire secret key using traces from 1,500 encryptions. Under the same conditions, the protected implementation successfully prevents key recovery even with traces from 1,000,000 encryptions.

defense

Title: In Defense of Clip-based Video Relation Detection. (arXiv:2307.08984v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08984
Code URL: null
Copy Paste: [[2307.08984] In Defense of Clip-based Video Relation Detection](http://arxiv.org/abs/2307.08984) #defense
Summary:
Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.

attack

Title: CBSeq: A Channel-level Behavior Sequence For Encrypted Malware Traffic Detection. (arXiv:2307.09002v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.09002
Code URL: null
Copy Paste: [[2307.09002] CBSeq: A Channel-level Behavior Sequence For Encrypted Malware Traffic Detection](http://arxiv.org/abs/2307.09002) #attack
Summary:
Machine learning and neural networks have become increasingly popular solutions for encrypted malware traffic detection. They mine and learn complex traffic patterns, enabling detection by fitting boundaries between malware traffic and benign traffic. Compared with signature-based methods, they have higher scalability and flexibility. However, affected by the frequent variants and updates of malware, current methods suffer from a high false positive rate and do not work well for unknown malware traffic detection. It remains a critical task to achieve effective malware traffic detection. In this paper, we introduce CBSeq to address the above problems. CBSeq is a method that constructs a stable traffic representation, behavior sequence, to characterize attacking intent and achieve malware traffic detection. We novelly propose the channels with similar behavior as the detection object and extract side-channel content to construct behavior sequence. Unlike benign activities, the behavior sequences of malware and its variant's traffic exhibit solid internal correlations. Moreover, we design the MSFormer, a powerful Transformer-based multi-sequence fusion classifier. It captures the internal similarity of behavior sequence, thereby distinguishing malware traffic from benign traffic. Our evaluations demonstrate that CBSeq performs effectively in various known malware traffic detection and exhibits superior performance in unknown malware traffic detection, outperforming state-of-the-art methods.

Title: FedDefender: Client-Side Attack-Tolerant Federated Learning. (arXiv:2307.09048v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.09048
Code URL: https://github.com/deu30303/feddefender
Copy Paste: [[2307.09048] FedDefender: Client-Side Attack-Tolerant Federated Learning](http://arxiv.org/abs/2307.09048) #attack
Summary:
Federated learning enables learning from decentralized data sources without compromising privacy, which makes it a crucial technique. However, it is vulnerable to model poisoning attacks, where malicious clients interfere with the training process. Previous defense mechanisms have focused on the server-side by using careful model aggregation, but this may not be effective when the data is not identically distributed or when attackers can access the information of benign clients. In this paper, we propose a new defense mechanism that focuses on the client-side, called FedDefender, to help benign clients train robust local models and avoid the adverse impact of malicious model updates from attackers, even when a server-side defense cannot identify or remove adversaries. Our method consists of two main components: (1) attack-tolerant local meta update and (2) attack-tolerant global knowledge distillation. These components are used to find noise-resilient model parameters while accurately extracting knowledge from a potentially corrupted global model. Our client-side defense strategy has a flexible structure and can work in conjunction with any existing server-side strategies. Evaluations of real-world scenarios across multiple datasets show that the proposed method enhances the robustness of federated learning against model poisoning attacks.

Title: Mitigating Intersection Attacks in Anonymous Microblogging. (arXiv:2307.09069v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.09069
Code URL: null
Copy Paste: [[2307.09069] Mitigating Intersection Attacks in Anonymous Microblogging](http://arxiv.org/abs/2307.09069) #attack
Summary:
Anonymous microblogging systems are known to be vulnerable to intersection attacks due to network churn. An adversary that monitors all communications can leverage the churn to learn who is publishing what with increasing confidence over time. In this paper, we propose a protocol for mitigating intersection attacks in anonymous microblogging systems by grouping users into anonymity sets based on similarities in their publishing behavior. The protocol provides a configurable communication schedule for users in each set to manage the inevitable trade-off between latency and bandwidth overhead. In our evaluation, we use real-world datasets from two popular microblogging platforms, Twitter and Reddit, to simulate user publishing behavior. The results demonstrate that the protocol can protect users against intersection attacks at low bandwidth overhead when the users adhere to communication schedules. In addition, the protocol can sustain a slow degradation in the size of the anonymity set over time under various churn rates.

Title: The Hitchhiker's Guide to Malicious Third-Party Dependencies. (arXiv:2307.09087v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.09087
Code URL: null
Copy Paste: [[2307.09087] The Hitchhiker's Guide to Malicious Third-Party Dependencies](http://arxiv.org/abs/2307.09087) #attack
Summary:
The increasing popularity of certain programming languages has spurred the creation of ecosystem-specific package repositories and package managers. Such repositories (e.g., NPM, PyPI) serve as public databases that users can query to retrieve packages for various functionalities, whereas package managers automatically handle dependency resolution and package installation on the client side. These mechanisms enhance software modularization and accelerate implementation. However, they have become a target for malicious actors seeking to propagate malware on a large scale.

In this work, we show how attackers can leverage capabilities of popular package managers and languages to achieve arbitrary code execution on victim machines, thereby realizing open-source software supply chain attacks. Based on the analysis of 7 ecosystems, we identify 3 install-time and 5 runtime techniques, and we provide recommendations describing how to reduce the risk when consuming third-party dependencies. We will provide proof-of-concepts that demonstrate the identified techniques. Furthermore, we describe evasion strategies employed by attackers to circumvent detection mechanisms.

Title: From Dragondoom to Dragonstar: Side-channel Attacks and Formally Verified Implementation of WPA3 Dragonfly Handshake. (arXiv:2307.09243v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.09243
Code URL: null
Copy Paste: [[2307.09243] From Dragondoom to Dragonstar: Side-channel Attacks and Formally Verified Implementation of WPA3 Dragonfly Handshake](http://arxiv.org/abs/2307.09243) #attack
Summary:
It is universally acknowledged that Wi-Fi communications are important to secure. Thus, the Wi-Fi Alliance published WPA3 in 2018 with a distinctive security feature: it leverages a Password-Authenticated Key Exchange (PAKE) protocol to protect users' passwords from offline dictionary attacks. Unfortunately, soon after its release, several attacks were reported against its implementations, in response to which the protocol was updated in a best-effort manner.

In this paper, we show that the proposed mitigations are not enough, especially for a complex protocol to implement even for savvy developers. Indeed, we present **Dragondoom**, a collection of side-channel vulnerabilities of varying strength allowing attackers to recover users' passwords in widely deployed Wi-Fi daemons, such as hostap in its default settings. Our findings target both password conversion methods, namely the default probabilistic hunting-and-pecking and its newly standardized deterministic alternative based on SSWU. We successfully exploit our leakage in practice through microarchitectural mechanisms, and overcome the limited spatial resolution of Flush+Reload. Our attacks outperform previous works in terms of required measurements.

Then, driven by the need to end the spiral of patch-and-hack in Dragonfly implementations, we propose **Dragonstar**, an implementation of Dragonfly leveraging a formally verified implementation of the underlying mathematical operations, thereby removing all the related leakage vector. Our implementation relies on HACL*, a formally verified crypto library guaranteeing secret-independence. We design Dragonstar, so that its integration within hostap requires minimal modifications to the existing project. Our experiments show that the performance of HACL*-based hostap is comparable to OpenSSL-based, implying that Dragonstar is both efficient and proved to be leakage-free.

Title: DeepMem: ML Models as storage channels and their (mis-)applications. (arXiv:2307.08811v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08811
Code URL: null
Copy Paste: [[2307.08811] DeepMem: ML Models as storage channels and their (mis-)applications](http://arxiv.org/abs/2307.08811) #attack
Summary:
Machine learning (ML) models are overparameterized to support generality and avoid overfitting. Prior works have shown that these additional parameters can be used for both malicious (e.g., hiding a model covertly within a trained model) and beneficial purposes (e.g., watermarking a model). In this paper, we propose a novel information theoretic perspective of the problem; we consider the ML model as a storage channel with a capacity that increases with overparameterization. Specifically, we consider a sender that embeds arbitrary information in the model at training time, which can be extracted by a receiver with a black-box access to the deployed model. We derive an upper bound on the capacity of the channel based on the number of available parameters. We then explore black-box write and read primitives that allow the attacker to: (i) store data in an optimized way within the model by augmenting the training data at the transmitter side, and (ii) to read it by querying the model after it is deployed. We also analyze the detectability of the writing primitive and consider a new version of the problem which takes information storage covertness into account. Specifically, to obtain storage covertness, we introduce a new constraint such that the data augmentation used for the write primitives minimizes the distribution shift with the initial (baseline task) distribution. This constraint introduces a level of "interference" with the initial task, thereby limiting the channel's effective capacity. Therefore, we develop optimizations to improve the capacity in this case, including a novel ML-specific substitution based error correction protocol. We believe that the proposed modeling of the problem offers new tools to better understand and mitigate potential vulnerabilities of ML, especially in the context of increasingly large models.

robust

Title: Revisiting Scene Text Recognition: A Data Perspective. (arXiv:2307.08723v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08723
Code URL: null
Copy Paste: [[2307.08723] Revisiting Scene Text Recognition: A Data Perspective](http://arxiv.org/abs/2307.08723) #robust
Summary:
This paper aims to re-assess scene text recognition (STR) from a data-oriented perspective. We begin by revisiting the six commonly used benchmarks in STR and observe a trend of performance saturation, whereby only 2.91% of the benchmark images cannot be accurately recognized by an ensemble of 13 representative models. While these results are impressive and suggest that STR could be considered solved, however, we argue that this is primarily due to the less challenging nature of the common benchmarks, thus concealing the underlying issues that STR faces. To this end, we consolidate a large-scale real STR dataset, namely Union14M, which comprises 4 million labeled images and 10 million unlabeled images, to assess the performance of STR models in more complex real-world scenarios. Our experiments demonstrate that the 13 models can only achieve an average accuracy of 66.53% on the 4 million labeled images, indicating that STR still faces numerous challenges in the real world. By analyzing the error patterns of the 13 models, we identify seven open challenges in STR and develop a challenge-driven benchmark consisting of eight distinct subsets to facilitate further progress in the field. Our exploration demonstrates that STR is far from being solved and leveraging data may be a promising solution. In this regard, we find that utilizing the 10 million unlabeled images through self-supervised pre-training can significantly improve the robustness of STR model in real-world scenarios and leads to state-of-the-art performance.

Title: LiDAR-BEVMTN: Real-Time LiDAR Bird's-Eye View Multi-Task Perception Network for Autonomous Driving. (arXiv:2307.08850v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08850
Code URL: null
Copy Paste: [[2307.08850] LiDAR-BEVMTN: Real-Time LiDAR Bird's-Eye View Multi-Task Perception Network for Autonomous Driving](http://arxiv.org/abs/2307.08850) #robust
Summary:
LiDAR is crucial for robust 3D scene perception in autonomous driving. LiDAR perception has the largest body of literature after camera perception. However, multi-task learning across tasks like detection, segmentation, and motion estimation using LiDAR remains relatively unexplored, especially on automotive-grade embedded platforms. We present a real-time multi-task convolutional neural network for LiDAR-based object detection, semantics, and motion segmentation. The unified architecture comprises a shared encoder and task-specific decoders, enabling joint representation learning. We propose a novel Semantic Weighting and Guidance (SWAG) module to transfer semantic features for improved object detection selectively. Our heterogeneous training scheme combines diverse datasets and exploits complementary cues between tasks. The work provides the first embedded implementation unifying these key perception tasks from LiDAR point clouds achieving 3ms latency on the embedded NVIDIA Xavier platform. We achieve state-of-the-art results for two tasks, semantic and motion segmentation, and close to state-of-the-art performance for 3D object detection. By maximizing hardware efficiency and leveraging multi-task synergies, our method delivers an accurate and efficient solution tailored for real-world automated driving deployment. Qualitative results can be seen at https://youtu.be/H-hWRzv2lIY.

Title: EgoVM: Achieving Precise Ego-Localization using Lightweight Vectorized Maps. (arXiv:2307.08991v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08991
Code URL: null
Copy Paste: [[2307.08991] EgoVM: Achieving Precise Ego-Localization using Lightweight Vectorized Maps](http://arxiv.org/abs/2307.08991) #robust
Summary:
Accurate and reliable ego-localization is critical for autonomous driving. In this paper, we present EgoVM, an end-to-end localization network that achieves comparable localization accuracy to prior state-of-the-art methods, but uses lightweight vectorized maps instead of heavy point-based maps. To begin with, we extract BEV features from online multi-view images and LiDAR point cloud. Then, we employ a set of learnable semantic embeddings to encode the semantic types of map elements and supervise them with semantic segmentation, to make their feature representation consistent with BEV features. After that, we feed map queries, composed of learnable semantic embeddings and coordinates of map elements, into a transformer decoder to perform cross-modality matching with BEV features. Finally, we adopt a robust histogram-based pose solver to estimate the optimal pose by searching exhaustively over candidate poses. We comprehensively validate the effectiveness of our method using both the nuScenes dataset and a newly collected dataset. The experimental results show that our method achieves centimeter-level localization accuracy, and outperforms existing methods using vectorized maps by a large margin. Furthermore, our model has been extensively tested in a large fleet of autonomous vehicles under various challenging urban scenes.

Title: TractCloud: Registration-free tractography parcellation with a novel local-global streamline point cloud representation. (arXiv:2307.09000v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09000
Code URL: null
Copy Paste: [[2307.09000] TractCloud: Registration-free tractography parcellation with a novel local-global streamline point cloud representation](http://arxiv.org/abs/2307.09000) #robust
Summary:
Diffusion MRI tractography parcellation classifies streamlines into anatomical fiber tracts to enable quantification and visualization for clinical and scientific applications. Current tractography parcellation methods rely heavily on registration, but registration inaccuracies can affect parcellation and the computational cost of registration is high for large-scale datasets. Recently, deep-learning-based methods have been proposed for tractography parcellation using various types of representations for streamlines. However, these methods only focus on the information from a single streamline, ignoring geometric relationships between the streamlines in the brain. We propose TractCloud, a registration-free framework that performs whole-brain tractography parcellation directly in individual subject space. We propose a novel, learnable, local-global streamline representation that leverages information from neighboring and whole-brain streamlines to describe the local anatomy and global pose of the brain. We train our framework on a large-scale labeled tractography dataset, which we augment by applying synthetic transforms including rotation, scaling, and translations. We test our framework on five independently acquired datasets across populations and health conditions. TractCloud significantly outperforms several state-of-the-art methods on all testing datasets. TractCloud achieves efficient and consistent whole-brain white matter parcellation across the lifespan (from neonates to elderly subjects, including brain tumor patients) without the need for registration. The robustness and high inference speed of TractCloud make it suitable for large-scale tractography data analysis. Our project page is available at https://tractcloud.github.io/.

Title: PottsMGNet: A Mathematical Explanation of Encoder-Decoder Based Neural Networks. (arXiv:2307.09039v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09039
Code URL: null
Copy Paste: [[2307.09039] PottsMGNet: A Mathematical Explanation of Encoder-Decoder Based Neural Networks](http://arxiv.org/abs/2307.09039) #robust
Summary:
For problems in image processing and many other fields, a large class of effective neural networks has encoder-decoder-based architectures. Although these networks have made impressive performances, mathematical explanations of their architectures are still underdeveloped. In this paper, we study the encoder-decoder-based network architecture from the algorithmic perspective and provide a mathematical explanation. We use the two-phase Potts model for image segmentation as an example for our explanations. We associate the segmentation problem with a control problem in the continuous setting. Then, multigrid method and operator splitting scheme, the PottsMGNet, are used to discretize the continuous control model. We show that the resulting discrete PottsMGNet is equivalent to an encoder-decoder-based network. With minor modifications, it is shown that a number of the popular encoder-decoder-based neural networks are just instances of the proposed PottsMGNet. By incorporating the Soft-Threshold-Dynamics into the PottsMGNet as a regularizer, the PottsMGNet has shown to be robust with the network parameters such as network width and depth and achieved remarkable performance on datasets with very large noise. In nearly all our experiments, the new network always performs better or as good on accuracy and dice score than existing networks for image segmentation.

Title: Pixel-wise Graph Attention Networks for Person Re-identification. (arXiv:2307.09183v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09183
Code URL: https://github.com/wenyu1009/pganet
Copy Paste: [[2307.09183] Pixel-wise Graph Attention Networks for Person Re-identification](http://arxiv.org/abs/2307.09183) #robust
Summary:
Graph convolutional networks (GCN) is widely used to handle irregular data since it updates node features by using the structure information of graph. With the help of iterated GCN, high-order information can be obtained to further enhance the representation of nodes. However, how to apply GCN to structured data (such as pictures) has not been deeply studied. In this paper, we explore the application of graph attention networks (GAT) in image feature extraction. First of all, we propose a novel graph generation algorithm to convert images into graphs through matrix transformation. It is one magnitude faster than the algorithm based on K Nearest Neighbors (KNN). Then, GAT is used on the generated graph to update the node features. Thus, a more robust representation is obtained. These two steps are combined into a module called pixel-wise graph attention module (PGA). Since the graph obtained by our graph generation algorithm can still be transformed into a picture after processing, PGA can be well combined with CNN. Based on these two modules, we consulted the ResNet and design a pixel-wise graph attention network (PGANet). The PGANet is applied to the task of person re-identification in the datasets Market1501, DukeMTMC-reID and Occluded-DukeMTMC (outperforms state-of-the-art by 0.8\%, 1.1\% and 11\% respectively, in mAP scores). Experiment results show that it achieves the state-of-the-art performance. \href{https://github.com/wenyu1009/PGANet}{The code is available here}.

Title: Generation of High Spatial Resolution Terrestrial Surface from Low Spatial Resolution Elevation Contour Maps via Hierarchical Computation of Median Elevation Regions. (arXiv:2307.09239v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09239
Code URL: null
Copy Paste: [[2307.09239] Generation of High Spatial Resolution Terrestrial Surface from Low Spatial Resolution Elevation Contour Maps via Hierarchical Computation of Median Elevation Regions](http://arxiv.org/abs/2307.09239) #robust
Summary:
We proposed a simple yet effective morphological approach to convert a sparse Digital Elevation Model (DEM) to a dense Digital Elevation Model. The conversion is similar to that of the generation of high-resolution DEM from its low-resolution DEM. The approach involves the generation of median contours to achieve the purpose. It is a sequential step of the I) decomposition of the existing sparse Contour map into the maximum possible Threshold Elevation Region (TERs). II) Computing all possible non-negative and non-weighted Median Elevation Region (MER) hierarchically between the successive TER decomposed from a sparse contour map. III) Computing the gradient of all TER, and MER computed from previous steps would yield the predicted intermediate elevation contour at a higher spatial resolution. We presented this approach initially with some self-made synthetic data to show how the contour prediction works and then experimented with the available contour map of Washington, NH to justify its usefulness. This approach considers the geometric information of existing contours and interpolates the elevation contour at a new spatial region of a topographic surface until no elevation contours are necessary to generate. This novel approach is also very low-cost and robust as it uses elevation contours.

Title: SphereNet: Learning a Noise-Robust and General Descriptor for Point Cloud Registration. (arXiv:2307.09351v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09351
Code URL: null
Copy Paste: [[2307.09351] SphereNet: Learning a Noise-Robust and General Descriptor for Point Cloud Registration](http://arxiv.org/abs/2307.09351) #robust
Summary:
Point cloud registration is to estimate a transformation to align point clouds collected in different perspectives. In learning-based point cloud registration, a robust descriptor is vital for high-accuracy registration. However, most methods are susceptible to noise and have poor generalization ability on unseen datasets. Motivated by this, we introduce SphereNet to learn a noise-robust and unseen-general descriptor for point cloud registration. In our method, first, the spheroid generator builds a geometric domain based on spherical voxelization to encode initial features. Then, the spherical interpolation of the sphere is introduced to realize robustness against noise. Finally, a new spherical convolutional neural network with spherical integrity padding completes the extraction of descriptors, which reduces the loss of features and fully captures the geometric features. To evaluate our methods, a new benchmark 3DMatch-noise with strong noise is introduced. Extensive experiments are carried out on both indoor and outdoor datasets. Under high-intensity noise, SphereNet increases the feature matching recall by more than 25 percentage points on 3DMatch-noise. In addition, it sets a new state-of-the-art performance for the 3DMatch and 3DLoMatch benchmarks with 93.5\% and 75.6\% registration recall and also has the best generalization ability on unseen datasets.

Title: An Evaluation of Zero-Cost Proxies -- from Neural Architecture Performance to Model Robustness. (arXiv:2307.09365v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.09365
Code URL: null
Copy Paste: [[2307.09365] An Evaluation of Zero-Cost Proxies -- from Neural Architecture Performance to Model Robustness](http://arxiv.org/abs/2307.09365) #robust
Summary:
Zero-cost proxies are nowadays frequently studied and used to search for neural architectures. They show an impressive ability to predict the performance of architectures by making use of their untrained weights. These techniques allow for immense search speed-ups. So far the joint search for well-performing and robust architectures has received much less attention in the field of NAS. Therefore, the main focus of zero-cost proxies is the clean accuracy of architectures, whereas the model robustness should play an evenly important part. In this paper, we analyze the ability of common zero-cost proxies to serve as performance predictors for robustness in the popular NAS-Bench-201 search space. We are interested in the single prediction task for robustness and the joint multi-objective of clean and robust accuracy. We further analyze the feature importance of the proxies and show that predicting the robustness makes the prediction task from existing zero-cost proxies more challenging. As a result, the joint consideration of several proxies becomes necessary to predict a model's robustness while the clean accuracy can be regressed from a single such feature.

Title: A comparative analysis of SR-GAN models. (arXiv:2307.09456v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09456
Code URL: null
Copy Paste: [[2307.09456] A comparative analysis of SR-GAN models](http://arxiv.org/abs/2307.09456) #robust
Summary:
In this study, we evaluate the performance of multiple state-of-the-art SR GAN (Super Resolution Generative Adversarial Network) models, ESRGAN, Real-ESRGAN and EDSR, on a benchmark dataset of real-world images which undergo degradation using a pipeline. Our results show that some models seem to significantly increase the resolution of the input images while preserving their visual quality, this is assessed using Tesseract OCR engine. We observe that EDSR-BASE model from huggingface outperforms the remaining candidate models in terms of both quantitative metrics and subjective visual quality assessments with least compute overhead. Specifically, EDSR generates images with higher peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) values and are seen to return high quality OCR results with Tesseract OCR engine. These findings suggest that EDSR is a robust and effective approach for single-image super-resolution and may be particularly well-suited for applications where high-quality visual fidelity is critical and optimized compute.

Title: AnyDoor: Zero-shot Object-level Image Customization. (arXiv:2307.09481v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09481
Code URL: null
Copy Paste: [[2307.09481] AnyDoor: Zero-shot Object-level Image Customization](http://arxiv.org/abs/2307.09481) #robust
Summary:
This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations in a harmonious way. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object. To this end, we complement the commonly used identity feature with detail features, which are carefully designed to maintain texture details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Extensive experiments demonstrate the superiority of our approach over existing alternatives as well as its great potential in real-world applications, such as virtual try-on and object moving. Project page is https://damo-vilab.github.io/AnyDoor-Page/.

Title: Discretization-based ensemble model for robust learning in IoT. (arXiv:2307.08955v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08955
Code URL: null
Copy Paste: [[2307.08955] Discretization-based ensemble model for robust learning in IoT](http://arxiv.org/abs/2307.08955) #robust
Summary:
IoT device identification is the process of recognizing and verifying connected IoT devices to the network. This is an essential process for ensuring that only authorized devices can access the network, and it is necessary for network management and maintenance. In recent years, machine learning models have been used widely for automating the process of identifying devices in the network. However, these models are vulnerable to adversarial attacks that can compromise their accuracy and effectiveness. To better secure device identification models, discretization techniques enable reduction in the sensitivity of machine learning models to adversarial attacks contributing to the stability and reliability of the model. On the other hand, Ensemble methods combine multiple heterogeneous models to reduce the impact of remaining noise or errors in the model. Therefore, in this paper, we integrate discretization techniques and ensemble methods and examine it on model robustness against adversarial attacks. In other words, we propose a discretization-based ensemble stacking technique to improve the security of our ML models. We evaluate the performance of different ML-based IoT device identification models against white box and black box attacks using a real-world dataset comprised of network traffic from 28 IoT devices. We demonstrate that the proposed method enables robustness to the models for IoT device identification.

Title: Intuitionistic Fuzzy Broad Learning System: Enhancing Robustness Against Noise and Outliers. (arXiv:2307.08713v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08713
Code URL: null
Copy Paste: [[2307.08713] Intuitionistic Fuzzy Broad Learning System: Enhancing Robustness Against Noise and Outliers](http://arxiv.org/abs/2307.08713) #robust
Summary:
In the realm of data classification, broad learning system (BLS) has proven to be a potent tool that utilizes a layer-by-layer feed-forward neural network. It consists of feature learning and enhancement segments, working together to extract intricate features from input data. The traditional BLS treats all samples as equally significant, which makes it less robust and less effective for real-world datasets with noises and outliers. To address this issue, we propose the fuzzy BLS (F-BLS) model, which assigns a fuzzy membership value to each training point to reduce the influence of noises and outliers. In assigning the membership value, the F-BLS model solely considers the distance from samples to the class center in the original feature space without incorporating the extent of non-belongingness to a class. We further propose a novel BLS based on intuitionistic fuzzy theory (IF-BLS). The proposed IF-BLS utilizes intuitionistic fuzzy numbers based on fuzzy membership and non-membership values to assign scores to training points in the high-dimensional feature space by using a kernel function. We evaluate the performance of proposed F-BLS and IF-BLS models on 44 UCI benchmark datasets across diverse domains. Furthermore, Gaussian noise is added to some UCI datasets to assess the robustness of the proposed F-BLS and IF-BLS models. Experimental results demonstrate superior generalization performance of the proposed F-BLS and IF-BLS models compared to baseline models, both with and without Gaussian noise. Additionally, we implement the proposed F-BLS and IF-BLS models on the Alzheimers Disease Neuroimaging Initiative (ADNI) dataset, and promising results showcase the models effectiveness in real-world applications. The proposed methods offer a promising solution to enhance the BLS frameworks ability to handle noise and outliers.

Title: Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation. (arXiv:2307.08875v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08875
Code URL: null
Copy Paste: [[2307.08875] Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation](http://arxiv.org/abs/2307.08875) #robust
Summary:
We study robust reinforcement learning (RL) with the goal of determining a well-performing policy that is robust against model mismatch between the training simulator and the testing environment. Previous policy-based robust RL algorithms mainly focus on the tabular setting under uncertainty sets that facilitate robust policy evaluation, but are no longer tractable when the number of states scales up. To this end, we propose two novel uncertainty set formulations, one based on double sampling and the other on an integral probability metric. Both make large-scale robust RL tractable even when one only has access to a simulator. We propose a robust natural actor-critic (RNAC) approach that incorporates the new uncertainty sets and employs function approximation. We provide finite-time convergence guarantees for the proposed RNAC algorithm to the optimal robust policy within the function approximation error. Finally, we demonstrate the robust performance of the policy learned by our proposed RNAC approach in multiple MuJoCo environments and a real-world TurtleBot navigation task.

biometric

steal

extraction

Title: Occlusion Aware Student Emotion Recognition based on Facial Action Unit Detection. (arXiv:2307.09465v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09465
Code URL: null
Copy Paste: [[2307.09465] Occlusion Aware Student Emotion Recognition based on Facial Action Unit Detection](http://arxiv.org/abs/2307.09465) #extraction
Summary:
Given that approximately half of science, technology, engineering, and mathematics (STEM) undergraduate students in U.S. colleges and universities leave by the end of the first year [15], it is crucial to improve the quality of classroom environments. This study focuses on monitoring students' emotions in the classroom as an indicator of their engagement and proposes an approach to address this issue. The impact of different facial parts on the performance of an emotional recognition model is evaluated through experimentation. To test the proposed model under partial occlusion, an artificially occluded dataset is introduced. The novelty of this work lies in the proposal of an occlusion-aware architecture for facial action units (AUs) extraction, which employs attention mechanism and adaptive feature learning. The AUs can be used later to classify facial expressions in classroom settings.

This research paper's findings provide valuable insights into handling occlusion in analyzing facial images for emotional engagement analysis. The proposed experiments demonstrate the significance of considering occlusion and enhancing the reliability of facial analysis models in classroom environments. These findings can also be extended to other settings where occlusions are prevalent.

Title: Improving Text Semantic Similarity Modeling through a 3D Siamese Network. (arXiv:2307.09274v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.09274
Code URL: null
Copy Paste: [[2307.09274] Improving Text Semantic Similarity Modeling through a 3D Siamese Network](http://arxiv.org/abs/2307.09274) #extraction
Summary:
Siamese networks have gained popularity as a method for modeling text semantic similarity. Traditional methods rely on pooling operation to compress the semantic representations from Transformer blocks in encoding, resulting in two-dimensional semantic vectors and the loss of hierarchical semantic information from Transformer blocks. Moreover, this limited structure of semantic vectors is akin to a flattened landscape, which restricts the methods that can be applied in downstream modeling, as they can only navigate this flat terrain. To address this issue, we propose a novel 3D Siamese network for text semantic similarity modeling, which maps semantic information to a higher-dimensional space. The three-dimensional semantic tensors not only retains more precise spatial and feature domain information but also provides the necessary structural condition for comprehensive downstream modeling strategies to capture them. Leveraging this structural advantage, we introduce several modules to reinforce this 3D framework, focusing on three aspects: feature extraction, attention, and feature fusion. Our extensive experiments on four text semantic similarity benchmarks demonstrate the effectiveness and efficiency of our 3D Siamese Network.

membership infer

federate

Title: Local or Global: Selective Knowledge Assimilation for Federated Learning with Limited Labels. (arXiv:2307.08809v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08809
Code URL: null
Copy Paste: [[2307.08809] Local or Global: Selective Knowledge Assimilation for Federated Learning with Limited Labels](http://arxiv.org/abs/2307.08809) #federate
Summary:
Many existing FL methods assume clients with fully-labeled data, while in realistic settings, clients have limited labels due to the expensive and laborious process of labeling. Limited labeled local data of the clients often leads to their local model having poor generalization abilities to their larger unlabeled local data, such as having class-distribution mismatch with the unlabeled data. As a result, clients may instead look to benefit from the global model trained across clients to leverage their unlabeled data, but this also becomes difficult due to data heterogeneity across clients. In our work, we propose FedLabel where clients selectively choose the local or global model to pseudo-label their unlabeled data depending on which is more of an expert of the data. We further utilize both the local and global models' knowledge via global-local consistency regularization which minimizes the divergence between the two models' outputs when they have identical pseudo-labels for the unlabeled data. Unlike other semi-supervised FL baselines, our method does not require additional experts other than the local or global model, nor require additional parameters to be communicated. We also do not assume any server-labeled data or fully labeled clients. For both cross-device and cross-silo settings, we show that FedLabel outperforms other semi-supervised FL baselines by $8$-$24\%$, and even outperforms standard fully supervised FL baselines ($100\%$ labeled data) with only $5$-$20\%$ of labeled data.

Title: Federated Large Language Model: A Position Paper. (arXiv:2307.08925v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08925
Code URL: null
Copy Paste: [[2307.08925] Federated Large Language Model: A Position Paper](http://arxiv.org/abs/2307.08925) #federate
Summary:
Large scale language models (LLM) have received significant attention and found diverse applications across various domains, but their development encounters challenges in real-world scenarios. These challenges arise due to the scarcity of public domain data availability and the need to maintain privacy with respect to private domain data. To address these issues, federated learning (FL) has emerged as a promising technology that enables collaborative training of shared models while preserving decentralized data. We propose the concept of federated LLM, which comprises three key components, i.e., federated LLM pre-training, federated LLM fine-tuning, and federated LLM prompt engineering. For each component, we discuss its advantage over traditional LLM training methods and propose specific engineering strategies for implementation. Furthermore, we explore the novel challenges introduced by the integration of FL and LLM. We analyze existing solutions and identify potential obstacles faced by these solutions within the context of federated LLM.

Title: A Federated learning model for Electric Energy management using Blockchain Technology. (arXiv:2307.09080v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.09080
Code URL: null
Copy Paste: [[2307.09080] A Federated learning model for Electric Energy management using Blockchain Technology](http://arxiv.org/abs/2307.09080) #federate
Summary:
Energy shortfall and electricity load shedding are the main problems for developing countries. The main causes are lack of management in the energy sector and the use of non-renewable energy sources. The improved energy management and use of renewable sources can be significant to resolve energy crisis. It is necessary to increase the use of renewable energy sources (RESs) to meet the increasing energy demand due to high prices of fossil-fuel based energy. Federated learning (FL) is the most emerging technique in the field of artificial intelligence. Federated learning helps to generate global model at server side by ensemble locally trained models at remote edges sites while preserving data privacy. The global model used to predict energy demand to satisfy the needs of consumers. In this article, we have proposed Blockchain based safe distributed ledger technology for transaction of data between prosumer and consumer to ensure their transparency, traceability and security. Furthermore, we have also proposed a Federated learning model to forecast the energy requirements of consumer and prosumer. Moreover, Blockchain has been used to store excess energy data from prosumer for better management of energy between prosumer and grid. Lastly, the experiment results revealed that renewable energy sources have produced better and comparable results to other non-renewable energy resources.

Title: Federated Learning for Computationally-Constrained Heterogeneous Devices: A Survey. (arXiv:2307.09182v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.09182
Code URL: null
Copy Paste: [[2307.09182] Federated Learning for Computationally-Constrained Heterogeneous Devices: A Survey](http://arxiv.org/abs/2307.09182) #federate
Summary:
With an increasing number of smart devices like internet of things (IoT) devices deployed in the field, offloadingtraining of neural networks (NNs) to a central server becomes more and more infeasible. Recent efforts toimprove users' privacy have led to on-device learning emerging as an alternative. However, a model trainedonly on a single device, using only local data, is unlikely to reach a high accuracy. Federated learning (FL)has been introduced as a solution, offering a privacy-preserving trade-off between communication overheadand model accuracy by sharing knowledge between devices but disclosing the devices' private data. Theapplicability and the benefit of applying baseline FL are, however, limited in many relevant use cases dueto the heterogeneity present in such environments. In this survey, we outline the heterogeneity challengesFL has to overcome to be widely applicable in real-world applications. We especially focus on the aspect ofcomputation heterogeneity among the participating devices and provide a comprehensive overview of recentworks on heterogeneity-aware FL. We discuss two groups: works that adapt the NN architecture and worksthat approach heterogeneity on a system level, covering Federated Averaging (FedAvg), distillation, and splitlearning-based approaches, as well as synchronous and asynchronous aggregation schemes.

fair

Title: Similarity Min-Max: Zero-Shot Day-Night Domain Adaptation. (arXiv:2307.08779v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08779
Code URL: null
Copy Paste: [[2307.08779] Similarity Min-Max: Zero-Shot Day-Night Domain Adaptation](http://arxiv.org/abs/2307.08779) #fair
Summary:
Low-light conditions not only hamper human visual experience but also degrade the model's performance on downstream vision tasks. While existing works make remarkable progress on day-night domain adaptation, they rely heavily on domain knowledge derived from the task-specific nighttime dataset. This paper challenges a more complicated scenario with border applicability, i.e., zero-shot day-night domain adaptation, which eliminates reliance on any nighttime data. Unlike prior zero-shot adaptation approaches emphasizing either image-level translation or model-level adaptation, we propose a similarity min-max paradigm that considers them under a unified framework. On the image level, we darken images towards minimum feature similarity to enlarge the domain gap. Then on the model level, we maximize the feature similarity between the darkened images and their normal-light counterparts for better model adaptation. To the best of our knowledge, this work represents the pioneering effort in jointly optimizing both aspects, resulting in a significant improvement of model generalizability. Extensive experiments demonstrate our method's effectiveness and broad applicability on various nighttime vision tasks, including classification, semantic segmentation, visual place recognition, and video action recognition. Code and pre-trained models are available at https://red-fairy.github.io/ZeroShotDayNightDA-Webpage/.

Title: Mitigating Label Bias via Decoupled Confident Learning. (arXiv:2307.08945v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08945
Code URL: null
Copy Paste: [[2307.08945] Mitigating Label Bias via Decoupled Confident Learning](http://arxiv.org/abs/2307.08945) #fair
Summary:
Growing concerns regarding algorithmic fairness have led to a surge in methodologies to mitigate algorithmic bias. However, such methodologies largely assume that observed labels in training data are correct. This is problematic because bias in labels is pervasive across important domains, including healthcare, hiring, and content moderation. In particular, human-generated labels are prone to encoding societal biases. While the presence of labeling bias has been discussed conceptually, there is a lack of methodologies to address this problem. We propose a pruning method -- Decoupled Confident Learning (DeCoLe) -- specifically designed to mitigate label bias. After illustrating its performance on a synthetic dataset, we apply DeCoLe in the context of hate speech detection, where label bias has been recognized as an important challenge, and show that it successfully identifies biased labels and outperforms competing approaches.

Title: Certifying the Fairness of KNN in the Presence of Dataset Bias. (arXiv:2307.08722v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08722
Code URL: null
Copy Paste: [[2307.08722] Certifying the Fairness of KNN in the Presence of Dataset Bias](http://arxiv.org/abs/2307.08722) #fair
Summary:
We propose a method for certifying the fairness of the classification result of a widely used supervised learning algorithm, the k-nearest neighbors (KNN), under the assumption that the training data may have historical bias caused by systematic mislabeling of samples from a protected minority group. To the best of our knowledge, this is the first certification method for KNN based on three variants of the fairness definition: individual fairness, $\epsilon$-fairness, and label-flipping fairness. We first define the fairness certification problem for KNN and then propose sound approximations of the complex arithmetic computations used in the state-of-the-art KNN algorithm. This is meant to lift the computation results from the concrete domain to an abstract domain, to reduce the computational cost. We show effectiveness of this abstract interpretation based technique through experimental evaluation on six datasets widely used in the fairness research literature. We also show that the method is accurate enough to obtain fairness certifications for a large number of test inputs, despite the presence of historical bias in the datasets.

Title: Oracle Efficient Online Multicalibration and Omniprediction. (arXiv:2307.08999v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08999
Code URL: null
Copy Paste: [[2307.08999] Oracle Efficient Online Multicalibration and Omniprediction](http://arxiv.org/abs/2307.08999) #fair
Summary:
A recent line of work has shown a surprising connection between multicalibration, a multi-group fairness notion, and omniprediction, a learning paradigm that provides simultaneous loss minimization guarantees for a large family of loss functions. Prior work studies omniprediction in the batch setting. We initiate the study of omniprediction in the online adversarial setting. Although there exist algorithms for obtaining notions of multicalibration in the online adversarial setting, unlike batch algorithms, they work only for small finite classes of benchmark functions $F$, because they require enumerating every function $f \in F$ at every round. In contrast, omniprediction is most interesting for learning theoretic hypothesis classes $F$, which are generally continuously large.

We develop a new online multicalibration algorithm that is well defined for infinite benchmark classes $F$, and is oracle efficient (i.e. for any class $F$, the algorithm has the form of an efficient reduction to a no-regret learning algorithm for $F$). The result is the first efficient online omnipredictor -- an oracle efficient prediction algorithm that can be used to simultaneously obtain no regret guarantees to all Lipschitz convex loss functions. For the class $F$ of linear functions, we show how to make our algorithm efficient in the worst case. Also, we show upper and lower bounds on the extent to which our rates can be improved: our oracle efficient algorithm actually promises a stronger guarantee called swap-omniprediction, and we prove a lower bound showing that obtaining $O(\sqrt{T})$ bounds for swap-omniprediction is impossible in the online setting. On the other hand, we give a (non-oracle efficient) algorithm which can obtain the optimal $O(\sqrt{T})$ omniprediction bounds without going through multicalibration, giving an information theoretic separation between these two solution concepts.

interpretability

Title: Modular Neural Network Approaches for Surgical Image Recognition. (arXiv:2307.08880v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08880
Code URL: null
Copy Paste: [[2307.08880] Modular Neural Network Approaches for Surgical Image Recognition](http://arxiv.org/abs/2307.08880) #interpretability
Summary:
Deep learning-based applications have seen a lot of success in recent years. Text, audio, image, and video have all been explored with great success using deep learning approaches. The use of convolutional neural networks (CNN) in computer vision, in particular, has yielded reliable results. In order to achieve these results, a large amount of data is required. However, the dataset cannot always be accessible. Moreover, annotating data can be difficult and time-consuming. Self-training is a semi-supervised approach that managed to alleviate this problem and achieve state-of-the-art performances. Theoretical analysis even proved that it may result in a better generalization than a normal classifier. Another problem neural networks can face is the increasing complexity of modern problems, requiring a high computational and storage cost. One way to mitigate this issue, a strategy that has been inspired by human cognition known as modular learning, can be employed. The principle of the approach is to decompose a complex problem into simpler sub-tasks. This approach has several advantages, including faster learning, better generalization, and enables interpretability.

In the first part of this paper, we introduce and evaluate different architectures of modular learning for Dorsal Capsulo-Scapholunate Septum (DCSS) instability classification. Our experiments have shown that modular learning improves performances compared to non-modular systems. Moreover, we found that weighted modular, that is to weight the output using the probabilities from the gating module, achieved an almost perfect classification. In the second part, we present our approach for data labeling and segmentation with self-training applied on shoulder arthroscopy images.

Title: An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient. (arXiv:2307.08873v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08873
Code URL: null
Copy Paste: [[2307.08873] An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient](http://arxiv.org/abs/2307.08873) #interpretability
Summary:
Restricting the variance of a policy's return is a popular choice in risk-averse Reinforcement Learning (RL) due to its clear mathematical definition and easy interpretability. Traditional methods directly restrict the total return variance. Recent methods restrict the per-step reward variance as a proxy. We thoroughly examine the limitations of these variance-based methods, such as sensitivity to numerical scale and hindering of policy learning, and propose to use an alternative risk measure, Gini deviation, as a substitute. We study various properties of this new risk measure and derive a policy gradient algorithm to minimize it. Empirical evaluation in domains where risk-aversion can be clearly defined, shows that our algorithm can mitigate the limitations of variance-based risk measures and achieves high return with low risk in terms of variance and Gini deviation when others fail to learn a reasonable policy.

Title: Knowledge-infused Deep Learning Enables Interpretable Landslide Forecasting. (arXiv:2307.08951v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08951
Code URL: null
Copy Paste: [[2307.08951] Knowledge-infused Deep Learning Enables Interpretable Landslide Forecasting](http://arxiv.org/abs/2307.08951) #interpretability
Summary:
Forecasting how landslides will evolve over time or whether they will fail is a challenging task due to a variety of factors, both internal and external. Despite their considerable potential to address these challenges, deep learning techniques lack interpretability, undermining the credibility of the forecasts they produce. The recent development of transformer-based deep learning offers untapped possibilities for forecasting landslides with unprecedented interpretability and nonlinear feature learning capabilities. Here, we present a deep learning pipeline that is capable of predicting landslide behavior holistically, which employs a transformer-based network called LFIT to learn complex nonlinear relationships from prior knowledge and multiple source data, identifying the most relevant variables, and demonstrating a comprehensive understanding of landslide evolution and temporal patterns. By integrating prior knowledge, we provide improvement in holistic landslide forecasting, enabling us to capture diverse responses to various influencing factors in different local landslide areas. Using deformation observations as proxies for measuring the kinetics of landslides, we validate our approach by training models to forecast reservoir landslides in the Three Gorges Reservoir and creeping landslides on the Tibetan Plateau. When prior knowledge is incorporated, we show that interpretable landslide forecasting effectively identifies influential factors across various landslides. It further elucidates how local areas respond to these factors, making landslide behavior and trends more interpretable and predictable. The findings from this study will contribute to understanding landslide behavior in a new way and make the proposed approach applicable to other complex disasters influenced by internal and external factors in the future.

Title: Neural Network Pruning as Spectrum Preserving Process. (arXiv:2307.08982v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08982
Code URL: null
Copy Paste: [[2307.08982] Neural Network Pruning as Spectrum Preserving Process](http://arxiv.org/abs/2307.08982) #interpretability
Summary:
Neural networks have achieved remarkable performance in various application domains. Nevertheless, a large number of weights in pre-trained deep neural networks prohibit them from being deployed on smartphones and embedded systems. It is highly desirable to obtain lightweight versions of neural networks for inference in edge devices. Many cost-effective approaches were proposed to prune dense and convolutional layers that are common in deep neural networks and dominant in the parameter space. However, a unified theoretical foundation for the problem mostly is missing. In this paper, we identify the close connection between matrix spectrum learning and neural network training for dense and convolutional layers and argue that weight pruning is essentially a matrix sparsification process to preserve the spectrum. Based on the analysis, we also propose a matrix sparsification algorithm tailored for neural network pruning that yields better pruning result. We carefully design and conduct experiments to support our arguments. Hence we provide a consolidated viewpoint for neural network pruning and enhance the interpretability of deep neural networks by identifying and preserving the critical neural weights.

Title: Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla. (arXiv:2307.09458v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.09458
Code URL: null
Copy Paste: [[2307.09458] Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla](http://arxiv.org/abs/2307.09458) #interpretability
Summary:
\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and investigate Chinchilla's capability to identify the correct answer \emph{label} given knowledge of the correct answer \emph{text}. We find that the existing techniques of logit attribution, attention pattern visualization, and activation patching naturally scale to Chinchilla, allowing us to identify and categorize a small set of `output nodes' (attention heads and MLPs).

We further study the `correct letter' category of attention heads aiming to understand the semantics of their features, with mixed results. For normal multiple-choice question answers, we significantly compress the query, key and value subspaces of the head without loss of performance when operating on the answer labels for multiple-choice questions, and we show that the query and key subspaces represent an `Nth item in an enumeration' feature to at least some extent. However, when we attempt to use this explanation to understand the heads' behaviour on a more general distribution including randomized answer labels, we find that it is only a partial explanation, suggesting there is more to learn about the operation of `correct letter' heads on multiple choice question answering.

explainability

Title: R-Cut: Enhancing Explainability in Vision Transformers with Relationship Weighted Out and Cut. (arXiv:2307.09050v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09050
Code URL: null
Copy Paste: [[2307.09050] R-Cut: Enhancing Explainability in Vision Transformers with Relationship Weighted Out and Cut](http://arxiv.org/abs/2307.09050) #explainability
Summary:
Transformer-based models have gained popularity in the field of natural language processing (NLP) and are extensively utilized in computer vision tasks and multi-modal models such as GPT4. This paper presents a novel method to enhance the explainability of Transformer-based image classification models. Our method aims to improve trust in classification results and empower users to gain a deeper understanding of the model for downstream tasks by providing visualizations of class-specific maps. We introduce two modules: the Relationship Weighted Out" and theCut" modules. The Relationship Weighted Out" module focuses on extracting class-specific information from intermediate layers, enabling us to highlight relevant features. Additionally, theCut" module performs fine-grained feature decomposition, taking into account factors such as position, texture, and color. By integrating these modules, we generate dense class-specific visual explainability maps. We validate our method with extensive qualitative and quantitative experiments on the ImageNet dataset. Furthermore, we conduct a large number of experiments on the LRN dataset, specifically designed for automatic driving danger alerts, to evaluate the explainability of our method in complex backgrounds. The results demonstrate a significant improvement over previous methods. Moreover, we conduct ablation experiments to validate the effectiveness of each module. Through these experiments, we are able to confirm the respective contributions of each module, thus solidifying the overall effectiveness of our proposed approach.

watermark

diffusion

Title: Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond. (arXiv:2307.08996v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08996
Code URL: null
Copy Paste: [[2307.08996] Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond](http://arxiv.org/abs/2307.08996) #diffusion
Summary:
An authentic face restoration system is becoming increasingly demanding in many computer vision applications, e.g., image enhancement, video communication, and taking portrait. Most of the advanced face restoration models can recover high-quality faces from low-quality ones but usually fail to faithfully generate realistic and high-frequency details that are favored by users. To achieve authentic restoration, we propose $\textbf{IDM}$, an $\textbf{I}$teratively learned face restoration system based on denoising $\textbf{D}$iffusion $\textbf{M}$odels (DDMs). We define the criterion of an authentic face restoration system, and argue that denoising diffusion models are naturally endowed with this property from two aspects: intrinsic iterative refinement and extrinsic iterative enhancement. Intrinsic learning can preserve the content well and gradually refine the high-quality details, while extrinsic enhancement helps clean the data and improve the restoration task one step further. We demonstrate superior performance on blind face restoration tasks. Beyond restoration, we find the authentically cleaned data by the proposed restoration system is also helpful to image generation tasks in terms of training stabilization and sample quality. Without modifying the models, we achieve better quality than state-of-the-art on FFHQ and ImageNet generation using either GANs or diffusion models.

Title: Augmenting CLIP with Improved Visio-Linguistic Reasoning. (arXiv:2307.09233v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09233
Code URL: null
Copy Paste: [[2307.09233] Augmenting CLIP with Improved Visio-Linguistic Reasoning](http://arxiv.org/abs/2307.09233) #diffusion
Summary:
Image-text contrastive models such as CLIP are useful for a variety of downstream applications including zero-shot classification, image-text retrieval and transfer learning. However, these contrastively trained vision-language models often fail on compositional visio-linguistic tasks such as Winoground with performance equivalent to random chance. In our paper, we address this issue and propose a sample-efficient light-weight method called SDS-CLIP to improve the compositional visio-linguistic reasoning capabilities of CLIP. The core idea of our method is to use differentiable image parameterizations to fine-tune CLIP with a distillation objective from large text-to-image generative models such as Stable-Diffusion which are relatively good at visio-linguistic reasoning tasks. On the challenging Winoground compositional reasoning benchmark, our method improves the absolute visio-linguistic performance of different CLIP models by up to 7%, while on the ARO dataset, our method improves the visio-linguistic performance by upto 3%. As a byproduct of inducing visio-linguistic reasoning into CLIP, we also find that the zero-shot performance improves marginally on a variety of downstream datasets. Our method reinforces that carefully designed distillation objectives from generative models can be leveraged to extend existing contrastive image-text models with improved visio-linguistic reasoning capabilities.

Title: DiTTO: Diffusion-inspired Temporal Transformer Operator. (arXiv:2307.09072v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.09072
Code URL: null
Copy Paste: [[2307.09072] DiTTO: Diffusion-inspired Temporal Transformer Operator](http://arxiv.org/abs/2307.09072) #diffusion
Summary:
Solving partial differential equations (PDEs) using a data-driven approach has become increasingly common. The recent development of the operator learning paradigm has enabled the solution of a broader range of PDE-related problems. We propose an operator learning method to solve time-dependent PDEs continuously in time without needing any temporal discretization. The proposed approach, named DiTTO, is inspired by latent diffusion models. While diffusion models are usually used in generative artificial intelligence tasks, their time-conditioning mechanism is extremely useful for PDEs. The diffusion-inspired framework is combined with elements from the Transformer architecture to improve its capabilities.

We demonstrate the effectiveness of the new approach on a wide variety of PDEs in multiple dimensions, namely the 1-D Burgers' equation, 2-D Navier-Stokes equations, and the acoustic wave equation in 2-D and 3-D. DiTTO achieves state-of-the-art results in terms of accuracy for these problems. We also present a method to improve the performance of DiTTO by using fast sampling concepts from diffusion models. Finally, we show that DiTTO can accurately perform zero-shot super-resolution in time.

noise learning

data-free

transformer

Title: DARTS: Double Attention Reference-based Transformer for Super-resolution. (arXiv:2307.08837v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08837
Code URL: null
Copy Paste: [[2307.08837] DARTS: Double Attention Reference-based Transformer for Super-resolution](http://arxiv.org/abs/2307.08837) #transformer
Summary:
We present DARTS, a transformer model for reference-based image super-resolution. DARTS learns joint representations of two image distributions to enhance the content of low-resolution input images through matching correspondences learned from high-resolution reference images. Current state-of-the-art techniques in reference-based image super-resolution are based on a multi-network, multi-stage architecture. In this work, we adapt the double attention block from the GAN literature, processing the two visual streams separately and combining self-attention and cross-attention blocks through a gating attention strategy. Our work demonstrates how the attention mechanism can be adapted for the particular requirements of reference-based image super-resolution, significantly simplifying the architecture and training pipeline. We show that our transformer-based model performs competitively with state-of-the-art models, while maintaining a simpler overall architecture and training process. In particular, we obtain state-of-the-art on the SUN80 dataset, with a PSNR/SSIM of 29.83 / .809. These results show that attention alone is sufficient for the RSR task, without multiple purpose-built subnetworks, knowledge distillation, or multi-stage training.

Title: Human Action Recognition in Still Images Using ConViT. (arXiv:2307.08994v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08994
Code URL: null
Copy Paste: [[2307.08994] Human Action Recognition in Still Images Using ConViT](http://arxiv.org/abs/2307.08994) #transformer
Summary:
Understanding the relationship between different parts of the image plays a crucial role in many visual recognition tasks. Despite the fact that Convolutional Neural Networks (CNNs) have demonstrated impressive results in detecting single objects, they lack the capability to extract the relationship between various regions of an image, which is a crucial factor in human action recognition. To address this problem, this paper proposes a new module that functions like a convolutional layer using Vision Transformer (ViT). The proposed action recognition model comprises two components: the first part is a deep convolutional network that extracts high-level spatial features from the image, and the second component of the model utilizes a Vision Transformer that extracts the relationship between various regions of the image using the feature map generated by the CNN output. The proposed model has been evaluated on the Stanford40 and PASCAL VOC 2012 action datasets and has achieved 95.5% mAP and 91.5% mAP results, respectively, which are promising compared to other state-of-the-art methods.

Title: U-shaped Transformer: Retain High Frequency Context in Time Series Analysis. (arXiv:2307.09019v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.09019
Code URL: null
Copy Paste: [[2307.09019] U-shaped Transformer: Retain High Frequency Context in Time Series Analysis](http://arxiv.org/abs/2307.09019) #transformer
Summary:
Time series prediction plays a crucial role in various industrial fields. In recent years, neural networks with a transformer backbone have achieved remarkable success in many domains, including computer vision and NLP. In time series analysis domain, some studies have suggested that even the simplest MLP networks outperform advanced transformer-based networks on time series forecast tasks. However, we believe these findings indicate there to be low-rank properties in time series sequences. In this paper, we consider the low-pass characteristics of transformers and try to incorporate the advantages of MLP. We adopt skip-layer connections inspired by Unet into traditional transformer backbone, thus preserving high-frequency context from input to output, namely U-shaped Transformer. We introduce patch merge and split operation to extract features with different scales and use larger datasets to fully make use of the transformer backbone. Our experiments demonstrate that the model performs at an advanced level across multiple datasets with relatively low cost.

Title: NU-MCC: Multiview Compressive Coding with Neighborhood Decoder and Repulsive UDF. (arXiv:2307.09112v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09112
Code URL: null
Copy Paste: [[2307.09112] NU-MCC: Multiview Compressive Coding with Neighborhood Decoder and Repulsive UDF](http://arxiv.org/abs/2307.09112) #transformer
Summary:
Remarkable progress has been made in 3D reconstruction from single-view RGB-D inputs. MCC is the current state-of-the-art method in this field, which achieves unprecedented success by combining vision Transformers with large-scale training. However, we identified two key limitations of MCC: 1) The Transformer decoder is inefficient in handling large number of query points; 2) The 3D representation struggles to recover high-fidelity details. In this paper, we propose a new approach called NU-MCC that addresses these limitations. NU-MCC includes two key innovations: a Neighborhood decoder and a Repulsive Unsigned Distance Function (Repulsive UDF). First, our Neighborhood decoder introduces center points as an efficient proxy of input visual features, allowing each query point to only attend to a small neighborhood. This design not only results in much faster inference speed but also enables the exploitation of finer-scale visual features for improved recovery of 3D textures. Second, our Repulsive UDF is a novel alternative to the occupancy field used in MCC, significantly improving the quality of 3D object reconstruction. Compared to standard UDFs that suffer from holes in results, our proposed Repulsive UDF can achieve more complete surface reconstruction. Experimental results demonstrate that NU-MCC is able to learn a strong 3D representation, significantly advancing the state of the art in single-view 3D reconstruction. Particularly, it outperforms MCC by 9.7% in terms of the F1-score on the CO3D-v2 dataset with more than 5x faster running speed.

Title: Light-Weight Vision Transformer with Parallel Local and Global Self-Attention. (arXiv:2307.09120v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09120
Code URL: null
Copy Paste: [[2307.09120] Light-Weight Vision Transformer with Parallel Local and Global Self-Attention](http://arxiv.org/abs/2307.09120) #transformer
Summary:
While transformer architectures have dominated computer vision in recent years, these models cannot easily be deployed on hardware with limited resources for autonomous driving tasks that require real-time-performance. Their computational complexity and memory requirements limits their use, especially for applications with high-resolution inputs. In our work, we redesign the powerful state-of-the-art Vision Transformer PLG-ViT to a much more compact and efficient architecture that is suitable for such tasks. We identify computationally expensive blocks in the original PLG-ViT architecture and propose several redesigns aimed at reducing the number of parameters and floating-point operations. As a result of our redesign, we are able to reduce PLG-ViT in size by a factor of 5, with a moderate drop in performance. We propose two variants, optimized for the best trade-off between parameter count to runtime as well as parameter count to accuracy. With only 5 million parameters, we achieve 79.5$\%$ top-1 accuracy on the ImageNet-1K classification benchmark. Our networks demonstrate great performance on general vision benchmarks like COCO instance segmentation. In addition, we conduct a series of experiments, demonstrating the potential of our approach in solving various tasks specifically tailored to the challenges of autonomous driving and transportation.

Title: Fusing Hand and Body Skeletons for Human Action Recognition in Assembly. (arXiv:2307.09238v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09238
Code URL: null
Copy Paste: [[2307.09238] Fusing Hand and Body Skeletons for Human Action Recognition in Assembly](http://arxiv.org/abs/2307.09238) #transformer
Summary:
As collaborative robots (cobots) continue to gain popularity in industrial manufacturing, effective human-robot collaboration becomes crucial. Cobots should be able to recognize human actions to assist with assembly tasks and act autonomously. To achieve this, skeleton-based approaches are often used due to their ability to generalize across various people and environments. Although body skeleton approaches are widely used for action recognition, they may not be accurate enough for assembly actions where the worker's fingers and hands play a significant role. To address this limitation, we propose a method in which less detailed body skeletons are combined with highly detailed hand skeletons. We investigate CNNs and transformers, the latter of which are particularly adept at extracting and combining important information from both skeleton types using attention. This paper demonstrates the effectiveness of our proposed approach in enhancing action recognition in assembly scenarios.

Title: RepViT: Revisiting Mobile CNN From ViT Perspective. (arXiv:2307.09283v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09283
Code URL: https://github.com/jameslahm/RepViT
Copy Paste: [[2307.09283] RepViT: Revisiting Mobile CNN From ViT Perspective](http://arxiv.org/abs/2307.09283) #transformer
Summary:
Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on resource-constrained mobile devices. This improvement is usually attributed to the multi-head self-attention module, which enables the model to learn global representations. However, the architectural disparities between lightweight ViTs and lightweight CNNs have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs and emphasize their potential for mobile devices. We incrementally enhance the mobile-friendliness of a standard lightweight CNN, specifically MobileNetV3, by integrating the efficient architectural choices of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. On ImageNet, RepViT achieves over 80\% top-1 accuracy with nearly 1ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Our largest model, RepViT-M3, obtains 81.4\% accuracy with only 1.3ms latency. The code and trained models are available at \url{https://github.com/jameslahm/RepViT}.

Title: Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving. (arXiv:2307.09329v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09329
Code URL: https://github.com/kaavyarekanar/towards-a-performance-analysis-on-pre-trained-vqa-models-for-autonomous-driving
Copy Paste: [[2307.09329] Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving](http://arxiv.org/abs/2307.09329) #transformer
Summary:
This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.

Title: MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments. (arXiv:2307.09361v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09361
Code URL: null
Copy Paste: [[2307.09361] MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments](http://arxiv.org/abs/2307.09361) #transformer
Summary:
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior methods.

Title: LEST: Large-scale LiDAR Semantic Segmentation with Transformer. (arXiv:2307.09367v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09367
Code URL: null
Copy Paste: [[2307.09367] LEST: Large-scale LiDAR Semantic Segmentation with Transformer](http://arxiv.org/abs/2307.09367) #transformer
Summary:
Large-scale LiDAR-based point cloud semantic segmentation is a critical task in autonomous driving perception. Almost all of the previous state-of-the-art LiDAR semantic segmentation methods are variants of sparse 3D convolution. Although the Transformer architecture is becoming popular in the field of natural language processing and 2D computer vision, its application to large-scale point cloud semantic segmentation is still limited. In this paper, we propose a LiDAR sEmantic Segmentation architecture with pure Transformer, LEST. LEST comprises two novel components: a Space Filling Curve (SFC) Grouping strategy and a Distance-based Cosine Linear Transformer, DISCO. On the public nuScenes semantic segmentation validation set and SemanticKITTI test set, our model outperforms all the other state-of-the-art methods.

Title: Attention over pre-trained Sentence Embeddings for Long Document Classification. (arXiv:2307.09084v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.09084
Code URL: null
Copy Paste: [[2307.09084] Attention over pre-trained Sentence Embeddings for Long Document Classification](http://arxiv.org/abs/2307.09084) #transformer
Summary:
Despite being the current de-facto models in most NLP tasks, transformers are often limited to short sequences due to their quadratic attention complexity on the number of tokens. Several attempts to address this issue were studied, either by reducing the cost of the self-attention computation or by modeling smaller sequences and combining them through a recurrence mechanism or using a new transformer model. In this paper, we suggest to take advantage of pre-trained sentence transformers to start from semantically meaningful embeddings of the individual sentences, and then combine them through a small attention layer that scales linearly with the document length. We report the results obtained by this simple architecture on three standard document classification datasets. When compared with the current state-of-the-art models using standard fine-tuning, the studied method obtains competitive results (even if there is no clear best model in this configuration). We also showcase that the studied architecture obtains better results when freezing the underlying transformers. A configuration that is useful when we need to avoid complete fine-tuning (e.g. when the same frozen transformer is shared by different applications). Finally, two additional experiments are provided to further evaluate the relevancy of the studied architecture over simpler baselines.

Title: UniTabE: Pretraining a Unified Tabular Encoder for Heterogeneous Tabular Data. (arXiv:2307.09249v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.09249
Code URL: null
Copy Paste: [[2307.09249] UniTabE: Pretraining a Unified Tabular Encoder for Heterogeneous Tabular Data](http://arxiv.org/abs/2307.09249) #transformer
Summary:
Recent advancements in Natural Language Processing (NLP) have witnessed the groundbreaking impact of pretrained models, yielding impressive outcomes across various tasks. This study seeks to extend the power of pretraining methodologies to tabular data, a domain traditionally overlooked, yet inherently challenging due to the plethora of table schemas intrinsic to different tasks. The primary research questions underpinning this work revolve around the adaptation to heterogeneous table structures, the establishment of a universal pretraining protocol for tabular data, the generalizability and transferability of learned knowledge across tasks, the adaptation to diverse downstream applications, and the incorporation of incremental columns over time. In response to these challenges, we introduce UniTabE, a pioneering method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures. UniTabE's core concept relies on representing each basic table element with a module, termed TabUnit. This is subsequently followed by a Transformer encoder to refine the representation. Moreover, our model is designed to facilitate pretraining and finetuning through the utilization of free-form prompts. In order to implement the pretraining phase, we curated an expansive tabular dataset comprising approximately 13 billion samples, meticulously gathered from the Kaggle platform. Rigorous experimental testing and analyses were performed under a myriad of scenarios to validate the effectiveness of our methodology. The experimental results demonstrate UniTabE's superior performance against several baseline models across a multitude of benchmark datasets. This, therefore, underscores UniTabE's potential to significantly enhance the semantic representation of tabular data, thereby marking a significant stride in the field of tabular data analysis.

Title: Text vectorization via transformer-based language models and n-gram perplexities. (arXiv:2307.09255v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.09255
Code URL: null
Copy Paste: [[2307.09255] Text vectorization via transformer-based language models and n-gram perplexities](http://arxiv.org/abs/2307.09255) #transformer
Summary:
As the probability (and thus perplexity) of a text is calculated based on the product of the probabilities of individual tokens, it may happen that one unlikely token significantly reduces the probability (i.e., increase the perplexity) of some otherwise highly probable input, while potentially representing a simple typographical error. Also, given that perplexity is a scalar value that refers to the entire input, information about the probability distribution within it is lost in the calculation (a relatively good text that has one unlikely token and another text in which each token is equally likely they can have the same perplexity value), especially for longer texts. As an alternative to scalar perplexity this research proposes a simple algorithm used to calculate vector values based on n-gram perplexities within the input. Such representations consider the previously mentioned aspects, and instead of a unique value, the relative perplexity of each text token is calculated, and these values are combined into a single vector representing the input.

Title: Linearized Relative Positional Encoding. (arXiv:2307.09270v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.09270
Code URL: null
Copy Paste: [[2307.09270] Linearized Relative Positional Encoding](http://arxiv.org/abs/2307.09270) #transformer
Summary:
Relative positional encoding is widely used in vanilla and linear transformers to represent positional information. However, existing encoding methods of a vanilla transformer are not always directly applicable to a linear transformer, because the latter requires a decomposition of the query and key representations into separate kernel functions. Nevertheless, principles for designing encoding methods suitable for linear transformers remain understudied. In this work, we put together a variety of existing linear relative positional encoding approaches under a canonical form and further propose a family of linear relative positional encoding algorithms via unitary transformation. Our formulation leads to a principled framework that can be used to develop new relative positional encoding methods that preserve linear space-time complexity. Equipped with different models, the proposed linearized relative positional encoding (LRPE) family derives effective encoding for various applications. Experiments show that compared with existing methods, LRPE achieves state-of-the-art performance in language modeling, text classification, and image classification. Meanwhile, it emphasizes a general paradigm for designing broadly more relative positional encoding methods that are applicable to linear transformers. The code is available at https://github.com/OpenNLPLab/Lrpe.

Title: Multi-Modal Discussion Transformer: Integrating Text, Images and Graph Transformers to Detect Hate Speech on Social Media. (arXiv:2307.09312v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.09312
Code URL: https://github.com/liamhebert/multimodaldiscussiontransformer
Copy Paste: [[2307.09312] Multi-Modal Discussion Transformer: Integrating Text, Images and Graph Transformers to Detect Hate Speech on Social Media](http://arxiv.org/abs/2307.09312) #transformer
Summary:
We present the Multi-Modal Discussion Transformer (mDT), a novel multi-modal graph-based transformer model for detecting hate speech in online social networks. In contrast to traditional text-only methods, our approach to labelling a comment as hate speech centers around the holistic analysis of text and images. This is done by leveraging graph transformers to capture the contextual relationships in the entire discussion that surrounds a comment, with interwoven fusion layers to combine text and image embeddings instead of processing different modalities separately. We compare the performance of our model to baselines that only process text; we also conduct extensive ablation studies. We conclude with future work for multimodal solutions to deliver social value in online contexts, arguing that capturing a holistic view of a conversation greatly advances the effort to detect anti-social behavior.

Title: Pseudo Outlier Exposure for Out-of-Distribution Detection using Pretrained Transformers. (arXiv:2307.09455v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.09455
Code URL: null
Copy Paste: [[2307.09455] Pseudo Outlier Exposure for Out-of-Distribution Detection using Pretrained Transformers](http://arxiv.org/abs/2307.09455) #transformer
Summary:
For real-world language applications, detecting an out-of-distribution (OOD) sample is helpful to alert users or reject such unreliable samples. However, modern over-parameterized language models often produce overconfident predictions for both in-distribution (ID) and OOD samples. In particular, language models suffer from OOD samples with a similar semantic representation to ID samples since these OOD samples lie near the ID manifold. A rejection network can be trained with ID and diverse outlier samples to detect test OOD samples, but explicitly collecting auxiliary OOD datasets brings an additional burden for data collection. In this paper, we propose a simple but effective method called Pseudo Outlier Exposure (POE) that constructs a surrogate OOD dataset by sequentially masking tokens related to ID classes. The surrogate OOD sample introduced by POE shows a similar representation to ID data, which is most effective in training a rejection network. Our method does not require any external OOD data and can be easily implemented within off-the-shelf Transformers. A comprehensive comparison with state-of-the-art algorithms demonstrates POE's competitiveness on several text classification benchmarks.

generative

Title: Harnessing the Power of AI based Image Generation Model DALLE 2 in Agricultural Settings. (arXiv:2307.08789v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08789
Code URL: null
Copy Paste: [[2307.08789] Harnessing the Power of AI based Image Generation Model DALLE 2 in Agricultural Settings](http://arxiv.org/abs/2307.08789) #generative
Summary:
This study investigates the potential impact of artificial intelligence (AI) on the enhancement of visualization processes in the agricultural sector, using the advanced AI image generator, DALLE 2, developed by OpenAI. By synergistically utilizing the natural language processing proficiency of chatGPT and the generative prowess of the DALLE 2 model, which employs a Generative Adversarial Networks (GANs) framework, our research offers an innovative method to transform textual descriptors into realistic visual content. Our rigorously assembled datasets include a broad spectrum of agricultural elements such as fruits, plants, and scenarios differentiating crops from weeds, maintained for AI-generated versus original images. The quality and accuracy of the AI-generated images were evaluated via established metrics including mean squared error (MSE), peak signal-to-noise ratio (PSNR), and feature similarity index (FSIM). The results underline the significant role of the DALLE 2 model in enhancing visualization processes in agriculture, aiding in more informed decision-making, and improving resource distribution. The outcomes of this research highlight the imminent rise of an AI-led transformation in the realm of precision agriculture.

Title: Arbitrary point cloud upsampling via Dual Back-Projection Network. (arXiv:2307.08992v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08992
Code URL: null
Copy Paste: [[2307.08992] Arbitrary point cloud upsampling via Dual Back-Projection Network](http://arxiv.org/abs/2307.08992) #generative
Summary:
Point clouds acquired from 3D sensors are usually sparse and noisy. Point cloud upsampling is an approach to increase the density of the point cloud so that detailed geometric information can be restored. In this paper, we propose a Dual Back-Projection network for point cloud upsampling (DBPnet). A Dual Back-Projection is formulated in an up-down-up manner for point cloud upsampling. It not only back projects feature residues but also coordinates residues so that the network better captures the point correlations in the feature and space domains, achieving lower reconstruction errors on both uniform and non-uniform sparse point clouds. Our proposed method is also generalizable for arbitrary upsampling tasks (e.g. 4x, 5.5x). Experimental results show that the proposed method achieves the lowest point set matching losses with respect to the benchmark. In addition, the success of our approach demonstrates that generative networks are not necessarily needed for non-uniform point clouds.

Title: Face-PAST: Facial Pose Awareness and Style Transfer Networks. (arXiv:2307.09020v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09020
Code URL: null
Copy Paste: [[2307.09020] Face-PAST: Facial Pose Awareness and Style Transfer Networks](http://arxiv.org/abs/2307.09020) #generative
Summary:
Facial style transfer has been quite popular among researchers due to the rise of emerging technologies such as eXtended Reality (XR), Metaverse, and Non-Fungible Tokens (NFTs). Furthermore, StyleGAN methods along with transfer-learning strategies have reduced the problem of limited data to some extent. However, most of the StyleGAN methods overfit the styles while adding artifacts to facial images. In this paper, we propose a facial pose awareness and style transfer (Face-PAST) network that preserves facial details and structures while generating high-quality stylized images. Dual StyleGAN inspires our work, but in contrast, our work uses a pre-trained style generation network in an external style pass with a residual modulation block instead of a transform coding block. Furthermore, we use the gated mapping unit and facial structure, identity, and segmentation losses to preserve the facial structure and details. This enables us to train the network with a very limited amount of data while generating high-quality stylized images. Our training process adapts curriculum learning strategy to perform efficient and flexible style mixing in the generative space. We perform extensive experiments to show the superiority of Face-PAST in comparison to existing state-of-the-art methods.

Title: Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation. (arXiv:2307.09416v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09416
Code URL: null
Copy Paste: [[2307.09416] Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation](http://arxiv.org/abs/2307.09416) #generative
Summary:
Research in Image Generation has recently made significant progress, particularly boosted by the introduction of Vision-Language models which are able to produce high-quality visual content based on textual inputs. Despite ongoing advancements in terms of generation quality and realism, no methodical frameworks have been defined yet to quantitatively measure the quality of the generated content and the adherence with the prompted requests: so far, only human-based evaluations have been adopted for quality satisfaction and for comparing different generative methods. We introduce a novel automated method for Visual Concept Evaluation (ViCE), i.e. to assess consistency between a generated/edited image and the corresponding prompt/instructions, with a process inspired by the human cognitive behaviour. ViCE combines the strengths of Large Language Models (LLMs) and Visual Question Answering (VQA) into a unified pipeline, aiming to replicate the human cognitive process in quality assessment. This method outlines visual concepts, formulates image-specific verification questions, utilizes the Q&A system to investigate the image, and scores the combined outcome. Although this brave new hypothesis of mimicking humans in the image evaluation process is in its preliminary assessment stage, results are promising and open the door to a new form of automatic evaluation which could have significant impact as the image generation or the image target editing tasks become more and more sophisticated.

Title: PAC Neural Prediction Set Learning to Quantify the Uncertainty of Generative Language Models. (arXiv:2307.09254v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.09254
Code URL: null
Copy Paste: [[2307.09254] PAC Neural Prediction Set Learning to Quantify the Uncertainty of Generative Language Models](http://arxiv.org/abs/2307.09254) #generative
Summary:
Uncertainty learning and quantification of models are crucial tasks to enhance the trustworthiness of the models. Importantly, the recent surge of generative language models (GLMs) emphasizes the need for reliable uncertainty quantification due to the concerns on generating hallucinated facts. In this paper, we propose to learn neural prediction set models that comes with the probably approximately correct (PAC) guarantee for quantifying the uncertainty of GLMs. Unlike existing prediction set models, which are parameterized by a scalar value, we propose to parameterize prediction sets via neural networks, which achieves more precise uncertainty quantification but still satisfies the PAC guarantee. We demonstrate the efficacy of our method on four types of language datasets and six types of models by showing that our method improves the quantified uncertainty by $63\%$ on average, compared to a standard baseline method.

Title: Disentangling Node Attributes from Graph Topology for Improved Generalizability in Link Prediction. (arXiv:2307.08877v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08877
Code URL: https://github.com/chatterjeeayan/upna
Copy Paste: [[2307.08877] Disentangling Node Attributes from Graph Topology for Improved Generalizability in Link Prediction](http://arxiv.org/abs/2307.08877) #generative
Summary:
Link prediction is a crucial task in graph machine learning with diverse applications. We explore the interplay between node attributes and graph topology and demonstrate that incorporating pre-trained node attributes improves the generalization power of link prediction models. Our proposed method, UPNA (Unsupervised Pre-training of Node Attributes), solves the inductive link prediction problem by learning a function that takes a pair of node attributes and predicts the probability of an edge, as opposed to Graph Neural Networks (GNN), which can be prone to topological shortcuts in graphs with power-law degree distribution. In this manner, UPNA learns a significant part of the latent graph generation mechanism since the learned function can be used to add incoming nodes to a growing graph. By leveraging pre-trained node attributes, we overcome observational bias and make meaningful predictions about unobserved nodes, surpassing state-of-the-art performance (3X to 34X improvement on benchmark datasets). UPNA can be applied to various pairwise learning tasks and integrated with existing link prediction models to enhance their generalizability and bolster graph generative models.

large language model

Title: ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning. (arXiv:2307.09474v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.09474
Code URL: null
Copy Paste: [[2307.09474] ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning](http://arxiv.org/abs/2307.09474) #large language model
Summary:
Human-AI interactivity is a critical aspect that reflects the usability of multimodal large language models (MLLMs). However, existing end-to-end MLLMs only allow users to interact with them through language instructions, leading to the limitation of the interactive accuracy and efficiency. In this study, we present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region. This enables MLLMs to focus on the region of interest and achieve finer-grained interaction. Based on precise referring instruction, we propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience. We also construct a multi-grained vision-language instruction-following dataset based on existing datasets and GPT-4 generating. Furthermore, we design a series of evaluation tasks to assess the effectiveness of region recognition and interaction. Experimental results showcase ChatSpot's promising performance.

Title: Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages. (arXiv:2307.08714v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.08714
Code URL: null
Copy Paste: [[2307.08714] Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages](http://arxiv.org/abs/2307.08714) #large language model
Summary:
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. Our approach relies on both knowledge distillation and consistency training. The modeling framework leverages knowledge from a large language model (XLMRoBERTa) pre-trained on the source language, with a student-teacher relationship (knowledge distillation). The student model incorporates unsupervised consistency training (with KL divergence loss) on the low-resource target language.

We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information, and focus on exhibiting the transfer of knowledge from English to Arabic. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic. We show that our modeling approach, while efficient, performs best overall when compared to state-of-the-art approaches like DistilBERT pre-trained on the target language or a supervised model directly trained on labeled data in the target language.

Our experiments show that it is enough to learn to recognize entities in English to reach reasonable performance in a low-resource language in the presence of a few labeled samples of semi-structured data. The proposed framework has implications for developing multi-lingual applications, especially in geographies where digital endeavors rely on both English and one or more low-resource language(s), sometimes mixed with English or employed singly.

Title: Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge. (arXiv:2307.08813v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.08813
Code URL: null
Copy Paste: [[2307.08813] Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge](http://arxiv.org/abs/2307.08813) #large language model
Summary:
Understanding protein interactions and pathway knowledge is crucial for unraveling the complexities of living systems and investigating the underlying mechanisms of biological functions and complex diseases. While existing databases provide curated biological data from literature and other sources, they are often incomplete and their maintenance is labor-intensive, necessitating alternative approaches. In this study, we propose to harness the capabilities of large language models to address these issues by automatically extracting such knowledge from the relevant scientific literature. Toward this goal, in this work, we investigate the effectiveness of different large language models in tasks that involve recognizing protein interactions, pathways, and gene regulatory relations. We thoroughly evaluate the performance of various models, highlight the significant findings, and discuss both the future opportunities and the remaining challenges associated with this approach. The code and data are available at: https://github.com/boxorange/BioIE-LLM

Title: Large Language Models Perform Diagnostic Reasoning. (arXiv:2307.08922v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.08922
Code URL: null
Copy Paste: [[2307.08922] Large Language Models Perform Diagnostic Reasoning](http://arxiv.org/abs/2307.08922) #large language model
Summary:
We explore the extension of chain-of-thought (CoT) prompting to medical reasoning for the task of automatic diagnosis. Motivated by doctors' underlying reasoning process, we present Diagnostic-Reasoning CoT (DR-CoT). Empirical results demonstrate that by simply prompting large language models trained only on general text corpus with two DR-CoT exemplars, the diagnostic accuracy improves by 15% comparing to standard prompting. Moreover, the gap reaches a pronounced 18% in out-domain settings. Our findings suggest expert-knowledge reasoning in large language models can be elicited through proper promptings.

Title: On the (In)Effectiveness of Large Language Models for Chinese Text Correction. (arXiv:2307.09007v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.09007
Code URL: null
Copy Paste: [[2307.09007] On the (In)Effectiveness of Large Language Models for Chinese Text Correction](http://arxiv.org/abs/2307.09007) #large language model
Summary:
Recently, the development and progress of Large Language Models (LLMs) have amazed the entire Artificial Intelligence community. As an outstanding representative of LLMs and the foundation model that set off this wave of research on LLMs, ChatGPT has attracted more and more researchers to study its capabilities and performance on various downstream Natural Language Processing (NLP) tasks. While marveling at ChatGPT's incredible performance on kinds of tasks, we notice that ChatGPT also has excellent multilingual processing capabilities, such as Chinese. To explore the Chinese processing ability of ChatGPT, we focus on Chinese Text Correction, a fundamental and challenging Chinese NLP task. Specifically, we evaluate ChatGPT on the Chinese Grammatical Error Correction (CGEC) and Chinese Spelling Check (CSC) tasks, which are two main Chinese Text Correction scenarios. From extensive analyses and comparisons with previous state-of-the-art fine-tuned models, we empirically find that the ChatGPT currently has both amazing performance and unsatisfactory behavior for Chinese Text Correction. We believe our findings will promote the landing and application of LLMs in the Chinese NLP community.

Title: How is ChatGPT's behavior changing over time?. (arXiv:2307.09009v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.09009
Code URL: https://github.com/lchen001/llmdrift
Copy Paste: [[2307.09009] How is ChatGPT's behavior changing over time?](http://arxiv.org/abs/2307.09009) #large language model
Summary:
GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four diverse tasks: 1) solving math problems, 2) answering sensitive/dangerous questions, 3) generating code and 4) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less willing to answer sensitive questions in June than in March, and both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings shows that the behavior of the same LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality.

Title: Towards a Neural Era in Dialogue Management for Collaboration: A Literature Survey. (arXiv:2307.09021v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.09021
Code URL: null
Copy Paste: [[2307.09021] Towards a Neural Era in Dialogue Management for Collaboration: A Literature Survey](http://arxiv.org/abs/2307.09021) #large language model
Summary:
Dialogue-based human-AI collaboration can revolutionize collaborative problem-solving, creative exploration, and social support. To realize this goal, the development of automated agents proficient in skills such as negotiating, following instructions, establishing common ground, and progressing shared tasks is essential. This survey begins by reviewing the evolution of dialogue management paradigms in collaborative dialogue systems, from traditional handcrafted and information-state based methods to AI planning-inspired approaches. It then shifts focus to contemporary data-driven dialogue management techniques, which seek to transfer deep learning successes from form-filling and open-domain settings to collaborative contexts. The paper proceeds to analyze a selected set of recent works that apply neural approaches to collaborative dialogue management, spotlighting prevailing trends in the field. This survey hopes to provide foundational background for future advancements in collaborative dialogue management, particularly as the dialogue systems community continues to embrace the potential of large language models.

Title: Unveiling Gender Bias in Terms of Profession Across LLMs: Analyzing and Addressing Sociological Implications. (arXiv:2307.09162v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.09162
Code URL: null
Copy Paste: [[2307.09162] Unveiling Gender Bias in Terms of Profession Across LLMs: Analyzing and Addressing Sociological Implications](http://arxiv.org/abs/2307.09162) #large language model
Summary:
Gender bias in artificial intelligence (AI) and natural language processing has garnered significant attention due to its potential impact on societal perceptions and biases. This research paper aims to analyze gender bias in Large Language Models (LLMs) with a focus on multiple comparisons between GPT-2 and GPT-3.5, some prominent language models, to better understand its implications. Through a comprehensive literature review, the study examines existing research on gender bias in AI language models and identifies gaps in the current knowledge. The methodology involves collecting and preprocessing data from GPT-2 and GPT-3.5, and employing in-depth quantitative analysis techniques to evaluate gender bias in the generated text. The findings shed light on gendered word associations, language usage, and biased narratives present in the outputs of these Large Language Models. The discussion explores the ethical implications of gender bias and its potential consequences on social perceptions and marginalized communities. Additionally, the paper presents strategies for reducing gender bias in LLMs, including algorithmic approaches and data augmentation techniques. The research highlights the importance of interdisciplinary collaborations and the role of sociological studies in mitigating gender bias in AI models. By addressing these issues, we can pave the way for more inclusive and unbiased AI systems that have a positive impact on society.

Title: Llama 2: Open Foundation and Fine-Tuned Chat Models. (arXiv:2307.09288v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.09288
Code URL: https://github.com/facebookresearch/llama
Copy Paste: [[2307.09288] Llama 2: Open Foundation and Fine-Tuned Chat Models](http://arxiv.org/abs/2307.09288) #large language model
Summary:
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

Title: Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. (arXiv:2307.08715v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.08715
Code URL: null
Copy Paste: [[2307.08715] Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots](http://arxiv.org/abs/2307.08715) #large language model
Summary:
Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) services due to their exceptional proficiency in understanding and generating human-like text. LLM chatbots, in particular, have seen widespread adoption, transforming human-machine interactions. However, these LLM chatbots are susceptible to "jailbreak" attacks, where malicious users manipulate prompts to elicit inappropriate or sensitive responses, contravening service policies. Despite existing attempts to mitigate such threats, our research reveals a substantial gap in our understanding of these vulnerabilities, largely due to the undisclosed defensive measures implemented by LLM service providers.

In this paper, we present Jailbreaker, a comprehensive framework that offers an in-depth understanding of jailbreak attacks and countermeasures. Our work makes a dual contribution. First, we propose an innovative methodology inspired by time-based SQL injection techniques to reverse-engineer the defensive strategies of prominent LLM chatbots, such as ChatGPT, Bard, and Bing Chat. This time-sensitive approach uncovers intricate details about these services' defenses, facilitating a proof-of-concept attack that successfully bypasses their mechanisms. Second, we introduce an automatic generation method for jailbreak prompts. Leveraging a fine-tuned LLM, we validate the potential of automated jailbreak generation across various commercial LLM chatbots. Our method achieves a promising average success rate of 21.58%, significantly outperforming the effectiveness of existing techniques. We have responsibly disclosed our findings to the concerned service providers, underscoring the urgent need for more robust defenses. Jailbreaker thus marks a significant step towards understanding and mitigating jailbreak threats in the realm of LLM chatbots.

segmentation

Title: Semantic Counting from Self-Collages. (arXiv:2307.08727v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08727
Code URL: https://github.com/lukasknobel/selfcollages
Copy Paste: [[2307.08727] Semantic Counting from Self-Collages](http://arxiv.org/abs/2307.08727) #segmentation
Summary:
While recent supervised methods for reference-based object counting continue to improve the performance on benchmark datasets, they have to rely on small datasets due to the cost associated with manually annotating dozens of objects in images. We propose Unsupervised Counter (UnCo), a model that can learn this task without requiring any manual annotations. To this end, we construct "SelfCollages", images with various pasted objects as training samples, that provide a rich learning signal covering arbitrary object types and counts. Our method builds on existing unsupervised representations and segmentation techniques to successfully demonstrate the ability to count objects without manual supervision. Our experiments show that our method not only outperforms simple baselines and generic models such as FasterRCNN, but also matches the performance of supervised counting models in some domains.

Title: EVIL: Evidential Inference Learning for Trustworthy Semi-supervised Medical Image Segmentation. (arXiv:2307.08988v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08988
Code URL: null
Copy Paste: [[2307.08988] EVIL: Evidential Inference Learning for Trustworthy Semi-supervised Medical Image Segmentation](http://arxiv.org/abs/2307.08988) #segmentation
Summary:
Recently, uncertainty-aware methods have attracted increasing attention in semi-supervised medical image segmentation. However, current methods usually suffer from the drawback that it is difficult to balance the computational cost, estimation accuracy, and theoretical support in a unified framework. To alleviate this problem, we introduce the Dempster-Shafer Theory of Evidence (DST) into semi-supervised medical image segmentation, dubbed Evidential Inference Learning (EVIL). EVIL provides a theoretically guaranteed solution to infer accurate uncertainty quantification in a single forward pass. Trustworthy pseudo labels on unlabeled data are generated after uncertainty estimation. The recently proposed consistency regularization-based training paradigm is adopted in our framework, which enforces the consistency on the perturbed predictions to enhance the generalization with few labeled data. Experimental results show that EVIL achieves competitive performance in comparison with several state-of-the-art methods on the public dataset.

Title: Online Self-Supervised Thermal Water Segmentation for Aerial Vehicles. (arXiv:2307.09027v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09027
Code URL: https://github.com/connorlee77/uav-thermal-water-segmentation
Copy Paste: [[2307.09027] Online Self-Supervised Thermal Water Segmentation for Aerial Vehicles](http://arxiv.org/abs/2307.09027) #segmentation
Summary:
We present a new method to adapt an RGB-trained water segmentation network to target-domain aerial thermal imagery using online self-supervision by leveraging texture and motion cues as supervisory signals. This new thermal capability enables current autonomous aerial robots operating in near-shore environments to perform tasks such as visual navigation, bathymetry, and flow tracking at night. Our method overcomes the problem of scarce and difficult-to-obtain near-shore thermal data that prevents the application of conventional supervised and unsupervised methods. In this work, we curate the first aerial thermal near-shore dataset, show that our approach outperforms fully-supervised segmentation models trained on limited target-domain thermal data, and demonstrate real-time capabilities onboard an Nvidia Jetson embedded computing platform. Code and datasets used in this work will be available at: https://github.com/connorlee77/uav-thermal-water-segmentation.

Title: Connections between Operator-splitting Methods and Deep Neural Networks with Applications in Image Segmentation. (arXiv:2307.09052v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09052
Code URL: null
Copy Paste: [[2307.09052] Connections between Operator-splitting Methods and Deep Neural Networks with Applications in Image Segmentation](http://arxiv.org/abs/2307.09052) #segmentation
Summary:
Deep neural network is a powerful tool for many tasks. Understanding why it is so successful and providing a mathematical explanation is an important problem and has been one popular research direction in past years. In the literature of mathematical analysis of deep deep neural networks, a lot of works are dedicated to establishing representation theories. How to make connections between deep neural networks and mathematical algorithms is still under development. In this paper, we give an algorithmic explanation for deep neural networks, especially in their connection with operator splitting and multigrid methods. We show that with certain splitting strategies, operator-splitting methods have the same structure as networks. Utilizing this connection and the Potts model for image segmentation, two networks inspired by operator-splitting methods are proposed. The two networks are essentially two operator-splitting algorithms solving the Potts model. Numerical experiments are presented to demonstrate the effectiveness of the proposed networks.

Title: Mining of Single-Class by Active Learning for Semantic Segmentation. (arXiv:2307.09109v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.09109
Code URL: null
Copy Paste: [[2307.09109] Mining of Single-Class by Active Learning for Semantic Segmentation](http://arxiv.org/abs/2307.09109) #segmentation
Summary:
Several Active Learning (AL) policies require retraining a target model several times in order to identify the most informative samples and rarely offer the option to focus on the acquisition of samples from underrepresented classes. Here the Mining of Single-Class by Active Learning (MiSiCAL) paradigm is introduced where an AL policy is constructed through deep reinforcement learning and exploits quantity-accuracy correlations to build datasets on which high-performance models can be trained with regards to specific classes. MiSiCAL is especially helpful in the case of very large batch sizes since it does not require repeated model training sessions as is common in other AL methods. This is thanks to its ability to exploit fixed representations of the candidate data points. We find that MiSiCAL is able to outperform a random policy on 150 out of 171 COCO10k classes, while the strongest baseline only outperforms random on 101 classes.

Title: CG-fusion CAM: Online segmentation of laser-induced damage on large-aperture optics. (arXiv:2307.09161v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09161
Code URL: null
Copy Paste: [[2307.09161] CG-fusion CAM: Online segmentation of laser-induced damage on large-aperture optics](http://arxiv.org/abs/2307.09161) #segmentation
Summary:
Online segmentation of laser-induced damage on large-aperture optics in high-power laser facilities is challenged by complicated damage morphology, uneven illumination and stray light interference. Fully supervised semantic segmentation algorithms have achieved state-of-the-art performance, but rely on plenty of pixel-level labels, which are time-consuming and labor-consuming to produce. LayerCAM, an advanced weakly supervised semantic segmentation algorithm, can generate pixel-accurate results using only image-level labels, but its scattered and partially under-activated class activation regions degrade segmentation performance. In this paper, we propose a weakly supervised semantic segmentation method with Continuous Gradient CAM and its nonlinear multi-scale fusion (CG-fusion CAM). The method redesigns the way of back-propagating gradients and non-linearly activates the multi-scale fused heatmaps to generate more fine-grained class activation maps with appropriate activation degree for different sizes of damage sites. Experiments on our dataset show that the proposed method can achieve segmentation performance comparable to that of fully supervised algorithms.

Title: A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future. (arXiv:2307.09220v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09220
Code URL: null
Copy Paste: [[2307.09220] A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future](http://arxiv.org/abs/2307.09220) #segmentation
Summary:
As the most fundamental tasks of computer vision, object detection and segmentation have made tremendous progress in the deep learning era. Due to the expensive manual labeling, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art detectors and segmentors fail to generalize beyond the closed-vocabulary. To resolve this limitation, the last few years have witnessed increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). In this survey, we provide a comprehensive review on the past and recent development of OVD and OVS. To this end, we develop a taxonomy according to the type of task and methodology. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation-based, and transfer learning-based. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D scene and video understanding. In each category, its main principles, key challenges, development routes, strengths, and weaknesses are thoroughly discussed. In addition, we benchmark each task along with the vital components of each method. Finally, several promising directions are provided to stimulate future research.

Title: MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds. (arXiv:2307.09316v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09316
Code URL: null
Copy Paste: [[2307.09316] MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds](http://arxiv.org/abs/2307.09316) #segmentation
Summary:
3D semantic segmentation on multi-scan large-scale point clouds plays an important role in autonomous systems. Unlike the single-scan-based semantic segmentation task, this task requires distinguishing the motion states of points in addition to their semantic categories. However, methods designed for single-scan-based segmentation tasks perform poorly on the multi-scan task due to the lacking of an effective way to integrate temporal information. We propose MarS3D, a plug-and-play motion-aware module for semantic segmentation on multi-scan 3D point clouds. This module can be flexibly combined with single-scan models to allow them to have multi-scan perception abilities. The model encompasses two key designs: the Cross-Frame Feature Embedding module for enriching representation learning and the Motion-Aware Feature Learning module for enhancing motion awareness. Extensive experiments show that MarS3D can improve the performance of the baseline model by a large margin. The code is available at https://github.com/CVMI-Lab/MarS3D.

Title: OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation. (arXiv:2307.09356v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09356
Code URL: https://github.com/wudongming97/onlinerefer
Copy Paste: [[2307.09356] OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation](http://arxiv.org/abs/2307.09356) #segmentation
Summary:
Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction. Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding for cross-modal understanding. They usually present that the offline pattern is necessary for RVOS, yet model limited temporal association within each clip. In this work, we break up the previous offline belief and propose a simple yet effective online model using explicit query propagation, named OnlineRefer. Specifically, our approach leverages target cues that gather semantic information and position prior to improve the accuracy and ease of referring predictions for the current frame. Furthermore, we generalize our online model into a semi-online framework to be compatible with video-based backbones. To show the effectiveness of our method, we evaluate it on four benchmarks, \ie, Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17, outperforming all other offline methods.

Title: Disentangle then Parse:Night-time Semantic Segmentation with Illumination Disentanglement. (arXiv:2307.09362v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.09362
Code URL: null
Copy Paste: [[2307.09362] Disentangle then Parse:Night-time Semantic Segmentation with Illumination Disentanglement](http://arxiv.org/abs/2307.09362) #segmentation
Summary:
Most prior semantic segmentation methods have been developed for day-time scenes, while typically underperforming in night-time scenes due to insufficient and complicated lighting conditions. In this work, we tackle this challenge by proposing a novel night-time semantic segmentation paradigm, i.e., disentangle then parse (DTP). DTP explicitly disentangles night-time images into light-invariant reflectance and light-specific illumination components and then recognizes semantics based on their adaptive fusion. Concretely, the proposed DTP comprises two key components: 1) Instead of processing lighting-entangled features as in prior works, our Semantic-Oriented Disentanglement (SOD) framework enables the extraction of reflectance component without being impeded by lighting, allowing the network to consistently recognize the semantics under cover of varying and complicated lighting conditions. 2) Based on the observation that the illumination component can serve as a cue for some semantically confused regions, we further introduce an Illumination-Aware Parser (IAParser) to explicitly learn the correlation between semantics and lighting, and aggregate the illumination features to yield more precise predictions. Extensive experiments on the night-time segmentation task with various settings demonstrate that DTP significantly outperforms state-of-the-art methods. Furthermore, with negligible additional parameters, DTP can be directly used to benefit existing day-time methods for night-time segmentation.

Title: Data Cross-Segmentation for Improved Generalization in Reinforcement Learning Based Algorithmic Trading. (arXiv:2307.09377v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.09377
Code URL: null
Copy Paste: [[2307.09377] Data Cross-Segmentation for Improved Generalization in Reinforcement Learning Based Algorithmic Trading](http://arxiv.org/abs/2307.09377) #segmentation
Summary:
The use of machine learning in algorithmic trading systems is increasingly common. In a typical set-up, supervised learning is used to predict the future prices of assets, and those predictions drive a simple trading and execution strategy. This is quite effective when the predictions have sufficient signal, markets are liquid, and transaction costs are low. However, those conditions often do not hold in thinly traded financial markets and markets for differentiated assets such as real estate or vehicles. In these markets, the trading strategy must consider the long-term effects of taking positions that are relatively more difficult to change. In this work, we propose a Reinforcement Learning (RL) algorithm that trades based on signals from a learned predictive model and addresses these challenges. We test our algorithm on 20+ years of equity data from Bursa Malaysia.