secure

Title: Unraveling Latch Locking Using Machine Learning, Boolean Analysis, and ILP. (arXiv:2305.00107v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00107
Code URL: null
Copy Paste: [[2305.00107] Unraveling Latch Locking Using Machine Learning, Boolean Analysis, and ILP](http://arxiv.org/abs/2305.00107) #secure
Summary:
Logic locking has become a promising approach to provide hardware security in the face of a possibly insecure fabrication supply chain. While many techniques have focused on locking combinational logic (CL), an alternative latch-locking approach in which the sequential elements are locked has also gained significant attention. Latch (LAT) locking duplicates a subset of the flip-flops (FF) of a design, retimes these FFs and replaces them with latches, and adds two types of decoy latches to obfuscate the netlist. It then adds control circuitry (CC) such that all latches must be correctly keyed for the circuit to function correctly. This paper presents a two-phase attack on latch-locked circuits that uses a novel combination of deep learning, Boolean analysis, and integer linear programming (ILP). The attack requires access to the reverse-engineered netlist but, unlike SAT attacks, is oracle-less, not needing access to the unlocked circuit or correct input/output pairs. We trained and evaluated the attack using the ISCAS'89 and ITC'99 benchmark circuits. The attack successfully identifies a key that is, on average, 96.9% accurate and fully discloses the correct functionality in 8 of the tested 19 circuits and leads to low function corruptibility (less than 4%) in 3 additional circuits. The attack run-times are manageable.

Title: ZIRCON: Zero-watermarking-based approach for data integrity and secure provenance in IoT networks. (arXiv:2305.00266v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00266
Code URL: null
Copy Paste: [[2305.00266] ZIRCON: Zero-watermarking-based approach for data integrity and secure provenance in IoT networks](http://arxiv.org/abs/2305.00266) #secure
Summary:
The Internet of Things (IoT) is integrating the Internet and smart devices in almost every domain such as home automation, e-healthcare systems, vehicular networks, industrial control and military applications. In these sectors, sensory data, which is collected from multiple sources and managed through intermediate processing by multiple nodes, is used for decision-making processes. Ensuring data integrity and keeping track of data provenance is a core requirement in such a highly dynamic context, since data provenance is an important tool for the assurance of data trustworthiness. Dealing with such requirements is challenging due to the limited computational and energy resources in IoT networks. This requires addressing several challenges such as processing overhead, secure provenance, bandwidth consumption and storage efficiency. In this paper, we propose ZIRCON, a novel zero-watermarking approach to establish end-to-end data trustworthiness in an IoT network. In ZIRCON, provenance information is stored in a tamper-proof centralized network database through watermarks, generated at source node before transmission. We provide an extensive security analysis showing the resilience of our scheme against passive and active attacks. We also compare our scheme with existing works based on performance metrics such as computational time, energy utilization and cost analysis. The results show that ZIRCON is robust against several attacks, lightweight, storage efficient, and better in energy utilization and bandwidth consumption, compared to prior art.

Title: Montsalvat: Intel SGX Shielding for GraalVM Native Images. (arXiv:2305.00766v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00766
Code URL: null
Copy Paste: [[2305.00766] Montsalvat: Intel SGX Shielding for GraalVM Native Images](http://arxiv.org/abs/2305.00766) #secure
Summary:
The popularity of the Java programming language has led to its wide adoption in cloud computing infrastructures. However, Java applications running in untrusted clouds are vulnerable to various forms of privileged attacks. The emergence of trusted execution environments (TEEs) such as Intel SGX mitigates this problem. TEEs protect code and data in secure enclaves inaccessible to untrusted software, including the kernel and hypervisors. To efficiently use TEEs, developers must manually partition their applications into trusted and untrusted parts, in order to reduce the size of the trusted computing base (TCB) and minimise the risks of security vulnerabilities. However, partitioning applications poses two important challenges: (i) ensuring efficient object communication between the partitioned components, and (ii) ensuring the consistency of garbage collection between the parts, especially with memory-managed languages such as Java. We present Montsalvat, a tool which provides a practical and intuitive annotation-based partitioning approach for Java applications destined for secure enclaves. Montsalvat provides an RMI-like mechanism to ensure inter-object communication, as well as consistent garbage collection across the partitioned components. We implement Montsalvat with GraalVM native-image, a tool for compiling Java applications ahead-of-time into standalone native executables that do not require a JVM at runtime. Our extensive evaluation with micro- and macro-benchmarks shows our partitioning approach to boost performance in real-world applications

security

Title: Constructing a Knowledge Graph from Textual Descriptions of Software Vulnerabilities in the National Vulnerability Database. (arXiv:2305.00382v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00382
Code URL: null
Copy Paste: [[2305.00382] Constructing a Knowledge Graph from Textual Descriptions of Software Vulnerabilities in the National Vulnerability Database](http://arxiv.org/abs/2305.00382) #security
Summary:
Knowledge graphs have shown promise for several cybersecurity tasks, such as vulnerability assessment and threat analysis. In this work, we present a new method for constructing a vulnerability knowledge graph from information in the National Vulnerability Database (NVD). Our approach combines named entity recognition (NER), relation extraction (RE), and entity prediction using a combination of neural models, heuristic rules, and knowledge graph embeddings. We demonstrate how our method helps to fix missing entities in knowledge graphs used for cybersecurity and evaluate the performance.

Title: Decentralised Identity Federations using Blockchain. (arXiv:2305.00315v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00315
Code URL: null
Copy Paste: [[2305.00315] Decentralised Identity Federations using Blockchain](http://arxiv.org/abs/2305.00315) #security
Summary:
Federated Identity Management has proven its worth by offering economic benefits and convenience to Service Providers and users alike. In such federations, the Identity Provider (IdP) is the solitary entity responsible for managing user credentials and generating assertions for the users, who are requesting access to a service provider's resource. This makes the IdP centralised and exhibits a single point of failure for the federation, making the federation prone to catastrophic damages. The paper presents our effort in designing and implementing a decentralised system in establishing an identity federation. In its attempt to decentralise the IdP in the federation, the proposed system relies on blockchain technology, thereby mitigating the single point of failure shortcoming of existing identity federations. The system is designed using a set of requirements In this article, we explore different aspects of designing and developing the system, present its protocol flow, analyse its performance, and evaluate its security using ProVerif, a state-of-the-art formal protocol verification tool.

Title: MetaShard: A Novel Sharding Blockchain Platform for Metaverse Applications. (arXiv:2305.00367v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00367
Code URL: null
Copy Paste: [[2305.00367] MetaShard: A Novel Sharding Blockchain Platform for Metaverse Applications](http://arxiv.org/abs/2305.00367) #security
Summary:
Due to its security, transparency, and flexibility in verifying virtual assets, blockchain has been identified as one of the key technologies for Metaverse. Unfortunately, blockchain-based Metaverse faces serious challenges such as massive resource demands, scalability, and security concerns. To address these issues, this paper proposes a novel sharding-based blockchain framework, namely MetaShard, for Metaverse applications. Particularly, we first develop an effective consensus mechanism, namely Proof-of-Engagement, that can incentivize MUs' data and computing resource contribution. Moreover, to improve the scalability of MetaShard, we propose an innovative sharding management scheme to maximize the network's throughput while protecting the shards from 51% attacks. Since the optimization problem is NP-complete, we develop a hybrid approach that decomposes the problem (using the binary search method) into sub-problems that can be solved effectively by the Lagrangian method. As a result, the proposed approach can obtain solutions in polynomial time, thereby enabling flexible shard reconfiguration and reducing the risk of corruption from the adversary. Extensive numerical experiments show that, compared to the state-of-the-art commercial solvers, our proposed approach can achieve up to 66.6% higher throughput in less than 1/30 running time. Moreover, the proposed approach can achieve global optimal solutions in most experiments.

Title: SoK: Pragmatic Assessment of Machine Learning for Network Intrusion Detection. (arXiv:2305.00550v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00550
Code URL: https://github.com/hihey54/pragmaticassessment
Copy Paste: [[2305.00550] SoK: Pragmatic Assessment of Machine Learning for Network Intrusion Detection](http://arxiv.org/abs/2305.00550) #security
Summary:
Machine Learning (ML) has become a valuable asset to solve many real-world tasks. For Network Intrusion Detection (NID), however, scientific advances in ML are still seen with skepticism by practitioners. This disconnection is due to the intrinsically limited scope of research papers, many of which primarily aim to demonstrate new methods ``outperforming'' prior work -- oftentimes overlooking the practical implications for deploying the proposed solutions in real systems. Unfortunately, the value of ML for NID depends on a plethora of factors, such as hardware, that are often neglected in scientific literature.

This paper aims to reduce the practitioners' skepticism towards ML for NID by "changing" the evaluation methodology adopted in research. After elucidating which "factors" influence the operational deployment of ML in NID, we propose the notion of "pragmatic assessment", which enable practitioners to gauge the real value of ML methods for NID. Then, we show that the state-of-research hardly allows one to estimate the value of ML for NID. As a constructive step forward, we carry out a pragmatic assessment. We re-assess existing ML methods for NID, focusing on the classification of malicious network traffic, and consider: hundreds of configuration settings; diverse adversarial scenarios; and four hardware platforms. Our large and reproducible evaluations enable estimating the quality of ML for NID. We also validate our claims through a user-study with security practitioners.

Title: MAMBO-V: Dynamic Side-Channel Leakage Analysis on RISC-V. (arXiv:2305.00584v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00584
Code URL: null
Copy Paste: [[2305.00584] MAMBO-V: Dynamic Side-Channel Leakage Analysis on RISC-V](http://arxiv.org/abs/2305.00584) #security
Summary:
RISC-V is an emerging technology, with applications ranging from embedded devices to high-performance servers. Therefore, more and more security-critical workloads will be conducted with code that is compiled for RISC-V. Well-known microarchitectural side-channel attacks against established platforms like x86 apply to RISC-V CPUs as well. As RISC-V does not mandate any hardware-based side-channel countermeasures, a piece of code compiled for a generic RISC-V CPU in a cloud server cannot make safe assumptions about the microarchitecture on which it is running. Existing tools for aiding software-level precautions by checking side-channel vulnerabilities on source code or x86 binaries are not compatible with RISC-V machine code.

In this work, we study the requirements and goals of architecture-specific leakage analysis for RISC-V and illustrate how to achieve these goals with the help of fast and precise dynamic binary analysis. We implement all necessary building blocks for finding side-channel leakages on RISC-V, while relying on existing mature solutions when possible. Our leakage analysis builds upon the modular side-channel analysis framework Microwalk, that examines execution traces for leakage through secret-dependent memory accesses or branches. To provide suitable traces, we port the ARM dynamic binary instrumentation tool MAMBO to RISC-V. Our port named MAMBO-V can instrument arbitrary binaries which use the 64-bit general purpose instruction set. We evaluate our toolchain on several cryptographic libraries with RISC-V support and identify multiple exploitable leakages.

Title: Uncovering CWE-CVE-CPE Relations with Threat Knowledge Graphs. (arXiv:2305.00632v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00632
Code URL: null
Copy Paste: [[2305.00632] Uncovering CWE-CVE-CPE Relations with Threat Knowledge Graphs](http://arxiv.org/abs/2305.00632) #security
Summary:
Security assessment relies on public information about products, vulnerabilities, and weaknesses. So far, databases in these categories have rarely been analyzed in combination. Yet, doing so could help predict unreported vulnerabilities and identify common threat patterns. In this paper, we propose a methodology for producing and optimizing a knowledge graph that aggregates knowledge from common threat databases (CVE, CWE, and CPE). We apply the threat knowledge graph to predict associations between threat databases, specifically between products, vulnerabilities, and weaknesses. We evaluate the prediction performance both in closed world with associations from the knowledge graph, and in open world with associations revealed afterward. Using rank-based metrics (i.e., Mean Rank, Mean Reciprocal Rank, and Hits@N scores), we demonstrate the ability of the threat knowledge graph to uncover many associations that are currently unknown but will be revealed in the future, which remains useful over different time periods. We propose approaches to optimize the knowledge graph, and show that they indeed help in further uncovering associations.

Title: Security-Enhancing Digital Twins: Characteristics, Indicators, and Future Perspectives. (arXiv:2305.00639v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00639
Code URL: null
Copy Paste: [[2305.00639] Security-Enhancing Digital Twins: Characteristics, Indicators, and Future Perspectives](http://arxiv.org/abs/2305.00639) #security
Summary:
The term "digital twin" (DT) has become a key theme of the cyber-physical systems (CPSs) area, while remaining vaguely defined as a virtual replica of an entity. This article identifies DT characteristics essential for enhancing CPS security and discusses indicators to evaluate them.

Title: SGX Switchless Calls Made Configless. (arXiv:2305.00763v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00763
Code URL: null
Copy Paste: [[2305.00763] SGX Switchless Calls Made Configless](http://arxiv.org/abs/2305.00763) #security
Summary:
Intel's software guard extensions (SGX) provide hardware enclaves to guarantee confidentiality and integrity for sensitive code and data. However, systems leveraging such security mechanisms must often pay high performance overheads. A major source of this overhead is SGX enclave transitions which induce expensive cross-enclave context switches. The Intel SGX SDK mitigates this with a switchless call mechanism for transitionless cross-enclave calls using worker threads. Intel's SGX switchless call implementation improves performance but provides limited flexibility: developers need to statically fix the system configuration at build time, which is error-prone and misconfigurations lead to performance degradations and waste of CPU resources. ZC-SWITCHLESS is a configless and efficient technique to drive the execution of SGX switchless calls. Its dynamic approach optimises the total switchless worker threads at runtime to minimise CPU waste. The experimental evaluation shows that ZC-SWITCHLESS obviates the performance penalty of misconfigured switchless systems while minimising CPU waste.

privacy

Title: Reliable Gradient-free and Likelihood-free Prompt Tuning. (arXiv:2305.00593v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00593
Code URL: https://github.com/maohaos2/SBI_LLM
Copy Paste: [[2305.00593] Reliable Gradient-free and Likelihood-free Prompt Tuning](http://arxiv.org/abs/2305.00593) #privacy
Summary:
Due to privacy or commercial constraints, large pre-trained language models (PLMs) are often offered as black-box APIs. Fine-tuning such models to downstream tasks is challenging because one can neither access the model's internal representations nor propagate gradients through it. This paper addresses these challenges by developing techniques for adapting PLMs with only API access. Building on recent work on soft prompt tuning, we develop methods to tune the soft prompts without requiring gradient computation. Further, we develop extensions that in addition to not requiring gradients also do not need to access any internal representation of the PLM beyond the input embeddings. Moreover, instead of learning a single prompt, our methods learn a distribution over prompts allowing us to quantify predictive uncertainty. Ours is the first work to consider uncertainty in prompts when only having API access to the PLM. Finally, through extensive experiments, we carefully vet the proposed methods and find them competitive with (and sometimes even improving on) gradient-based approaches with full access to the PLM.

Title: GTree: GPU-Friendly Privacy-preserving Decision Tree Training and Inference. (arXiv:2305.00645v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00645
Code URL: null
Copy Paste: [[2305.00645] GTree: GPU-Friendly Privacy-preserving Decision Tree Training and Inference](http://arxiv.org/abs/2305.00645) #privacy
Summary:
Decision tree (DT) is a widely used machine learning model due to its versatility, speed, and interpretability. However, for privacy-sensitive applications, outsourcing DT training and inference to cloud platforms raise concerns about data privacy. Researchers have developed privacy-preserving approaches for DT training and inference using cryptographic primitives, such as Secure Multi-Party Computation (MPC). While these approaches have shown progress, they still suffer from heavy computation and communication overheads. Few recent works employ Graphical Processing Units (GPU) to improve the performance of MPC-protected deep learning. This raises a natural question: \textit{can MPC-protected DT training and inference be accelerated by GPU?}

We present GTree, the first scheme that uses GPU to accelerate MPC-protected secure DT training and inference. GTree is built across 3 parties who securely and jointly perform each step of DT training and inference with GPU. Each MPC protocol in GTree is designed in a GPU-friendly version. The performance evaluation shows that GTree achieves ${\thicksim}11{\times}$ and ${\thicksim}21{\times}$ improvements in training SPECT and Adult datasets, compared to the prior most efficient CPU-based work. For inference, GTree shows its superior efficiency when the DT has less than 10 levels, which is $126\times$ faster than the prior most efficient work when inferring $10^4$ instances with a tree of 7 levels. GTree also achieves a stronger security guarantee than prior solutions, which only leaks the tree depth and size of data samples while prior solutions also leak the tree structure. With \textit{oblivious array access}, the access pattern on GPU is also protected.

Title: slytHErin: An Agile Framework for Encrypted Deep Neural Network Inference. (arXiv:2305.00690v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00690
Code URL: null
Copy Paste: [[2305.00690] slytHErin: An Agile Framework for Encrypted Deep Neural Network Inference](http://arxiv.org/abs/2305.00690) #privacy
Summary:
Homomorphic encryption (HE), which allows computations on encrypted data, is an enabling technology for confidential cloud computing. One notable example is privacy-preserving Prediction-as-a-Service (PaaS), where machine-learning predictions are computed on encrypted data. However, developing HE-based solutions for encrypted PaaS is a tedious task which requires a careful design that predominantly depends on the deployment scenario and on leveraging the characteristics of modern HE schemes. Prior works on privacy-preserving PaaS focus solely on protecting the confidentiality of the client data uploaded to a remote model provider, e.g., a cloud offering a prediction API, and assume (or take advantage of the fact) that the model is held in plaintext. Furthermore, their aim is to either minimize the latency of the service by processing one sample at a time, or to maximize the number of samples processed per second, while processing a fixed (large) number of samples. In this work, we present slytHErin, an agile framework that enables privacy-preserving PaaS beyond the application scenarios considered in prior works. Thanks to its hybrid design leveraging HE and its multiparty variant (MHE), slytHErin enables novel PaaS scenarios by encrypting the data, the model or both. Moreover, slytHErin features a flexible input data packing approach that allows processing a batch of an arbitrary number of samples, and several computation optimizations that are model-and-setting-agnostic. slytHErin is implemented in Go and it allows end-users to perform encrypted PaaS on custom deep learning models comprising fully-connected, convolutional, and pooling layers, in a few lines of code and without having to worry about the cumbersome implementation and optimization concerns inherent to HE.

Title: Optimizing Privacy, Utility and Efficiency in Constrained Multi-Objective Federated Learning. (arXiv:2305.00312v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00312
Code URL: null
Copy Paste: [[2305.00312] Optimizing Privacy, Utility and Efficiency in Constrained Multi-Objective Federated Learning](http://arxiv.org/abs/2305.00312) #privacy
Summary:
Conventionally, federated learning aims to optimize a single objective, typically the utility. However, for a federated learning system to be trustworthy, it needs to simultaneously satisfy multiple/many objectives, such as maximizing model performance, minimizing privacy leakage and training cost, and being robust to malicious attacks. Multi-Objective Optimization (MOO) aiming to optimize multiple conflicting objectives at the same time is quite suitable for solving the optimization problem of Trustworthy Federated Learning (TFL). In this paper, we unify MOO and TFL by formulating the problem of constrained multi-objective federated learning (CMOFL). Under this formulation, existing MOO algorithms can be adapted to TFL straightforwardly. Different from existing CMOFL works focusing on utility, efficiency, fairness, and robustness, we consider optimizing privacy leakage along with utility loss and training cost, the three primary objectives of a TFL system. We develop two improved CMOFL algorithms based on NSGA-II and PSL, respectively, for effectively and efficiently finding Pareto optimal solutions, and we provide theoretical analysis on their convergence. We design specific measurements of privacy leakage, utility loss, and training cost for three privacy protection mechanisms: Randomization, BatchCrypt (An efficient version of homomorphic encryption), and Sparsification. Empirical experiments conducted under each of the three protection mechanisms demonstrate the effectiveness of our proposed algorithms.

protect

defense

Title: NNSplitter: An Active Defense Solution to DNN Model via Automated Weight Obfuscation. (arXiv:2305.00097v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00097
Code URL: null
Copy Paste: [[2305.00097] NNSplitter: An Active Defense Solution to DNN Model via Automated Weight Obfuscation](http://arxiv.org/abs/2305.00097) #defense
Summary:
As a type of valuable intellectual property (IP), deep neural network (DNN) models have been protected by techniques like watermarking. However, such passive model protection cannot fully prevent model abuse. In this work, we propose an active model IP protection scheme, namely NNSplitter, which actively protects the model by splitting it into two parts: the obfuscated model that performs poorly due to weight obfuscation, and the model secrets consisting of the indexes and original values of the obfuscated weights, which can only be accessed by authorized users. NNSplitter uses the trusted execution environment to secure the secrets and a reinforcement learning-based controller to reduce the number of obfuscated weights while maximizing accuracy drop. Our experiments show that by only modifying 313 out of over 28 million (i.e., 0.001%) weights, the accuracy of the obfuscated VGG-11 model on Fashion-MNIST can drop to 10%. We also demonstrate that NNSplitter is stealthy and resilient against potential attack surfaces, including norm clipping and fine-tuning attacks.

attack

Title: FedGrad: Mitigating Backdoor Attacks in Federated Learning Through Local Ultimate Gradients Inspection. (arXiv:2305.00328v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00328
Code URL: null
Copy Paste: [[2305.00328] FedGrad: Mitigating Backdoor Attacks in Federated Learning Through Local Ultimate Gradients Inspection](http://arxiv.org/abs/2305.00328) #attack
Summary:
Federated learning (FL) enables multiple clients to train a model without compromising sensitive data. The decentralized nature of FL makes it susceptible to adversarial attacks, especially backdoor insertion during training. Recently, the edge-case backdoor attack employing the tail of the data distribution has been proposed as a powerful one, raising questions about the shortfall in current defenses' robustness guarantees. Specifically, most existing defenses cannot eliminate edge-case backdoor attacks or suffer from a trade-off between backdoor-defending effectiveness and overall performance on the primary task. To tackle this challenge, we propose FedGrad, a novel backdoor-resistant defense for FL that is resistant to cutting-edge backdoor attacks, including the edge-case attack, and performs effectively under heterogeneous client data and a large number of compromised clients. FedGrad is designed as a two-layer filtering mechanism that thoroughly analyzes the ultimate layer's gradient to identify suspicious local updates and remove them from the aggregation process. We evaluate FedGrad under different attack scenarios and show that it significantly outperforms state-of-the-art defense mechanisms. Notably, FedGrad can almost 100% correctly detect the malicious participants, thus providing a significant reduction in the backdoor effect (e.g., backdoor accuracy is less than 8%) while not reducing the main accuracy on the primary task.

Title: Enhancing Adversarial Contrastive Learning via Adversarial Invariant Regularization. (arXiv:2305.00374v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00374
Code URL: null
Copy Paste: [[2305.00374] Enhancing Adversarial Contrastive Learning via Adversarial Invariant Regularization](http://arxiv.org/abs/2305.00374) #attack
Summary:
Adversarial contrastive learning (ACL), without requiring labels, incorporates adversarial data with standard contrastive learning (SCL) and outputs a robust representation which is generalizable and resistant to adversarial attacks and common corruptions. The style-independence property of representations has been validated to be beneficial in improving robustness transferability. Standard invariant regularization (SIR) has been proposed to make the learned representations via SCL to be independent of the style factors. However, how to equip robust representations learned via ACL with the style-independence property is still unclear so far. To this end, we leverage the technique of causal reasoning to propose an adversarial invariant regularization (AIR) that enforces robust representations learned via ACL to be style-independent. Then, we enhance ACL using invariant regularization (IR), which is a weighted sum of SIR and AIR. Theoretically, we show that AIR implicitly encourages the prediction of adversarial data and consistency between adversarial and natural data to be independent of data augmentations. We also theoretically demonstrate that the style-independence property of robust representation learned via ACL still holds in downstream tasks, providing generalization guarantees. Empirically, our comprehensive experimental results corroborate that IR can significantly improve the performance of ACL and its variants on various datasets.

Title: Assessing Vulnerabilities of Adversarial Learning Algorithm through Poisoning Attacks. (arXiv:2305.00399v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2305.00399
Code URL: null
Copy Paste: [[2305.00399] Assessing Vulnerabilities of Adversarial Learning Algorithm through Poisoning Attacks](http://arxiv.org/abs/2305.00399) #attack
Summary:
Adversarial training (AT) is a robust learning algorithm that can defend against adversarial attacks in the inference phase and mitigate the side effects of corrupted data in the training phase. As such, it has become an indispensable component of many artificial intelligence (AI) systems. However, in high-stake AI applications, it is crucial to understand AT's vulnerabilities to ensure reliable deployment. In this paper, we investigate AT's susceptibility to poisoning attacks, a type of malicious attack that manipulates training data to compromise the performance of the trained model. Previous work has focused on poisoning attacks against standard training, but little research has been done on their effectiveness against AT. To fill this gap, we design and test effective poisoning attacks against AT. Specifically, we investigate and design clean-label poisoning attacks, allowing attackers to imperceptibly modify a small fraction of training data to control the algorithm's behavior on a specific target data point. Additionally, we propose the clean-label untargeted attack, enabling attackers can attach tiny stickers on training data to degrade the algorithm's performance on all test data, where the stickers could serve as a signal against unauthorized data collection. Our experiments demonstrate that AT can still be poisoned, highlighting the need for caution when using vanilla AT algorithms in security-related applications. The code is at https://github.com/zjfheart/Poison-adv-training.git.

robust

Title: Exploring the Zero-Shot Capabilities of the Segment Anything Model (SAM) in 2D Medical Imaging: A Comprehensive Evaluation and Practical Guideline. (arXiv:2305.00109v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00109
Code URL: null
Copy Paste: [[2305.00109] Exploring the Zero-Shot Capabilities of the Segment Anything Model (SAM) in 2D Medical Imaging: A Comprehensive Evaluation and Practical Guideline](http://arxiv.org/abs/2305.00109) #robust
Summary:
Segmentation in medical imaging plays a crucial role in diagnosing, monitoring, and treating various diseases and conditions. The current landscape of segmentation in the medical domain is dominated by numerous specialized deep learning models fine-tuned for each segmentation task and image modality. Recently, the Segment Anything Model (SAM), a new segmentation model, was introduced. SAM utilizes the ViT neural architecture and leverages a vast training dataset to segment almost any object. However, its generalizability to the medical domain remains unexplored. In this study, we assess the zero-shot capabilities of SAM 2D in medical imaging using eight different prompt strategies across six datasets from four imaging modalities: X-ray, ultrasound, dermatoscopy, and colonoscopy. Our results demonstrate that SAM's zero-shot performance is comparable and, in certain cases, superior to the current state-of-the-art. Based on our findings, we propose a practical guideline that requires minimal interaction and yields robust results in all evaluated contexts.

Title: Sensor Equivariance by LiDAR Projection Images. (arXiv:2305.00221v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00221
Code URL: null
Copy Paste: [[2305.00221] Sensor Equivariance by LiDAR Projection Images](http://arxiv.org/abs/2305.00221) #robust
Summary:
In this work, we propose an extension of conventional image data by an additional channel in which the associated projection properties are encoded. This addresses the issue of sensor-dependent object representation in projection-based sensors, such as LiDAR, which can lead to distorted physical and geometric properties due to variations in sensor resolution and field of view. To that end, we propose an architecture for processing this data in an instance segmentation framework. We focus specifically on LiDAR as a key sensor modality for machine vision tasks and highly automated driving (HAD). Through an experimental setup in a controlled synthetic environment, we identify a bias on sensor resolution and field of view and demonstrate that our proposed method can reduce said bias for the task of LiDAR instance segmentation. Furthermore, we define our method such that it can be applied to other projection-based sensors, such as cameras. To promote transparency, we make our code and dataset publicly available. This method shows the potential to improve performance and robustness in various machine vision tasks that utilize projection-based sensors.

Title: InfraDet3D: Multi-Modal 3D Object Detection based on Roadside Infrastructure Camera and LiDAR Sensors. (arXiv:2305.00314v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00314
Code URL: null
Copy Paste: [[2305.00314] InfraDet3D: Multi-Modal 3D Object Detection based on Roadside Infrastructure Camera and LiDAR Sensors](http://arxiv.org/abs/2305.00314) #robust
Summary:
Current multi-modal object detection approaches focus on the vehicle domain and are limited in the perception range and the processing capabilities. Roadside sensor units (RSUs) introduce a new domain for perception systems and leverage altitude to observe traffic. Cameras and LiDARs mounted on gantry bridges increase the perception range and produce a full digital twin of the traffic. In this work, we introduce InfraDet3D, a multi-modal 3D object detector for roadside infrastructure sensors. We fuse two LiDARs using early fusion and further incorporate detections from monocular cameras to increase the robustness and to detect small objects. Our monocular 3D detection module uses HD maps to ground object yaw hypotheses, improving the final perception results. The perception framework is deployed on a real-world intersection that is part of the A9 Test Stretch in Munich, Germany. We perform several ablation studies and experiments and show that fusing two LiDARs with two cameras leads to an improvement of +1.90 mAP compared to a camera-only solution. We evaluate our results on the A9 infrastructure dataset and achieve 68.48 mAP on the test set. The dataset and code will be available at https://a9-dataset.com to allow the research community to further improve the perception results and make autonomous driving safer.

Title: Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data. (arXiv:2305.00320v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00320
Code URL: https://github.com/art2611/mreid-ucd-ccd
Copy Paste: [[2305.00320] Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data](http://arxiv.org/abs/2305.00320) #robust
Summary:
Visible-infrared person re-identification (V-I ReID) seeks to match images of individuals captured over a distributed network of RGB and IR cameras. The task is challenging due to the significant differences between V and I modalities, especially under real-world conditions, where images are corrupted by, e.g, blur, noise, and weather. Indeed, state-of-art V-I ReID models cannot leverage corrupted modality information to sustain a high level of accuracy. In this paper, we propose an efficient model for multimodal V-I ReID -- named Multimodal Middle Stream Fusion (MMSF) -- that preserves modality-specific knowledge for improved robustness to corrupted multimodal images. In addition, three state-of-art attention-based multimodal fusion models are adapted to address corrupted multimodal data in V-I ReID, allowing to dynamically balance each modality importance. Recently, evaluation protocols have been proposed to assess the robustness of ReID models under challenging real-world scenarios. However, these protocols are limited to unimodal V settings. For realistic evaluation of multimodal (and cross-modal) V-I person ReID models, we propose new challenging corrupted datasets for scenarios where V and I cameras are co-located (CL) and not co-located (NCL). Finally, the benefits of our Masking and Local Multimodal Data Augmentation (ML-MDA) strategy are explored to improve the robustness of ReID models to multimodal corruption. Our experiments on clean and corrupted versions of the SYSU-MM01, RegDB, and ThermalWORLD datasets indicate the multimodal V-I ReID models that are more likely to perform well in real-world operational conditions. In particular, our ML-MDA is an important strategy for a V-I person ReID system to sustain high accuracy and robustness when processing corrupted multimodal images. Also, our multimodal ReID model MMSF outperforms every method under CL and NCL camera scenarios.

Title: Modality-invariant Visual Odometry for Embodied Vision. (arXiv:2305.00348v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00348
Code URL: https://github.com/memmelma/vo-transformer
Copy Paste: [[2305.00348] Modality-invariant Visual Odometry for Embodied Vision](http://arxiv.org/abs/2305.00348) #robust
Summary:
Effectively localizing an agent in a realistic, noisy setting is crucial for many embodied vision tasks. Visual Odometry (VO) is a practical substitute for unreliable GPS and compass sensors, especially in indoor environments. While SLAM-based methods show a solid performance without large data requirements, they are less flexible and robust w.r.t. to noise and changes in the sensor suite compared to learning-based approaches. Recent deep VO models, however, limit themselves to a fixed set of input modalities, e.g., RGB and depth, while training on millions of samples. When sensors fail, sensor suites change, or modalities are intentionally looped out due to available resources, e.g., power consumption, the models fail catastrophically. Furthermore, training these models from scratch is even more expensive without simulator access or suitable existing models that can be fine-tuned. While such scenarios get mostly ignored in simulation, they commonly hinder a model's reusability in real-world applications. We propose a Transformer-based modality-invariant VO approach that can deal with diverse or changing sensor suites of navigation agents. Our model outperforms previous methods while training on only a fraction of the data. We hope this method opens the door to a broader range of real-world applications that can benefit from flexible and learned VO models.

Title: A Simulation-Augmented Benchmarking Framework for Automatic RSO Streak Detection in Single-Frame Space Images. (arXiv:2305.00412v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00412
Code URL: null
Copy Paste: [[2305.00412] A Simulation-Augmented Benchmarking Framework for Automatic RSO Streak Detection in Single-Frame Space Images](http://arxiv.org/abs/2305.00412) #robust
Summary:
Detecting Resident Space Objects (RSOs) and preventing collisions with other satellites is crucial. Recently, deep convolutional neural networks (DCNNs) have shown superior performance in object detection when large-scale datasets are available. However, collecting rich data of RSOs is difficult due to very few occurrences in the space images. Without sufficient data, it is challenging to comprehensively train DCNN detectors and make them effective for detecting RSOs in space images, let alone to estimate whether a detector is sufficiently robust. The lack of meaningful evaluation of different detectors could further affect the design and application of detection methods. To tackle this issue, we propose that the space images containing RSOs can be simulated to complement the shortage of raw data for better benchmarking. Accordingly, we introduce a novel simulation-augmented benchmarking framework for RSO detection (SAB-RSOD). In our framework, by making the best use of the hardware parameters of the sensor that captures real-world space images, we first develop a high-fidelity RSO simulator that can generate various realistic space images. Then, we use this simulator to generate images that contain diversified RSOs in space and annotate them automatically. Later, we mix the synthetic images with the real-world images, obtaining around 500 images for training with only the real-world images for evaluation. Under SAB-RSOD, we can train different popular object detectors like Yolo and Faster RCNN effectively, enabling us to evaluate their performance thoroughly. The evaluation results have shown that the amount of available data and image resolution are two key factors for robust RSO detection. Moreover, if using a lower resolution for higher efficiency, we demonstrated that a simple UNet-based detection method can already access high detection accuracy.

Title: Second-order Anisotropic Gaussian Directional Derivative Filters for Blob Detection. (arXiv:2305.00435v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00435
Code URL: null
Copy Paste: [[2305.00435] Second-order Anisotropic Gaussian Directional Derivative Filters for Blob Detection](http://arxiv.org/abs/2305.00435) #robust
Summary:
Interest point detection methods have received increasing attention and are widely used in computer vision tasks such as image retrieval and 3D reconstruction. In this work, second-order anisotropic Gaussian directional derivative filters with multiple scales are used to smooth the input image and a novel blob detection method is proposed. Extensive experiments demonstrate the superiority of our proposed method over state-of-the-art benchmarks in terms of detection performance and robustness to affine transformations.

Title: Multi-Task Structural Learning using Local Task Similarity induced Neuron Creation and Removal. (arXiv:2305.00441v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00441
Code URL: null
Copy Paste: [[2305.00441] Multi-Task Structural Learning using Local Task Similarity induced Neuron Creation and Removal](http://arxiv.org/abs/2305.00441) #robust
Summary:
Multi-task learning has the potential to improve generalization by maximizing positive transfer between tasks while reducing task interference. Fully achieving this potential is hindered by manually designed architectures that remain static throughout training. On the contrary, learning in the brain occurs through structural changes that are in tandem with changes in synaptic strength. Thus, we propose \textit{Multi-Task Structural Learning (MTSL)} that simultaneously learns the multi-task architecture and its parameters. MTSL begins with an identical single-task network for each task and alternates between a task-learning phase and a structural-learning phase. In the task learning phase, each network specializes in the corresponding task. In each of the structural learning phases, starting from the earliest layer, locally similar task layers first transfer their knowledge to a newly created group layer before being removed. MTSL then uses the group layer in place of the corresponding removed task layers and moves on to the next layers. Our empirical results show that MTSL achieves competitive generalization with various baselines and improves robustness to out-of-distribution data.

Title: Learning Self-Prior for Mesh Inpainting Using Self-Supervised Graph Convolutional Networks. (arXiv:2305.00635v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00635
Code URL: null
Copy Paste: [[2305.00635] Learning Self-Prior for Mesh Inpainting Using Self-Supervised Graph Convolutional Networks](http://arxiv.org/abs/2305.00635) #robust
Summary:
This study presents a self-prior-based mesh inpainting framework that requires only an incomplete mesh as input, without the need for any training datasets. Additionally, our method maintains the polygonal mesh format throughout the inpainting process without converting the shape format to an intermediate, such as a voxel grid, a point cloud, or an implicit function, which are typically considered easier for deep neural networks to process. To achieve this goal, we introduce two graph convolutional networks (GCNs): single-resolution GCN (SGCN) and multi-resolution GCN (MGCN), both trained in a self-supervised manner. Our approach refines a watertight mesh obtained from the initial hole filling to generate a completed output mesh. Specifically, we train the GCNs to deform an oversmoothed version of the input mesh into the expected completed shape. To supervise the GCNs for accurate vertex displacements, despite the unknown correct displacements at real holes, we utilize multiple sets of meshes with several connected regions marked as fake holes. The correct displacements are known for vertices in these fake holes, enabling network training with loss functions that assess the accuracy of displacement vectors estimated by the GCNs. We demonstrate that our method outperforms traditional dataset-independent approaches and exhibits greater robustness compared to other deep-learning-based methods for shapes that less frequently appear in shape datasets.

Title: Enhanced Multi-level Features for Very High Resolution Remote Sensing Scene Classification. (arXiv:2305.00679v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00679
Code URL: null
Copy Paste: [[2305.00679] Enhanced Multi-level Features for Very High Resolution Remote Sensing Scene Classification](http://arxiv.org/abs/2305.00679) #robust
Summary:
Very high-resolution (VHR) remote sensing (RS) scene classification is a challenging task due to the higher inter-class similarity and intra-class variability problems. Recently, the existing deep learning (DL)-based methods have shown great promise in VHR RS scene classification. However, they still provide an unstable classification performance. To address such a problem, we, in this letter, propose a novel DL-based approach. For this, we devise an enhanced VHR attention module (EAM), followed by the atrous spatial pyramid pooling (ASPP) and global average pooling (GAP). This procedure imparts the enhanced features from the corresponding level. Then, the multi-level feature fusion is performed. Experimental results on two widely-used VHR RS datasets show that the proposed approach yields a competitive and stable/robust classification performance with the least standard deviation of 0.001. Further, the highest overall accuracies on the AID and the NWPU datasets are 95.39% and 93.04%, respectively.

Title: Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding. (arXiv:2305.00633v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.00633
Code URL: https://github.com/yuxixie/selfeval-guided-decoding
Copy Paste: [[2305.00633] Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding](http://arxiv.org/abs/2305.00633) #robust
Summary:
We propose an effective prompting approach that integrates self-evaluation guidance through stochastic beam search. Our approach explores the reasoning search space using a well-calibrated automatic criterion. This enables an efficient search to produce higher-quality final predictions. With the self-evaluation guided stochastic beam search, we also balance the quality--diversity trade-off in the generation of reasoning chains. This allows our approach to adapt well with majority voting and surpass the corresponding Codex-backboned baselines by $6.34\%$, $9.56\%$, and $5.46\%$ on the GSM8K, AQUA, and StrategyQA benchmarks, respectively, in few-shot accuracy. Analysis of our decompositional reasoning finds it pinpoints logic failures and leads to higher consistency and robustness.

Title: Verification against in-situ observations for Data-Driven Weather Prediction. (arXiv:2305.00048v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00048
Code URL: null
Copy Paste: [[2305.00048] Verification against in-situ observations for Data-Driven Weather Prediction](http://arxiv.org/abs/2305.00048) #robust
Summary:
Data-driven weather prediction models (DDWPs) have made rapid strides in recent years, demonstrating an ability to approximate Numerical Weather Prediction (NWP) models to a high degree of accuracy. The fast, accurate, and low-cost DDWP forecasts make their use in operational forecasting an attractive proposition, however, there remains work to be done in rigorously evaluating DDWPs in a true operational setting. Typically trained and evaluated using ERA5 reanalysis data, DDWPs have been tested only in a simulation, which cannot represent the real world with complete accuracy even if it is of a very high quality. The safe use of DDWPs in operational forecasting requires more thorough "real-world" verification, as well as a careful examination of how DDWPs are currently trained and evaluated. It is worth asking, for instance, how well do the reanalysis datasets, used for training, simulate the real world? With an eye towards climate justice and the uneven availability of weather data: is the simulation equally good for all regions of the world, and would DDWPs exacerbate biases present in the training data? Does a good performance in simulation correspond to good performance in operational settings? In addition to approximating the physics of NWP models, how can ML be uniquely deployed to provide more accurate weather forecasts? As a first step towards answering such questions, we present a robust dataset of in-situ observations derived from the NOAA MADIS program to serve as a benchmark to validate DDWPs in an operational setting. By providing a large corpus of quality-controlled, in-situ observations, this dataset provides a meaningful real-world task that all NWPs and DDWPs can be tested against. We hope that this data can be used not only to rigorously and fairly compare operational weather models but also to spur future research in new directions.

Title: Online Platt Scaling with Calibeating. (arXiv:2305.00070v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00070
Code URL: null
Copy Paste: [[2305.00070] Online Platt Scaling with Calibeating](http://arxiv.org/abs/2305.00070) #robust
Summary:
We present an online post-hoc calibration method, called Online Platt Scaling (OPS), which combines the Platt scaling technique with online logistic regression. We demonstrate that OPS smoothly adapts between i.i.d. and non-i.i.d. settings with distribution drift. Further, in scenarios where the best Platt scaling model is itself miscalibrated, we enhance OPS by incorporating a recently developed technique called calibeating to make it more robust. Theoretically, our resulting OPS+calibeating method is guaranteed to be calibrated for adversarial outcome sequences. Empirically, it is effective on a range of synthetic and real-world datasets, with and without distribution drifts, achieving superior performance without hyperparameter tuning. Finally, we extend all OPS ideas to the beta scaling method.

Title: On the existence of solutions to adversarial training in multiclass classification. (arXiv:2305.00075v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00075
Code URL: null
Copy Paste: [[2305.00075] On the existence of solutions to adversarial training in multiclass classification](http://arxiv.org/abs/2305.00075) #robust
Summary:
We study three models of the problem of adversarial training in multiclass classification designed to construct robust classifiers against adversarial perturbations of data in the agnostic-classifier setting. We prove the existence of Borel measurable robust classifiers in each model and provide a unified perspective of the adversarial training problem, expanding the connections with optimal transport initiated by the authors in previous work and developing new connections between adversarial training in the multiclass setting and total variation regularization. As a corollary of our results, we prove the existence of Borel measurable solutions to the agnostic adversarial training problem in the binary classification setting, a result that improves results in the literature of adversarial training, where robust classifiers were only known to exist within the enlarged universal $\sigma$-algebra of the feature space.

Title: Temporal Subsampling Diminishes Small Spatial Scales in Recurrent Neural Network Emulators of Geophysical Turbulence. (arXiv:2305.00100v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00100
Code URL: null
Copy Paste: [[2305.00100] Temporal Subsampling Diminishes Small Spatial Scales in Recurrent Neural Network Emulators of Geophysical Turbulence](http://arxiv.org/abs/2305.00100) #robust
Summary:
The immense computational cost of traditional numerical weather and climate models has sparked the development of machine learning (ML) based emulators. Because ML methods benefit from long records of training data, it is common to use datasets that are temporally subsampled relative to the time steps required for the numerical integration of differential equations. Here, we investigate how this often overlooked processing step affects the quality of an emulator's predictions. We implement two ML architectures from a class of methods called reservoir computing: (1) a form of Nonlinear Vector Autoregression (NVAR), and (2) an Echo State Network (ESN). Despite their simplicity, it is well documented that these architectures excel at predicting low dimensional chaotic dynamics. We are therefore motivated to test these architectures in an idealized setting of predicting high dimensional geophysical turbulence as represented by Surface Quasi-Geostrophic dynamics. In all cases, subsampling the training data consistently leads to an increased bias at small spatial scales that resembles numerical diffusion. Interestingly, the NVAR architecture becomes unstable when the temporal resolution is increased, indicating that the polynomial based interactions are insufficient at capturing the detailed nonlinearities of the turbulent flow. The ESN architecture is found to be more robust, suggesting a benefit to the more expensive but more general structure. Spectral errors are reduced by including a penalty on the kinetic energy density spectrum during training, although the subsampling related errors persist. Future work is warranted to understand how the temporal resolution of training data affects other ML architectures.

Title: Meta-Reinforcement Learning Based on Self-Supervised Task Representation Learning. (arXiv:2305.00286v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00286
Code URL: null
Copy Paste: [[2305.00286] Meta-Reinforcement Learning Based on Self-Supervised Task Representation Learning](http://arxiv.org/abs/2305.00286) #robust
Summary:
Meta-reinforcement learning enables artificial agents to learn from related training tasks and adapt to new tasks efficiently with minimal interaction data. However, most existing research is still limited to narrow task distributions that are parametric and stationary, and does not consider out-of-distribution tasks during the evaluation, thus, restricting its application. In this paper, we propose MoSS, a context-based Meta-reinforcement learning algorithm based on Self-Supervised task representation learning to address this challenge. We extend meta-RL to broad non-parametric task distributions which have never been explored before, and also achieve state-of-the-art results in non-stationary and out-of-distribution tasks. Specifically, MoSS consists of a task inference module and a policy module. We utilize the Gaussian mixture model for task representation to imitate the parametric and non-parametric task variations. Additionally, our online adaptation strategy enables the agent to react at the first sight of a task change, thus being applicable in non-stationary tasks. MoSS also exhibits strong generalization robustness in out-of-distributions tasks which benefits from the reliable and robust task representation. The policy is built on top of an off-policy RL algorithm and the entire network is trained completely off-policy to ensure high sample efficiency. On MuJoCo and Meta-World benchmarks, MoSS outperforms prior works in terms of asymptotic performance, sample efficiency (3-50x faster), adaptation efficiency, and generalization robustness on broad and diverse task distributions.

Title: Robustified Learning for Online Optimization with Memory Costs. (arXiv:2305.00677v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00677
Code URL: null
Copy Paste: [[2305.00677] Robustified Learning for Online Optimization with Memory Costs](http://arxiv.org/abs/2305.00677) #robust
Summary:
Online optimization with memory costs has many real-world applications, where sequential actions are made without knowing the future input. Nonetheless, the memory cost couples the actions over time, adding substantial challenges. Conventionally, this problem has been approached by various expert-designed online algorithms with the goal of achieving bounded worst-case competitive ratios, but the resulting average performance is often unsatisfactory. On the other hand, emerging machine learning (ML) based optimizers can improve the average performance, but suffer from the lack of worst-case performance robustness. In this paper, we propose a novel expert-robustified learning (ERL) approach, achieving {both} good average performance and robustness. More concretely, for robustness, ERL introduces a novel projection operator that robustifies ML actions by utilizing an expert online algorithm; for average performance, ERL trains the ML optimizer based on a recurrent architecture by explicitly considering downstream expert robustification. We prove that, for any $\lambda\geq1$, ERL can achieve $\lambda$-competitive against the expert algorithm and $\lambda\cdot C$-competitive against the optimal offline algorithm (where $C$ is the expert's competitive ratio). Additionally, we extend our analysis to a novel setting of multi-step memory costs. Finally, our analysis is supported by empirical experiments for an energy scheduling application.

Title: Strengthening structural baselines for graph classification using Local Topological Profile. (arXiv:2305.00724v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00724
Code URL: https://github.com/j-adamczyk/ltp
Copy Paste: [[2305.00724] Strengthening structural baselines for graph classification using Local Topological Profile](http://arxiv.org/abs/2305.00724) #robust
Summary:
We present the analysis of the topological graph descriptor Local Degree Profile (LDP), which forms a widely used structural baseline for graph classification. Our study focuses on model evaluation in the context of the recently developed fair evaluation framework, which defines rigorous routines for model selection and evaluation for graph classification, ensuring reproducibility and comparability of the results. Based on the obtained insights, we propose a new baseline algorithm called Local Topological Profile (LTP), which extends LDP by using additional centrality measures and local vertex descriptors. The new approach provides the results outperforming or very close to the latest GNNs for all datasets used. Specifically, state-of-the-art results were obtained for 4 out of 9 benchmark datasets. We also consider computational aspects of LDP-based feature extraction and model construction to propose practical improvements affecting execution speed and scalability. This allows for handling modern, large datasets and extends the portfolio of benchmarks used in graph representation learning. As the outcome of our work, we obtained LTP as a simple to understand, fast and scalable, still robust baseline, capable of outcompeting modern graph classification models such as Graph Isomorphism Network (GIN). We provide open-source implementation at \href{https://github.com/j-adamczyk/LTP}{GitHub}.

biometric

steal

extraction

Title: An Efficient Plane Extraction Approach for Bundle Adjustment on LiDAR Point clouds. (arXiv:2305.00287v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00287
Code URL: null
Copy Paste: [[2305.00287] An Efficient Plane Extraction Approach for Bundle Adjustment on LiDAR Point clouds](http://arxiv.org/abs/2305.00287) #extraction
Summary:
Bundle adjustment (BA) on LiDAR point clouds has been extensively investigated in recent years due to its ability to optimize multiple poses together, resulting in high accuracy and global consistency for point cloud. However, the accuracy and speed of LiDAR bundle adjustment depend on the quality of plane extraction, which provides point association for LiDAR BA. In this study, we propose a novel and efficient voxel-based approach for plane extraction that is specially designed to provide point association for LiDAR bundle adjustment. To begin, we partition the space into multiple voxels of a fixed size and then split these root voxels based on whether the points are on the same plane, using an octree structure. We also design a novel plane determination method based on principle component analysis (PCA), which segments the points into four even quarters and compare their minimum eigenvalues with that of the initial point cloud. Finally, we adopt a plane merging method to prevent too many small planes from being in a single voxel, which can increase the optimization time required for BA. Our experimental results on HILTI demonstrate that our approach achieves the best precision and least time cost compared to other plane extraction methods.

Title: Event Camera as Region Proposal Network. (arXiv:2305.00718v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00718
Code URL: null
Copy Paste: [[2305.00718] Event Camera as Region Proposal Network](http://arxiv.org/abs/2305.00718) #extraction
Summary:
The human eye consists of two types of photoreceptors, rods and cones. Rods are responsible for monochrome vision, and cones for color vision. The number of rods is much higher than the cones, which means that most human vision processing is done in monochrome. An event camera reports the change in pixel intensity and is analogous to rods. Event and color cameras in computer vision are like rods and cones in human vision. Humans can notice objects moving in the peripheral vision (far right and left), but we cannot classify them (think of someone passing by on your far left or far right, this can trigger your attention without knowing who they are). Thus, rods act as a region proposal network (RPN) in human vision. Therefore, an event camera can act as a region proposal network in deep learning Two-stage object detectors in deep learning, such as Mask R-CNN, consist of a backbone for feature extraction and a RPN. Currently, RPN uses the brute force method by trying out all the possible bounding boxes to detect an object. This requires much computation time to generate region proposals making two-stage detectors inconvenient for fast applications. This work replaces the RPN in Mask-RCNN of detectron2 with an event camera for generating proposals for moving objects. Thus, saving time and being computationally less expensive. The proposed approach is faster than the two-stage detectors with comparable accuracy

Title: Hierarchical Dialogue Understanding with Special Tokens and Turn-level Attention. (arXiv:2305.00262v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.00262
Code URL: https://github.com/shawx825/hidialog
Copy Paste: [[2305.00262] Hierarchical Dialogue Understanding with Special Tokens and Turn-level Attention](http://arxiv.org/abs/2305.00262) #extraction
Summary:
Compared with standard text, understanding dialogue is more challenging for machines as the dynamic and unexpected semantic changes in each turn. To model such inconsistent semantics, we propose a simple but effective Hierarchical Dialogue Understanding model, HiDialog. Specifically, we first insert multiple special tokens into a dialogue and propose the turn-level attention to learn turn embeddings hierarchically. Then, a heterogeneous graph module is leveraged to polish the learned embeddings. We evaluate our model on various dialogue understanding tasks including dialogue relation extraction, dialogue emotion recognition, and dialogue act classification. Results show that our simple approach achieves state-of-the-art performance on all three tasks above. All our source code is publicly available at https://github.com/ShawX825/HiDialog.

Title: Accurate ignition detection of solid fuel particles using machine learning. (arXiv:2305.00004v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00004
Code URL: null
Copy Paste: [[2305.00004] Accurate ignition detection of solid fuel particles using machine learning](http://arxiv.org/abs/2305.00004) #extraction
Summary:
In the present work, accurate determination of single-particle ignition is focused on using high-speed optical diagnostics combined with machine learning approaches. Ignition of individual particles in a laminar flow reactor are visualized by simultaneous 10 kHz OH-LIF and DBI measurements. Two coal particle sizes of 90-125{\mu}m and 160-200{\mu}m are investigated in conventional air and oxy-fuel conditions with increasing oxygen concentrations. Ignition delay times are first evaluated with threshold methods, revealing obvious deviations compared to the ground truth detected by the human eye. Then, residual networks (ResNet) and feature pyramidal networks (FPN) are trained on the ground truth and applied to predict the ignition time.~Both networks are capable of detecting ignition with significantly higher accuracy and precision. Besides, influences of input data and depth of networks on the prediction performance of a trained model are examined.~The current study shows that the hierarchical feature extraction of the convolutions networks clearly facilitates data evaluation for high-speed optical measurements and could be transferred to other solid fuel experiments with similar boundary conditions.

Title: Predictability of Machine Learning Algorithms and Related Feature Extraction Techniques. (arXiv:2305.00449v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00449
Code URL: null
Copy Paste: [[2305.00449] Predictability of Machine Learning Algorithms and Related Feature Extraction Techniques](http://arxiv.org/abs/2305.00449) #extraction
Summary:
This thesis designs a prediction system based on matrix factorization to predict the classification accuracy of a specific model on a particular dataset. In this thesis, we conduct comprehensive empirical research on more than fifty datasets that we collected from the openml website. We study the performance prediction of three fundamental machine learning algorithms, namely, random forest, XGBoost, and MultiLayer Perceptron(MLP). In particular, we obtain the following results: 1. Predictability of fine-tuned models using coarse-tuned variants. 2. Predictability of MLP using feature extraction techniques. 3. Predict model performance using implicit feedback.

membership infer

Title: Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. (arXiv:2305.00118v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.00118
Code URL: https://github.com/bamman-group/gpt4-books
Copy Paste: [[2305.00118] Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4](http://arxiv.org/abs/2305.00118) #membership infer
Summary:
In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform much better on memorized books than on non-memorized books for downstream tasks. We argue that this supports a case for open models whose training data is known.

federate

Title: FCA: Taming Long-tailed Federated Medical Image Classification by Classifier Anchoring. (arXiv:2305.00738v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00738
Code URL: https://github.com/jwicaksana/fca
Copy Paste: [[2305.00738] FCA: Taming Long-tailed Federated Medical Image Classification by Classifier Anchoring](http://arxiv.org/abs/2305.00738) #federate
Summary:
Limited training data and severe class imbalance impose significant challenges to developing clinically robust deep learning models. Federated learning (FL) addresses the former by enabling different medical clients to collaboratively train a deep model without sharing data. However, the class imbalance problem persists due to inter-client class distribution variations. To overcome this, we propose federated classifier anchoring (FCA) by adding a personalized classifier at each client to guide and debias the federated model through consistency learning. Additionally, FCA debiases the federated classifier and each client's personalized classifier based on their respective class distributions, thus mitigating divergence. With FCA, the federated feature extractor effectively learns discriminative features suitably globally for federation as well as locally for all participants. In clinical practice, the federated model is expected to be both generalized, performing well across clients, and specialized, benefiting each individual client from collaboration. According to this, we propose a novel evaluation metric to assess models' generalization and specialization performance globally on an aggregated public test set and locally at each client. Through comprehensive comparison and evaluation, FCA outperforms the state-of-the-art methods with large margins for federated long-tailed skin lesion classification and intracranial hemorrhage classification, making it a more feasible solution in clinical settings.

Title: Towards Unbiased Training in Federated Open-world Semi-supervised Learning. (arXiv:2305.00771v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00771
Code URL: null
Copy Paste: [[2305.00771] Towards Unbiased Training in Federated Open-world Semi-supervised Learning](http://arxiv.org/abs/2305.00771) #federate
Summary:
Federated Semi-supervised Learning (FedSSL) has emerged as a new paradigm for allowing distributed clients to collaboratively train a machine learning model over scarce labeled data and abundant unlabeled data. However, existing works for FedSSL rely on a closed-world assumption that all local training data and global testing data are from seen classes observed in the labeled dataset. It is crucial to go one step further: adapting FL models to an open-world setting, where unseen classes exist in the unlabeled data. In this paper, we propose a novel Federatedopen-world Semi-Supervised Learning (FedoSSL) framework, which can solve the key challenge in distributed and open-world settings, i.e., the biased training process for heterogeneously distributed unseen classes. Specifically, since the advent of a certain unseen class depends on a client basis, the locally unseen classes (exist in multiple clients) are likely to receive differentiated superior aggregation effects than the globally unseen classes (exist only in one client). We adopt an uncertainty-aware suppressed loss to alleviate the biased training between locally unseen and globally unseen classes. Besides, we enable a calibration module supplementary to the global aggregation to avoid potential conflicting knowledge transfer caused by inconsistent data distribution among different clients. The proposed FedoSSL can be easily adapted to state-of-the-art FL methods, which is also validated via extensive experiments on benchmarks and real-world datasets (CIFAR-10, CIFAR-100 and CINIC-10).

fair

Title: Learning to Re-rank with Constrained Meta-Optimal Transport. (arXiv:2305.00319v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00319
Code URL: null
Copy Paste: [[2305.00319] Learning to Re-rank with Constrained Meta-Optimal Transport](http://arxiv.org/abs/2305.00319) #fair
Summary:
Many re-ranking strategies in search systems rely on stochastic ranking policies, encoded as Doubly-Stochastic (DS) matrices, that satisfy desired ranking constraints in expectation, e.g., Fairness of Exposure (FOE). These strategies are generally two-stage pipelines: \emph{i)} an offline re-ranking policy construction step and \emph{ii)} an online sampling of rankings step. Building a re-ranking policy requires repeatedly solving a constrained optimization problem, one for each issued query. Thus, it is necessary to recompute the optimization procedure for any new/unseen query. Regarding sampling, the Birkhoff-von-Neumann decomposition (BvND) is the favored approach to draw rankings from any DS-based policy. However, the BvND is too costly to compute online. Hence, the BvND as a sampling solution is memory-consuming as it can grow as $\gO(N\, n^2)$ for $N$ queries and $n$ documents.

This paper offers a novel, fast, lightweight way to predict fair stochastic re-ranking policies: Constrained Meta-Optimal Transport (CoMOT). This method fits a neural network shared across queries like a learning-to-rank system. We also introduce Gumbel-Matching Sampling (GumMS), an online sampling approach from DS-based policies. Our proposed pipeline, CoMOT + GumMS, only needs to store the parameters of a single model, and it generalizes to unseen queries. We empirically evaluated our pipeline on the TREC 2019 and 2020 datasets under FOE constraints. Our experiments show that CoMOT rapidly predicts fair re-ranking policies on held-out data, with a speed-up proportional to the average number of documents per query. It also displays fairness and ranking performance similar to the original optimization-based policy. Furthermore, we empirically validate the effectiveness of GumMS to approximate DS-based policies in expectation.

interpretability

Title: Discover and Cure: Concept-aware Mitigation of Spurious Correlation. (arXiv:2305.00650v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00650
Code URL: https://github.com/wuyxin/disc
Copy Paste: [[2305.00650] Discover and Cure: Concept-aware Mitigation of Spurious Correlation](http://arxiv.org/abs/2305.00650) #interpretability
Summary:
Deep neural networks often rely on spurious correlations to make predictions, which hinders generalization beyond training environments. For instance, models that associate cats with bed backgrounds can fail to predict the existence of cats in other environments without beds. Mitigating spurious correlations is crucial in building trustworthy models. However, the existing works lack transparency to offer insights into the mitigation process. In this work, we propose an interpretable framework, Discover and Cure (DISC), to tackle the issue. With human-interpretable concepts, DISC iteratively 1) discovers unstable concepts across different environments as spurious attributes, then 2) intervenes on the training data using the discovered concepts to reduce spurious correlation. Across systematic experiments, DISC provides superior generalization ability and interpretability than the existing approaches. Specifically, it outperforms the state-of-the-art methods on an object recognition task and a skin-lesion classification task by 7.5% and 9.6%, respectively. Additionally, we offer theoretical analysis and guarantees to understand the benefits of models trained by DISC. Code and data are available at https://github.com/Wuyxin/DISC.

Title: TPMIL: Trainable Prototype Enhanced Multiple Instance Learning for Whole Slide Image Classification. (arXiv:2305.00696v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00696
Code URL: null
Copy Paste: [[2305.00696] TPMIL: Trainable Prototype Enhanced Multiple Instance Learning for Whole Slide Image Classification](http://arxiv.org/abs/2305.00696) #interpretability
Summary:
Digital pathology based on whole slide images (WSIs) plays a key role in cancer diagnosis and clinical practice. Due to the high resolution of the WSI and the unavailability of patch-level annotations, WSI classification is usually formulated as a weakly supervised problem, which relies on multiple instance learning (MIL) based on patches of a WSI. In this paper, we aim to learn an optimal patch-level feature space by integrating prototype learning with MIL. To this end, we develop a Trainable Prototype enhanced deep MIL (TPMIL) framework for weakly supervised WSI classification. In contrast to the conventional methods which rely on a certain number of selected patches for feature space refinement, we softly cluster all the instances by allocating them to their corresponding prototypes. Additionally, our method is able to reveal the correlations between different tumor subtypes through distances between corresponding trained prototypes. More importantly, TPMIL also enables to provide a more accurate interpretability based on the distance of the instances from the trained prototypes which serves as an alternative to the conventional attention score-based interpretability. We test our method on two WSI datasets and it achieves a new SOTA. GitHub repository: https://github.com/LitaoYang-Jet/TPMIL

Title: How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. (arXiv:2305.00586v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2305.00586
Code URL: null
Copy Paste: [[2305.00586] How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model](http://arxiv.org/abs/2305.00586) #interpretability
Summary:
Pre-trained language models can be surprisingly adept at tasks they were not explicitly trained on, but how they implement these capabilities is poorly understood. In this paper, we investigate the basic mathematical abilities often acquired by pre-trained language models. Concretely, we use mechanistic interpretability techniques to explain the (limited) mathematical abilities of GPT-2 small. As a case study, we examine its ability to take in sentences such as "The war lasted from the year 1732 to the year 17", and predict valid two-digit end years (years > 32). We first identify a circuit, a small subset of GPT-2 small's computational graph that computes this task's output. Then, we explain the role of each circuit component, showing that GPT-2 small's final multi-layer perceptrons boost the probability of end years greater than the start year. Finally, we show that our circuit generalizes to other tasks, playing a role in other greater-than scenarios.

explainability

Title: Causalainer: Causal Explainer for Automatic Video Summarization. (arXiv:2305.00455v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00455
Code URL: null
Copy Paste: [[2305.00455] Causalainer: Causal Explainer for Automatic Video Summarization](http://arxiv.org/abs/2305.00455) #explainability
Summary:
The goal of video summarization is to automatically shorten videos such that it conveys the overall story without losing relevant information. In many application scenarios, improper video summarization can have a large impact. For example in forensics, the quality of the generated video summary will affect an investigator's judgment while in journalism it might yield undesired bias. Because of this, modeling explainability is a key concern. One of the best ways to address the explainability challenge is to uncover the causal relations that steer the process and lead to the result. Current machine learning-based video summarization algorithms learn optimal parameters but do not uncover causal relationships. Hence, they suffer from a relative lack of explainability. In this work, a Causal Explainer, dubbed Causalainer, is proposed to address this issue. Multiple meaningful random variables and their joint distributions are introduced to characterize the behaviors of key components in the problem of video summarization. In addition, helper distributions are introduced to enhance the effectiveness of model training. In visual-textual input scenarios, the extra input can decrease the model performance. A causal semantics extractor is designed to tackle this issue by effectively distilling the mutual information from the visual and textual inputs. Experimental results on commonly used benchmarks demonstrate that the proposed method achieves state-of-the-art performance while being more explainable.

watermark

diffusion

Title: Unsupervised Discovery of 3D Hierarchical Structure with Generative Diffusion Features. (arXiv:2305.00067v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00067
Code URL: null
Copy Paste: [[2305.00067] Unsupervised Discovery of 3D Hierarchical Structure with Generative Diffusion Features](http://arxiv.org/abs/2305.00067) #diffusion
Summary:
Inspired by recent findings that generative diffusion models learn semantically meaningful representations, we use them to discover the intrinsic hierarchical structure in biomedical 3D images using unsupervised segmentation. We show that features of diffusion models from different stages of a U-Net-based ladder-like architecture capture different hierarchy levels in 3D biomedical images. We design three losses to train a predictive unsupervised segmentation network that encourages the decomposition of 3D volumes into meaningful nested subvolumes that represent a hierarchy. First, we pretrain 3D diffusion models and use the consistency of their features across subvolumes. Second, we use the visual consistency between subvolumes. Third, we use the invariance to photometric augmentations as a regularizer. Our models achieve better performance than prior unsupervised structure discovery approaches on challenging biologically-inspired synthetic datasets and on a real-world brain tumor MRI dataset.

Title: Class-Balancing Diffusion Models. (arXiv:2305.00562v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00562
Code URL: null
Copy Paste: [[2305.00562] Class-Balancing Diffusion Models](http://arxiv.org/abs/2305.00562) #diffusion
Summary:
Diffusion-based models have shown the merits of generating high-quality visual data while preserving better diversity in recent studies. However, such observation is only justified with curated data distribution, where the data samples are nicely pre-processed to be uniformly distributed in terms of their labels. In practice, a long-tailed data distribution appears more common and how diffusion models perform on such class-imbalanced data remains unknown. In this work, we first investigate this problem and observe significant degradation in both diversity and fidelity when the diffusion model is trained on datasets with class-imbalanced distributions. Especially in tail classes, the generations largely lose diversity and we observe severe mode-collapse issues. To tackle this problem, we set from the hypothesis that the data distribution is not class-balanced, and propose Class-Balancing Diffusion Models (CBDM) that are trained with a distribution adjustment regularizer as a solution. Experiments show that images generated by CBDM exhibit higher diversity and quality in both quantitative and qualitative ways. Our method benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows outstanding performance on the downstream recognition task.

Title: Diffusion Models for Time Series Applications: A Survey. (arXiv:2305.00624v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00624
Code URL: null
Copy Paste: [[2305.00624] Diffusion Models for Time Series Applications: A Survey](http://arxiv.org/abs/2305.00624) #diffusion
Summary:
Diffusion models, a family of generative models based on deep learning, have become increasingly prominent in cutting-edge machine learning research. With a distinguished performance in generating samples that resemble the observed data, diffusion models are widely used in image, video, and text synthesis nowadays. In recent years, the concept of diffusion has been extended to time series applications, and many powerful models have been developed. Considering the deficiency of a methodical summary and discourse on these models, we provide this survey as an elementary resource for new researchers in this area and also an inspiration to motivate future research. For better understanding, we include an introduction about the basics of diffusion models. Except for this, we primarily focus on diffusion-based methods for time series forecasting, imputation, and generation, and present them respectively in three individual sections. We also compare different methods for the same application and highlight their connections if applicable. Lastly, we conclude the common limitation of diffusion-based methods and highlight potential future research directions.

noise learning

data-free

transformer

Title: MMViT: Multiscale Multiview Vision Transformers. (arXiv:2305.00104v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00104
Code URL: null
Copy Paste: [[2305.00104] MMViT: Multiscale Multiview Vision Transformers](http://arxiv.org/abs/2305.00104) #transformer
Summary:
We present Multiscale Multiview Vision Transformers (MMViT), which introduces multiscale feature maps and multiview encodings to transformer models. Our model encodes different views of the input signal and builds several channel-resolution feature stages to process the multiple views of the input at different resolutions in parallel. At each scale stage, we use a cross-attention block to fuse information across different views. This enables the MMViT model to acquire complex high-dimensional representations of the input at different resolutions. The proposed model can serve as a backbone model in multiple domains. We demonstrate the effectiveness of MMViT on audio and image classification tasks, achieving state-of-the-art results.

Title: Searching from Area to Point: A Hierarchical Framework for Semantic-Geometric Combined Feature Matching. (arXiv:2305.00194v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00194
Code URL: null
Copy Paste: [[2305.00194] Searching from Area to Point: A Hierarchical Framework for Semantic-Geometric Combined Feature Matching](http://arxiv.org/abs/2305.00194) #transformer
Summary:
Feature matching is a crucial technique in computer vision. Essentially, it can be considered as a searching problem to establish correspondences between images. The key challenge in this task lies in the lack of a well-defined search space, leading to inaccurate point matching of current methods. In pursuit of a reasonable matching search space, this paper introduces a hierarchical feature matching framework: Area to Point Matching (A2PM), to first find semantic area matches between images, and then perform point matching on area matches, thus setting the search space as the area matches with salient features to achieve high matching precision. This proper search space of A2PM framework also alleviates the accuracy limitation in state-of-the-art Transformer-based matching methods. To realize this framework, we further propose Semantic and Geometry Area Matching (SGAM) method, which utilizes semantic prior and geometry consistency to establish accurate area matches between images. By integrating SGAM with off-the-shelf Transformer-based matchers, our feature matching methods, adopting the A2PM framework, achieve encouraging precision improvements in massive point matching and pose estimation experiments for present arts.

Title: Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT. (arXiv:2305.00201v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00201
Code URL: null
Copy Paste: [[2305.00201] Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT](http://arxiv.org/abs/2305.00201) #transformer
Summary:
Prompts have been proven to play a crucial role in large language models, and in recent years, vision models have also been using prompts to improve scalability for multiple downstream tasks. In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification which we called Instruction-ViT. The key idea is to implement multi-modal prompts (text or image prompt) related to category information to guide the fine-tuning of the model. Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved. Our work provided an innovative strategy to fuse multi-modal prompts with better performance and faster adaptability for visual classification models.

Title: MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer. (arXiv:2305.00355v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00355
Code URL: null
Copy Paste: [[2305.00355] MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer](http://arxiv.org/abs/2305.00355) #transformer
Summary:
With the increasing demand for video understanding, video moment and highlight detection (MHD) has emerged as a critical research topic. MHD aims to localize all moments and predict clip-wise saliency scores simultaneously. Despite progress made by existing DETR-based methods, we observe that these methods coarsely fuse features from different modalities, which weakens the temporal intra-modal context and results in insufficient cross-modal interaction. To address this issue, we propose MH-DETR (Moment and Highlight Detection Transformer) tailored for MHD. Specifically, we introduce a simple yet efficient pooling operator within the uni-modal encoder to capture global intra-modal context. Moreover, to obtain temporally aligned cross-modal features, we design a plug-and-play cross-modal interaction module between the encoder and decoder, seamlessly integrating visual and textual features. Comprehensive experiments on QVHighlights, Charades-STA, Activity-Net, and TVSum datasets show that MH-DETR outperforms existing state-of-the-art methods, demonstrating its effectiveness and superiority. Our code is available at https://github.com/YoucanBaby/MH-DETR.

Title: TransCAR: Transformer-based Camera-And-Radar Fusion for 3D Object Detection. (arXiv:2305.00397v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00397
Code URL: null
Copy Paste: [[2305.00397] TransCAR: Transformer-based Camera-And-Radar Fusion for 3D Object Detection](http://arxiv.org/abs/2305.00397) #transformer
Summary:
Despite radar's popularity in the automotive industry, for fusion-based 3D object detection, most existing works focus on LiDAR and camera fusion. In this paper, we propose TransCAR, a Transformer-based Camera-And-Radar fusion solution for 3D object detection. Our TransCAR consists of two modules. The first module learns 2D features from surround-view camera images and then uses a sparse set of 3D object queries to index into these 2D features. The vision-updated queries then interact with each other via transformer self-attention layer. The second module learns radar features from multiple radar scans and then applies transformer decoder to learn the interactions between radar features and vision-updated queries. The cross-attention layer within the transformer decoder can adaptively learn the soft-association between the radar features and vision-updated queries instead of hard-association based on sensor calibration only. Finally, our model estimates a bounding box per query using set-to-set Hungarian loss, which enables the method to avoid non-maximum suppression. TransCAR improves the velocity estimation using the radar scans without temporal information. The superior experimental results of our TransCAR on the challenging nuScenes datasets illustrate that our TransCAR outperforms state-of-the-art Camera-Radar fusion-based 3D object detection approaches.

Title: Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection. (arXiv:2305.00514v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00514
Code URL: https://github.com/dragonlee258079/DMT
Copy Paste: [[2305.00514] Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection](http://arxiv.org/abs/2305.00514) #transformer
Summary:
Most previous co-salient object detection works mainly focus on extracting co-salient cues via mining the consistency relations across images while ignoring explicit exploration of background regions. In this paper, we propose a Discriminative co-saliency and background Mining Transformer framework (DMT) based on several economical multi-grained correlation modules to explicitly mine both co-saliency and background information and effectively model their discrimination. Specifically, we first propose a region-to-region correlation module for introducing inter-image relations to pixel-wise segmentation features while maintaining computational efficiency. Then, we use two types of pre-defined tokens to mine co-saliency and background information via our proposed contrast-induced pixel-to-token correlation and co-saliency token-to-token correlation modules. We also design a token-guided feature refinement module to enhance the discriminability of the segmentation features under the guidance of the learned tokens. We perform iterative mutual promotion for the segmentation feature extraction and token construction. Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method. The source code is available at: https://github.com/dragonlee258079/DMT.

Title: Multimodal Graph Transformer for Multimodal Question Answering. (arXiv:2305.00581v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00581
Code URL: null
Copy Paste: [[2305.00581] Multimodal Graph Transformer for Multimodal Question Answering](http://arxiv.org/abs/2305.00581) #transformer
Summary:
Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.

Title: Consolidator: Mergeable Adapter with Grouped Connections for Visual Adaptation. (arXiv:2305.00603v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00603
Code URL: null
Copy Paste: [[2305.00603] Consolidator: Mergeable Adapter with Grouped Connections for Visual Adaptation](http://arxiv.org/abs/2305.00603) #transformer
Summary:
Recently, transformers have shown strong ability as visual feature extractors, surpassing traditional convolution-based models in various scenarios. However, the success of vision transformers largely owes to their capacity to accommodate numerous parameters. As a result, new challenges for adapting large models to downstream tasks arise. On the one hand, classic fine-tuning tunes all parameters in a huge model for every task and thus easily falls into overfitting, leading to inferior performance. On the other hand, on resource-limited devices, fine-tuning stores a full copy of parameters and thus is usually impracticable for the shortage of storage space. However, few works have focused on how to efficiently and effectively transfer knowledge in a vision transformer. Existing methods did not dive into the properties of visual features, leading to inferior performance. Moreover, some of them bring heavy inference cost though benefiting storage. To tackle these problems, we propose consolidator to modify the pre-trained model with the addition of a small set of tunable parameters to temporarily store the task-specific knowledge while freezing the backbone model. Motivated by the success of group-wise convolution, we adopt grouped connections across the features extracted by fully connected layers to construct tunable parts in a consolidator. To further enhance the model's capacity to transfer knowledge under a constrained storage budget and keep inference efficient, we consolidate the parameters in two stages: 1. between adaptation and storage, and 2. between loading and inference. On a series of downstream visual tasks, our consolidator can reach up to 7.56 better accuracy than full fine-tuning with merely 0.35% parameters, and outperform state-of-the-art parameter-efficient tuning methods by a clear margin. Code is available at https://github.com/beyondhtx/Consolidator.

Title: End to End Lane detection with One-to-Several Transformer. (arXiv:2305.00675v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00675
Code URL: https://github.com/zkyseu/O2SFormer
Copy Paste: [[2305.00675] End to End Lane detection with One-to-Several Transformer](http://arxiv.org/abs/2305.00675) #transformer
Summary:
Although lane detection methods have shown impressive performance in real-world scenarios, most of methods require post-processing which is not robust enough. Therefore, end-to-end detectors like DEtection TRansformer(DETR) have been introduced in lane detection. However, one-to-one label assignment in DETR can degrade the training efficiency due to label semantic conflicts. Besides, positional query in DETR is unable to provide explicit positional prior, making it difficult to be optimized. In this paper, we present the One-to-Several Transformer(O2SFormer). We first propose the one-to-several label assignment, which combines one-to-one and one-to-many label assignments to improve the training efficiency while keeping end-to-end detection. To overcome the difficulty in optimizing one-to-one assignment. We further propose the layer-wise soft label which adjusts the positive weight of positive lane anchors across different decoder layers. Finally, we design the dynamic anchor-based positional query to explore positional prior by incorporating lane anchors into positional query. Experimental results show that O2SFormer significantly speeds up the convergence of DETR and outperforms Transformer-based and CNN-based detectors on the CULane dataset. Code will be available athttps://github.com/zkyseu/O2SFormer.

Title: What Do Self-Supervised Vision Transformers Learn?. (arXiv:2305.00729v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00729
Code URL: https://github.com/naver-ai/cl-vs-mim
Copy Paste: [[2305.00729] What Do Self-Supervised Vision Transformers Learn?](http://arxiv.org/abs/2305.00729) #transformer
Summary:
We present a comparative study on how and why contrastive learning (CL) and masked image modeling (MIM) differ in their representations and in their performance of downstream tasks. In particular, we demonstrate that self-supervised Vision Transformers (ViTs) have the following properties: (1) CL trains self-attentions to capture longer-range global patterns than MIM, such as the shape of an object, especially in the later layers of the ViT architecture. This CL property helps ViTs linearly separate images in their representation spaces. However, it also makes the self-attentions collapse into homogeneity for all query tokens and heads. Such homogeneity of self-attention reduces the diversity of representations, worsening scalability and dense prediction performance. (2) CL utilizes the low-frequency signals of the representations, but MIM utilizes high-frequencies. Since low- and high-frequency information respectively represent shapes and textures, CL is more shape-oriented and MIM more texture-oriented. (3) CL plays a crucial role in the later layers, while MIM mainly focuses on the early layers. Upon these analyses, we find that CL and MIM can complement each other and observe that even the simplest harmonization can help leverage the advantages of both methods. The code is available at https://github.com/naver-ai/cl-vs-mim.

Title: RViDeformer: Efficient Raw Video Denoising Transformer with a Larger Benchmark Dataset. (arXiv:2305.00767v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00767
Code URL: null
Copy Paste: [[2305.00767] RViDeformer: Efficient Raw Video Denoising Transformer with a Larger Benchmark Dataset](http://arxiv.org/abs/2305.00767) #transformer
Summary:
In recent years, raw video denoising has garnered increased attention due to the consistency with the imaging process and well-studied noise modeling in the raw domain. However, two problems still hinder the denoising performance. Firstly, there is no large dataset with realistic motions for supervised raw video denoising, as capturing noisy and clean frames for real dynamic scenes is difficult. To address this, we propose recapturing existing high-resolution videos displayed on a 4K screen with high-low ISO settings to construct noisy-clean paired frames. In this way, we construct a video denoising dataset (named as ReCRVD) with 120 groups of noisy-clean videos, whose ISO values ranging from 1600 to 25600. Secondly, while non-local temporal-spatial attention is beneficial for denoising, it often leads to heavy computation costs. We propose an efficient raw video denoising transformer network (RViDeformer) that explores both short and long-distance correlations. Specifically, we propose multi-branch spatial and temporal attention modules, which explore the patch correlations from local window, local low-resolution window, global downsampled window, and neighbor-involved window, and then they are fused together. We employ reparameterization to reduce computation costs. Our network is trained in both supervised and unsupervised manners, achieving the best performance compared with state-of-the-art methods. Additionally, the model trained with our proposed dataset (ReCRVD) outperforms the model trained with previous benchmark dataset (CRVD) when evaluated on the real-world outdoor noisy videos. Our code and dataset will be released after the acceptance of this work.

Title: Scaling Pareto-Efficient Decision Making Via Offline Multi-Objective RL. (arXiv:2305.00567v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00567
Code URL: https://github.com/baitingzbt/peda
Copy Paste: [[2305.00567] Scaling Pareto-Efficient Decision Making Via Offline Multi-Objective RL](http://arxiv.org/abs/2305.00567) #transformer
Summary:
The goal of multi-objective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives. In practice, an agent's preferences over the objectives may not be known apriori, and hence, we require policies that can generalize to arbitrary preferences at test time. In this work, we propose a new data-driven setup for offline MORL, where we wish to learn a preference-agnostic policy agent using only a finite dataset of offline demonstrations of other agents and their preferences. The key contributions of this work are two-fold. First, we introduce D4MORL, (D)atasets for MORL that are specifically designed for offline settings. It contains 1.8 million annotated demonstrations obtained by rolling out reference policies that optimize for randomly sampled preferences on 6 MuJoCo environments with 2-3 objectives each. Second, we propose Pareto-Efficient Decision Agents (PEDA), a family of offline MORL algorithms that builds and extends Decision Transformers via a novel preference-and-return-conditioned policy. Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Pareto-front with appropriate conditioning, as measured by the hypervolume and sparsity metrics.

Title: Dynamic Transfer Learning across Graphs. (arXiv:2305.00664v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00664
Code URL: null
Copy Paste: [[2305.00664] Dynamic Transfer Learning across Graphs](http://arxiv.org/abs/2305.00664) #transformer
Summary:
Transferring knowledge across graphs plays a pivotal role in many high-stake domains, ranging from transportation networks to e-commerce networks, from neuroscience to finance. To date, the vast majority of existing works assume both source and target domains are sampled from a universal and stationary distribution. However, many real-world systems are intrinsically dynamic, where the underlying domains are evolving over time. To bridge the gap, we propose to shift the problem to the dynamic setting and ask: given the label-rich source graphs and the label-scarce target graphs observed in previous T timestamps, how can we effectively characterize the evolving domain discrepancy and optimize the generalization performance of the target domain at the incoming T+1 timestamp? To answer the question, for the first time, we propose a generalization bound under the setting of dynamic transfer learning across graphs, which implies the generalization performance is dominated by domain evolution and domain discrepancy between source and target domains. Inspired by the theoretical results, we propose a novel generic framework DyTrans to improve knowledge transferability across dynamic graphs. In particular, we start with a transformer-based temporal encoding module to model temporal information of the evolving domains; then, we further design a dynamic domain unification module to efficiently learn domain-invariant representations across the source and target domains. Finally, extensive experiments on various real-world datasets demonstrate the effectiveness of DyTrans in transferring knowledge from dynamic source domains to dynamic target domains.

generative

Title: Learning Locally Editable Virtual Humans. (arXiv:2305.00121v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00121
Code URL: null
Copy Paste: [[2305.00121] Learning Locally Editable Virtual Humans](http://arxiv.org/abs/2305.00121) #generative
Summary:
In this paper, we propose a novel hybrid representation and end-to-end trainable network architecture to model fully editable and customizable neural avatars. At the core of our work lies a representation that combines the modeling power of neural fields with the ease of use and inherent 3D consistency of skinned meshes. To this end, we construct a trainable feature codebook to store local geometry and texture features on the vertices of a deformable body model, thus exploiting its consistent topology under articulation. This representation is then employed in a generative auto-decoder architecture that admits fitting to unseen scans and sampling of realistic avatars with varied appearances and geometries. Furthermore, our representation allows local editing by swapping local features between 3D assets. To verify our method for avatar creation and editing, we contribute a new high-quality dataset, dubbed CustomHumans, for training and evaluation. Our experiments quantitatively and qualitatively show that our method generates diverse detailed avatars and achieves better model fitting performance compared to state-of-the-art methods. Our code and dataset are available at https://custom-humans.github.io/.

Title: LD-GAN: Low-Dimensional Generative Adversarial Network for Spectral Image Generation with Variance Regularization. (arXiv:2305.00132v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00132
Code URL: https://github.com/hdspgroup/LD-GAN
Copy Paste: [[2305.00132] LD-GAN: Low-Dimensional Generative Adversarial Network for Spectral Image Generation with Variance Regularization](http://arxiv.org/abs/2305.00132) #generative
Summary:
Deep learning methods are state-of-the-art for spectral image (SI) computational tasks. However, these methods are constrained in their performance since available datasets are limited due to the highly expensive and long acquisition time. Usually, data augmentation techniques are employed to mitigate the lack of data. Surpassing classical augmentation methods, such as geometric transformations, GANs enable diverse augmentation by learning and sampling from the data distribution. Nevertheless, GAN-based SI generation is challenging since the high-dimensionality nature of this kind of data hinders the convergence of the GAN training yielding to suboptimal generation. To surmount this limitation, we propose low-dimensional GAN (LD-GAN), where we train the GAN employing a low-dimensional representation of the {dataset} with the latent space of a pretrained autoencoder network. Thus, we generate new low-dimensional samples which are then mapped to the SI dimension with the pretrained decoder network. Besides, we propose a statistical regularization to control the low-dimensional representation variance for the autoencoder training and to achieve high diversity of samples generated with the GAN. We validate our method LD-GAN as data augmentation strategy for compressive spectral imaging, SI super-resolution, and RBG to spectral tasks with improvements varying from 0.5 to 1 [dB] in each task respectively. We perform comparisons against the non-data augmentation training, traditional DA, and with the same GAN adjusted and trained to generate the full-sized SIs. The code of this paper can be found in https://github.com/romanjacome99/LD_GAN.git

Title: Identity-driven Three-Player Generative Adversarial Network for Synthetic-based Face Recognition. (arXiv:2305.00358v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00358
Code URL: null
Copy Paste: [[2305.00358] Identity-driven Three-Player Generative Adversarial Network for Synthetic-based Face Recognition](http://arxiv.org/abs/2305.00358) #generative
Summary:
Many of the commonly used datasets for face recognition development are collected from the internet without proper user consent. Due to the increasing focus on privacy in the social and legal frameworks, the use and distribution of these datasets are being restricted and strongly questioned. These databases, which have a realistically high variability of data per identity, have enabled the success of face recognition models. To build on this success and to align with privacy concerns, synthetic databases, consisting purely of synthetic persons, are increasingly being created and used in the development of face recognition solutions. In this work, we present a three-player generative adversarial network (GAN) framework, namely IDnet, that enables the integration of identity information into the generation process. The third player in our IDnet aims at forcing the generator to learn to generate identity-separable face images. We empirically proved that our IDnet synthetic images are of higher identity discrimination in comparison to the conventional two-player GAN, while maintaining a realistic intra-identity variation. We further studied the identity link between the authentic identities used to train the generator and the generated synthetic identities, showing very low similarities between these identities. We demonstrated the applicability of our IDnet data in training face recognition models by evaluating these models on a wide set of face recognition benchmarks. In comparison to the state-of-the-art works in synthetic-based face recognition, our solution achieved comparable results to a recent rendering-based approach and outperformed all existing GAN-based approaches. The training code and the synthetic face image dataset are publicly available ( https://github.com/fdbtrs/Synthetic-Face-Recognition ).

Title: SLSG: Industrial Image Anomaly Detection by Learning Better Feature Embeddings and One-Class Classification. (arXiv:2305.00398v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00398
Code URL: null
Copy Paste: [[2305.00398] SLSG: Industrial Image Anomaly Detection by Learning Better Feature Embeddings and One-Class Classification](http://arxiv.org/abs/2305.00398) #generative
Summary:
Industrial image anomaly detection under the setting of one-class classification has significant practical value. However, most existing models struggle to extract separable feature representations when performing feature embedding and struggle to build compact descriptions of normal features when performing one-class classification. One direct consequence of this is that most models perform poorly in detecting logical anomalies which violate contextual relationships. Focusing on more effective and comprehensive anomaly detection, we propose a network based on self-supervised learning and self-attentive graph convolution (SLSG) for anomaly detection. SLSG uses a generative pre-training network to assist the encoder in learning the embedding of normal patterns and the reasoning of position relationships. Subsequently, SLSG introduces the pseudo-prior knowledge of anomaly through simulated abnormal samples. By comparing the simulated anomalies, SLSG can better summarize the normal features and narrow down the hypersphere used for one-class classification. In addition, with the construction of a more general graph structure, SLSG comprehensively models the dense and sparse relationships among elements in the image, which further strengthens the detection of logical anomalies. Extensive experiments on benchmark datasets show that SLSG achieves superior anomaly detection performance, demonstrating the effectiveness of our method.

Title: StyleLipSync: Style-based Personalized Lip-sync Video Generation. (arXiv:2305.00521v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00521
Code URL: null
Copy Paste: [[2305.00521] StyleLipSync: Style-based Personalized Lip-sync Video Generation](http://arxiv.org/abs/2305.00521) #generative
Summary:
In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lips-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method. Please refer to our project page.

Title: StyleGenes: Discrete and Efficient Latent Distributions for GANs. (arXiv:2305.00599v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00599
Code URL: null
Copy Paste: [[2305.00599] StyleGenes: Discrete and Efficient Latent Distributions for GANs](http://arxiv.org/abs/2305.00599) #generative
Summary:
We propose a discrete latent distribution for Generative Adversarial Networks (GANs). Instead of drawing latent vectors from a continuous prior, we sample from a finite set of learnable latents. However, a direct parametrization of such a distribution leads to an intractable linear increase in memory in order to ensure sufficient sample diversity. We address this key issue by taking inspiration from the encoding of information in biological organisms. Instead of learning a separate latent vector for each sample, we split the latent space into a set of genes. For each gene, we train a small bank of gene variants. Thus, by independently sampling a variant for each gene and combining them into the final latent vector, our approach can represent a vast number of unique latent samples from a compact set of learnable parameters. Interestingly, our gene-inspired latent encoding allows for new and intuitive approaches to latent-space exploration, enabling conditional sampling from our unconditionally trained model. Moreover, our approach preserves state-of-the-art photo-realism while achieving better disentanglement than the widely-used StyleMapping network.

Title: Boosting Weakly-Supervised Temporal Action Localization with Text Information. (arXiv:2305.00607v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00607
Code URL: https://github.com/lgzlilili/boosting-wtal
Copy Paste: [[2305.00607] Boosting Weakly-Supervised Temporal Action Localization with Text Information](http://arxiv.org/abs/2305.00607) #generative
Summary:
Due to the lack of temporal annotation, current Weakly-supervised Temporal Action Localization (WTAL) methods are generally stuck into over-complete or incomplete localization. In this paper, we aim to leverage the text information to boost WTAL from two aspects, i.e., (a) the discriminative objective to enlarge the inter-class difference, thus reducing the over-complete; (b) the generative objective to enhance the intra-class integrity, thus finding more complete temporal boundaries. For the discriminative objective, we propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments. Without the temporal annotation of actions, TSM compares the text query with the entire videos across the dataset to mine the best matching segments while ignoring irrelevant ones. Due to the shared sub-actions in different categories of videos, merely applying TSM is too strict to neglect the semantic-related segments, which results in incomplete localization. We further introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence. We achieve the state-of-the-art performance on THUMOS14 and ActivityNet1.3. Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin. The code is available at https://github.com/lgzlIlIlI/Boosting-WTAL.

Title: ShipHullGAN: A generic parametric modeller for ship hull design using deep convolutional generative model. (arXiv:2305.00210v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00210
Code URL: null
Copy Paste: [[2305.00210] ShipHullGAN: A generic parametric modeller for ship hull design using deep convolutional generative model](http://arxiv.org/abs/2305.00210) #generative
Summary:
In this work, we introduce ShipHullGAN, a generic parametric modeller built using deep convolutional generative adversarial networks (GANs) for the versatile representation and generation of ship hulls. At a high level, the new model intends to address the current conservatism in the parametric ship design paradigm, where parametric modellers can only handle a particular ship type. We trained ShipHullGAN on a large dataset of 52,591 \textit{physically validated} designs from a wide range of existing ship types, including container ships, tankers, bulk carriers, tugboats, and crew supply vessels. We developed a new shape extraction and representation strategy to convert all training designs into a common geometric representation of the same resolution, as typically GANs can only accept vectors of fixed dimension as input. A space-filling layer is placed right after the generator component to ensure that the trained generator can cover all design classes. During training, designs are provided in the form of a shape-signature tensor (SST) which harnesses the compact geometric representation using geometric moments that further enable the inexpensive incorporation of physics-informed elements in ship design. We have shown through extensive comparative studies and optimisation cases that ShipHullGAN can generate designs with augmented features resulting in versatile design spaces that produce traditional and novel designs with geometrically valid and practically feasible shapes.

large language model

Title: An Iterative Algorithm for Rescaled Hyperbolic Functions Regression. (arXiv:2305.00660v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2305.00660
Code URL: null
Copy Paste: [[2305.00660] An Iterative Algorithm for Rescaled Hyperbolic Functions Regression](http://arxiv.org/abs/2305.00660) #large language model
Summary:
Large language models (LLMs) have numerous real-life applications across various domains, such as natural language translation, sentiment analysis, language modeling, chatbots and conversational agents, creative writing, text classification, summarization, and generation. LLMs have shown great promise in improving the accuracy and efficiency of these tasks, and have the potential to revolutionize the field of natural language processing (NLP) in the years to come.

Exponential function based attention unit is a fundamental element in LLMs. Several previous works have studied the convergence of exponential regression and softmax regression.

The exponential regression [Li, Song, Zhou 2023] and softmax regression [Deng, Li, Song 2023] can be formulated as follows. Given matrix $A \in \mathbb{R}^{n \times d}$ and vector $b \in \mathbb{R}^n$, the goal of exponential regression is to solve \begin{align*} \min_{x} \| \exp(Ax) - b \|_2 \end{align*} and the goal of softmax regression is to solve \begin{align*} \min_{x} \| \langle \exp(Ax) , {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2 . \end{align*}

In this work, we define a slightly different formulation than softmax regression. \begin{align*} \min_{x \in \mathbb{R}^d } \| u(x) - \langle u(x) , {\bf 1}_n \rangle \cdot b \|_2 \end{align*} where $u(x) \in \{ \exp(Ax), \cosh(Ax) , \sinh(Ax) \}$. We provide an input sparsity time algorithm for this problem. Our algorithm framework is very general and can be applied to functions like $\cosh()$ and $\sinh()$ as well. Our technique is also general enough to be applied to in-context learning for rescaled softmax regression.

segmentation

Title: SAM on Medical Images: A Comprehensive Study on Three Prompt Modes. (arXiv:2305.00035v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00035
Code URL: null
Copy Paste: [[2305.00035] SAM on Medical Images: A Comprehensive Study on Three Prompt Modes](http://arxiv.org/abs/2305.00035) #segmentation
Summary:
The Segment Anything Model (SAM) made an eye-catching debut recently and inspired many researchers to explore its potential and limitation in terms of zero-shot generalization capability. As the first promptable foundation model for segmentation tasks, it was trained on a large dataset with an unprecedented number of images and annotations. This large-scale dataset and its promptable nature endow the model with strong zero-shot generalization. Although the SAM has shown competitive performance on several datasets, we still want to investigate its zero-shot generalization on medical images. As we know, the acquisition of medical image annotation usually requires a lot of effort from professional practitioners. Therefore, if there exists a foundation model that can give high-quality mask prediction simply based on a few point prompts, this model will undoubtedly become the game changer for medical image analysis. To evaluate whether SAM has the potential to become the foundation model for medical image segmentation tasks, we collected more than 12 public medical image datasets that cover various organs and modalities. We also explore what kind of prompt can lead to the best zero-shot performance with different modalities. Furthermore, we find that a pattern shows that the perturbation of the box size will significantly change the prediction accuracy. Finally, Extensive experiments show that the predicted mask quality varied a lot among different datasets. And providing proper prompts, such as bounding boxes, to the SAM will significantly increase its performance.

Title: DSEC-MOS: Segment Any Moving Object with Moving Ego Vehicle. (arXiv:2305.00126v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00126
Code URL: null
Copy Paste: [[2305.00126] DSEC-MOS: Segment Any Moving Object with Moving Ego Vehicle](http://arxiv.org/abs/2305.00126) #segmentation
Summary:
Moving Object Segmentation (MOS), a crucial task in computer vision, has numerous applications such as surveillance, autonomous driving, and video analytics. Existing datasets for moving object segmentation mainly focus on RGB or Lidar videos, but lack additional event information that can enhance the understanding of dynamic scenes. To address this limitation, we propose a novel dataset, called DSEC-MOS. Our dataset includes frames captured by RGB cameras embedded on moving vehicules and incorporates event data, which provide high temporal resolution and low-latency information about changes in the scenes. To generate accurate segmentation mask annotations for moving objects, we apply the recently emerged large model SAM - Segment Anything Model - with moving object bounding boxes from DSEC-MOD serving as prompts and calibrated RGB frames, then further revise the results. Our DSEC-MOS dataset contains in total 16 sequences (13314 images). To the best of our knowledge, DSEC-MOS is also the first moving object segmentation dataset that includes event camera in autonomous driving. Project Page: https://github.com/ZZY-Zhou/DSEC-MOS.

Title: Regularizing Self-training for Unsupervised Domain Adaptation via Structural Constraints. (arXiv:2305.00131v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00131
Code URL: null
Copy Paste: [[2305.00131] Regularizing Self-training for Unsupervised Domain Adaptation via Structural Constraints](http://arxiv.org/abs/2305.00131) #segmentation
Summary:
Self-training based on pseudo-labels has emerged as a dominant approach for addressing conditional distribution shifts in unsupervised domain adaptation (UDA) for semantic segmentation problems. A notable drawback, however, is that this family of approaches is susceptible to erroneous pseudo labels that arise from confirmation biases in the source domain and that manifest as nuisance factors in the target domain. A possible source for this mismatch is the reliance on only photometric cues provided by RGB image inputs, which may ultimately lead to sub-optimal adaptation. To mitigate the effect of mismatched pseudo-labels, we propose to incorporate structural cues from auxiliary modalities, such as depth, to regularise conventional self-training objectives. Specifically, we introduce a contrastive pixel-level objectness constraint that pulls the pixel representations within a region of an object instance closer, while pushing those from different object categories apart. To obtain object regions consistent with the true underlying object, we extract information from both depth maps and RGB-images in the form of multimodal clustering. Crucially, the objectness constraint is agnostic to the ground-truth semantic labels and, hence, appropriate for unsupervised domain adaptation. In this work, we show that our regularizer significantly improves top performing self-training methods (by up to $2$ points) in various UDA benchmarks for semantic segmentation. We include all code in the supplementary.

Title: A Critical Analysis of the Limitation of Deep Learning based 3D Dental Mesh Segmentation Methods in Segmenting Partial Scans. (arXiv:2305.00244v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00244
Code URL: null
Copy Paste: [[2305.00244] A Critical Analysis of the Limitation of Deep Learning based 3D Dental Mesh Segmentation Methods in Segmenting Partial Scans](http://arxiv.org/abs/2305.00244) #segmentation
Summary:
Tooth segmentation from intraoral scans is a crucial part of digital dentistry. Many Deep Learning based tooth segmentation algorithms have been developed for this task. In most of the cases, high accuracy has been achieved, although, most of the available tooth segmentation techniques make an implicit restrictive assumption of full jaw model and they report accuracy based on full jaw models. Medically, however, in certain cases, full jaw tooth scan is not required or may not be available. Given this practical issue, it is important to understand the robustness of currently available widely used Deep Learning based tooth segmentation techniques. For this purpose, we applied available segmentation techniques on partial intraoral scans and we discovered that the available deep Learning techniques under-perform drastically. The analysis and comparison presented in this work would help us in understanding the severity of the problem and allow us to develop robust tooth segmentation technique without strong assumption of full jaw model.

Title: Segment Anything Model (SAM) Meets Glass: Mirror and Transparent Objects Cannot Be Easily Detected. (arXiv:2305.00278v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00278
Code URL: null
Copy Paste: [[2305.00278] Segment Anything Model (SAM) Meets Glass: Mirror and Transparent Objects Cannot Be Easily Detected](http://arxiv.org/abs/2305.00278) #segmentation
Summary:
Meta AI Research has recently released SAM (Segment Anything Model) which is trained on a large segmentation dataset of over 1 billion masks. As a foundation model in the field of computer vision, SAM (Segment Anything Model) has gained attention for its impressive performance in generic object segmentation. Despite its strong capability in a wide range of zero-shot transfer tasks, it remains unknown whether SAM can detect things in challenging setups like transparent objects. In this work, we perform an empirical evaluation of two glass-related challenging scenarios: mirror and transparent objects. We found that SAM often fails to detect the glass in both scenarios, which raises concern for deploying the SAM in safety-critical situations that have various forms of glass.

Title: Optimized Machine Learning for CHD Detection using 3D CNN-based Segmentation, Transfer Learning and Adagrad Optimization. (arXiv:2305.00411v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00411
Code URL: null
Copy Paste: [[2305.00411] Optimized Machine Learning for CHD Detection using 3D CNN-based Segmentation, Transfer Learning and Adagrad Optimization](http://arxiv.org/abs/2305.00411) #segmentation
Summary:
Globally, Coronary Heart Disease (CHD) is one of the main causes of death. Early detection of CHD can improve patient outcomes and reduce mortality rates. We propose a novel framework for predicting the presence of CHD using a combination of machine learning and image processing techniques. The framework comprises various phases, including analyzing the data, feature selection using ReliefF, 3D CNN-based segmentation, feature extraction by means of transfer learning, feature fusion as well as classification, and Adagrad optimization. The first step of the proposed framework involves analyzing the data to identify patterns and correlations that may be indicative of CHD. Next, ReliefF feature selection is applied to decide on the most relevant features from the sample images. The 3D CNN-based segmentation technique is then used to segment the optic disc and macula, which are important regions for CHD diagnosis. Feature extraction using transfer learning is performed to extract features from the segmented regions of interest. The extracted features are then fused using a feature fusion technique, and a classifier is trained to predict the presence of CHD. Finally, Adagrad optimization is used to optimize the performance of the classifier. Our framework is evaluated on a dataset of sample images collected from patients with and without CHD. The results show that the anticipated framework accomplishes elevated accuracy in predicting the presence of CHD. either a particular user with a reasonable degree of accuracy compared to the previously employed classifiers like SVM, etc.

Title: Synthetic Data-based Detection of Zebras in Drone Imagery. (arXiv:2305.00432v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00432
Code URL: null
Copy Paste: [[2305.00432] Synthetic Data-based Detection of Zebras in Drone Imagery](http://arxiv.org/abs/2305.00432) #segmentation
Summary:
Datasets that allow the training of common objects or human detectors are widely available. These come in the form of labelled real-world images and require either a significant amount of human effort, with a high probability of errors such as missing labels, or very constrained scenarios, e.g. VICON systems. Likewise, uncommon scenarios, like aerial views, animals, like wild zebras, or difficult-to-obtain information as human shapes, are hardly available. To overcome this, usage of synthetic data generation with realistic rendering technologies has recently gained traction and advanced tasks like target tracking and human pose estimation. However, subjects such as wild animals are still usually not well represented in such datasets. In this work, we first show that a pre-trained YOLO detector can not identify zebras in real images recorded from aerial viewpoints. To solve this, we present an approach for training an animal detector using only synthetic data. We start by generating a novel synthetic zebra dataset using GRADE, a state-of-the-art framework for data generation. The dataset includes RGB, depth, skeletal joint locations, pose, shape and instance segmentations for each subject. We use this to train a YOLO detector from scratch. Through extensive evaluations of our model with real-world data from i) limited datasets available on the internet and ii) a new one collected and manually labelled by us, we show that we can detect zebras by using only synthetic data during training. The code, results, trained models, and both the generated and training data are provided as open-source at https://keeper.mpdl.mpg.de/d/12abb3bb6b12491480d5/.

Title: PRSeg: A Lightweight Patch Rotate MLP Decoder for Semantic Segmentation. (arXiv:2305.00671v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00671
Code URL: null
Copy Paste: [[2305.00671] PRSeg: A Lightweight Patch Rotate MLP Decoder for Semantic Segmentation](http://arxiv.org/abs/2305.00671) #segmentation
Summary:
The lightweight MLP-based decoder has become increasingly promising for semantic segmentation. However, the channel-wise MLP cannot expand the receptive fields, lacking the context modeling capacity, which is critical to semantic segmentation. In this paper, we propose a parametric-free patch rotate operation to reorganize the pixels spatially. It first divides the feature map into multiple groups and then rotates the patches within each group. Based on the proposed patch rotate operation, we design a novel segmentation network, named PRSeg, which includes an off-the-shelf backbone and a lightweight Patch Rotate MLP decoder containing multiple Dynamic Patch Rotate Blocks (DPR-Blocks). In each DPR-Block, the fully connected layer is performed following a Patch Rotate Module (PRM) to exchange spatial information between pixels. Specifically, in PRM, the feature map is first split into the reserved part and rotated part along the channel dimension according to the predicted probability of the Dynamic Channel Selection Module (DCSM), and our proposed patch rotate operation is only performed on the rotated part. Extensive experiments on ADE20K, Cityscapes and COCO-Stuff 10K datasets prove the effectiveness of our approach. We expect that our PRSeg can promote the development of MLP-based decoder in semantic segmentation.

Title: Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation. (arXiv:2305.00673v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00673
Code URL: https://github.com/deepmed-lab-ecnu/bcp
Copy Paste: [[2305.00673] Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation](http://arxiv.org/abs/2305.00673) #segmentation
Summary:
In semi-supervised medical image segmentation, there exist empirical mismatch problems between labeled and unlabeled data distribution. The knowledge learned from the labeled data may be largely discarded if treating labeled and unlabeled data separately or in an inconsistent manner. We propose a straightforward method for alleviating the problem - copy-pasting labeled and unlabeled data bidirectionally, in a simple Mean Teacher architecture. The method encourages unlabeled data to learn comprehensive common semantics from the labeled data in both inward and outward directions. More importantly, the consistent learning procedure for labeled and unlabeled data can largely reduce the empirical distribution gap. In detail, we copy-paste a random crop from a labeled image (foreground) onto an unlabeled image (background) and an unlabeled image (foreground) onto a labeled image (background), respectively. The two mixed images are fed into a Student network and supervised by the mixed supervisory signals of pseudo-labels and ground-truth. We reveal that the simple mechanism of copy-pasting bidirectionally between labeled and unlabeled data is good enough and the experiments show solid gains (e.g., over 21% Dice improvement on ACDC dataset with 5% labeled data) compared with other state-of-the-arts on various semi-supervised medical image segmentation datasets. Code is available at https://github.com/DeepMed-Lab-ECNU/BCP}.

Title: Rethinking Boundary Detection in Deep Learning Models for Medical Image Segmentation. (arXiv:2305.00678v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2305.00678
Code URL: null
Copy Paste: [[2305.00678] Rethinking Boundary Detection in Deep Learning Models for Medical Image Segmentation](http://arxiv.org/abs/2305.00678) #segmentation
Summary:
Medical image segmentation is a fundamental task in the community of medical image analysis. In this paper, a novel network architecture, referred to as Convolution, Transformer, and Operator (CTO), is proposed. CTO employs a combination of Convolutional Neural Networks (CNNs), Vision Transformer (ViT), and an explicit boundary detection operator to achieve high recognition accuracy while maintaining an optimal balance between accuracy and efficiency. The proposed CTO follows the standard encoder-decoder segmentation paradigm, where the encoder network incorporates a popular CNN backbone for capturing local semantic information, and a lightweight ViT assistant for integrating long-range dependencies. To enhance the learning capacity on boundary, a boundary-guided decoder network is proposed that uses a boundary mask obtained from a dedicated boundary detection operator as explicit supervision to guide the decoding learning process. The performance of the proposed method is evaluated on six challenging medical image segmentation datasets, demonstrating that CTO achieves state-of-the-art accuracy with a competitive model complexity.