secure

Title: Blockchain-based Federated Learning with Secure Aggregation in Trusted Execution Environment for Internet-of-Things. (arXiv:2304.12889v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.12889
Code URL: null
Copy Paste: [[2304.12889] Blockchain-based Federated Learning with Secure Aggregation in Trusted Execution Environment for Internet-of-Things](http://arxiv.org/abs/2304.12889) #secure
Summary:
This paper proposes a blockchain-based Federated Learning (FL) framework with Intel Software Guard Extension (SGX)-based Trusted Execution Environment (TEE) to securely aggregate local models in Industrial Internet-of-Things (IIoTs). In FL, local models can be tampered with by attackers. Hence, a global model generated from the tampered local models can be erroneous. Therefore, the proposed framework leverages a blockchain network for secure model aggregation. Each blockchain node hosts an SGX-enabled processor that securely performs the FL-based aggregation tasks to generate a global model. Blockchain nodes can verify the authenticity of the aggregated model, run a blockchain consensus mechanism to ensure the integrity of the model, and add it to the distributed ledger for tamper-proof storage. Each cluster can obtain the aggregated model from the blockchain and verify its integrity before using it. We conducted several experiments with different CNN models and datasets to evaluate the performance of the proposed framework.

security

Title: SPDH-Sign: towards Efficient, Post-quantum Group-based Signatures. (arXiv:2304.12900v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.12900
Code URL: null
Copy Paste: [[2304.12900] SPDH-Sign: towards Efficient, Post-quantum Group-based Signatures](http://arxiv.org/abs/2304.12900) #security
Summary:
In this paper, we present a new diverse class of post-quantum group-based Digital Signature Schemes (DSS). The approach is significantly different from previous examples of group-based digital signatures and adopts the framework of group action-based cryptography: we show that each finite group defines a group action relative to the semidirect product of the group by its automorphism group, and give security bounds on the resulting signature scheme in terms of the group-theoretic computational problem known as the Semidirect Discrete Logarithm Problem (SDLP). Crucially, we make progress towards being able to efficiently compute the novel group action, and give an example of a parameterised family of groups for which the group action can be computed for any parameters, thereby negating the need for expensive offline computation or inclusion of redundancy required in other schemes of this type.

Title: Rubik's Optical Neural Networks: Multi-task Learning with Physics-aware Rotation Architecture. (arXiv:2304.12985v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12985
Code URL: null
Copy Paste: [[2304.12985] Rubik's Optical Neural Networks: Multi-task Learning with Physics-aware Rotation Architecture](http://arxiv.org/abs/2304.12985) #security
Summary:
Recently, there are increasing efforts on advancing optical neural networks (ONNs), which bring significant advantages for machine learning (ML) in terms of power efficiency, parallelism, and computational speed. With the considerable benefits in computation speed and energy efficiency, there are significant interests in leveraging ONNs into medical sensing, security screening, drug detection, and autonomous driving. However, due to the challenge of implementing reconfigurability, deploying multi-task learning (MTL) algorithms on ONNs requires re-building and duplicating the physical diffractive systems, which significantly degrades the energy and cost efficiency in practical application scenarios. This work presents a novel ONNs architecture, namely, \textit{RubikONNs}, which utilizes the physical properties of optical systems to encode multiple feed-forward functions by physically rotating the hardware similarly to rotating a \textit{Rubik's Cube}. To optimize MTL performance on RubikONNs, two domain-specific physics-aware training algorithms \textit{RotAgg} and \textit{RotSeq} are proposed. Our experimental results demonstrate more than 4$\times$ improvements in energy and cost efficiency with marginal accuracy degradation compared to the state-of-the-art approaches.

privacy

Title: Differential Privacy via Distributionally Robust Optimization. (arXiv:2304.12681v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.12681
Code URL: https://github.com/selvi-aras/dp-via-dro
Copy Paste: [[2304.12681] Differential Privacy via Distributionally Robust Optimization](http://arxiv.org/abs/2304.12681) #privacy
Summary:
In recent years, differential privacy has emerged as the de facto standard for sharing statistics of datasets while limiting the disclosure of private information about the involved individuals. This is achieved by randomly perturbing the statistics to be published, which in turn leads to a privacy-accuracy trade-off: larger perturbations provide stronger privacy guarantees, but they result in less accurate statistics that offer lower utility to the recipients. Of particular interest are therefore optimal mechanisms that provide the highest accuracy for a pre-selected level of privacy. To date, work in this area has focused on specifying families of perturbations a priori and subsequently proving their asymptotic and/or best-in-class optimality. In this paper, we develop a class of mechanisms that enjoy non-asymptotic and unconditional optimality guarantees. To this end, we formulate the mechanism design problem as an infinite-dimensional distributionally robust optimization problem. We show that the problem affords a strong dual, and we exploit this duality to develop converging hierarchies of finite-dimensional upper and lower bounding problems. Our upper (primal) bounds correspond to implementable perturbations whose suboptimality can be bounded by our lower (dual) bounds. Both bounding problems can be solved within seconds via cutting plane techniques that exploit the inherent problem structure. Our numerical experiments demonstrate that our perturbations can outperform the previously best results from the literature on artificial as well as standard benchmark problems.

Title: (Local) Differential Privacy has NO Disparate Impact on Fairness. (arXiv:2304.12845v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12845
Code URL: https://github.com/hharcolezi/ldp-fairness-impact
Copy Paste: [[2304.12845] (Local) Differential Privacy has NO Disparate Impact on Fairness](http://arxiv.org/abs/2304.12845) #privacy
Summary:
In recent years, Local Differential Privacy (LDP), a robust privacy-preserving methodology, has gained widespread adoption in real-world applications. With LDP, users can perturb their data on their devices before sending it out for analysis. However, as the collection of multiple sensitive information becomes more prevalent across various industries, collecting a single sensitive attribute under LDP may not be sufficient. Correlated attributes in the data may still lead to inferences about the sensitive attribute. This paper empirically studies the impact of collecting multiple sensitive attributes under LDP on fairness. We propose a novel privacy budget allocation scheme that considers the varying domain size of sensitive attributes. This generally led to a better privacy-utility-fairness trade-off in our experiments than the state-of-art solution. Our results show that LDP leads to slightly improved fairness in learning problems without significantly affecting the performance of the models. We conduct extensive experiments evaluating three benchmark datasets using several group fairness metrics and seven state-of-the-art LDP protocols. Overall, this study challenges the common belief that differential privacy necessarily leads to worsened fairness in machine learning.

protect

defense

Title: Autonomous Intelligent Cyber-defense Agent: Introduction and Overview. (arXiv:2304.12408v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.12408
Code URL: null
Copy Paste: [[2304.12408] Autonomous Intelligent Cyber-defense Agent: Introduction and Overview](http://arxiv.org/abs/2304.12408) #defense
Summary:
This chapter introduces the concept of Autonomous Intelligent Cyber-defense Agents (AICAs), and briefly explains the importance of this field and the motivation for its emergence. AICA is a software agent that resides on a system, and is responsible for defending the system from cyber compromises and enabling the response and recovery of the system, usually autonomously. The autonomy of the agent is a necessity because of the growing scarcity of human cyber-experts who could defend systems, either remotely or onsite, and because sophisticated malware could degrade or spoof the communications of a system that uses a remote monitoring center. An AICA Reference Architecture has been proposed and defines five main functions: (1) sensing and world state identification, (2) planning and action selection, (3) collaboration and negotiation, (4) action execution and (5) learning and knowledge improvement. The chapter reviews the details of AICA's environment, functions and operations. As AICA is intended to make changes within its environment, there is a risk that an agent's action could harm a friendly computer. This risk must be balanced against the losses that could occur if the agent does not act. The chapter discusses means by which this risk can be managed and how AICA's design features could help build trust among its users.

attack

Title: Beyond the Prior Forgery Knowledge: Mining Critical Clues for General Face Forgery Detection. (arXiv:2304.12489v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12489
Code URL: null
Copy Paste: [[2304.12489] Beyond the Prior Forgery Knowledge: Mining Critical Clues for General Face Forgery Detection](http://arxiv.org/abs/2304.12489) #attack
Summary:
Face forgery detection is essential in combating malicious digital face attacks. Previous methods mainly rely on prior expert knowledge to capture specific forgery clues, such as noise patterns, blending boundaries, and frequency artifacts. However, these methods tend to get trapped in local optima, resulting in limited robustness and generalization capability. To address these issues, we propose a novel Critical Forgery Mining (CFM) framework, which can be flexibly assembled with various backbones to boost their generalization and robustness performance. Specifically, we first build a fine-grained triplet and suppress specific forgery traces through prior knowledge-agnostic data augmentation. Subsequently, we propose a fine-grained relation learning prototype to mine critical information in forgeries through instance and local similarity-aware losses. Moreover, we design a novel progressive learning controller to guide the model to focus on principal feature components, enabling it to learn critical forgery features in a coarse-to-fine manner. The proposed method achieves state-of-the-art forgery detection performance under various challenging evaluation settings.

Title: Flickr-PAD: New Face High-Resolution Presentation Attack Detection Database. (arXiv:2304.13015v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13015
Code URL: https://github.com/jedota/flickr-pad
Copy Paste: [[2304.13015] Flickr-PAD: New Face High-Resolution Presentation Attack Detection Database](http://arxiv.org/abs/2304.13015) #attack
Summary:
Nowadays, Presentation Attack Detection is a very active research area. Several databases are constituted in the state-of-the-art using images extracted from videos. One of the main problems identified is that many databases present a low-quality, small image size and do not represent an operational scenario in a real remote biometric system. Currently, these images are captured from smartphones with high-quality and bigger resolutions. In order to increase the diversity of image quality, this work presents a new PAD database based on open-access Flickr images called: "Flickr-PAD". Our new hand-made database shows high-quality printed and screen scenarios. This will help researchers to compare new approaches to existing algorithms on a wider database. This database will be available for other researchers. A leave-one-out protocol was used to train and evaluate three PAD models based on MobileNet-V3 (small and large) and EfficientNet-B0. The best result was reached with MobileNet-V3 large with BPCER10 of 7.08% and BPCER20 of 11.15%.

Title: Face Feature Visualisation of Single Morphing Attack Detection. (arXiv:2304.13021v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13021
Code URL: null
Copy Paste: [[2304.13021] Face Feature Visualisation of Single Morphing Attack Detection](http://arxiv.org/abs/2304.13021) #attack
Summary:
This paper proposes an explainable visualisation of different face feature extraction algorithms that enable the detection of bona fide and morphing images for single morphing attack detection. The feature extraction is based on raw image, shape, texture, frequency and compression. This visualisation may help to develop a Graphical User Interface for border policies and specifically for border guard personnel that have to investigate details of suspect images. A Random forest classifier was trained in a leave-one-out protocol on three landmarks-based face morphing methods and a StyleGAN-based morphing method for which morphed images are available in the FRLL database. For morphing attack detection, the Discrete Cosine-Transformation-based method obtained the best results for synthetic images and BSIF for landmark-based image features.

Title: Improving Robustness Against Adversarial Attacks with Deeply Quantized Neural Networks. (arXiv:2304.12829v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12829
Code URL: null
Copy Paste: [[2304.12829] Improving Robustness Against Adversarial Attacks with Deeply Quantized Neural Networks](http://arxiv.org/abs/2304.12829) #attack
Summary:
Reducing the memory footprint of Machine Learning (ML) models, particularly Deep Neural Networks (DNNs), is essential to enable their deployment into resource-constrained tiny devices. However, a disadvantage of DNN models is their vulnerability to adversarial attacks, as they can be fooled by adding slight perturbations to the inputs. Therefore, the challenge is how to create accurate, robust, and tiny DNN models deployable on resource-constrained embedded devices. This paper reports the results of devising a tiny DNN model, robust to adversarial black and white box attacks, trained with an automatic quantizationaware training framework, i.e. QKeras, with deep quantization loss accounted in the learning loop, thereby making the designed DNNs more accurate for deployment on tiny devices. We investigated how QKeras and an adversarial robustness technique, Jacobian Regularization (JR), can provide a co-optimization strategy by exploiting the DNN topology and the per layer JR approach to produce robust yet tiny deeply quantized DNN models. As a result, a new DNN model implementing this cooptimization strategy was conceived, developed and tested on three datasets containing both images and audio inputs, as well as compared its performance with existing benchmarks against various white-box and black-box attacks. Experimental results demonstrated that on average our proposed DNN model resulted in 8.3% and 79.5% higher accuracy than MLCommons/Tiny benchmarks in the presence of white-box and black-box attacks on the CIFAR-10 image dataset and a subset of the Google Speech Commands audio dataset respectively. It was also 6.5% more accurate for black-box attacks on the SVHN image dataset.

Title: Evaluation of Parameter-based Attacks against Embedded Neural Networks with Laser Injection. (arXiv:2304.12876v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.12876
Code URL: null
Copy Paste: [[2304.12876] Evaluation of Parameter-based Attacks against Embedded Neural Networks with Laser Injection](http://arxiv.org/abs/2304.12876) #attack
Summary:
Upcoming certification actions related to the security of machine learning (ML) based systems raise major evaluation challenges that are amplified by the large-scale deployment of models in many hardware platforms. Until recently, most of research works focused on API-based attacks that consider a ML model as a pure algorithmic abstraction. However, new implementation-based threats have been revealed, emphasizing the urgency to propose both practical and simulation-based methods to properly evaluate the robustness of models. A major concern is parameter-based attacks (such as the Bit-Flip Attack, BFA) that highlight the lack of robustness of typical deep neural network models when confronted by accurate and optimal alterations of their internal parameters stored in memory. Setting in a security testing purpose, this work practically reports, for the first time, a successful variant of the BFA on a 32-bit Cortex-M microcontroller using laser fault injection. It is a standard fault injection means for security evaluation, that enables to inject spatially and temporally accurate faults. To avoid unrealistic brute-force strategies, we show how simulations help selecting the most sensitive set of bits from the parameters taking into account the laser fault model.

robust

Title: Evaluating Adversarial Robustness on Document Image Classification. (arXiv:2304.12486v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12486
Code URL: null
Copy Paste: [[2304.12486] Evaluating Adversarial Robustness on Document Image Classification](http://arxiv.org/abs/2304.12486) #robust
Summary:
Adversarial attacks and defenses have gained increasing interest on computer vision systems in recent years, but as of today, most investigations are limited to images. However, many artificial intelligence models actually handle documentary data, which is very different from real world images. Hence, in this work, we try to apply the adversarial attack philosophy on documentary and natural data and to protect models against such attacks. We focus our work on untargeted gradient-based, transfer-based and score-based attacks and evaluate the impact of adversarial training, JPEG input compression and grey-scale input transformation on the robustness of ResNet50 and EfficientNetB0 model architectures. To the best of our knowledge, no such work has been conducted by the community in order to study the impact of these attacks on the document image classification task.

Title: ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale LiDAR Point Clouds. (arXiv:2304.12589v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12589
Code URL: null
Copy Paste: [[2304.12589] ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale LiDAR Point Clouds](http://arxiv.org/abs/2304.12589) #robust
Summary:
In this paper, we propose a novel self-supervised motion estimator for LiDAR-based autonomous driving via BEV representation. Different from usually adopted self-supervised strategies for data-level structure consistency, we predict scene motion via feature-level consistency between pillars in consecutive frames, which can eliminate the effect caused by noise points and view-changing point clouds in dynamic scenes. Specifically, we propose \textit{Soft Discriminative Loss} that provides the network with more pseudo-supervised signals to learn discriminative and robust features in a contrastive learning manner. We also propose \textit{Gated Multi-frame Fusion} block that learns valid compensation between point cloud frames automatically to enhance feature extraction. Finally, \textit{pillar association} is proposed to predict pillar correspondence probabilities based on feature distance, and whereby further predicts scene motion. Extensive experiments show the effectiveness and superiority of our \textbf{ContrastMotion} on both scene flow and motion prediction tasks. The code is available soon.

Title: Shape-Net: Room Layout Estimation from Panoramic Images Robust to Occlusion using Knowledge Distillation with 3D Shapes as Additional Inputs. (arXiv:2304.12624v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12624
Code URL: null
Copy Paste: [[2304.12624] Shape-Net: Room Layout Estimation from Panoramic Images Robust to Occlusion using Knowledge Distillation with 3D Shapes as Additional Inputs](http://arxiv.org/abs/2304.12624) #robust
Summary:
Estimating the layout of a room from a single-shot panoramic image is important in virtual/augmented reality and furniture layout simulation. This involves identifying three-dimensional (3D) geometry, such as the location of corners and boundaries, and performing 3D reconstruction. However, occlusion is a common issue that can negatively impact room layout estimation, and this has not been thoroughly studied to date. It is possible to obtain 3D shape information of rooms as drawings of buildings and coordinates of corners from image datasets, thus we propose providing both 2D panoramic and 3D information to a model to effectively deal with occlusion. However, simply feeding 3D information to a model is not sufficient to utilize the shape information for an occluded area. Therefore, we improve the model by introducing 3D Intersection over Union (IoU) loss to effectively use 3D information. In some cases, drawings are not available or the construction deviates from a drawing. Considering such practical cases, we propose a method for distilling knowledge from a model trained with both images and 3D information to a model that takes only images as input. The proposed model, which is called Shape-Net, achieves state-of-the-art (SOTA) performance on benchmark datasets. We also confirmed its effectiveness in dealing with occlusion through significantly improved accuracy on images with occlusion compared with existing models.

Title: Docmarking: Real-Time Screen-Cam Robust Document Image Watermarking. (arXiv:2304.12682v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.12682
Code URL: null
Copy Paste: [[2304.12682] Docmarking: Real-Time Screen-Cam Robust Document Image Watermarking](http://arxiv.org/abs/2304.12682) #robust
Summary:
This paper focuses on investigation of confidential documents leaks in the form of screen photographs. Proposed approach does not try to prevent leak in the first place but rather aims to determine source of the leak. Method works by applying on the screen a unique identifying watermark as semi-transparent image that is almost imperceptible for human eyes. Watermark image is static and stays on the screen all the time thus watermark present on every captured photograph of the screen. The key components of the approach are three neural networks. The first network generates an image with embedded message in a way that this image is almost invisible when displayed on the screen. The other two neural networks are used to retrieve embedded message with high accuracy. Developed method was comprehensively tested on different screen and cameras. Test results showed high efficiency of the proposed approach.

Title: Learning Robust Deep Equilibrium Models. (arXiv:2304.12707v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12707
Code URL: null
Copy Paste: [[2304.12707] Learning Robust Deep Equilibrium Models](http://arxiv.org/abs/2304.12707) #robust
Summary:
Deep equilibrium (DEQ) models have emerged as a promising class of implicit layer models in deep learning, which abandon traditional depth by solving for the fixed points of a single nonlinear layer. Despite their success, the stability of the fixed points for these models remains poorly understood. Recently, Lyapunov theory has been applied to Neural ODEs, another type of implicit layer model, to confer adversarial robustness. By considering DEQ models as nonlinear dynamic systems, we propose a robust DEQ model named LyaDEQ with guaranteed provable stability via Lyapunov theory. The crux of our method is ensuring the fixed points of the DEQ models are Lyapunov stable, which enables the LyaDEQ models to resist the minor initial perturbations. To avoid poor adversarial defense due to Lyapunov-stable fixed points being located near each other, we add an orthogonal fully connected layer after the Lyapunov stability module to separate different fixed points. We evaluate LyaDEQ models on several widely used datasets under well-known adversarial attacks, and experimental results demonstrate significant improvement in robustness. Furthermore, we show that the LyaDEQ model can be combined with other defense methods, such as adversarial training, to achieve even better adversarial robustness.

Title: Depth-Relative Self Attention for Monocular Depth Estimation. (arXiv:2304.12849v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12849
Code URL: null
Copy Paste: [[2304.12849] Depth-Relative Self Attention for Monocular Depth Estimation](http://arxiv.org/abs/2304.12849) #robust
Summary:
Monocular depth estimation is very challenging because clues to the exact depth are incomplete in a single RGB image. To overcome the limitation, deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information. However, we observe that if such hints are overly exploited, the network can be biased on RGB information without considering the comprehensive view. We propose a novel depth estimation model named RElative Depth Transformer (RED-T) that uses relative depth as guidance in self-attention. Specifically, the model assigns high attention weights to pixels of close depth and low attention weights to pixels of distant depth. As a result, the features of similar depth can become more likely to each other and thus less prone to misused visual hints. We show that the proposed model achieves competitive results in monocular depth estimation benchmarks and is less biased to RGB information. In addition, we propose a novel monocular depth estimation benchmark that limits the observable depth range during training in order to evaluate the robustness of the model for unseen depths.

Title: DQS3D: Densely-matched Quantization-aware Semi-supervised 3D Detection. (arXiv:2304.13031v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13031
Code URL: https://github.com/air-discover/dqs3d
Copy Paste: [[2304.13031] DQS3D: Densely-matched Quantization-aware Semi-supervised 3D Detection](http://arxiv.org/abs/2304.13031) #robust
Summary:
In this paper, we study the problem of semi-supervised 3D object detection, which is of great importance considering the high annotation cost for cluttered 3D indoor scenes. We resort to the robust and principled framework of selfteaching, which has triggered notable progress for semisupervised learning recently. While this paradigm is natural for image-level or pixel-level prediction, adapting it to the detection problem is challenged by the issue of proposal matching. Prior methods are based upon two-stage pipelines, matching heuristically selected proposals generated in the first stage and resulting in spatially sparse training signals. In contrast, we propose the first semisupervised 3D detection algorithm that works in the singlestage manner and allows spatially dense training signals. A fundamental issue of this new design is the quantization error caused by point-to-voxel discretization, which inevitably leads to misalignment between two transformed views in the voxel domain. To this end, we derive and implement closed-form rules that compensate this misalignment onthe-fly. Our results are significant, e.g., promoting ScanNet mAP@0.5 from 35.2% to 48.5% using 20% annotation. Codes and data will be publicly available.

Title: USTEP: Structuration des logs en flux gr{\^a}ce {`a} un arbre de recherche {\'e}volutif. (arXiv:2304.12331v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12331
Code URL: null
Copy Paste: [[2304.12331] USTEP: Structuration des logs en flux gr{\^a}ce {\a} un arbre de recherche {\'e}volutif](http://arxiv.org/abs/2304.12331) #robust`
Summary:
Logs record valuable system information at runtime. They are widely used by data-driven approaches for development and monitoring purposes. Parsing log messages to structure their format is a classic preliminary step for log-mining tasks. As they appear upstream, parsing operations can become a processing time bottleneck for downstream applications. The quality of parsing also has a direct influence on their efficiency. Here, we propose USTEP, an online log parsing method based on an evolving tree structure. Evaluation results on a wide panel of datasets coming from different real-world systems demonstrate USTEP superiority in terms of both effectiveness and robustness when compared to other online methods.

Title: Understanding and Predicting Human Label Variation in Natural Language Inference through Explanation. (arXiv:2304.12443v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12443
Code URL: null
Copy Paste: [[2304.12443] Understanding and Predicting Human Label Variation in Natural Language Inference through Explanation](http://arxiv.org/abs/2304.12443) #robust
Summary:
Human label variation (Plank 2022), or annotation disagreement, exists in many natural language processing (NLP) tasks. To be robust and trusted, NLP models need to identify such variation and be able to explain it. To this end, we created the first ecologically valid explanation dataset with diverse reasoning, LiveNLI. LiveNLI contains annotators' highlights and free-text explanations for the label(s) of their choice for 122 English Natural Language Inference items, each with at least 10 annotations. We used its explanations for chain-of-thought prompting, and found there is still room for improvement in GPT-3's ability to predict label distribution with in-context learning.

Title: Test-Time Adaptation with Perturbation Consistency Learning. (arXiv:2304.12764v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12764
Code URL: null
Copy Paste: [[2304.12764] Test-Time Adaptation with Perturbation Consistency Learning](http://arxiv.org/abs/2304.12764) #robust
Summary:
Currently, pre-trained language models (PLMs) do not cope well with the distribution shift problem, resulting in models trained on the training set failing in real test scenarios. To address this problem, the test-time adaptation (TTA) shows great potential, which updates model parameters to suit the test data at the testing time. Existing TTA methods rely on well-designed auxiliary tasks or self-training strategies based on pseudo-label. However, these methods do not achieve good trade-offs regarding performance gains and computational costs. To obtain some insights into such a dilemma, we take two representative TTA methods, i.e., Tent and OIL, for exploration and find that stable prediction is the key to achieving a good balance. Accordingly, in this paper, we propose perturbation consistency learning (PCL), a simple test-time adaptation method to promote the model to make stable predictions for samples with distribution shifts. Extensive experiments on adversarial robustness and cross-lingual transferring demonstrate that our method can achieve higher or comparable performance with less inference time over strong PLM backbones and previous state-of-the-art TTA methods.

Title: AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. (arXiv:2304.12995v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12995
Code URL: https://github.com/aigc-audio/audiogpt
Copy Paste: [[2304.12995] AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head](http://arxiv.org/abs/2304.12995) #robust
Summary:
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Our system is publicly available at \url{https://github.com/AIGC-Audio/AudioGPT}.

Title: CIMLA: Interpretable AI for inference of differential causal networks. (arXiv:2304.12523v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12523
Code URL: null
Copy Paste: [[2304.12523] CIMLA: Interpretable AI for inference of differential causal networks](http://arxiv.org/abs/2304.12523) #robust
Summary:
The discovery of causal relationships from high-dimensional data is a major open problem in bioinformatics. Machine learning and feature attribution models have shown great promise in this context but lack causal interpretation. Here, we show that a popular feature attribution model estimates a causal quantity reflecting the influence of one variable on another, under certain assumptions. We leverage this insight to implement a new tool, CIMLA, for discovering condition-dependent changes in causal relationships. We then use CIMLA to identify differences in gene regulatory networks between biological conditions, a problem that has received great attention in recent years. Using extensive benchmarking on simulated data sets, we show that CIMLA is more robust to confounding variables and is more accurate than leading methods. Finally, we employ CIMLA to analyze a previously published single-cell RNA-seq data set collected from subjects with and without Alzheimer's disease (AD), discovering several potential regulators of AD.

Title: Combining Adversaries with Anti-adversaries in Training. (arXiv:2304.12550v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12550
Code URL: null
Copy Paste: [[2304.12550] Combining Adversaries with Anti-adversaries in Training](http://arxiv.org/abs/2304.12550) #robust
Summary:
Adversarial training is an effective learning technique to improve the robustness of deep neural networks. In this study, the influence of adversarial training on deep learning models in terms of fairness, robustness, and generalization is theoretically investigated under more general perturbation scope that different samples can have different perturbation directions (the adversarial and anti-adversarial directions) and varied perturbation bounds. Our theoretical explorations suggest that the combination of adversaries and anti-adversaries (samples with anti-adversarial perturbations) in training can be more effective in achieving better fairness between classes and a better tradeoff between robustness and generalization in some typical learning scenarios (e.g., noisy label learning and imbalance learning) compared with standard adversarial training. On the basis of our theoretical findings, a more general learning objective that combines adversaries and anti-adversaries with varied bounds on each training sample is presented. Meta learning is utilized to optimize the combination weights. Experiments on benchmark datasets under different learning scenarios verify our theoretical findings and the effectiveness of the proposed methodology.

Title: A Multi-Task Approach to Robust Deep Reinforcement Learning for Resource Allocation. (arXiv:2304.12660v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12660
Code URL: null
Copy Paste: [[2304.12660] A Multi-Task Approach to Robust Deep Reinforcement Learning for Resource Allocation](http://arxiv.org/abs/2304.12660) #robust
Summary:
With increasing complexity of modern communication systems, machine learning algorithms have become a focal point of research. However, performance demands have tightened in parallel to complexity. For some of the key applications targeted by future wireless, such as the medical field, strict and reliable performance guarantees are essential, but vanilla machine learning methods have been shown to struggle with these types of requirements. Therefore, the question is raised whether these methods can be extended to better deal with the demands imposed by such applications. In this paper, we look at a combinatorial resource allocation challenge with rare, significant events which must be handled properly. We propose to treat this as a multi-task learning problem, select two methods from this domain, Elastic Weight Consolidation and Gradient Episodic Memory, and integrate them into a vanilla actor-critic scheduler. We compare their performance in dealing with Black Swan Events with the state-of-the-art of augmenting the training data distribution and report that the multi-task approach proves highly effective.

Title: Decoupling Quantile Representations from Loss Functions. (arXiv:2304.12766v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12766
Code URL: null
Copy Paste: [[2304.12766] Decoupling Quantile Representations from Loss Functions](http://arxiv.org/abs/2304.12766) #robust
Summary:
The simultaneous quantile regression (SQR) technique has been used to estimate uncertainties for deep learning models, but its application is limited by the requirement that the solution at the median quantile ({\tau} = 0.5) must minimize the mean absolute error (MAE). In this article, we address this limitation by demonstrating a duality between quantiles and estimated probabilities in the case of simultaneous binary quantile regression (SBQR). This allows us to decouple the construction of quantile representations from the loss function, enabling us to assign an arbitrary classifier f(x) at the median quantile and generate the full spectrum of SBQR quantile representations at different {\tau} values. We validate our approach through two applications: (i) detecting out-of-distribution samples, where we show that quantile representations outperform standard probability outputs, and (ii) calibrating models, where we demonstrate the robustness of quantile representations to distortions. We conclude with a discussion of several hypotheses arising from these findings.

Title: Generating robust counterfactual explanations. (arXiv:2304.12943v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12943
Code URL: null
Copy Paste: [[2304.12943] Generating robust counterfactual explanations](http://arxiv.org/abs/2304.12943) #robust
Summary:
Counterfactual explanations have become a mainstay of the XAI field. This particularly intuitive statement allows the user to understand what small but necessary changes would have to be made to a given situation in order to change a model prediction. The quality of a counterfactual depends on several criteria: realism, actionability, validity, robustness, etc. In this paper, we are interested in the notion of robustness of a counterfactual. More precisely, we focus on robustness to counterfactual input changes. This form of robustness is particularly challenging as it involves a trade-off between the robustness of the counterfactual and the proximity with the example to explain. We propose a new framework, CROCO, that generates robust counterfactuals while managing effectively this trade-off, and guarantees the user a minimal robustness. An empirical evaluation on tabular datasets confirms the relevance and effectiveness of our approach.

Title: Certifying Ensembles: A General Certification Theory with S-Lipschitzness. (arXiv:2304.13019v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.13019
Code URL: null
Copy Paste: [[2304.13019] Certifying Ensembles: A General Certification Theory with S-Lipschitzness](http://arxiv.org/abs/2304.13019) #robust
Summary:
Improving and guaranteeing the robustness of deep learning models has been a topic of intense research. Ensembling, which combines several classifiers to provide a better model, has shown to be beneficial for generalisation, uncertainty estimation, calibration, and mitigating the effects of concept drift. However, the impact of ensembling on certified robustness is less well understood. In this work, we generalise Lipschitz continuity by introducing S-Lipschitz classifiers, which we use to analyse the theoretical robustness of ensembles. Our results are precise conditions when ensembles of robust classifiers are more robust than any constituent classifier, as well as conditions when they are less robust.

biometric

steal

extraction

Title: TextMesh: Generation of Realistic 3D Meshes From Text Prompts. (arXiv:2304.12439v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12439
Code URL: null
Copy Paste: [[2304.12439] TextMesh: Generation of Realistic 3D Meshes From Text Prompts](http://arxiv.org/abs/2304.12439) #extraction
Summary:
The ability to generate highly realistic 2D images from mere text prompts has recently made huge progress in terms of speed and quality, thanks to the advent of image diffusion models. Naturally, the question arises if this can be also achieved in the generation of 3D content from such text prompts. To this end, a new line of methods recently emerged trying to harness diffusion models, trained on 2D images, for supervision of 3D model generation using view dependent prompts. While achieving impressive results, these methods, however, have two major drawbacks. First, rather than commonly used 3D meshes, they instead generate neural radiance fields (NeRFs), making them impractical for most real applications. Second, these approaches tend to produce over-saturated models, giving the output a cartoonish looking effect. Therefore, in this work we propose a novel method for generation of highly realistic-looking 3D meshes. To this end, we extend NeRF to employ an SDF backbone, leading to improved 3D mesh extraction. In addition, we propose a novel way to finetune the mesh texture, removing the effect of high saturation and improving the details of the output 3D mesh.

Title: DocParser: End-to-end OCR-free Information Extraction from Visually Rich Documents. (arXiv:2304.12484v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12484
Code URL: null
Copy Paste: [[2304.12484] DocParser: End-to-end OCR-free Information Extraction from Visually Rich Documents](http://arxiv.org/abs/2304.12484) #extraction
Summary:
Information Extraction from visually rich documents is a challenging task that has gained a lot of attention in recent years due to its importance in several document-control based applications and its widespread commercial value. The majority of the research work conducted on this topic to date follow a two-step pipeline. First, they read the text using an off-the-shelf Optical Character Recognition (OCR) engine, then, they extract the fields of interest from the obtained text. The main drawback of these approaches is their dependence on an external OCR system, which can negatively impact both performance and computational speed. Recent OCR-free methods were proposed to address the previous issues. Inspired by their promising results, we propose in this paper an OCR-free end-to-end information extraction model named DocParser. It differs from prior end-to-end approaches by its ability to better extract discriminative character features. DocParser achieves state-of-the-art results on various datasets, while still being faster than previous works.

Title: SwinFSR: Stereo Image Super-Resolution using SwinIR and Frequency Domain Knowledge. (arXiv:2304.12556v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12556
Code URL: null
Copy Paste: [[2304.12556] SwinFSR: Stereo Image Super-Resolution using SwinIR and Frequency Domain Knowledge](http://arxiv.org/abs/2304.12556) #extraction
Summary:
Stereo Image Super-Resolution (stereoSR) has attracted significant attention in recent years due to the extensive deployment of dual cameras in mobile phones, autonomous vehicles and robots. In this work, we propose a new StereoSR method, named SwinFSR, based on an extension of SwinIR, originally designed for single image restoration, and the frequency domain knowledge obtained by the Fast Fourier Convolution (FFC). Specifically, to effectively gather global information, we modify the Residual Swin Transformer blocks (RSTBs) in SwinIR by explicitly incorporating the frequency domain knowledge using the FFC and employing the resulting residual Swin Fourier Transformer blocks (RSFTBs) for feature extraction. Besides, for the efficient and accurate fusion of stereo views, we propose a new cross-attention module referred to as RCAM, which achieves highly competitive performance while requiring less computational cost than the state-of-the-art cross-attention modules. Extensive experimental results and ablation studies demonstrate the effectiveness and efficiency of our proposed SwinFSR.

Title: The Potential of Visual ChatGPT For Remote Sensing. (arXiv:2304.13009v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13009
Code URL: null
Copy Paste: [[2304.13009] The Potential of Visual ChatGPT For Remote Sensing](http://arxiv.org/abs/2304.13009) #extraction
Summary:
Recent advancements in Natural Language Processing (NLP), particularly in Large Language Models (LLMs), associated with deep learning-based computer vision techniques, have shown substantial potential for automating a variety of tasks. One notable model is Visual ChatGPT, which combines ChatGPT's LLM capabilities with visual computation to enable effective image analysis. The model's ability to process images based on textual inputs can revolutionize diverse fields. However, its application in the remote sensing domain remains unexplored. This is the first paper to examine the potential of Visual ChatGPT, a cutting-edge LLM founded on the GPT architecture, to tackle the aspects of image processing related to the remote sensing domain. Among its current capabilities, Visual ChatGPT can generate textual descriptions of images, perform canny edge and straight line detection, and conduct image segmentation. These offer valuable insights into image content and facilitate the interpretation and extraction of information. By exploring the applicability of these techniques within publicly available datasets of satellite images, we demonstrate the current model's limitations in dealing with remote sensing images, highlighting its challenges and future prospects. Although still in early development, we believe that the combination of LLMs and visual models holds a significant potential to transform remote sensing image processing, creating accessible and practical application opportunities in the field.

membership infer

federate

Title: Chameleon: Adapting to Peer Images for Planting Durable Backdoors in Federated Learning. (arXiv:2304.12961v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12961
Code URL: null
Copy Paste: [[2304.12961] Chameleon: Adapting to Peer Images for Planting Durable Backdoors in Federated Learning](http://arxiv.org/abs/2304.12961) #federate
Summary:
In a federated learning (FL) system, distributed clients upload their local models to a central server to aggregate into a global model. Malicious clients may plant backdoors into the global model through uploading poisoned local models, causing images with specific patterns to be misclassified into some target labels. Backdoors planted by current attacks are not durable, and vanish quickly once the attackers stop model poisoning. In this paper, we investigate the connection between the durability of FL backdoors and the relationships between benign images and poisoned images (i.e., the images whose labels are flipped to the target label during local training). Specifically, benign images with the original and the target labels of the poisoned images are found to have key effects on backdoor durability. Consequently, we propose a novel attack, Chameleon, which utilizes contrastive learning to further amplify such effects towards a more durable backdoor. Extensive experiments demonstrate that Chameleon significantly extends the backdoor lifespan over baselines by $1.2\times \sim 4\times$, for a wide range of image datasets, backdoor types, and model architectures.

Title: Mobilizing Personalized Federated Learning via Random Walk Stochastic ADMM. (arXiv:2304.12534v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12534
Code URL: null
Copy Paste: [[2304.12534] Mobilizing Personalized Federated Learning via Random Walk Stochastic ADMM](http://arxiv.org/abs/2304.12534) #federate
Summary:
In this research, we investigate the barriers associated with implementing Federated Learning (FL) in real-world scenarios, where a consistent connection between the central server and all clients cannot be maintained, and data distribution is heterogeneous. To address these challenges, we focus on mobilizing the federated setting, where the server moves between groups of adjacent clients to learn local models. Specifically, we propose a new algorithm, Random Walk Stochastic Alternating Direction Method of Multipliers (RWSADMM), capable of adapting to dynamic and ad-hoc network conditions as long as a sufficient number of connected clients are available for model training. In RWSADMM, the server walks randomly toward a group of clients. It formulates local proximity among adjacent clients based on hard inequality constraints instead of consensus updates to address data heterogeneity. Our proposed method is convergent, reduces communication costs, and enhances scalability by reducing the number of clients the central server needs to communicate with.

Title: User-Centric Federated Learning: Trading off Wireless Resources for Personalization. (arXiv:2304.12930v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12930
Code URL: null
Copy Paste: [[2304.12930] User-Centric Federated Learning: Trading off Wireless Resources for Personalization](http://arxiv.org/abs/2304.12930) #federate
Summary:
Statistical heterogeneity across clients in a Federated Learning (FL) system increases the algorithm convergence time and reduces the generalization performance, resulting in a large communication overhead in return for a poor model. To tackle the above problems without violating the privacy constraints that FL imposes, personalized FL methods have to couple statistically similar clients without directly accessing their data in order to guarantee a privacy-preserving transfer. In this work, we design user-centric aggregation rules at the parameter server (PS) that are based on readily available gradient information and are capable of producing personalized models for each FL client. The proposed aggregation rules are inspired by an upper bound of the weighted aggregate empirical risk minimizer. Secondly, we derive a communication-efficient variant based on user clustering which greatly enhances its applicability to communication-constrained systems. Our algorithm outperforms popular personalized FL baselines in terms of average accuracy, worst node performance, and training communication overhead.

fair

Title: Fairness and Bias in Truth Discovery Algorithms: An Experimental Analysis. (arXiv:2304.12573v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12573
Code URL: null
Copy Paste: [[2304.12573] Fairness and Bias in Truth Discovery Algorithms: An Experimental Analysis](http://arxiv.org/abs/2304.12573) #fair
Summary:
Machine learning (ML) based approaches are increasingly being used in a number of applications with societal impact. Training ML models often require vast amounts of labeled data, and crowdsourcing is a dominant paradigm for obtaining labels from multiple workers. Crowd workers may sometimes provide unreliable labels, and to address this, truth discovery (TD) algorithms such as majority voting are applied to determine the consensus labels from conflicting worker responses. However, it is important to note that these consensus labels may still be biased based on sensitive attributes such as gender, race, or political affiliation. Even when sensitive attributes are not involved, the labels can be biased due to different perspectives of subjective aspects such as toxicity. In this paper, we conduct a systematic study of the bias and fairness of TD algorithms. Our findings using two existing crowd-labeled datasets, reveal that a non-trivial proportion of workers provide biased results, and using simple approaches for TD is sub-optimal. Our study also demonstrates that popular TD algorithms are not a panacea. Additionally, we quantify the impact of these unfair workers on downstream ML tasks and show that conventional methods for achieving fairness and correcting label biases are ineffective in this setting. We end the paper with a plea for the design of novel bias-aware truth discovery algorithms that can ameliorate these issues.

interpretability

Title: Class Attention Transfer Based Knowledge Distillation. (arXiv:2304.12777v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12777
Code URL: https://github.com/gzyaftermath/cat-kd
Copy Paste: [[2304.12777] Class Attention Transfer Based Knowledge Distillation](http://arxiv.org/abs/2304.12777) #interpretability
Summary:
Previous knowledge distillation methods have shown their impressive performance on model compression tasks, however, it is hard to explain how the knowledge they transferred helps to improve the performance of the student network. In this work, we focus on proposing a knowledge distillation method that has both high interpretability and competitive performance. We first revisit the structure of mainstream CNN models and reveal that possessing the capacity of identifying class discriminative regions of input is critical for CNN to perform classification. Furthermore, we demonstrate that this capacity can be obtained and enhanced by transferring class activation maps. Based on our findings, we propose class attention transfer based knowledge distillation (CAT-KD). Different from previous KD methods, we explore and present several properties of the knowledge transferred by our method, which not only improve the interpretability of CAT-KD but also contribute to a better understanding of CNN. While having high interpretability, CAT-KD achieves state-of-the-art performance on multiple benchmarks. Code is available at: https://github.com/GzyAftermath/CAT-KD.

Title: What does BERT learn about prosody?. (arXiv:2304.12706v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12706
Code URL: null
Copy Paste: [[2304.12706] What does BERT learn about prosody?](http://arxiv.org/abs/2304.12706) #interpretability
Summary:
Language models have become nearly ubiquitous in natural language processing applications achieving state-of-the-art results in many tasks including prosody. As the model design does not define predetermined linguistic targets during training but rather aims at learning generalized representations of the language, analyzing and interpreting the representations that models implicitly capture is important in bridging the gap between interpretability and model performance. Several studies have explored the linguistic information that models capture providing some insights on their representational capacity. However, the current studies have not explored whether prosody is part of the structural information of the language that models learn. In this work, we perform a series of experiments on BERT probing the representations captured at different layers. Our results show that information about prosodic prominence spans across many layers but is mostly focused in middle layers suggesting that BERT relies mostly on syntactic and semantic information.

Title: Discovering Graph Generation Algorithms. (arXiv:2304.12895v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12895
Code URL: null
Copy Paste: [[2304.12895] Discovering Graph Generation Algorithms](http://arxiv.org/abs/2304.12895) #interpretability
Summary:
We provide a novel approach to construct generative models for graphs. Instead of using the traditional probabilistic models or deep generative models, we propose to instead find an algorithm that generates the data. We achieve this using evolutionary search and a powerful fitness function, implemented by a randomly initialized graph neural network. This brings certain advantages over current deep generative models, for instance, a higher potential for out-of-training-distribution generalization and direct interpretability, as the final graph generative process is expressed as a Python function. We show that this approach can be competitive with deep generative models and under some circumstances can even find the true graph generative process, and as such perfectly generalize.

explainability

watermark

diffusion

Title: RenderDiffusion: Text Generation as Image Generation. (arXiv:2304.12519v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12519
Code URL: null
Copy Paste: [[2304.12519] RenderDiffusion: Text Generation as Image Generation](http://arxiv.org/abs/2304.12519) #diffusion
Summary:
Diffusion models have become a new generative paradigm for text generation. Considering the discrete categorical nature of text, in this paper, we propose \textsc{RenderDiffusion}, a novel diffusion approach for text generation via text-guided image generation. Our key idea is to render the target text as a \emph{glyph image} containing visual language content. In this way, conditional text generation can be cast as a glyph image generation task, and it is then natural to apply continuous diffusion models to discrete texts. Specially, we utilize a cascaded architecture (\ie a base and a super-resolution diffusion model) to generate high-fidelity glyph images, conditioned on the input text. Furthermore, we design a text grounding module to transform and refine the visual language content from generated glyph images into the final texts. In experiments over four conditional text generation tasks and two classes of metrics (\ie quality and diversity), \textsc{RenderDiffusion} can achieve comparable or even better results than several baselines, including pretrained language models. Our model also makes significant improvements compared to the recent diffusion model.

Title: Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models. (arXiv:2304.12526v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12526
Code URL: null
Copy Paste: [[2304.12526] Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models](http://arxiv.org/abs/2304.12526) #diffusion
Summary:
Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the patch location in the original image is included as additional coordinate channels, while the patch size is randomized and diversified throughout training to encode the cross-region dependency at multiple scales. Sampling with our method is as easy as in the original diffusion model. Through Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets, $e.g.$, as few as 5,000 images to train from scratch. We achieve state-of-the-art FID scores 1.77 on CelebA-64$\times$64 and 1.93 on AFHQv2-Wild-64$\times$64. We will share our code and pre-trained models soon.

Title: Exploring Compositional Visual Generation with Latent Classifier Guidance. (arXiv:2304.12536v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12536
Code URL: null
Copy Paste: [[2304.12536] Exploring Compositional Visual Generation with Latent Classifier Guidance](http://arxiv.org/abs/2304.12536) #diffusion
Summary:
Diffusion probabilistic models have achieved enormous success in the field of image generation and manipulation. In this paper, we explore a novel paradigm of using the diffusion model and classifier guidance in the latent semantic space for compositional visual tasks. linear fashion. Specifically, we train latent diffusion models and auxiliary latent classifiers to facilitate non-linear navigation of latent representation generation for any pre-trained generative model with a semantic latent space. We demonstrate that such conditional generation achieved by latent classifier guidance provably maximizes a lower bound of the conditional log probability during training. To maintain the original semantics during manipulation, we introduce a new guidance term, which we show is crucial for achieving compositionality. With additional assumptions, we show that the non-linear manipulation reduces to a simple latent arithmetic approach. We show that this paradigm based on latent classifier guidance is agnostic to pre-trained generative models, and present competitive results for both image generation and sequential manipulation of real and synthetic images. Our findings suggest that latent classifier guidance is a promising approach that merits further exploration, even in the presence of other strong competing methods.

Title: CoDi: Co-evolving Contrastive Diffusion Models for Mixed-type Tabular Synthesis. (arXiv:2304.12654v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12654
Code URL: null
Copy Paste: [[2304.12654] CoDi: Co-evolving Contrastive Diffusion Models for Mixed-type Tabular Synthesis](http://arxiv.org/abs/2304.12654) #diffusion
Summary:
With growing attention to tabular data these days, the attempt to apply a synthetic table to various tasks has been expanded toward various scenarios. Owing to the recent advances in generative modeling, fake data generated by tabular data synthesis models become sophisticated and realistic. However, there still exists a difficulty in modeling discrete variables (columns) of tabular data. In this work, we propose to process continuous and discrete variables separately (but being conditioned on each other) by two diffusion models. The two diffusion models are co-evolved during training by reading conditions from each other. In order to further bind the diffusion models, moreover, we introduce a contrastive learning method with a negative sampling method. In our experiments with 11 real-world tabular datasets and 8 baseline methods, we prove the efficacy of the proposed method, called CoDi.

Title: Contrastive Energy Prediction for Exact Energy-Guided Diffusion Sampling in Offline Reinforcement Learning. (arXiv:2304.12824v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12824
Code URL: null
Copy Paste: [[2304.12824] Contrastive Energy Prediction for Exact Energy-Guided Diffusion Sampling in Offline Reinforcement Learning](http://arxiv.org/abs/2304.12824) #diffusion
Summary:
Guided sampling is a vital approach for applying diffusion models in real-world tasks that embeds human-defined guidance during the sampling procedure. This paper considers a general setting where the guidance is defined by an (unnormalized) energy function. The main challenge for this setting is that the intermediate guidance during the diffusion sampling procedure, which is jointly defined by the sampling distribution and the energy function, is unknown and is hard to estimate. To address this challenge, we propose an exact formulation of the intermediate guidance as well as a novel training objective named contrastive energy prediction (CEP) to learn the exact guidance. Our method is guaranteed to converge to the exact guidance under unlimited model capacity and data samples, while previous methods can not. We demonstrate the effectiveness of our method by applying it to offline reinforcement learning (RL). Extensive experiments on D4RL benchmarks demonstrate that our method outperforms existing state-of-the-art algorithms. We also provide some examples of applying CEP for image synthesis to demonstrate the scalability of CEP on high-dimensional data.

noise learning

data-free

Title: Model Conversion via Differentially Private Data-Free Distillation. (arXiv:2304.12528v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.12528
Code URL: null
Copy Paste: [[2304.12528] Model Conversion via Differentially Private Data-Free Distillation](http://arxiv.org/abs/2304.12528) #data-free
Summary:
While massive valuable deep models trained on large-scale data have been released to facilitate the artificial intelligence community, they may encounter attacks in deployment which leads to privacy leakage of training data. In this work, we propose a learning approach termed differentially private data-free distillation (DPDFD) for model conversion that can convert a pretrained model (teacher) into its privacy-preserving counterpart (student) via an intermediate generator without access to training data. The learning collaborates three parties in a unified way. First, massive synthetic data are generated with the generator. Then, they are fed into the teacher and student to compute differentially private gradients by normalizing the gradients and adding noise before performing descent. Finally, the student is updated with these differentially private gradients and the generator is updated by taking the student as a fixed discriminator in an alternate manner. In addition to a privacy-preserving student, the generator can generate synthetic data in a differentially private way for other downstream tasks. We theoretically prove that our approach can guarantee differential privacy and well convergence. Extensive experiments clearly demonstrate that our approach significantly outperform other differentially private generative approaches.

transformer

Title: Pointersect: Neural Rendering with Cloud-Ray Intersection. (arXiv:2304.12390v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12390
Code URL: null
Copy Paste: [[2304.12390] Pointersect: Neural Rendering with Cloud-Ray Intersection](http://arxiv.org/abs/2304.12390) #transformer
Summary:
We propose a novel method that renders point clouds as if they are surfaces. The proposed method is differentiable and requires no scene-specific optimization. This unique capability enables, out-of-the-box, surface normal estimation, rendering room-scale point clouds, inverse rendering, and ray tracing with global illumination. Unlike existing work that focuses on converting point clouds to other representations--e.g., surfaces or implicit functions--our key idea is to directly infer the intersection of a light ray with the underlying surface represented by the given point cloud. Specifically, we train a set transformer that, given a small number of local neighbor points along a light ray, provides the intersection point, the surface normal, and the material blending weights, which are used to render the outcome of this light ray. Localizing the problem into small neighborhoods enables us to train a model with only 48 meshes and apply it to unseen point clouds. Our model achieves higher estimation accuracy than state-of-the-art surface reconstruction and point-cloud rendering methods on three test sets. When applied to room-scale point clouds, without any scene-specific optimization, the model achieves competitive quality with the state-of-the-art novel-view rendering methods. Moreover, we demonstrate ability to render and manipulate Lidar-scanned point clouds such as lighting control and object insertion.

Title: Rank Flow Embedding for Unsupervised and Semi-Supervised Manifold Learning. (arXiv:2304.12448v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12448
Code URL: null
Copy Paste: [[2304.12448] Rank Flow Embedding for Unsupervised and Semi-Supervised Manifold Learning](http://arxiv.org/abs/2304.12448) #transformer
Summary:
Impressive advances in acquisition and sharing technologies have made the growth of multimedia collections and their applications almost unlimited. However, the opposite is true for the availability of labeled data, which is needed for supervised training, since such data is often expensive and time-consuming to obtain. While there is a pressing need for the development of effective retrieval and classification methods, the difficulties faced by supervised approaches highlight the relevance of methods capable of operating with few or no labeled data. In this work, we propose a novel manifold learning algorithm named Rank Flow Embedding (RFE) for unsupervised and semi-supervised scenarios. The proposed method is based on ideas recently exploited by manifold learning approaches, which include hypergraphs, Cartesian products, and connected components. The algorithm computes context-sensitive embeddings, which are refined following a rank-based processing flow, while complementary contextual information is incorporated. The generated embeddings can be exploited for more effective unsupervised retrieval or semi-supervised classification based on Graph Convolutional Networks. Experimental results were conducted on 10 different collections. Various features were considered, including the ones obtained with recent Convolutional Neural Networks (CNN) and Vision Transformer (ViT) models. High effective results demonstrate the effectiveness of the proposed method on different tasks: unsupervised image retrieval, semi-supervised classification, and person Re-ID. The results demonstrate that RFE is competitive or superior to the state-of-the-art in diverse evaluated scenarios.

Title: Recurrent Transformer Encoders for Vision-based Estimation of Fatigue and Engagement in Cognitive Training Sessions. (arXiv:2304.12470v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12470
Code URL: null
Copy Paste: [[2304.12470] Recurrent Transformer Encoders for Vision-based Estimation of Fatigue and Engagement in Cognitive Training Sessions](http://arxiv.org/abs/2304.12470) #transformer
Summary:
The effectiveness of computerized cognitive training in slowing cognitive decline and brain aging in dementia is often limited by the engagement of participants in the training. Monitoring older users' real-time engagement in domains of attention, motivation, and affect is crucial to understanding the overall effectiveness of such training. In this paper, we propose to predict engagement, quantified via an established mental fatigue measure assessing users' perceived attention, motivation, and affect throughout computerized cognitive training sessions, in older adults with mild cognitive impairment (MCI), by monitoring their real-time video-recorded facial gestures in training sessions. To achieve the goal, we used computer vision, analyzing video frames every 5 seconds to optimize the balance between information retention and data size, and developed a novel Recurrent Video Transformer (RVT). Our RVT model, which combines a clip-wise transformer encoder module and a session-wise Recurrent Neural Network (RNN) classifier, achieved the highest balanced accuracy, F1 score, and precision compared to other state-of-the-art models for both detecting mental fatigue/disengagement cases (binary classification) and rating the level of mental fatigue (multi-class classification). By leveraging dynamic temporal information, the RVT model demonstrates the potential to accurately predict engagement among computerized cognitive training users, which lays the foundation for future work to modulate the level of engagement in computerized cognitive training interventions. The code will be released.

Title: Hint-Aug: Drawing Hints from Foundation Vision Transformers Towards Boosted Few-Shot Parameter-Efficient Tuning. (arXiv:2304.12520v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12520
Code URL: null
Copy Paste: [[2304.12520] Hint-Aug: Drawing Hints from Foundation Vision Transformers Towards Boosted Few-Shot Parameter-Efficient Tuning](http://arxiv.org/abs/2304.12520) #transformer
Summary:
Despite the growing demand for tuning foundation vision transformers (FViTs) on downstream tasks, fully unleashing FViTs' potential under data-limited scenarios (e.g., few-shot tuning) remains a challenge due to FViTs' data-hungry nature. Common data augmentation techniques fall short in this context due to the limited features contained in the few-shot tuning data. To tackle this challenge, we first identify an opportunity for FViTs in few-shot tuning: pretrained FViTs themselves have already learned highly representative features from large-scale pretraining data, which are fully preserved during widely used parameter-efficient tuning. We thus hypothesize that leveraging those learned features to augment the tuning data can boost the effectiveness of few-shot FViT tuning. To this end, we propose a framework called Hint-based Data Augmentation (Hint-Aug), which aims to boost FViT in few-shot tuning by augmenting the over-fitted parts of tuning samples with the learned features of pretrained FViTs. Specifically, Hint-Aug integrates two key enablers: (1) an Attentive Over-fitting Detector (AOD) to detect over-confident patches of foundation ViTs for potentially alleviating their over-fitting on the few-shot tuning data and (2) a Confusion-based Feature Infusion (CFI) module to infuse easy-to-confuse features from the pretrained FViTs with the over-confident patches detected by the above AOD in order to enhance the feature diversity during tuning. Extensive experiments and ablation studies on five datasets and three parameter-efficient tuning techniques consistently validate Hint-Aug's effectiveness: 0.04% ~ 32.91% higher accuracy over the state-of-the-art (SOTA) data augmentation method under various low-shot settings. For example, on the Pet dataset, Hint-Aug achieves a 2.22% higher accuracy with 50% less training data over SOTA data augmentation methods.

Title: Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders. (arXiv:2304.12535v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12535
Code URL: null
Copy Paste: [[2304.12535] Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders](http://arxiv.org/abs/2304.12535) #transformer
Summary:
We present a pipeline of Image to Vector (Img2Vec) for masked image modeling (MIM) with deep features. To study which type of deep features is appropriate for MIM as a learning target, we propose a simple MIM framework with serials of well-trained self-supervised models to convert an Image to a feature Vector as the learning target of MIM, where the feature extractor is also known as a teacher model. Surprisingly, we empirically find that an MIM model benefits more from image features generated by some lighter models (e.g., ResNet-50, 26M) than from those by a cumbersome teacher like Transformer-based models (e.g., ViT-Large, 307M). To analyze this remarkable phenomenon, we devise a novel attribute, token diversity, to evaluate the characteristics of generated features from different models. Token diversity measures the feature dissimilarity among different tokens. Through extensive experiments and visualizations, we hypothesize that beyond the acknowledgment that a large model can improve MIM, a high token-diversity of a teacher model is also crucial. Based on the above discussion, Img2Vec adopts a teacher model with high token-diversity to generate image features. Img2Vec pre-trained on ImageNet unlabeled data with ViT-B yields 85.1\% top-1 accuracy on fine-tuning. Moreover, we scale up Img2Vec on larger models, ViT-L and ViT-H, and get $86.7\%$ and $87.5\%$ accuracy respectively. It also achieves state-of-the-art results on other downstream tasks, e.g., 51.8\% mAP on COCO and 50.7\% mIoU on ADE20K. Img2Vec is a simple yet effective framework tailored to deep feature MIM learning, accomplishing superb comprehensive performance on representative vision tasks.

Title: Detection of Pavement Cracks by Deep Learning Models of Transformer and UNet. (arXiv:2304.12596v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12596
Code URL: null
Copy Paste: [[2304.12596] Detection of Pavement Cracks by Deep Learning Models of Transformer and UNet](http://arxiv.org/abs/2304.12596) #transformer
Summary:
Fracture is one of the main failure modes of engineering structures such as buildings and roads. Effective detection of surface cracks is significant for damage evaluation and structure maintenance. In recent years, the emergence and development of deep learning techniques have shown great potential to facilitate surface crack detection. Currently, most reported tasks were performed by a convolutional neural network (CNN), while the limitation of CNN may be improved by the transformer architecture introduced recently. In this study, we investigated nine promising models to evaluate their performance in pavement surface crack detection by model accuracy, computational complexity, and model stability. We created 711 images of 224 by 224 pixels with crack labels, selected an optimal loss function, compared the evaluation metrics of the validation dataset and test dataset, analyzed the data details, and checked the segmentation outcomes of each model. We find that transformer-based models generally are easier to converge during the training process and have higher accuracy, but usually exhibit more memory consumption and low processing efficiency. Among nine models, SwinUNet outperforms the other two transformers and shows the highest accuracy among nine models. The results should shed light on surface crack detection by various deep-learning models and provide a guideline for future applications in this field.

Title: Local Implicit Ray Function for Generalizable Radiance Field Representation. (arXiv:2304.12746v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12746
Code URL: null
Copy Paste: [[2304.12746] Local Implicit Ray Function for Generalizable Radiance Field Representation](http://arxiv.org/abs/2304.12746) #transformer
Summary:
We propose LIRF (Local Implicit Ray Function), a generalizable neural rendering approach for novel view rendering. Current generalizable neural radiance fields (NeRF) methods sample a scene with a single ray per pixel and may therefore render blurred or aliased views when the input views and rendered views capture scene content with different resolutions. To solve this problem, we propose LIRF to aggregate the information from conical frustums to construct a ray. Given 3D positions within conical frustums, LIRF takes 3D coordinates and the features of conical frustums as inputs and predicts a local volumetric radiance field. Since the coordinates are continuous, LIRF renders high-quality novel views at a continuously-valued scale via volume rendering. Besides, we predict the visible weights for each input view via transformer-based feature matching to improve the performance in occluded areas. Experimental results on real-world scenes validate that our method outperforms state-of-the-art methods on novel view rendering of unseen scenes at arbitrary scales.

Title: CompletionFormer: Depth Completion with Convolutions and Vision Transformers. (arXiv:2304.13030v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13030
Code URL: https://github.com/youmi-zym/completionformer
Copy Paste: [[2304.13030] CompletionFormer: Depth Completion with Convolutions and Vision Transformers](http://arxiv.org/abs/2304.13030) #transformer
Summary:
Given sparse depths and the corresponding RGB images, depth completion aims at spatially propagating the sparse measurements throughout the whole image to get a dense depth prediction. Despite the tremendous progress of deep-learning-based depth completion methods, the locality of the convolutional layer or graph model makes it hard for the network to model the long-range relationship between pixels. While recent fully Transformer-based architecture has reported encouraging results with the global receptive field, the performance and efficiency gaps to the well-developed CNN models still exist because of its deteriorative local feature details. This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure. This hybrid architecture naturally benefits both the local connectivity of convolutions and the global context of the Transformer in one single model. As a result, our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods. Code is available at \url{https://github.com/youmi-zym/CompletionFormer}.

Title: Extreme Classification for Answer Type Prediction in Question Answering. (arXiv:2304.12395v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12395
Code URL: null
Copy Paste: [[2304.12395] Extreme Classification for Answer Type Prediction in Question Answering](http://arxiv.org/abs/2304.12395) #transformer
Summary:
Semantic answer type prediction (SMART) is known to be a useful step towards effective question answering (QA) systems. The SMART task involves predicting the top-$k$ knowledge graph (KG) types for a given natural language question. This is challenging due to the large number of types in KGs. In this paper, we propose use of extreme multi-label classification using Transformer models (XBERT) by clustering KG types using structural and semantic features based on question text. We specifically improve the clustering stage of the XBERT pipeline using textual and structural features derived from KGs. We show that these features can improve end-to-end performance for the SMART task, and yield state-of-the-art results.

Title: KINLP at SemEval-2023 Task 12: Kinyarwanda Tweet Sentiment Analysis. (arXiv:2304.12569v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12569
Code URL: null
Copy Paste: [[2304.12569] KINLP at SemEval-2023 Task 12: Kinyarwanda Tweet Sentiment Analysis](http://arxiv.org/abs/2304.12569) #transformer
Summary:
This paper describes the system entered by the author to the SemEval-2023 Task 12: Sentiment analysis for African languages. The system focuses on the Kinyarwanda language and uses a language-specific model. Kinyarwanda morphology is modeled in a two tier transformer architecture and the transformer model is pre-trained on a large text corpus using multi-task masked morphology prediction. The model is deployed on an experimental platform that allows users to experiment with the pre-trained language model fine-tuning without the need to write machine learning code. Our final submission to the shared task achieves second ranking out of 34 teams in the competition, achieving 72.50% weighted F1 score. Our analysis of the evaluation results highlights challenges in achieving high accuracy on the task and identifies areas for improvement.

Title: State Spaces Aren't Enough: Machine Translation Needs Attention. (arXiv:2304.12776v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12776
Code URL: null
Copy Paste: [[2304.12776] State Spaces Aren't Enough: Machine Translation Needs Attention](http://arxiv.org/abs/2304.12776) #transformer
Summary:
Structured State Spaces for Sequences (S4) is a recently proposed sequence model with successful applications in various tasks, e.g. vision, language modeling, and audio. Thanks to its mathematical formulation, it compresses its input to a single hidden state, and is able to capture long range dependencies while avoiding the need for an attention mechanism. In this work, we apply S4 to Machine Translation (MT), and evaluate several encoder-decoder variants on WMT'14 and WMT'16. In contrast with the success in language modeling, we find that S4 lags behind the Transformer by approximately 4 BLEU points, and that it counter-intuitively struggles with long sentences. Finally, we show that this gap is caused by S4's inability to summarize the full source sentence in a single hidden state, and show that we can close the gap by introducing an attention mechanism.

Title: NLP-LTU at SemEval-2023 Task 10: The Impact of Data Augmentation and Semi-Supervised Learning Techniques on Text Classification Performance on an Imbalanced Dataset. (arXiv:2304.12847v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12847
Code URL: null
Copy Paste: [[2304.12847] NLP-LTU at SemEval-2023 Task 10: The Impact of Data Augmentation and Semi-Supervised Learning Techniques on Text Classification Performance on an Imbalanced Dataset](http://arxiv.org/abs/2304.12847) #transformer
Summary:
In this paper, we propose a methodology for task 10 of SemEval23, focusing on detecting and classifying online sexism in social media posts. The task is tackling a serious issue, as detecting harmful content on social media platforms is crucial for mitigating the harm of these posts on users. Our solution for this task is based on an ensemble of fine-tuned transformer-based models (BERTweet, RoBERTa, and DeBERTa). To alleviate problems related to class imbalance, and to improve the generalization capability of our model, we also experiment with data augmentation and semi-supervised learning. In particular, for data augmentation, we use back-translation, either on all classes, or on the underrepresented classes only. We analyze the impact of these strategies on the overall performance of the pipeline through extensive experiments. while for semi-supervised learning, we found that with a substantial amount of unlabelled, in-domain data available, semi-supervised learning can enhance the performance of certain models. Our proposed method (for which the source code is available on Github attains an F1-score of 0.8613 for sub-taskA, which ranked us 10th in the competition

Title: Nondeterministic Stacks in Neural Networks. (arXiv:2304.12955v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12955
Code URL: null
Copy Paste: [[2304.12955] Nondeterministic Stacks in Neural Networks](http://arxiv.org/abs/2304.12955) #transformer
Summary:
Human language is full of compositional syntactic structures, and although neural networks have contributed to groundbreaking improvements in computer systems that process language, widely-used neural network architectures still exhibit limitations in their ability to process syntax. To address this issue, prior work has proposed adding stack data structures to neural networks, drawing inspiration from theoretical connections between syntax and stacks. However, these methods employ deterministic stacks that are designed to track one parse at a time, whereas syntactic ambiguity, which requires a nondeterministic stack to parse, is extremely common in language. In this dissertation, we remedy this discrepancy by proposing a method of incorporating nondeterministic stacks into neural networks. We develop a differentiable data structure that efficiently simulates a nondeterministic pushdown automaton, representing an exponential number of computations with a dynamic programming algorithm. We incorporate this module into two predominant architectures: recurrent neural networks (RNNs) and transformers. We show that this raises their formal recognition power to arbitrary context-free languages, and also aids training, even on deterministic context-free languages. Empirically, neural networks with nondeterministic stacks learn context-free languages much more effectively than prior stack-augmented models, including a language with theoretically maximal parsing difficulty. We also show that an RNN augmented with a nondeterminsitic stack is capable of surprisingly powerful behavior, such as learning cross-serial dependencies, a well-known non-context-free pattern. We demonstrate improvements on natural language modeling and provide analysis on a syntactic generalization benchmark. This work represents an important step toward building systems that learn to use syntax in more human-like fashion.

Title: Escaping the sentence-level paradigm in machine translation. (arXiv:2304.12959v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12959
Code URL: null
Copy Paste: [[2304.12959] Escaping the sentence-level paradigm in machine translation](http://arxiv.org/abs/2304.12959) #transformer
Summary:
It is well-known that document context is vital for resolving a range of translation ambiguities, and in fact the document setting is the most natural setting for nearly all translation. It is therefore unfortunate that machine translation -- both research and production -- largely remains stuck in a decades-old sentence-level translation paradigm. It is also an increasingly glaring problem in light of competitive pressure from large language models, which are natively document-based. Much work in document-context machine translation exists, but for various reasons has been unable to catch hold. This paper suggests a path out of this rut by addressing three impediments at once: what architectures should we use? where do we get document-level information for training them? and how do we know whether they are any good? In contrast to work on specialized architectures, we show that the standard Transformer architecture is sufficient, provided it has enough capacity. Next, we address the training data issue by taking document samples from back-translated data only, where the data is not only more readily available, but is also of higher quality compared to parallel document data, which may contain machine translation output. Finally, we propose generative variants of existing contrastive metrics that are better able to discriminate among document systems. Results in four large-data language pairs (DE$\rightarrow$EN, EN$\rightarrow$DE, EN$\rightarrow$FR, and EN$\rightarrow$RU) establish the success of these three pieces together in improving document-level performance.

Title: DuETT: Dual Event Time Transformer for Electronic Health Records. (arXiv:2304.13017v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.13017
Code URL: https://github.com/layer6ai-labs/duett
Copy Paste: [[2304.13017] DuETT: Dual Event Time Transformer for Electronic Health Records](http://arxiv.org/abs/2304.13017) #transformer
Summary:
Electronic health records (EHRs) recorded in hospital settings typically contain a wide range of numeric time series data that is characterized by high sparsity and irregular observations. Effective modelling for such data must exploit its time series nature, the semantic relationship between different types of observations, and information in the sparsity structure of the data. Self-supervised Transformers have shown outstanding performance in a variety of structured tasks in NLP and computer vision. But multivariate time series data contains structured relationships over two dimensions: time and recorded event type, and straightforward applications of Transformers to time series data do not leverage this distinct structure. The quadratic scaling of self-attention layers can also significantly limit the input sequence length without appropriate input engineering. We introduce the DuETT architecture, an extension of Transformers designed to attend over both time and event type dimensions, yielding robust representations from EHR data. DuETT uses an aggregated input where sparse time series are transformed into a regular sequence with fixed length; this lowers the computational complexity relative to previous EHR Transformer models and, more importantly, enables the use of larger and deeper neural networks. When trained with self-supervised prediction tasks, that provide rich and informative signals for model pre-training, our model outperforms state-of-the-art deep learning models on multiple downstream tasks from the MIMIC-IV and PhysioNet-2012 EHR datasets.

generative

Title: Unsupervised Style-based Explicit 3D Face Reconstruction from Single Image. (arXiv:2304.12455v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12455
Code URL: null
Copy Paste: [[2304.12455] Unsupervised Style-based Explicit 3D Face Reconstruction from Single Image](http://arxiv.org/abs/2304.12455) #generative
Summary:
Inferring 3D object structures from a single image is an ill-posed task due to depth ambiguity and occlusion. Typical resolutions in the literature include leveraging 2D or 3D ground truth for supervised learning, as well as imposing hand-crafted symmetry priors or using an implicit representation to hallucinate novel viewpoints for unsupervised methods. In this work, we propose a general adversarial learning framework for solving Unsupervised 2D to Explicit 3D Style Transfer (UE3DST). Specifically, we merge two architectures: the unsupervised explicit 3D reconstruction network of Wu et al.\ and the Generative Adversarial Network (GAN) named StarGAN-v2. We experiment across three facial datasets (Basel Face Model, 3DFAW and CelebA-HQ) and show that our solution is able to outperform well established solutions such as DepthNet in 3D reconstruction and Pix2NeRF in conditional style transfer, while we also justify the individual contributions of our model components via ablation. In contrast to the aforementioned baselines, our scheme produces features for explicit 3D rendering, which can be manipulated and utilized in downstream tasks.

Title: A Study on Improving Realism of Synthetic Data for Machine Learning. (arXiv:2304.12463v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12463
Code URL: null
Copy Paste: [[2304.12463] A Study on Improving Realism of Synthetic Data for Machine Learning](http://arxiv.org/abs/2304.12463) #generative
Summary:
Synthetic-to-real data translation using generative adversarial learning has achieved significant success to improve synthetic data. Yet, there are limited studies focusing on deep evaluation and comparison of adversarial training on general-purpose synthetic data for machine learning. This work aims to train and evaluate a synthetic-to-real generative model that transforms the synthetic renderings into more realistic styles on general-purpose datasets conditioned with unlabeled real-world data. Extensive performance evaluation and comparison have been conducted through qualitative and quantitative metrics, and a defined downstream perception task.

Title: Towards Realistic Generative 3D Face Models. (arXiv:2304.12483v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12483
Code URL: null
Copy Paste: [[2304.12483] Towards Realistic Generative 3D Face Models](http://arxiv.org/abs/2304.12483) #generative
Summary:
In recent years, there has been significant progress in 2D generative face models fueled by applications such as animation, synthetic data generation, and digital avatars. However, due to the absence of 3D information, these 2D models often struggle to accurately disentangle facial attributes like pose, expression, and illumination, limiting their editing capabilities. To address this limitation, this paper proposes a 3D controllable generative face model to produce high-quality albedo and precise 3D shape leveraging existing 2D generative models. By combining 2D face generative models with semantic face manipulation, this method enables editing of detailed 3D rendered faces. The proposed framework utilizes an alternating descent optimization approach over shape and albedo. Differentiable rendering is used to train high-quality shapes and albedo without 3D supervision. Moreover, this approach outperforms the state-of-the-art (SOTA) methods in the well-known NoW benchmark for shape reconstruction. It also outperforms the SOTA reconstruction models in recovering rendered faces' identities across novel poses by an average of 10%. Additionally, the paper demonstrates direct control of expressions in 3D faces by exploiting latent space leading to text-based editing of 3D faces.

Title: Latent Traversals in Generative Models as Potential Flows. (arXiv:2304.12944v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12944
Code URL: https://github.com/kingjamessong/pdetraversal
Copy Paste: [[2304.12944] Latent Traversals in Generative Models as Potential Flows](http://arxiv.org/abs/2304.12944) #generative
Summary:
Despite the significant recent progress in deep generative models, the underlying structure of their latent spaces is still poorly understood, thereby making the task of performing semantically meaningful latent traversals an open research challenge. Most prior work has aimed to solve this challenge by modeling latent structures linearly, and finding corresponding linear directions which result in `disentangled' generations. In this work, we instead propose to model latent structures with a learned dynamic potential landscape, thereby performing latent traversals as the flow of samples down the landscape's gradient. Inspired by physics, optimal transport, and neuroscience, these potential landscapes are learned as physically realistic partial differential equations, thereby allowing them to flexibly vary over both space and time. To achieve disentanglement, multiple potentials are learned simultaneously, and are constrained by a classifier to be distinct and semantically self-consistent. Experimentally, we demonstrate that our method achieves both more qualitatively and quantitatively disentangled trajectories than state-of-the-art baselines. Further, we demonstrate that our method can be integrated as a regularization term during training, thereby acting as an inductive bias towards the learning of structured representations, ultimately improving model likelihood on similarly structured data.

Title: Causal Semantic Communication for Digital Twins: A Generalizable Imitation Learning Approach. (arXiv:2304.12502v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12502
Code URL: null
Copy Paste: [[2304.12502] Causal Semantic Communication for Digital Twins: A Generalizable Imitation Learning Approach](http://arxiv.org/abs/2304.12502) #generative
Summary:
A digital twin (DT) leverages a virtual representation of the physical world, along with communication (e.g., 6G), computing (e.g., edge computing), and artificial intelligence (AI) technologies to enable many connected intelligence services. In order to handle the large amounts of network data based on digital twins (DTs), wireless systems can exploit the paradigm of semantic communication (SC) for facilitating informed decision-making under strict communication constraints by utilizing AI techniques such as causal reasoning. In this paper, a novel framework called causal semantic communication (CSC) is proposed for DT-based wireless systems. The CSC system is posed as an imitation learning (IL) problem, where the transmitter, with access to optimal network control policies using a DT, teaches the receiver using SC over a bandwidth limited wireless channel how to improve its knowledge to perform optimal control actions. The causal structure in the source data is extracted using novel approaches from the framework of deep end-to-end causal inference, thereby enabling the creation of a semantic representation that is causally invariant, which in turn helps generalize the learned knowledge of the system to unseen scenarios. The CSC decoder at the receiver is designed to extract and estimate semantic information while ensuring high semantic reliability. The receiver control policies, semantic decoder, and causal inference are formulated as a bi-level optimization problem within a variational inference framework. This problem is solved using a novel concept called network state models, inspired from world models in generative AI, that faithfully represents the environment dynamics leading to data generation. Simulation results demonstrate that the proposed CSC system outperforms state-of-the-art SC systems by achieving better semantic reliability and reduced semantic representation.

Title: Controlling Posterior Collapse by an Inverse Lipschitz Constraint on the Decoder Network. (arXiv:2304.12770v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12770
Code URL: null
Copy Paste: [[2304.12770] Controlling Posterior Collapse by an Inverse Lipschitz Constraint on the Decoder Network](http://arxiv.org/abs/2304.12770) #generative
Summary:
Variational autoencoders (VAEs) are one of the deep generative models that have experienced enormous success over the past decades. However, in practice, they suffer from a problem called posterior collapse, which occurs when the encoder coincides, or collapses, with the prior taking no information from the latent structure of the input data into consideration. In this work, we introduce an inverse Lipschitz neural network into the decoder and, based on this architecture, provide a new method that can control in a simple and clear manner the degree of posterior collapse for a wide range of VAE models equipped with a concrete theoretical guarantee. We also illustrate the effectiveness of our method through several numerical experiments.

Title: The Score-Difference Flow for Implicit Generative Modeling. (arXiv:2304.12906v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12906
Code URL: null
Copy Paste: [[2304.12906] The Score-Difference Flow for Implicit Generative Modeling](http://arxiv.org/abs/2304.12906) #generative
Summary:
Implicit generative modeling (IGM) aims to produce samples of synthetic data matching the characteristics of a target data distribution. Recent work (e.g. score-matching networks, diffusion models) has approached the IGM problem from the perspective of pushing synthetic source data toward the target distribution via dynamical perturbations or flows in the ambient space. We introduce the score difference (SD) between arbitrary target and source distributions as a flow that optimally reduces the Kullback-Leibler divergence between them while also solving the Schr\"odinger bridge problem. We apply the SD flow to convenient proxy distributions, which are aligned if and only if the original distributions are aligned. We demonstrate the formal equivalence of this formulation to denoising diffusion models under certain conditions. However, unlike diffusion models, SD flow places no restrictions on the prior distribution. We also show that the training of generative adversarial networks includes a hidden data-optimization sub-problem, which induces the SD flow under certain choices of loss function when the discriminator is optimal. As a result, the SD flow provides a theoretical link between model classes that, taken together, address all three challenges of the "generative modeling trilemma": high sample quality, mode coverage, and fast sampling.

Title: Towards Theoretical Understanding of Inverse Reinforcement Learning. (arXiv:2304.12966v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12966
Code URL: null
Copy Paste: [[2304.12966] Towards Theoretical Understanding of Inverse Reinforcement Learning](http://arxiv.org/abs/2304.12966) #generative
Summary:
Inverse reinforcement learning (IRL) denotes a powerful family of algorithms for recovering a reward function justifying the behavior demonstrated by an expert agent. A well-known limitation of IRL is the ambiguity in the choice of the reward function, due to the existence of multiple rewards that explain the observed behavior. This limitation has been recently circumvented by formulating IRL as the problem of estimating the feasible reward set, i.e., the region of the rewards compatible with the expert's behavior. In this paper, we make a step towards closing the theory gap of IRL in the case of finite-horizon problems with a generative model. We start by formally introducing the problem of estimating the feasible reward set, the corresponding PAC requirement, and discussing the properties of particular classes of rewards. Then, we provide the first minimax lower bound on the sample complexity for the problem of estimating the feasible reward set of order ${\Omega}\Bigl( \frac{H^3SA}{\epsilon^2} \bigl( \log \bigl(\frac{1}{\delta}\bigl) + S \bigl)\Bigl)$, being $S$ and $A$ the number of states and actions respectively, $H$ the horizon, $\epsilon$ the desired accuracy, and $\delta$ the confidence. We analyze the sample complexity of a uniform sampling strategy (US-IRL), proving a matching upper bound up to logarithmic factors. Finally, we outline several open questions in IRL and propose future research directions.

large language model

Title: Compressing Sentence Representation with maximum Coding Rate Reduction. (arXiv:2304.12674v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.12674
Code URL: null
Copy Paste: [[2304.12674] Compressing Sentence Representation with maximum Coding Rate Reduction](http://arxiv.org/abs/2304.12674) #large language model
Summary:
In most natural language inference problems, sentence representation is needed for semantic retrieval tasks. In recent years, pre-trained large language models have been quite effective for computing such representations. These models produce high-dimensional sentence embeddings. An evident performance gap between large and small models exists in practice. Hence, due to space and time hardware limitations, there is a need to attain comparable results when using the smaller model, which is usually a distilled version of the large language model. In this paper, we assess the model distillation of the sentence representation model Sentence-BERT by augmenting the pre-trained distilled model with a projection layer additionally learned on the Maximum Coding Rate Reduction (MCR2)objective, a novel approach developed for general-purpose manifold clustering. We demonstrate that the new language model with reduced complexity and sentence embedding size can achieve comparable results on semantic retrieval benchmarks.

Title: Answering Questions by Meta-Reasoning over Multiple Chains of Thought. (arXiv:2304.13007v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2304.13007
Code URL: https://github.com/oriyor/reasoning-on-cots
Copy Paste: [[2304.13007] Answering Questions by Meta-Reasoning over Multiple Chains of Thought](http://arxiv.org/abs/2304.13007) #large language model
Summary:
Modern systems for multi-hop question answering (QA) typically break questions into a sequence of reasoning steps, termed chain-of-thought (CoT), before arriving at a final answer. Often, multiple chains are sampled and aggregated through a voting mechanism over the final answers, but the intermediate steps themselves are discarded. While such approaches improve performance, they do not consider the relations between intermediate steps across chains and do not provide a unified explanation for the predicted answer. We introduce Multi-Chain Reasoning (MCR), an approach which prompts large language models to meta-reason over multiple chains of thought, rather than aggregating their answers. MCR examines different reasoning chains, mixes information between them and selects the most relevant facts in generating an explanation and predicting the answer. MCR outperforms strong baselines on 7 multi-hop QA datasets. Moreover, our analysis reveals that MCR explanations exhibit high quality, enabling humans to verify its answers.

Title: Blockchain Large Language Models. (arXiv:2304.12749v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2304.12749
Code URL: null
Copy Paste: [[2304.12749] Blockchain Large Language Models](http://arxiv.org/abs/2304.12749) #large language model
Summary:
This paper presents a dynamic, real-time approach to detecting anomalous blockchain transactions. The proposed tool, TXRANK, generates tracing representations of blockchain activity and trains from scratch a large language model to act as a real-time Intrusion Detection System. Unlike traditional methods, TXRANK is designed to offer an unrestricted search space and does not rely on predefined rules or patterns, enabling it to detect a broader range of anomalies. We demonstrate the effectiveness of TXRANK through its use as an anomaly detection tool for Ethereum transactions. In our experiments, it effectively identifies abnormal transactions among a dataset of 68M transactions and has a batched throughput of 2284 transactions per second on average. Our results show that, TXRANK identifies abnormal transactions by ranking 49 out of 124 attacks among the top-3 most abnormal transactions interacting with their victim contracts. This work makes contributions to the field of blockchain transaction analysis by introducing a custom data encoding compatible with the transformer architecture, a domain-specific tokenization technique, and a tree encoding method specifically crafted for the Ethereum Virtual Machine (EVM) trace representation.

Title: N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models. (arXiv:2304.12918v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12918
Code URL: null
Copy Paste: [[2304.12918] N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models](http://arxiv.org/abs/2304.12918) #large language model
Summary:
Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose $\textbf{Neuron to Graph (N2G)}$, a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. This presents a less labour intensive approach to interpreting neurons than current manual methods, that will better scale these methods to Large Language Models (LLMs). We use truncation and saliency methods to only present the important tokens, and augment the dataset examples with more diverse samples to better capture the extent of neuron behaviour. These graphs can be visualised to aid manual interpretation by researchers, but can also output token activations on text to compare to the neuron's ground truth activations for automatic validation. N2G represents a step towards scalable interpretability methods by allowing us to convert neurons in an LLM to interpretable representations of measurable quality.

Title: A Closer Look at Reward Decomposition for High-Level Robotic Explanations. (arXiv:2304.12958v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2304.12958
Code URL: null
Copy Paste: [[2304.12958] A Closer Look at Reward Decomposition for High-Level Robotic Explanations](http://arxiv.org/abs/2304.12958) #large language model
Summary:
Explaining the behavior of intelligent agents such as robots to humans is challenging due to their incomprehensible proprioceptive states, variational intermediate goals, and resultant unpredictability. Moreover, one-step explanations for reinforcement learning agents can be ambiguous as they fail to account for the agent's future behavior at each transition, adding to the complexity of explaining robot actions. By leveraging abstracted actions that map to task-specific primitives, we avoid explanations on the movement level. Our proposed framework combines reward decomposition (RD) with abstracted action spaces into an explainable learning framework, allowing for non-ambiguous and high-level explanations based on object properties in the task. We demonstrate the effectiveness of our framework through quantitative and qualitative analysis of two robot scenarios, showcasing visual and textual explanations, from output artifacts of RD explanation, that are easy for humans to comprehend. Additionally, we demonstrate the versatility of integrating these artifacts with large language models for reasoning and interactive querying.

segmentation

Title: AutoFocusFormer: Image Segmentation off the Grid. (arXiv:2304.12406v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12406
Code URL: null
Copy Paste: [[2304.12406] AutoFocusFormer: Image Segmentation off the Grid](http://arxiv.org/abs/2304.12406) #segmentation
Summary:
Real world images often have highly imbalanced content density. Some areas are very uniform, e.g., large patches of blue sky, while other areas are scattered with many small objects. Yet, the commonly used successive grid downsampling strategy in convolutional deep networks treats all areas equally. Hence, small objects are represented in very few spatial locations, leading to worse results in tasks such as segmentation. Intuitively, retaining more pixels representing small objects during downsampling helps to preserve important information. To achieve this, we propose AutoFocusFormer (AFF), a local-attention transformer image recognition backbone, which performs adaptive downsampling by learning to retain the most important pixels for the task. Since adaptive downsampling generates a set of pixels irregularly distributed on the image plane, we abandon the classic grid structure. Instead, we develop a novel point-based local attention block, facilitated by a balanced clustering module and a learnable neighborhood merging module, which yields representations for our point-based versions of state-of-the-art segmentation heads. Experiments show that our AutoFocusFormer (AFF) improves significantly over baseline models of similar sizes.

Title: Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation. (arXiv:2304.12620v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12620
Code URL: null
Copy Paste: [[2304.12620] Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation](http://arxiv.org/abs/2304.12620) #segmentation
Summary:
The Segment Anything Model (SAM) has recently gained popularity in the field of image segmentation. Thanks to its impressive capabilities in all-round segmentation tasks and its prompt-based interface, SAM has sparked intensive discussion within the community. It is even said by many prestigious experts that image segmentation task has been "finished" by SAM. However, medical image segmentation, although an important branch of the image segmentation family, seems not to be included in the scope of Segmenting "Anything". Many individual experiments and recent studies have shown that SAM performs subpar in medical image segmentation. A natural question is how to find the missing piece of the puzzle to extend the strong segmentation capability of SAM to medical image segmentation. In this paper, we present a possible solution by fine-tuning the pretrained SAM model following parameter-efficient fine-tuning paradigm with Adapter. Although this work is still one of a few to transfer the popular NLP technique Adapter to computer vision cases, this simple implementation shows surprisingly good performance on medical image segmentation. A medical image adapted SAM, which we have dubbed Medical SAM Adapter (MSA), shows superior performance on 19 medical image segmentation tasks with various image modalities including CT, MRI, ultrasound image, fundus image, and dermoscopic images. MSA outperforms a wide range of state-of-the-art (SOTA) medical image segmentation methods, such as nnUNet, TransUNet, UNetr, MedSegDiff, and so on. Code will be released at: https://github.com/WuJunde/Medical-SAM-Adapter.

Title: Generalist Vision Foundation Models for Medical Imaging: A Case Study of Segment Anything Model on Zero-Shot Medical Segmentation. (arXiv:2304.12637v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12637
Code URL: null
Copy Paste: [[2304.12637] Generalist Vision Foundation Models for Medical Imaging: A Case Study of Segment Anything Model on Zero-Shot Medical Segmentation](http://arxiv.org/abs/2304.12637) #segmentation
Summary:
We examine the recent Segment Anything Model (SAM) on medical images, and report both quantitative and qualitative zero-shot segmentation results on nine medical image segmentation benchmarks, covering various imaging modalities, such as optical coherence tomography (OCT), magnetic resonance imaging (MRI), and computed tomography (CT), as well as different applications including dermatology, ophthalmology, and radiology. Our experiments reveal that while SAM demonstrates stunning segmentation performance on images from the general domain, for those out-of-distribution images, e.g., medical images, its zero-shot segmentation performance is still limited. Furthermore, SAM demonstrated varying zero-shot segmentation performance across different unseen medical domains. For example, it had a 0.8704 mean Dice score on segmenting under-bruch's membrane layer of retinal OCT, whereas the segmentation accuracy drops to 0.0688 when segmenting retinal pigment epithelium. For certain structured targets, e.g., blood vessels, the zero-shot segmentation of SAM completely failed, whereas a simple fine-tuning of it with small amount of data could lead to remarkable improvements of the segmentation quality. Our study indicates the versatility of generalist vision foundation models on solving specific tasks in medical imaging, and their great potential to achieve desired performance through fine-turning and eventually tackle the challenges of accessing large diverse medical datasets and the complexity of medical domains.

Title: Change detection needs change information: improving deep 3D point cloud change detection. (arXiv:2304.12639v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.12639
Code URL: null
Copy Paste: [[2304.12639] Change detection needs change information: improving deep 3D point cloud change detection](http://arxiv.org/abs/2304.12639) #segmentation
Summary:
Change detection is an important task to rapidly identify modified areas, in particular when multi-temporal data are concerned. In landscapes with complex geometry such as urban environment, vertical information turn out to be a very useful knowledge not only to highlight changes but also to classify them into different categories. In this paper, we focus on change segmentation directly using raw 3D point clouds (PCs), to avoid any loss of information due to rasterization processes. While deep learning has recently proved its effectiveness for this particular task by encoding the information through Siamese networks, we investigate here the idea of also using change information in early steps of deep networks. To do this, we first propose to provide the Siamese KPConv State-of-The-Art (SoTA) network with hand-crafted features and especially a change-related one. This improves the mean of Intersection over Union (IoU) over classes of change by 4.70\%. Considering that the major improvement was obtained thanks to the change-related feature, we propose three new architectures to address 3D PCs change segmentation: OneConvFusion, Triplet KPConv, and Encoder Fusion SiamKPConv. All the three networks take into account change information in early steps and outperform SoTA methods. In particular, the last network, entitled Encoder Fusion SiamKPConv, overtakes SoTA with more than 5% of mean of IoU over classes of change emphasizing the value of having the network focus on change information for change detection task.

Title: Segment anything, from space?. (arXiv:2304.13000v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13000
Code URL: null
Copy Paste: [[2304.13000] Segment anything, from space?](http://arxiv.org/abs/2304.13000) #segmentation
Summary:
Recently, the first foundation model developed specifically for vision tasks was developed, termed the "Segment Anything Model" (SAM). SAM can segment objects in input imagery based upon cheap input prompts, such as one (or more) points, a bounding box, or a mask. The authors examined the zero-shot image segmentation accuracy of SAM on a large number of vision benchmark tasks and found that SAM usually achieved recognition accuracy similar to, or sometimes exceeding, vision models that had been trained on the target tasks. The impressive generalization of SAM for segmentation has major implications for vision researchers working on natural imagery. In this work, we examine whether SAM's impressive performance extends to overhead imagery problems, and help guide the community's response to its development. We examine SAM's performance on a set of diverse and widely-studied benchmark tasks. We find that SAM does often generalize well to overhead imagery, although it fails in some cases due to the unique characteristics of overhead imagery and the target objects. We report on these unique systematic failure cases for remote sensing imagery that may comprise useful future research for the community. Note that this is a working paper, and it will be updated as additional analysis and results are completed.

Title: Methods and datasets for segmentation of minimally invasive surgical instruments in endoscopic images and videos: A review of the state of the art. (arXiv:2304.13014v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2304.13014
Code URL: null
Copy Paste: [[2304.13014] Methods and datasets for segmentation of minimally invasive surgical instruments in endoscopic images and videos: A review of the state of the art](http://arxiv.org/abs/2304.13014) #segmentation
Summary:
In the field of computer- and robot-assisted minimally invasive surgery, enormous progress has been made in recent years based on the recognition of surgical instruments in endoscopic images. Especially the determination of the position and type of the instruments is of great interest here. Current work involves both spatial and temporal information with the idea, that the prediction of movement of surgical tools over time may improve the quality of final segmentations. The provision of publicly available datasets has recently encouraged the development of new methods, mainly based on deep learning. In this review, we identify datasets used for method development and evaluation, as well as quantify their frequency of use in the literature. We further present an overview of the current state of research regarding the segmentation and tracking of minimally invasive surgical instruments in endoscopic images. The paper focuses on methods that work purely visually without attached markers of any kind on the instruments, taking into account both single-frame segmentation approaches as well as those involving temporal information. A discussion of the reviewed literature is provided, highlighting existing shortcomings and emphasizing available potential for future developments. The publications considered were identified through the platforms Google Scholar, Web of Science, and PubMed. The search terms used were "instrument segmentation", "instrument tracking", "surgical tool segmentation", and "surgical tool tracking" and result in 408 articles published between 2015 and 2022 from which 109 were included using systematic selection criteria.