secure

Title: Deep Learning for Iris Recognition: A Review. (arXiv:2303.08514v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08514
Code URL: null
Copy Paste: [[2303.08514] Deep Learning for Iris Recognition: A Review](http://arxiv.org/abs/2303.08514) #secure
Summary:
Iris recognition is a secure biometric technology known for its stability and privacy. With no two irises being identical and little change throughout a person's lifetime, iris recognition is considered more reliable and less susceptible to external factors than other biometric recognition methods. Unlike traditional machine learning-based iris recognition methods, deep learning technology does not rely on feature engineering and boasts excellent performance. This paper collects 120 relevant papers to summarize the development of iris recognition based on deep learning. We first introduce the background of iris recognition and the motivation and contribution of this survey. Then, we present the common datasets widely used in iris recognition. After that, we summarize the key tasks involved in the process of iris recognition based on deep learning technology, including identification, segmentation, presentation attack detection, and localization. Finally, we discuss the challenges and potential development of iris recognition. This review provides a comprehensive sight of the research of iris recognition based on deep learning.

Title: Efficient and Secure Federated Learning for Financial Applications. (arXiv:2303.08355v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08355
Code URL: null
Copy Paste: [[2303.08355] Efficient and Secure Federated Learning for Financial Applications](http://arxiv.org/abs/2303.08355) #secure
Summary:
The conventional machine learning (ML) and deep learning approaches need to share customers' sensitive information with an external credit bureau to generate a prediction model that opens the door to privacy leakage. This leakage risk makes financial companies face an enormous challenge in their cooperation. Federated learning is a machine learning setting that can protect data privacy, but the high communication cost is often the bottleneck of the federated systems, especially for large neural networks. Limiting the number and size of communications is necessary for the practical training of large neural structures. Gradient sparsification has received increasing attention as a method to reduce communication cost, which only updates significant gradients and accumulates insignificant gradients locally. However, the secure aggregation framework cannot directly use gradient sparsification. This article proposes two sparsification methods to reduce communication cost in federated learning. One is a time-varying hierarchical sparsification method for model parameter update, which solves the problem of maintaining model accuracy after high ratio sparsity. It can significantly reduce the cost of a single communication. The other is to apply the sparsification method to the secure aggregation framework. We sparse the encryption mask matrix to reduce the cost of communication while protecting privacy. Experiments show that under different Non-IID experiment settings, our method can reduce the upload communication cost to about 2.9% to 18.9% of the conventional federated learning algorithm when the sparse rate is 0.01.

security

Title: Real Face Foundation Representation Learning for Generalized Deepfake Detection. (arXiv:2303.08439v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08439
Code URL: null
Copy Paste: [[2303.08439] Real Face Foundation Representation Learning for Generalized Deepfake Detection](http://arxiv.org/abs/2303.08439) #security
Summary:
The emergence of deepfake technologies has become a matter of social concern as they pose threats to individual privacy and public security. It is now of great significance to develop reliable deepfake detectors. However, with numerous face manipulation algorithms present, it is almost impossible to collect sufficient representative fake faces, and it is hard for existing detectors to generalize to all types of manipulation. Therefore, we turn to learn the distribution of real faces, and indirectly identify fake images that deviate from the real face distribution. In this study, we propose Real Face Foundation Representation Learning (RFFR), which aims to learn a general representation from large-scale real face datasets and detect potential artifacts outside the distribution of RFFR. Specifically, we train a model on real face datasets by masked image modeling (MIM), which results in a discrepancy between input faces and the reconstructed ones when applying the model on fake samples. This discrepancy reveals the low-level artifacts not contained in RFFR, making it easier to build a deepfake detector sensitive to all kinds of potential artifacts outside the distribution of RFFR. Extensive experiments demonstrate that our method brings about better generalization performance, as it significantly outperforms the state-of-the-art methods in cross-manipulation evaluations, and has the potential to further improve by introducing extra real faces for training RFFR.

Title: Exploring Large-scale Unlabeled Faces to Enhance Facial Expression Recognition. (arXiv:2303.08617v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08617
Code URL: null
Copy Paste: [[2303.08617] Exploring Large-scale Unlabeled Faces to Enhance Facial Expression Recognition](http://arxiv.org/abs/2303.08617) #security
Summary:
Facial Expression Recognition (FER) is an important task in computer vision and has wide applications in human-computer interaction, intelligent security, emotion analysis, and other fields. However, the limited size of FER datasets limits the generalization ability of expression recognition models, resulting in ineffective model performance. To address this problem, we propose a semi-supervised learning framework that utilizes unlabeled face data to train expression recognition models effectively. Our method uses a dynamic threshold module (\textbf{DTM}) that can adaptively adjust the confidence threshold to fully utilize the face recognition (FR) data to generate pseudo-labels, thus improving the model's ability to model facial expressions. In the ABAW5 EXPR task, our method achieved excellent results on the official validation set.

Title: Compact and Divisible E-Cash with Threshold Issuance. (arXiv:2303.08221v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2303.08221
Code URL: null
Copy Paste: [[2303.08221] Compact and Divisible E-Cash with Threshold Issuance](http://arxiv.org/abs/2303.08221) #security
Summary:
Decentralized, offline, and privacy-preserving e-cash could fulfil the need for both scalable and byzantine fault-resistant payment systems. Existing offline anonymous e-cash schemes are unsuitable for distributed environments due to a central bank. We construct a distributed offline anonymous e-cash scheme, in which the role of the bank is performed by a quorum of authorities, and present its two instantiations. Our first scheme is compact, i.e. the cost of the issuance protocol and the size of a wallet are independent of the number of coins issued, but the cost of payment grows linearly with the number of coins spent. Our second scheme is divisible and thus the cost of payments is also independent of the number of coins spent, but the verification of deposits is more costly. We provide formal security proof of both schemes and compare the efficiency of their implementations.

privacy

Title: Fashion-model pose recommendation and generation using Machine Learning. (arXiv:2303.08660v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08660
Code URL: null
Copy Paste: [[2303.08660] Fashion-model pose recommendation and generation using Machine Learning](http://arxiv.org/abs/2303.08660) #privacy
Summary:
Fashion-model pose is an important attribute in the fashion industry. Creative directors, modeling production houses, and top photographers always look for professional models able to pose. without the skill to correctly pose, their chances of landing professional modeling employment are regrettably quite little. There are occasions when models and photographers are unsure of the best pose to strike while taking photographs. This research concentrates on suggesting the fashion personnel a series of similar images based on the input image. The image is segmented into different parts and similar images are suggested for the user. This was achieved by calculating the color histogram of the input image and applying the same for all the images in the dataset and comparing the histograms. Synthetic images have become popular to avoid privacy concerns and to overcome the high cost of photoshoots. Hence, this paper also extends the work of generating synthetic images from the recommendation engine using styleGAN to an extent.

protect

defense

attack

Title: Model Extraction Attacks on Split Federated Learning. (arXiv:2303.08581v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08581
Code URL: null
Copy Paste: [[2303.08581] Model Extraction Attacks on Split Federated Learning](http://arxiv.org/abs/2303.08581) #attack
Summary:
Federated Learning (FL) is a popular collaborative learning scheme involving multiple clients and a server. FL focuses on protecting clients' data but turns out to be highly vulnerable to Intellectual Property (IP) threats. Since FL periodically collects and distributes the model parameters, a free-rider can download the latest model and thus steal model IP. Split Federated Learning (SFL), a recent variant of FL that supports training with resource-constrained clients, splits the model into two, giving one part of the model to clients (client-side model), and the remaining part to the server (server-side model). Thus SFL prevents model leakage by design. Moreover, by blocking prediction queries, it can be made resistant to advanced IP threats such as traditional Model Extraction (ME) attacks. While SFL is better than FL in terms of providing IP protection, it is still vulnerable. In this paper, we expose the vulnerability of SFL and show how malicious clients can launch ME attacks by querying the gradient information from the server side. We propose five variants of ME attack which differs in the gradient usage as well as in the data assumptions. We show that under practical cases, the proposed ME attacks work exceptionally well for SFL. For instance, when the server-side model has five layers, our proposed ME attack can achieve over 90% accuracy with less than 2% accuracy degradation with VGG-11 on CIFAR-10.

robust

Title: Rotation-Invariant Transformer for Point Cloud Matching. (arXiv:2303.08231v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08231
Code URL: null
Copy Paste: [[2303.08231] Rotation-Invariant Transformer for Point Cloud Matching](http://arxiv.org/abs/2303.08231) #robust
Summary:
The intrinsic rotation invariance lies at the core of matching point clouds with handcrafted descriptors, but it is despised by most of the recent deep matchers. As an alternative, they obtain the rotation invariance extrinsically via data augmentation. However, the continuous SO(3) space can never be covered by the finite number of augmented rotations, resulting in their instability when facing rotations that are rarely seen. To this end, we introduce RoITr, a Rotation-Invariant Transformer to cope with the pose variations in the point cloud matching task. We contribute both on the local and global levels. Starting from the local level, we introduce an attention mechanism embedded with Point Pair Feature (PPF)-based coordinates to describe the pose-invariant geometry, upon which a novel attention-based encoder-decoder is constructed. We further propose a global transformer with rotation-invariant cross-frame spatial awareness learned by the self-attention mechanism, which significantly improves the feature distinctiveness and makes the model robust with respect to the low overlap. Experiments are conducted on both the rigid and non-rigid public benchmarks, where RoITr outperforms all the state-of-the-art models by a considerable margin in the low-overlapping scenarios. Especially when the rotations are enlarged on the challenging 3DLoMatch benchmark, RoITr surpasses the existing methods by at least 13 and 5 percentage points in terms of the Inlier Ratio and the Registration Recall, respectively.

Title: Improving Adversarial Robustness with Hypersphere Embedding and Angular-based Regularizations. (arXiv:2303.08289v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08289
Code URL: null
Copy Paste: [[2303.08289] Improving Adversarial Robustness with Hypersphere Embedding and Angular-based Regularizations](http://arxiv.org/abs/2303.08289) #robust
Summary:
Adversarial training (AT) methods have been found to be effective against adversarial attacks on deep neural networks. Many variants of AT have been proposed to improve its performance. Pang et al. [1] have recently shown that incorporating hypersphere embedding (HE) into the existing AT procedures enhances robustness. We observe that the existing AT procedures are not designed for the HE framework, and thus fail to adequately learn the angular discriminative information available in the HE framework. In this paper, we propose integrating HE into AT with regularization terms that exploit the rich angular information available in the HE framework. Specifically, our method, termed angular-AT, adds regularization terms to AT that explicitly enforce weight-feature compactness and inter-class separation; all expressed in terms of angular features. Experimental results show that angular-AT further improves adversarial robustness.

Title: Guided Slot Attention for Unsupervised Video Object Segmentation. (arXiv:2303.08314v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08314
Code URL: null
Copy Paste: [[2303.08314] Guided Slot Attention for Unsupervised Video Object Segmentation](http://arxiv.org/abs/2303.08314) #robust
Summary:
Unsupervised video object segmentation aims to segment the most prominent object in a video sequence. However, the existence of complex backgrounds and multiple foreground objects make this task challenging. To address this issue, we propose a guided slot attention network to reinforce spatial structural information and obtain better foreground--background separation. The foreground and background slots, which are initialized with query guidance, are iteratively refined based on interactions with template information. Furthermore, to improve slot--template interaction and effectively fuse global and local features in the target and reference frames, K-nearest neighbors filtering and a feature aggregation transformer are introduced. The proposed model achieves state-of-the-art performance on two popular datasets. Additionally, we demonstrate the robustness of the proposed model in challenging scenes through various comparative experiments.

Title: Rethinking Optical Flow from Geometric Matching Consistent Perspective. (arXiv:2303.08384v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08384
Code URL: https://github.com/dqiaole/matchflow
Copy Paste: [[2303.08384] Rethinking Optical Flow from Geometric Matching Consistent Perspective](http://arxiv.org/abs/2303.08384) #robust
Summary:
Optical flow estimation is a challenging problem remaining unsolved. Recent deep learning based optical flow models have achieved considerable success. However, these models often train networks from the scratch on standard optical flow data, which restricts their ability to robustly and geometrically match image features. In this paper, we propose a rethinking to previous optical flow estimation. We particularly leverage Geometric Image Matching (GIM) as a pre-training task for the optical flow estimation (MatchFlow) with better feature representations, as GIM shares some common challenges as optical flow estimation, and with massive labeled real-world data. Thus, matching static scenes helps to learn more fundamental feature correlations of objects and scenes with consistent displacements. Specifically, the proposed MatchFlow model employs a QuadTree attention-based network pre-trained on MegaDepth to extract coarse features for further flow regression. Extensive experiments show that our model has great cross-dataset generalization. Our method achieves 11.5% and 10.1% error reduction from GMA on Sintel clean pass and KITTI test set. At the time of anonymous submission, our MatchFlow(G) enjoys state-of-the-art performance on Sintel clean and final pass compared to published approaches with comparable computation and memory footprint. Codes and models will be released in https://github.com/DQiaole/MatchFlow.

Title: A Triplet-loss Dilated Residual Network for High-Resolution Representation Learning in Image Retrieval. (arXiv:2303.08398v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08398
Code URL: null
Copy Paste: [[2303.08398] A Triplet-loss Dilated Residual Network for High-Resolution Representation Learning in Image Retrieval](http://arxiv.org/abs/2303.08398) #robust
Summary:
Content-based image retrieval is the process of retrieving a subset of images from an extensive image gallery based on visual contents, such as color, shape or spatial relations, and texture. In some applications, such as localization, image retrieval is employed as the initial step. In such cases, the accuracy of the top-retrieved images significantly affects the overall system accuracy. The current paper introduces a simple yet efficient image retrieval system with a fewer trainable parameters, which offers acceptable accuracy in top-retrieved images. The proposed method benefits from a dilated residual convolutional neural network with triplet loss. Experimental evaluations show that this model can extract richer information (i.e., high-resolution representations) by enlarging the receptive field, thus improving image retrieval accuracy without increasing the depth or complexity of the model. To enhance the extracted representations' robustness, the current research obtains candidate regions of interest from each feature map and applies Generalized-Mean pooling to the regions. As the choice of triplets in a triplet-based network affects the model training, we employ a triplet online mining method. We test the performance of the proposed method under various configurations on two of the challenging image-retrieval datasets, namely Revisited Paris6k (RPar) and UKBench. The experimental results show an accuracy of 94.54 and 80.23 (mean precision at rank 10) in the RPar medium and hard modes and 3.86 (recall at rank 4) in the UKBench dataset, respectively.

Title: Learning Accurate Template Matching with Differentiable Coarse-to-Fine Correspondence Refinement. (arXiv:2303.08438v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08438
Code URL: https://github.com/zhirui-gao/deep-template-matching
Copy Paste: [[2303.08438] Learning Accurate Template Matching with Differentiable Coarse-to-Fine Correspondence Refinement](http://arxiv.org/abs/2303.08438) #robust
Summary:
Template matching is a fundamental task in computer vision and has been studied for decades. It plays an essential role in manufacturing industry for estimating the poses of different parts, facilitating downstream tasks such as robotic grasping. Existing methods fail when the template and source images have different modalities, cluttered backgrounds or weak textures. They also rarely consider geometric transformations via homographies, which commonly exist even for planar industrial parts. To tackle the challenges, we propose an accurate template matching method based on differentiable coarse-to-fine correspondence refinement. We use an edge-aware module to overcome the domain gap between the mask template and the grayscale image, allowing robust matching. An initial warp is estimated using coarse correspondences based on novel structure-aware information provided by transformers. This initial alignment is passed to a refinement network using references and aligned images to obtain sub-pixel level correspondences which are used to give the final geometric transformation. Extensive evaluation shows that our method is significantly better than state-of-the-art methods and baselines, providing good generalization ability and visually plausible results even on unseen real data.

Title: BEVHeight: A Robust Framework for Vision-based Roadside 3D Object Detection. (arXiv:2303.08498v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08498
Code URL: https://github.com/adlab-autodrive/bevheight
Copy Paste: [[2303.08498] BEVHeight: A Robust Framework for Vision-based Roadside 3D Object Detection](http://arxiv.org/abs/2303.08498) #robust
Summary:
While most recent autonomous driving system focuses on developing perception methods on ego-vehicle sensors, people tend to overlook an alternative approach to leverage intelligent roadside cameras to extend the perception ability beyond the visual range. We discover that the state-of-the-art vision-centric bird's eye view detection methods have inferior performances on roadside cameras. This is because these methods mainly focus on recovering the depth regarding the camera center, where the depth difference between the car and the ground quickly shrinks while the distance increases. In this paper, we propose a simple yet effective approach, dubbed BEVHeight, to address this issue. In essence, instead of predicting the pixel-wise depth, we regress the height to the ground to achieve a distance-agnostic formulation to ease the optimization process of camera-only perception methods. On popular 3D detection benchmarks of roadside cameras, our method surpasses all previous vision-centric methods by a significant margin. The code is available at {\url{https://github.com/ADLab-AutoDrive/BEVHeight}}.

Title: MGA: Medical generalist agent through text-guided knowledge transformation. (arXiv:2303.08562v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08562
Code URL: null
Copy Paste: [[2303.08562] MGA: Medical generalist agent through text-guided knowledge transformation](http://arxiv.org/abs/2303.08562) #robust
Summary:
Multi-modal representation methods have achieved advanced performance in medical applications by extracting more robust features from multi-domain data. However, existing methods usually need to train additional branches for downstream tasks, which may increase the model complexities in clinical applications as well as introduce additional human inductive bias. Besides, very few studies exploit the rich clinical knowledge embedded in clinical daily reports. To this end, we propose a novel medical generalist agent, MGA, that can address three kinds of common clinical tasks via clinical reports knowledge transformation. Unlike the existing methods, MGA can easily adapt to different tasks without specific downstream branches when their corresponding annotations are missing. More importantly, we are the first attempt to use medical professional language guidance as a transmission medium to guide the agent's behavior. The proposed method is implemented on four well-known X-ray open-source datasets, MIMIC-CXR, CheXpert, MIMIC-CXR-JPG, and MIMIC-CXR-MS. Promising results are obtained, which validate the effectiveness of our proposed MGA. Code is available at: https://github.com/SZUHvern/MGA

Title: MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving. (arXiv:2303.08600v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08600
Code URL: https://github.com/jialeli1/lidarseg3d
Copy Paste: [[2303.08600] MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving](http://arxiv.org/abs/2303.08600) #robust
Summary:
LiDAR and camera are two modalities available for 3D semantic segmentation in autonomous driving. The popular LiDAR-only methods severely suffer from inferior segmentation on small and distant objects due to insufficient laser points, while the robust multi-modal solution is under-explored, where we investigate three crucial inherent difficulties: modality heterogeneity, limited sensor field of view intersection, and multi-modal data augmentation. We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint intra-modal feature extraction and inter-modal feature fusion to mitigate the modality heterogeneity. The multi-modal fusion in MSeg3D consists of geometry-based feature fusion GF-Phase, cross-modal feature completion, and semantic-based feature fusion SF-Phase on all visible points. The multi-modal data augmentation is reinvigorated by applying asymmetric transformations on LiDAR point cloud and multi-camera images individually, which benefits the model training with diversified augmentation transformations. MSeg3D achieves state-of-the-art results on nuScenes, Waymo, and SemanticKITTI datasets. Under the malfunctioning multi-camera input and the multi-frame point clouds input, MSeg3D still shows robustness and improves the LiDAR-only baseline. Our code is publicly available at \url{https://github.com/jialeli1/lidarseg3d}.

Title: Improving Fast Auto-Focus with Event Polarity. (arXiv:2303.08611v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08611
Code URL: null
Copy Paste: [[2303.08611] Improving Fast Auto-Focus with Event Polarity](http://arxiv.org/abs/2303.08611) #robust
Summary:
Fast and accurate auto-focus in adverse conditions remains an arduous task. The paper presents a polarity-based event camera auto-focus algorithm featuring high-speed, precise auto-focus in dark, dynamic scenes that conventional frame-based cameras cannot match. Specifically, the symmetrical relationship between the event polarities in focusing is investigated, and the event-based focus evaluation function is proposed based on the principles of the event cameras and the imaging model in the focusing process. Comprehensive experiments on the public EAD dataset show the robustness of the model. Furthermore, precise focus with less than one depth of focus is achieved within 0.004 seconds on our self-built high-speed focusing platform. The dataset and code will be made publicly available.

Title: Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video. (arXiv:2303.08670v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08670
Code URL: null
Copy Paste: [[2303.08670] Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video](http://arxiv.org/abs/2303.08670) #robust
Summary:
Forced alignment refers to a technology that time-aligns a given transcription with a corresponding speech. However, as the forced alignment technologies have developed using speech audio, they might fail in alignment when the input speech audio is noise-corrupted or is not accessible. We focus on that there is another component that the speech can be inferred from, the speech video (i.e., talking face video). Since the drawbacks of audio-based forced alignment can be complemented using the visual information when the audio signal is under poor condition, we try to develop a novel video-based forced alignment method. However, different from audio forced alignment, it is challenging to develop a reliable visual forced alignment technology for the following two reasons: 1) Visual Speech Recognition (VSR) has a much lower performance compared to audio-based Automatic Speech Recognition (ASR), and 2) the translation from text to video is not reliable, so the method typically used for building audio forced alignment cannot be utilized in developing visual forced alignment. In order to alleviate these challenges, in this paper, we propose a new method that is appropriate for visual forced alignment, namely Deep Visual Forced Alignment (DVFA). The proposed DVFA can align the input transcription (i.e., sentence) with the talking face video without accessing the speech audio. Moreover, by augmenting the alignment task with anomaly case detection, DVFA can detect mismatches between the input transcription and the input video while performing the alignment. Therefore, we can robustly align the text with the talking face video even if there exist error words in the text. Through extensive experiments, we show the effectiveness of the proposed DVFA not only in the alignment task but also in interpreting the outputs of VSR models.

Title: RefiNeRF: Modelling dynamic neural radiance fields with inconsistent or missing camera parameters. (arXiv:2303.08695v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08695
Code URL: null
Copy Paste: [[2303.08695] RefiNeRF: Modelling dynamic neural radiance fields with inconsistent or missing camera parameters](http://arxiv.org/abs/2303.08695) #robust
Summary:
Novel view synthesis (NVS) is a challenging task in computer vision that involves synthesizing new views of a scene from a limited set of input images. Neural Radiance Fields (NeRF) have emerged as a powerful approach to address this problem, but they require accurate knowledge of camera \textit{intrinsic} and \textit{extrinsic} parameters. Traditionally, structure-from-motion (SfM) and multi-view stereo (MVS) approaches have been used to extract camera parameters, but these methods can be unreliable and may fail in certain cases. In this paper, we propose a novel technique that leverages unposed images from dynamic datasets, such as the NVIDIA dynamic scenes dataset, to learn camera parameters directly from data. Our approach is highly extensible and can be integrated into existing NeRF architectures with minimal modifications. We demonstrate the effectiveness of our method on a variety of static and dynamic scenes and show that it outperforms traditional SfM and MVS approaches. The code for our method is publicly available at \href{https://github.com/redacted/refinerf}{https://github.com/redacted/refinerf}. Our approach offers a promising new direction for improving the accuracy and robustness of NVS using NeRF, and we anticipate that it will be a valuable tool for a wide range of applications in computer vision and graphics.

Title: Attention-likelihood relationship in transformers. (arXiv:2303.08288v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2303.08288
Code URL: null
Copy Paste: [[2303.08288] Attention-likelihood relationship in transformers](http://arxiv.org/abs/2303.08288) #robust
Summary:
We analyze how large language models (LLMs) represent out-of-context words, investigating their reliance on the given context to capture their semantics. Our likelihood-guided text perturbations reveal a correlation between token likelihood and attention values in transformer-based language models. Extensive experiments reveal that unexpected tokens cause the model to attend less to the information coming from themselves to compute their representations, particularly at higher layers. These findings have valuable implications for assessing the robustness of LLMs in real-world scenarios. Fully reproducible codebase at https://github.com/Flegyas/AttentionLikelihood.

Title: PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning. (arXiv:2303.08389v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2303.08389
Code URL: null
Copy Paste: [[2303.08389] PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning](http://arxiv.org/abs/2303.08389) #robust
Summary:
Vulnerability to lexical perturbation is a critical weakness of automatic evaluation metrics for image captioning. This paper proposes Perturbation Robust Multi-Lingual CLIPScore(PR-MCS), which exhibits robustness to such perturbations, as a novel reference-free image captioning metric applicable to multiple languages. To achieve perturbation robustness, we fine-tune the text encoder of CLIP with our language-agnostic method to distinguish the perturbed text from the original text. To verify the robustness of PR-MCS, we introduce a new fine-grained evaluation dataset consisting of detailed captions, critical objects, and the relationships between the objects for 3, 000 images in five languages. In our experiments, PR-MCS significantly outperforms baseline metrics in capturing lexical noise of all various perturbation types in all five languages, proving that PR-MCS is highly robust to lexical perturbations.

Title: Systematic design space exploration by learning the explored space using Machine Learning. (arXiv:2303.08249v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08249
Code URL: null
Copy Paste: [[2303.08249] Systematic design space exploration by learning the explored space using Machine Learning](http://arxiv.org/abs/2303.08249) #robust
Summary:
Current practice in parameter space exploration in euclidean space is dominated by randomized sampling or design of experiment methods. The biggest issue with these methods is not keeping track of what part of parameter space has been explored and what has not. In this context, we utilize the geometric learning of explored data space using modern machine learning methods to keep track of already explored regions and samples from the regions that are unexplored. For this purpose, we use a modified version of a robust random-cut forest along with other heuristic-based approaches. We demonstrate our method and its progression in two-dimensional Euclidean space but it can be extended to any dimension since the underlying method is generic.

biometric

steal

extraction

Title: DICNet: Deep Instance-Level Contrastive Network for Double Incomplete Multi-View Multi-Label Classification. (arXiv:2303.08358v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08358
Code URL: null
Copy Paste: [[2303.08358] DICNet: Deep Instance-Level Contrastive Network for Double Incomplete Multi-View Multi-Label Classification](http://arxiv.org/abs/2303.08358) #extraction
Summary:
In recent years, multi-view multi-label learning has aroused extensive research enthusiasm. However, multi-view multi-label data in the real world is commonly incomplete due to the uncertain factors of data collection and manual annotation, which means that not only multi-view features are often missing, and label completeness is also difficult to be satisfied. To deal with the double incomplete multi-view multi-label classification problem, we propose a deep instance-level contrastive network, namely DICNet. Different from conventional methods, our DICNet focuses on leveraging deep neural network to exploit the high-level semantic representations of samples rather than shallow-level features. First, we utilize the stacked autoencoders to build an end-to-end multi-view feature extraction framework to learn the view-specific representations of samples. Furthermore, in order to improve the consensus representation ability, we introduce an incomplete instance-level contrastive learning scheme to guide the encoders to better extract the consensus information of multiple views and use a multi-view weighted fusion module to enhance the discrimination of semantic features. Overall, our DICNet is adept in capturing consistent discriminative representations of multi-view multi-label data and avoiding the negative effects of missing views and missing labels. Extensive experiments performed on five datasets validate that our method outperforms other state-of-the-art methods.

Title: Quality evaluation of point clouds: a novel no-reference approach using transformer-based architecture. (arXiv:2303.08634v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08634
Code URL: null
Copy Paste: [[2303.08634] Quality evaluation of point clouds: a novel no-reference approach using transformer-based architecture](http://arxiv.org/abs/2303.08634) #extraction
Summary:
With the increased interest in immersive experiences, point cloud came to birth and was widely adopted as the first choice to represent 3D media. Besides several distortions that could affect the 3D content spanning from acquisition to rendering, efficient transmission of such volumetric content over traditional communication systems stands at the expense of the delivered perceptual quality. To estimate the magnitude of such degradation, employing quality metrics became an inevitable solution. In this work, we propose a novel deep-based no-reference quality metric that operates directly on the whole point cloud without requiring extensive pre-processing, enabling real-time evaluation over both transmission and rendering levels. To do so, we use a novel model design consisting primarily of cross and self-attention layers, in order to learn the best set of local semantic affinities while keeping the best combination of geometry and color information in multiple levels from basic features extraction to deep representation modeling.

Title: Economical Quaternion Extraction from a Human Skeletal Pose Estimate using 2-D Cameras. (arXiv:2303.08657v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08657
Code URL: null
Copy Paste: [[2303.08657] Economical Quaternion Extraction from a Human Skeletal Pose Estimate using 2-D Cameras](http://arxiv.org/abs/2303.08657) #extraction
Summary:
In this paper, we present a novel algorithm to extract a quaternion from a two dimensional camera frame for estimating a contained human skeletal pose. The problem of pose estimation is usually tackled through the usage of stereo cameras and intertial measurement units for obtaining depth and euclidean distance for measurement of points in 3D space. However, the usage of these devices comes with a high signal processing latency as well as a significant monetary cost. By making use of MediaPipe, a framework for building perception pipelines for human pose estimation, the proposed algorithm extracts a quaternion from a 2-D frame capturing an image of a human object at a sub-fifty millisecond latency while also being capable of deployment at edges with a single camera frame and a generally low computational resource availability, especially for use cases involving last-minute detection and reaction by autonomous robots. The algorithm seeks to bypass the funding barrier and improve accessibility for robotics researchers involved in designing control systems.

Title: Multi-Exposure HDR Composition by Gated Swin Transformer. (arXiv:2303.08704v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08704
Code URL: null
Copy Paste: [[2303.08704] Multi-Exposure HDR Composition by Gated Swin Transformer](http://arxiv.org/abs/2303.08704) #extraction
Summary:
Fusing a sequence of perfectly aligned images captured at various exposures, has shown great potential to approach High Dynamic Range (HDR) imaging by sensors with limited dynamic range. However, in the presence of large motion of scene objects or the camera, mis-alignment is almost inevitable and leads to the notorious ``ghost'' artifacts. Besides, factors such as the noise in the dark region or color saturation in the over-bright region may also fail to fill local image details to the HDR image. This paper provides a novel multi-exposure fusion model based on Swin Transformer. Particularly, we design feature selection gates, which are integrated with the feature extraction layers to detect outliers and block them from HDR image synthesis. To reconstruct the missing local details by well-aligned and properly-exposed regions, we exploit the long distance contextual dependency in the exposure-space pyramid by the self-attention mechanism. Extensive numerical and visual evaluation has been conducted on a variety of benchmark datasets. The experiments show that our model achieves the accuracy on par with current top performing multi-exposure HDR imaging models, while gaining higher efficiency.

Title: Contextualized Medication Information Extraction Using Transformer-based Deep Learning Architectures. (arXiv:2303.08259v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2303.08259
Code URL: null
Copy Paste: [[2303.08259] Contextualized Medication Information Extraction Using Transformer-based Deep Learning Architectures](http://arxiv.org/abs/2303.08259) #extraction
Summary:
Objective: To develop a natural language processing (NLP) system to extract medications and contextual information that help understand drug changes. This project is part of the 2022 n2c2 challenge.

Materials and methods: We developed NLP systems for medication mention extraction, event classification (indicating medication changes discussed or not), and context classification to classify medication changes context into 5 orthogonal dimensions related to drug changes. We explored 6 state-of-the-art pretrained transformer models for the three subtasks, including GatorTron, a large language model pretrained using >90 billion words of text (including >80 billion words from >290 million clinical notes identified at the University of Florida Health). We evaluated our NLP systems using annotated data and evaluation scripts provided by the 2022 n2c2 organizers.

Results:Our GatorTron models achieved the best F1-scores of 0.9828 for medication extraction (ranked 3rd), 0.9379 for event classification (ranked 2nd), and the best micro-average accuracy of 0.9126 for context classification. GatorTron outperformed existing transformer models pretrained using smaller general English text and clinical text corpora, indicating the advantage of large language models.

Conclusion: This study demonstrated the advantage of using large transformer models for contextual medication information extraction from clinical narratives.

Title: Clinical Concept and Relation Extraction Using Prompt-based Machine Reading Comprehension. (arXiv:2303.08262v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2303.08262
Code URL: null
Copy Paste: [[2303.08262] Clinical Concept and Relation Extraction Using Prompt-based Machine Reading Comprehension](http://arxiv.org/abs/2303.08262) #extraction
Summary:
Objective: To develop a natural language processing system that solves both clinical concept extraction and relation extraction in a unified prompt-based machine reading comprehension (MRC) architecture with good generalizability for cross-institution applications.

Methods: We formulate both clinical concept extraction and relation extraction using a unified prompt-based MRC architecture and explore state-of-the-art transformer models. We compare our MRC models with existing deep learning models for concept extraction and end-to-end relation extraction using two benchmark datasets developed by the 2018 National NLP Clinical Challenges (n2c2) challenge (medications and adverse drug events) and the 2022 n2c2 challenge (relations of social determinants of health [SDoH]). We also evaluate the transfer learning ability of the proposed MRC models in a cross-institution setting. We perform error analyses and examine how different prompting strategies affect the performance of MRC models.

Results and Conclusion: The proposed MRC models achieve state-of-the-art performance for clinical concept and relation extraction on the two benchmark datasets, outperforming previous non-MRC transformer models. GatorTron-MRC achieves the best strict and lenient F1-scores for concept extraction, outperforming previous deep learning models on the two datasets by 1%~3% and 0.7%~1.3%, respectively. For end-to-end relation extraction, GatorTron-MRC and BERT-MIMIC-MRC achieve the best F1-scores, outperforming previous deep learning models by 0.9%~2.4% and 10%-11%, respectively. For cross-institution evaluation, GatorTron-MRC outperforms traditional GatorTron by 6.4% and 16% for the two datasets, respectively. The proposed method is better at handling nested/overlapped concepts, extracting relations, and has good portability for cross-institute applications.

Title: A Cross-institutional Evaluation on Breast Cancer Phenotyping NLP Algorithms on Electronic Health Records. (arXiv:2303.08448v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2303.08448
Code URL: null
Copy Paste: [[2303.08448] A Cross-institutional Evaluation on Breast Cancer Phenotyping NLP Algorithms on Electronic Health Records](http://arxiv.org/abs/2303.08448) #extraction
Summary:
Objective: The generalizability of clinical large language models is usually ignored during the model development process. This study evaluated the generalizability of BERT-based clinical NLP models across different clinical settings through a breast cancer phenotype extraction task.

Materials and Methods: Two clinical corpora of breast cancer patients were collected from the electronic health records from the University of Minnesota and the Mayo Clinic, and annotated following the same guideline. We developed three types of NLP models (i.e., conditional random field, bi-directional long short-term memory and CancerBERT) to extract cancer phenotypes from clinical texts. The models were evaluated for their generalizability on different test sets with different learning strategies (model transfer vs. locally trained). The entity coverage score was assessed with their association with the model performances.

Results: We manually annotated 200 and 161 clinical documents at UMN and MC, respectively. The corpora of the two institutes were found to have higher similarity between the target entities than the overall corpora. The CancerBERT models obtained the best performances among the independent test sets from two clinical institutes and the permutation test set. The CancerBERT model developed in one institute and further fine-tuned in another institute achieved reasonable performance compared to the model developed on local data (micro-F1: 0.925 vs 0.932).

Conclusions: The results indicate the CancerBERT model has the best learning ability and generalizability among the three types of clinical NLP models. The generalizability of the models was found to be correlated with the similarity of the target entities between the corpora.

Title: Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples!. (arXiv:2303.08559v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2303.08559
Code URL: null
Copy Paste: [[2303.08559] Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples!](http://arxiv.org/abs/2303.08559) #extraction
Summary:
Large Language Models (LLMs) have made remarkable strides in various tasks. However, whether they are competitive few-shot solvers for information extraction (IE) tasks and surpass fine-tuned small Pre-trained Language Models (SLMs) remains an open problem. This paper aims to provide a thorough answer to this problem, and moreover, to explore an approach towards effective and economical IE systems that combine the strengths of LLMs and SLMs. Through extensive experiments on eight datasets across three IE tasks, we show that LLMs are not effective few-shot information extractors in general, given their unsatisfactory performance in most settings and the high latency and budget requirements. However, we demonstrate that LLMs can well complement SLMs and effectively solve hard samples that SLMs struggle with. Building on these findings, we propose an adaptive filter-then-rerank paradigm, in which SLMs act as filters and LLMs act as rerankers. By utilizing LLMs to rerank a small portion of difficult samples identified by SLMs, our preliminary system consistently achieves promising improvements (2.1% F1-gain on average) on various IE tasks, with acceptable cost of time and money.

Title: GCRE-GPT: A Generative Model for Comparative Relation Extraction. (arXiv:2303.08601v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2303.08601
Code URL: null
Copy Paste: [[2303.08601] GCRE-GPT: A Generative Model for Comparative Relation Extraction](http://arxiv.org/abs/2303.08601) #extraction
Summary:
Given comparative text, comparative relation extraction aims to extract two targets (\eg two cameras) in comparison and the aspect they are compared for (\eg image quality). The extracted comparative relations form the basis of further opinion analysis.Existing solutions formulate this task as a sequence labeling task, to extract targets and aspects. However, they cannot directly extract comparative relation(s) from text. In this paper, we show that comparative relations can be directly extracted with high accuracy, by generative model. Based on GPT-2, we propose a Generation-based Comparative Relation Extractor (GCRE-GPT). Experiment results show that \modelname achieves state-of-the-art accuracy on two datasets.

membership infer

federate

Title: Visual Prompt Based Personalized Federated Learning. (arXiv:2303.08678v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08678
Code URL: null
Copy Paste: [[2303.08678] Visual Prompt Based Personalized Federated Learning](http://arxiv.org/abs/2303.08678) #federate
Summary:
As a popular paradigm of distributed learning, personalized federated learning (PFL) allows personalized models to improve generalization ability and robustness by utilizing knowledge from all distributed clients. Most existing PFL algorithms tackle personalization in a model-centric way, such as personalized layer partition, model regularization, and model interpolation, which all fail to take into account the data characteristics of distributed clients. In this paper, we propose a novel PFL framework for image classification tasks, dubbed pFedPT, that leverages personalized visual prompts to implicitly represent local data distribution information of clients and provides that information to the aggregation model to help with classification tasks. Specifically, in each round of pFedPT training, each client generates a local personalized prompt related to local data distribution. Then, the local model is trained on the input composed of raw data and a visual prompt to learn the distribution information contained in the prompt. During model testing, the aggregated model obtains prior knowledge of the data distributions based on the prompts, which can be seen as an adaptive fine-tuning of the aggregation model to improve model performances on different clients. Furthermore, the visual prompt can be added as an orthogonal method to implement personalization on the client for existing FL methods to boost their performance. Experiments on the CIFAR10 and CIFAR100 datasets show that pFedPT outperforms several state-of-the-art (SOTA) PFL algorithms by a large margin in various settings.

Title: Optimization Design for Federated Learning in Heterogeneous 6G Networks. (arXiv:2303.08322v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08322
Code URL: null
Copy Paste: [[2303.08322] Optimization Design for Federated Learning in Heterogeneous 6G Networks](http://arxiv.org/abs/2303.08322) #federate
Summary:
With the rapid advancement of 5G networks, billions of smart Internet of Things (IoT) devices along with an enormous amount of data are generated at the network edge. While still at an early age, it is expected that the evolving 6G network will adopt advanced artificial intelligence (AI) technologies to collect, transmit, and learn this valuable data for innovative applications and intelligent services. However, traditional machine learning (ML) approaches require centralizing the training data in the data center or cloud, raising serious user-privacy concerns. Federated learning, as an emerging distributed AI paradigm with privacy-preserving nature, is anticipated to be a key enabler for achieving ubiquitous AI in 6G networks. However, there are several system and statistical heterogeneity challenges for effective and efficient FL implementation in 6G networks. In this article, we investigate the optimization approaches that can effectively address the challenging heterogeneity issues from three aspects: incentive mechanism design, network resource management, and personalized model optimization. We also present some open problems and promising directions for future research.

fair

Title: FairAdaBN: Mitigating unfairness with adaptive batch normalization and its application to dermatological disease classification. (arXiv:2303.08325v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08325
Code URL: null
Copy Paste: [[2303.08325] FairAdaBN: Mitigating unfairness with adaptive batch normalization and its application to dermatological disease classification](http://arxiv.org/abs/2303.08325) #fair
Summary:
Deep learning is becoming increasingly ubiquitous in medical research and applications while involving sensitive information and even critical diagnosis decisions. Researchers observe a significant performance disparity among subgroups with different demographic attributes, which is called model unfairness, and put lots of effort into carefully designing elegant architectures to address unfairness, which poses heavy training burden, brings poor generalization, and reveals the trade-off between model performance and fairness. To tackle these issues, we propose FairAdaBN by making batch normalization adaptive to sensitive attribute. This simple but effective design can be adopted to several classification backbones that are originally unaware of fairness. Additionally, we derive a novel loss function that restrains statistical parity between subgroups on mini-batches, encouraging the model to converge with considerable fairness. In order to evaluate the trade-off between model performance and fairness, we propose a new metric, named Fairness-Accuracy Trade-off Efficiency (FATE), to compute normalized fairness improvement over accuracy drop. Experiments on two dermatological datasets show that our proposed method outperforms other methods on fairness criteria and FATE.

Title: Graph Neural Network Surrogates of Fair Graph Filtering. (arXiv:2303.08157v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08157
Code URL: null
Copy Paste: [[2303.08157] Graph Neural Network Surrogates of Fair Graph Filtering](http://arxiv.org/abs/2303.08157) #fair
Summary:
Graph filters that transform prior node values to posterior scores via edge propagation often support graph mining tasks affecting humans, such as recommendation and ranking. Thus, it is important to make them fair in terms of satisfying statistical parity constraints between groups of nodes (e.g., distribute score mass between genders proportionally to their representation). To achieve this while minimally perturbing the original posteriors, we introduce a filter-aware universal approximation framework for posterior objectives. This defines appropriate graph neural networks trained at runtime to be similar to filters but also locally optimize a large class of objectives, including fairness-aware ones. Experiments on a collection of 8 filters and 5 graphs show that our approach performs equally well or better than alternatives in meeting parity constraints while preserving the AUC of score-based community member recommendation and creating minimal utility loss in prior diffusion.

Title: DualFair: Fair Representation Learning at Both Group and Individual Levels via Contrastive Self-supervision. (arXiv:2303.08403v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08403
Code URL: https://github.com/sungwon-han/dualfair
Copy Paste: [[2303.08403] DualFair: Fair Representation Learning at Both Group and Individual Levels via Contrastive Self-supervision](http://arxiv.org/abs/2303.08403) #fair
Summary:
Algorithmic fairness has become an important machine learning problem, especially for mission-critical Web applications. This work presents a self-supervised model, called DualFair, that can debias sensitive attributes like gender and race from learned representations. Unlike existing models that target a single type of fairness, our model jointly optimizes for two fairness criteria - group fairness and counterfactual fairness - and hence makes fairer predictions at both the group and individual levels. Our model uses contrastive loss to generate embeddings that are indistinguishable for each protected group, while forcing the embeddings of counterfactual pairs to be similar. It then uses a self-knowledge distillation method to maintain the quality of representation for the downstream tasks. Extensive analysis over multiple datasets confirms the model's validity and further shows the synergy of jointly addressing two fairness criteria, suggesting the model's potential value in fair intelligent Web applications.

Title: Fair Off-Policy Learning from Observational Data. (arXiv:2303.08516v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08516
Code URL: null
Copy Paste: [[2303.08516] Fair Off-Policy Learning from Observational Data](http://arxiv.org/abs/2303.08516) #fair
Summary:
Businesses and organizations must ensure that their algorithmic decision-making is fair in order to meet legislative, ethical, and societal demands. For example, decision-making in automated hiring must not discriminate with respect to gender or race. To achieve this, prior research has contributed approaches that ensure algorithmic fairness in machine learning predictions, while comparatively little effort has focused on algorithmic fairness in decision models, specifically off-policy learning. In this paper, we propose a novel framework for fair off-policy learning: we learn decision rules from observational data under different notions of fairness, where we explicitly assume that observational data were collected under a different -- potentially biased -- behavioral policy. For this, we first formalize different fairness notions for off-policy learning. We then propose a machine learning approach to learn optimal policies under these fairness notions. Specifically, we reformulate the fairness notions into unconstrained learning objectives that can be estimated from finite samples. Here, we leverage machine learning to minimize the objective constrained on a fair representation of the data, so that the resulting policies satisfy our fairness notions. We further provide theoretical guarantees in form of generalization bounds for the finite-sample version of our framework. We demonstrate the effectiveness of our framework through extensive numerical experiments using both simulated and real-world data. As a result, our work enables algorithmic decision-making in a wide array of practical applications where fairness must ensured.

interpretability

explainability

watermark

diffusion

Title: Decomposed Diffusion Models for High-Quality Video Generation. (arXiv:2303.08320v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08320
Code URL: null
Copy Paste: [[2303.08320] Decomposed Diffusion Models for High-Quality Video Generation](http://arxiv.org/abs/2303.08320) #diffusion
Summary:
A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution. Despite its recent success in image synthesis, applying DPMs to video generation is still challenging due to the high dimensional data space. Previous methods usually adopt a standard diffusion process, where frames in the same video clip are destroyed with independent noises, ignoring the content redundancy and temporal correlation. This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation. We further show that our decomposed formulation can benefit from pre-trained image diffusion models and well-support text-conditioned video creation.

Title: DiffBEV: Conditional Diffusion Model for Bird's Eye View Perception. (arXiv:2303.08333v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08333
Code URL: https://github.com/jiayuzou2020/diffbev
Copy Paste: [[2303.08333] DiffBEV: Conditional Diffusion Model for Bird's Eye View Perception](http://arxiv.org/abs/2303.08333) #diffusion
Summary:
BEV perception is of great importance in the field of autonomous driving, serving as the cornerstone of planning, controlling, and motion prediction. The quality of the BEV feature highly affects the performance of BEV perception. However, taking the noises in camera parameters and LiDAR scans into consideration, we usually obtain BEV representation with harmful noises. Diffusion models naturally have the ability to denoise noisy samples to the ideal data, which motivates us to utilize the diffusion model to get a better BEV representation. In this work, we propose an end-to-end framework, named DiffBEV, to exploit the potential of diffusion model to generate a more comprehensive BEV representation. To the best of our knowledge, we are the first to apply diffusion model to BEV perception. In practice, we design three types of conditions to guide the training of the diffusion model which denoises the coarse samples and refines the semantic feature in a progressive way. What's more, a cross-attention module is leveraged to fuse the context of BEV feature and the semantic content of conditional diffusion model. DiffBEV achieves a 25.9% mIoU on the nuScenes dataset, which is 6.2% higher than the best-performing existing approach. Quantitative and qualitative results on multiple benchmarks demonstrate the effectiveness of DiffBEV in BEV semantic segmentation and 3D object detection tasks. The code will be available soon.

Title: Uncertainty-Aware Pedestrian Trajectory Prediction via Distributional Diffusion. (arXiv:2303.08367v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08367
Code URL: null
Copy Paste: [[2303.08367] Uncertainty-Aware Pedestrian Trajectory Prediction via Distributional Diffusion](http://arxiv.org/abs/2303.08367) #diffusion
Summary:
Tremendous efforts have been devoted to pedestrian trajectory prediction using generative modeling for accommodating uncertainty and multi-modality in human behaviors. An individual's inherent uncertainty, e.g., change of destination, can be masked by complex patterns resulting from the movements of interacting pedestrians. However, latent variable-based generative models often entangle such uncertainty with complexity, leading to either limited expressivity or overconfident predictions. In this work, we propose to separately model these two factors by implicitly deriving a flexible distribution that describes complex pedestrians' movements, whereas incorporating predictive uncertainty of individuals with explicit density functions over their future locations. More specifically, we present an uncertainty-aware pedestrian trajectory prediction framework, parameterizing sufficient statistics for the distributions of locations that jointly comprise the multi-modal trajectories. We further estimate these parameters of interest by approximating a denoising process that progressively recovers pedestrian movements from noise. Unlike prior studies, we translate the predictive stochasticity to the explicit distribution, making it readily used to generate plausible future trajectories indicating individuals' self-uncertainty. Moreover, our framework is model-agnostic for compatibility with different neural network architectures. We empirically show the performance advantages of our framework on widely-used benchmarks, outperforming state-of-the-art in most scenes even with lighter backbones.

Title: The Devil's Advocate: Shattering the Illusion of Unexploitable Data using Diffusion Models. (arXiv:2303.08500v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08500
Code URL: null
Copy Paste: [[2303.08500] The Devil's Advocate: Shattering the Illusion of Unexploitable Data using Diffusion Models](http://arxiv.org/abs/2303.08500) #diffusion
Summary:
Protecting personal data against the exploitation of machine learning models is of paramount importance. Recently, availability attacks have shown great promise to provide an extra layer of protection against the unauthorized use of data to train neural networks. These methods aim to add imperceptible noise to clean data so that the neural networks cannot extract meaningful patterns from the protected data, claiming that they can make personal data "unexploitable." In this paper, we provide a strong countermeasure against such approaches, showing that unexploitable data might only be an illusion. In particular, we leverage the power of diffusion models and show that a carefully designed denoising process can defuse the ramifications of the data-protecting perturbations. We rigorously analyze our algorithm, and theoretically prove that the amount of required denoising is directly related to the magnitude of the data-protecting perturbations. Our approach, called AVATAR, delivers state-of-the-art performance against a suite of recent availability attacks in various scenarios, outperforming adversarial training. Our findings call for more research into making personal data unexploitable, showing that this goal is far from over.

Title: Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer. (arXiv:2303.08622v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08622
Code URL: null
Copy Paste: [[2303.08622] Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer](http://arxiv.org/abs/2303.08622) #diffusion
Summary:
Diffusion models have shown great promise in text-guided image style transfer, but there is a trade-off between style transformation and content preservation due to their stochastic nature. Existing methods require computationally expensive fine-tuning of diffusion models or additional neural network. To address this, here we propose a zero-shot contrastive loss for diffusion models that doesn't require additional fine-tuning or auxiliary networks. By leveraging patch-wise contrastive loss between generated samples and original image embeddings in the pre-trained diffusion model, our method can generate images with the same semantic content as the source image in a zero-shot manner. Our approach outperforms existing methods while preserving content and requiring no additional training, not only for image style transfer but also for image-to-image translation and manipulation. Our experimental results validate the effectiveness of our proposed method.

Title: ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution. (arXiv:2303.08714v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08714
Code URL: null
Copy Paste: [[2303.08714] ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution](http://arxiv.org/abs/2303.08714) #diffusion
Summary:
Adapting the Diffusion Probabilistic Model (DPM) for direct image super-resolution is wasteful, given that a simple Convolutional Neural Network (CNN) can recover the main low-frequency content. Therefore, we present ResDiff, a novel Diffusion Probabilistic Model based on Residual structure for Single Image Super-Resolution (SISR). ResDiff utilizes a combination of a CNN, which restores primary low-frequency components, and a DPM, which predicts the residual between the ground-truth image and the CNN-predicted image. In contrast to the common diffusion-based methods that directly use LR images to guide the noise towards HR space, ResDiff utilizes the CNN's initial prediction to direct the noise towards the residual space between HR space and CNN-predicted space, which not only accelerates the generation process but also acquires superior sample quality. Additionally, a frequency-domain-based loss function for CNN is introduced to facilitate its restoration, and a frequency-domain guided diffusion is designed for DPM on behalf of predicting high-frequency details. The extensive experiments on multiple benchmark datasets demonstrate that ResDiff outperforms previous diffusion-based methods in terms of shorter model convergence time, superior generation quality, and more diverse samples.

Title: DiffusionAD: Denoising Diffusion for Anomaly Detection. (arXiv:2303.08730v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08730
Code URL: https://github.com/HuiZhang0812/DiffusionAD-Denoising-Diffusion-for-Anomaly-Detection
Copy Paste: [[2303.08730] DiffusionAD: Denoising Diffusion for Anomaly Detection](http://arxiv.org/abs/2303.08730) #diffusion
Summary:
Anomaly detection is widely applied due to its remarkable effectiveness and efficiency in meeting the needs of real-world industrial manufacturing. We introduce a new pipeline, DiffusionAD, to anomaly detection. We frame anomaly detection as a ``noise-to-norm'' paradigm, in which anomalies are identified as inconsistencies between a query image and its flawless approximation. Our pipeline achieves this by restoring the anomalous regions from the noisy corrupted query image while keeping the normal regions unchanged. DiffusionAD includes a denoising sub-network and a segmentation sub-network, which work together to provide intuitive anomaly detection and localization in an end-to-end manner, without the need for complicated post-processing steps. Remarkably, during inference, this framework delivers satisfactory performance with just one diffusion reverse process step, which is tens to hundreds of times faster than general diffusion methods. Extensive evaluations on standard and challenging benchmarks including VisA and DAGM show that DiffusionAD outperforms current state-of-the-art paradigms, demonstrating the effectiveness and generalizability of the proposed pipeline.

Title: Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion. (arXiv:2303.08767v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2303.08767
Code URL: null
Copy Paste: [[2303.08767] Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion](http://arxiv.org/abs/2303.08767) #diffusion
Summary:
Diffusion models have shown superior performance in image generation and manipulation, but the inherent stochasticity presents challenges in preserving and manipulating image content and identity. While previous approaches like DreamBooth and Textual Inversion have proposed model or latent representation personalization to maintain the content, their reliance on multiple reference images and complex training limits their practicality. In this paper, we present a simple yet highly effective approach to personalization using highly personalized (HiPer) text embedding by decomposing the CLIP embedding space for personalization and content manipulation. Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text. Through experiments on diverse target texts, we demonstrate that our approach produces highly personalized and complex semantic image edits across a wide range of tasks. We believe that the novel understanding of the text embedding space presented in this work has the potential to inspire further research across various tasks.

Title: On the uncertainty analysis of the data-enabled physics-informed neural network for solving neutron diffusion eigenvalue problem. (arXiv:2303.08455v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08455
Code URL: null
Copy Paste: [[2303.08455] On the uncertainty analysis of the data-enabled physics-informed neural network for solving neutron diffusion eigenvalue problem](http://arxiv.org/abs/2303.08455) #diffusion
Summary:
In practical engineering experiments, the data obtained through detectors are inevitably noisy. For the already proposed data-enabled physics-informed neural network (DEPINN) \citep{DEPINN}, we investigate the performance of DEPINN in calculating the neutron diffusion eigenvalue problem from several perspectives when the prior data contain different scales of noise. Further, in order to reduce the effect of noise and improve the utilization of the noisy prior data, we propose innovative interval loss functions and give some rigorous mathematical proofs. The robustness of DEPINN is examined on two typical benchmark problems through a large number of numerical results, and the effectiveness of the proposed interval loss function is demonstrated by comparison. This paper confirms the feasibility of the improved DEPINN for practical engineering applications in nuclear reactor physics.

Title: Stochastic Interpolants: A Unifying Framework for Flows and Diffusions. (arXiv:2303.08797v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2303.08797
Code URL: null
Copy Paste: [[2303.08797] Stochastic Interpolants: A Unifying Framework for Flows and Diffusions](http://arxiv.org/abs/2303.08797) #diffusion
Summary:
We introduce a class of generative models based on the stochastic interpolant framework proposed in Albergo & Vanden-Eijnden (2023) that unifies flow-based and diffusion-based methods. We first show how to construct a broad class of continuous-time stochastic processes whose time-dependent probability density function bridges two arbitrary densities exactly in finite time. These `stochastic interpolants' are built by combining data from the two densities with an additional latent variable, and the specific details of the construction can be leveraged to shape the resulting time-dependent density in a flexible way. We then show that the time-dependent density of the stochastic interpolant satisfies a first-order transport equation as well as a family of forward and backward Fokker-Planck equations with tunable diffusion; upon consideration of the time evolution of an individual sample, this viewpoint immediately leads to both deterministic and stochastic generative models based on probability flow equations or stochastic differential equations with a tunable level of noise. The drift coefficients entering these models are time-dependent velocity fields characterized as the unique minimizers of simple quadratic objective functions, one of which is a new objective for the score of the interpolant density. Remarkably, we show that minimization of these quadratic objectives leads to control of the likelihood for generative models built upon stochastic dynamics; by contrast, we show that generative models based upon a deterministic dynamics must, in addition, control the Fisher divergence between the target and the model. Finally, we construct estimators for the likelihood and the cross-entropy of interpolant-based generative models, and demonstrate that such models recover the Schr\"odinger bridge between the two target densities when explicitly optimizing over the interpolant.