secure

Title: TimeClave: Oblivious In-enclave Time series Processing System. (arXiv:2306.16652v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.16652
Code URL: null
Copy Paste: [[2306.16652] TimeClave: Oblivious In-enclave Time series Processing System](http://arxiv.org/abs/2306.16652) #secure
Summary:
Cloud platforms are widely adopted by many systems, such as time series processing systems, to store and process massive amounts of sensitive time series data. Unfortunately, several incidents have shown that cloud platforms are vulnerable to internal and external attacks that lead to critical data breaches. Adopting cryptographic protocols such as homomorphic encryption and secure multi-party computation adds high computational and network overhead to query operations.

We present TimeClave, a fully oblivious in-enclave time series processing system: TimeClave leverages Intel SGX to support aggregate statistics on time series with minimal memory consumption inside the enclave. To hide the access pattern inside the enclave, we introduce a non-blocking read-optimised ORAM named RoORAM. TimeClave integrates RoORAM to obliviously and securely handle client queries with high performance. With an aggregation time interval of $10s$, $2^{14}$ summarised data blocks and 8 aggregate functions, TimeClave run point query in $0.03ms$ and a range query of 50 intervals in $0.46ms$. Compared to the ORAM baseline, TimeClave achieves lower query latency by up to $2.5\times$ and up to $2\times$ throughput, with up to 22K queries per second.

security

Title: Towards Grammatical Tagging for the Legal Language of Cybersecurity. (arXiv:2306.17042v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.17042
Code URL: null
Copy Paste: [[2306.17042] Towards Grammatical Tagging for the Legal Language of Cybersecurity](http://arxiv.org/abs/2306.17042) #security
Summary:
Legal language can be understood as the language typically used by those engaged in the legal profession and, as such, it may come both in spoken or written form. Recent legislation on cybersecurity obviously uses legal language in writing, thus inheriting all its interpretative complications due to the typical abundance of cases and sub-cases as well as to the general richness in detail. This paper faces the challenge of the essential interpretation of the legal language of cybersecurity, namely of the extraction of the essential Parts of Speech (POS) from the legal documents concerning cybersecurity. The challenge is overcome by our methodology for POS tagging of legal language. It leverages state-of-the-art open-source tools for Natural Language Processing (NLP) as well as manual analysis to validate the outcomes of the tools. As a result, the methodology is automated and, arguably, general for any legal language following minor tailoring of the preprocessing step. It is demonstrated over the most relevant EU legislation on cybersecurity, namely on the NIS 2 directive, producing the first, albeit essential, structured interpretation of such a relevant document. Moreover, our findings indicate that tools such as SpaCy and ClausIE reach their limits over the legal language of the NIS 2.

Title: BLEND: Efficient and blended IoT data storage and communication with application layer security. (arXiv:2306.16540v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.16540
Code URL: null
Copy Paste: [[2306.16540] BLEND: Efficient and blended IoT data storage and communication with application layer security](http://arxiv.org/abs/2306.16540) #security
Summary:
Many IoT use cases demand both secure storage and secure communication. Resource-constrained devices cannot afford having one set of crypto protocols for storage and another for communication. Lightweight application layer security standards are being developed for IoT communication. Extending these protocols for secure storage can significantly reduce communication latency and local processing.

We present BLEND, combining secure storage and communication by storing IoT data as pre-computed encrypted network packets. Unlike local methods, BLEND not only eliminates separate crypto for secure storage needs, but also eliminates a need for real-time crypto operations, reducing the communication latency significantly. Our evaluation shows that compared with a local solution, BLEND reduces send latency from 630 microseconds to 110 microseconds per packet. BLEND enables PKI based key management while being sufficiently lightweight for IoT. BLEND doesn't need modifications to communication standards used when extended for secure storage, and can therefore preserve underlying protocols' security guarantees.

Title: Blockchain in Oil and Gas Supply Chain: A Literature Review from User Security and Privacy Perspective. (arXiv:2306.16576v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.16576
Code URL: null
Copy Paste: [[2306.16576] Blockchain in Oil and Gas Supply Chain: A Literature Review from User Security and Privacy Perspective](http://arxiv.org/abs/2306.16576) #security
Summary:
Blockchain's influence extends beyond finance, impacting diverse sectors such as real estate, oil and gas, and education. This extensive reach stems from blockchain's intrinsic ability to reliably manage digital transactions and supply chains. Within the oil and gas sector, the merger of blockchain with supply chain management and data handling is a notable trend. The supply chain encompasses several operations: extraction, transportation, trading, and distribution of resources. Unfortunately, the current supply chain structure misses critical features such as transparency, traceability, flexible trading, and secure data storage - all of which blockchain can provide. Nevertheless, it is essential to investigate blockchain's security and privacy in the oil and gas industry. Such scrutiny enables the smooth, secure, and usable execution of transactions. For this purpose, we reviewed 124 peer-reviewed academic publications, conducting an in-depth analysis of 21 among them. We classified the articles by their relevance to various phases of the supply chain flow: upstream, midstream, downstream, and data management. Despite blockchain's potential to address existing security and privacy voids in the supply chain, there is a significant lack of practical implementation of blockchain integration in oil and gas operations. This deficiency substantially challenges the transition from conventional methods to a blockchain-centric approach.

Title: A Survey on Enterprise Network Security: Asset Behavioral Monitoring and Distributed Attack Detection. (arXiv:2306.16675v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.16675
Code URL: null
Copy Paste: [[2306.16675] A Survey on Enterprise Network Security: Asset Behavioral Monitoring and Distributed Attack Detection](http://arxiv.org/abs/2306.16675) #security
Summary:
Enterprise networks that host valuable assets and services are popular and frequent targets of distributed network attacks. In order to cope with the ever-increasing threats, industrial and research communities develop systems and methods to monitor the behaviors of their assets and protect them from critical attacks. In this paper, we systematically survey related research articles and industrial systems to highlight the current status of this arms race in enterprise network security. First, we discuss the taxonomy of distributed network attacks on enterprise assets, including distributed denial-of-service (DDoS) and reconnaissance attacks. Second, we review existing methods in monitoring and classifying network behavior of enterprise hosts to verify their benign activities and isolate potential anomalies. Third, state-of-the-art detection methods for distributed network attacks sourced from external attackers are elaborated, highlighting their merits and bottlenecks. Fourth, as programmable networks and machine learning (ML) techniques are increasingly becoming adopted by the community, their current applications in network security are discussed. Finally, we highlight several research gaps on enterprise network security to inspire future research.

Title: SWAT: A System-Wide Approach to Tunable Leakage Mitigation in Encrypted Data Stores. (arXiv:2306.16851v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.16851
Code URL: null
Copy Paste: [[2306.16851] SWAT: A System-Wide Approach to Tunable Leakage Mitigation in Encrypted Data Stores](http://arxiv.org/abs/2306.16851) #security
Summary:
Numerous studies have underscored the significant privacy risks associated with various leakage patterns in encrypted data stores. Most existing systems that conceal leakage either (1) incur substantial overheads, (2) focus on specific subsets of leakage patterns, or (3) apply the same security notion across various workloads, thereby impeding the attainment of fine-tuned privacy-efficiency trade-offs. In light of various detrimental leakage patterns, this paper starts with an investigation into which specific leakage patterns require our focus respectively in the contexts of key-value, range-query, and dynamic workloads. Subsequently, we introduce new security notions tailored to the specific privacy requirements of these workloads. Accordingly, we present, SWAT, an efficient construction that progressively enables these workloads, while provably mitigating system-wide leakage via a suite of algorithms with tunable privacy-efficiency trade-offs. We conducted extensive experiments and compiled a detailed result analysis, showing the efficiency of our solution. SWAT is about $10.6\times$ slower than an encryption-only data store that reveals various leakage patterns and is $31.6\times$ faster than a trivially zero-leakage solution. Meanwhile, the performance of SWAT remains highly competitive compared to other designs that mitigate specific types of leakage.

Title: VibHead: An Authentication Scheme for Smart Headsets through Vibration. (arXiv:2306.17002v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.17002
Code URL: null
Copy Paste: [[2306.17002] VibHead: An Authentication Scheme for Smart Headsets through Vibration](http://arxiv.org/abs/2306.17002) #security
Summary:
Recent years have witnessed the fast penetration of Virtual Reality (VR) and Augmented Reality (AR) systems into our daily life, the security and privacy issues of the VR/AR applications have been attracting considerable attention. Most VR/AR systems adopt head-mounted devices (i.e., smart headsets) to interact with users and the devices usually store the users' private data. Hence, authentication schemes are desired for the head-mounted devices. Traditional knowledge-based authentication schemes for general personal devices have been proved vulnerable to shoulder-surfing attacks, especially considering the headsets may block the sight of the users. Although the robustness of the knowledge-based authentication can be improved by designing complicated secret codes in virtual space, this approach induces a compromise of usability. Another choice is to leverage the users' biometrics; however, it either relies on highly advanced equipments which may not always be available in commercial headsets or introduce heavy cognitive load to users.

In this paper, we propose a vibration-based authentication scheme, VibHead, for smart headsets. Since the propagation of vibration signals through human heads presents unique patterns for different individuals, VibHead employs a CNN-based model to classify registered legitimate users based the features extracted from the vibration signals. We also design a two-step authentication scheme where the above user classifiers are utilized to distinguish the legitimate user from illegitimate ones. We implement VibHead on a Microsoft HoloLens equipped with a linear motor and an IMU sensor which are commonly used in off-the-shelf personal smart devices. According to the results of our extensive experiments, with short vibration signals ($\leq 1s$), VibHead has an outstanding authentication accuracy; both FAR and FRR are around 5%.

Title: RowPress: Amplifying Read Disturbance in Modern DRAM Chips. (arXiv:2306.17061v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.17061
Code URL: null
Copy Paste: [[2306.17061] RowPress: Amplifying Read Disturbance in Modern DRAM Chips](http://arxiv.org/abs/2306.17061) #security
Summary:
Memory isolation is critical for system reliability, security, and safety. Unfortunately, read disturbance can break memory isolation in modern DRAM chips. For example, RowHammer is a well-studied read-disturb phenomenon where repeatedly opening and closing (i.e., hammering) a DRAM row many times causes bitflips in physically nearby rows.

This paper experimentally demonstrates and analyzes another widespread read-disturb phenomenon, RowPress, in real DDR4 DRAM chips. RowPress breaks memory isolation by keeping a DRAM row open for a long period of time, which disturbs physically nearby rows enough to cause bitflips. We show that RowPress amplifies DRAM's vulnerability to read-disturb attacks by significantly reducing the number of row activations needed to induce a bitflip by one to two orders of magnitude under realistic conditions. In extreme cases, RowPress induces bitflips in a DRAM row when an adjacent row is activated only once. Our detailed characterization of 164 real DDR4 DRAM chips shows that RowPress 1) affects chips from all three major DRAM manufacturers, 2) gets worse as DRAM technology scales down to smaller node sizes, and 3) affects a different set of DRAM cells from RowHammer and behaves differently from RowHammer as temperature and access pattern changes.

We demonstrate in a real DDR4-based system with RowHammer protection that 1) a user-level program induces bitflips by leveraging RowPress while conventional RowHammer cannot do so, and 2) a memory controller that adaptively keeps the DRAM row open for a longer period of time based on access pattern can facilitate RowPress-based attacks. To prevent bitflips due to RowPress, we describe and evaluate a new methodology that adapts existing RowHammer mitigation techniques to also mitigate RowPress with low additional performance overhead. We open source all our code and data to facilitate future research on RowPress.

Title: ItyFuzz: Snapshot-Based Fuzzer for Smart Contract. (arXiv:2306.17135v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.17135
Code URL: null
Copy Paste: [[2306.17135] ItyFuzz: Snapshot-Based Fuzzer for Smart Contract](http://arxiv.org/abs/2306.17135) #security
Summary:
Smart contracts are critical financial instruments, and their security is of utmost importance. However, smart contract programs are difficult to fuzz due to the persistent blockchain state behind all transactions. Mutating sequences of transactions are complex and often lead to a suboptimal exploration for both input and program spaces. In this paper, we introduce a novel snapshot-based fuzzer ItyFuzz for testing smart contracts. In ItyFuzz, instead of storing sequences of transactions and mutating from them, we snapshot states and singleton transactions. To explore interesting states, ItyFuzz introduces a dataflow waypoint mechanism to identify states with more potential momentum. ItyFuzz also incorporates comparison waypoints to prune the space of states. By maintaining snapshots of the states, ItyFuzz can synthesize concrete exploits like reentrancy attacks quickly. Because ItyFuzz has second-level response time to test a smart contract, it can be used for on-chain testing, which has many benefits compared to local development testing. Finally, we evaluate ItyFuzz on real-world smart contracts and some hacked on-chain DeFi projects. ItyFuzz outperforms existing fuzzers in terms of instructional coverage and can find and generate realistic exploits for on-chain projects quickly.

privacy

Title: milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing. (arXiv:2306.17010v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.17010
Code URL: null
Copy Paste: [[2306.17010] milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing](http://arxiv.org/abs/2306.17010) #privacy
Summary:
Approaching the era of ubiquitous computing, human motion sensing plays a crucial role in smart systems for decision making, user interaction, and personalized services. Extensive research has been conducted on human tracking, pose estimation, gesture recognition, and activity recognition, which are predominantly based on cameras in traditional methods. However, the intrusive nature of cameras limits their use in smart home applications. To address this, mmWave radars have gained popularity due to their privacy-friendly features. In this work, we propose \textit{milliFlow}, a novel deep learning method for scene flow estimation as a complementary motion information for mmWave point cloud, serving as an intermediate level of features and directly benefiting downstream human motion sensing tasks. Experimental results demonstrate the superior performance of our method with an average 3D endpoint error of 4.6cm, significantly surpassing the competing approaches. Furthermore, by incorporating scene flow information, we achieve remarkable improvements in human activity recognition, human parsing, and human body part tracking. To foster further research in this area, we provide our codebase and dataset for open access.

Title: Towards Blockchain-Assisted Privacy-Aware Data Sharing For Edge Intelligence: A Smart Healthcare Perspective. (arXiv:2306.16630v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.16630
Code URL: null
Copy Paste: [[2306.16630] Towards Blockchain-Assisted Privacy-Aware Data Sharing For Edge Intelligence: A Smart Healthcare Perspective](http://arxiv.org/abs/2306.16630) #privacy
Summary:
The popularization of intelligent healthcare devices and big data analytics significantly boosts the development of smart healthcare networks (SHNs). To enhance the precision of diagnosis, different participants in SHNs share health data that contains sensitive information. Therefore, the data exchange process raises privacy concerns, especially when the integration of health data from multiple sources (linkage attack) results in further leakage. Linkage attack is a type of dominant attack in the privacy domain, which can leverage various data sources for private data mining. Furthermore, adversaries launch poisoning attacks to falsify the health data, which leads to misdiagnosing or even physical damage. To protect private health data, we propose a personalized differential privacy model based on the trust levels among users. The trust is evaluated by a defined community density, while the corresponding privacy protection level is mapped to controllable randomized noise constrained by differential privacy. To avoid linkage attacks in personalized differential privacy, we designed a noise correlation decoupling mechanism using a Markov stochastic process. In addition, we build the community model on a blockchain, which can mitigate the risk of poisoning attacks during differentially private data transmission over SHNs. To testify the effectiveness and superiority of the proposed approach, we conduct extensive experiments on benchmark datasets.

Title: Honesty is the Best Policy: On the Accuracy of Apple Privacy Labels Compared to Apps' Privacy Policies. (arXiv:2306.17063v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.17063
Code URL: null
Copy Paste: [[2306.17063] Honesty is the Best Policy: On the Accuracy of Apple Privacy Labels Compared to Apps' Privacy Policies](http://arxiv.org/abs/2306.17063) #privacy
Summary:
Apple introduced \textit{privacy labels} in Dec. 2020 as a way for developers to report the privacy behaviors of their apps. While Apple does not validate labels, they do also require developers to provide a privacy policy, which offers an important comparison point. In this paper, we applied the NLP framework of Polisis to extract features of the privacy policy for 515,920 apps on the iOS App Store comparing the output to the privacy labels. We identify discrepancies between the policies and the labels, particularly as it relates to data collected that is linked to users. We find that 287$\pm196$K apps' privacy policies may indicate data collection that is linked to users than what is reported in the privacy labels. More alarming, a large number of (97$\pm30$\%) of the apps that have {\em Data Not Collected} privacy label have a privacy policy that indicates otherwise. We provide insights into potential sources for discrepancies, including the use of templates and confusion around Apple's definitions and requirements. These results suggest that there is still significant work to be done to help developers more accurately labeling their apps. Incorporating a Polisis-like system as a first-order check can help improve the current state and better inform developers when there are possible misapplication of privacy labels.

protect

Title: NeuralFuse: Learning to Improve the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes. (arXiv:2306.16869v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.16869
Code URL: https://github.com/ibm/neuralfuse
Copy Paste: [[2306.16869] NeuralFuse: Learning to Improve the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes](http://arxiv.org/abs/2306.16869) #protect
Summary:
Deep neural networks (DNNs) have become ubiquitous in machine learning, but their energy consumption remains a notable issue. Lowering the supply voltage is an effective strategy for reducing energy consumption. However, aggressively scaling down the supply voltage can lead to accuracy degradation due to random bit flips in static random access memory (SRAM) where model parameters are stored. To address this challenge, we introduce NeuralFuse, a novel add-on module that addresses the accuracy-energy tradeoff in low-voltage regimes by learning input transformations to generate error-resistant data representations. NeuralFuse protects DNN accuracy in both nominal and low-voltage scenarios. Moreover, NeuralFuse is easy to implement and can be readily applied to DNNs with limited access, such as non-configurable hardware or remote access to cloud-based APIs. Experimental results demonstrate that, at a 1% bit error rate, NeuralFuse can reduce SRAM memory access energy by up to 24% while improving accuracy by up to 57%. To the best of our knowledge, this is the first model-agnostic approach (i.e., no model retraining) to address low-voltage-induced bit errors. The source code is available at https://github.com/IBM/NeuralFuse.

defense

Title: Defending Black-box Classifiers by Bayesian Boundary Correction. (arXiv:2306.16979v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16979
Code URL: null
Copy Paste: [[2306.16979] Defending Black-box Classifiers by Bayesian Boundary Correction](http://arxiv.org/abs/2306.16979) #defense
Summary:
Classifiers based on deep neural networks have been recently challenged by Adversarial Attack, where the widely existing vulnerability has invoked the research in defending them from potential threats. Given a vulnerable classifier, existing defense methods are mostly white-box and often require re-training the victim under modified loss functions/training regimes. While the model/data/training specifics of the victim are usually unavailable to the user, re-training is unappealing, if not impossible for reasons such as limited computational resources. To this end, we propose a new black-box defense framework. It can turn any pre-trained classifier into a resilient one with little knowledge of the model specifics. This is achieved by new joint Bayesian treatments on the clean data, the adversarial examples and the classifier, for maximizing their joint probability. It is further equipped with a new post-train strategy which keeps the victim intact. We name our framework Bayesian Boundary Correction (BBC). BBC is a general and flexible framework that can easily adapt to different data types. We instantiate BBC for image classification and skeleton-based human activity recognition, for both static and dynamic data. Exhaustive evaluation shows that BBC has superior robustness and can enhance robustness without severely hurting the clean accuracy, compared with existing defense methods.

attack

Title: Towards Optimal Randomized Strategies in Adversarial Example Game. (arXiv:2306.16738v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.16738
Code URL: null
Copy Paste: [[2306.16738] Towards Optimal Randomized Strategies in Adversarial Example Game](http://arxiv.org/abs/2306.16738) #attack
Summary:
The vulnerability of deep neural network models to adversarial example attacks is a practical challenge in many artificial intelligence applications. A recent line of work shows that the use of randomization in adversarial training is the key to find optimal strategies against adversarial example attacks. However, in a fully randomized setting where both the defender and the attacker can use randomized strategies, there are no efficient algorithm for finding such an optimal strategy. To fill the gap, we propose the first algorithm of its kind, called FRAT, which models the problem with a new infinite-dimensional continuous-time flow on probability distribution spaces. FRAT maintains a lightweight mixture of models for the defender, with flexibility to efficiently update mixing weights and model parameters at each iteration. Furthermore, FRAT utilizes lightweight sampling subroutines to construct a random strategy for the attacker. We prove that the continuous-time limit of FRAT converges to a mixed Nash equilibria in a zero-sum game formed by a defender and an attacker. Experimental results also demonstrate the efficiency of FRAT on CIFAR-10 and CIFAR-100 datasets.

robust

Title: Does Saliency-Based Training bring Robustness for Deep Neural Networks in Image Classification?. (arXiv:2306.16581v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16581
Code URL: null
Copy Paste: [[2306.16581] Does Saliency-Based Training bring Robustness for Deep Neural Networks in Image Classification?](http://arxiv.org/abs/2306.16581) #robust
Summary:
Deep Neural Networks are powerful tools to understand complex patterns and making decisions. However, their black-box nature impedes a complete understanding of their inner workings. While online saliency-guided training methods try to highlight the prominent features in the model's output to alleviate this problem, it is still ambiguous if the visually explainable features align with robustness of the model against adversarial examples. In this paper, we investigate the saliency trained model's vulnerability to adversarial examples methods. Models are trained using an online saliency-guided training method and evaluated against popular algorithms of adversarial examples. We quantify the robustness and conclude that despite the well-explained visualizations in the model's output, the salient models suffer from the lower performance against adversarial examples attacks.

Title: GuidedMixup: An Efficient Mixup Strategy Guided by Saliency Maps. (arXiv:2306.16612v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16612
Code URL: null
Copy Paste: [[2306.16612] GuidedMixup: An Efficient Mixup Strategy Guided by Saliency Maps](http://arxiv.org/abs/2306.16612) #robust
Summary:
Data augmentation is now an essential part of the image training process, as it effectively prevents overfitting and makes the model more robust against noisy datasets. Recent mixing augmentation strategies have advanced to generate the mixup mask that can enrich the saliency information, which is a supervisory signal. However, these methods incur a significant computational burden to optimize the mixup mask. From this motivation, we propose a novel saliency-aware mixup method, GuidedMixup, which aims to retain the salient regions in mixup images with low computational overhead. We develop an efficient pairing algorithm that pursues to minimize the conflict of salient regions of paired images and achieve rich saliency in mixup images. Moreover, GuidedMixup controls the mixup ratio for each pixel to better preserve the salient region by interpolating two paired images smoothly. The experiments on several datasets demonstrate that GuidedMixup provides a good trade-off between augmentation overhead and generalization performance on classification datasets. In addition, our method shows good performance in experiments with corrupted or reduced datasets.

Title: Group-based Robustness: A General Framework for Customized Robustness in the Real World. (arXiv:2306.16614v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.16614
Code URL: null
Copy Paste: [[2306.16614] Group-based Robustness: A General Framework for Customized Robustness in the Real World](http://arxiv.org/abs/2306.16614) #robust
Summary:
Machine-learning models are known to be vulnerable to evasion attacks that perturb model inputs to induce misclassifications. In this work, we identify real-world scenarios where the true threat cannot be assessed accurately by existing attacks. Specifically, we find that conventional metrics measuring targeted and untargeted robustness do not appropriately reflect a model's ability to withstand attacks from one set of source classes to another set of target classes. To address the shortcomings of existing methods, we formally define a new metric, termed group-based robustness, that complements existing metrics and is better-suited for evaluating model performance in certain attack scenarios. We show empirically that group-based robustness allows us to distinguish between models' vulnerability against specific threat models in situations where traditional robustness metrics do not apply. Moreover, to measure group-based robustness efficiently and accurately, we 1) propose two loss functions and 2) identify three new attack strategies. We show empirically that with comparable success rates, finding evasive samples using our new loss functions saves computation by a factor as large as the number of targeted classes, and finding evasive samples using our new attack strategies saves time by up to 99\% compared to brute-force search methods. Finally, we propose a defense method that increases group-based robustness by up to 3.52$\times$.

Title: Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train. (arXiv:2306.16741v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16741
Code URL: https://github.com/med-air/endo-fm
Copy Paste: [[2306.16741] Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train](http://arxiv.org/abs/2306.16741) #robust
Summary:
Foundation models have exhibited remarkable success in various applications, such as disease diagnosis and text report generation. To date, a foundation model for endoscopic video analysis is still lacking. In this paper, we propose Endo-FM, a foundation model specifically developed using massive endoscopic video data. First, we build a video transformer, which captures both local and global long-range dependencies across spatial and temporal dimensions. Second, we pre-train our transformer model using global and local views via a self-supervised manner, aiming to make it robust to spatial-temporal variations and discriminative across different scenes. To develop the foundation model, we construct a large-scale endoscopy video dataset by combining 9 publicly available datasets and a privately collected dataset from Baoshan Branch of Renji Hospital in Shanghai, China. Our dataset overall consists of over 33K video clips with up to 5 million frames, encompassing various protocols, target organs, and disease types. Our pre-trained Endo-FM can be easily adopted for a given downtream task via fine-tuning by serving as the backbone. With experiments on 3 different types of downstream tasks, including classification, segmentation, and detection, our Endo-FM surpasses the current state-of-the-art self-supervised pre-training and adapter-based transfer learning methods by a significant margin, such as VCL (3.1% F1 for classification, 4.8% Dice for segmentation, and 5.5% F1 for detection) and ST-Adapter (5.9% F1 for classification, 9.6% Dice for segmentation, and 9.9% F1 for detection). Code, datasets, and models are released at https://github.com/med-air/Endo-FM.

Title: CLIPAG: Towards Generator-Free Text-to-Image Generation. (arXiv:2306.16805v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16805
Code URL: null
Copy Paste: [[2306.16805] CLIPAG: Towards Generator-Free Text-to-Image Generation](http://arxiv.org/abs/2306.16805) #robust
Summary:
Perceptually Aligned Gradients (PAG) refer to an intriguing property observed in robust image classification models, wherein their input gradients align with human perception and pose semantic meanings. While this phenomenon has gained significant research attention, it was solely studied in the context of unimodal vision-only architectures. In this work, we extend the study of PAG to Vision-Language architectures, which form the foundations for diverse image-text tasks and applications. Through an adversarial robustification finetuning of CLIP, we demonstrate that robust Vision-Language models exhibit PAG in contrast to their vanilla counterparts. This work reveals the merits of CLIP with PAG (CLIPAG) in several vision-language generative tasks. Notably, we show that seamlessly integrating CLIPAG in a "plug-n-play" manner leads to substantial improvements in vision-language generative applications. Furthermore, leveraging its PAG property, CLIPAG enables text-to-image generation without any generative model, which typically requires huge generators.

Title: The Drunkard's Odometry: Estimating Camera Motion in Deforming Scenes. (arXiv:2306.16917v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16917
Code URL: https://github.com/UZ-SLAMLab/DrunkardsOdometry
Copy Paste: [[2306.16917] The Drunkard's Odometry: Estimating Camera Motion in Deforming Scenes](http://arxiv.org/abs/2306.16917) #robust
Summary:
Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. Deformable odometry and SLAM pipelines, which tackle the most challenging scenario of exploratory trajectories, suffer from a lack of robustness and proper quantitative evaluation methodologies. To tackle this issue with a common benchmark, we introduce the Drunkard's Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality. We further present a novel deformable odometry method, dubbed the Drunkard's Odometry, which decomposes optical flow estimates into rigid-body camera motion and non-rigid scene deformations. In order to validate our data, our work contains an evaluation of several baselines as well as a novel tracking error metric which does not require ground truth data. Dataset and code: https://davidrecasens.github.io/TheDrunkard'sOdometry/

Title: Alternative Telescopic Displacement: An Efficient Multimodal Alignment Method. (arXiv:2306.16950v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16950
Code URL: https://github.com/D-ST-Sword/ATD
Copy Paste: [[2306.16950] Alternative Telescopic Displacement: An Efficient Multimodal Alignment Method](http://arxiv.org/abs/2306.16950) #robust
Summary:
Feature alignment is the primary means of fusing multimodal data. We propose a feature alignment method that fully fuses multimodal information, which alternately shifts and expands feature information from different modalities to have a consistent representation in a feature space. The proposed method can robustly capture high-level interactions between features of different modalities, thus significantly improving the performance of multimodal learning. We also show that the proposed method outperforms other popular multimodal schemes on multiple tasks. Experimental evaluation of ETT and MIT-BIH-Arrhythmia, datasets shows that the proposed method achieves state of the art performance.

Title: Integrating Large Pre-trained Models into Multimodal Named Entity Recognition with Evidential Fusion. (arXiv:2306.16991v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16991
Code URL: null
Copy Paste: [[2306.16991] Integrating Large Pre-trained Models into Multimodal Named Entity Recognition with Evidential Fusion](http://arxiv.org/abs/2306.16991) #robust
Summary:
Multimodal Named Entity Recognition (MNER) is a crucial task for information extraction from social media platforms such as Twitter. Most current methods rely on attention weights to extract information from both text and images but are often unreliable and lack interpretability. To address this problem, we propose incorporating uncertainty estimation into the MNER task, producing trustworthy predictions. Our proposed algorithm models the distribution of each modality as a Normal-inverse Gamma distribution, and fuses them into a unified distribution with an evidential fusion mechanism, enabling hierarchical characterization of uncertainties and promotion of prediction accuracy and trustworthiness. Additionally, we explore the potential of pre-trained large foundation models in MNER and propose an efficient fusion approach that leverages their robust feature representations. Experiments on two datasets demonstrate that our proposed method outperforms the baselines and achieves new state-of-the-art performance.

Title: The Importance of Robust Features in Mitigating Catastrophic Forgetting. (arXiv:2306.17091v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.17091
Code URL: null
Copy Paste: [[2306.17091] The Importance of Robust Features in Mitigating Catastrophic Forgetting](http://arxiv.org/abs/2306.17091) #robust
Summary:
Continual learning (CL) is an approach to address catastrophic forgetting, which refers to forgetting previously learned knowledge by neural networks when trained on new tasks or data distributions. The adversarial robustness has decomposed features into robust and non-robust types and demonstrated that models trained on robust features significantly enhance adversarial robustness. However, no study has been conducted on the efficacy of robust features from the lens of the CL model in mitigating catastrophic forgetting in CL. In this paper, we introduce the CL robust dataset and train four baseline models on both the standard and CL robust datasets. Our results demonstrate that the CL models trained on the CL robust dataset experienced less catastrophic forgetting of the previously learned tasks than when trained on the standard dataset. Our observations highlight the significance of the features provided to the underlying CL models, showing that CL robust features can alleviate catastrophic forgetting.

Title: Generate Anything Anywhere in Any Scene. (arXiv:2306.17154v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.17154
Code URL: null
Copy Paste: [[2306.17154] Generate Anything Anywhere in Any Scene](http://arxiv.org/abs/2306.17154) #robust
Summary:
Text-to-image diffusion models have attracted considerable interest due to their wide applicability across diverse fields. However, challenges persist in creating controllable models for personalized object generation. In this paper, we first identify the entanglement issues in existing personalized generative models, and then propose a straightforward and efficient data augmentation training strategy that guides the diffusion model to focus solely on object identity. By inserting the plug-and-play adapter layers from a pre-trained controllable diffusion model, our model obtains the ability to control the location and size of each generated personalized object. During inference, we propose a regionally-guided sampling technique to maintain the quality and fidelity of the generated images. Our method achieves comparable or superior fidelity for personalized objects, yielding a robust, versatile, and controllable text-to-image diffusion model that is capable of generating realistic and personalized images. Our approach demonstrates significant potential for various applications, such as those in art, entertainment, and advertising design.

Title: CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?. (arXiv:2306.16636v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.16636
Code URL: null
Copy Paste: [[2306.16636] CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?](http://arxiv.org/abs/2306.16636) #robust
Summary:
We present the Chinese Elementary School Math Word Problems (CMATH) dataset, comprising 1.7k elementary school-level math word problems with detailed annotations, source from actual Chinese workbooks and exams. This dataset aims to provide a benchmark tool for assessing the following question: to what grade level of elementary school math do the abilities of popular large language models (LLMs) correspond? We evaluate a variety of popular LLMs, including both commercial and open-source options, and discover that only GPT-4 achieves success (accuracy $\geq$ 60\%) across all six elementary school grades, while other models falter at different grade levels. Furthermore, we assess the robustness of several top-performing LLMs by augmenting the original problems in the CMATH dataset with distracting information. Our findings reveal that GPT-4 is able to maintains robustness, while other model fail. We anticipate that our study will expose limitations in LLMs' arithmetic and reasoning capabilities, and promote their ongoing development and advancement.

Title: Evaluating Paraphrastic Robustness in Textual Entailment Models. (arXiv:2306.16722v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.16722
Code URL: null
Copy Paste: [[2306.16722] Evaluating Paraphrastic Robustness in Textual Entailment Models](http://arxiv.org/abs/2306.16722) #robust
Summary:
We present PaRTE, a collection of 1,126 pairs of Recognizing Textual Entailment (RTE) examples to evaluate whether models are robust to paraphrasing. We posit that if RTE models understand language, their predictions should be consistent across inputs that share the same meaning. We use the evaluation set to determine if RTE models' predictions change when examples are paraphrased. In our experiments, contemporary models change their predictions on 8-16\% of paraphrased examples, indicating that there is still room for improvement.

Title: LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT. (arXiv:2306.17103v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.17103
Code URL: null
Copy Paste: [[2306.17103] LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT](http://arxiv.org/abs/2306.17103) #robust
Summary:
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear" by transcribing the audio, while GPT-4 serves as the "brain," acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task.

Title: Long-Term Hourly Scenario Generation for Correlated Wind and Solar Power combining Variational Autoencoders with Radial Basis Function Kernels. (arXiv:2306.16427v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.16427
Code URL: null
Copy Paste: [[2306.16427] Long-Term Hourly Scenario Generation for Correlated Wind and Solar Power combining Variational Autoencoders with Radial Basis Function Kernels](http://arxiv.org/abs/2306.16427) #robust
Summary:
Accurate generation of realistic future scenarios of renewable energy generation is crucial for long-term planning and operation of electrical systems, especially considering the increasing focus on sustainable energy and the growing penetration of renewable generation in energy matrices. These predictions enable power system operators and energy planners to effectively manage the variability and intermittency associated with renewable generation, allowing for better grid stability, improved energy management, and enhanced decision-making processes. In this paper, we propose an innovative method for generating long-term hourly scenarios for wind and solar power generation, taking into consideration the correlation between these two energy sources. To achieve this, we combine the capabilities of a Variational Autoencoder (VAE) with the additional benefits of incorporating the Radial Basis Function (RBF) kernel in our artificial neural network architecture. By incorporating them, we aim to obtain a latent space with improved regularization properties. To evaluate the effectiveness of our proposed method, we conduct experiments in a representative study scenario, utilizing real-world wind and solar power generation data from the Brazil system. We compare the scenarios generated by our model with the observed data and with other sets of scenarios produced by a conventional VAE architecture. Our experimental results demonstrate that the proposed method can generate long-term hourly scenarios for wind and solar power generation that are highly correlated, accurately capturing the temporal and spatial characteristics of these energy sources. Taking advantage of the benefits of RBF in obtaining a well-regularized latent space, our approach offers improved accuracy and robustness in generating long-term hourly scenarios for renewable energy generation.

Title: Non-Convex Optimizations for Machine Learning with Theoretical Guarantee: Robust Matrix Completion and Neural Network Learning. (arXiv:2306.16557v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.16557
Code URL: null
Copy Paste: [[2306.16557] Non-Convex Optimizations for Machine Learning with Theoretical Guarantee: Robust Matrix Completion and Neural Network Learning](http://arxiv.org/abs/2306.16557) #robust
Summary:
Despite the recent development in machine learning, most learning systems are still under the concept of "black box", where the performance cannot be understood and derived. With the rise of safety and privacy concerns in public, designing an explainable learning system has become a new trend in machine learning. In general, many machine learning problems are formulated as minimizing (or maximizing) some loss function. Since real data are most likely generated from non-linear models, the loss function is non-convex in general. Unlike the convex optimization problem, gradient descent algorithms will be trapped in spurious local minima in solving non-convex optimization. Therefore, it is challenging to provide explainable algorithms when studying non-convex optimization problems. In this thesis, two popular non-convex problems are studied: (1) low-rank matrix completion and (2) neural network learning.

Title: Gesture Recognition with mmWave Wi-Fi Access Points: Lessons Learned. (arXiv:2306.17062v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.17062
Code URL: null
Copy Paste: [[2306.17062] Gesture Recognition with mmWave Wi-Fi Access Points: Lessons Learned](http://arxiv.org/abs/2306.17062) #robust
Summary:
In recent years, channel state information (CSI) at sub-6 GHz has been widely exploited for Wi-Fi sensing, particularly for activity and gesture recognition. In this work, we instead explore mmWave (60 GHz) Wi-Fi signals for gesture recognition/pose estimation. Our focus is on the mmWave Wi-Fi signals so that they can be used not only for high data rate communication but also for improved sensing e.g., for extended reality (XR) applications. For this reason, we extract spatial beam signal-to-noise ratios (SNRs) from the periodic beam training employed by IEEE 802.11ad devices. We consider a set of 10 gestures/poses motivated by XR applications. We conduct experiments in two environments and with three people.As a comparison, we also collect CSI from IEEE 802.11ac devices. To extract features from the CSI and the beam SNR, we leverage a deep neural network (DNN). The DNN classifier achieves promising results on the beam SNR task with state-of-the-art 96.7% accuracy in a single environment, even with a limited dataset. We also investigate the robustness of the beam SNR against CSI across different environments. Our experiments reveal that features from the CSI generalize without additional re-training, while those from beam SNRs do not. Therefore, re-training is required in the latter case.

biometric

steal

extraction

Title: Weight Compander: A Simple Weight Reparameterization for Regularization. (arXiv:2306.16993v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.16993
Code URL: null
Copy Paste: [[2306.16993] Weight Compander: A Simple Weight Reparameterization for Regularization](http://arxiv.org/abs/2306.16993) #extraction
Summary:
Regularization is a set of techniques that are used to improve the generalization ability of deep neural networks. In this paper, we introduce weight compander (WC), a novel effective method to improve generalization by reparameterizing each weight in deep neural networks using a nonlinear function. It is a general, intuitive, cheap and easy to implement method, which can be combined with various other regularization techniques. Large weights in deep neural networks are a sign of a more complex network that is overfitted to the training data. Moreover, regularized networks tend to have a greater range of weights around zero with fewer weights centered at zero. We introduce a weight reparameterization function which is applied to each weight and implicitly reduces overfitting by restricting the magnitude of the weights while forcing them away from zero at the same time. This leads to a more democratic decision-making in the network. Firstly, individual weights cannot have too much influence in the prediction process due to the restriction of their magnitude. Secondly, more weights are used in the prediction process, since they are forced away from zero during the training. This promotes the extraction of more features from the input data and increases the level of weight redundancy, which makes the network less sensitive to statistical differences between training and test data. We extend our method to learn the hyperparameters of the introduced weight reparameterization function. This avoids hyperparameter search and gives the network the opportunity to align the weight reparameterization with the training progress. We show experimentally that using weight compander in addition to standard regularization methods improves the performance of neural networks.

Title: Unsupervised 3D registration through optimization-guided cyclical self-training. (arXiv:2306.16997v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16997
Code URL: https://github.com/multimodallearning/reg-cyclical-self-train
Copy Paste: [[2306.16997] Unsupervised 3D registration through optimization-guided cyclical self-training](http://arxiv.org/abs/2306.16997) #extraction
Summary:
State-of-the-art deep learning-based registration methods employ three different learning strategies: supervised learning, which requires costly manual annotations, unsupervised learning, which heavily relies on hand-crafted similarity metrics designed by domain experts, or learning from synthetic data, which introduces a domain shift. To overcome the limitations of these strategies, we propose a novel self-supervised learning paradigm for unsupervised registration, relying on self-training. Our idea is based on two key insights. Feature-based differentiable optimizers 1) perform reasonable registration even from random features and 2) stabilize the training of the preceding feature extraction network on noisy labels. Consequently, we propose cyclical self-training, where pseudo labels are initialized as the displacement fields inferred from random features and cyclically updated based on more and more expressive features from the learning feature extractor, yielding a self-reinforcement effect. We evaluate the method for abdomen and lung registration, consistently surpassing metric-based supervision and outperforming diverse state-of-the-art competitors. Source code is available at https://github.com/multimodallearning/reg-cyclical-self-train.

membership infer

federate

Title: Elastically-Constrained Meta-Learner for Federated Learning. (arXiv:2306.16703v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.16703
Code URL: null
Copy Paste: [[2306.16703] Elastically-Constrained Meta-Learner for Federated Learning](http://arxiv.org/abs/2306.16703) #federate
Summary:
Federated learning is an approach to collaboratively training machine learning models for multiple parties that prohibit data sharing. One of the challenges in federated learning is non-IID data between clients, as a single model can not fit the data distribution for all clients. Meta-learning, such as Per-FedAvg, is introduced to cope with the challenge. Meta-learning learns shared initial parameters for all clients. Each client employs gradient descent to adapt the initialization to local data distributions quickly to realize model personalization. However, due to non-convex loss function and randomness of sampling update, meta-learning approaches have unstable goals in local adaptation for the same client. This fluctuation in different adaptation directions hinders the convergence in meta-learning. To overcome this challenge, we use the historical local adapted model to restrict the direction of the inner loop and propose an elastic-constrained method. As a result, the current round inner loop keeps historical goals and adapts to better solutions. Experiments show our method boosts meta-learning convergence and improves personalization without additional calculation and communication. Our method achieved SOTA on all metrics in three public datasets.

Title: Momentum Benefits Non-IID Federated Learning Simply and Provably. (arXiv:2306.16504v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.16504
Code URL: null
Copy Paste: [[2306.16504] Momentum Benefits Non-IID Federated Learning Simply and Provably](http://arxiv.org/abs/2306.16504) #federate
Summary:
Federated learning is a powerful paradigm for large-scale machine learning, but it faces significant challenges due to unreliable network connections, slow communication, and substantial data heterogeneity across clients. FedAvg and SCAFFOLD are two fundamental algorithms to address these challenges. In particular, FedAvg employs multiple local updates before communicating with a central server, while SCAFFOLD maintains a control variable on each client to compensate for "client drift" in its local updates. Various methods have been proposed in literature to enhance the convergence of these two algorithms, but they either make impractical adjustments to algorithmic structure, or rely on the assumption of bounded data heterogeneity.

This paper explores the utilization of momentum to enhance the performance of FedAvg and SCAFFOLD. When all clients participate in the training process, we demonstrate that incorporating momentum allows FedAvg to converge without relying on the assumption of bounded data heterogeneity even using a constant local learning rate. This is a novel result since existing analyses for FedAvg require bounded data heterogeneity even with diminishing local learning rates. In the case of partial client participation, we show that momentum enables SCAFFOLD to converge provably faster without imposing any additional assumptions. Furthermore, we use momentum to develop new variance-reduced extensions of FedAvg and SCAFFOLD, which exhibit state-of-the-art convergence rates. Our experimental results support all theoretical findings.

fair

Title: A systematic study of the foreground-background imbalance problem in deep learning for object detection. (arXiv:2306.16539v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16539
Code URL: null
Copy Paste: [[2306.16539] A systematic study of the foreground-background imbalance problem in deep learning for object detection](http://arxiv.org/abs/2306.16539) #fair
Summary:
The class imbalance problem in deep learning has been explored in several studies, but there has yet to be a systematic analysis of this phenomenon in object detection. Here, we present comprehensive analyses and experiments of the foreground-background (F-B) imbalance problem in object detection, which is very common and caused by small, infrequent objects of interest. We experimentally study the effects of different aspects of F-B imbalance (object size, number of objects, dataset size, object type) on detection performance. In addition, we also compare 9 leading methods for addressing this problem, including Faster-RCNN, SSD, OHEM, Libra-RCNN, Focal-Loss, GHM, PISA, YOLO-v3, and GFL with a range of datasets from different imaging domains. We conclude that (1) the F-B imbalance can indeed cause a significant drop in detection performance, (2) The detection performance is more affected by F-B imbalance when fewer training data are available, (3) in most cases, decreasing object size leads to larger performance drop than decreasing number of objects, given the same change in the ratio of object pixels to non-object pixels, (6) among all selected methods, Libra-RCNN and PISA demonstrate the best performance in addressing the issue of F-B imbalance. (7) When the training dataset size is large, the choice of method is not impactful (8) Soft-sampling methods, including focal-loss, GHM, and GFL, perform fairly well on average but are relatively unstable.

Title: Improving Fairness in Deepfake Detection. (arXiv:2306.16635v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16635
Code URL: null
Copy Paste: [[2306.16635] Improving Fairness in Deepfake Detection](http://arxiv.org/abs/2306.16635) #fair
Summary:
Despite the development of effective deepfake detection models in recent years, several recent studies have demonstrated that biases in the training data utilized to develop deepfake detection models can lead to unfair performance for demographic groups of different races and/or genders. Such can result in these groups being unfairly targeted or excluded from detection, allowing misclassified deepfakes to manipulate public opinion and erode trust in the model. While these studies have focused on identifying and evaluating the unfairness in deepfake detection, no methods have been developed to address the fairness issue of deepfake detection at the algorithm level. In this work, we make the first attempt to improve deepfake detection fairness by proposing novel loss functions to train fair deepfake detection models in ways that are agnostic or aware of demographic factors. Extensive experiments on four deepfake datasets and five deepfake detectors demonstrate the effectiveness and flexibility of our approach in improving the deepfake detection fairness.

Title: Metric-aligned Sample Selection and Critical Feature Sampling for Oriented Object Detection. (arXiv:2306.16718v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16718
Code URL: null
Copy Paste: [[2306.16718] Metric-aligned Sample Selection and Critical Feature Sampling for Oriented Object Detection](http://arxiv.org/abs/2306.16718) #fair
Summary:
Arbitrary-oriented object detection is a relatively emerging but challenging task. Although remarkable progress has been made, there still remain many unsolved issues due to the large diversity of patterns in orientation, scale, aspect ratio, and visual appearance of objects in aerial images. Most of the existing methods adopt a coarse-grained fixed label assignment strategy and suffer from the inconsistency between the classification score and localization accuracy. First, to align the metric inconsistency between sample selection and regression loss calculation caused by fixed IoU strategy, we introduce affine transformation to evaluate the quality of samples and propose a distance-based label assignment strategy. The proposed metric-aligned selection (MAS) strategy can dynamically select samples according to the shape and rotation characteristic of objects. Second, to further address the inconsistency between classification and localization, we propose a critical feature sampling (CFS) module, which performs localization refinement on the sampling location for classification task to extract critical features accurately. Third, we present a scale-controlled smooth $L_1$ loss (SC-Loss) to adaptively select high quality samples by changing the form of regression loss function based on the statistics of proposals during training. Extensive experiments are conducted on four challenging rotated object detection datasets DOTA, FAIR1M-1.0, HRSC2016, and UCAS-AOD. The results show the state-of-the-art accuracy of the proposed detector.

Title: Learning Fair Classifiers via Min-Max F-divergence Regularization. (arXiv:2306.16552v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.16552
Code URL: null
Copy Paste: [[2306.16552] Learning Fair Classifiers via Min-Max F-divergence Regularization](http://arxiv.org/abs/2306.16552) #fair
Summary:
As machine learning (ML) based systems are adopted in domains such as law enforcement, criminal justice, finance, hiring and admissions, ensuring the fairness of ML aided decision-making is becoming increasingly important. In this paper, we focus on the problem of fair classification, and introduce a novel min-max F-divergence regularization framework for learning fair classification models while preserving high accuracy. Our framework consists of two trainable networks, namely, a classifier network and a bias/fairness estimator network, where the fairness is measured using the statistical notion of F-divergence. We show that F-divergence measures possess convexity and differentiability properties, and their variational representation make them widely applicable in practical gradient based training methods. The proposed framework can be readily adapted to multiple sensitive attributes and for high dimensional datasets. We study the F-divergence based training paradigm for two types of group fairness constraints, namely, demographic parity and equalized odds. We present a comprehensive set of experiments for several real-world data sets arising in multiple domains (including COMPAS, Law Admissions, Adult Income, and CelebA datasets). To quantify the fairness-accuracy tradeoff, we introduce the notion of fairness-accuracy receiver operating characteristic (FA-ROC) and a corresponding \textit{low-bias} FA-ROC, which we argue is an appropriate measure to evaluate different classifiers. In comparison to several existing approaches for learning fair classifiers (including pre-processing, post-processing and other regularization methods), we show that the proposed F-divergence based framework achieves state-of-the-art performance with respect to the trade-off between accuracy and fairness.

interpretability

explainability

watermark

diffusion

Title: DiffusionSTR: Diffusion Model for Scene Text Recognition. (arXiv:2306.16707v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16707
Code URL: null
Copy Paste: [[2306.16707] DiffusionSTR: Diffusion Model for Scene Text Recognition](http://arxiv.org/abs/2306.16707) #diffusion
Summary:
This paper presents Diffusion Model for Scene Text Recognition (DiffusionSTR), an end-to-end text recognition framework using diffusion models for recognizing text in the wild. While existing studies have viewed the scene text recognition task as an image-to-text transformation, we rethought it as a text-text one under images in a diffusion model. We show for the first time that the diffusion model can be applied to text recognition. Furthermore, experimental results on publicly available datasets show that the proposed method achieves competitive accuracy compared to state-of-the-art methods.

Title: PFB-Diff: Progressive Feature Blending Diffusion for Text-driven Image Editing. (arXiv:2306.16894v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16894
Code URL: null
Copy Paste: [[2306.16894] PFB-Diff: Progressive Feature Blending Diffusion for Text-driven Image Editing](http://arxiv.org/abs/2306.16894) #diffusion
Summary:
Diffusion models have showcased their remarkable capability to synthesize diverse and high-quality images, sparking interest in their application for real image editing. However, existing diffusion-based approaches for local image editing often suffer from undesired artifacts due to the pixel-level blending of the noised target images and diffusion latent variables, which lack the necessary semantics for maintaining image consistency. To address these issues, we propose PFB-Diff, a Progressive Feature Blending method for Diffusion-based image editing. Unlike previous methods, PFB-Diff seamlessly integrates text-guided generated content into the target image through multi-level feature blending. The rich semantics encoded in deep features and the progressive blending scheme from high to low levels ensure semantic coherence and high quality in edited images. Additionally, we introduce an attention masking mechanism in the cross-attention layers to confine the impact of specific words to desired regions, further improving the performance of background editing. PFB-Diff can effectively address various editing tasks, including object/background replacement and object attribute editing. Our method demonstrates its superior performance in terms of image fidelity, editing accuracy, efficiency, and faithfulness to the original image, without the need for fine-tuning or training.

Title: One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization. (arXiv:2306.16928v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16928
Code URL: null
Copy Paste: [[2306.16928] One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization](http://arxiv.org/abs/2306.16928) #diffusion
Summary:
Single image 3D reconstruction is an important but challenging task that requires extensive knowledge of our natural world. Many existing methods solve this problem by optimizing a neural radiance field under the guidance of 2D diffusion models but suffer from lengthy optimization time, 3D inconsistency results, and poor geometry. In this work, we propose a novel method that takes a single image of any object as input and generates a full 360-degree 3D textured mesh in a single feed-forward pass. Given a single image, we first use a view-conditioned 2D diffusion model, Zero123, to generate multi-view images for the input view, and then aim to lift them up to 3D space. Since traditional reconstruction methods struggle with inconsistent multi-view predictions, we build our 3D reconstruction module upon an SDF-based generalizable neural surface reconstruction method and propose several critical training strategies to enable the reconstruction of 360-degree meshes. Without costly optimizations, our method reconstructs 3D shapes in significantly less time than existing methods. Moreover, our method favors better geometry, generates more 3D consistent results, and adheres more closely to the input image. We evaluate our approach on both synthetic data and in-the-wild images and demonstrate its superiority in terms of both mesh quality and runtime. In addition, our approach can seamlessly support the text-to-3D task by integrating with off-the-shelf text-to-image diffusion models.

Title: DreamDiffusion: Generating High-Quality Images from Brain EEG Signals. (arXiv:2306.16934v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16934
Code URL: null
Copy Paste: [[2306.16934] DreamDiffusion: Generating High-Quality Images from Brain EEG Signals](http://arxiv.org/abs/2306.16934) #diffusion
Summary:
This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text. DreamDiffusion leverages pre-trained text-to-image models and employs temporal masked signal modeling to pre-train the EEG encoder for effective and robust EEG representations. Additionally, the method further leverages the CLIP image encoder to provide extra supervision to better align EEG, text, and image embeddings with limited EEG-image pairs. Overall, the proposed method overcomes the challenges of using EEG signals for image generation, such as noise, limited information, and individual differences, and achieves promising results. Quantitative and qualitative results demonstrate the effectiveness of the proposed method as a significant step towards portable and low-cost ``thoughts-to-image'', with potential applications in neuroscience and computer vision.

Title: Learning Structure-Guided Diffusion Model for 2D Human Pose Estimation. (arXiv:2306.17074v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.17074
Code URL: null
Copy Paste: [[2306.17074] Learning Structure-Guided Diffusion Model for 2D Human Pose Estimation](http://arxiv.org/abs/2306.17074) #diffusion
Summary:
One of the mainstream schemes for 2D human pose estimation (HPE) is learning keypoints heatmaps by a neural network. Existing methods typically improve the quality of heatmaps by customized architectures, such as high-resolution representation and vision Transformers. In this paper, we propose \textbf{DiffusionPose}, a new scheme that formulates 2D HPE as a keypoints heatmaps generation problem from noised heatmaps. During training, the keypoints are diffused to random distribution by adding noises and the diffusion model learns to recover ground-truth heatmaps from noised heatmaps with respect to conditions constructed by image feature. During inference, the diffusion model generates heatmaps from initialized heatmaps in a progressive denoising way. Moreover, we further explore improving the performance of DiffusionPose with conditions from human structural information. Extensive experiments show the prowess of our DiffusionPose, with improvements of 1.6, 1.2, and 1.2 mAP on widely-used COCO, CrowdPose, and AI Challenge datasets, respectively.

Title: Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation. (arXiv:2306.17115v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.17115
Code URL: null
Copy Paste: [[2306.17115] Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation](http://arxiv.org/abs/2306.17115) #diffusion
Summary:
We present a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts. Directly learning a conditional generative model from images or texts to 3D shapes is prone to producing inconsistent results with the conditions because 3D shapes have an additional dimension whose distribution significantly differs from that of 2D images and texts. To bridge the domain gap among the three modalities and facilitate multi-modal-conditioned 3D shape generation, we explore representing 3D shapes in a shape-image-text-aligned space. Our framework comprises two models: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model (ASLDM). The former model encodes the 3D shapes into the shape latent space aligned to the image and text and reconstructs the fine-grained 3D neural fields corresponding to given shape embeddings via the transformer-based decoder. The latter model learns a probabilistic mapping function from the image or text space to the latent shape space. Our extensive experiments demonstrate that our proposed approach can generate higher-quality and more diverse 3D shapes that better semantically conform to the visual or textural conditional inputs, validating the effectiveness of the shape-image-text-aligned space for cross-modality 3D shape generation.

Title: ID-Pose: Sparse-view Camera Pose Estimation by Inverting Diffusion Models. (arXiv:2306.17140v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.17140
Code URL: null
Copy Paste: [[2306.17140] ID-Pose: Sparse-view Camera Pose Estimation by Inverting Diffusion Models](http://arxiv.org/abs/2306.17140) #diffusion
Summary:
Given sparse views of an object, estimating their camera poses is a long-standing and intractable problem. We harness the pre-trained diffusion model of novel views conditioned on viewpoints (Zero-1-to-3). We present ID-Pose which inverses the denoising diffusion process to estimate the relative pose given two input images. ID-Pose adds a noise on one image, and predicts the noise conditioned on the other image and a decision variable for the pose. The prediction error is used as the objective to find the optimal pose with the gradient descent method. ID-Pose can handle more than two images and estimate each of the poses with multiple image pairs from triangular relationships. ID-Pose requires no training and generalizes to real-world images. We conduct experiments using high-quality real-scanned 3D objects, where ID-Pose significantly outperforms state-of-the-art methods.

Title: Filtered-Guided Diffusion: Fast Filter Guidance for Black-Box Diffusion Models. (arXiv:2306.17141v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.17141
Code URL: null
Copy Paste: [[2306.17141] Filtered-Guided Diffusion: Fast Filter Guidance for Black-Box Diffusion Models](http://arxiv.org/abs/2306.17141) #diffusion
Summary:
Recent advances in diffusion-based generative models have shown incredible promise for Image-to-Image translation and editing. Most recent work in this space relies on additional training or architecture-specific adjustments to the diffusion process. In this work, we show that much of this low-level control can be achieved without additional training or any access to features of the diffusion model. Our method simply applies a filter to the input of each diffusion step based on the output of the previous step in an adaptive manner. Notably, this approach does not depend on any specific architecture or sampler and can be done without access to internal features of the network, making it easy to combine with other techniques, samplers, and diffusion architectures. Furthermore, it has negligible cost to performance, and allows for more continuous adjustment of guidance strength than other approaches. We show FGD offers a fast and strong baseline that is competitive with recent architecture-dependent approaches. Furthermore, FGD can also be used as a simple add-on to enhance the structural guidance of other state-of-the-art I2I methods. Finally, our derivation of this method helps to understand the impact of self attention, a key component of other recent architecture-specific I2I approaches, in a more architecture-independent way. Project page: https://github.com/jaclyngu/FilteredGuidedDiffusion

Title: SaGess: Sampling Graph Denoising Diffusion Model for Scalable Graph Generation. (arXiv:2306.16827v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.16827
Code URL: null
Copy Paste: [[2306.16827] SaGess: Sampling Graph Denoising Diffusion Model for Scalable Graph Generation](http://arxiv.org/abs/2306.16827) #diffusion
Summary:
Over recent years, denoising diffusion generative models have come to be considered as state-of-the-art methods for synthetic data generation, especially in the case of generating images. These approaches have also proved successful in other applications such as tabular and graph data generation. However, due to computational complexity, to this date, the application of these techniques to graph data has been restricted to small graphs, such as those used in molecular modeling. In this paper, we propose SaGess, a discrete denoising diffusion approach, which is able to generate large real-world networks by augmenting a diffusion model (DiGress) with a generalized divide-and-conquer framework. The algorithm is capable of generating larger graphs by sampling a covering of subgraphs of the initial graph in order to train DiGress. SaGess then constructs a synthetic graph using the subgraphs that have been generated by DiGress. We evaluate the quality of the synthetic data sets against several competitor methods by comparing graph statistics between the original and synthetic samples, as well as evaluating the utility of the synthetic data set produced by using it to train a task-driven model, namely link prediction. In our experiments, SaGess, outperforms most of the one-shot state-of-the-art graph generating methods by a significant factor, both on the graph metrics and on the link prediction task.

Title: Diffusion-Jump GNNs: Homophiliation via Learnable Metric Filters. (arXiv:2306.16976v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.16976
Code URL: null
Copy Paste: [[2306.16976] Diffusion-Jump GNNs: Homophiliation via Learnable Metric Filters](http://arxiv.org/abs/2306.16976) #diffusion
Summary:
High-order Graph Neural Networks (HO-GNNs) have been developed to infer consistent latent spaces in the heterophilic regime, where the label distribution is not correlated with the graph structure. However, most of the existing HO-GNNs are hop-based, i.e., they rely on the powers of the transition matrix. As a result, these architectures are not fully reactive to the classification loss and the achieved structural filters have static supports. In other words, neither the filters' supports nor their coefficients can be learned with these networks. They are confined, instead, to learn combinations of filters. To address the above concerns, we propose Diffusion-jump GNNs a method relying on asymptotic diffusion distances that operates on jumps. A diffusion-pump generates pairwise distances whose projections determine both the support and coefficients of each structural filter. These filters are called jumps because they explore a wide range of scales in order to find bonds between scattered nodes with the same label. Actually, the full process is controlled by the classification loss. Both the jumps and the diffusion distances react to classification errors (i.e. they are learnable). Homophiliation, i.e., the process of learning piecewise smooth latent spaces in the heterophilic regime, is formulated as a Dirichlet problem: the known labels determine the border nodes and the diffusion-pump ensures a minimal deviation of the semi-supervised grouping from a canonical unsupervised grouping. This triggers the update of both the diffusion distances and, consequently, the jumps in order to minimize the classification error. The Dirichlet formulation has several advantages. It leads to the definition of structural heterophily, a novel measure beyond edge heterophily. It also allows us to investigate links with (learnable) diffusion distances, absorbing random walks and stochastic diffusion.

noise learning

data-free

Title: NaturalInversion: Data-Free Image Synthesis Improving Real-World Consistency. (arXiv:2306.16661v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16661
Code URL: https://github.com/kdst-team/naturalinversion
Copy Paste: [[2306.16661] NaturalInversion: Data-Free Image Synthesis Improving Real-World Consistency](http://arxiv.org/abs/2306.16661) #data-free
Summary:
We introduce NaturalInversion, a novel model inversion-based method to synthesize images that agrees well with the original data distribution without using real data. In NaturalInversion, we propose: (1) a Feature Transfer Pyramid which uses enhanced image prior of the original data by combining the multi-scale feature maps extracted from the pre-trained classifier, (2) a one-to-one approach generative model where only one batch of images are synthesized by one generator to bring the non-linearity to optimization and to ease the overall optimizing process, (3) learnable Adaptive Channel Scaling parameters which are end-to-end trained to scale the output image channel to utilize the original image prior further. With our NaturalInversion, we synthesize images from classifiers trained on CIFAR-10/100 and show that our images are more consistent with original data distribution than prior works by visualization and additional analysis. Furthermore, our synthesized images outperform prior works on various applications such as knowledge distillation and pruning, demonstrating the effectiveness of our proposed method.

transformer

Title: BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models. (arXiv:2306.16678v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16678
Code URL: https://github.com/phuoc-hoan-le/binaryvit
Copy Paste: [[2306.16678] BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models](http://arxiv.org/abs/2306.16678) #transformer
Summary:
With the increasing popularity and the increasing size of vision transformers (ViTs), there has been an increasing interest in making them more efficient and less computationally costly for deployment on edge devices with limited computing resources. Binarization can be used to help reduce the size of ViT models and their computational cost significantly, using popcount operations when the weights and the activations are in binary. However, ViTs suffer a larger performance drop when directly applying convolutional neural network (CNN) binarization methods or existing binarization methods to binarize ViTs compared to CNNs on datasets with a large number of classes such as ImageNet-1k. With extensive analysis, we find that binary vanilla ViTs such as DeiT miss out on a lot of key architectural properties that CNNs have that allow binary CNNs to have much higher representational capability than binary vanilla ViT. Therefore, we propose BinaryViT, in which inspired by the CNN architecture, we include operations from the CNN architecture into a pure ViT architecture to enrich the representational capability of a binary ViT without introducing convolutions. These include an average pooling layer instead of a token pooling layer, a block that contains multiple average pooling branches, an affine transformation right before the addition of each main residual connection, and a pyramid structure. Experimental results on the ImageNet-1k dataset show the effectiveness of these operations that allow a binary pure ViT model to be competitive with previous state-of-the-art (SOTA) binary CNN models.

Title: SaaFormer: Spectral-spatial Axial Aggregation Transformer for Hyperspectral Image Classification. (arXiv:2306.16759v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16759
Code URL: null
Copy Paste: [[2306.16759] SaaFormer: Spectral-spatial Axial Aggregation Transformer for Hyperspectral Image Classification](http://arxiv.org/abs/2306.16759) #transformer
Summary:
Hyperspectral images (HSI) captured from earth observing satellites and aircraft is becoming increasingly important for applications in agriculture, environmental monitoring, mining, etc. Due to the limited available hyperspectral datasets, the pixel-wise random sampling is the most commonly used training-test dataset partition approach, which has significant overlap between samples in training and test datasets. Furthermore, our experimental observations indicates that regions with larger overlap often exhibit higher classification accuracy. Consequently, the pixel-wise random sampling approach poses a risk of data leakage. Thus, we propose a block-wise sampling method to minimize the potential for data leakage. Our experimental findings also confirm the presence of data leakage in models such as 2DCNN. Further, We propose a spectral-spatial axial aggregation transformer model, namely SaaFormer, to address the challenges associated with hyperspectral image classifier that considers HSI as long sequential three-dimensional images. The model comprises two primary components: axial aggregation attention and multi-level spectral-spatial extraction. The axial aggregation attention mechanism effectively exploits the continuity and correlation among spectral bands at each pixel position in hyperspectral images, while aggregating spatial dimension features. This enables SaaFormer to maintain high precision even under block-wise sampling. The multi-level spectral-spatial extraction structure is designed to capture the sensitivity of different material components to specific spectral bands, allowing the model to focus on a broader range of spectral details. The results on six publicly available datasets demonstrate that our model exhibits comparable performance when using random sampling, while significantly outperforming other methods when employing block-wise sampling partition.

Title: MotionTrack: End-to-End Transformer-based Multi-Object Tracing with LiDAR-Camera Fusion. (arXiv:2306.17000v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.17000
Code URL: null
Copy Paste: [[2306.17000] MotionTrack: End-to-End Transformer-based Multi-Object Tracing with LiDAR-Camera Fusion](http://arxiv.org/abs/2306.17000) #transformer
Summary:
Multiple Object Tracking (MOT) is crucial to autonomous vehicle perception. End-to-end transformer-based algorithms, which detect and track objects simultaneously, show great potential for the MOT task. However, most existing methods focus on image-based tracking with a single object category. In this paper, we propose an end-to-end transformer-based MOT algorithm (MotionTrack) with multi-modality sensor inputs to track objects with multiple classes. Our objective is to establish a transformer baseline for the MOT in an autonomous driving environment. The proposed algorithm consists of a transformer-based data association (DA) module and a transformer-based query enhancement module to achieve MOT and Multiple Object Detection (MOD) simultaneously. The MotionTrack and its variations achieve better results (AMOTA score at 0.55) on the nuScenes dataset compared with other classical baseline models, such as the AB3DMOT, the CenterTrack, and the probabilistic 3D Kalman filter. In addition, we prove that a modified attention mechanism can be utilized for DA to accomplish the MOT, and aggregate history features to enhance the MOD performance.

Title: Learning Nuclei Representations with Masked Image Modelling. (arXiv:2306.17116v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.17116
Code URL: null
Copy Paste: [[2306.17116] Learning Nuclei Representations with Masked Image Modelling](http://arxiv.org/abs/2306.17116) #transformer
Summary:
Masked image modelling (MIM) is a powerful self-supervised representation learning paradigm, whose potential has not been widely demonstrated in medical image analysis. In this work, we show the capacity of MIM to capture rich semantic representations of Haemotoxylin & Eosin (H&E)-stained images at the nuclear level. Inspired by Bidirectional Encoder representation from Image Transformers (BEiT), we split the images into smaller patches and generate corresponding discrete visual tokens. In addition to the regular grid-based patches, typically used in visual Transformers, we introduce patches of individual cell nuclei. We propose positional encoding of the irregular distribution of these structures within an image. We pre-train the model in a self-supervised manner on H&E-stained whole-slide images of diffuse large B-cell lymphoma, where cell nuclei have been segmented. The pre-training objective is to recover the original discrete visual tokens of the masked image on the one hand, and to reconstruct the visual tokens of the masked object instances on the other. Coupling these two pre-training tasks allows us to build powerful, context-aware representations of nuclei. Our model generalizes well and can be fine-tuned on downstream classification tasks, achieving improved cell classification accuracy on PanNuke dataset by more than 5% compared to current instance segmentation methods.

Title: An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training. (arXiv:2306.17165v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.17165
Code URL: null
Copy Paste: [[2306.17165] An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training](http://arxiv.org/abs/2306.17165) #transformer
Summary:
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Despite considerable progress in multi-task learning, most efforts focus on learning from multi-label data: a single image set with multiple task labels. Such multi-label data sets are rare, small, and expensive. We say heterogeneous to refer to image sets with different task labels, or to combinations of single-task datasets. Few have explored training on such heterogeneous datasets. General-purpose vision models are still dominated by single-task pretraining, and it remains unclear how to scale up multi-task models by leveraging mainstream vision datasets designed for different purposes. The challenges lie in managing large intrinsic differences among vision tasks, including data distribution, architectures, task-specific modules, dataset scales, and sampling strategies. To address these challenges, we propose to modify and scale up mixture-of-experts (MoE) vision transformers, so that they can simultaneously learn classification, detection, and segmentation on diverse mainstream vision datasets including ImageNet, COCO, and ADE20K. Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks. Due to its emergent modularity, this general-purpose model decomposes into high-performing components, efficiently adapting to downstream tasks. We can fine-tune it with fewer training parameters, fewer model parameters, and less computation. Additionally, its modularity allows for easy expansion in continual-learning-without-forgetting scenarios. Finally, these functions can be controlled and combined to meet various demands of downstream tasks.

Title: An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs. (arXiv:2306.16601v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.16601
Code URL: https://github.com/intel/intel-extension-for-transformers
Copy Paste: [[2306.16601] An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs](http://arxiv.org/abs/2306.16601) #transformer
Summary:
In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.

Title: A negation detection assessment of GPTs: analysis with the xNot360 dataset. (arXiv:2306.16638v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.16638
Code URL: null
Copy Paste: [[2306.16638] A negation detection assessment of GPTs: analysis with the xNot360 dataset](http://arxiv.org/abs/2306.16638) #transformer
Summary:
Negation is a fundamental aspect of natural language, playing a critical role in communication and comprehension. Our study assesses the negation detection performance of Generative Pre-trained Transformer (GPT) models, specifically GPT-2, GPT-3, GPT-3.5, and GPT-4. We focus on the identification of negation in natural language using a zero-shot prediction approach applied to our custom xNot360 dataset. Our approach examines sentence pairs labeled to indicate whether the second sentence negates the first. Our findings expose a considerable performance disparity among the GPT models, with GPT-4 surpassing its counterparts and GPT-3.5 displaying a marked performance reduction. The overall proficiency of the GPT models in negation detection remains relatively modest, indicating that this task pushes the boundaries of their natural language understanding capabilities. We not only highlight the constraints of GPT models in handling negation but also emphasize the importance of logical reliability in high-stakes domains such as healthcare, science, and law.

Title: Probabilistic Linguistic Knowledge and Token-level Text Augmentation. (arXiv:2306.16644v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.16644
Code URL: null
Copy Paste: [[2306.16644] Probabilistic Linguistic Knowledge and Token-level Text Augmentation](http://arxiv.org/abs/2306.16644) #transformer
Summary:
This paper investigates the effectiveness of token-level text augmentation and the role of probabilistic linguistic knowledge within a linguistically-motivated evaluation context. Two text augmentation programs, REDA and REDA${NG}$, were developed, both implementing five token-level text editing operations: Synonym Replacement (SR), Random Swap (RS), Random Insertion (RI), Random Deletion (RD), and Random Mix (RM). REDA${NG}$ leverages pretrained $n$-gram language models to select the most likely augmented texts from REDA's output. Comprehensive and fine-grained experiments were conducted on a binary question matching classification task in both Chinese and English. The results strongly refute the general effectiveness of the five token-level text augmentation techniques under investigation, whether applied together or separately, and irrespective of various common classification model types used, including transformers. Furthermore, the role of probabilistic linguistic knowledge is found to be minimal.

Title: Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications. (arXiv:2306.16710v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.16710
Code URL: null
Copy Paste: [[2306.16710] Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications](http://arxiv.org/abs/2306.16710) #transformer
Summary:
Voicebots have provided a new avenue for supporting the development of language skills, particularly within the context of second language learning. Voicebots, though, have largely been geared towards native adult speakers. We sought to assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI, with a view to developing a voicebot that can support children acquiring a foreign language. We evaluated their performance on read and extemporaneous speech of native and non-native Dutch children. We also investigated the utility of using ASR technology to provide insight into the children's pronunciation and fluency. The results show that recent, pre-trained ASR transformer-based models achieve acceptable performance from which detailed feedback on phoneme pronunciation quality can be extracted, despite the challenging nature of child and non-native speech.

Title: Leveraging Cross-Utterance Context For ASR Decoding. (arXiv:2306.16903v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.16903
Code URL: null
Copy Paste: [[2306.16903] Leveraging Cross-Utterance Context For ASR Decoding](http://arxiv.org/abs/2306.16903) #transformer
Summary:
While external language models (LMs) are often incorporated into the decoding stage of automated speech recognition systems, these models usually operate with limited context. Cross utterance information has been shown to be beneficial during second pass re-scoring, however this limits the hypothesis space based on the local information available to the first pass LM. In this work, we investigate the incorporation of long-context transformer LMs for cross-utterance decoding of acoustic models via beam search, and compare against results from n-best rescoring. Results demonstrate that beam search allows for an improved use of cross-utterance context. When evaluating on the long-format dataset AMI, results show a 0.7\% and 0.3\% absolute reduction on dev and test sets compared to the single-utterance setting, with improvements when including up to 500 tokens of prior context. Evaluations are also provided for Tedlium-1 with less significant improvements of around 0.1\% absolute.

generative

Title: Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering. (arXiv:2306.16713v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16713
Code URL: null
Copy Paste: [[2306.16713] Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering](http://arxiv.org/abs/2306.16713) #generative
Summary:
We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context. For such a setting, a model must first retrieve relevant images from the pool and answer the question from these retrieved images. We refer to this problem as retrieval-based visual question answering (or RETVQA in short). The RETVQA is distinctively different and more challenging than the traditionally-studied Visual Question Answering (VQA), where a given question has to be answered with a single relevant image in context. Towards solving the RETVQA task, we propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation. Further, we introduce the largest dataset in this space, namely RETVQA, which has the following salient features: multi-image and retrieval requirement for VQA, metadata-independent questions over a pool of heterogeneous images, expecting a mix of classification-oriented and open-ended generative answers. Our proposed framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed dataset, namely RETVQA and also outperforms state-of-the-art methods by 4.9% and 11.8% on the image segment of the publicly available WebQA dataset on the accuracy and fluency metrics, respectively.

Title: MEMD-ABSA: A Multi-Element Multi-Domain Dataset for Aspect-Based Sentiment Analysis. (arXiv:2306.16956v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.16956
Code URL: https://github.com/nustm/memd-absa
Copy Paste: [[2306.16956] MEMD-ABSA: A Multi-Element Multi-Domain Dataset for Aspect-Based Sentiment Analysis](http://arxiv.org/abs/2306.16956) #generative
Summary:
Aspect-based sentiment analysis is a long-standing research interest in the field of opinion mining, and in recent years, researchers have gradually shifted their focus from simple ABSA subtasks to end-to-end multi-element ABSA tasks. However, the datasets currently used in the research are limited to individual elements of specific tasks, usually focusing on in-domain settings, ignoring implicit aspects and opinions, and with a small data scale. To address these issues, we propose a large-scale Multi-Element Multi-Domain dataset (MEMD) that covers the four elements across five domains, including nearly 20,000 review sentences and 30,000 quadruples annotated with explicit and implicit aspects and opinions for ABSA research. Meanwhile, we evaluate generative and non-generative baselines on multiple ABSA subtasks under the open domain setting, and the results show that open domain ABSA as well as mining implicit aspects and opinions remain ongoing challenges to be addressed. The datasets are publicly released at \url{https://github.com/NUSTM/MEMD-ABSA}.

Title: Synthetic Demographic Data Generation for Card Fraud Detection Using GANs. (arXiv:2306.17109v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.17109
Code URL: null
Copy Paste: [[2306.17109] Synthetic Demographic Data Generation for Card Fraud Detection Using GANs](http://arxiv.org/abs/2306.17109) #generative
Summary:
Using machine learning models to generate synthetic data has become common in many fields. Technology to generate synthetic transactions that can be used to detect fraud is also growing fast. Generally, this synthetic data contains only information about the transaction, such as the time, place, and amount of money. It does not usually contain the individual user's characteristics (age and gender are occasionally included). Using relatively complex synthetic demographic data may improve the complexity of transaction data features, thus improving the fraud detection performance. Benefiting from developments of machine learning, some deep learning models have potential to perform better than other well-established synthetic data generation methods, such as microsimulation. In this study, we built a deep-learning Generative Adversarial Network (GAN), called DGGAN, which will be used for demographic data generation. Our model generates samples during model training, which we found important to overcame class imbalance issues. This study can help improve the cognition of synthetic data and further explore the application of synthetic data generation in card fraud detection.

large language model

Title: Palm: Predicting Actions through Language Models @ Ego4D Long-Term Action Anticipation Challenge 2023. (arXiv:2306.16545v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16545
Code URL: https://github.com/dandoge/palm
Copy Paste: [[2306.16545] Palm: Predicting Actions through Language Models @ Ego4D Long-Term Action Anticipation Challenge 2023](http://arxiv.org/abs/2306.16545) #large language model
Summary:
We present Palm, a solution to the Long-Term Action Anticipation (LTA) task utilizing vision-language and large language models. Given an input video with annotated action periods, the LTA task aims to predict possible future actions. We hypothesize that an optimal solution should capture the interdependency between past and future actions, and be able to infer future actions based on the structure and dependency encoded in the past actions. Large language models have demonstrated remarkable commonsense-based reasoning ability. Inspired by that, Palm chains an image captioning model and a large language model. It predicts future actions based on frame descriptions and action labels extracted from the input videos. Our method outperforms other participants in the EGO4D LTA challenge and achieves the best performance in terms of action prediction. Our code is available at https://github.com/DanDoge/Palm

Title: LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding. (arXiv:2306.17107v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.17107
Code URL: null
Copy Paste: [[2306.17107] LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding](http://arxiv.org/abs/2306.17107) #large language model
Summary:
Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.

Title: Automatic Calibration and Error Correction for Large Language Models via Pareto Optimal Self-Supervision. (arXiv:2306.16564v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.16564
Code URL: null
Copy Paste: [[2306.16564] Automatic Calibration and Error Correction for Large Language Models via Pareto Optimal Self-Supervision](http://arxiv.org/abs/2306.16564) #large language model
Summary:
Large language models (LLMs) have demonstrated remarkable capabilities out of box for a wide range of applications, yet accuracy still remains a major growth area, especially in mission-critical domains such as biomedicine. An effective method to calibrate the confidence level on LLM responses is essential to automatically detect errors and facilitate human-in-the-loop verification. An important source of calibration signals stems from expert-stipulated programmatic supervision, which is often available at low cost but has its own limitations such as noise and coverage. In this paper, we introduce a Pareto optimal self-supervision framework that can leverage available programmatic supervision to systematically calibrate LLM responses by producing a risk score for every response, without any additional manual efforts. This is accomplished by learning a harmonizer model to align LLM output with other available supervision sources, which would assign higher risk scores to more uncertain LLM responses and facilitate error correction. Experiments on standard relation extraction tasks in biomedical and general domains demonstrate the promise of this approach, with our proposed risk scores highly correlated with the real error rate of LLMs. For the most uncertain test instances, dynamic prompting based on our proposed risk scores results in significant accuracy improvement for off-the-shelf LLMs, boosting GPT-3 results past state-of-the-art (SOTA) weak supervision and GPT-4 results past SOTA supervised results on challenging evaluation datasets.

Title: Benchmarking Large Language Model Capabilities for Conditional Generation. (arXiv:2306.16793v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.16793
Code URL: null
Copy Paste: [[2306.16793] Benchmarking Large Language Model Capabilities for Conditional Generation](http://arxiv.org/abs/2306.16793) #large language model
Summary:
Pre-trained large language models (PLMs) underlie most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside techniques like few-shot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks--while they can be used to compare systems at a high level--relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages and inform which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.

Title: Concept-Oriented Deep Learning with Large Language Models. (arXiv:2306.17089v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.17089
Code URL: null
Copy Paste: [[2306.17089] Concept-Oriented Deep Learning with Large Language Models](http://arxiv.org/abs/2306.17089) #large language model
Summary:
Large Language Models (LLMs) have been successfully used in many natural-language tasks and applications including text generation and AI chatbots. They also are a promising new technology for concept-oriented deep learning (CODL). However, the prerequisite is that LLMs understand concepts and ensure conceptual consistency. We discuss these in this paper, as well as major uses of LLMs for CODL including concept extraction from text, concept graph extraction from text, and concept learning. Human knowledge consists of both symbolic (conceptual) knowledge and embodied (sensory) knowledge. Text-only LLMs, however, can represent only symbolic (conceptual) knowledge. Multimodal LLMs, on the other hand, are capable of representing the full range (conceptual and sensory) of human knowledge. We discuss conceptual understanding in visual-language LLMs, the most important multimodal LLMs, and major uses of them for CODL including concept extraction from image, concept graph extraction from image, and concept learning. While uses of LLMs for CODL are valuable standalone, they are particularly valuable as part of LLM applications such as AI chatbots.

segmentation

Title: Analysis of LiDAR Configurations on Off-road Semantic Segmentation Performance. (arXiv:2306.16551v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16551
Code URL: null
Copy Paste: [[2306.16551] Analysis of LiDAR Configurations on Off-road Semantic Segmentation Performance](http://arxiv.org/abs/2306.16551) #segmentation
Summary:
This paper investigates the impact of LiDAR configuration shifts on the performance of 3D LiDAR point cloud semantic segmentation models, a topic not extensively studied before. We explore the effect of using different LiDAR channels when training and testing a 3D LiDAR point cloud semantic segmentation model, utilizing Cylinder3D for the experiments. A Cylinder3D model is trained and tested on simulated 3D LiDAR point cloud datasets created using the Mississippi State University Autonomous Vehicle Simulator (MAVS) and 32, 64 channel 3D LiDAR point clouds of the RELLIS-3D dataset collected in a real-world off-road environment. Our experimental results demonstrate that sensor and spatial domain shifts significantly impact the performance of LiDAR-based semantic segmentation models. In the absence of spatial domain changes between training and testing, models trained and tested on the same sensor type generally exhibited better performance. Moreover, higher-resolution sensors showed improved performance compared to those with lower-resolution ones. However, results varied when spatial domain changes were present. In some cases, the advantage of a sensor's higher resolution led to better performance both with and without sensor domain shifts. In other instances, the higher resolution resulted in overfitting within a specific domain, causing a lack of generalization capability and decreased performance when tested on data with different sensor configurations.

Title: SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation. (arXiv:2306.16585v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16585
Code URL: null
Copy Paste: [[2306.16585] SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation](http://arxiv.org/abs/2306.16585) #segmentation
Summary:
The availability of real-time semantics greatly improves the core geometric functionality of SLAM systems, enabling numerous robotic and AR/VR applications. We present a new methodology for real-time semantic mapping from RGB-D sequences that combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping. When segmenting a new frame we perform latent feature re-projection from previous frames based on differentiable rendering. Fusing re-projected feature maps from previous frames with current-frame features greatly improves image segmentation quality, compared to a baseline that processes images independently. For 3D map processing, we propose a novel geometric quasi-planar over-segmentation method that groups 3D map elements likely to belong to the same semantic classes, relying on surface normals. We also describe a novel neural network design for lightweight semantic map post-processing. Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems and matches the performance of 3D convolutional networks on three real indoor datasets, while working in real-time. Moreover, it shows better cross-sensor generalization abilities compared to 3D CNNs, enabling training and inference with different depth sensors. Code and data will be released on project page: this http URL

Title: The Segment Anything Model (SAM) for Remote Sensing Applications: From Zero to One Shot. (arXiv:2306.16623v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16623
Code URL: null
Copy Paste: [[2306.16623] The Segment Anything Model (SAM) for Remote Sensing Applications: From Zero to One Shot](http://arxiv.org/abs/2306.16623) #segmentation
Summary:
Segmentation is an essential step for remote sensing image processing. This study aims to advance the application of the Segment Anything Model (SAM), an innovative image segmentation model by Meta AI, in the field of remote sensing image analysis. SAM is known for its exceptional generalization capabilities and zero-shot learning, making it a promising approach to processing aerial and orbital images from diverse geographical contexts. Our exploration involved testing SAM across multi-scale datasets using various input prompts, such as bounding boxes, individual points, and text descriptors. To enhance the model's performance, we implemented a novel automated technique that combines a text-prompt-derived general example with one-shot training. This adjustment resulted in an improvement in accuracy, underscoring SAM's potential for deployment in remote sensing imagery and reducing the need for manual annotation. Despite the limitations encountered with lower spatial resolution images, SAM exhibits promising adaptability to remote sensing data analysis. We recommend future research to enhance the model's proficiency through integration with supplementary fine-tuning techniques and other networks. Furthermore, we provide the open-source code of our modifications on online repositories, encouraging further and broader adaptations of SAM to the remote sensing domain.

Title: Learning from Synthetic Human Group Activities. (arXiv:2306.16772v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16772
Code URL: null
Copy Paste: [[2306.16772] Learning from Synthetic Human Group Activities](http://arxiv.org/abs/2306.16772) #segmentation
Summary:
The understanding of complex human interactions and group activities has garnered attention in human-centric computer vision. However, the advancement of the related tasks is hindered due to the difficulty of obtaining large-scale labeled real-world datasets. To mitigate the issue, we propose M3Act, a multi-view multi-group multi-person human atomic action and group activity data generator. Powered by the Unity engine, M3Act contains simulation-ready 3D scenes and human assets, configurable lighting and camera systems, highly parameterized modular group activities, and a large degree of domain randomization during the data generation process. Our data generator is capable of generating large-scale datasets of human activities with multiple viewpoints, modalities (RGB images, 2D poses, 3D motions), and high-quality annotations for individual persons and multi-person groups (2D bounding boxes, instance segmentation masks, individual actions and group activity categories). Using M3Act, we perform synthetic data pre-training for 2D skeleton-based group activity recognition and RGB-based multi-person pose tracking. The results indicate that learning from our synthetic datasets largely improves the model performances on real-world datasets, with the highest gain of 5.59% and 7.32% respectively in group and person recognition accuracy on CAD2, as well as an improvement of 6.63 in MOTP on HiEve. Pre-training with our synthetic data also leads to faster model convergence on downstream tasks (up to 6.8% faster). Moreover, M3Act opens new research problems for 3D group activity generation. We release M3Act3D, an 87.6-hour 3D motion dataset of human activities with larger group sizes and higher complexity of inter-person interactions than previous multi-person datasets. We define multiple metrics and propose a competitive baseline for the novel task.

Title: MIS-FM: 3D Medical Image Segmentation using Foundation Models Pretrained on a Large-Scale Unannotated Dataset. (arXiv:2306.16925v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.16925
Code URL: https://github.com/openmedlab/mis-fm
Copy Paste: [[2306.16925] MIS-FM: 3D Medical Image Segmentation using Foundation Models Pretrained on a Large-Scale Unannotated Dataset](http://arxiv.org/abs/2306.16925) #segmentation
Summary:
Pretraining with large-scale 3D volumes has a potential for improving the segmentation performance on a target medical image dataset where the training images and annotations are limited. Due to the high cost of acquiring pixel-level segmentation annotations on the large-scale pretraining dataset, pretraining with unannotated images is highly desirable. In this work, we propose a novel self-supervised learning strategy named Volume Fusion (VF) for pretraining 3D segmentation models. It fuses several random patches from a foreground sub-volume to a background sub-volume based on a predefined set of discrete fusion coefficients, and forces the model to predict the fusion coefficient of each voxel, which is formulated as a self-supervised segmentation task without manual annotations. Additionally, we propose a novel network architecture based on parallel convolution and transformer blocks that is suitable to be transferred to different downstream segmentation tasks with various scales of organs and lesions. The proposed model was pretrained with 110k unannotated 3D CT volumes, and experiments with different downstream segmentation targets including head and neck organs, thoracic/abdominal organs showed that our pretrained model largely outperformed training from scratch and several state-of-the-art self-supervised training methods and segmentation models. The code and pretrained model are available at https://github.com/openmedlab/MIS-FM.

Title: Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization. (arXiv:2306.17075v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.17075
Code URL: https://github.com/laiyingxin2/dadf
Copy Paste: [[2306.17075] Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization](http://arxiv.org/abs/2306.17075) #segmentation
Summary:
The rapid advancements in computer vision have stimulated remarkable progress in face forgery techniques, capturing the dedicated attention of researchers committed to detecting forgeries and precisely localizing manipulated areas. Nonetheless, with limited fine-grained pixel-wise supervision labels, deepfake detection models perform unsatisfactorily on precise forgery detection and localization. To address this challenge, we introduce the well-trained vision segmentation foundation model, i.e., Segment Anything Model (SAM) in face forgery detection and localization. Based on SAM, we propose the Detect Any Deepfakes (DADF) framework with the Multiscale Adapter, which can capture short- and long-range forgery contexts for efficient fine-tuning. Moreover, to better identify forged traces and augment the model's sensitivity towards forgery regions, Reconstruction Guided Attention (RGA) module is proposed. The proposed framework seamlessly integrates end-to-end forgery localization and detection optimization. Extensive experiments on three benchmark datasets demonstrate the superiority of our approach for both forgery detection and localization. The codes will be released soon at https://github.com/laiyingxin2/DADF.