secure

Title: Information Flow Control in Machine Learning through Modular Model Architecture. (arXiv:2306.03235v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03235
Code URL: null
Copy Paste: [[2306.03235] Information Flow Control in Machine Learning through Modular Model Architecture](http://arxiv.org/abs/2306.03235) #secure
Summary:
In today's machine learning (ML) models, any part of the training data can affect its output. This lack of control for information flow from training data to model output is a major obstacle in training models on sensitive data when access control only allows individual users to access a subset of data. To enable secure machine learning for access controlled data, we propose the notion of information flow control for machine learning, and develop a secure Transformer-based language model based on the Mixture-of-Experts (MoE) architecture. The secure MoE architecture controls information flow by limiting the influence of training data from each security domain to a single expert module, and only enabling a subset of experts at inference time based on an access control policy. The evaluation using a large corpus of text data shows that the proposed MoE architecture has minimal (1.9%) performance overhead and can significantly improve model accuracy (up to 37%) by enabling training on access-controlled data.

Title: Correlated Pseudorandomness from the Hardness of Quasi-Abelian Decoding. (arXiv:2306.03488v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.03488
Code URL: null
Copy Paste: [[2306.03488] Correlated Pseudorandomness from the Hardness of Quasi-Abelian Decoding](http://arxiv.org/abs/2306.03488) #secure
Summary:
Secure computation often benefits from the use of correlated randomness to achieve fast, non-cryptographic online protocols. A recent paradigm put forth by Boyle $\textit{et al.}$ (CCS 2018, Crypto 2019) showed how pseudorandom correlation generators (PCG) can be used to generate large amounts of useful forms of correlated (pseudo)randomness, using minimal interactions followed solely by local computations, yielding silent secure two-party computation protocols (protocols where the preprocessing phase requires almost no communication). An additional property called programmability allows to extend this to build N-party protocols. However, known constructions for programmable PCG's can only produce OLE's over large fields, and use rather new splittable Ring-LPN assumption.

In this work, we overcome both limitations. To this end, we introduce the quasi-abelian syndrome decoding problem (QA-SD), a family of assumptions which generalises the well-established quasi-cyclic syndrome decoding assumption. Building upon QA-SD, we construct new programmable PCG's for OLE's over any field $\mathbb{F}_q$ with $q>2$. Our analysis also sheds light on the security of the ring-LPN assumption used in Boyle $\textit{et al.}$ (Crypto 2020). Using our new PCG's, we obtain the first efficient N-party silent secure computation protocols for computing general arithmetic circuit over $\mathbb{F}_q$ for any $q>2$.

Title: A Practical Framework for Storing and Searching Encrypted Data on Cloud Storage. (arXiv:2306.03547v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.03547
Code URL: null
Copy Paste: [[2306.03547] A Practical Framework for Storing and Searching Encrypted Data on Cloud Storage](http://arxiv.org/abs/2306.03547) #secure
Summary:
Security has become a significant concern with the increased popularity of cloud storage services. It comes with the vulnerability of being accessed by third parties. Security is one of the major hurdles in the cloud server for the user when the user data that reside in local storage is outsourced to the cloud. It has given rise to security concerns involved in data confidentiality even after the deletion of data from cloud storage. Though, it raises a serious problem when the encrypted data needs to be shared with more people than the data owner initially designated. However, searching on encrypted data is a fundamental issue in cloud storage. The method of searching over encrypted data represents a significant challenge in the cloud.

Searchable encryption allows a cloud server to conduct a search over encrypted data on behalf of the data users without learning the underlying plaintexts. While many academic SE schemes show provable security, they usually expose some query information, making them less practical, weak in usability, and challenging to deploy. Also, sharing encrypted data with other authorized users must provide each document's secret key. However, this way has many limitations due to the difficulty of key management and distribution.

We have designed the system using the existing cryptographic approaches, ensuring the search on encrypted data over the cloud. The primary focus of our proposed model is to ensure user privacy and security through a less computationally intensive, user-friendly system with a trusted third party entity. To demonstrate our proposed model, we have implemented a web application called CryptoSearch as an overlay system on top of a well-known cloud storage domain. It exhibits secure search on encrypted data with no compromise to the user-friendliness and the scheme's functional performance in real-world applications.

Title: mdTLS: How to Make middlebox-aware TLS more efficient?. (arXiv:2306.03573v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.03573
Code URL: null
Copy Paste: [[2306.03573] mdTLS: How to Make middlebox-aware TLS more efficient?](http://arxiv.org/abs/2306.03573) #secure
Summary:
The more data transmission over TLS protocol becomes increasingly common in IT Systems, the more middleboxes are deployed in networks. These middleboxes have several advantages, however, they become the target of cyber-attacks. Many researchers proposed revised versions of TLS protocols to make them secure, however, their approaches had some limitations. In this paper, we propose a middlebox-delegated TLS (mdTLS) protocol to improve performance based on the middlebox-aware TLS (maTLS), one of the most secure TLS protocols. We found out that the computational complexity of mdTLS is about twice as low as that of maTLS. Furthermore, we formally verified that our proposal meets newly defined security goals as well as those verified by maTLS. All of the formal models and lemmas are open to the public through following url https://github.com/HackProof/mdTLS.

Title: TALUS: Reinforcing TEE Confidentiality with Cryptographic Coprocessors (Technical Report). (arXiv:2306.03643v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.03643
Code URL: null
Copy Paste: [[2306.03643] TALUS: Reinforcing TEE Confidentiality with Cryptographic Coprocessors (Technical Report)](http://arxiv.org/abs/2306.03643) #secure
Summary:
Platforms are nowadays typically equipped with tristed execution environments (TEES), such as Intel SGX and ARM TrustZone. However, recent microarchitectural attacks on TEEs repeatedly broke their confidentiality guarantees, including the leakage of long-term cryptographic secrets. These systems are typically also equipped with a cryptographic coprocessor, such as a TPM or Google Titan. These coprocessors offer a unique set of security features focused on safeguarding cryptographic secrets. Still, despite their simultaneous availability, the integration between these technologies is practically nonexistent, which prevents them from benefitting from each other's strengths. In this paper, we propose TALUS, a general design and a set of three main requirements for a secure symbiosis between TEEs and cryptographic coprocessors. We implement a proof-of-concept of TALUS based on Intel SGX and a hardware TPM. We show that with TALUS, the long-term secrets used in the SGX life cycle can be moved to the TPM. We demonstrate that our design is robust even in the presence of transient execution attacks, preventing an entire class of attacks due to the reduced attack surface on the shared hardware.

security

Title: Security Knowledge-Guided Fuzzing of Deep Learning Libraries. (arXiv:2306.03269v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.03269
Code URL: null
Copy Paste: [[2306.03269] Security Knowledge-Guided Fuzzing of Deep Learning Libraries](http://arxiv.org/abs/2306.03269) #security
Summary:
There have been many Deep Learning (DL) fuzzers proposed in the literature. However, most of them only focused on high-level APIs that are used by users, which results in a large number of APIs used by library developers being untested. Additionally, they use general input generation rules to generate malformed inputs such as random value generation and boundary-input generation, which are ineffective to generate DL-specific malformed inputs.

To fill this gap, we first conduct an empirical study regarding root cause analysis on 447 history security vulnerabilities of two of the most popular DL libraries, i.e., PyTorch and TensorFlow, for characterizing and understanding their malicious inputs. As a result, we categorize 18 rules regarding the construction of malicious inputs, which we believe can be used to generate effective malformed inputs for testing DL libraries. We further design and implement Orion, a new fuzzer that tests DL libraries by utilizing our malformed input generation rules mined from real-world deep learning security vulnerabilities. Specifically, Orion first collects API invocation code from various sources such as API documentation, source code, developer tests, and publicly available repositories on GitHub. Then Orion instruments these code snippets to dynamically trace execution information for each API such as parameters' types, shapes, and values. Then, Orion combines the malformed input generation rules and the dynamic execution information to create inputs to test DL libraries.

Our evaluation on TensorFlow and PyTorch shows that Orion reports 143 bugs and 68 of which are previously unknown. Among the 68 new bugs, 58 have been fixed or confirmed by developers after we report them and the left are awaiting confirmation. Compared to the state-of-the-art DL fuzzers (i.e., FreeFuzz and DocTer), Orion detects 21% and 34% more bugs respectively.

privacy

Title: OptimShare: A Unified Framework for Privacy Preserving Data Sharing -- Towards the Practical Utility of Data with Privacy. (arXiv:2306.03379v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.03379
Code URL: null
Copy Paste: [[2306.03379] OptimShare: A Unified Framework for Privacy Preserving Data Sharing -- Towards the Practical Utility of Data with Privacy](http://arxiv.org/abs/2306.03379) #privacy
Summary:
Tabular data sharing serves as a common method for data exchange. However, sharing sensitive information without adequate privacy protection can compromise individual privacy. Thus, ensuring privacy-preserving data sharing is crucial. Differential privacy (DP) is regarded as the gold standard in data privacy. Despite this, current DP methods tend to generate privacy-preserving tabular datasets that often suffer from limited practical utility due to heavy perturbation and disregard for the tables' utility dynamics. Besides, there has not been much research on selective attribute release, particularly in the context of controlled partially perturbed data sharing. This has significant implications for scenarios such as cross-agency data sharing in real-world situations. We introduce OptimShare: a utility-focused, multi-criteria solution designed to perturb input datasets selectively optimized for specific real-world applications. OptimShare combines the principles of differential privacy, fuzzy logic, and probability theory to establish an integrated tool for privacy-preserving data sharing. Empirical assessments confirm that OptimShare successfully strikes a balance between better data utility and robust privacy, effectively serving various real-world problem scenarios.

Title: Machine Unlearning: A Survey. (arXiv:2306.03558v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.03558
Code URL: null
Copy Paste: [[2306.03558] Machine Unlearning: A Survey](http://arxiv.org/abs/2306.03558) #privacy
Summary:
Machine learning has attracted widespread attention and evolved into an enabling technology for a wide range of highly successful applications, such as intelligent computer vision, speech recognition, medical diagnosis, and more. Yet a special need has arisen where, due to privacy, usability, and/or the right to be forgotten, information about some specific samples needs to be removed from a model, called machine unlearning. This emerging technology has drawn significant interest from both academics and industry due to its innovation and practicality. At the same time, this ambitious problem has led to numerous research efforts aimed at confronting its challenges. To the best of our knowledge, no study has analyzed this complex topic or compared the feasibility of existing unlearning solutions in different kinds of scenarios. Accordingly, with this survey, we aim to capture the key concepts of unlearning techniques. The existing solutions are classified and summarized based on their characteristics within an up-to-date and comprehensive review of each category's advantages and limitations. The survey concludes by highlighting some of the outstanding issues with unlearning techniques, along with some feasible directions for new research opportunities.

Title: Origin-Destination Network Generation via Gravity-Guided GAN. (arXiv:2306.03390v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03390
Code URL: null
Copy Paste: [[2306.03390] Origin-Destination Network Generation via Gravity-Guided GAN](http://arxiv.org/abs/2306.03390) #privacy
Summary:
Origin-destination (OD) flow, which contains valuable population mobility information including direction and volume, is critical in many urban applications, such as urban planning, transportation management, etc. However, OD data is not always easy to access due to high costs or privacy concerns. Therefore, we must consider generating OD through mathematical models. Existing works utilize physics laws or machine learning (ML) models to build the association between urban structures and OD flows while these two kinds of methods suffer from the limitation of over-simplicity and poor generalization ability, respectively. In this paper, we propose to adopt physics-informed ML paradigm, which couple the physics scientific knowledge and data-driven ML methods, to construct a model named Origin-Destination Generation Networks (ODGN) for better population mobility modeling by leveraging the complementary strengths of combining physics and ML methods. Specifically, we first build a Multi-view Graph Attention Networks (MGAT) to capture the urban features of every region and then use a gravity-guided predictor to obtain OD flow between every two regions. Furthermore, we use a conditional GAN training strategy and design a sequence-based discriminator to consider the overall topological features of OD as a network. Extensive experiments on real-world datasets have been done to demonstrate the superiority of our proposed method compared with baselines.

protect

Title: Protecting the Intellectual Property of Diffusion Models by the Watermark Diffusion Process. (arXiv:2306.03436v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.03436
Code URL: null
Copy Paste: [[2306.03436] Protecting the Intellectual Property of Diffusion Models by the Watermark Diffusion Process](http://arxiv.org/abs/2306.03436) #protect
Summary:
Diffusion models have emerged as state-of-the-art deep generative architectures with the increasing demands for generation tasks. Training large diffusion models for good performance requires high resource costs, making them valuable intellectual properties to protect. While most of the existing ownership solutions, including watermarking, mainly focus on discriminative models. This paper proposes WDM, a novel watermarking method for diffusion models, including watermark embedding, extraction, and verification. WDM embeds the watermark data through training or fine-tuning the diffusion model to learn a Watermark Diffusion Process (WDP), different from the standard diffusion process for the task data. The embedded watermark can be extracted by sampling using the shared reverse noise from the learned WDP without degrading performance on the original task. We also provide theoretical foundations and analysis of the proposed method by connecting the WDP to the diffusion process with a modified Gaussian kernel. Extensive experiments are conducted to demonstrate its effectiveness and robustness against various attacks.

defense

Title: A Survey on Federated Learning Poisoning Attacks and Defenses. (arXiv:2306.03397v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.03397
Code URL: null
Copy Paste: [[2306.03397] A Survey on Federated Learning Poisoning Attacks and Defenses](http://arxiv.org/abs/2306.03397) #defense
Summary:
As one kind of distributed machine learning technique, federated learning enables multiple clients to build a model across decentralized data collaboratively without explicitly aggregating the data. Due to its ability to break data silos, federated learning has received increasing attention in many fields, including finance, healthcare, and education. However, the invisibility of clients' training data and the local training process result in some security issues. Recently, many works have been proposed to research the security attacks and defenses in federated learning, but there has been no special survey on poisoning attacks on federated learning and the corresponding defenses. In this paper, we investigate the most advanced schemes of federated learning poisoning attacks and defenses and point out the future directions in these areas.

Title: Adversarial Attacks and Defenses for Semantic Communication in Vehicular Metaverses. (arXiv:2306.03528v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.03528
Code URL: null
Copy Paste: [[2306.03528] Adversarial Attacks and Defenses for Semantic Communication in Vehicular Metaverses](http://arxiv.org/abs/2306.03528) #defense
Summary:
For vehicular metaverses, one of the ultimate user-centric goals is to optimize the immersive experience and Quality of Service (QoS) for users on board. Semantic Communication (SemCom) has been introduced as a revolutionary paradigm that significantly eases communication resource pressure for vehicular metaverse applications to achieve this goal. SemCom enables high-quality and ultra-efficient vehicular communication, even with explosively increasing data traffic among vehicles. In this article, we propose a hierarchical SemCom-enabled vehicular metaverses framework consisting of the global metaverse, local metaverses, SemCom module, and resource pool. The global and local metaverses are brand-new concepts from the metaverse's distribution standpoint. Considering the QoS of users, this article explores the potential security vulnerabilities of the proposed framework. To that purpose, this study highlights a specific security risk to the framework's SemCom module and offers a viable defense solution, so encouraging community researchers to focus more on vehicular metaverse security. Finally, we provide an overview of the open issues of secure SemCom in the vehicular metaverses, notably pointing out potential future research directions.

attack

Title: Adversarial alignment: Breaking the trade-off between the strength of an attack and its relevance to human perception. (arXiv:2306.03229v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03229
Code URL: null
Copy Paste: [[2306.03229] Adversarial alignment: Breaking the trade-off between the strength of an attack and its relevance to human perception](http://arxiv.org/abs/2306.03229) #attack
Summary:
Deep neural networks (DNNs) are known to have a fundamental sensitivity to adversarial attacks, perturbations of the input that are imperceptible to humans yet powerful enough to change the visual decision of a model. Adversarial attacks have long been considered the "Achilles' heel" of deep learning, which may eventually force a shift in modeling paradigms. Nevertheless, the formidable capabilities of modern large-scale DNNs have somewhat eclipsed these early concerns. Do adversarial attacks continue to pose a threat to DNNs?

Here, we investigate how the robustness of DNNs to adversarial attacks has evolved as their accuracy on ImageNet has continued to improve. We measure adversarial robustness in two different ways: First, we measure the smallest adversarial attack needed to cause a model to change its object categorization decision. Second, we measure how aligned successful attacks are with the features that humans find diagnostic for object recognition. We find that adversarial attacks are inducing bigger and more easily detectable changes to image pixels as DNNs grow better on ImageNet, but these attacks are also becoming less aligned with features that humans find diagnostic for recognition. To better understand the source of this trade-off, we turn to the neural harmonizer, a DNN training routine that encourages models to leverage the same features as humans to solve tasks. Harmonized DNNs achieve the best of both worlds and experience attacks that are detectable and affect features that humans find diagnostic for recognition, meaning that attacks on these models are more likely to be rendered ineffective by inducing similar effects on human perception. Our findings suggest that the sensitivity of DNNs to adversarial attacks can be mitigated by DNN scale, data scale, and training routines that align models with biological intelligence.

Title: An Open Patch Generator based Fingerprint Presentation Attack Detection using Generative Adversarial Network. (arXiv:2306.03577v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03577
Code URL: null
Copy Paste: [[2306.03577] An Open Patch Generator based Fingerprint Presentation Attack Detection using Generative Adversarial Network](http://arxiv.org/abs/2306.03577) #attack
Summary:
The low-cost, user-friendly, and convenient nature of Automatic Fingerprint Recognition Systems (AFRS) makes them suitable for a wide range of applications. This spreading use of AFRS also makes them vulnerable to various security threats. Presentation Attack (PA) or spoofing is one of the threats which is caused by presenting a spoof of a genuine fingerprint to the sensor of AFRS. Fingerprint Presentation Attack Detection (FPAD) is a countermeasure intended to protect AFRS against fake or spoof fingerprints created using various fabrication materials. In this paper, we have proposed a Convolutional Neural Network (CNN) based technique that uses a Generative Adversarial Network (GAN) to augment the dataset with spoof samples generated from the proposed Open Patch Generator (OPG). This OPG is capable of generating realistic fingerprint samples which have no resemblance to the existing spoof fingerprint samples generated with other materials. The augmented dataset is fed to the DenseNet classifier which helps in increasing the performance of the Presentation Attack Detection (PAD) module for the various real-world attacks possible with unknown spoof materials. Experimental evaluations of the proposed approach are carried out on the Liveness Detection (LivDet) 2015, 2017, and 2019 competition databases. An overall accuracy of 96.20\%, 94.97\%, and 92.90\% has been achieved on the LivDet 2015, 2017, and 2019 databases, respectively under the LivDet protocol scenarios. The performance of the proposed PAD model is also validated in the cross-material and cross-sensor attack paradigm which further exhibits its capability to be used under real-world attack scenarios.

Title: Greedy-Mine: A Profitable Mining Attack Strategy in Bitcoin-NG. (arXiv:2306.03540v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2306.03540
Code URL: null
Copy Paste: [[2306.03540] Greedy-Mine: A Profitable Mining Attack Strategy in Bitcoin-NG](http://arxiv.org/abs/2306.03540) #attack
Summary:
Bitcoin-NG is an extensible blockchain protocol based on the same trust model as Bitcoin. It divides each epoch into one Key-Block and multiple Micro-Blocks, effectively improving transaction processing capacity. Bitcoin-NG adopts a special incentive mechanism (i.e., the transaction fees in each epoch are split to the current and next leader) to maintain its security. However, there are some limitations to the existing incentive analysis of Bitcoin-NG in recent works. First, the incentive division method of Bitcoin-NG only includes some specific mining attack strategies of adversary, while ignoring more stubborn attack strategies. Second, once adversaries find a whale transaction, they will deviate from honest mining strategy to obtain extra reward. In this paper, we are committed to solving these two limitations. First, we propose a novel mining strategy named Greedy-Mine attack. Then, we formulate a Markov Decision Process (MDP) model to analyze the competition of honest miners and adversaries. Furthermore, we analysis the extra reward of adversaries and summarize the mining power proportion range required for malicious adversaries to launch Greedy-Mine to obtain extra returns. Finally, we make a backward-compatibility progressive modification to Bitcoin-NG protocol that would raise the threshold of propagation factor from 0 to 1. Meanwhile, we get the winning condition of adversaries when adopting Greedy-Mine, compared with honest mining. Simulation and experimental results indicate that Bitcoin-NG is not incentive compatible, which is vulnerable to Greedy-Mine attack.

robust

Title: A Robust Likelihood Model for Novelty Detection. (arXiv:2306.03331v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03331
Code URL: null
Copy Paste: [[2306.03331] A Robust Likelihood Model for Novelty Detection](http://arxiv.org/abs/2306.03331) #robust
Summary:
Current approaches to novelty or anomaly detection are based on deep neural networks. Despite their effectiveness, neural networks are also vulnerable to imperceptible deformations of the input data. This is a serious issue in critical applications, or when data alterations are generated by an adversarial attack. While this is a known problem that has been studied in recent years for the case of supervised learning, the case of novelty detection has received very limited attention. Indeed, in this latter setting the learning is typically unsupervised because outlier data is not available during training, and new approaches for this case need to be investigated. We propose a new prior that aims at learning a robust likelihood for the novelty test, as a defense against attacks. We also integrate the same prior with a state-of-the-art novelty detection approach. Because of the geometric properties of that approach, the resulting robust training is computationally very efficient. An initial evaluation of the method indicates that it is effective at improving performance with respect to the standard models in the absence and presence of attacks.

Title: A Unified Framework to Super-Resolve Face Images of Varied Low Resolutions. (arXiv:2306.03380v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03380
Code URL: null
Copy Paste: [[2306.03380] A Unified Framework to Super-Resolve Face Images of Varied Low Resolutions](http://arxiv.org/abs/2306.03380) #robust
Summary:
The existing face image super-resolution (FSR) algorithms usually train a specific model for a specific low input resolution for optimal results. By contrast, we explore in this work a unified framework that is trained once and then used to super-resolve input face images of varied low resolutions. For that purpose, we propose a novel neural network architecture that is composed of three anchor auto-encoders, one feature weight regressor and a final image decoder. The three anchor auto-encoders are meant for optimal FSR for three pre-defined low input resolutions, or named anchor resolutions, respectively. An input face image of an arbitrary low resolution is firstly up-scaled to the target resolution by bi-cubic interpolation and then fed to the three auto-encoders in parallel. The three encoded anchor features are then fused with weights determined by the feature weight regressor. At last, the fused feature is sent to the final image decoder to derive the super-resolution result. As shown by experiments, the proposed algorithm achieves robust and state-of-the-art performance over a wide range of low input resolutions by a single framework. Code and models will be made available after the publication of this work.

Title: Revisiting the Trade-off between Accuracy and Robustness via Weight Distribution of Filters. (arXiv:2306.03430v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03430
Code URL: null
Copy Paste: [[2306.03430] Revisiting the Trade-off between Accuracy and Robustness via Weight Distribution of Filters](http://arxiv.org/abs/2306.03430) #robust
Summary:
Adversarial attacks have been proven to be potential threats to Deep Neural Networks (DNNs), and many methods are proposed to defend against adversarial attacks. However, while enhancing the robustness, the clean accuracy will decline to a certain extent, implying a trade-off existed between the accuracy and robustness. In this paper, we firstly empirically find an obvious distinction between standard and robust models in the filters' weight distribution of the same architecture, and then theoretically explain this phenomenon in terms of the gradient regularization, which shows this difference is an intrinsic property for DNNs, and thus a static network architecture is difficult to improve the accuracy and robustness at the same time. Secondly, based on this observation, we propose a sample-wise dynamic network architecture named Adversarial Weight-Varied Network (AW-Net), which focuses on dealing with clean and adversarial examples with a ``divide and rule" weight strategy. The AW-Net dynamically adjusts network's weights based on regulation signals generated by an adversarial detector, which is directly influenced by the input sample. Benefiting from the dynamic network architecture, clean and adversarial examples can be processed with different network weights, which provides the potentiality to enhance the accuracy and robustness simultaneously. A series of experiments demonstrate that our AW-Net is architecture-friendly to handle both clean and adversarial examples and can achieve better trade-off performance than state-of-the-art robust models.

Title: Explaining and Adapting Graph Conditional Shift. (arXiv:2306.03256v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03256
Code URL: null
Copy Paste: [[2306.03256] Explaining and Adapting Graph Conditional Shift](http://arxiv.org/abs/2306.03256) #robust
Summary:
Graph Neural Networks (GNNs) have shown remarkable performance on graph-structured data. However, recent empirical studies suggest that GNNs are very susceptible to distribution shift. There is still significant ambiguity about why graph-based models seem more vulnerable to these shifts. In this work we provide a thorough theoretical analysis on it by quantifying the magnitude of conditional shift between the input features and the output label. Our findings show that both graph heterophily and model architecture exacerbate conditional shifts, leading to performance degradation. To address this, we propose an approach that involves estimating and minimizing the conditional shift for unsupervised domain adaptation on graphs. In our controlled synthetic experiments, our algorithm demonstrates robustness towards distribution shift, resulting in up to 10% absolute ROC AUC improvement versus the second-best algorithm. Furthermore, comprehensive experiments on both node classification and graph classification show its robust performance under various distribution shifts.

Title: Survival Instinct in Offline Reinforcement Learning. (arXiv:2306.03286v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03286
Code URL: null
Copy Paste: [[2306.03286] Survival Instinct in Offline Reinforcement Learning](http://arxiv.org/abs/2306.03286) #robust
Summary:
We present a novel observation about the behavior of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and a certain bias implicit in common data collection practices. As we prove in this work, pessimism endows the agent with a "survival instinct", i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. Formally, given a reward class -- which may not even contain the true reward -- we identify conditions on the training data distribution that enable offline RL to learn a near-optimal and safe policy from any reward within the class. We argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating future ones. Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is "nudged" to learn a desirable behavior with imperfect reward but purposely biased data coverage.

Title: On Pitfalls of Test-Time Adaptation. (arXiv:2306.03536v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03536
Code URL: https://github.com/lins-lab/ttab
Copy Paste: [[2306.03536] On Pitfalls of Test-Time Adaptation](http://arxiv.org/abs/2306.03536) #robust
Summary:
Test-Time Adaptation (TTA) has recently emerged as a promising approach for tackling the robustness challenge under distribution shifts. However, the lack of consistent settings and systematic studies in prior literature hinders thorough assessments of existing methods. To address this issue, we present TTAB, a test-time adaptation benchmark that encompasses ten state-of-the-art algorithms, a diverse array of distribution shifts, and two evaluation protocols. Through extensive experiments, our benchmark reveals three common pitfalls in prior efforts. First, selecting appropriate hyper-parameters, especially for model selection, is exceedingly difficult due to online batch dependency. Second, the effectiveness of TTA varies greatly depending on the quality and properties of the model being adapted. Third, even under optimal algorithmic conditions, none of the existing methods are capable of addressing all common types of distribution shifts. Our findings underscore the need for future research in the field to conduct rigorous evaluations on a broader set of models and shifts, and to re-examine the assumptions behind the empirical success of TTA. Our code is available at \url{https://github.com/lins-lab/ttab}.

Title: Zero-shot Preference Learning for Offline RL via Optimal Transport. (arXiv:2306.03615v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03615
Code URL: null
Copy Paste: [[2306.03615] Zero-shot Preference Learning for Offline RL via Optimal Transport](http://arxiv.org/abs/2306.03615) #robust
Summary:
Preference-based Reinforcement Learning (PbRL) has demonstrated remarkable efficacy in aligning rewards with human intentions. However, a significant challenge lies in the need of substantial human labels, which is costly and time-consuming. Additionally, the expensive preference data obtained from prior tasks is not typically reusable for subsequent task learning, leading to extensive labeling for each new task. In this paper, we propose a novel zero-shot preference-based RL algorithm that leverages labeled preference data from source tasks to infer labels for target tasks, eliminating the requirement for human queries. Our approach utilizes Gromov-Wasserstein distance to align trajectory distributions between source and target tasks. The solved optimal transport matrix serves as a correspondence between trajectories of two tasks, making it possible to identify corresponding trajectory pairs between tasks and transfer the preference labels. However, learning directly from inferred labels that contains a fraction of noisy labels will result in an inaccurate reward function, subsequently affecting policy performance. To this end, we introduce Robust Preference Transformer, which models the rewards as Gaussian distributions and incorporates reward uncertainty in addition to reward mean. The empirical results on robotic manipulation tasks of Meta-World and Robomimic show that our method has strong capabilities of transferring preferences between tasks and learns reward functions from noisy labels robustly. Furthermore, we reveal that our method attains near-oracle performance with a small proportion of scripted labels.

biometric

steal

extraction

Title: ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images. (arXiv:2306.03287v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03287
Code URL: null
Copy Paste: [[2306.03287] ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images](http://arxiv.org/abs/2306.03287) #extraction
Summary:
Structured text extraction is one of the most valuable and challenging application directions in the field of Document AI. However, the scenarios of past benchmarks are limited, and the corresponding evaluation protocols usually focus on the submodules of the structured text extraction scheme. In order to eliminate these problems, we organized the ICDAR 2023 competition on Structured text extraction from Visually-Rich Document images (SVRD). We set up two tracks for SVRD including Track 1: HUST-CELL and Track 2: Baidu-FEST, where HUST-CELL aims to evaluate the end-to-end performance of Complex Entity Linking and Labeling, and Baidu-FEST focuses on evaluating the performance and generalization of Zero-shot / Few-shot Structured Text extraction from an end-to-end perspective. Compared to the current document benchmarks, our two tracks of competition benchmark enriches the scenarios greatly and contains more than 50 types of visually-rich document images (mainly from the actual enterprise applications). The competition opened on 30th December, 2022 and closed on 24th March, 2023. There are 35 participants and 91 valid submissions received for Track 1, and 15 participants and 26 valid submissions received for Track 2. In this report we will presents the motivation, competition datasets, task definition, evaluation protocol, and submission summaries. According to the performance of the submissions, we believe there is still a large gap on the expected information extraction performance for complex and zero-shot scenarios. It is hoped that this competition will attract many researchers in the field of CV and NLP, and bring some new thoughts to the field of Document AI.

Title: Joint Event Extraction via Structural Semantic Matching. (arXiv:2306.03469v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.03469
Code URL: null
Copy Paste: [[2306.03469] Joint Event Extraction via Structural Semantic Matching](http://arxiv.org/abs/2306.03469) #extraction
Summary:
Event Extraction (EE) is one of the essential tasks in information extraction, which aims to detect event mentions from text and find the corresponding argument roles. The EE task can be abstracted as a process of matching the semantic definitions and argument structures of event types with the target text. This paper encodes the semantic features of event types and makes structural matching with target text. Specifically, Semantic Type Embedding (STE) and Dynamic Structure Encoder (DSE) modules are proposed. Also, the Joint Structural Semantic Matching (JSSM) model is built to jointly perform event detection and argument extraction tasks through a bidirectional attention layer. The experimental results on the ACE2005 dataset indicate that our model achieves a significant performance improvement

Title: Dance Generation by Sound Symbolic Words. (arXiv:2306.03646v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03646
Code URL: null
Copy Paste: [[2306.03646] Dance Generation by Sound Symbolic Words](http://arxiv.org/abs/2306.03646) #extraction
Summary:
This study introduces a novel approach to generate dance motions using onomatopoeia as input, with the aim of enhancing creativity and diversity in dance generation. Unlike text and music, onomatopoeia conveys rhythm and meaning through abstract word expressions without constraints on expression and without need for specialized knowledge. We adapt the AI Choreographer framework and employ the Sakamoto system, a feature extraction method for onomatopoeia focusing on phonemes and syllables. Additionally, we present a new dataset of 40 onomatopoeia-dance motion pairs collected through a user survey. Our results demonstrate that the proposed method enables more intuitive dance generation and can create dance motions using sound-symbolic words from a variety of languages, including those without onomatopoeia. This highlights the potential for diverse dance creation across different languages and cultures, accessible to a wider audience. Qualitative samples from our model can be found at: https://sites.google.com/view/onomatopoeia-dance/home/.

membership infer

federate

Title: Confidence-based federated distillation for vision-based lane-centering. (arXiv:2306.03222v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03222
Code URL: null
Copy Paste: [[2306.03222] Confidence-based federated distillation for vision-based lane-centering](http://arxiv.org/abs/2306.03222) #federate
Summary:
A fundamental challenge of autonomous driving is maintaining the vehicle in the center of the lane by adjusting the steering angle. Recent advances leverage deep neural networks to predict steering decisions directly from images captured by the car cameras. Machine learning-based steering angle prediction needs to consider the vehicle's limitation in uploading large amounts of potentially private data for model training. Federated learning can address these constraints by enabling multiple vehicles to collaboratively train a global model without sharing their private data, but it is difficult to achieve good accuracy as the data distribution is often non-i.i.d. across the vehicles. This paper presents a new confidence-based federated distillation method to improve the performance of federated learning for steering angle prediction. Specifically, it proposes the novel use of entropy to determine the predictive confidence of each local model, and then selects the most confident local model as the teacher to guide the learning of the global model. A comprehensive evaluation of vision-based lane centering shows that the proposed approach can outperform FedAvg and FedDF by 11.3% and 9%, respectively.

Title: Improving Accelerated Federated Learning with Compression and Importance Sampling. (arXiv:2306.03240v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03240
Code URL: null
Copy Paste: [[2306.03240] Improving Accelerated Federated Learning with Compression and Importance Sampling](http://arxiv.org/abs/2306.03240) #federate
Summary:
Federated Learning is a collaborative training framework that leverages heterogeneous data distributed across a vast number of clients. Since it is practically infeasible to request and process all clients during the aggregation step, partial participation must be supported. In this setting, the communication between the server and clients poses a major bottleneck. To reduce communication loads, there are two main approaches: compression and local steps. Recent work by Mishchenko et al. [2022] introduced the new ProxSkip method, which achieves an accelerated rate using the local steps technique. Follow-up works successfully combined local steps acceleration with partial participation [Grudzie\'n et al., 2023, Condat et al. 2023] and gradient compression [Condat et al. [2022]. In this paper, we finally present a complete method for Federated Learning that incorporates all necessary ingredients: Local Training, Compression, and Partial Participation. We obtain state-of-the-art convergence guarantees in the considered setting. Moreover, we analyze the general sampling framework for partial participation and derive an importance sampling scheme, which leads to even better performance. We experimentally demonstrate the advantages of the proposed method in practice.

Title: A Lightweight Method for Tackling Unknown Participation Probabilities in Federated Averaging. (arXiv:2306.03401v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03401
Code URL: null
Copy Paste: [[2306.03401] A Lightweight Method for Tackling Unknown Participation Probabilities in Federated Averaging](http://arxiv.org/abs/2306.03401) #federate
Summary:
In federated learning (FL), clients usually have diverse participation probabilities that are unknown a priori, which can significantly harm the performance of FL if not handled properly. Existing works aiming at addressing this problem are usually based on global variance reduction, which requires a substantial amount of additional memory in a multiplicative factor equal to the total number of clients. An important open problem is to find a lightweight method for FL in the presence of clients with unknown participation rates. In this paper, we address this problem by adapting the aggregation weights in federated averaging (FedAvg) based on the participation history of each client. We first show that, with heterogeneous participation probabilities, FedAvg with non-optimal aggregation weights can diverge from the optimal solution of the original FL objective, indicating the need of finding optimal aggregation weights. However, it is difficult to compute the optimal weights when the participation probabilities are unknown. To address this problem, we present a new algorithm called FedAU, which improves FedAvg by adaptively weighting the client updates based on online estimates of the optimal weights without knowing the probabilities of client participation. We provide a theoretical convergence analysis of FedAU using a novel methodology to connect the estimation error and convergence. Our theoretical results reveal important and interesting insights, while showing that FedAU converges to an optimal solution of the original objective and has desirable properties such as linear speedup. Our experimental results also verify the advantage of FedAU over baseline methods.

Title: Masked Autoencoders are Efficient Continual Federated Learners. (arXiv:2306.03542v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03542
Code URL: https://github.com/ml-research/confedmade
Copy Paste: [[2306.03542] Masked Autoencoders are Efficient Continual Federated Learners](http://arxiv.org/abs/2306.03542) #federate
Summary:
Machine learning is typically framed from a perspective of i.i.d., and more importantly, isolated data. In parts, federated learning lifts this assumption, as it sets out to solve the real-world challenge of collaboratively learning a shared model from data distributed across clients. However, motivated primarily by privacy and computational constraints, the fact that data may change, distributions drift, or even tasks advance individually on clients, is seldom taken into account. The field of continual learning addresses this separate challenge and first steps have recently been taken to leverage synergies in distributed supervised settings, in which several clients learn to solve changing classification tasks over time without forgetting previously seen ones. Motivated by these prior works, we posit that such federated continual learning should be grounded in unsupervised learning of representations that are shared across clients; in the loose spirit of how humans can indirectly leverage others' experience without exposure to a specific task. For this purpose, we demonstrate that masked autoencoders for distribution estimation are particularly amenable to this setup. Specifically, their masking strategy can be seamlessly integrated with task attention mechanisms to enable selective knowledge transfer between clients. We empirically corroborate the latter statement through several continual federated scenarios on both image and binary datasets.

Title: Personalization Disentanglement for Federated Learning. (arXiv:2306.03570v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03570
Code URL: null
Copy Paste: [[2306.03570] Personalization Disentanglement for Federated Learning](http://arxiv.org/abs/2306.03570) #federate
Summary:
Personalized federated learning (PFL) jointly trains a variety of local models through balancing between knowledge sharing across clients and model personalization per client. This paper addresses PFL via explicit disentangling latent representations into two parts to capture the shared knowledge and client-specific personalization, which leads to more reliable and effective PFL. The disentanglement is achieved by a novel Federated Dual Variational Autoencoder (FedDVA), which employs two encoders to infer the two types of representations. FedDVA can produce a better understanding of the trade-off between global knowledge sharing and local personalization in PFL. Moreover, it can be integrated with existing FL methods and turn them into personalized models for heterogeneous downstream tasks. Extensive experiments validate the advantages caused by disentanglement and show that models trained with disentangled representations substantially outperform those vanilla methods.

Title: Avoid Adversarial Adaption in Federated Learning by Multi-Metric Investigations. (arXiv:2306.03600v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03600
Code URL: null
Copy Paste: [[2306.03600] Avoid Adversarial Adaption in Federated Learning by Multi-Metric Investigations](http://arxiv.org/abs/2306.03600) #federate
Summary:
Federated Learning (FL) trains machine learning models on data distributed across multiple devices, avoiding data transfer to a central location. This improves privacy, reduces communication costs, and enhances model performance. However, FL is prone to poisoning attacks, which can be untargeted aiming to reduce the model performance, or targeted, so-called backdoors, which add adversarial behavior that can be triggered with appropriately crafted inputs. Striving for stealthiness, backdoor attacks are harder to deal with.

Mitigation techniques against poisoning attacks rely on monitoring certain metrics and filtering malicious model updates. However, previous works didn't consider real-world adversaries and data distributions. To support our statement, we define a new notion of strong adaptive adversaries that can simultaneously adapt to multiple objectives and demonstrate through extensive tests, that existing defense methods can be circumvented in this adversary model. We also demonstrate, that existing defenses have limited effectiveness when no assumptions are made about underlying data distributions.

To address realistic scenarios and adversary models, we propose Metric-Cascades (MESAS) a new defense that leverages multiple detection metrics simultaneously for the filtering of poisoned model updates. This approach forces adaptive attackers into a heavy multi-objective optimization problem, and our evaluation with nine backdoors and three datasets shows that even our strong adaptive attacker cannot evade MESAS's detection. We show that MESAS outperforms existing defenses in distinguishing backdoors from distortions originating from different data distributions within and across the clients. Overall, MESAS is the first defense that is robust against strong adaptive adversaries and is effective in real-world data scenarios while introducing a low overhead of 24.37s on average.

fair

Title: Fair Patient Model: Mitigating Bias in the Patient Representation Learned from the Electronic Health Records. (arXiv:2306.03179v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03179
Code URL: null
Copy Paste: [[2306.03179] Fair Patient Model: Mitigating Bias in the Patient Representation Learned from the Electronic Health Records](http://arxiv.org/abs/2306.03179) #fair
Summary:
Objective: To pre-train fair and unbiased patient representations from Electronic Health Records (EHRs) using a novel weighted loss function that reduces bias and improves fairness in deep representation learning models.

Methods: We defined a new loss function, called weighted loss function, in the deep representation learning model to balance the importance of different groups of patients and features. We applied the proposed model, called Fair Patient Model (FPM), to a sample of 34,739 patients from the MIMIC-III dataset and learned patient representations for four clinical outcome prediction tasks.

Results: FPM outperformed the baseline models in terms of three fairness metrics: demographic parity, equality of opportunity difference, and equalized odds ratio. FPM also achieved comparable predictive performance with the baselines, with an average accuracy of 0.7912. Feature analysis revealed that FPM captured more information from clinical features than the baselines.

Conclusion: FPM is a novel method to pre-train fair and unbiased patient representations from EHR data using a weighted loss function. The learned representations can be used for various downstream tasks in healthcare and can be extended to other domains where bias and fairness are important.

interpretability

Title: Efficient and Interpretable Compressive Text Summarisation with Unsupervised Dual-Agent Reinforcement Learning. (arXiv:2306.03415v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.03415
Code URL: https://github.com/peggypytang/urlcomsum
Copy Paste: [[2306.03415] Efficient and Interpretable Compressive Text Summarisation with Unsupervised Dual-Agent Reinforcement Learning](http://arxiv.org/abs/2306.03415) #interpretability
Summary:
Recently, compressive text summarisation offers a balance between the conciseness issue of extractive summarisation and the factual hallucination issue of abstractive summarisation. However, most existing compressive summarisation methods are supervised, relying on the expensive effort of creating a new training dataset with corresponding compressive summaries. In this paper, we propose an efficient and interpretable compressive summarisation method that utilises unsupervised dual-agent reinforcement learning to optimise a summary's semantic coverage and fluency by simulating human judgment on summarisation quality. Our model consists of an extractor agent and a compressor agent, and both agents have a multi-head attentional pointer-based structure. The extractor agent first chooses salient sentences from a document, and then the compressor agent compresses these extracted sentences by selecting salient words to form a summary without using reference summaries to compute the summary reward. To our best knowledge, this is the first work on unsupervised compressive summarisation. Experimental results on three widely used datasets (e.g., Newsroom, CNN/DM, and XSum) show that our model achieves promising performance and a significant improvement on Newsroom in terms of the ROUGE metric, as well as interpretability of semantic coverage of summarisation results.

explainability

Title: Expanding Explainability Horizons: A Unified Concept-Based System for Local, Global, and Misclassification Explanations. (arXiv:2306.03531v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03531
Code URL: null
Copy Paste: [[2306.03531] Expanding Explainability Horizons: A Unified Concept-Based System for Local, Global, and Misclassification Explanations](http://arxiv.org/abs/2306.03531) #explainability
Summary:
Explainability of intelligent models has been garnering increasing attention in recent years. Of the various explainability approaches, concept-based techniques are notable for utilizing a set of human-meaningful concepts instead of focusing on individual pixels. However, there is a scarcity of methods that consistently provide both local and global explanations. Moreover, most of the methods have no offer to explain misclassification cases. To address these challenges, our study follows a straightforward yet effective approach. We propose a unified concept-based system, which inputs a number of super-pixelated images into the networks, allowing them to learn better representations of the target's objects as well as the target's concepts. This method automatically learns, scores, and extracts local and global concepts. Our experiments revealed that, in addition to enhancing performance, the models could provide deeper insights into predictions and elucidate false classifications.

Title: $\textit{WHAT}$, $\textit{WHEN}$, and $\textit{HOW}$ to Ground: Designing User Persona-Aware Conversational Agents for Engaging Dialogue. (arXiv:2306.03361v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.03361
Code URL: null
Copy Paste: [[2306.03361] $\textit{WHAT}$, $\textit{WHEN}$, and $\textit{HOW}$ to Ground: Designing User Persona-Aware Conversational Agents for Engaging Dialogue](http://arxiv.org/abs/2306.03361) #explainability
Summary:
This paper presents a method for building a personalized open-domain dialogue system to address the $\textit{WWH}$ ($\textit{WHAT}$, $\textit{WHEN}$, and $\textit{HOW}$) problem for natural response generation in a commercial setting, where personalized dialogue responses are heavily interleaved with casual response turns. The proposed approach involves weighted dataset blending, negative persona information augmentation methods, and the design of personalized conversation datasets to address the challenges of $\textit{WWH}$ in personalized, open-domain dialogue systems. Our work effectively balances dialogue fluency and tendency to ground, while also introducing a response-type label to improve the controllability and explainability of the grounded responses. The combination of these methods leads to more fluent conversations, as evidenced by subjective human evaluations as well as objective evaluations.

watermark

diffusion

Title: DreamSparse: Escaping from Plato's Cave with 2D Diffusion Model Given Sparse Views. (arXiv:2306.03414v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03414
Code URL: null
Copy Paste: [[2306.03414] DreamSparse: Escaping from Plato's Cave with 2D Diffusion Model Given Sparse Views](http://arxiv.org/abs/2306.03414) #diffusion
Summary:
Synthesizing novel view images from a few views is a challenging but practical problem. Existing methods often struggle with producing high-quality results or necessitate per-object optimization in such few-view settings due to the insufficient information provided. In this work, we explore leveraging the strong 2D priors in pre-trained diffusion models for synthesizing novel view images. 2D diffusion models, nevertheless, lack 3D awareness, leading to distorted image synthesis and compromising the identity. To address these problems, we propose DreamSparse, a framework that enables the frozen pre-trained diffusion model to generate geometry and identity-consistent novel view image. Specifically, DreamSparse incorporates a geometry module designed to capture 3D features from sparse views as a 3D prior. Subsequently, a spatial guidance model is introduced to convert these 3D feature maps into spatial information for the generative process. This information is then used to guide the pre-trained diffusion model, enabling it to generate geometrically consistent images without tuning it. Leveraging the strong image priors in the pre-trained diffusion models, DreamSparse is capable of synthesizing high-quality novel views for both object and scene-level images and generalising to open-set images. Experimental results demonstrate that our framework can effectively synthesize novel view images from sparse views and outperforms baselines in both trained and open-set category images. More results can be found on our project page: https://sites.google.com/view/dreamsparse-webpage.

Title: Change Diffusion: Change Detection Map Generation Based on Difference-Feature Guided DDPM. (arXiv:2306.03424v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03424
Code URL: null
Copy Paste: [[2306.03424] Change Diffusion: Change Detection Map Generation Based on Difference-Feature Guided DDPM](http://arxiv.org/abs/2306.03424) #diffusion
Summary:
Deep learning (DL) approaches based on CNN-purely or Transformer networks have demonstrated promising results in bitemporal change detection (CD). However, their performance is limited by insufficient contextual information aggregation, as they struggle to fully capture the implicit contextual dependency relationships among feature maps at different levels. Additionally, researchers have utilized pre-trained denoising diffusion probabilistic models (DDPMs) for training lightweight CD classifiers. Nevertheless, training a DDPM to generate intricately detailed, multi-channel remote sensing images requires months of training time and a substantial volume of unlabeled remote sensing datasets, making it significantly more complex than generating a single-channel change map. To overcome these challenges, we propose a novel end-to-end DDPM-based model architecture called change-aware diffusion model (CADM), which can be trained using a limited annotated dataset quickly. Furthermore, we introduce dynamic difference conditional encoding to enhance step-wise regional attention in DDPM for bitemporal images in CD datasets. This method establishes state-adaptive conditions for each sampling step, emphasizing two main innovative points of our model: 1) its end-to-end nature and 2) difference conditional encoding. We evaluate CADM on four remote sensing CD tasks with different ground scenarios, including CDD, WHU, Levier, and GVLM. Experimental results demonstrate that CADM significantly outperforms state-of-the-art methods, indicating the generalization and effectiveness of the proposed model.

Title: DFormer: Diffusion-guided Transformer for Universal Image Segmentation. (arXiv:2306.03437v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03437
Code URL: null
Copy Paste: [[2306.03437] DFormer: Diffusion-guided Transformer for Universal Image Segmentation](http://arxiv.org/abs/2306.03437) #diffusion
Summary:
This paper introduces an approach, named DFormer, for universal image segmentation. The proposed DFormer views universal image segmentation task as a denoising process using a diffusion model. DFormer first adds various levels of Gaussian noise to ground-truth masks, and then learns a model to predict denoising masks from corrupted masks. Specifically, we take deep pixel-level features along with the noisy masks as inputs to generate mask features and attention masks, employing diffusion-based decoder to perform mask prediction gradually. At inference, our DFormer directly predicts the masks and corresponding categories from a set of randomly-generated masks. Extensive experiments reveal the merits of our proposed contributions on different image segmentation tasks: panoptic segmentation, instance segmentation, and semantic segmentation. Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val2017 set. Further, DFormer achieves promising semantic segmentation performance outperforming the recent diffusion-based method by 2.2% on ADE20K val set. Our source code and models will be publicly on https://github.com/cp3wan/DFormer

Title: Optimizing Sampling Patterns for Compressed Sensing MRI with Diffusion Generative Models. (arXiv:2306.03284v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03284
Code URL: https://github.com/sriram-ravula/mri_sampling_diffusion
Copy Paste: [[2306.03284] Optimizing Sampling Patterns for Compressed Sensing MRI with Diffusion Generative Models](http://arxiv.org/abs/2306.03284) #diffusion
Summary:
Diffusion-based generative models have been used as powerful priors for magnetic resonance imaging (MRI) reconstruction. We present a learning method to optimize sub-sampling patterns for compressed sensing multi-coil MRI that leverages pre-trained diffusion generative models. Crucially, during training we use a single-step reconstruction based on the posterior mean estimate given by the diffusion model and the MRI measurement process. Experiments across varying anatomies, acceleration factors, and pattern types show that sampling operators learned with our method lead to competitive, and in the case of 2D patterns, improved reconstructions compared to baseline patterns. Our method requires as few as five training images to learn effective sampling patterns.

Title: Logic Diffusion for Knowledge Graph Reasoning. (arXiv:2306.03515v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03515
Code URL: null
Copy Paste: [[2306.03515] Logic Diffusion for Knowledge Graph Reasoning](http://arxiv.org/abs/2306.03515) #diffusion
Summary:
Most recent works focus on answering first order logical queries to explore the knowledge graph reasoning via multi-hop logic predictions. However, existing reasoning models are limited by the circumscribed logical paradigms of training samples, which leads to a weak generalization of unseen logic. To address these issues, we propose a plug-in module called Logic Diffusion (LoD) to discover unseen queries from surroundings and achieves dynamical equilibrium between different kinds of patterns. The basic idea of LoD is relation diffusion and sampling sub-logic by random walking as well as a special training mechanism called gradient adaption. Besides, LoD is accompanied by a novel loss function to further achieve the robust logical diffusion when facing noisy data in training or testing sets. Extensive experiments on four public datasets demonstrate the superiority of mainstream knowledge graph reasoning models with LoD over state-of-the-art. Moreover, our ablation study proves the general effectiveness of LoD on the noise-rich knowledge graph.

Title: Machine learning in and out of equilibrium. (arXiv:2306.03521v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03521
Code URL: https://github.com/hincz-lab/ml-nonequilibrium
Copy Paste: [[2306.03521] Machine learning in and out of equilibrium](http://arxiv.org/abs/2306.03521) #diffusion
Summary:
The algorithms used to train neural networks, like stochastic gradient descent (SGD), have close parallels to natural processes that navigate a high-dimensional parameter space -- for example protein folding or evolution. Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels in a single, unified framework. We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium, exhibiting persistent currents in the space of network parameters. As in its physical analogues, the current is associated with an entropy production rate for any given training trajectory. The stationary distribution of these rates obeys the integral and detailed fluctuation theorems -- nonequilibrium generalizations of the second law of thermodynamics. We validate these relations in two numerical examples, a nonlinear regression network and MNIST digit classification. While the fluctuation theorems are universal, there are other aspects of the stationary state that are highly sensitive to the training details. Surprisingly, the effective loss landscape and diffusion matrix that determine the shape of the stationary distribution vary depending on the simple choice of minibatching done with or without replacement. We can take advantage of this nonequilibrium sensitivity to engineer an equilibrium stationary state for a particular application: sampling from a posterior distribution of network weights in Bayesian machine learning. We propose a new variation of stochastic gradient Langevin dynamics (SGLD) that harnesses without replacement minibatching. In an example system where the posterior is exactly known, this SGWORLD algorithm outperforms SGLD, converging to the posterior orders of magnitude faster as a function of the learning rate.

noise learning

data-free

transformer

Title: PGformer: Proxy-Bridged Game Transformer for Multi-Person Extremely Interactive Motion Prediction. (arXiv:2306.03374v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03374
Code URL: null
Copy Paste: [[2306.03374] PGformer: Proxy-Bridged Game Transformer for Multi-Person Extremely Interactive Motion Prediction](http://arxiv.org/abs/2306.03374) #transformer
Summary:
Multi-person motion prediction is a challenging task, especially for real-world scenarios of densely interacted persons. Most previous works have been devoted to studying the case of weak interactions (e.g., hand-shaking), which typically forecast each human pose in isolation. In this paper, we focus on motion prediction for multiple persons with extreme collaborations and attempt to explore the relationships between the highly interactive persons' motion trajectories. Specifically, a novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences tailored for this situation. Additionally, we introduce and build a proxy entity to bridge the involved persons, which cooperates with our proposed XQA module and subtly controls the bidirectional information flows, acting as a motion intermediary. We then adapt these designs to a Transformer-based architecture and devise a simple yet effective end-to-end framework called proxy-bridged game Transformer (PGformer) for multi-person interactive motion prediction. The effectiveness of our method has been evaluated on the challenging ExPI dataset, which involves highly interactive actions. We show that our PGformer consistently outperforms the state-of-the-art methods in both short- and long-term predictions by a large margin. Besides, our approach can also be compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and achieve encouraging results. Our code will become publicly available upon acceptance.

Title: TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision. (arXiv:2306.03377v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03377
Code URL: null
Copy Paste: [[2306.03377] TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision](http://arxiv.org/abs/2306.03377) #transformer
Summary:
End-to-end text spotting is a vital computer vision task that aims to integrate scene text detection and recognition into a unified framework. Typical methods heavily rely on Region-of-Interest (RoI) operations to extract local features and complex post-processing steps to produce final predictions. To address these limitations, we propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. Specifically, using query embedding per text instance, TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing without sacrificing flexibility or simplicity. Additionally, we design an Adaptive Global aGgregation (AGG) module to transfer global features into sequential features for reading arbitrarily-shaped texts, which overcomes the sub-optimization problem of RoI operations. Furthermore, potential corpus information is utilized from weak annotations to full labels through mixed supervision, further improving text detection and end-to-end text spotting results. Extensive experiments on various bilingual (i.e., English and Chinese) benchmarks demonstrate the superiority of our method. Especially on TDA-ReCTS dataset, TextFormer surpasses the state-of-the-art method in terms of 1-NED by 13.2%.

Title: SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation. (arXiv:2306.03403v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03403
Code URL: https://github.com/tencentarc/sgat4pass
Copy Paste: [[2306.03403] SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation](http://arxiv.org/abs/2306.03403) #transformer
Summary:
As an important and challenging problem in computer vision, PAnoramic Semantic Segmentation (PASS) gives complete scene perception based on an ultra-wide angle of view. Usually, prevalent PASS methods with 2D panoramic image input focus on solving image distortions but lack consideration of the 3D properties of original $360^{\circ}$ data. Therefore, their performance will drop a lot when inputting panoramic images with the 3D disturbance. To be more robust to 3D disturbance, we propose our Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation (SGAT4PASS), considering 3D spherical geometry knowledge. Specifically, a spherical geometry-aware framework is proposed for PASS. It includes three modules, i.e., spherical geometry-aware image projection, spherical deformable patch embedding, and a panorama-aware loss, which takes input images with 3D disturbance into account, adds a spherical geometry-aware constraint on the existing deformable patch embedding, and indicates the pixel density of original $360^{\circ}$ data, respectively. Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS significantly improves performance and robustness, with approximately a 2% increase in mIoU, and when small 3D disturbances occur in the data, the stability of our performance is improved by an order of magnitude. Our code and supplementary material are available at https://github.com/TencentARC/SGAT4PASS.

Title: Deep neural networks architectures from the perspective of manifold learning. (arXiv:2306.03406v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03406
Code URL: null
Copy Paste: [[2306.03406] Deep neural networks architectures from the perspective of manifold learning](http://arxiv.org/abs/2306.03406) #transformer
Summary:
Despite significant advances in the field of deep learning in ap-plications to various areas, an explanation of the learning pro-cess of neural network models remains an important open ques-tion. The purpose of this paper is a comprehensive comparison and description of neural network architectures in terms of ge-ometry and topology. We focus on the internal representation of neural networks and on the dynamics of changes in the topology and geometry of a data manifold on different layers. In this paper, we use the concepts of topological data analysis (TDA) and persistent homological fractal dimension. We present a wide range of experiments with various datasets and configurations of convolutional neural network (CNNs) architectures and Transformers in CV and NLP tasks. Our work is a contribution to the development of the important field of explainable and interpretable AI within the framework of geometrical deep learning.

Title: SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning. (arXiv:2306.03491v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03491
Code URL: https://github.com/zhishenyang/scientific_figure_captioning_dataset
Copy Paste: [[2306.03491] SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning](http://arxiv.org/abs/2306.03491) #transformer
Summary:
In scholarly documents, figures provide a straightforward way of communicating scientific findings to readers. Automating figure caption generation helps move model understandings of scientific documents beyond text and will help authors write informative captions that facilitate communicating scientific findings. Unlike previous studies, we reframe scientific figure captioning as a knowledge-augmented image captioning task that models need to utilize knowledge embedded across modalities for caption generation. To this end, we extended the large-scale SciCap dataset~\cite{hsu-etal-2021-scicap-generating} to SciCap+ which includes mention-paragraphs (paragraphs mentioning figures) and OCR tokens. Then, we conduct experiments with the M4C-Captioner (a multimodal transformer-based model with a pointer network) as a baseline for our study. Our results indicate that mention-paragraphs serves as additional context knowledge, which significantly boosts the automatic standard image caption evaluation scores compared to the figure-only baselines. Human evaluations further reveal the challenges of generating figure captions that are informative to readers. The code and SciCap+ dataset will be publicly available at https://github.com/ZhishenYang/scientific_figure_captioning_dataset

Title: Efficient Anomaly Detection with Budget Annotation Using Semi-Supervised Residual Transformer. (arXiv:2306.03492v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03492
Code URL: null
Copy Paste: [[2306.03492] Efficient Anomaly Detection with Budget Annotation Using Semi-Supervised Residual Transformer](http://arxiv.org/abs/2306.03492) #transformer
Summary:
Anomaly Detection is challenging as usually only the normal samples are seen during training and the detector needs to discover anomalies on-the-fly. The recently proposed deep-learning-based approaches could somehow alleviate the problem but there is still a long way to go in obtaining an industrial-class anomaly detector for real-world applications. On the other hand, in some particular AD tasks, a few anomalous samples are labeled manually for achieving higher accuracy. However, this performance gain is at the cost of considerable annotation efforts, which can be intractable in many practical scenarios.

In this work, the above two problems are addressed in a unified framework. Firstly, inspired by the success of the patch-matching-based AD algorithms, we train a sliding vision transformer over the residuals generated by a novel position-constrained patch-matching. Secondly, the conventional pixel-wise segmentation problem is cast into a block-wise classification problem. Thus the sliding transformer can attain even higher accuracy with much less annotation labor. Thirdly, to further reduce the labeling cost, we propose to label the anomalous regions using only bounding boxes. The unlabeled regions caused by the weak labels are effectively exploited using a highly-customized semi-supervised learning scheme equipped with two novel data augmentation methods. The proposed method outperforms all the state-of-the-art approaches using all the evaluation metrics in both the unsupervised and supervised scenarios. On the popular MVTec-AD dataset, our SemiREST algorithm obtains the Average Precision (AP) of 81.2% in the unsupervised condition and 84.4% AP for supervised anomaly detection. Surprisingly, with the bounding-box-based semi-supervisions, SemiREST still outperforms the SOTA methods with full supervision (83.8% AP) on MVTec-AD.

Title: Human-Object Interaction Prediction in Videos through Gaze Following. (arXiv:2306.03597v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03597
Code URL: https://github.com/nizhf/hoi-prediction-gaze-transformer
Copy Paste: [[2306.03597] Human-Object Interaction Prediction in Videos through Gaze Following](http://arxiv.org/abs/2306.03597) #transformer
Summary:
Understanding the human-object interactions (HOIs) from a video is essential to fully comprehend a visual scene. This line of research has been addressed by detecting HOIs from images and lately from videos. However, the video-based HOI anticipation task in the third-person view remains understudied. In this paper, we design a framework to detect current HOIs and anticipate future HOIs in videos. We propose to leverage human gaze information since people often fixate on an object before interacting with it. These gaze features together with the scene contexts and the visual appearances of human-object pairs are fused through a spatio-temporal transformer. To evaluate the model in the HOI anticipation task in a multi-person scenario, we propose a set of person-wise multi-label metrics. Our model is trained and validated on the VidHOI dataset, which contains videos capturing daily life and is currently the largest video HOI dataset. Experimental results in the HOI detection task show that our approach improves the baseline by a great margin of 36.3% relatively. Moreover, we conduct an extensive ablation study to demonstrate the effectiveness of our modifications and extensions to the spatio-temporal transformer. Our code is publicly available on https://github.com/nizhf/hoi-prediction-gaze-transformer.

generative

Title: GaitGCI: Generative Counterfactual Intervention for Gait Recognition. (arXiv:2306.03428v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03428
Code URL: null
Copy Paste: [[2306.03428] GaitGCI: Generative Counterfactual Intervention for Gait Recognition](http://arxiv.org/abs/2306.03428) #generative
Summary:
Gait is one of the most promising biometrics that aims to identify pedestrians from their walking patterns. However, prevailing methods are susceptible to confounders, resulting in the networks hardly focusing on the regions that reflect effective walking patterns. To address this fundamental problem in gait recognition, we propose a Generative Counterfactual Intervention framework, dubbed GaitGCI, consisting of Counterfactual Intervention Learning (CIL) and Diversity-Constrained Dynamic Convolution (DCDC). CIL eliminates the impacts of confounders by maximizing the likelihood difference between factual/counterfactual attention while DCDC adaptively generates sample-wise factual/counterfactual attention to efficiently perceive the sample-wise properties. With matrix decomposition and diversity constraint, DCDC guarantees the model to be efficient and effective. Extensive experiments indicate that proposed GaitGCI: 1) could effectively focus on the discriminative and interpretable regions that reflect gait pattern; 2) is model-agnostic and could be plugged into existing models to improve performance with nearly no extra cost; 3) efficiently achieves state-of-the-art performance on arbitrary scenarios (in-the-lab and in-the-wild).

Title: SDR-GAIN: A High Real-Time Occluded Pedestrian Pose Completion Method for Autonomous Driving. (arXiv:2306.03538v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03538
Code URL: null
Copy Paste: [[2306.03538] SDR-GAIN: A High Real-Time Occluded Pedestrian Pose Completion Method for Autonomous Driving](http://arxiv.org/abs/2306.03538) #generative
Summary:
To mitigate the challenges arising from partial occlusion in human pose keypoint based pedestrian detection methods , we present a novel pedestrian pose keypoint completion method called the separation and dimensionality reduction-based generative adversarial imputation networks (SDR-GAIN) . Firstly, we utilize OpenPose to estimate pedestrian poses in images. Then, we isolate the head and torso keypoints of pedestrians with incomplete keypoints due to occlusion or other factors and perform dimensionality reduction to enhance features and further unify feature distribution. Finally, we introduce two generative models based on the generative adversarial networks (GAN) framework, which incorporate Huber loss, residual structure, and L1 regularization to generate missing parts of the incomplete head and torso pose keypoints of partially occluded pedestrians, resulting in pose completion. Our experiments on MS COCO and JAAD datasets demonstrate that SDR-GAIN outperforms basic GAIN framework, interpolation methods PCHIP and MAkima, machine learning methods k-NN and MissForest in terms of pose completion task. In addition, the runtime of SDR-GAIN is approximately 0.4ms, displaying high real-time performance and significant application value in the field of autonomous driving.

Title: shs-nlp at RadSum23: Domain-Adaptive Pre-training of Instruction-tuned LLMs for Radiology Report Impression Generation. (arXiv:2306.03264v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.03264
Code URL: null
Copy Paste: [[2306.03264] shs-nlp at RadSum23: Domain-Adaptive Pre-training of Instruction-tuned LLMs for Radiology Report Impression Generation](http://arxiv.org/abs/2306.03264) #generative
Summary:
Instruction-tuned generative Large language models (LLMs) like ChatGPT and Bloomz possess excellent generalization abilities, but they face limitations in understanding radiology reports, particularly in the task of generating the IMPRESSIONS section from the FINDINGS section. They tend to generate either verbose or incomplete IMPRESSIONS, mainly due to insufficient exposure to medical text data during training. We present a system which leverages large-scale medical text data for domain-adaptive pre-training of instruction-tuned LLMs to enhance its medical knowledge and performance on specific medical tasks. We show that this system performs better in a zero-shot setting than a number of pretrain-and-finetune adaptation methods on the IMPRESSIONS generation task, and ranks 1st among participating systems in Task 1B: Radiology Report Summarization at the BioNLP 2023 workshop.

Title: A Scalable and Adaptive System to Infer the Industry Sectors of Companies: Prompt + Model Tuning of Generative Language Models. (arXiv:2306.03313v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.03313
Code URL: null
Copy Paste: [[2306.03313] A Scalable and Adaptive System to Infer the Industry Sectors of Companies: Prompt + Model Tuning of Generative Language Models](http://arxiv.org/abs/2306.03313) #generative
Summary:
The Private Equity (PE) firms operate investment funds by acquiring and managing companies to achieve a high return upon selling. Many PE funds are thematic, meaning investment professionals aim to identify trends by covering as many industry sectors as possible, and picking promising companies within these sectors. So, inferring sectors for companies is critical to the success of thematic PE funds. In this work, we standardize the sector framework and discuss the typical challenges; we then introduce our sector inference system addressing these challenges. Specifically, our system is built on a medium-sized generative language model, finetuned with a prompt + model tuning procedure. The deployed model demonstrates a superior performance than the common baselines. The system has been serving many PE professionals for over a year, showing great scalability to data volume and adaptability to any change in sector framework and/or annotation.

Title: Alzheimer Disease Classification through ASR-based Transcriptions: Exploring the Impact of Punctuation and Pauses. (arXiv:2306.03443v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.03443
Code URL: null
Copy Paste: [[2306.03443] Alzheimer Disease Classification through ASR-based Transcriptions: Exploring the Impact of Punctuation and Pauses](http://arxiv.org/abs/2306.03443) #generative
Summary:
Alzheimer's Disease (AD) is the world's leading neurodegenerative disease, which often results in communication difficulties. Analysing speech can serve as a diagnostic tool for identifying the condition. The recent ADReSS challenge provided a dataset for AD classification and highlighted the utility of manual transcriptions. In this study, we used the new state-of-the-art Automatic Speech Recognition (ASR) model Whisper to obtain the transcriptions, which also include automatic punctuation. The classification models achieved test accuracy scores of 0.854 and 0.833 combining the pretrained FastText word embeddings and recurrent neural networks on manual and ASR transcripts respectively. Additionally, we explored the influence of including pause information and punctuation in the transcriptions. We found that punctuation only yielded minor improvements in some cases, whereas pause encoding aided AD classification for both manual and ASR transcriptions across all approaches investigated.

Title: Estimating Conditional Mutual Information for Dynamic Feature Selection. (arXiv:2306.03301v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03301
Code URL: https://github.com/suinleelab/dime
Copy Paste: [[2306.03301] Estimating Conditional Mutual Information for Dynamic Feature Selection](http://arxiv.org/abs/2306.03301) #generative
Summary:
Dynamic feature selection, where we sequentially query features to make accurate predictions with a minimal budget, is a promising paradigm to reduce feature acquisition costs and provide transparency into the prediction process. The problem is challenging, however, as it requires both making predictions with arbitrary feature sets and learning a policy to identify the most valuable selections. Here, we take an information-theoretic perspective and prioritize features based on their mutual information with the response variable. The main challenge is learning this selection policy, and we design a straightforward new modeling approach that estimates the mutual information in a discriminative rather than generative fashion. Building on our learning approach, we introduce several further improvements: allowing variable feature budgets across samples, enabling non-uniform costs between features, incorporating prior information, and exploring modern architectures to handle partial input information. We find that our method provides consistent gains over recent state-of-the-art methods across a variety of datasets.

Title: GSHOT: Few-shot Generative Modeling of Labeled Graphs. (arXiv:2306.03480v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03480
Code URL: null
Copy Paste: [[2306.03480] GSHOT: Few-shot Generative Modeling of Labeled Graphs](http://arxiv.org/abs/2306.03480) #generative
Summary:
Deep graph generative modeling has gained enormous attraction in recent years due to its impressive ability to directly learn the underlying hidden graph distribution. Despite their initial success, these techniques, like much of the existing deep generative methods, require a large number of training samples to learn a good model. Unfortunately, large number of training samples may not always be available in scenarios such as drug discovery for rare diseases. At the same time, recent advances in few-shot learning have opened door to applications where available training data is limited. In this work, we introduce the hitherto unexplored paradigm of few-shot graph generative modeling. Towards this, we develop GSHOT, a meta-learning based framework for few-shot labeled graph generative modeling. GSHOT learns to transfer meta-knowledge from similar auxiliary graph datasets. Utilizing these prior experiences, GSHOT quickly adapts to an unseen graph dataset through self-paced fine-tuning. Through extensive experiments on datasets from diverse domains having limited training samples, we establish that GSHOT generates graphs of superior fidelity compared to existing baselines.

large language model

Title: Prompting Large Language Models to Reformulate Queries for Moment Localization. (arXiv:2306.03422v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03422
Code URL: null
Copy Paste: [[2306.03422] Prompting Large Language Models to Reformulate Queries for Moment Localization](http://arxiv.org/abs/2306.03422) #large language model
Summary:
The task of moment localization is to localize a temporal moment in an untrimmed video for a given natural language query. Since untrimmed video contains highly redundant contents, the quality of the query is crucial for accurately localizing moments, i.e., the query should provide precise information about the target moment so that the localization model can understand what to look for in the videos. However, the natural language queries in current datasets may not be easy to understand for existing models. For example, the Ego4D dataset uses question sentences as the query to describe relatively complex moments. While being natural and straightforward for humans, understanding such question sentences are challenging for mainstream moment localization models like 2D-TAN. Inspired by the recent success of large language models, especially their ability of understanding and generating complex natural language contents, in this extended abstract, we make early attempts at reformulating the moment queries into a set of instructions using large language models and making them more friendly to the localization models.

Title: A Static Evaluation of Code Completion by Large Language Models. (arXiv:2306.03203v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.03203
Code URL: null
Copy Paste: [[2306.03203] A Static Evaluation of Code Completion by Large Language Models](http://arxiv.org/abs/2306.03203) #large language model
Summary:
Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary, static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models. Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions.

Title: NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks. (arXiv:2306.03208v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.03208
Code URL: null
Copy Paste: [[2306.03208] NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks](http://arxiv.org/abs/2306.03208) #large language model
Summary:
Finetuning large language models inflates the costs of NLU applications and remains the bottleneck of development cycles. Recent works in computer vision use data pruning to reduce training time. Pruned data selection with static methods is based on a score calculated for each training example prior to finetuning, which involves important computational overhead. Moreover, the score may not necessarily be representative of sample importance throughout the entire training duration. We propose to address these issues with a refined version of dynamic data pruning, a curriculum which periodically scores and discards unimportant examples during finetuning. Our method leverages an EL2N metric that we extend to the joint intent and slot classification task, and an initial finetuning phase on the full train set. Our results on the GLUE benchmark and four joint NLU datasets show a better time-accuracy trade-off compared to static methods. Our method preserves full accuracy while training on 50% of the data points and reduces computational times by up to 41%. If we tolerate instead a minor drop of accuracy of 1%, we can prune 80% of the training examples for a reduction in finetuning time reaching 66%.

Title: Understanding the Effectiveness of Early Weight Averaging for Training Large Language Models. (arXiv:2306.03241v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03241
Code URL: null
Copy Paste: [[2306.03241] Understanding the Effectiveness of Early Weight Averaging for Training Large Language Models](http://arxiv.org/abs/2306.03241) #large language model
Summary:
Training LLMs is expensive, and recent evidence indicates training all the way to convergence is inefficient. In this paper, we investigate the ability of a simple idea, checkpoint averaging along the trajectory of a training run to improve the quality of models before they have converged. This approach incurs no extra cost during training or inference. Specifically, we analyze the training trajectories of Pythia LLMs with 1 to 12 billion parameters and demonstrate that, particularly during the early to mid stages of training, this idea accelerates convergence and improves both test and zero-shot generalization. Loss spikes are a well recognized problem in LLM training; in our analysis we encountered two instances of this in the underlying trajectories, and both instances were mitigated by our averaging. For a 6.9B parameter LLM, for example, our early weight averaging recipe can save upto 4200 hours of GPU time, which corresponds to significant savings in cloud compute costs.

Title: Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models. (arXiv:2306.03268v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2306.03268
Code URL: null
Copy Paste: [[2306.03268] Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models](http://arxiv.org/abs/2306.03268) #large language model
Summary:
Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $\$187$ and $\$800$ each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.

Title: Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. (arXiv:2306.03341v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03341
Code URL: https://github.com/likenneth/honest_llama
Copy Paste: [[2306.03341] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model](http://arxiv.org/abs/2306.03341) #large language model
Summary:
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the truthfulness of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.

Title: On the Role of Attention in Prompt-tuning. (arXiv:2306.03435v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03435
Code URL: null
Copy Paste: [[2306.03435] On the Role of Attention in Prompt-tuning](http://arxiv.org/abs/2306.03435) #large language model
Summary:
Prompt-tuning is an emerging strategy to adapt large language models (LLM) to downstream tasks by learning a (soft-)prompt parameter from data. Despite its success in LLMs, there is limited theoretical understanding of the power of prompt-tuning and the role of the attention mechanism in prompting. In this work, we explore prompt-tuning for one-layer attention architectures and study contextual mixture-models where each input token belongs to a context-relevant or -irrelevant set. We isolate the role of prompt-tuning through a self-contained prompt-attention model. Our contributions are as follows: (1) We show that softmax-prompt-attention is provably more expressive than softmax-self-attention and linear-prompt-attention under our contextual data model. (2) We analyze the initial trajectory of gradient descent and show that it learns the prompt and prediction head with near-optimal sample complexity and demonstrate how prompt can provably attend to sparse context-relevant tokens. (3) Assuming a known prompt but an unknown prediction head, we characterize the exact finite sample performance of prompt-attention which reveals the fundamental performance limits and the precise benefit of the context information. We also provide experiments that verify our theoretical insights on real datasets and demonstrate how prompt-tuning enables the model to attend to context-relevant information.

Title: Large Language Models of Code Fail at Completing Code with Potential Bugs. (arXiv:2306.03438v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2306.03438
Code URL: null
Copy Paste: [[2306.03438] Large Language Models of Code Fail at Completing Code with Potential Bugs](http://arxiv.org/abs/2306.03438) #large language model
Summary:
Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CodeGen-2B-mono on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a large gap in post-mitigation performance.

segmentation

Title: Zero-Shot 3D Shape Correspondence. (arXiv:2306.03253v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03253
Code URL: null
Copy Paste: [[2306.03253] Zero-Shot 3D Shape Correspondence](http://arxiv.org/abs/2306.03253) #segmentation
Summary:
We propose a novel zero-shot approach to computing correspondences between 3D shapes. Existing approaches mainly focus on isometric and near-isometric shape pairs (e.g., human vs. human), but less attention has been given to strongly non-isometric and inter-class shape matching (e.g., human vs. cow). To this end, we introduce a fully automatic method that exploits the exceptional reasoning capabilities of recent foundation models in language and vision to tackle difficult shape correspondence problems. Our approach comprises multiple stages. First, we classify the 3D shapes in a zero-shot manner by feeding rendered shape views to a language-vision model (e.g., BLIP2) to generate a list of class proposals per shape. These proposals are unified into a single class per shape by employing the reasoning capabilities of ChatGPT. Second, we attempt to segment the two shapes in a zero-shot manner, but in contrast to the co-segmentation problem, we do not require a mutual set of semantic regions. Instead, we propose to exploit the in-context learning capabilities of ChatGPT to generate two different sets of semantic regions for each shape and a semantic mapping between them. This enables our approach to match strongly non-isometric shapes with significant differences in geometric structure. Finally, we employ the generated semantic mapping to produce coarse correspondences that can further be refined by the functional maps framework to produce dense point-to-point maps. Our approach, despite its simplicity, produces highly plausible results in a zero-shot manner, especially between strongly non-isometric shapes.

Title: DVIS: Decoupled Video Instance Segmentation Framework. (arXiv:2306.03413v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03413
Code URL: https://github.com/zhang-tao-whu/DVIS
Copy Paste: [[2306.03413] DVIS: Decoupled Video Instance Segmentation Framework](http://arxiv.org/abs/2306.03413) #segmentation
Summary:
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. Firstly, offline methods are limited by the tightly-coupled modeling paradigm, which treats all frames equally and disregards the interdependencies between adjacent frames. Consequently, this leads to the introduction of excessive noise during long-term temporal alignment. Secondly, online methods suffer from inadequate utilization of temporal information. To tackle these challenges, we propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement. The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement. We introduce a novel referring tracker and temporal refiner to construct the \textbf{D}ecoupled \textbf{VIS} framework (\textbf{DVIS}). DVIS achieves new SOTA performance in both VIS and VPS, surpassing the current SOTA methods by 7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most challenging and realistic benchmarks. Moreover, thanks to the decoupling strategy, the referring tracker and temporal refiner are super light-weight (only 1.69\% of the segmenter FLOPs), allowing for efficient training and inference on a single GPU with 11G memory. The code is available at \href{https://github.com/zhang-tao-whu/DVIS}{https://github.com/zhang-tao-whu/DVIS}.

Title: Instructive Feature Enhancement for Dichotomous Medical Image Segmentation. (arXiv:2306.03497v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03497
Code URL: https://github.com/yezi-66/ife
Copy Paste: [[2306.03497] Instructive Feature Enhancement for Dichotomous Medical Image Segmentation](http://arxiv.org/abs/2306.03497) #segmentation
Summary:
Deep neural networks have been widely applied in dichotomous medical image segmentation (DMIS) of many anatomical structures in several modalities, achieving promising performance. However, existing networks tend to struggle with task-specific, heavy and complex designs to improve accuracy. They made little instructions to which feature channels would be more beneficial for segmentation, and that may be why the performance and universality of these segmentation models are hindered. In this study, we propose an instructive feature enhancement approach, namely IFE, to adaptively select feature channels with rich texture cues and strong discriminability to enhance raw features based on local curvature or global information entropy criteria. Being plug-and-play and applicable for diverse DMIS tasks, IFE encourages the model to focus on texture-rich features which are especially important for the ambiguous and challenging boundary identification, simultaneously achieving simplicity, universality, and certain interpretability. To evaluate the proposed IFE, we constructed the first large-scale DMIS dataset Cosmos55k, which contains 55,023 images from 7 modalities and 26 anatomical structures. Extensive experiments show that IFE can improve the performance of classic segmentation networks across different anatomies and modalities with only slight modifications. Code is available at https://github.com/yezi-66/IFE

Title: Semantic Segmentation on VSPW Dataset through Contrastive Loss and Multi-dataset Training Approach. (arXiv:2306.03508v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2306.03508
Code URL: null
Copy Paste: [[2306.03508] Semantic Segmentation on VSPW Dataset through Contrastive Loss and Multi-dataset Training Approach](http://arxiv.org/abs/2306.03508) #segmentation
Summary:
Video scene parsing incorporates temporal information, which can enhance the consistency and accuracy of predictions compared to image scene parsing. The added temporal dimension enables a more comprehensive understanding of the scene, leading to more reliable results. This paper presents the winning solution of the CVPR2023 workshop for video semantic segmentation, focusing on enhancing Spatial-Temporal correlations with contrastive loss. We also explore the influence of multi-dataset training by utilizing a label-mapping technique. And the final result is aggregating the output of the above two models. Our approach achieves 65.95% mIoU performance on the VSPW dataset, ranked 1st place on the VSPW challenge at CVPR 2023.